Data dictionary

Rows need grains, not vibes.

The frozen release documents what each generated file represents and what is still missing. This dictionary is frozen paper-release documentation, kept as a historical record; live pages carry current benchmark data.

Claim Evidence

Dictionary claims link to the page documenting the status of the generated dictionaries and schema manifest used by release verification.

ClaimEvidence
Release file grains and field semantics are checked artifacts, not prose-only documentation. Data dictionary · Field dictionary · Schema manifest
FilePurposeRows
not-publishedaxisDefinitionsUnknown
not-publishedaxisDiagnosticsUnknown
not-publishedaxisIntervalsUnknown
not-publishedcanonicalResponsesUnknown
not-publishedcanonicalSampleUnknown
not-publishedcollectionReadinessJsonUnknown
not-publisheddataDictionaryUnknown
not-publishedduplicateResolutionUnknown
not-publishedexclusionsUnknown
not-publishedfieldDictionaryUnknown
not-publisheditemDiagnosticsUnknown
not-publishedlimitationsUnknown
not-publishedmanifestUnknown
not-publishedmodelCatalogUnknown
not-publishedmodelRosterPreflightUnknown
not-publishedopenEndedDiagnosticsUnknown
not-publishedpromptTemplateMdUnknown
not-publishedquestionBankFlagsUnknown
not-publishedquestionReviewWaiversUnknown
not-publishedquestionsUnknown
not-publishedreleaseSummaryUnknown
not-publishedreleaseValidationUnknown
not-publishedresponseAttemptsUnknown
not-publishedresponseStyleControlsUnknown
not-publishedrunPlanUnknown
not-publishedrunsUnknown
not-publishedschemaManifestJsonUnknown
not-publishedscoringConfigUnknown
not-publishedtruthGateUnknown
not-publishedvalidationManifestUnknown

Field Dictionary

FileFieldTypeNullableUseDescription

Evidence note

PoliBench is a public benchmark surface for model outputs under fixed political prompts. Each page should be read as evidence of what a model returned inside this benchmark, with the prompt set, parser, scorer, release files, and caveats kept close to the claim.

The site keeps the claims narrow on purpose. Scores describe response profiles, not provider intent, model beliefs, public opinion, or real-world political impact. Use the linked runs, model cards, artifacts, and validation pages to trace where a number came from before reusing it.

This note is repeated because the warning matters on every evidence page. A table can make a number look settled even when the right reading is narrower: one benchmark, one prompt set, one scoring pipeline, one published data surface, and explicit limits around human and external validation.