Duplicates

Duplicate run-question pairs are resolved, not hidden.

Resolution prefers parsed rows, non-default answers, preferred source pack, then later artifact timestamp. This table is frozen paper-release documentation, kept as a historical record; live pages carry current benchmark data.

Claim Evidence

Duplicate-resolution claims link to the pages documenting the resolution log, canonical responses, and truth gate.

This page exists so duplicate handling can be reviewed without trusting a hidden cleanup script. The release keeps both the retained and removed response IDs visible, records the rule that chose the canonical row, and treats unresolved duplicate pairs as a release blocker instead of a cosmetic data issue.

ClaimEvidence
Duplicate run-question pairs are resolved deterministically and kept inspectable. Duplicate resolution · Canonical responses · Truth gate
RunQuestionKeptRemovedRule

Evidence note

PoliBench is a public benchmark surface for model outputs under fixed political prompts. Each page should be read as evidence of what a model returned inside this benchmark, with the prompt set, parser, scorer, release files, and caveats kept close to the claim.

The site keeps the claims narrow on purpose. Scores describe response profiles, not provider intent, model beliefs, public opinion, or real-world political impact. Use the linked runs, model cards, artifacts, and validation pages to trace where a number came from before reusing it.

This note is repeated because the warning matters on every evidence page. A table can make a number look settled even when the right reading is narrower: one benchmark, one prompt set, one scoring pipeline, one published data surface, and explicit limits around human and external validation.