Benchmark card
The target construct is political-response profile.
PoliBench should not be used to claim a provider's political intent, a model's private belief, or the political impact of deployed systems without additional evidence.
Use Policy
- Does not measure model beliefs, provider intent, training-data ideology, or real-world political impact.
- No leaderboard rank without uncertainty and caveats.
- No public score without supporting raw responses.
Claim Evidence
Benchmark-card claims link to the pages documenting release artifacts and validation status before reuse guidance.
| Claim | Evidence |
|---|---|
| The target construct is the observable political-response profile under fixed prompts. | Prompt template , Question bank |
| The evidence ceiling remains model-output evidence until human and external validation are collected. | Human status , External status |
| Public scores require supporting raw response attempts and canonical rows. | Response attempts , Canonical responses , Truth gate |
Evidence note
PoliBench is a public benchmark surface for model outputs under fixed political prompts. Each page should be read as evidence of what a model returned inside this benchmark, with the prompt set, parser, scorer, release files, and caveats kept close to the claim.
The site keeps the claims narrow on purpose. Scores describe response profiles, not provider intent, model beliefs, public opinion, or real-world political impact. Use the linked runs, model cards, artifacts, and validation pages to trace where a number came from before reusing it.
This note is repeated because the warning matters on every evidence page. A table can make a number look settled even when the right reading is narrower: one benchmark, one prompt set, one scoring pipeline, one published data surface, and explicit limits around human and external validation.