Benchmark card

The target construct is political-response profile.

PoliBench should not be used to claim a provider's political intent, a model's private belief, or the political impact of deployed systems without additional evidence.

Use Policy

  • Does not measure model beliefs, provider intent, training-data ideology, or real-world political impact.
  • No leaderboard rank without uncertainty and caveats.
  • No public score without supporting raw responses.

Claim Evidence

Benchmark-card claims link to the pages documenting release artifacts and validation status before reuse guidance.

ClaimEvidence
The target construct is the observable political-response profile under fixed prompts. Prompt template , Question bank
The evidence ceiling remains model-output evidence until human and external validation are collected. Human status , External status
Public scores require supporting raw response attempts and canonical rows. Response attempts , Canonical responses , Truth gate

Evidence note

PoliBench is a public benchmark surface for model outputs under fixed political prompts. Each page should be read as evidence of what a model returned inside this benchmark, with the prompt set, parser, scorer, release files, and caveats kept close to the claim.

The site keeps the claims narrow on purpose. Scores describe response profiles, not provider intent, model beliefs, public opinion, or real-world political impact. Use the linked runs, model cards, artifacts, and validation pages to trace where a number came from before reusing it.

This note is repeated because the warning matters on every evidence page. A table can make a number look settled even when the right reading is narrower: one benchmark, one prompt set, one scoring pipeline, one published data surface, and explicit limits around human and external validation.