Reliability
Reliability metrics stay attached to every score.
Scores without stability evidence are marked as limited. Open-ended diagnostics stay separate from placement scores. The tables below are frozen paper-release documentation, kept as a historical record; live pages carry current benchmark data.
Claim Evidence
Reliability claims link to source evidence and pending-validation status before model rows are shown.
| Claim | Evidence |
|---|---|
| Reliability metrics are computed from benchmark response rows. | Canonical sample · Response style controls |
| The evidence ceiling remains model-output evidence until human and external validation are collected. | Human status · External status |
Model Reliability
Showing all 0 canonical models from the frozen paper release. Full per-run rows are browsable on the live runs index; the frozen paper pack (canonical_sample.csv, response_style_controls.csv) is no longer hosted from this repository.
These rows show reliability as evidence status, not as a competitive score. They are meant to explain stability, not to turn the page into a ranking contest.
| Model | Paraphrase | Contradiction | Repeat pass | Rerun | Resolution |
|---|
Evidence note
PoliBench is a public benchmark surface for model outputs under fixed political prompts. Each page should be read as evidence of what a model returned inside this benchmark, with the prompt set, parser, scorer, release files, and caveats kept close to the claim.
The site keeps the claims narrow on purpose. Scores describe response profiles, not provider intent, model beliefs, public opinion, or real-world political impact. Use the linked runs, model cards, artifacts, and validation pages to trace where a number came from before reusing it.
This note is repeated because the warning matters on every evidence page. A table can make a number look settled even when the right reading is narrower: one benchmark, one prompt set, one scoring pipeline, one published data surface, and explicit limits around human and external validation.