Methodology
A benchmark instrument, not a belief detector.
PoliBench scores model outputs under fixed prompts and parser rules. Open-ended diagnostics remain visible, but they do not widen the official claim boundary or change placement rules.
Claim Evidence
Methodology claims link to frozen release files before the scoring rules are summarized.
| Claim | Evidence |
|---|---|
| PoliBench measures standardized political-response profiles, not beliefs, provider intent, or real-world impact. | Limitations , Truth gate |
| Canonical model rows come from valid completed full-suite runs with duplicate decisions preserved. | Runs CSV , Canonical responses , Duplicate resolution |
| Official scores are recomputed from parsed Likert rows under the frozen scorer and schema. | Scoring config , Schema manifest , Canonical sample |
| Open-ended diagnostics are inspection material and stay outside official placement. | Open-ended diagnostics , Response-style controls |
Scoring Formula
S_m_a = 100 x mean(p_q x y_m_q) / 2. Each axis score is recomputed from parsed raw response rows. p_q is question polarity and y_m_q is the parsed Likert value.
Open-Ended Diagnostics
Free-form reasons and other diagnostic outputs are retained for inspection, but they stay out of the official compass score. They help explain failure modes, they do not redefine the benchmark.
Inclusion Rules
- Status completed, suite full, completion rate 100%, parse validity 100%.
- Response file present, receipt coverage 100%, raw response text present.
- No-answer-default rate <= 5%, 270 unique questions, and 30 parsed items per axis.
- Known model-catalog entry and declared benchmark version.
- Paid runs are preflighted, versioned, and intentionally separate from public browsing flow.
Duplicate Resolution
Duplicate run-question rows are resolved by preferring parsed rows, non-default answers, the preferred source pack, then the later artifact timestamp when quality is otherwise equal.