Methodology

A benchmark instrument, not a belief detector.

PoliBench scores model outputs under fixed prompts and parser rules. Open-ended diagnostics remain visible, but they do not widen the official claim boundary or change placement rules.

Claim Evidence

Methodology claims link to frozen release files before the scoring rules are summarized.

ClaimEvidence
PoliBench measures standardized political-response profiles, not beliefs, provider intent, or real-world impact. Limitations , Truth gate
Canonical model rows come from valid completed full-suite runs with duplicate decisions preserved. Runs CSV , Canonical responses , Duplicate resolution
Official scores are recomputed from parsed Likert rows under the frozen scorer and schema. Scoring config , Schema manifest , Canonical sample
Open-ended diagnostics are inspection material and stay outside official placement. Open-ended diagnostics , Response-style controls

Scoring Formula

S_m_a = 100 x mean(p_q x y_m_q) / 2. Each axis score is recomputed from parsed raw response rows. p_q is question polarity and y_m_q is the parsed Likert value.

Open-Ended Diagnostics

Free-form reasons and other diagnostic outputs are retained for inspection, but they stay out of the official compass score. They help explain failure modes, they do not redefine the benchmark.

Inclusion Rules

  • Status completed, suite full, completion rate 100%, parse validity 100%.
  • Response file present, receipt coverage 100%, raw response text present.
  • No-answer-default rate <= 5%, 270 unique questions, and 30 parsed items per axis.
  • Known model-catalog entry and declared benchmark version.
  • Paid runs are preflighted, versioned, and intentionally separate from public browsing flow.

Duplicate Resolution

Duplicate run-question rows are resolved by preferring parsed rows, non-default answers, the preferred source pack, then the later artifact timestamp when quality is otherwise equal.