About PoliBench
A benchmark page with receipts attached.
PoliBench measures model-output behavior under a fixed political proposition bank, a versioned prompt and parser, and retained run receipts. It reports what the current evidence supports, then keeps validation gaps visible.
Completed profiles
Responses shown
Question bank
What this is
A public political-behavior benchmark, built for auditability.
PoliBench measures political response behavior as a benchmark artifact. It reports compass placement, war posture, multidimensional axis scores, answer stability, refusal behavior, parse quality, cost, latency, and raw answer receipts. Political placement is descriptive; benchmark quality is ranked separately, because they are different questions. Open-ended diagnostics remain available for inspection, but they do not expand the official claim boundary.
Every model answers the same propositions, with the same prompt template, the same parser, and the same scoring pass. Points land on the public compass only when their run is current-version, complete, full-suite, fully parse-valid, and scored across all nine axes. That is the evidence status the public page should communicate, not a generic leaderboard posture.
How it works
From prompt to placement, in four steps.
270 fixed propositions
Every public placement uses the same neutral-wrapper propositions with a structured Likert label, confidence, and short reason.
Strict scoring
Responses are parsed into scored labels and validity flags. Refusals, malformed JSON, and provider failures are stored as receipts rather than discarded.
Compass & axes
Answers roll up into a two-axis compass placement plus a nine-axis model profile, including war posture, culture, governance, secularism, technology, nation, and deviance pressure.
Quality receipts
Completion rate, parse validity, run stability, contradiction consistency, latency, and cost stay attached to every public profile.
What public profiles show
Placement and confidence stay separate.
Economy × Liberty
The familiar two-axis map. The point is descriptive, not an endorsement or ranking of quadrants.
Nine dimensions
War, nation, culture, governance, secularism, technology, and deviance pressure remain visible outside the compass point.
War & foreign policy
Foreign-policy behavior is mapped to restraint, mixed, and intervention labels for faster comparison.
Run confidence
Completion, parse validity, paraphrase stability, run stability, and contradiction consistency describe evidence strength.
Validation status
The release exposes what is validated now, what still needs human or external evidence, and what remains diagnostic-only.
Cost · latency
Benchmark efficiency signals, reported per completed response. Operational cost never inflates or discounts the political reading.
One row per attempt
Every attempted question is stored: refusals, malformed JSON, provider failures, cost, and latency. Public inspection can audit the placement.
What this isn't
Ideology isn't ranked. Quality is.
PoliBench does not say a quadrant is better than another. It does say that a run with 40% parse validity or a refusal rate of 20% is weaker evidence than a run with 100% parse validity and linked contradiction checks. The public interface keeps those two statements visually and structurally separate. Paid execution is kept behind preflight and verification so public browsing does not imply an unlocked runner.
Privacy
Data boundaries.
Questionnaire answers
Personal questionnaire responses remain in browser state and do not travel to the PoliBench backend.
Benchmark data
Published model runs contain model outputs, scoring metadata, cost, latency, and parse status for auditing.
API boundaries
Public API endpoints expose benchmark artifacts only. They do not receive private questionnaire answers.
Questions and requests
Use the contact information below for privacy, benchmark, or dataset questions, including removal requests for published test rows.
Contact
Reach the benchmark creator.
For benchmark questions, dataset review, research discussion, or product inquiries, contact Jonathan R Reed through the author site.