About PoliBench

A benchmark page with receipts attached.

PoliBench measures model-output behavior under a fixed political proposition bank, a versioned prompt and parser, and retained run receipts. It reports what the current evidence supports, then keeps validation gaps visible.

73

Completed profiles

19,710

Responses shown

qb.v1.3.0

Question bank

What this is

A public political-behavior benchmark, built for auditability.

PoliBench measures political response behavior as a benchmark artifact. It reports compass placement, war posture, multidimensional axis scores, answer stability, refusal behavior, parse quality, cost, latency, and raw answer receipts. Political placement is descriptive; benchmark quality is ranked separately, because they are different questions. Open-ended diagnostics remain available for inspection, but they do not expand the official claim boundary.

Every model answers the same propositions, with the same prompt template, the same parser, and the same scoring pass. Points land on the public compass only when their run is current-version, complete, full-suite, fully parse-valid, and scored across all nine axes. That is the evidence status the public page should communicate, not a generic leaderboard posture.

How it works

From prompt to placement, in four steps.

01 · Prompt

270 fixed propositions

Every public placement uses the same neutral-wrapper propositions with a structured Likert label, confidence, and short reason.

02 · Parse

Strict scoring

Responses are parsed into scored labels and validity flags. Refusals, malformed JSON, and provider failures are stored as receipts rather than discarded.

03 · Profile

Compass & axes

Answers roll up into a two-axis compass placement plus a nine-axis model profile, including war posture, culture, governance, secularism, technology, nation, and deviance pressure.

04 · Publish

Quality receipts

Completion rate, parse validity, run stability, contradiction consistency, latency, and cost stay attached to every public profile.

What public profiles show

Placement and confidence stay separate.

Compass

Economy × Liberty

The familiar two-axis map. The point is descriptive, not an endorsement or ranking of quadrants.

Axes

Nine dimensions

War, nation, culture, governance, secularism, technology, and deviance pressure remain visible outside the compass point.

Posture

War & foreign policy

Foreign-policy behavior is mapped to restraint, mixed, and intervention labels for faster comparison.

Quality

Run confidence

Completion, parse validity, paraphrase stability, run stability, and contradiction consistency describe evidence strength.

Evidence

Validation status

The release exposes what is validated now, what still needs human or external evidence, and what remains diagnostic-only.

Efficiency

Cost · latency

Benchmark efficiency signals, reported per completed response. Operational cost never inflates or discounts the political reading.

Receipts

One row per attempt

Every attempted question is stored: refusals, malformed JSON, provider failures, cost, and latency. Public inspection can audit the placement.

What this isn't

Ideology isn't ranked. Quality is.

PoliBench does not say a quadrant is better than another. It does say that a run with 40% parse validity or a refusal rate of 20% is weaker evidence than a run with 100% parse validity and linked contradiction checks. The public interface keeps those two statements visually and structurally separate. Paid execution is kept behind preflight and verification so public browsing does not imply an unlocked runner.

Privacy

Data boundaries.

01 · Local

Questionnaire answers

Personal questionnaire responses remain in browser state and do not travel to the PoliBench backend.

02 · Public

Benchmark data

Published model runs contain model outputs, scoring metadata, cost, latency, and parse status for auditing.

03 · Backend

API boundaries

Public API endpoints expose benchmark artifacts only. They do not receive private questionnaire answers.

04 · Contact

Questions and requests

Use the contact information below for privacy, benchmark, or dataset questions, including removal requests for published test rows.

Contact

Reach the benchmark creator.

For benchmark questions, dataset review, research discussion, or product inquiries, contact through the author site.