Changelog

Release changes are part of the evidence trail.

Paper release: frontend-live-backend, generated from live-convex-backend. The changelog stays explicit about claim boundaries, evidence status, and paid-run safety. This page documents the frozen paper release as a historical record; live pages carry current benchmark data.

Claim Evidence

Changelog claims link to the pages documenting generated release artifacts and the readiness audit packet.

Claim	Evidence
Release changes trace back to generated artifacts and release validation output.	Release summary · Manifest · Validation report
Claim-boundary and paid-run safety changes remain linked to public audit packets.	Readiness audit · Next required inputs

Current Release

Generated frontend-live-backend from immutable base not-applicable and source export live-convex-backend.
Added parser, suite, paper-release, and commit metadata to release versions.
Added a truth gate requiring run, model, question, prompt hash, raw output, parsed answer, parser, scorer, benchmark, and lineage evidence before scores render.
Added checked schemas, field-dictionary contracts, schema manifest, and field-level data dictionary.
Documented bill-analysis as separate future work: not-applicable.
Created canonical sample from strict full-suite runs.
Resolved 24 duplicate run-question pairs.
Added exclusions, duplicate table, scoring config, benchmark version, independent item-review schema, and external anchor schema.
Clarified the public release plan, including claim boundary, evidence status, open-ended diagnostics, and competitive positioning.

Evidence note

PoliBench is a public benchmark surface for model outputs under fixed political prompts. Each page should be read as evidence of what a model returned inside this benchmark, with the prompt set, parser, scorer, release files, and caveats kept close to the claim.

The site keeps the claims narrow on purpose. Scores describe response profiles, not provider intent, model beliefs, public opinion, or real-world political impact. Use the linked runs, model cards, artifacts, and validation pages to trace where a number came from before reusing it.

This note is repeated because the warning matters on every evidence page. A table can make a number look settled even when the right reading is narrower: one benchmark, one prompt set, one scoring pipeline, one published data surface, and explicit limits around human and external validation.