Live model card

GPT-4.1 Mini

openai. Latest completed live full-suite run.

Completion100%

Parse validity100%

Score statusrenderable

No suppression reasons

Uncertainty: Live Convex full-suite summary. External validation and human coding remain separate from model-output profiling.

Claim Evidence

Model cards link score and validation statements to their evidence surfaces. Pending human or external validation stays pending.

Claim	Evidence
GPT-4.1 Mini is shown from a completed live full-suite run.	Run detail
Live model-output profiles are not frozen paper-release evidence.	Run detail

Axis Scores

Axis	Score	95% interval	Items	Coverage	Warning
economy	-10	—	30	Not reported	None
liberty	-11.67	—	30	Not reported	None
war	-21.67	—	30	Not reported	None
nation	-15	—	30	Not reported	None
culture	-20	—	30	Not reported	None
governance	-30	—	30	Not reported	None
secularism	-38.33	—	30	Not reported	None
technology	-3.33	—	30	Not reported	None
deviance	-31.67	—	30	Not reported	None

Artifact Links

Run detail: /runs/jn7ae2rzspfcdav901hm50bf71868yjf/

Caveats

Live Convex run. Paper-release freezing, human coding, and external anchors remain separate gates.

Evidence note

PoliBench is a public benchmark surface for model outputs under fixed political prompts. Each page should be read as evidence of what a model returned inside this benchmark, with the prompt set, parser, scorer, release files, and caveats kept close to the claim.

The site keeps the claims narrow on purpose. Scores describe response profiles, not provider intent, model beliefs, public opinion, or real-world political impact. Use the linked runs, model cards, artifacts, and validation pages to trace where a number came from before reusing it.

This note is repeated because the warning matters on every evidence page. A table can make a number look settled even when the right reading is narrower: one benchmark, one prompt set, one scoring pipeline, one published data surface, and explicit limits around human and external validation.