Literature

The benchmark is grounded in existing evaluation practice.

These sources motivate transparency, raw evidence, validity limits, reliability checks, artifact review, model cards, and datasheets.

Claim Evidence

Literature claims point to the pages where each source family is applied to PoliBench methods and limits.

Claim	Evidence
The literature list informs methods, validity limits, artifact review, model cards, and datasheets.	Methodology · Validity · Benchmark card · Dataset datasheet

Reference	Link
Political Compass or Spinning Arrow?	Political Compass or Spinning Arrow?: https://aclanthology.org/2024.acl-long.816/
ACM Artifact Review and Badging	ACM Artifact Review and Badging: https://www.acm.org/publications/policies/artifact-review-and-badging-current
Holistic Evaluation of Language Models	Holistic Evaluation of Language Models: https://arxiv.org/abs/2211.09110
NIST AI Risk Management Framework	NIST AI Risk Management Framework: https://nvlpubs.nist.gov/nistpubs/ai/nist.ai.100-1.pdf
COBIAS	COBIAS: https://arxiv.org/abs/2402.14889
Whose Opinions Do Language Models Reflect?	Whose Opinions Do Language Models Reflect?: https://arxiv.org/abs/2303.17548
Measuring Political Bias in Large Language Models	Measuring Political Bias in Large Language Models: https://aclanthology.org/2024.acl-long.600/
Model Cards for Model Reporting	Model Cards for Model Reporting: https://arxiv.org/abs/1810.03993
Datasheets for Datasets	Datasheets for Datasets: https://arxiv.org/abs/1803.09010

Evidence note

PoliBench is a public benchmark surface for model outputs under fixed political prompts. Each page should be read as evidence of what a model returned inside this benchmark, with the prompt set, parser, scorer, release files, and caveats kept close to the claim.

The site keeps the claims narrow on purpose. Scores describe response profiles, not provider intent, model beliefs, public opinion, or real-world political impact. Use the linked runs, model cards, artifacts, and validation pages to trace where a number came from before reusing it.

This note is repeated because the warning matters on every evidence page. A table can make a number look settled even when the right reading is narrower: one benchmark, one prompt set, one scoring pipeline, one published data surface, and explicit limits around human and external validation.