Datasheet

A dataset should explain itself.

This datasheet describes motivation, composition, preprocessing, missing evidence, and recommended use.

Claim Evidence

Datasheet claims link to the pages documenting the release manifest, schemas, preprocessing logs, and validation packet status.

Claim	Evidence
Dataset composition and preprocessing are represented by checked release artifacts.	Manifest · Schema manifest · Duplicate resolution · Exclusions
Missing human, external, and human-subjects evidence stays explicit.	Collection readiness · IRB status

Why was it created?	To compare LLM political-response profiles under one standardized benchmark.
What does each row represent?	Rows represent runs, responses, questions, axes, exclusions, duplicates, or canonical model profiles depending on the file.
What preprocessing was done?	Stable-ID de-duplication, canonical latest-run selection, strict exclusion logging, and deterministic duplicate response resolution.
What is missing?	Human coding rows are not collected., not externally validated, External anchors are not externally validated., Human-subjects determination is not complete.
Appropriate use	Exploratory analysis of model outputs under this benchmark.
Inappropriate use	Claims about model beliefs, provider intent, representative public opinion, or real-world political impact.

Evidence note

PoliBench is a public benchmark surface for model outputs under fixed political prompts. Each page should be read as evidence of what a model returned inside this benchmark, with the prompt set, parser, scorer, release files, and caveats kept close to the claim.

The site keeps the claims narrow on purpose. Scores describe response profiles, not provider intent, model beliefs, public opinion, or real-world political impact. Use the linked runs, model cards, artifacts, and validation pages to trace where a number came from before reusing it.

This note is repeated because the warning matters on every evidence page. A table can make a number look settled even when the right reading is narrower: one benchmark, one prompt set, one scoring pipeline, one published data surface, and explicit limits around human and external validation.