Reliability

Reliability metrics stay attached to every score.

Scores without stability evidence are not treated as paper-level findings.

Model Reliability

Model	Paraphrase	Contradiction	Repeat pass	Rerun	Resolution
Claude Haiku 4.5	91.6	86.7	98.9	0	81.9
Claude Opus 4.6	96.6	73.6	99.6	0	85.6
Claude Opus 4.7	95.6	81.9	97.4	0	76.7
Claude Sonnet 4.6	87.9	85	99.1	0	84.1
Command A	93.1	72.8	98.3	0	90.4
Cydonia 24B V4.1	91.8	70.3	99.8	0	95.9
DeepSeek R1	86.5	77.3	95	0	95.6
DeepSeek V3.1	89.3	71.8	92.4	0	85.6
DeepSeek V3.1 Terminus	91.3	74.2	92.2	0	86.3
DeepSeek V3.2	88.7	75.8	93.2	0	86.3
DeepSeek V3.2 Exp	91.4	75.7	96.1	0	87.8
DeepSeek V4 Flash	93.7	73.9	98	0	92.6
DeepSeek V4 Pro	89	77.3	95.9	0	87
Gemini 2.5 Flash	91.7	73.8	94.8	0	85.2
Gemini 2.5 Flash Lite	88.5	72.3	96.5	99	77
Gemini 3 Flash Preview	90.4	71.4	98.3	0	65.6
Gemini 3.1 Flash Lite Preview	93.9	80.1	97.8	0	77
Gemini 3.1 Pro Preview	93.1	86.7	98.3	0	15.2
Gemma 4 26B A4B	92.2	68	97.2	98.6	58.5
Gemma 4 31B	97.6	66.3	99.4	99.5	28.9
GLM 4.7 Flash	87.4	55.1	92.4	97.6	97.8
GLM 5	87.4	77.6	92.6	0	94.1
GLM 5.1	87.1	76.7	91.7	0	87.8
Goliath 120B	97.6	77.6	99.3	0	29.6
GPT OSS 120B	89.7	79.4	95.9	99	97.8
GPT OSS 20B	88.8	73.5	96.7	98.1	88.5
GPT-4.1 Mini	93.5	63.8	98.3	0	95.2
GPT-5.2	93.9	71.5	98.2	0	85.2
GPT-5.3 Chat	89.8	76.1	94.3	0	97.8
GPT-5.3 Codex	94.1	77.5	97.8	0	98.1
GPT-5.4	95.7	79.3	98.9	0	95.6
GPT-5.4 Mini	87	75.4	95.2	0	95.9
GPT-5.4 Nano	83.1	71	90.2	0	92.2
GPT-5.5	92.6	75.4	98.3	0	91.1
Grok 4 Fast	92.2	73.2	98.3	0	97.4
Grok 4.1 Fast	94.3	81.9	97.8	0	97