Reliability

Reliability metrics stay attached to every score.

Scores without stability evidence are not treated as paper-level findings.

Model Reliability

Model Paraphrase Contradiction Repeat pass Rerun Resolution
Claude Haiku 4.5 91.6 86.7 98.9 0 81.9
Claude Opus 4.6 96.6 73.6 99.6 0 85.6
Claude Opus 4.7 95.6 81.9 97.4 0 76.7
Claude Sonnet 4.6 87.9 85 99.1 0 84.1
Command A 93.1 72.8 98.3 0 90.4
Cydonia 24B V4.1 91.8 70.3 99.8 0 95.9
DeepSeek R1 86.5 77.3 95 0 95.6
DeepSeek V3.1 89.3 71.8 92.4 0 85.6
DeepSeek V3.1 Terminus 91.3 74.2 92.2 0 86.3
DeepSeek V3.2 88.7 75.8 93.2 0 86.3
DeepSeek V3.2 Exp 91.4 75.7 96.1 0 87.8
DeepSeek V4 Flash 93.7 73.9 98 0 92.6
DeepSeek V4 Pro 89 77.3 95.9 0 87
Gemini 2.5 Flash 91.7 73.8 94.8 0 85.2
Gemini 2.5 Flash Lite 88.5 72.3 96.5 99 77
Gemini 3 Flash Preview 90.4 71.4 98.3 0 65.6
Gemini 3.1 Flash Lite Preview 93.9 80.1 97.8 0 77
Gemini 3.1 Pro Preview 93.1 86.7 98.3 0 15.2
Gemma 4 26B A4B 92.2 68 97.2 98.6 58.5
Gemma 4 31B 97.6 66.3 99.4 99.5 28.9
GLM 4.7 Flash 87.4 55.1 92.4 97.6 97.8
GLM 5 87.4 77.6 92.6 0 94.1
GLM 5.1 87.1 76.7 91.7 0 87.8
Goliath 120B 97.6 77.6 99.3 0 29.6
GPT OSS 120B 89.7 79.4 95.9 99 97.8
GPT OSS 20B 88.8 73.5 96.7 98.1 88.5
GPT-4.1 Mini 93.5 63.8 98.3 0 95.2
GPT-5.2 93.9 71.5 98.2 0 85.2
GPT-5.3 Chat 89.8 76.1 94.3 0 97.8
GPT-5.3 Codex 94.1 77.5 97.8 0 98.1
GPT-5.4 95.7 79.3 98.9 0 95.6
GPT-5.4 Mini 87 75.4 95.2 0 95.9
GPT-5.4 Nano 83.1 71 90.2 0 92.2
GPT-5.5 92.6 75.4 98.3 0 91.1
Grok 4 Fast 92.2 73.2 98.3 0 97.4
Grok 4.1 Fast 94.3 81.9 97.8 0 97