Performance Heatmap
Every model, every metric, color-coded by relative rank within each column.
Mode
Columns
Comp
F1
VA
Hit
OM
HP
PW
τ
Jdg
DBA
4B
PAN
Q1
Q3
Trn
Cnv
Claude Opus 4.6
54.3
0.138
0.250
23.0%
84.4%
77.4%
63.7%
0.301
84.4%
81.0%
76.3%
0.908
61.0%
45.5%
54.7%
53.5%
GPT-5.5
53.7
0.141
0.261
22.4%
85.4%
82.1%
53.8%
0.024
82.6%
80.1%
81.3%
0.901
52.1%
46.5%
52.6%
57.4%
Claude Opus 4.7
53.6
0.140
0.255
22.9%
84.3%
76.7%
64.6%
0.338
83.3%
80.6%
73.7%
0.901
52.5%
41.0%
55.0%
50.0%
MiMo-v2-Pro
53.6
0.138
0.263
21.5%
84.9%
79.7%
59.8%
0.213
78.2%
79.4%
77.3%
0.897
59.0%
39.5%
54.1%
52.7%
Gemini 3.1 Pro
52.9
0.133
0.278
19.2%
85.9%
81.8%
50.3%
-0.015
81.3%
80.1%
80.9%
0.889
54.5%
41.0%
51.6%
57.3%
Claude Haiku 4.5
52.8
0.136
0.276
20.6%
83.2%
80.2%
56.0%
0.124
80.6%
81.4%
79.8%
0.885
70.0%
37.0%
52.8%
53.3%
Qwen 2.5 72B
52.2
0.106
0.257
16.3%
85.9%
81.7%
45.0%
-0.127
69.1%
76.8%
81.6%
0.919
70.3%
38.5%
49.0%
61.4%
Mistral Large
51.8
0.137
0.227
24.9%
86.2%
82.7%
51.7%
0.029
79.6%
80.2%
81.4%
0.892
66.3%
24.5%
51.5%
52.9%
Claude Sonnet 4.6
50.3
0.138
0.251
22.9%
84.1%
79.1%
51.2%
-0.030
81.4%
80.1%
69.8%
0.906
62.5%
34.5%
50.7%
48.8%
GPT-5.4
50.2
0.138
0.265
21.3%
83.9%
79.9%
47.0%
-0.135
80.7%
80.6%
79.2%
0.887
71.0%
43.0%
49.7%
52.2%
Grok 4
50.1
0.135
0.269
19.3%
83.6%
77.9%
46.1%
-0.125
79.5%
80.5%
79.7%
0.863
54.8%
45.0%
49.0%
54.4%
Best in column
Middle
Worst in column
· Click a row to highlight