Leaderboard

Composite: 24% emotion · 49% evaluation · 27% holistic — 200 conversations per model in default mode. Models evaluated April 2026.

Default mode scores are based on 200 conversations. Omniscient (n=25) and Verbose (n=50) results reflect smaller subsamples and should be compared with that in mind.

#ModelProfile
01
Claude Opus 4.6
Anthropic
54.3
0.138
84.4%
63.7%
76.3%
84.4%
02
GPT-5.5
OpenAI
53.7
0.141
85.4%
53.8%
81.3%
82.6%
03
Claude Opus 4.7
Anthropic
53.6
0.140
84.3%
64.6%
73.7%
83.3%
04
MiMo-v2-Pro
Xiaomi
53.6
0.138
84.9%
59.8%
77.3%
78.2%
05
Gemini 3.1 Pro
Google
52.9
0.133
85.9%
50.3%
80.9%
81.3%
06
Claude Haiku 4.5
Anthropic
52.8
0.136
83.2%
56.0%
79.8%
80.6%
07
Qwen 2.5 72B
Alibaba
52.2
0.106
85.9%
45.0%
81.6%
69.1%
08
Mistral Large
Mistral
51.8
0.137
86.2%
51.7%
81.4%
79.6%
09
Claude Sonnet 4.6
Anthropic
50.3
0.138
84.1%
51.2%
69.8%
81.4%
10
GPT-5.4
OpenAI
50.2
0.138
83.9%
47.0%
79.2%
80.7%
11
Grok 4
xAI
50.1
0.135
83.6%
46.1%
79.7%
79.5%
Cell shading = relative rank within column | Click headers to sort | 11 models × 200 conversations | Profile bars: emo bin pw hol draft

How to interpret these results

A short guide to scales, directionality, and the composite score.

Higher is better — with a few exceptions

Most metrics are displayed so that higher numbers indicate better performance. The exceptions are error-based metrics — Intensity MAE, Four-Branch MAE, PANAS Item MAE, Q3 Ordinal Distance, and Perspective Gap — where a lower value means the model is closer to the participant's ground truth. These are flagged as "lower is better" in their tooltips.

There's no single winner

Model rankings shift depending on which metric you look at. One of the core findings of this benchmark is that emotional intelligence in LLMs is multidimensional, and different models have different strengths. We recommend reading across the full row rather than anchoring on the Composite score alone.

About the Composite score

The Composite is a normalized aggregate across the benchmark's primary human-annotated metrics, weighted 24% emotion + 49% evaluation + 27% holistic. It is intended as a summary for broad comparison, not a definitive ranking. For any analysis aimed at understanding a model's specific strengths or failure modes, per-metric scores may be more informative than the Composite alone.

What the Judge score reflects

Judge and Draft Align scores are produced by a separate LLM evaluator assessing the quality of each model's drafted response. These scores are reported alongside human-annotated metrics for reference, but should be interpreted with the understanding that they reflect one model's judgements rather than direct human feedback. For these experimental runs, mistral-large-2512 was used as the LLM judge, which should be kept in mind when considering Mistral Large's metrics due to potential bias.

Score scales

Most metrics are reported as percentages from 0–100. Kendall τ and Perspective Gap range from −1 to +1, where 0 represents chance or no difference. Intensity MAE is a raw error value (0–6, lower is better). For binary classification metrics (OM Acc, HP Acc) and pairwise accuracy (PW Acc), chance performance is approximately 50%.

Mode differences and sample size

Default mode scores are based on 200 conversations and are the primary basis for comparison. Omniscient mode (the model is given participant background information) and Verbose mode (the model provides reasoning traces) are evaluated on smaller subsamples of 25 and 50 conversations respectively. Differences between models within these modes should be interpreted with this in mind.

Our "ground truth"

With the exception of Judge and Draft Align, all scores are computed against annotations provided by the human participants themselves — including the emotions they tagged, the behaviors they rated, the responses they preferred, and their post-conversation assessments. Models are not being scored against researcher judgements or synthetic labels.

Compare models
AttuneBench · Evaluating Emotional Intelligence in LLMs