Leaderboard
Composite: 24% emotion · 49% evaluation · 27% holistic — 200 conversations per model in default mode. Models evaluated April 2026.
Default mode scores are based on 200 conversations. Omniscient (n=25) and Verbose (n=50) results reflect smaller subsamples and should be compared with that in mind.
| # | Model | Profile | ||||||
|---|---|---|---|---|---|---|---|---|
| 01 | Claude Opus 4.6 Anthropic | 0.138 | 84.4% | 63.7% | 76.3% | 84.4% | ||
| 02 | GPT-5.5 OpenAI | 0.141 | 85.4% | 53.8% | 81.3% | 82.6% | ||
| 03 | Claude Opus 4.7 Anthropic | 0.140 | 84.3% | 64.6% | 73.7% | 83.3% | ||
| 04 | MiMo-v2-Pro Xiaomi | 0.138 | 84.9% | 59.8% | 77.3% | 78.2% | ||
| 05 | Gemini 3.1 Pro Google | 0.133 | 85.9% | 50.3% | 80.9% | 81.3% | ||
| 06 | Claude Haiku 4.5 Anthropic | 0.136 | 83.2% | 56.0% | 79.8% | 80.6% | ||
| 07 | Qwen 2.5 72B Alibaba | 0.106 | 85.9% | 45.0% | 81.6% | 69.1% | ||
| 08 | Mistral Large Mistral | 0.137 | 86.2% | 51.7% | 81.4% | 79.6% | ||
| 09 | Claude Sonnet 4.6 Anthropic | 0.138 | 84.1% | 51.2% | 69.8% | 81.4% | ||
| 10 | GPT-5.4 OpenAI | 0.138 | 83.9% | 47.0% | 79.2% | 80.7% | ||
| 11 | Grok 4 xAI | 0.135 | 83.6% | 46.1% | 79.7% | 79.5% |
How to interpret these results
A short guide to scales, directionality, and the composite score.
Higher is better — with a few exceptions
Most metrics are displayed so that higher numbers indicate better performance. The exceptions are error-based metrics — Intensity MAE, Four-Branch MAE, PANAS Item MAE, Q3 Ordinal Distance, and Perspective Gap — where a lower value means the model is closer to the participant's ground truth. These are flagged as "lower is better" in their tooltips.
There's no single winner
Model rankings shift depending on which metric you look at. One of the core findings of this benchmark is that emotional intelligence in LLMs is multidimensional, and different models have different strengths. We recommend reading across the full row rather than anchoring on the Composite score alone.
About the Composite score
The Composite is a normalized aggregate across the benchmark's primary human-annotated metrics, weighted 24% emotion + 49% evaluation + 27% holistic. It is intended as a summary for broad comparison, not a definitive ranking. For any analysis aimed at understanding a model's specific strengths or failure modes, per-metric scores may be more informative than the Composite alone.
What the Judge score reflects
Judge and Draft Align scores are produced by a separate LLM evaluator assessing the quality of each model's drafted response. These scores are reported alongside human-annotated metrics for reference, but should be interpreted with the understanding that they reflect one model's judgements rather than direct human feedback. For these experimental runs, mistral-large-2512 was used as the LLM judge, which should be kept in mind when considering Mistral Large's metrics due to potential bias.
Score scales
Most metrics are reported as percentages from 0–100. Kendall τ and Perspective Gap range from −1 to +1, where 0 represents chance or no difference. Intensity MAE is a raw error value (0–6, lower is better). For binary classification metrics (OM Acc, HP Acc) and pairwise accuracy (PW Acc), chance performance is approximately 50%.
Mode differences and sample size
Default mode scores are based on 200 conversations and are the primary basis for comparison. Omniscient mode (the model is given participant background information) and Verbose mode (the model provides reasoning traces) are evaluated on smaller subsamples of 25 and 50 conversations respectively. Differences between models within these modes should be interpreted with this in mind.
Our "ground truth"
With the exception of Judge and Draft Align, all scores are computed against annotations provided by the human participants themselves — including the emotions they tagged, the behaviors they rated, the responses they preferred, and their post-conversation assessments. Models are not being scored against researcher judgements or synthetic labels.