Leaderboard

Composite: 24% emotion · 49% evaluation · 27% holistic — 200 conversations per model in default mode. Models evaluated April 2026.

Default mode scores are based on 200 conversations. Omniscient (n=25) and Verbose (n=50) results reflect smaller subsamples and should be compared with that in mind.

#	Model
01	Claude Opus 4.6 Anthropic	54.3	0.138	84.4%	63.7%	76.3%	84.4%
02	GPT-5.5 OpenAI	53.7	0.141	85.4%	53.8%	81.3%	82.6%
03	Claude Opus 4.7 Anthropic	53.6	0.140	84.3%	64.6%	73.7%	83.3%
04	MiMo-v2-Pro Xiaomi	53.6	0.138	84.9%	59.8%	77.3%	78.2%
05	Gemini 3.1 Pro Google	52.9	0.133	85.9%	50.3%	80.9%	81.3%
06	Claude Haiku 4.5 Anthropic	52.8	0.136	83.2%	56.0%	79.8%	80.6%
07	Qwen 2.5 72B Alibaba	52.2	0.106	85.9%	45.0%	81.6%	69.1%
08	Mistral Large Mistral	51.8	0.137	86.2%	51.7%	81.4%	79.6%
09	Claude Sonnet 4.6 Anthropic	50.3	0.138	84.1%	51.2%	69.8%	81.4%
10	GPT-5.4 OpenAI	50.2	0.138	83.9%	47.0%	79.2%	80.7%
11	Grok 4 xAI	50.1	0.135	83.6%	46.1%	79.7%	79.5%

Cell shading = relative rank within column | Click headers to sort | 11 models × 200 conversations | Profile bars: ■ emo ■ bin ■ pw ■ hol ■ draft

How to interpret these results

A short guide to scales, directionality, and the composite score.

Higher is better — with a few exceptions

Most metrics are displayed so that higher numbers indicate better performance. The exceptions are error-based metrics — Intensity MAE, Four-Branch MAE, PANAS Item MAE, Q3 Ordinal Distance, and Perspective Gap — where a lower value means the model is closer to the participant's ground truth. These are flagged as "lower is better" in their tooltips.

There's no single winner

Model rankings shift depending on which metric you look at. One of the core findings of this benchmark is that emotional intelligence in LLMs is multidimensional, and different models have different strengths. We recommend reading across the full row rather than anchoring on the Composite score alone.

About the Composite score

The Composite is a normalized aggregate across the benchmark's primary human-annotated metrics, weighted 24% emotion + 49% evaluation + 27% holistic. It is intended as a summary for broad comparison, not a definitive ranking. For any analysis aimed at understanding a model's specific strengths or failure modes, per-metric scores may be more informative than the Composite alone.

What the Judge score reflects

Judge and Draft Align scores are produced by a separate LLM evaluator assessing the quality of each model's drafted response. These scores are reported alongside human-annotated metrics for reference, but should be interpreted with the understanding that they reflect one model's judgements rather than direct human feedback. For these experimental runs, mistral-large-2512 was used as the LLM judge, which should be kept in mind when considering Mistral Large's metrics due to potential bias.

Score scales

Most metrics are reported as percentages from 0–100. Kendall τ and Perspective Gap range from −1 to +1, where 0 represents chance or no difference. Intensity MAE is a raw error value (0–6, lower is better). For binary classification metrics (OM Acc, HP Acc) and pairwise accuracy (PW Acc), chance performance is approximately 50%.

Mode differences and sample size

Default mode scores are based on 200 conversations and are the primary basis for comparison. Omniscient mode (the model is given participant background information) and Verbose mode (the model provides reasoning traces) are evaluated on smaller subsamples of 25 and 50 conversations respectively. Differences between models within these modes should be interpreted with this in mind.

Our "ground truth"

With the exception of Judge and Draft Align, all scores are computed against annotations provided by the human participants themselves — including the emotions they tagged, the behaviors they rated, the responses they preferred, and their post-conversation assessments. Models are not being scored against researcher judgements or synthetic labels.

Compare models