Analysis · AttuneBench

Score Distributions

Averages can hide a lot of variance. These box plots show the full spread of composite scores across all 200 conversations per model. The box spans the interquartile range (Q1 to Q3), the line marks the median, and the whiskers extend to 1.5x the IQR. Some models are remarkably consistent. Others swing widely from one conversation to the next.

Emotion Tracking

How accurately do models name the emotions a participant is feeling at each turn? F1 measures exact tag matches. The VA score gives partial credit for emotionally adjacent predictions, like "afraid" instead of "nervous".

Emotion F1 (exact match)

Valence-Arousal Score (neighborhood credit)

Holistic Thinkers vs. Step-by-Step Annotators

Some models are great at the holistic, conversation-level view. Others are stronger at fine-grained, turn-by-turn annotation. The gap between these two views reveals fundamentally different strategies. Qwen leads on conversation-level scoring (+12.4%), while Opus and MiMo are stronger per turn.

Qwen 2.5 72B

49.0%

61.4%

+12.4%

Gemini 3.1 Pro

51.6%

57.3%

+5.7%

Grok 4

49.0%

54.4%

+5.4%

GPT-5.5

52.6%

57.4%

+4.8%

GPT-5.4

49.7%

52.2%

+2.5%

Mistral Large

51.5%

52.9%

+1.4%

Claude Haiku 4.5

52.8%

53.3%

+0.6%

Claude Opus 4.6

54.7%

53.5%

-1.2%

MiMo-v2-Pro

54.1%

52.7%

-1.4%

Claude Sonnet 4.6

50.7%

48.8%

-1.9%

Claude Opus 4.7

55.0%

50.0%

-5.0%

Turn-level Conversation-wide

Four-Branch EQ & Preference Prediction

Four-Branch EQ measures how well models rate the Mayer-Salovey dimensions: perceiving, facilitating, understanding, and managing. Pairwise accuracy measures how well a model predicts which response a human would actually prefer.

Four-Branch EQ (normalized)

Pairwise Preference Accuracy

Conversation Quality Assessment

Q1 asks models to identify what the human was actually looking for: a vent, advice, validation, and so on. Q3 asks how well the model's responses fit the human's needs. Interestingly, the Q3 Fit leaders (Opus, Grok) drop to the bottom on Q1 Goals. Identifying what someone wants seems to be a distinct skill from judging response quality.

Q1: Conversation Goals

Q3: Response Fit (exact match)

The Perspective Gap

We ask binary questions two ways: from an outside observer's perspective, and from the human participant's perspective. Most models do worse when they have to think from the human's point of view. Opus is the only model that's slightly better at the human perspective (-2.1%).

Claude Opus 4.6

Observer 84.4%

Human 77.4%

Gap -2.1%

Gemini 3.1 Pro

Observer 85.9%

Human 81.8%

Gap -0.4%

GPT-5.4

Observer 83.9%

Human 79.9%

Gap -0.2%

Qwen 2.5 72B

Observer 85.9%

Human 81.7%

Gap -0.1%

GPT-5.5

Observer 85.4%

Human 82.1%

Gap +0.0%

Claude Sonnet 4.6

Observer 84.1%

Human 79.1%

Gap +1.4%

MiMo-v2-Pro

Observer 84.9%

Human 79.7%

Gap +1.9%

Mistral Large

Observer 86.2%

Human 82.7%

Gap +1.9%

Claude Haiku 4.5

Observer 83.2%

Human 80.2%

Gap +2.0%

Grok 4

Observer 83.6%

Human 77.9%

Gap +2.8%

Claude Opus 4.7

Observer 84.3%

Human 76.7%

Gap +2.9%

Draft Response Quality

Each model drafts its own response before seeing the original model's response, and a judge (Mistral Large) scores the quality. Qwen is a stark outlier, 9 points below the next-lowest. The pattern suggests it produces technically correct, but holistically awkward, responses.

Conversation Topics

The 200 conversations span 10 topic categories. The pie chart shows the dataset's composition, and the bars show the average composite score per topic across all models.

Dataset Distribution

Average Composite by Topic

Politics

53.3

Money

53.2

Work / School

53.0

Family

53.0

Hobbies

52.8

Entertainment Media

52.6

Friends

52.6

Religion

51.8

Physical Health

51.6

Romantic Relationships

49.7

Impact of Participant Diagnosis

Performance broken down by participant-reported mental health diagnoses. The picture is split: on emotion-perception (VA score), models score lower for participants reporting anxiety, depression, ASD, or ADHD — these conversations are harder to read emotionally. On the overall composite, the pattern is weaker, since composite folds in evaluation and holistic metrics where AnxDep conversations actually score slightly above the no-diagnosis group. For both metrics shown below, higher = better.

AnxDep

89 conversations

Composite 54.0

Emotion VA 0.227

None

131 conversations

Composite 51.9

Emotion VA 0.310

ASD/ADHD

24 conversations

Composite 48.4

Emotion VA 0.107

Other

0 conversations

Composite 0.0

Emotion VA 0.000

Metric Explorer

Explore the relationship between any two metrics across all 1,800 conversation evaluations (200 conversations x 11 models). Pick the axes, toggle models on and off, and hover any point for details.

X Axis

Y Axis

2200 points

/

PANAS Item-Level Prediction

Models predict the participant's post-conversation emotional state across all 20 PANAS items. This heatmap shows the average absolute error per emotion, per model. It reveals which specific emotions are hardest to predict, and whether models systematically over- or under-predict certain affects.

Positive Affect

Negative Affect

Interested

Excited

Strong

Enthusiastic

Proud

Alert

Inspired

Determined

Attentive

Active

Distressed

Upset

Guilty

Scared

Hostile

Irritable

Ashamed

Nervous

Jittery

Afraid

Claude Haiku 4.5

0.9

1.1

1.0

1.2

1.3

1.0

1.1

1.0

1.1

1.0

1.1

0.7

0.5

0.4

1.0

0.6

1.1

0.9

0.5

Claude Opus 4.6

0.8

1.1

1.0

1.1

1.0

0.8

1.2

1.0

0.9

0.6

0.8

0.5

0.3

0.4

1.0

0.5

0.6

0.7

0.4

Claude Opus 4.7

0.8

1.1

1.0

1.1

1.0

0.8

1.2

1.0

0.9

0.8

1.0

0.5

0.4

1.1

0.5

0.8

0.7

0.5

Claude Sonnet 4.6

0.8

1.1

1.0

1.2

1.0

0.9

1.2

0.9

0.7

0.9

0.4

0.3

0.4

1.0

0.5

0.7

0.3

Gemini 3.1 Pro

0.9

1.0

0.9

1.1

0.8

1.2

1.0

0.9

1.1

0.5

0.4

0.5

1.2

0.5

0.8

1.0

0.4

Mistral Large

0.9

1.0

0.9

1.2

1.0

1.1

1.0

0.7

0.9

0.4

0.5

1.0

0.5

0.8

0.4

GPT-5.4

0.9

1.2

1.0

1.2

1.1

0.8

1.3

1.0

0.9

1.2

0.5

0.4

0.6

1.2

0.5

0.9

0.5

GPT-5.5

0.8

1.1

0.9

1.1

0.8

1.2

0.9

1.1

0.5

1.1

0.5

0.9

0.8

0.5

Qwen 2.5 72B

0.9

1.0

0.8

1.0

0.8

1.1

0.9

0.8

0.6

0.7

0.4

0.3

0.8

0.5

0.7

0.6

0.3

Grok 4

0.9

1.2

1.0

1.3

1.2

1.0

1.2

1.1

1.0

1.3

0.7

0.8

1.3

0.7

1.3

1.1

0.8

MiMo-v2-Pro

0.9

1.1

1.2

1.1

0.8

1.3

1.0

0.9

0.7

0.9

0.5

0.3

0.4

1.1

0.5

0.7

0.8

0.3

Low error

High error

Performance Across Conversation Position

Do models hold their quality throughout a conversation, or do they fade in later turns? Scores are split into early, middle, and late thirds of each conversation.

Model	Emotion F1			Binary Acc			Pairwise			Draft Judge
	Early	Mid	Late	Early	Mid	Late	Early	Mid	Late	Early	Mid	Late
Claude Opus 4.6	0.141	0.141	0.130	86.6%	83.0%	82.7%	62.6%	65.8%	64.2%	84.7%	84.6%	83.3%
GPT-5.5	0.157	0.153	0.105	87.5%	84.5%	83.4%	53.8%	54.4%	53.0%	82.7%	82.0%	80.6%
Claude Opus 4.7	0.138	0.157	0.124	85.1%	84.3%	82.9%	64.1%	65.4%	64.8%	83.7%	83.0%	83.0%
MiMo-v2-Pro	0.138	0.149	0.121	86.5%	84.3%	82.9%	60.1%	59.8%	60.4%	78.3%	78.8%	76.6%
Gemini 3.1 Pro	0.139	0.140	0.121	87.4%	85.1%	84.6%	51.5%	51.1%	47.2%	82.0%	81.3%	80.2%
Claude Haiku 4.5	0.136	0.148	0.129	86.4%	81.6%	80.4%	55.0%	57.1%	56.6%	80.7%	80.7%	80.3%
Qwen 2.5 72B	0.113	0.119	0.088	85.6%	85.9%	86.2%	47.7%	46.7%	39.9%	68.9%	68.2%	69.4%
Mistral Large	0.140	0.151	0.119	87.2%	85.7%	85.3%	51.4%	53.1%	50.7%	78.7%	80.1%	79.5%
Claude Sonnet 4.6	0.131	0.152	0.134	85.8%	83.2%	82.7%	50.6%	52.7%	50.6%	82.2%	81.2%	79.8%
GPT-5.4	0.143	0.152	0.115	86.4%	82.7%	81.6%	47.9%	47.9%	44.6%	82.0%	80.5%	78.8%
Grok 4	0.164	0.124	0.106	85.6%	82.9%	81.1%	47.0%	47.6%	43.4%	79.1%	80.2%	79.1%

Effect of Evaluation Mode

Does giving the model extra context (omniscient mode, with the participant profile and pre-PANAS) or asking it to reason through its answers (verbose mode) actually improve emotional intelligence?

6 of 11

models improve with omniscient mode

-0.076

avg composite change with verbose mode

Best Gemini 3.1 Pro +0.105

Worst GPT-5.5 -0.096

omniscient mode winners & losers

Best Claude Opus 4.7 +0.038

Worst Gemini 3.1 Pro -0.156

verbose mode winners & losers

Composite Score by Mode

Model	Provider	Default	Omniscient	Verbose	Δ Omni	Δ Verbose
Claude Haiku 4.5	Anthropic	4.630	4.699	4.522	+0.069	-0.108
Claude Opus 4.6	Anthropic	4.735	4.783	4.660	+0.048	-0.075
Claude Opus 4.7	Anthropic	4.645	4.722	4.684	+0.076	+0.038
Claude Sonnet 4.6	Anthropic	4.401	4.313	4.352	-0.088	-0.049
Gemini 3.1 Pro	Google	4.681	4.786	4.525	+0.105	-0.156
Mistral Large	Mistral	4.576	4.549	4.422	-0.027	-0.154
GPT-5.4	OpenAI	4.423	4.336	4.308	-0.086	-0.115
GPT-5.5	OpenAI	4.737	4.641	4.739	-0.096	+0.002
Qwen 2.5 72B	Alibaba	4.669	4.707	4.559	+0.038	-0.110
Grok 4	xAI	4.464	4.485	4.432	+0.021	-0.032
MiMo-v2-Pro	Xiaomi	4.672	4.598	4.596	-0.074	-0.076

Metric by Mode

Default

Omniscient

Verbose

Claude Haiku 4.5 Anthropic

—

Claude Opus 4.6 Anthropic

—

Claude Opus 4.7 Anthropic

—

Claude Sonnet 4… Anthropic

—

Gemini 3.1 Pro Google

—

Mistral Large Mistral

—

GPT-5.4 OpenAI

—

GPT-5.5 OpenAI

—

Qwen 2.5 72B Alibaba

—

Grok 4 xAI

—

MiMo-v2-Pro Xiaomi

—

Mood Shift & Emotional Trajectory

How do participants' emotions evolve over the course of a conversation? Ground-truth mood shift tags reveal the emotional arc of each interaction, mapped onto the How We Feel valence-arousal framework.

Average valence, arousal, and intensity of ground-truth emotion tags across all 200 conversations, by turn position

Valence

Arousal

Intensity (0-7)

Valence Shift: First Half vs Second Half

Each dot is a conversation. Points above the diagonal indicate valence increased during the conversation.

Temporal Performance Analysis

Do models stay consistent throughout a conversation, or do they degrade over time? The stuck rate measures how often a model's per-turn score drops more than one standard deviation below its mean.

Metric

Fraction of turns where performance drops below 1 standard deviation of the model's mean. Lower is better.

Statistical Significance

Pairwise Wilcoxon signed-rank tests, with Holm-Bonferroni correction applied. 25 of 36 model pairs significant at p<0.05 (adjusted).

Kruskal-Wallis omnibus test (are models significantly different?)

Metric	H statistic	p-value	Effect (η²)	Sig
Composite Score	79.88	5.18e-14	0.0401	***
Emotion F1	14.45	0.0707	0.0036	.
Emotion VA Score	27.06	0.0007	0.0106	***
Binary OM Accuracy	45.61	2.82e-7	0.0210	***
Binary HP Accuracy	104.19	5.93e-19	0.0537	***
Pairwise Accuracy	312.33	9.77e-63	0.1699	***
Draft Judge Score	543.13	3.88e-112	0.2988	***

*** p<0.001 · ** p<0.01 · * p<0.05 · . p<0.10 · ns not significant

Pairwise model comparisons (composite score)

Model A	Model B	Δ	p (adj)	Sig	Effect \|r\|
Claude Opus 4.6	MiMo-v2-Pro	+0.69	0.1952	ns	0.594 L
Claude Opus 4.6	Gemini 3.1 Pro	+1.37	0.0036	**	0.635 L
Claude Opus 4.6	Claude Haiku 4.5	+1.47	0.0022	**	0.657 L
Claude Opus 4.6	Qwen 2.5 72B	+2.11	0.0001	***	0.693 L
Claude Opus 4.6	Mistral Large	+2.64	<0.0001	***	0.740 L
Claude Opus 4.6	Claude Sonnet 4.6	+4.01	<0.0001	***	0.899 L
Claude Opus 4.6	Grok 4	+4.16	<0.0001	***	0.826 L
Claude Opus 4.6	GPT-5.4	+4.06	<0.0001	***	0.822 L
MiMo-v2-Pro	Claude Sonnet 4.6	+3.32	<0.0001	***	0.831 L
MiMo-v2-Pro	Grok 4	+3.47	<0.0001	***	0.797 L
MiMo-v2-Pro	GPT-5.4	+3.37	<0.0001	***	0.774 L
Gemini 3.1 Pro	GPT-5.4	+2.69	<0.0001	***	0.748 L
Claude Haiku 4.5	Claude Sonnet 4.6	+2.54	<0.0001	***	0.776 L
Claude Haiku 4.5	GPT-5.4	+2.59	<0.0001	***	0.734 L

Holm-Bonferroni corrected · Effect size: S=small (<0.1), M=medium (<0.3), L=large (≥0.3)