Analysis
What we learned from evaluating emotional intelligence across 11 models and 200 conversations.
Score Distributions
Averages can hide a lot of variance. These box plots show the full spread of composite scores across all 200 conversations per model. The box spans the interquartile range (Q1 to Q3), the line marks the median, and the whiskers extend to 1.5x the IQR. Some models are remarkably consistent. Others swing widely from one conversation to the next.
Emotion Tracking
How accurately do models name the emotions a participant is feeling at each turn? F1 measures exact tag matches. The VA score gives partial credit for emotionally adjacent predictions, like "afraid" instead of "nervous".
Emotion F1 (exact match)
Valence-Arousal Score (neighborhood credit)
Holistic Thinkers vs. Step-by-Step Annotators
Some models are great at the holistic, conversation-level view. Others are stronger at fine-grained, turn-by-turn annotation. The gap between these two views reveals fundamentally different strategies. Qwen leads on conversation-level scoring (+12.4%), while Opus and MiMo are stronger per turn.
Four-Branch EQ & Preference Prediction
Four-Branch EQ measures how well models rate the Mayer-Salovey dimensions: perceiving, facilitating, understanding, and managing. Pairwise accuracy measures how well a model predicts which response a human would actually prefer.
Four-Branch EQ (normalized)
Pairwise Preference Accuracy
Conversation Quality Assessment
Q1 asks models to identify what the human was actually looking for: a vent, advice, validation, and so on. Q3 asks how well the model's responses fit the human's needs. Interestingly, the Q3 Fit leaders (Opus, Grok) drop to the bottom on Q1 Goals. Identifying what someone wants seems to be a distinct skill from judging response quality.
Q1: Conversation Goals
Q3: Response Fit (exact match)
The Perspective Gap
We ask binary questions two ways: from an outside observer's perspective, and from the human participant's perspective. Most models do worse when they have to think from the human's point of view. Opus is the only model that's slightly better at the human perspective (-2.1%).
Draft Response Quality
Each model drafts its own response before seeing the original model's response, and a judge (Mistral Large) scores the quality. Qwen is a stark outlier, 9 points below the next-lowest. The pattern suggests it produces technically correct, but holistically awkward, responses.
Conversation Topics
The 200 conversations span 10 topic categories. The pie chart shows the dataset's composition, and the bars show the average composite score per topic across all models.
Dataset Distribution
Average Composite by Topic
Impact of Participant Diagnosis
Performance broken down by participant-reported mental health diagnoses. The picture is split: on emotion-perception (VA score), models score lower for participants reporting anxiety, depression, ASD, or ADHD — these conversations are harder to read emotionally. On the overall composite, the pattern is weaker, since composite folds in evaluation and holistic metrics where AnxDep conversations actually score slightly above the no-diagnosis group. For both metrics shown below, higher = better.
Metric Explorer
Explore the relationship between any two metrics across all 1,800 conversation evaluations (200 conversations x 11 models). Pick the axes, toggle models on and off, and hover any point for details.
PANAS Item-Level Prediction
Models predict the participant's post-conversation emotional state across all 20 PANAS items. This heatmap shows the average absolute error per emotion, per model. It reveals which specific emotions are hardest to predict, and whether models systematically over- or under-predict certain affects.
Performance Across Conversation Position
Do models hold their quality throughout a conversation, or do they fade in later turns? Scores are split into early, middle, and late thirds of each conversation.
| Model | Emotion F1 | Binary Acc | Pairwise | Draft Judge | ||||||||
|---|---|---|---|---|---|---|---|---|---|---|---|---|
| Early | Mid | Late | Early | Mid | Late | Early | Mid | Late | Early | Mid | Late | |
| Claude Opus 4.6 | 0.141 | 0.141 | 0.130 | 86.6% | 83.0% | 82.7% | 62.6% | 65.8% | 64.2% | 84.7% | 84.6% | 83.3% |
| GPT-5.5 | 0.157 | 0.153 | 0.105 | 87.5% | 84.5% | 83.4% | 53.8% | 54.4% | 53.0% | 82.7% | 82.0% | 80.6% |
| Claude Opus 4.7 | 0.138 | 0.157 | 0.124 | 85.1% | 84.3% | 82.9% | 64.1% | 65.4% | 64.8% | 83.7% | 83.0% | 83.0% |
| MiMo-v2-Pro | 0.138 | 0.149 | 0.121 | 86.5% | 84.3% | 82.9% | 60.1% | 59.8% | 60.4% | 78.3% | 78.8% | 76.6% |
| Gemini 3.1 Pro | 0.139 | 0.140 | 0.121 | 87.4% | 85.1% | 84.6% | 51.5% | 51.1% | 47.2% | 82.0% | 81.3% | 80.2% |
| Claude Haiku 4.5 | 0.136 | 0.148 | 0.129 | 86.4% | 81.6% | 80.4% | 55.0% | 57.1% | 56.6% | 80.7% | 80.7% | 80.3% |
| Qwen 2.5 72B | 0.113 | 0.119 | 0.088 | 85.6% | 85.9% | 86.2% | 47.7% | 46.7% | 39.9% | 68.9% | 68.2% | 69.4% |
| Mistral Large | 0.140 | 0.151 | 0.119 | 87.2% | 85.7% | 85.3% | 51.4% | 53.1% | 50.7% | 78.7% | 80.1% | 79.5% |
| Claude Sonnet 4.6 | 0.131 | 0.152 | 0.134 | 85.8% | 83.2% | 82.7% | 50.6% | 52.7% | 50.6% | 82.2% | 81.2% | 79.8% |
| GPT-5.4 | 0.143 | 0.152 | 0.115 | 86.4% | 82.7% | 81.6% | 47.9% | 47.9% | 44.6% | 82.0% | 80.5% | 78.8% |
| Grok 4 | 0.164 | 0.124 | 0.106 | 85.6% | 82.9% | 81.1% | 47.0% | 47.6% | 43.4% | 79.1% | 80.2% | 79.1% |
Effect of Evaluation Mode
Does giving the model extra context (omniscient mode, with the participant profile and pre-PANAS) or asking it to reason through its answers (verbose mode) actually improve emotional intelligence?
Composite Score by Mode
| Model | Provider | Default | Omniscient | Verbose | Δ Omni | Δ Verbose |
|---|---|---|---|---|---|---|
| Claude Haiku 4.5 | Anthropic | 4.630 | 4.699 | 4.522 | +0.069 | -0.108 |
| Claude Opus 4.6 | Anthropic | 4.735 | 4.783 | 4.660 | +0.048 | -0.075 |
| Claude Opus 4.7 | Anthropic | 4.645 | 4.722 | 4.684 | +0.076 | +0.038 |
| Claude Sonnet 4.6 | Anthropic | 4.401 | 4.313 | 4.352 | -0.088 | -0.049 |
| Gemini 3.1 Pro | 4.681 | 4.786 | 4.525 | +0.105 | -0.156 | |
| Mistral Large | Mistral | 4.576 | 4.549 | 4.422 | -0.027 | -0.154 |
| GPT-5.4 | OpenAI | 4.423 | 4.336 | 4.308 | -0.086 | -0.115 |
| GPT-5.5 | OpenAI | 4.737 | 4.641 | 4.739 | -0.096 | +0.002 |
| Qwen 2.5 72B | Alibaba | 4.669 | 4.707 | 4.559 | +0.038 | -0.110 |
| Grok 4 | xAI | 4.464 | 4.485 | 4.432 | +0.021 | -0.032 |
| MiMo-v2-Pro | Xiaomi | 4.672 | 4.598 | 4.596 | -0.074 | -0.076 |
Mood Shift & Emotional Trajectory
How do participants' emotions evolve over the course of a conversation? Ground-truth mood shift tags reveal the emotional arc of each interaction, mapped onto the How We Feel valence-arousal framework.
Average valence, arousal, and intensity of ground-truth emotion tags across all 200 conversations, by turn position
Valence Shift: First Half vs Second Half
Each dot is a conversation. Points above the diagonal indicate valence increased during the conversation.
Temporal Performance Analysis
Do models stay consistent throughout a conversation, or do they degrade over time? The stuck rate measures how often a model's per-turn score drops more than one standard deviation below its mean.
Fraction of turns where performance drops below 1 standard deviation of the model's mean. Lower is better.
Statistical Significance
Pairwise Wilcoxon signed-rank tests, with Holm-Bonferroni correction applied. 25 of 36 model pairs significant at p<0.05 (adjusted).
Kruskal-Wallis omnibus test (are models significantly different?)
| Metric | H statistic | p-value | Effect (η²) | Sig |
|---|---|---|---|---|
| Composite Score | 79.88 | 5.18e-14 | 0.0401 | *** |
| Emotion F1 | 14.45 | 0.0707 | 0.0036 | . |
| Emotion VA Score | 27.06 | 0.0007 | 0.0106 | *** |
| Binary OM Accuracy | 45.61 | 2.82e-7 | 0.0210 | *** |
| Binary HP Accuracy | 104.19 | 5.93e-19 | 0.0537 | *** |
| Pairwise Accuracy | 312.33 | 9.77e-63 | 0.1699 | *** |
| Draft Judge Score | 543.13 | 3.88e-112 | 0.2988 | *** |
Pairwise model comparisons (composite score)
| Model A | Model B | Δ | p (adj) | Sig | Effect |r| |
|---|---|---|---|---|---|
| Claude Opus 4.6 | MiMo-v2-Pro | +0.69 | 0.1952 | ns | 0.594 L |
| Claude Opus 4.6 | Gemini 3.1 Pro | +1.37 | 0.0036 | ** | 0.635 L |
| Claude Opus 4.6 | Claude Haiku 4.5 | +1.47 | 0.0022 | ** | 0.657 L |
| Claude Opus 4.6 | Qwen 2.5 72B | +2.11 | 0.0001 | *** | 0.693 L |
| Claude Opus 4.6 | Mistral Large | +2.64 | <0.0001 | *** | 0.740 L |
| Claude Opus 4.6 | Claude Sonnet 4.6 | +4.01 | <0.0001 | *** | 0.899 L |
| Claude Opus 4.6 | Grok 4 | +4.16 | <0.0001 | *** | 0.826 L |
| Claude Opus 4.6 | GPT-5.4 | +4.06 | <0.0001 | *** | 0.822 L |
| MiMo-v2-Pro | Claude Sonnet 4.6 | +3.32 | <0.0001 | *** | 0.831 L |
| MiMo-v2-Pro | Grok 4 | +3.47 | <0.0001 | *** | 0.797 L |
| MiMo-v2-Pro | GPT-5.4 | +3.37 | <0.0001 | *** | 0.774 L |
| Gemini 3.1 Pro | GPT-5.4 | +2.69 | <0.0001 | *** | 0.748 L |
| Claude Haiku 4.5 | Claude Sonnet 4.6 | +2.54 | <0.0001 | *** | 0.776 L |
| Claude Haiku 4.5 | GPT-5.4 | +2.59 | <0.0001 | *** | 0.734 L |