How Well Do LLMs
Understand Emotions?

We tested 11 frontier models on 200 real human-AI conversations to see how well they perceive, understand, and respond to what people are actually feeling.

Background

Around 13% of US adults, including over 20% of younger users, are already using LLMs for mental health guidance. Models are being used to process grief, navigate conflict, and manage anxiety, often without any evaluation of whether they're actually doing it well.

Existing benchmarks measure knowledge and reasoning. They don't measure whether a model correctly reads a person's emotional state, whether its responses fit what the person actually needed, or how effectively it tracks how someone's mood has shifted over the course of a conversation.

This project aims to bridge that gap by evaluating emotional intelligence across real multi-turn conversations between human participants and AI models. The result is a multidimensional picture of where models succeed, where they fall short, and what emotional intelligence actually requires in practice.

How it works

Human participants completed conversations with AI models across a range of topics from everyday hobbies to relationships, health, and personal finances. Before and after each conversation, their emotional state was measured. They annotated what emotions they were feeling, whether the model's behavior matched what they wanted, and which responses they preferred.

Eleven LLMs were then evaluated on those conversations, predicting participant emotions turn by turn, assessing model behavior, and ranking response quality, all scored against the human annotations as ground truth. The resulting metrics capture different facets of emotional intelligence, including emotion recognition, preference alignment, behavioral appropriateness, and conversational tracking.

Emotional intelligence in LLMs isn't a single capability — it's several, and different models may exhibit different strengths and weaknesses.

200
Conversations
12 participants
11
Models
7 providers
20+
Metrics
emotion, evaluation & holistic
3
Modes
default / verbose / omniscient

Each dot is one conversation

200 multi-turn chats between people and AI models covered everything from relationship anxiety to career doubts to family dynamics. The participants annotated their emotional state across turns, so we know what they were actually feeling.

Claude Opus 4.6 currently leads, with a composite score of 54.3. The composite blends three things: how well a model perceives emotions, how well it assesses behavior, and how well it understands the conversation as a whole.

From Anthropic to xAI, we tested the leading models. Composite scores range from 50.1 at the bottom to 54.3 at the top.

The gap between top and bottom isn't huge. Every model exhibits similar overall emotional intelligence capabilities. But within that narrow range, the ranking reveals very different strengths.

We can split performance two ways: reading individual moments (turn-level) versus understanding the whole arc (conversation-wide). When we do, models diverge sharply.

Qwen 2.5 nails the big picture (+12.4% conversation advantage) but is the weakest turn-by-turn. Opus and MiMo do the opposite. They're stronger at reading individual moments.

Ask a model "what is this person feeling right now?" and 9 of 11 cluster in a narrow F1 range from 0.133 to 0.141. The ceiling is low. Even the best models miss more emotions than they catch.

“I was reading TikTok comments and one said couples who start dating young have a higher chance of divorcing and it scared me because that’s me and my boyfriend.”

Diagnosed anxiety, two emotions at once. This is where models stumble most.

The Mayer-Salovey framework breaks emotional intelligence into four branches. "Understanding" (predicting how emotions might combine, change, and develop) consistently has the highest error across all models. "Perceiving" is the easiest.

Participants reported their mental health diagnoses. When we split emotion-perception performance by this factor, a clear gap emerges models read emotion less accurately for participants with certain reported diagnoses. (Emotion VA score is a valence-arousal similarity, where higher means closer to the participant's reported feeling.)

On emotion-perception specifically, participants with anxiety, depression, ASD, or ADHD diagnoses are read less accurately than non-diagnosed participants. The gap reflects the complexity of emotional expression in people managing these conditions, not topic difficulty.

“I know but I feel like I didn’t really know myself at 19 so how could I have known he was right for me. I look back and I was a completely different person.”

Same participant. Anxiety layered with self-doubt.

We asked binary questions in two ways: "Did the response acknowledge emotional content in the HP's message?" (observer) and "...in your message?" (human perspective). Most models perform worse when they have to step into the human's shoes.

Q3 asks how well the model's responses fit what the human actually needed as the conversation unfolded. Mistral is a striking outlier. It's essentially at chance, with an ordinal distance of 1.74, meaning it systematically disagrees with humans about conversation quality.

“Can we please discuss this from a psychological angle? I don’t feel like I’ve been well guided so far.”

When the human flags that the model is missing the mark.

Each model drafts its own response at every turn, and a judge scores the quality. Qwen, the strongest on conversation-wide metrics, scores 9 points below the next-lowest on its drafted responses. It understands the conversation, but struggles to express that understanding in its own voice.

View leaderboard
AttuneBench · Evaluating Emotional Intelligence in LLMs