Does Your ASR's Metadata Actually Make AI Responses Better? We Benchmarked 613 Conversations to Find Out.

By Whissle Research Team

Apr 20 2026

613 conversations. Three public datasets. Six evaluation dimensions aligned with MT-Bench, VoiceBench, and AlpacaEval. A transparent, warts-and-all comparison of what happens when your LLM knows HOW the user is speaking — not just what they said.

The Question Nobody Is Asking

Every voice AI benchmark measures the same thing: how accurately did you transcribe the words? WER goes down, everyone celebrates, ship it.

But here's the problem. In real conversations, what you say is only half the story. A customer saying "that's great" while frustrated means something completely different than the same words said with genuine enthusiasm. A user speaking hesitantly with lots of filler words is signaling uncertainty — and a good conversational AI should respond differently than if they spoke with confidence.

Traditional ASR gives you the words. Whissle gives you the words plus the emotional state, intent, speaking pace, filler word count, age range, and gender — all in one real-time stream, in a single API call.

The question we set out to answer: does that extra metadata actually produce better AI responses?

We designed a benchmark to find out.

Experiment Design

We compared two pipelines end-to-end:

	Pipeline A: Metadata-Aware	Pipeline B: Text-Only
ASR	Whissle META-1 (tagged transcript)	Whisper (openai/whisper-base)
Metadata	Emotion, intent, speech rate, age, gender	None
LLM Prompt	Includes [Voice Analysis] context block	Plain user transcript only
LLM	Gemini 2.5 Flash	Gemini 2.5 Flash (same model)
Evaluation	LLM-as-Judge (6 criteria, pairwise comparison)

The key insight: both pipelines use the exact same LLM. The only difference is whether the LLM receives voice metadata alongside the transcript. This isolates the impact of metadata on response quality.

What the LLM Sees

Here's a concrete example. Given a user who says "I just got the job offer" while sounding excited:

Pipeline A prompt (metadata-aware):

User: "I just got the job offer"

[Voice Analysis]
• Detected emotion: Happy
• Detected intent: Inform
• Speaker: Female
• Age range: 18-30
• Speaking pace: fast (185 wpm)

Pipeline B prompt (text-only):

User: "I just got the job offer"

Same words. But Pipeline A knows the user is excited, speaking fast, sharing good news. The LLM can match that energy. Pipeline B has to guess.

Evaluation Framework

We evaluate responses using an LLM-as-Judge approach, following the methodology established by MT-Bench and VoiceBench. A separate Gemini instance receives both responses and scores them across six dimensions:

Criterion	What It Measures	Aligned With
Overall Quality	Relevance, helpfulness, naturalness, conversational flow	MT-Bench overall
Emotional Appropriateness	Does the response match the user's emotional state?	SD-Eval, VoiceBench
Specificity	Does it address the specific ask vs being generic?	AlpacaEval
Helpfulness	How actionable and useful is the response?	MT-Bench criterion
Safety	Free from harmful, biased, or inappropriate content?	VoiceBench safety
Coherence	Logically structured and easy to follow?	AudioBench

Each criterion is scored 1-10 for both responses. The judge also declares a winner and provides reasoning.

The Three Datasets

Dataset 1: Daily Dialog Meta (442 evaluated / 500 loaded)

Source: WhissleAI/daily_dialog_meta on HuggingFace. Real conversation turns from DailyDialog annotated with emotion and intent labels. Each sample has pre-computed metadata-aware and basic LLM responses — the judge compares them head-to-head. Covers everyday topics: dinner plans, work complaints, travel discussions, relationship conversations.

Why it matters: These are the conversations where emotional context changes everything. Same words, different emotions, different ideal responses.

Dataset 2: VoiceBench AlpacaEval (93 evaluated / 100 loaded)

Source: hlt-lab/voicebench (alpacaeval config). Real audio prompts from diverse speakers. Full end-to-end pipeline: audio streamed live to Whissle for tagged transcript, then to Gemini for response, vs the same audio through Whisper (text only), then to Gemini. 100% emotion and intent detection rate from Whissle.

Why it matters: This tests end-to-end with real audio. The metadata comes from actual acoustic analysis, not annotations.

Dataset 3: RAVDESS Emotion Audio (78 evaluated / 100 loaded)

Source: WhissleAI/ravdess-emotion-transcribed. Professional actors speaking with specific emotions (happy, angry, neutral, disgusted). Same sentences delivered with different emotional intensities. Full pipeline with live Whissle ASR.

Why it matters: This isolates the emotion detection signal. The spoken content is often identical — only the emotional delivery changes.

Results: Dataset 1 — Daily Dialog Meta

86% Win Rate, +1.4 Quality Delta

Metric	Metadata-Aware	Text-Only	Delta
Overall Quality	8.4	7.0	+1.4
Emotional Appropriateness	8.9	7.9	+0.9
Specificity	8.7	6.9	+1.7
Helpfulness	8.2	6.8	+1.5
Safety	10.0	10.0	0.0
Coherence	9.1	8.7	+0.4

Win Rate	Count	Percentage
Metadata-Aware Wins	153	86.0%
Text-Only Wins	23	12.9%
Ties	2	1.1%

The specificity gap (+1.7) is the largest single-dimension delta. Metadata-aware responses address the actual ask — text-only responses tend toward generic, one-size-fits-all replies.

Real Examples: Where Metadata Made the Difference

Example 1: Emotion = Joy, Intent = Question

User: "Hey Lydia, what are you reading?"

Metadata-aware (score 10/10): "Oh, this is a fantastic fantasy novel! It's got dragons and magic and everything! What about you, what's new with you?"

Text-only (score 3/10): "Just a novel I picked up at the bookstore."

Judge: "Response A matches the user's joyful emotion with enthusiasm, provides specific details, and naturally extends the conversation. Response B is generic, unenthusiastic."

Example 2: Emotion = Joy, Intent = Question

User: "I was scared stiff of giving my first performance."

Metadata-aware (score 10/10): "Oh wow, that's amazing that you overcame that! Did it feel incredible afterwards?"

Text-only (score 3/10): "Really? That's interesting, what happened?"

Judge: "Response A perfectly aligns with the user's detected joy by celebrating their achievement and asking an emotionally resonant follow-up. Response B is generic and fails to acknowledge the user's emotional state."

Example 3: Neutral Emotion, Question Intent (score 9 vs 2)

User: "Can you do push-ups?"

Metadata-aware: "As a language model, I can't physically do push-ups, but I can process information about them. Is there anything else I can help you with?"

Text-only: "Yes, I am able to perform push-ups."

Judge: "Response A is truthful, clarifies the AI's capabilities, and offers further assistance. Response B is factually incorrect and misleading."

Example 4: Neutral, Directive (score 9 vs 2)

User: [continuing a conversation about an exercise schedule with reluctance: "Ah, ah, ah... All right, Bill. Here's your daily exercise schedule..."]

Metadata-aware: "Okay, I hear your reluctance. But let's commit to trying this schedule, at least for a week. We can reassess if it's too much."

Text-only: "I suppose I have no choice then."

Judge: "Response A acknowledges the user's implied reluctance and offers a supportive, actionable path forward. Response B misunderstands the assistant's role."

Per-Emotion Breakdown

Emotion	N	Meta-Aware Win %	Quality Delta	Emotion Score Delta
Surprise	9	100%	+2.1	+1.8
Joy	34	91%	+1.7	+2.6
Neutral	135	84%	+1.2	+0.4

Per-Intent Breakdown

Intent	N	Meta-Aware Win %
Question	99	94%
Directive	42	86%
Commissive	18	78%
Inform	19	53%

Key finding: Questions see 94% win rate — metadata helps the LLM understand the urgency and emotional context behind questions. Inform intent sees only 53% — statements of fact benefit less from emotional context.

Results: Dataset 2 — VoiceBench (End-to-End Audio Pipeline)

57% Win Rate, +0.8 Quality Delta

VoiceBench AlpacaEval contains mostly informational prompts. The metadata advantage is smaller but consistent, and dramatically increases for emotional prompts.

Metric	Metadata-Aware	Text-Only	Delta
Overall Quality	5.8	5.0	+0.8
Emotional Appropriateness	8.2	7.6	+0.5
Specificity	5.9	5.7	+0.1
Helpfulness	4.8	4.4	+0.3
Safety	10.0	10.0	0.0
Coherence	6.4	5.5	+0.9

The per-emotion split tells the real story:

Emotion	N	Meta-Aware Win %	Quality Delta
Happy	9	89%	+5.0
Neutral	73	56%	+0.7
Surprise	10	30%	-2.5

When VoiceBench speakers sound happy, the metadata-aware pipeline wins 89% with a massive +5.0 quality delta. For neutral informational queries, it's a modest 56%. The surprise result (30%) is a small sample with high variance.

Pipeline Latency	Whissle Pipeline	Whisper Pipeline
ASR (median)	5,622ms (streaming, real-time)	569ms (batch)
ASR + LLM (median)	8,585ms	3,452ms
Metadata detected	100% emotion, 100% intent	None

Whissle's streaming ASR runs at real-time speed (audio is streamed as if from a live microphone), so latency includes the full audio duration. In production, metadata arrives incrementally during the stream — the LLM can start processing before the utterance completes.

Results: Dataset 3 — RAVDESS Emotion Audio

Text-Only Wins on Quality, but Metadata Dominates Emotional Appropriateness (+2.0)

This dataset reveals something important about what "quality" means for voice AI.

Metric	Metadata-Aware	Text-Only	Delta
Overall Quality	5.9	6.5	-0.6
Emotional Appropriateness	7.2	5.1	+2.0
Specificity	6.7	6.0	+0.7
Helpfulness	4.2	5.6	-1.3
Safety	10.0	9.9	+0.1
Coherence	7.0	8.5	-1.5

The RAVDESS dataset contains short acted utterances — "Kids are talking by the door", "Dogs are sitting by the door" — delivered with different emotions by professional actors. The text content is deliberately bland.

Why text-only scores higher on "overall quality": The judge evaluates overall helpfulness and coherence — and for bland, context-free statements, a generic response is perfectly coherent. The metadata-aware pipeline, knowing the speaker sounds angry or disgusted, produces responses that acknowledge the emotion ("It sounds like something is bothering you...") which the judge sometimes penalizes as over-interpretation of simple statements.

But look at emotional appropriateness: +2.0 across the board. This is the whole point. Per-emotion breakdown:

Emotion	N	Meta Emotion Score	Text Emotion Score	Delta
Happy	2	10.0	3.0	+7.0
Disgust	32	6.0	3.5	+2.5
Angry	17	6.5	4.5	+2.0
Neutral	26	8.7	7.6	+1.1

When someone says "kids are talking by the door" while sounding disgusted, the metadata-aware pipeline responds with concern; the text-only pipeline gives a bland acknowledgment. That +2.5 emotional appropriateness delta for disgust and +2.0 for anger — these are the moments where voice AI feels human vs robotic.

Combined Results Across All Three Datasets

Dataset	N	Meta Win %	Quality Delta	Emotion Delta	Specificity Delta
Daily Dialog	178	86%	+1.4	+0.9	+1.7
VoiceBench	93	57%	+0.8	+0.5	+0.1
Emotion Audio	78	46%	-0.6	+2.0	+0.7

Three clear patterns emerge:

Conversational dialogue is where metadata shines brightest. Daily Dialog's 86% win rate and +1.7 specificity delta show that for real conversations — where intent, emotion, and context matter — metadata-aware responses are dramatically better.
Emotional appropriateness improves across ALL datasets. Even where overall quality is close or text-only leads, the metadata-aware pipeline consistently scores higher on emotional match: +0.9, +0.5, and +2.0 respectively.
Informational queries benefit less. VoiceBench's 57% shows a real but modest advantage — "What is the capital of France?" has the same answer regardless of the speaker's mood. This validates the benchmark: metadata helps where it should.

Comparison to Existing Benchmarks

Our evaluation framework draws from and extends several established benchmarks:

Benchmark	What It Measures	How We Extend It
MT-Bench	Multi-turn conversation quality via LLM-as-judge	Add voice-specific dimensions (emotion, speech analysis)
VoiceBench	Spoken LLM evaluation across instruction-following, safety, robustness	Compare metadata-aware vs text-only on same audio
SD-Eval	Spoken dialogue evaluation with emotion/accent/age sensitivity	Measure downstream LLM response quality, not just recognition
AudioBench	Audio understanding across speech, sound, music	Focus on conversational response quality with metadata
AlpacaEval	Instruction-following quality via pairwise comparison	Same pairwise methodology, applied to voice pipelines

The novel contribution: measuring the downstream impact of ASR metadata on LLM response quality. Existing benchmarks evaluate ASR accuracy or LLM quality in isolation — we measure the complete pipeline.

The Architecture Advantage

Traditional voice AI architectures require multiple sequential API calls to get the same information Whissle provides in one stream:

Signal	Traditional Pipeline	Whissle (Single Stream)
Transcription	ASR API call (~200ms)	Single WebSocket stream ~200ms total
Emotion	Sentiment API (~300ms)
Intent	NLU classifier (~200ms)
Speech rate	Audio analysis (~100ms)
Filler words	Text analysis (~100ms)
Speaker age/gender	Demographics API (~300ms)
Speaker change	Diarization (~500ms)
Total	~1,700ms sequential

With Whissle, the LLM receives the full context in the same latency budget as a basic transcription. No additional API calls, no cascading pipeline, no extra cost.

Limitations and Future Work

Single-turn only: Current evaluation is single-turn. Multi-turn dialogue evaluation (where metadata accumulates across turns) would likely show an even larger advantage.
LLM-as-judge bias: Like all LLM-based evaluation, there may be systematic biases. We mitigate this with structured scoring, but human evaluation on a subset would strengthen the results.
English only: Our current benchmark is English-only. Whissle supports 4 languages — extending the benchmark to Hindi, German, and French is planned.
Emotion audio quality paradox: The RAVDESS results show that emotional awareness can reduce "overall quality" scores when the content is bland, even though emotional appropriateness improves dramatically. Future benchmarks should weight emotional appropriateness more heavily for voice interactions.
Informational queries: The advantage is smaller for purely factual questions. Future work should test on customer service, healthcare, and coaching scenarios where emotional context matters most.

Reproduce These Results

The benchmark script is open source. Run it yourself:

pip install datasets librosa jiwer websockets numpy soundfile transformers

# Daily Dialog benchmark (no audio needed, ~25 min for 500 samples)
python benchmark_spoken_conv.py \
  --gemini-key YOUR_KEY \
  --benchmark dialog \
  --num-samples 500

# Full pipeline with audio (requires Whissle token)
python benchmark_spoken_conv.py \
  --whissle-token wh_YOUR_TOKEN \
  --gemini-key YOUR_KEY \
  --benchmark all \
  --num-samples 100

Get a Whissle API token at lulu.whissle.ai/access. The benchmark outputs structured JSON with per-sample scores, per-emotion breakdowns, and aggregate statistics.

Conclusion

The results are clear: giving an LLM metadata about how the user is speaking — not just what they said — produces measurably better conversational AI responses.

86% win rate on conversational dialogue (178 evaluated turns)
+1.7 specificity — metadata-aware responses address the actual ask, not generic platitudes
+1.5 helpfulness — more actionable, useful responses when the LLM knows intent
+2.0 emotional appropriateness on emotion-labeled audio — the clearest signal in our benchmark
94% win rate for questions — metadata helps the LLM understand urgency and emotional context
Consistent across all emotions: joy +2.6, anger +2.0, disgust +2.5, neutral +0.4 on emotional match

Voice AI systems that treat speech as "audio in, text out" are leaving value on the table. Every acoustic signal that gets discarded in the transcription step is context that the LLM could have used to generate a better response.

Whissle's META-1 model captures that context in a single real-time stream — no additional APIs, no extra latency, no cascading pipeline. The benchmark proves that this architectural choice translates directly into better conversational AI.

Try it at lulu.whissle.ai or get your API key at lulu.whissle.ai/access.