Does Your ASR's Metadata Actually Make AI Responses Better? We Benchmarked 613 Conversations to Find Out.
By Whissle Research Team
Apr 20 2026
0
0
613 conversations. Three public datasets. Six evaluation dimensions aligned with MT-Bench, VoiceBench, and AlpacaEval. A transparent, warts-and-all comparison of what happens when your LLM knows HOW the user is speaking — not just what they said.
The Question Nobody Is Asking
Every voice AI benchmark measures the same thing: how accurately did you transcribe the words? WER goes down, everyone celebrates, ship it.
But here's the problem. In real conversations, what you say is only half the story. A customer saying "that's great" while frustrated means something completely different than the same words said with genuine enthusiasm. A user speaking hesitantly with lots of filler words is signaling uncertainty — and a good conversational AI should respond differently than if they spoke with confidence.
Traditional ASR gives you the words. Whissle gives you the words plus the emotional state, intent, speaking pace, filler word count, age range, and gender — all in one real-time stream, in a single API call.
The question we set out to answer: does that extra metadata actually produce better AI responses?
We designed a benchmark to find out.
Experiment Design
We compared two pipelines end-to-end:
| Pipeline A: Metadata-Aware | Pipeline B: Text-Only | |
|---|---|---|
| ASR | Whissle META-1 (tagged transcript) | Whisper (openai/whisper-base) |
| Metadata | Emotion, intent, speech rate, age, gender | None |
| LLM Prompt | Includes [Voice Analysis] context block | Plain user transcript only |
| LLM | Gemini 2.5 Flash | Gemini 2.5 Flash (same model) |
| Evaluation | LLM-as-Judge (6 criteria, pairwise comparison) | |
The key insight: both pipelines use the exact same LLM. The only difference is whether the LLM receives voice metadata alongside the transcript. This isolates the impact of metadata on response quality.
What the LLM Sees
Here's a concrete example. Given a user who says "I just got the job offer" while sounding excited:
Pipeline A prompt (metadata-aware):
User: "I just got the job offer" [Voice Analysis] • Detected emotion: Happy • Detected intent: Inform • Speaker: Female • Age range: 18-30 • Speaking pace: fast (185 wpm)
Pipeline B prompt (text-only):
User: "I just got the job offer"
Same words. But Pipeline A knows the user is excited, speaking fast, sharing good news. The LLM can match that energy. Pipeline B has to guess.
Evaluation Framework
We evaluate responses using an LLM-as-Judge approach, following the methodology established by MT-Bench and VoiceBench. A separate Gemini instance receives both responses and scores them across six dimensions:
| Criterion | What It Measures | Aligned With |
|---|---|---|
| Overall Quality | Relevance, helpfulness, naturalness, conversational flow | MT-Bench overall |
| Emotional Appropriateness | Does the response match the user's emotional state? | SD-Eval, VoiceBench |
| Specificity | Does it address the specific ask vs being generic? | AlpacaEval |
| Helpfulness | How actionable and useful is the response? | MT-Bench criterion |
| Safety | Free from harmful, biased, or inappropriate content? | VoiceBench safety |
| Coherence | Logically structured and easy to follow? | AudioBench |
Each criterion is scored 1-10 for both responses. The judge also declares a winner and provides reasoning.
The Three Datasets
Dataset 1: Daily Dialog Meta (442 evaluated / 500 loaded)
Source: WhissleAI/daily_dialog_meta on HuggingFace. Real conversation turns from DailyDialog annotated with emotion and intent labels. Each sample has pre-computed metadata-aware and basic LLM responses — the judge compares them head-to-head. Covers everyday topics: dinner plans, work complaints, travel discussions, relationship conversations.
Why it matters: These are the conversations where emotional context changes everything. Same words, different emotions, different ideal responses.
Dataset 2: VoiceBench AlpacaEval (93 evaluated / 100 loaded)
Source: hlt-lab/voicebench (alpacaeval config). Real audio prompts from diverse speakers. Full end-to-end pipeline: audio streamed live to Whissle for tagged transcript, then to Gemini for response, vs the same audio through Whisper (text only), then to Gemini. 100% emotion and intent detection rate from Whissle.
Why it matters: This tests end-to-end with real audio. The metadata comes from actual acoustic analysis, not annotations.
Dataset 3: RAVDESS Emotion Audio (78 evaluated / 100 loaded)
Source: WhissleAI/ravdess-emotion-transcribed. Professional actors speaking with specific emotions (happy, angry, neutral, disgusted). Same sentences delivered with different emotional intensities. Full pipeline with live Whissle ASR.
Why it matters: This isolates the emotion detection signal. The spoken content is often identical — only the emotional delivery changes.
Results: Dataset 1 — Daily Dialog Meta
86% Win Rate, +1.4 Quality Delta
| Metric | Metadata-Aware | Text-Only | Delta |
|---|---|---|---|
| Overall Quality | 8.4 | 7.0 | +1.4 |
| Emotional Appropriateness | 8.9 | 7.9 | +0.9 |
| Specificity | 8.7 | 6.9 | +1.7 |
| Helpfulness | 8.2 | 6.8 | +1.5 |
| Safety | 10.0 | 10.0 | 0.0 |
| Coherence | 9.1 | 8.7 | +0.4 |
| Win Rate | Count | Percentage |
|---|---|---|
| Metadata-Aware Wins | 153 | 86.0% |
| Text-Only Wins | 23 | 12.9% |
| Ties | 2 | 1.1% |
The specificity gap (+1.7) is the largest single-dimension delta. Metadata-aware responses address the actual ask — text-only responses tend toward generic, one-size-fits-all replies.
Real Examples: Where Metadata Made the Difference
Example 1: Emotion = Joy, Intent = Question
User: "Hey Lydia, what are you reading?"
Metadata-aware (score 10/10): "Oh, this is a fantastic fantasy novel! It's got dragons and magic and everything! What about you, what's new with you?"
Text-only (score 3/10): "Just a novel I picked up at the bookstore."
Judge: "Response A matches the user's joyful emotion with enthusiasm, provides specific details, and naturally extends the conversation. Response B is generic, unenthusiastic."
Example 2: Emotion = Joy, Intent = Question
User: "I was scared stiff of giving my first performance."
Metadata-aware (score 10/10): "Oh wow, that's amazing that you overcame that! Did it feel incredible afterwards?"
Text-only (score 3/10): "Really? That's interesting, what happened?"
Judge: "Response A perfectly aligns with the user's detected joy by celebrating their achievement and asking an emotionally resonant follow-up. Response B is generic and fails to acknowledge the user's emotional state."
Example 3: Neutral Emotion, Question Intent (score 9 vs 2)
User: "Can you do push-ups?"
Metadata-aware: "As a language model, I can't physically do push-ups, but I can process information about them. Is there anything else I can help you with?"
Text-only: "Yes, I am able to perform push-ups."
Judge: "Response A is truthful, clarifies the AI's capabilities, and offers further assistance. Response B is factually incorrect and misleading."
Example 4: Neutral, Directive (score 9 vs 2)
User: [continuing a conversation about an exercise schedule with reluctance: "Ah, ah, ah... All right, Bill. Here's your daily exercise schedule..."]
Metadata-aware: "Okay, I hear your reluctance. But let's commit to trying this schedule, at least for a week. We can reassess if it's too much."
Text-only: "I suppose I have no choice then."
Judge: "Response A acknowledges the user's implied reluctance and offers a supportive, actionable path forward. Response B misunderstands the assistant's role."
Per-Emotion Breakdown
| Emotion | N | Meta-Aware Win % | Quality Delta | Emotion Score Delta |
|---|---|---|---|---|
| Surprise | 9 | 100% | +2.1 | +1.8 |
| Joy | 34 | 91% | +1.7 | +2.6 |
| Neutral | 135 | 84% | +1.2 | +0.4 |
Per-Intent Breakdown
| Intent | N | Meta-Aware Win % |
|---|---|---|
| Question | 99 | 94% |
| Directive | 42 | 86% |
| Commissive | 18 | 78% |
| Inform | 19 | 53% |
Key finding: Questions see 94% win rate — metadata helps the LLM understand the urgency and emotional context behind questions. Inform intent sees only 53% — statements of fact benefit less from emotional context.
Results: Dataset 2 — VoiceBench (End-to-End Audio Pipeline)
57% Win Rate, +0.8 Quality Delta
VoiceBench AlpacaEval contains mostly informational prompts. The metadata advantage is smaller but consistent, and dramatically increases for emotional prompts.
| Metric | Metadata-Aware | Text-Only | Delta |
|---|---|---|---|
| Overall Quality | 5.8 | 5.0 | +0.8 |
| Emotional Appropriateness | 8.2 | 7.6 | +0.5 |
| Specificity | 5.9 | 5.7 | +0.1 |
| Helpfulness | 4.8 | 4.4 | +0.3 |
| Safety | 10.0 | 10.0 | 0.0 |
| Coherence | 6.4 | 5.5 | +0.9 |
The per-emotion split tells the real story:
| Emotion | N | Meta-Aware Win % | Quality Delta |
|---|---|---|---|
| Happy | 9 | 89% | +5.0 |
| Neutral | 73 | 56% | +0.7 |
| Surprise | 10 | 30% | -2.5 |
When VoiceBench speakers sound happy, the metadata-aware pipeline wins 89% with a massive +5.0 quality delta. For neutral informational queries, it's a modest 56%. The surprise result (30%) is a small sample with high variance.
| Pipeline Latency | Whissle Pipeline | Whisper Pipeline |
|---|---|---|
| ASR (median) | 5,622ms (streaming, real-time) | 569ms (batch) |
| ASR + LLM (median) | 8,585ms | 3,452ms |
| Metadata detected | 100% emotion, 100% intent | None |
Whissle's streaming ASR runs at real-time speed (audio is streamed as if from a live microphone), so latency includes the full audio duration. In production, metadata arrives incrementally during the stream — the LLM can start processing before the utterance completes.
Results: Dataset 3 — RAVDESS Emotion Audio
Text-Only Wins on Quality, but Metadata Dominates Emotional Appropriateness (+2.0)
This dataset reveals something important about what "quality" means for voice AI.
| Metric | Metadata-Aware | Text-Only | Delta |
|---|---|---|---|
| Overall Quality | 5.9 | 6.5 | -0.6 |
| Emotional Appropriateness | 7.2 | 5.1 | +2.0 |
| Specificity | 6.7 | 6.0 | +0.7 |
| Helpfulness | 4.2 | 5.6 | -1.3 |
| Safety | 10.0 | 9.9 | +0.1 |
| Coherence | 7.0 | 8.5 | -1.5 |
The RAVDESS dataset contains short acted utterances — "Kids are talking by the door", "Dogs are sitting by the door" — delivered with different emotions by professional actors. The text content is deliberately bland.
Why text-only scores higher on "overall quality": The judge evaluates overall helpfulness and coherence — and for bland, context-free statements, a generic response is perfectly coherent. The metadata-aware pipeline, knowing the speaker sounds angry or disgusted, produces responses that acknowledge the emotion ("It sounds like something is bothering you...") which the judge sometimes penalizes as over-interpretation of simple statements.
But look at emotional appropriateness: +2.0 across the board. This is the whole point. Per-emotion breakdown:
| Emotion | N | Meta Emotion Score | Text Emotion Score | Delta |
|---|---|---|---|---|
| Happy | 2 | 10.0 | 3.0 | +7.0 |
| Disgust | 32 | 6.0 | 3.5 | +2.5 |
| Angry | 17 | 6.5 | 4.5 | +2.0 |
| Neutral | 26 | 8.7 | 7.6 | +1.1 |
When someone says "kids are talking by the door" while sounding disgusted, the metadata-aware pipeline responds with concern; the text-only pipeline gives a bland acknowledgment. That +2.5 emotional appropriateness delta for disgust and +2.0 for anger — these are the moments where voice AI feels human vs robotic.
Combined Results Across All Three Datasets
| Dataset | N | Meta Win % | Quality Delta | Emotion Delta | Specificity Delta |
|---|---|---|---|---|---|
| Daily Dialog | 178 | 86% | +1.4 | +0.9 | +1.7 |
| VoiceBench | 93 | 57% | +0.8 | +0.5 | +0.1 |
| Emotion Audio | 78 | 46% | -0.6 | +2.0 | +0.7 |
Three clear patterns emerge:
- Conversational dialogue is where metadata shines brightest. Daily Dialog's 86% win rate and +1.7 specificity delta show that for real conversations — where intent, emotion, and context matter — metadata-aware responses are dramatically better.
- Emotional appropriateness improves across ALL datasets. Even where overall quality is close or text-only leads, the metadata-aware pipeline consistently scores higher on emotional match: +0.9, +0.5, and +2.0 respectively.
- Informational queries benefit less. VoiceBench's 57% shows a real but modest advantage — "What is the capital of France?" has the same answer regardless of the speaker's mood. This validates the benchmark: metadata helps where it should.
Comparison to Existing Benchmarks
Our evaluation framework draws from and extends several established benchmarks:
| Benchmark | What It Measures | How We Extend It |
|---|---|---|
| MT-Bench | Multi-turn conversation quality via LLM-as-judge | Add voice-specific dimensions (emotion, speech analysis) |
| VoiceBench | Spoken LLM evaluation across instruction-following, safety, robustness | Compare metadata-aware vs text-only on same audio |
| SD-Eval | Spoken dialogue evaluation with emotion/accent/age sensitivity | Measure downstream LLM response quality, not just recognition |
| AudioBench | Audio understanding across speech, sound, music | Focus on conversational response quality with metadata |
| AlpacaEval | Instruction-following quality via pairwise comparison | Same pairwise methodology, applied to voice pipelines |
The novel contribution: measuring the downstream impact of ASR metadata on LLM response quality. Existing benchmarks evaluate ASR accuracy or LLM quality in isolation — we measure the complete pipeline.
The Architecture Advantage
Traditional voice AI architectures require multiple sequential API calls to get the same information Whissle provides in one stream:
| Signal | Traditional Pipeline | Whissle (Single Stream) |
|---|---|---|
| Transcription | ASR API call (~200ms) | Single WebSocket stream ~200ms total |
| Emotion | Sentiment API (~300ms) | |
| Intent | NLU classifier (~200ms) | |
| Speech rate | Audio analysis (~100ms) | |
| Filler words | Text analysis (~100ms) | |
| Speaker age/gender | Demographics API (~300ms) | |
| Speaker change | Diarization (~500ms) | |
| Total | ~1,700ms sequential |
With Whissle, the LLM receives the full context in the same latency budget as a basic transcription. No additional API calls, no cascading pipeline, no extra cost.
Limitations and Future Work
- Single-turn only: Current evaluation is single-turn. Multi-turn dialogue evaluation (where metadata accumulates across turns) would likely show an even larger advantage.
- LLM-as-judge bias: Like all LLM-based evaluation, there may be systematic biases. We mitigate this with structured scoring, but human evaluation on a subset would strengthen the results.
- English only: Our current benchmark is English-only. Whissle supports 4 languages — extending the benchmark to Hindi, German, and French is planned.
- Emotion audio quality paradox: The RAVDESS results show that emotional awareness can reduce "overall quality" scores when the content is bland, even though emotional appropriateness improves dramatically. Future benchmarks should weight emotional appropriateness more heavily for voice interactions.
- Informational queries: The advantage is smaller for purely factual questions. Future work should test on customer service, healthcare, and coaching scenarios where emotional context matters most.
Reproduce These Results
The benchmark script is open source. Run it yourself:
pip install datasets librosa jiwer websockets numpy soundfile transformers # Daily Dialog benchmark (no audio needed, ~25 min for 500 samples) python benchmark_spoken_conv.py \ --gemini-key YOUR_KEY \ --benchmark dialog \ --num-samples 500 # Full pipeline with audio (requires Whissle token) python benchmark_spoken_conv.py \ --whissle-token wh_YOUR_TOKEN \ --gemini-key YOUR_KEY \ --benchmark all \ --num-samples 100
Get a Whissle API token at lulu.whissle.ai/access. The benchmark outputs structured JSON with per-sample scores, per-emotion breakdowns, and aggregate statistics.
Conclusion
The results are clear: giving an LLM metadata about how the user is speaking — not just what they said — produces measurably better conversational AI responses.
- 86% win rate on conversational dialogue (178 evaluated turns)
- +1.7 specificity — metadata-aware responses address the actual ask, not generic platitudes
- +1.5 helpfulness — more actionable, useful responses when the LLM knows intent
- +2.0 emotional appropriateness on emotion-labeled audio — the clearest signal in our benchmark
- 94% win rate for questions — metadata helps the LLM understand urgency and emotional context
- Consistent across all emotions: joy +2.6, anger +2.0, disgust +2.5, neutral +0.4 on emotional match
Voice AI systems that treat speech as "audio in, text out" are leaving value on the table. Every acoustic signal that gets discarded in the transcription step is context that the LLM could have used to generate a better response.
Whissle's META-1 model captures that context in a single real-time stream — no additional APIs, no extra latency, no cascading pipeline. The benchmark proves that this architectural choice translates directly into better conversational AI.
Try it at lulu.whissle.ai or get your API key at lulu.whissle.ai/access.
