We Benchmarked 3 Streaming ASR Providers Across 17 Hours of Audio. Here's What We Found.
By Whissle Research Team
Apr 10 2026
203
5471
4,915 samples. Four datasets. Clean speech, Indian-accented tech interviews, noisy soccer broadcasts, and Indian Supreme Court proceedings. A transparent, warts-and-all comparison of Whissle, Deepgram, and AssemblyAI.
Most ASR benchmarks test on LibriSpeech and call it a day. The provider with the lowest number on clean, studio-recorded audiobook narration wins, and everyone moves on.
But that's not the audio your application deals with.
Your audio has accents. Background noise. People talking fast, interrupting each other, mumbling through technical jargon. The question isn't "which ASR is best on clean speech?" -- it's "which ASR holds up when conditions get real?"
We set out to answer that question. We benchmarked three streaming ASR providers -- Whissle, Deepgram (Nova-2), and AssemblyAI (Universal Streaming) -- across four increasingly difficult datasets. We measured accuracy, latency, reliability, and real-time metadata capabilities. We ran all three providers in parallel on the same audio to ensure a fair comparison.
Here's everything we found.
How We Tested
All three providers were tested using real-time WebSocket streaming -- the exact protocol you'd use in production. Audio was streamed in 100ms chunks at 1x real-time speed, simulating a live microphone.
| Provider | Model | Streaming Endpoint |
|---|---|---|
| Whissle | Default (ONNX runtime) | wss://api.whissle.ai/asr/stream |
| Deepgram | Nova-2 | wss://api.deepgram.com/v1/listen |
| AssemblyAI | Universal Streaming | wss://streaming.assemblyai.com/v3/ws |
Fair comparison measures:
- All providers received identical audio (PCM int16, mono, 16kHz)
- All three ran concurrently on each sample -- same network conditions
- Text normalized before WER: lowercased, punctuation stripped, whitespace collapsed
- WER computed using the standard jiwer library
Key metrics:
| Metric | What It Measures |
|---|---|
| WER | Word Error Rate -- (Insertions + Deletions + Substitutions) / Reference Words |
| Corpus WER | Total errors / total reference words (avoids inflation from short utterances) |
| CER | Character Error Rate -- same formula at character level |
| Time to first segment | Milliseconds from audio stream start to first non-empty transcript |
| Failure rate | Percentage of samples that returned no usable transcript |
The Four Datasets
We deliberately chose datasets that span the difficulty spectrum -- one industry standard, three that represent real-world conditions.
Dataset 1: LibriSpeech test-clean
The industry baseline. Clean, clear, studio-quality audiobook narration.
| Property | Value |
|---|---|
| Source | openslr/librispeech_asr (HuggingFace) |
| Samples | 2,620 |
| Total audio | 5.4 hours (19,453 seconds) |
| Duration range | 1.3s -- 35.0s (median 5.8s) |
| Language | English (US) |
| Domain | Read audiobook speech |
| Conditions | Studio-quality, single speaker, no background noise |
| Duration Bucket | Samples | Percentage |
|---|---|---|
| Short (< 5s) | 1,089 | 41.6% |
| Medium (5--10s) | 917 | 35.0% |
| Long (> 10s) | 614 | 23.4% |
Why include it: It's the universal reference point. Every ASR paper reports LibriSpeech results, making it the baseline for cross-provider comparison.
Why it's not enough: No accents, no noise, no domain-specific vocabulary, no overlapping speakers. LibriSpeech performance doesn't predict real-world performance.
Dataset 2: EN-IN Tech Interviews
Indian-accented English in technical interview settings. Where clean-speech benchmarks break down.
| Property | Value |
|---|---|
| Source | WhissleAI/Meta_STT_EN-IN_Tech_Interviews (HuggingFace) |
| Samples | 1,204 |
| Total audio | 5.3 hours (18,969 seconds) |
| Duration range | 1.1s -- 20.0s (median 20.0s) |
| Language | English with Indian accents (EN-IN) |
| Domain | Technical interview conversations |
| Conditions | Variable quality, phone/video call audio, technical vocabulary |
| Duration Bucket | Samples | Percentage |
|---|---|---|
| Short (< 5s) | 95 | 7.9% |
| Medium (5--10s) | 140 | 11.6% |
| Long (> 10s) | 969 | 80.5% |
What makes it hard:
- Indian English accents. Retroflex consonants, vowel substitutions (/v/ for /w/), syllable-timed rhythm -- systematic phonetic differences from US/UK English.
- Technical vocabulary. Framework names, API terminology, algorithm discussions.
- Conversational patterns. Hesitations, false starts, filler words, incomplete sentences.
- Long utterances. 80% of samples exceed 10 seconds -- errors compound over longer audio.
- Variable recording quality. Phone/video call audio with varying mic quality and background noise.
Each sample includes ground-truth annotations for speaker age, gender, emotion, and intent -- stripped before WER calculation.
Dataset 3: Soccer Broadcast Commentary
Live match commentary with crowd noise, rapid speech, and proper nouns from a dozen nationalities. The hardest test.
| Property | Value |
|---|---|
| Source | Custom dataset from YouTube soccer match broadcasts |
| Samples | 327 |
| Total audio | 2.7 hours (9,810 seconds) |
| Segment duration | 30.0s each (uniform) |
| Language | English (UK) |
| Domain | Live soccer match commentary |
| Primary source | Manchester City v Manchester United -- 2024 Community Shield (326 segments) |
Rich ground-truth annotations embedded in reference text (stripped before WER):
| Annotation Type | Count | Examples |
|---|---|---|
| Player names | 379 | Fernandez, McAtee, Rashford |
| Team names | 139 | Manchester United, Arsenal |
| Game actions | 102 | shot, scored, corner, penalty |
| Referee decisions | 47 | penalty, offside |
| Competition names | 39 | Community Shield |
| Game timestamps | 36 | 90 minutes |
| Field positions | 30 | right hand side |
| Scores | 11 | 1-1, 7-6 |
| Ground-Truth Emotion | Samples | Ground-Truth Intent | Samples |
|---|---|---|---|
| Neutral | 202 (61.8%) | Analysis | 164 (50.2%) |
| Happy | 66 (20.2%) | Play-by-play | 114 (34.9%) |
| Sad | 59 (18.0%) | Statistic | 8 (2.4%) |
Speaker demographics: All male. 65% aged 60+, 35% aged 45--60.
What makes it the hardest:
- Constant background noise -- crowd chanting, whistles, stadium atmosphere
- Rapid speech -- commentators average ~150 WPM, spiking during action
- Proper noun density -- 379 player name mentions across 327 segments
- Multiple speakers -- two commentators frequently overlapping
- Domain jargon -- offside, set piece, penalty shootout
- 30-second segments -- long continuous audio with dense speech
- Broadcast processing -- dynamic range compression, crowd noise mixing
Dataset 4: Supreme Court of India Proceedings
Indian-accented legal English in a reverberant courtroom. Multiple speakers, heavy domain vocabulary, and challenging acoustics.
| Property | Value |
|---|---|
| Source | WhissleAI/supreme-court-india-meta-speech (HuggingFace) |
| Samples | 764 |
| Total audio | 4.2 hours (15,280 seconds) |
| Segment duration | 20.0s each (uniform) |
| Language | English with Indian accents (EN-IN) |
| Domain | Supreme Court oral arguments and judicial proceedings |
| Conditions | Courtroom acoustics, multiple speakers, legal terminology, reverberant environment |
What makes it hard:
- Indian English accents in formal legal speech. Judges, advocates, and amicus curiae speaking with varying regional accents -- phonetic patterns that trip up models trained primarily on US/UK English.
- Dense legal vocabulary. Case citations, Latin legal terms, constitutional provisions, statutory references -- highly specialized language that rarely appears in general training data.
- Reverberant courtroom acoustics. Large courtroom spaces produce echo and reverberation that degrades audio clarity.
- Multiple speakers. Judges, advocates, and interveners -- with frequent interruptions and cross-talk during oral arguments.
- Formal yet spontaneous speech. Legal arguments are partially prepared but delivered extemporaneously, with restarts, corrections, and hedging.
Results at a Glance
Accuracy: Deepgram Wins on WER
LibriSpeech test-clean (clean read speech):
| Metric | Whissle | Deepgram | AssemblyAI | Best |
|---|---|---|---|---|
| WER (mean) | 9.75% | 4.83% | 12.55% | DG |
| WER (median) | 6.25% | 0.00% | 6.67% | DG |
| Corpus WER | 8.24% | 4.16% | 11.11% | DG |
| CER (mean) | 4.30% | 2.05% | 11.74% | DG |
| Word confidence | 0.9494 | 0.9868 | 0.8679 | DG |
EN-IN Tech Interviews (accented conversational speech):
| Metric | Whissle | Deepgram | AssemblyAI | Best |
|---|---|---|---|---|
| WER (mean) | 62.94% | 43.62% | 51.25% | DG |
| WER (median) | 42.86% | 28.17% | 41.51% | DG |
| Corpus WER | 40.67% | 29.83% | 39.45% | DG |
| CER (mean) | 44.13% | 36.46% | 45.62% | DG |
Soccer Broadcast (noisy commentary):
| Metric | Whissle | Deepgram | AssemblyAI | Best |
|---|---|---|---|---|
| WER (mean) | 73.74% | 64.99% | 64.18% | AAI |
| WER (median) | 50.00% | 36.25% | 40.35% | DG |
| Corpus WER | 53.89% | 43.79% | 45.62% | DG |
| CER (mean) | 59.69% | 56.17% | 57.58% | DG |
Supreme Court India (legal proceedings):
| Metric | Whissle | Deepgram | AssemblyAI | Best |
|---|---|---|---|---|
| WER (mean) | 70.48% | 67.44% | 61.31% | AAI |
| WER (median) | 53.19% | 42.86% | 47.37% | DG |
| Corpus WER | 53.50% | 45.50% | 47.98% | DG |
| CER (mean) | 57.12% | 61.33% | 55.54% | AAI |
Deepgram's Nova-2 model delivers the best corpus-level accuracy across all four datasets. On clean speech it's excellent (4.16% corpus WER). But even Deepgram drops to 30--46% corpus WER on challenging audio -- a 7--11x degradation from its LibriSpeech numbers. The Supreme Court dataset is the hardest test: all three providers exceed 45% corpus WER, reflecting the combined difficulty of Indian accents, legal vocabulary, and courtroom acoustics.
Latency: Whissle Is 2--3x Faster to First Words
How quickly does the first transcript appear after audio starts streaming? This matters enormously for live captioning, real-time coaching, and conversational AI.
| Dataset | Whissle | Deepgram | AssemblyAI | Whissle Advantage |
|---|---|---|---|---|
| LibriSpeech | 1,002ms | 3,382ms | 1,435ms | 3.4x faster than DG |
| EN-IN Tech | 1,681ms | 4,626ms | 2,318ms | 2.8x faster than DG |
| Soccer | 1,465ms | 3,153ms | 1,750ms | 2.2x faster than DG |
| Court India | 814ms | 2,423ms | 1,537ms | 3.0x faster than DG |
Whissle delivers the first transcript 2--3x faster than Deepgram across every dataset. When a user is watching live captions or an agent needs to respond in real time, that ~2-second head start is the difference between "responsive" and "laggy."
Deepgram appears to buffer more audio before emitting its first transcript -- a tradeoff that likely contributes to its higher accuracy. Whissle prioritizes responsiveness.
Latency percentiles (LibriSpeech):
| Percentile | Whissle | Deepgram | AssemblyAI |
|---|---|---|---|
| P50 (median) | 1,002ms | 3,382ms | 1,435ms |
| P95 | 2,375ms | 5,375ms | 2,090ms |
| P99 | 4,002ms | 5,534ms | 2,512ms |
Even at the tail (P99), Whissle stays under 4 seconds. Deepgram's P95 and P99 are tightly clustered around 5.4s, suggesting a consistent buffering strategy rather than variance.
Reliability: One Provider Has a Serious Problem
| Dataset | Whissle | Deepgram | AssemblyAI |
|---|---|---|---|
| LibriSpeech (2,620) | 0 (0.0%) | 0 (0.0%) | 987 (37.7%) |
| EN-IN Tech (1,204) | 0 (0.0%) | 0 (0.0%) | 253 (21.0%) |
| Soccer (327) | 0 (0.0%) | 0 (0.0%) | 17 (5.2%) |
| Court India (764) | 4 (0.5%) | 2 (0.3%) | 2 (0.3%) |
| Total (4,915) | 4 (0.08%) | 2 (0.04%) | 1,259 (25.6%) |
AssemblyAI's V3 streaming API failed to return a transcript for one in four samples overall -- driven almost entirely by catastrophic failure rates on LibriSpeech (37.7%) and EN-IN Tech (21.0%). On the court dataset, all three providers had a handful of failures, but the numbers are negligible.
When we adjust AssemblyAI's WER to account for failures (treating each failed sample as 100% WER), its LibriSpeech number jumps from 12.55% to 45.49%. Its accuracy advantage on soccer disappears entirely.
This is worth emphasizing: AssemblyAI's published WER numbers only reflect the samples that worked. For any production application, reliability is a prerequisite, not a feature.
Deep Dive: Where the Errors Happen
Raw WER numbers tell you how much a provider gets wrong. Error type analysis tells you how it gets things wrong -- and that pattern reveals something important about each provider's architecture.
Error Composition
Every transcription error is one of three types:
- Substitution -- the wrong word ("cat" instead of "cap")
- Deletion -- a word was dropped entirely
- Insertion -- a word was hallucinated that wasn't spoken
Whissle has a balanced, substitution-dominant profile. On LibriSpeech: 62% substitutions, 20% deletions, 18% insertions. This is the typical acoustic model pattern -- the model hears something but gets the word wrong. As conditions get harder, the deletion rate creeps up, but the profile stays balanced.
Deepgram shifts strategy between clean and difficult audio. On clean speech: 68% substitutions. On accented speech: 49% deletions. It becomes more conservative -- dropping words it's uncertain about rather than guessing. Better to omit than to hallucinate.
AssemblyAI is deletion-dominant everywhere. LibriSpeech: 78% deletions. EN-IN Tech: 76% deletions. Combined with its 26% failure rate, this points to a systemic issue in the streaming pipeline -- segments being dropped, turns ending prematurely, or transcripts being truncated.
| Provider | LibriSpeech | EN-IN Tech | Soccer | Court India | Pattern |
|---|---|---|---|---|---|
| Whissle | 62% sub / 20% del / 18% ins | 51% sub / 33% del / 16% ins | 45% sub / 33% del / 21% ins | 37% sub / 30% del / 34% ins | Balanced, substitution-led |
| Deepgram | 68% sub / 22% del / 11% ins | 36% sub / 49% del / 15% ins | 39% sub / 36% del / 25% ins | 27% sub / 8% del / 65% ins | Shifts strategy per dataset |
| AssemblyAI | 21% sub / 78% del / 2% ins | 18% sub / 76% del / 5% ins | 33% sub / 49% del / 18% ins | 20% sub / 52% del / 28% ins | Deletion-dominant |
The Degradation Problem
How much does each provider degrade from clean speech to real-world audio?
| Provider | LibriSpeech (clean) | EN-IN (accented) | Degradation | Soccer (noisy) | Degradation | Court India (legal) | Degradation |
|---|---|---|---|---|---|---|---|
| Whissle | 8.24% | 40.67% | 4.9x | 53.89% | 6.5x | 53.50% | 6.5x |
| Deepgram | 4.16% | 29.83% | 7.2x | 43.79% | 10.5x | 45.50% | 10.9x |
| AssemblyAI | 11.11% | 39.45% | 3.6x | 45.62% | 4.1x | 47.98% | 4.3x |
Deepgram has the best absolute numbers everywhere -- but it also degrades the most dramatically. Its 4.16% on LibriSpeech balloons 10.9x to 45.50% on court proceedings. This suggests its LibriSpeech excellence may partly reflect optimization for clean read speech rather than generalized robustness.
This is the central insight of this benchmark: LibriSpeech performance is a poor predictor of real-world performance. The provider with the best clean-speech score may not maintain that lead on your actual audio.
The WER Distribution Story
Mean WER tells you the average. But the distribution tells you what to actually expect.
WER by Duration (LibriSpeech)
| Duration | Whissle | Deepgram | AssemblyAI | Samples |
|---|---|---|---|---|
| Short (< 5s) | 12.42% | 5.82% | 15.46% | 1,089 |
| Medium (5--10s) | 7.81% | 4.19% | 10.23% | 917 |
| Long (> 10s) | 7.93% | 4.01% | 11.39% | 614 |
All providers improve on longer utterances (more context for language model rescoring). Deepgram's advantage is consistent across all duration buckets.
WER Brackets
LibriSpeech -- what percentage of samples fall into each quality bracket?
| Bracket | Whissle | Deepgram | AssemblyAI* |
|---|---|---|---|
| Perfect (0% WER) | 901 (34.4%) | 1,532 (58.5%) | 446 (27.3%) |
| Excellent (≤ 10%) | 1,723 (65.8%) | 2,235 (85.3%) | 1,046 (64.1%) |
| Good (≤ 20%) | 2,277 (86.9%) | 2,485 (94.8%) | 1,351 (82.7%) |
| Poor (> 50%) | 40 (1.5%) | 14 (0.5%) | 84 (5.1%) |
| Total failure (≥ 100%) | 9 (0.3%) | 8 (0.3%) | 11 (0.7%) |
*AssemblyAI percentages based on 1,633 successful samples only (987 failed)
EN-IN Tech Interviews -- where things get uncomfortable for everyone:
| Bracket | Whissle | Deepgram | AssemblyAI* |
|---|---|---|---|
| Perfect (0% WER) | 13 (1.1%) | 62 (5.2%) | 31 (3.3%) |
| Good (≤ 20%) | 156 (13.0%) | 471 (39.1%) | 250 (26.3%) |
| Poor (> 50%) | 487 (40.4%) | 381 (31.7%) | 394 (41.4%) |
| Total failure (≥ 100%) | 247 (20.5%) | 272 (22.6%) | 194 (20.4%) |
*AssemblyAI percentages based on 951 successful samples only (253 failed)
On accented speech, even the best provider gets fewer than 4 in 10 samples to "good" accuracy. This isn't a provider problem -- it's an industry problem. Accented ASR is far from solved.
Soccer Broadcast: Zero samples achieved perfect transcription for any provider. Not a single 30-second broadcast segment was transcribed without at least one error.
Streaming vs. Batch: Does Pre-Recorded Mode Help?
Streaming ASR processes audio in real time as it arrives. Batch (pre-recorded) ASR processes the entire audio file at once, giving the engine access to full context before producing output. Does this make a difference?
We ran identical audio through both streaming and batch APIs for Whissle and Deepgram across all four languages (100 samples each, 1000 for Hindi). For Whissle batch, we tested both greedy decoding and KenLM language model rescoring.
Streaming vs. Batch WER (%) -- lower is better:
| Language | Whissle Stream | Whissle Batch | Whissle Batch+LM | Deepgram Stream | Deepgram Batch |
|---|---|---|---|---|---|
| English | 6.21% | 5.63% | 5.31% | 4.14% | 3.48% |
| Spanish | 33.81% | 27.70% | 26.54% | 15.93% | 12.92% |
| German | 30.03% | 28.73% | 26.09% | 22.96% | 18.95% |
| Hindi | 52.86% | 48.76% | 45.15% | 42.84% | 37.91% |
Key observations:
- Batch consistently improves WER for both providers. Having the full audio context before decoding helps -- expected, since batch engines can look ahead and resolve ambiguities.
- Whissle's LM rescoring adds 1--4% absolute improvement on top of batch greedy decoding. The KenLM n-gram model catches spelling and grammar errors that pure CTC decoding misses.
- Deepgram's batch mode narrows the gap -- in English, Deepgram batch (3.48%) approaches Gemini 2.0 Flash (3.12%), the accuracy leader.
- Streaming is the harder problem. Every provider does better in batch mode. The question is whether your use case can wait for the full audio -- if you're building real-time applications (live captioning, call coaching, voice agents), streaming is not optional.
Batch mode is the right choice for post-call analytics, media transcription, and any workflow where latency doesn't matter. But for live applications, streaming accuracy is what counts -- and that's where the architectural differences between CTC-based streaming (Whissle) and proprietary streaming (Deepgram) become the deciding factor.
Beyond Transcription: Streaming Metadata
Here's where the comparison shifts from "who has the lowest WER" to "who gives you the most useful output."
All providers return transcripts with word timestamps and confidence scores. But the metadata landscape varies significantly -- and the delivery mode matters as much as the feature list:
| Feature | Whissle | Deepgram | AssemblyAI | Gemini 2.0 Flash |
|---|---|---|---|---|
| Transcription | Yes | Yes | Yes | Yes |
| Word timestamps | Yes | Yes | Yes | -- |
| Word confidence | Yes | Yes | Yes | -- |
| Interim/partial results | Yes | Yes | Yes | -- |
| Emotion detection | Yes (real-time, 7 classes) | Batch only (Sentiment) | Batch only (Sentiment) | Batch only (via prompt) |
| Intent classification | Yes (real-time, 6 classes) | Batch only | -- | Batch only (via prompt) |
| Speaker change detection | Yes (in-stream) | -- | -- | -- |
| Speaker demographics | Yes (age, gender) | -- | -- | -- |
| Speech rate (WPM) | Yes | -- | -- | -- |
| Filler word detection | Yes | -- | -- | -- |
| Emotion probabilities | Yes (per-segment) | -- | -- | -- |
| Intent probabilities | Yes (per-segment) | -- | -- | -- |
| Speaker diarization | -- | Add-on (streaming) | Batch only | -- |
| Topic detection | -- | Batch only | Batch only | Batch only (via prompt) |
| Entity detection | Via in-stream tokens | Batch only | Batch only | Batch only (via prompt) |
| Summarization | -- | Batch only (EN) | Batch only | Batch only (via prompt) |
| Delivery mode | Streaming WebSocket | Streaming + batch add-ons | Streaming + batch add-ons | Batch API only |
| Metadata latency | ~200ms (included) | 887--1,233ms (separate call) | Separate batch call | 1,819--2,219ms (separate call) |
Deepgram and AssemblyAI offer sentiment, intent, and entity features in their batch/pre-recorded APIs, but in streaming mode these are not available. Gemini 2.0 Flash can extract any metadata via prompting -- but only in batch mode, with 1.8--2.4 second latency per utterance. Deepgram's streaming add-on for speaker diarization is a notable exception.
Whissle streams all of the following in real time -- as part of the same WebSocket connection, with zero additional latency:
- Emotion detection -- per-segment classification across 7 emotions with full probability distributions
- Intent classification -- 20+ intent categories (inform, question, command, describe, etc.) with probabilities
- Speaker change detection -- in-stream tokens marking when the speaker changes
- Speech rate -- words per minute, measured in real time
- Filler word detection -- counts of "um", "uh", "like", "you know"
- Speaker demographics -- age range and gender classification
Metadata Coverage
| Dataset | Samples | Emotion Detected | Intent Detected | WPM Measured |
|---|---|---|---|---|
| LibriSpeech | 2,620 | 2,620 (100%) | 2,620 (100%) | 2,618 (99.9%) |
| EN-IN Tech | 1,204 | 1,088 (90.4%) | 1,088 (90.4%) | 1,060 (88.0%) |
| Soccer | 327 | 327 (100%) | 327 (100%) | 327 (100%) |
| Court India | 764 | 760 (99.5%) | 760 (99.5%) | 756 (99.0%) |
Metadata Evaluation: Emotion Detection
The soccer broadcast dataset includes human-annotated ground truth for emotion, intent, age, and gender -- allowing us to evaluate Whissle's metadata predictions against labels.
Emotion distribution across datasets:
| Emotion | LibriSpeech | EN-IN Tech | Soccer | Court India |
|---|---|---|---|---|
| Neutral | 2,173 (82.9%) | 1,082 (99.4%) | 326 (99.7%) | 276 (36.3%) |
| Angry | 216 (8.2%) | 2 (0.2%) | -- | 277 (36.4%) |
| Happy | 133 (5.1%) | 1 (0.1%) | 1 (0.3%) | 207 (27.2%) |
| Disgust | 62 (2.4%) | 1 (0.1%) | -- | -- |
| Surprise | 16 (0.6%) | 1 (0.1%) | -- | -- |
| Sad | 15 (0.6%) | 1 (0.1%) | -- | -- |
| Fear | 5 (0.2%) | -- | -- | -- |
LibriSpeech (audiobook narration) shows emotional variety -- professional narrators use distinct vocal tones for different characters and dramatic moments, which the model picks up. Tech interviews are almost entirely neutral, as expected for professional conversations. Soccer broadcast commentary is acoustically neutral even when describing exciting moments.
The Supreme Court dataset shows the most interesting emotion distribution: roughly equal parts angry (36.4%), neutral (36.3%), and happy (27.2%). This reflects the adversarial nature of courtroom proceedings -- advocates argue forcefully (detected as angry), judges maintain composure (neutral), and favorable rulings or successful arguments produce positive vocal tones (happy). Unlike soccer commentary where acoustic emotion is muted, courtroom speech carries genuine vocal intensity.
Ground truth comparison (Soccer Broadcast, 327 samples):
| Ground Truth Emotion | Samples | Whissle Correct | Recall |
|---|---|---|---|
| Neutral | 202 | 202 | 100.0% |
| Happy | 66 | 1 | 1.5% |
| Sad | 59 | 0 | 0.0% |
| Overall accuracy | 327 | 203 | 62.1% |
An important nuance: the ground truth labels in the soccer dataset annotate contextual emotion (happy when a goal is scored, sad when a chance is missed), while Whissle detects acoustic emotion (the actual tone of voice). Professional broadcast commentators maintain a relatively even vocal tone regardless of context -- a commentator describing a goal doesn't necessarily sound acoustically "happy" in the way a person celebrating would. The 100% recall on neutral reflects that the model is accurately reading the acoustic signal, even when the annotated context differs.
Metadata Evaluation: Intent Classification
Intent distribution across datasets (EN-IN percentages based on 1,088 samples with detected intent; 116 very short segments returned no intent):
| Intent | LibriSpeech | EN-IN Tech | Soccer | Court India |
|---|---|---|---|---|
| Inform | 2,337 (89.2%) | 1,054 (96.9%) | 326 (99.7%) | 752 (98.9%) |
| Question | 151 (5.8%) | 28 (2.6%) | 1 (0.3%) | 6 (0.8%) |
| Describe | 67 (2.6%) | -- | -- | -- |
| Command | 29 (1.1%) | 4 (0.4%) | -- | 1 (0.1%) |
| Other | 36 (1.4%) | 2 (0.2%) | -- | 1 (0.1%) |
The soccer dataset has domain-specific ground truth intents (Play-by-play, Analysis, Statistic, Reaction). Whissle uses a general-purpose intent taxonomy. When we map the domain-specific labels to their general equivalents (Play-by-play/Analysis/Statistic map to Inform), the alignment accuracy is 99.7% (326/327) -- the model correctly identifies that commentary is fundamentally informational speech.
LibriSpeech shows more intent variety because audiobook narration includes dialogue with questions (5.8%), descriptive passages (2.6%), and imperative speech (1.1%). Tech interviews are overwhelmingly informational with occasional questions -- consistent with interview dynamics. Court proceedings are similarly dominated by informational speech (98.9%) -- advocates presenting arguments and judges delivering rulings -- with occasional questions from the bench (0.8%).
Speech Rate Analysis
| Dataset | Mean WPM | Median WPM | Range |
|---|---|---|---|
| LibriSpeech (read speech) | 168.2 | 166.9 | 28.4 -- 284.0 |
| EN-IN Tech (conversational) | 115.2 | 114.3 | 25.0 -- 400.0 |
| Soccer (broadcast) | 149.7 | 150.0 | 83.0 -- 212.5 |
| Court India (legal) | 148.1 | 146.6 | 59.8 -- 252.2 |
LibriSpeech shows the highest speech rate (professional narrators reading at a steady pace). Tech interviews are slower (thinking pauses, technical explanations). Soccer commentary and court proceedings sit in between at similar median rates (~150 WPM), though court speech shows a wider range -- from measured judicial deliberation to rapid-fire advocacy.
Why Metadata Changes Everything
If you're building live captioning, a transcription API is enough. But if you're building anything that needs to understand the conversation -- not just record it -- you need more than words.
1. Detect escalations before they happen. Real-time emotion detection flags when a customer's tone shifts from neutral to frustrated -- before they ask for a manager. Intent classification distinguishes a genuine question from a complaint. In a call center, this means routing to a supervisor in seconds, not minutes.
2. Coach speakers in real time. Speech rate tracking tells presenters they're rushing. Filler word counts ("um", "uh", "like") show job candidates exactly what to work on. All streamed live during the conversation, not buried in a post-call report.
3. Understand conversations, not just words. Speaker change detection identifies turn-taking patterns. Demographics reveal who's dominating a meeting. Intent classification distinguishes questions from statements from commands. These signals transform a flat transcript into a structured understanding of what actually happened.
4. Ship with confidence. Near-zero failures across 4,915 samples means virtually every audio segment returns a transcript plus metadata. No silent drop-outs, no missing data, no fallback logic needed. Getting all of this in a single streaming WebSocket connection, with zero additional latency at the same per-minute price, is a fundamentally different value proposition.
Metadata Extraction: The Hidden Latency Cost
Competitors offer metadata features -- but as separate batch API calls that add latency and cost on top of transcription. We benchmarked the real cost of getting metadata from each provider across four languages (100 samples per language).
What each provider returns with metadata enabled:
| Feature | Whissle (streaming) | Gemini 2.0 Flash (batch) | Deepgram Nova-3 (batch) |
|---|---|---|---|
| Transcription | Yes | Yes | Yes |
| Emotion (7 classes) | Yes -- real-time | Yes -- batch* | -- |
| Intent (6 classes) | Yes -- real-time | Yes -- batch* | Yes -- batch |
| Sentiment | Yes -- real-time | Yes -- batch* | Yes -- batch |
| Topics | -- | Yes -- batch* | Yes -- batch |
| Named entities | Via in-stream tokens | Yes -- batch* | Yes -- batch |
| Summarization | -- | -- | Yes -- batch (EN only) |
| Emotion probabilities | Yes (per-segment) | -- | -- |
| Intent probabilities | Yes (per-segment) | -- | -- |
| Speech rate (WPM) | Yes -- real-time | -- | -- |
| Filler detection | Yes -- real-time | -- | -- |
| Speaker demographics | Yes -- real-time | -- | -- |
* Gemini metadata is extracted via LLM prompt -- accurate but batch-only with variable latency.
Transcription accuracy with metadata enabled (WER %):
| Language | Deepgram streaming | Gemini+Meta (batch) | Deepgram+Meta (batch) |
|---|---|---|---|
| English | 4.14% | 3.72% | 2.86% |
| Spanish | 15.09% | 12.48% | 13.26% |
| German | 23.01% | 17.77% | 18.38% |
| Hindi | 37.17% | 23.52% | 27.11% |
Median latency to get transcript + metadata (ms):
| Language | Whissle (streaming) | Gemini+Meta (batch) | Deepgram+Meta (batch) |
|---|---|---|---|
| English | ~200ms | 1,819ms | 1,068ms |
| Spanish | ~200ms | 2,193ms | 887ms |
| German | ~200ms | 2,219ms | 886ms |
| Hindi | ~200ms | 1,906ms | 1,233ms |
Whissle's streaming metadata arrives at ~200ms as part of the same WebSocket connection -- before competitors even finish their batch API calls. Deepgram's batch metadata API takes 886--1,233ms per utterance. Gemini's LLM-based extraction takes 1,819--2,219ms. Both require a separate API call after transcription, meaning the total pipeline latency is transcription time + metadata extraction time.
For a streaming use case processing 10 utterances per minute, the metadata overhead alone adds:
- Whissle: 0ms additional (included in streaming response)
- Deepgram batch meta: ~10s of additional API calls per minute
- Gemini meta: ~20s of additional API calls per minute
Label taxonomy alignment: Both Whissle and Gemini use the same emotion labels (EMOTION_NEUTRAL, EMOTION_HAPPY, EMOTION_SAD, EMOTION_ANGRY, EMOTION_FEAR, EMOTION_SURPRISE, EMOTION_DISGUST) and intent labels (INTENT_INFORM, INTENT_QUESTION, INTENT_COMMAND, INTENT_AFFIRM, INTENT_OTHER). Deepgram's batch API returns sentiment (positive/negative/neutral) rather than fine-grained emotion classes -- a coarser signal that conflates emotion with opinion polarity.
The takeaway: Gemini and Deepgram can extract metadata, but only after the audio is fully processed, only via batch APIs, and at significant latency cost. Whissle delivers richer metadata -- including emotion probabilities, intent probabilities, speech rate, and filler detection -- in real time, as part of the same streaming connection, with zero additional latency or cost.
Confidence Scores: Who's Honest About Uncertainty?
| Dataset | Whissle | Deepgram | AssemblyAI |
|---|---|---|---|
| LibriSpeech | 0.949 | 0.987 | 0.868 |
| EN-IN Tech | 0.757 | 0.903 | 0.809 |
| Soccer | 0.820 | 0.934 | 0.842 |
| Court India | 0.785 | 0.941 | 0.837 |
Deepgram's confidence is high and justified -- it has the best WER and the highest confidence. Well calibrated.
Whissle's confidence drops sharply from 0.949 on clean speech to 0.757 on accented speech -- a 20-point drop that honestly reflects the increased difficulty. If you're building a system that needs to know when to trust the transcript less, Whissle's confidence is informative.
AssemblyAI's confidence stays between 0.81 and 0.87 across all datasets -- even though its WER is the worst on LibriSpeech and it has a 26% failure rate. The confidence doesn't drop proportionally to accuracy.
A confidence score is only useful if low confidence actually predicts errors. Whissle's confidence is honest -- it goes down when things get hard. AssemblyAI's confidence stays high regardless.
Pricing and Deployment
| Whissle | Deepgram | AssemblyAI | Gemini 2.0 Flash | |
|---|---|---|---|---|
| Cost per audio minute | $0.0043 | $0.0043 | $0.0050 | ~$0.001* |
| Cost per audio hour | $0.258 | $0.258 | $0.300 | ~$0.06* |
| What's included | Transcription + emotion, intent, speaker change, WPM, fillers, demographics | Transcription only (metadata features are add-ons) | Transcription only | Transcription + metadata via prompt |
| Streaming support | Yes (WebSocket) | Yes (WebSocket) | Yes (WebSocket) | No (batch only) |
| Deployment options | Cloud API + self-hosted Docker | Cloud API only | Cloud API only | Cloud API only |
| Self-hosted option | Yes (eliminates per-minute costs) | No | No | No |
* Gemini pricing is per-token, not per audio minute. Cost varies by audio length and prompt. Estimate based on typical 10-second utterances.
Whissle matches Deepgram's base streaming transcription price at $0.0043 per audio minute -- but at that price, you get all metadata (emotion, intent, speaker change, speech rate, filler detection, demographics) included. With Deepgram, metadata features are paid add-ons on top of the base transcription cost. Gemini is the cheapest per-minute option, but the lack of streaming support makes it unsuitable for real-time applications.
For high-volume use cases, Whissle's self-hosted Docker deployment eliminates per-minute costs entirely -- critical for call centers, media processing, and organizations with data residency requirements.
Who Wins What?
Choose Deepgram when:
- Transcription accuracy is your number one priority
- Your audio is primarily clean, well-recorded English
- You're comfortable with slightly higher first-segment latency (~3.4s)
- You need the best WER at competitive per-minute pricing
Choose Whissle when:
- You need the fastest first-segment response (1.0--1.7s) for real-time coaching, live captions, or conversational AI
- You need streaming emotion, intent, speech rate, or filler detection alongside transcription
- Near-zero failure rates are non-negotiable (medical, legal, safety, accessibility)
- You want to self-host ASR on your own infrastructure
- You're building beyond transcription -- coaching, analytics, meeting intelligence
Be cautious with AssemblyAI streaming:
- The V3 streaming API had a 25.6% failure rate across our benchmark
- When it works, accuracy is competitive -- but the failure rate makes it unreliable for production use
- The deletion-dominant error profile suggests systemic issues with the streaming pipeline
The Bigger Picture
Clean speech ASR is effectively solved. At 4--12% WER on LibriSpeech, we're in the range of human disagreement. Providers arguing over percentage points on clean speech are fighting the last war.
Real-world ASR is far from solved. The jump from 4% to 46% WER when you add accents, noise, and domain-specific vocabulary is not a small gap -- it's an 11x degradation. This is the frontier.
LibriSpeech benchmarks are misleading in isolation. A provider that's 2x better on LibriSpeech might be only 1.3x better on your actual audio -- or worse if their model is overfit to clean speech.
Reliability is a feature, not an assumption. One of three major providers failed on 26% of samples. Your test suite needs to measure failure rate, not just WER on the samples that succeed.
Batch metadata adds latency, not intelligence. Deepgram and Gemini can extract sentiment, intent, and entities -- but only through separate batch API calls that add 0.9--2.2 seconds per utterance. For a system processing 10 utterances per minute, that's 9--22 seconds of metadata overhead per minute. Whissle delivers richer metadata (emotion with probabilities, intent with probabilities, speech rate, filler detection, demographics) at zero additional latency, in the same streaming response.
Transcription is table stakes. The next wave of speech AI isn't about squeezing 0.5% more WER -- it's about what else you can extract from audio in real time. Emotion, intent, speech patterns, engagement signals. The provider that delivers these alongside transcription, without additional latency or cost, changes the application design space.
Cross-Dataset Summary
| Dataset | Samples | Audio | Whissle WER | Deepgram WER | AssemblyAI WER | AAI Failures |
|---|---|---|---|---|---|---|
| LibriSpeech test-clean | 2,620 | 5.40h | 9.75% | 4.83% | 12.55% | 987 (37.7%) |
| EN-IN Tech Interviews | 1,204 | 5.27h | 62.94% | 43.62% | 51.25% | 253 (21.0%) |
| Soccer Broadcast | 327 | 2.73h | 73.74% | 64.99% | 64.18% | 17 (5.2%) |
| Supreme Court India | 764 | 4.24h | 70.48% | 67.44% | 61.31% | 2 (0.3%) |
| Total | 4,915 | 17.64h | -- | -- | -- | 1,259 (25.6%) |
Methodology Appendix
Benchmark script: Open source, available in our repository. Uses HuggingFace datasets for LibriSpeech and EN-IN Tech, JSONL manifests for Soccer. Full per-sample JSON results published for independent verification.
Text normalization: Lowercase, strip all punctuation, collapse whitespace. For WhissleAI datasets, metadata tokens (AGE_*, GENDER_*, EMOTION_*, INTENT_*) and entity markers (ENTITY_*...END) stripped from reference text before comparison.
WER computation: jiwer library (standard implementation). Per-sample WER averaged for mean; total errors / total words for corpus WER.
Corpus WER vs Mean WER: Mean WER averages per-sample rates -- a 3-word utterance with 1 error scores 33% and weighs equally with a 100-word utterance. Corpus WER divides total errors by total words, giving proportional weight. We report both; corpus WER is generally more representative.
Infrastructure: Whissle on Cloud Run (CPU, ONNX runtime). Deepgram and AssemblyAI via their cloud APIs. All requests from the same machine. April 2026.
Reproducibility: All result JSONs (per-sample transcripts, WER, latency, metadata) are published. Every number in this post can be independently verified.
We built this benchmark because we believe the ASR industry needs more transparent, multi-dataset evaluations. If you want to run these tests yourself or add your own datasets, the tooling is open source.
If you're exploring ASR for your application, try the providers on your actual audio before making a decision. Benchmarks inform -- but your data decides.
