We Benchmarked 3 Streaming ASR Providers Across 17 Hours of Audio. Here's What We Found.

By Whissle Research Team

Apr 10 2026

203

5471

4,915 samples. Four datasets. Clean speech, Indian-accented tech interviews, noisy soccer broadcasts, and Indian Supreme Court proceedings. A transparent, warts-and-all comparison of Whissle, Deepgram, and AssemblyAI.


Most ASR benchmarks test on LibriSpeech and call it a day. The provider with the lowest number on clean, studio-recorded audiobook narration wins, and everyone moves on.

But that's not the audio your application deals with.

Your audio has accents. Background noise. People talking fast, interrupting each other, mumbling through technical jargon. The question isn't "which ASR is best on clean speech?" -- it's "which ASR holds up when conditions get real?"

We set out to answer that question. We benchmarked three streaming ASR providers -- Whissle, Deepgram (Nova-2), and AssemblyAI (Universal Streaming) -- across four increasingly difficult datasets. We measured accuracy, latency, reliability, and real-time metadata capabilities. We ran all three providers in parallel on the same audio to ensure a fair comparison.

Here's everything we found.


How We Tested

All three providers were tested using real-time WebSocket streaming -- the exact protocol you'd use in production. Audio was streamed in 100ms chunks at 1x real-time speed, simulating a live microphone.

ProviderModelStreaming Endpoint
WhissleDefault (ONNX runtime)wss://api.whissle.ai/asr/stream
DeepgramNova-2wss://api.deepgram.com/v1/listen
AssemblyAIUniversal Streamingwss://streaming.assemblyai.com/v3/ws

Fair comparison measures:

  • All providers received identical audio (PCM int16, mono, 16kHz)
  • All three ran concurrently on each sample -- same network conditions
  • Text normalized before WER: lowercased, punctuation stripped, whitespace collapsed
  • WER computed using the standard jiwer library

Key metrics:

MetricWhat It Measures
WERWord Error Rate -- (Insertions + Deletions + Substitutions) / Reference Words
Corpus WERTotal errors / total reference words (avoids inflation from short utterances)
CERCharacter Error Rate -- same formula at character level
Time to first segmentMilliseconds from audio stream start to first non-empty transcript
Failure ratePercentage of samples that returned no usable transcript

The Four Datasets

We deliberately chose datasets that span the difficulty spectrum -- one industry standard, three that represent real-world conditions.

Dataset 1: LibriSpeech test-clean

The industry baseline. Clean, clear, studio-quality audiobook narration.
PropertyValue
Sourceopenslr/librispeech_asr (HuggingFace)
Samples2,620
Total audio5.4 hours (19,453 seconds)
Duration range1.3s -- 35.0s (median 5.8s)
LanguageEnglish (US)
DomainRead audiobook speech
ConditionsStudio-quality, single speaker, no background noise
Duration BucketSamplesPercentage
Short (< 5s)1,08941.6%
Medium (5--10s)91735.0%
Long (> 10s)61423.4%

Why include it: It's the universal reference point. Every ASR paper reports LibriSpeech results, making it the baseline for cross-provider comparison.

Why it's not enough: No accents, no noise, no domain-specific vocabulary, no overlapping speakers. LibriSpeech performance doesn't predict real-world performance.


Dataset 2: EN-IN Tech Interviews

Indian-accented English in technical interview settings. Where clean-speech benchmarks break down.
PropertyValue
SourceWhissleAI/Meta_STT_EN-IN_Tech_Interviews (HuggingFace)
Samples1,204
Total audio5.3 hours (18,969 seconds)
Duration range1.1s -- 20.0s (median 20.0s)
LanguageEnglish with Indian accents (EN-IN)
DomainTechnical interview conversations
ConditionsVariable quality, phone/video call audio, technical vocabulary
Duration BucketSamplesPercentage
Short (< 5s)957.9%
Medium (5--10s)14011.6%
Long (> 10s)96980.5%

What makes it hard:

  • Indian English accents. Retroflex consonants, vowel substitutions (/v/ for /w/), syllable-timed rhythm -- systematic phonetic differences from US/UK English.
  • Technical vocabulary. Framework names, API terminology, algorithm discussions.
  • Conversational patterns. Hesitations, false starts, filler words, incomplete sentences.
  • Long utterances. 80% of samples exceed 10 seconds -- errors compound over longer audio.
  • Variable recording quality. Phone/video call audio with varying mic quality and background noise.

Each sample includes ground-truth annotations for speaker age, gender, emotion, and intent -- stripped before WER calculation.


Dataset 3: Soccer Broadcast Commentary

Live match commentary with crowd noise, rapid speech, and proper nouns from a dozen nationalities. The hardest test.
PropertyValue
SourceCustom dataset from YouTube soccer match broadcasts
Samples327
Total audio2.7 hours (9,810 seconds)
Segment duration30.0s each (uniform)
LanguageEnglish (UK)
DomainLive soccer match commentary
Primary sourceManchester City v Manchester United -- 2024 Community Shield (326 segments)

Rich ground-truth annotations embedded in reference text (stripped before WER):

Annotation TypeCountExamples
Player names379Fernandez, McAtee, Rashford
Team names139Manchester United, Arsenal
Game actions102shot, scored, corner, penalty
Referee decisions47penalty, offside
Competition names39Community Shield
Game timestamps3690 minutes
Field positions30right hand side
Scores111-1, 7-6
Ground-Truth EmotionSamplesGround-Truth IntentSamples
Neutral202 (61.8%)Analysis164 (50.2%)
Happy66 (20.2%)Play-by-play114 (34.9%)
Sad59 (18.0%)Statistic8 (2.4%)

Speaker demographics: All male. 65% aged 60+, 35% aged 45--60.

What makes it the hardest:

  • Constant background noise -- crowd chanting, whistles, stadium atmosphere
  • Rapid speech -- commentators average ~150 WPM, spiking during action
  • Proper noun density -- 379 player name mentions across 327 segments
  • Multiple speakers -- two commentators frequently overlapping
  • Domain jargon -- offside, set piece, penalty shootout
  • 30-second segments -- long continuous audio with dense speech
  • Broadcast processing -- dynamic range compression, crowd noise mixing

Dataset 4: Supreme Court of India Proceedings

Indian-accented legal English in a reverberant courtroom. Multiple speakers, heavy domain vocabulary, and challenging acoustics.
PropertyValue
SourceWhissleAI/supreme-court-india-meta-speech (HuggingFace)
Samples764
Total audio4.2 hours (15,280 seconds)
Segment duration20.0s each (uniform)
LanguageEnglish with Indian accents (EN-IN)
DomainSupreme Court oral arguments and judicial proceedings
ConditionsCourtroom acoustics, multiple speakers, legal terminology, reverberant environment

What makes it hard:

  • Indian English accents in formal legal speech. Judges, advocates, and amicus curiae speaking with varying regional accents -- phonetic patterns that trip up models trained primarily on US/UK English.
  • Dense legal vocabulary. Case citations, Latin legal terms, constitutional provisions, statutory references -- highly specialized language that rarely appears in general training data.
  • Reverberant courtroom acoustics. Large courtroom spaces produce echo and reverberation that degrades audio clarity.
  • Multiple speakers. Judges, advocates, and interveners -- with frequent interruptions and cross-talk during oral arguments.
  • Formal yet spontaneous speech. Legal arguments are partially prepared but delivered extemporaneously, with restarts, corrections, and hedging.

Results at a Glance

WER Comparison Across Datasets

Accuracy: Deepgram Wins on WER

LibriSpeech test-clean (clean read speech):

MetricWhissleDeepgramAssemblyAIBest
WER (mean)9.75%4.83%12.55%DG
WER (median)6.25%0.00%6.67%DG
Corpus WER8.24%4.16%11.11%DG
CER (mean)4.30%2.05%11.74%DG
Word confidence0.94940.98680.8679DG

EN-IN Tech Interviews (accented conversational speech):

MetricWhissleDeepgramAssemblyAIBest
WER (mean)62.94%43.62%51.25%DG
WER (median)42.86%28.17%41.51%DG
Corpus WER40.67%29.83%39.45%DG
CER (mean)44.13%36.46%45.62%DG

Soccer Broadcast (noisy commentary):

MetricWhissleDeepgramAssemblyAIBest
WER (mean)73.74%64.99%64.18%AAI
WER (median)50.00%36.25%40.35%DG
Corpus WER53.89%43.79%45.62%DG
CER (mean)59.69%56.17%57.58%DG

Supreme Court India (legal proceedings):

MetricWhissleDeepgramAssemblyAIBest
WER (mean)70.48%67.44%61.31%AAI
WER (median)53.19%42.86%47.37%DG
Corpus WER53.50%45.50%47.98%DG
CER (mean)57.12%61.33%55.54%AAI
Deepgram's Nova-2 model delivers the best corpus-level accuracy across all four datasets. On clean speech it's excellent (4.16% corpus WER). But even Deepgram drops to 30--46% corpus WER on challenging audio -- a 7--11x degradation from its LibriSpeech numbers. The Supreme Court dataset is the hardest test: all three providers exceed 45% corpus WER, reflecting the combined difficulty of Indian accents, legal vocabulary, and courtroom acoustics.
Corpus-Level WER

Latency: Whissle Is 2--3x Faster to First Words

How quickly does the first transcript appear after audio starts streaming? This matters enormously for live captioning, real-time coaching, and conversational AI.

Time to First Segment
DatasetWhissleDeepgramAssemblyAIWhissle Advantage
LibriSpeech1,002ms3,382ms1,435ms3.4x faster than DG
EN-IN Tech1,681ms4,626ms2,318ms2.8x faster than DG
Soccer1,465ms3,153ms1,750ms2.2x faster than DG
Court India814ms2,423ms1,537ms3.0x faster than DG
Whissle delivers the first transcript 2--3x faster than Deepgram across every dataset. When a user is watching live captions or an agent needs to respond in real time, that ~2-second head start is the difference between "responsive" and "laggy."

Deepgram appears to buffer more audio before emitting its first transcript -- a tradeoff that likely contributes to its higher accuracy. Whissle prioritizes responsiveness.

Latency percentiles (LibriSpeech):

Latency Percentiles
PercentileWhissleDeepgramAssemblyAI
P50 (median)1,002ms3,382ms1,435ms
P952,375ms5,375ms2,090ms
P994,002ms5,534ms2,512ms

Even at the tail (P99), Whissle stays under 4 seconds. Deepgram's P95 and P99 are tightly clustered around 5.4s, suggesting a consistent buffering strategy rather than variance.


Reliability: One Provider Has a Serious Problem

AssemblyAI Failure Rates
DatasetWhissleDeepgramAssemblyAI
LibriSpeech (2,620)0 (0.0%)0 (0.0%)987 (37.7%)
EN-IN Tech (1,204)0 (0.0%)0 (0.0%)253 (21.0%)
Soccer (327)0 (0.0%)0 (0.0%)17 (5.2%)
Court India (764)4 (0.5%)2 (0.3%)2 (0.3%)
Total (4,915)4 (0.08%)2 (0.04%)1,259 (25.6%)

AssemblyAI's V3 streaming API failed to return a transcript for one in four samples overall -- driven almost entirely by catastrophic failure rates on LibriSpeech (37.7%) and EN-IN Tech (21.0%). On the court dataset, all three providers had a handful of failures, but the numbers are negligible.

When we adjust AssemblyAI's WER to account for failures (treating each failed sample as 100% WER), its LibriSpeech number jumps from 12.55% to 45.49%. Its accuracy advantage on soccer disappears entirely.
Reported vs Adjusted WER

This is worth emphasizing: AssemblyAI's published WER numbers only reflect the samples that worked. For any production application, reliability is a prerequisite, not a feature.


Deep Dive: Where the Errors Happen

Raw WER numbers tell you how much a provider gets wrong. Error type analysis tells you how it gets things wrong -- and that pattern reveals something important about each provider's architecture.

Error Composition

Every transcription error is one of three types:

  • Substitution -- the wrong word ("cat" instead of "cap")
  • Deletion -- a word was dropped entirely
  • Insertion -- a word was hallucinated that wasn't spoken
Error Type Composition

Whissle has a balanced, substitution-dominant profile. On LibriSpeech: 62% substitutions, 20% deletions, 18% insertions. This is the typical acoustic model pattern -- the model hears something but gets the word wrong. As conditions get harder, the deletion rate creeps up, but the profile stays balanced.

Deepgram shifts strategy between clean and difficult audio. On clean speech: 68% substitutions. On accented speech: 49% deletions. It becomes more conservative -- dropping words it's uncertain about rather than guessing. Better to omit than to hallucinate.

AssemblyAI is deletion-dominant everywhere. LibriSpeech: 78% deletions. EN-IN Tech: 76% deletions. Combined with its 26% failure rate, this points to a systemic issue in the streaming pipeline -- segments being dropped, turns ending prematurely, or transcripts being truncated.

ProviderLibriSpeechEN-IN TechSoccerCourt IndiaPattern
Whissle62% sub / 20% del / 18% ins51% sub / 33% del / 16% ins45% sub / 33% del / 21% ins37% sub / 30% del / 34% insBalanced, substitution-led
Deepgram68% sub / 22% del / 11% ins36% sub / 49% del / 15% ins39% sub / 36% del / 25% ins27% sub / 8% del / 65% insShifts strategy per dataset
AssemblyAI21% sub / 78% del / 2% ins18% sub / 76% del / 5% ins33% sub / 49% del / 18% ins20% sub / 52% del / 28% insDeletion-dominant

The Degradation Problem

How much does each provider degrade from clean speech to real-world audio?

WER Degradation
ProviderLibriSpeech (clean)EN-IN (accented)DegradationSoccer (noisy)DegradationCourt India (legal)Degradation
Whissle8.24%40.67%4.9x53.89%6.5x53.50%6.5x
Deepgram4.16%29.83%7.2x43.79%10.5x45.50%10.9x
AssemblyAI11.11%39.45%3.6x45.62%4.1x47.98%4.3x
Deepgram has the best absolute numbers everywhere -- but it also degrades the most dramatically. Its 4.16% on LibriSpeech balloons 10.9x to 45.50% on court proceedings. This suggests its LibriSpeech excellence may partly reflect optimization for clean read speech rather than generalized robustness.

This is the central insight of this benchmark: LibriSpeech performance is a poor predictor of real-world performance. The provider with the best clean-speech score may not maintain that lead on your actual audio.


The WER Distribution Story

Mean WER tells you the average. But the distribution tells you what to actually expect.

WER by Duration (LibriSpeech)

WER by Duration
DurationWhissleDeepgramAssemblyAISamples
Short (< 5s)12.42%5.82%15.46%1,089
Medium (5--10s)7.81%4.19%10.23%917
Long (> 10s)7.93%4.01%11.39%614

All providers improve on longer utterances (more context for language model rescoring). Deepgram's advantage is consistent across all duration buckets.

WER Brackets

LibriSpeech -- what percentage of samples fall into each quality bracket?

BracketWhissleDeepgramAssemblyAI*
Perfect (0% WER)901 (34.4%)1,532 (58.5%)446 (27.3%)
Excellent (≤ 10%)1,723 (65.8%)2,235 (85.3%)1,046 (64.1%)
Good (≤ 20%)2,277 (86.9%)2,485 (94.8%)1,351 (82.7%)
Poor (> 50%)40 (1.5%)14 (0.5%)84 (5.1%)
Total failure (≥ 100%)9 (0.3%)8 (0.3%)11 (0.7%)

*AssemblyAI percentages based on 1,633 successful samples only (987 failed)

EN-IN Tech Interviews -- where things get uncomfortable for everyone:

BracketWhissleDeepgramAssemblyAI*
Perfect (0% WER)13 (1.1%)62 (5.2%)31 (3.3%)
Good (≤ 20%)156 (13.0%)471 (39.1%)250 (26.3%)
Poor (> 50%)487 (40.4%)381 (31.7%)394 (41.4%)
Total failure (≥ 100%)247 (20.5%)272 (22.6%)194 (20.4%)

*AssemblyAI percentages based on 951 successful samples only (253 failed)

On accented speech, even the best provider gets fewer than 4 in 10 samples to "good" accuracy. This isn't a provider problem -- it's an industry problem. Accented ASR is far from solved.

Soccer Broadcast: Zero samples achieved perfect transcription for any provider. Not a single 30-second broadcast segment was transcribed without at least one error.


Streaming vs. Batch: Does Pre-Recorded Mode Help?

Streaming vs Batch WER Comparison

Streaming ASR processes audio in real time as it arrives. Batch (pre-recorded) ASR processes the entire audio file at once, giving the engine access to full context before producing output. Does this make a difference?

We ran identical audio through both streaming and batch APIs for Whissle and Deepgram across all four languages (100 samples each, 1000 for Hindi). For Whissle batch, we tested both greedy decoding and KenLM language model rescoring.

Streaming vs. Batch WER (%) -- lower is better:

LanguageWhissle StreamWhissle BatchWhissle Batch+LMDeepgram StreamDeepgram Batch
English6.21%5.63%5.31%4.14%3.48%
Spanish33.81%27.70%26.54%15.93%12.92%
German30.03%28.73%26.09%22.96%18.95%
Hindi52.86%48.76%45.15%42.84%37.91%

Key observations:

  • Batch consistently improves WER for both providers. Having the full audio context before decoding helps -- expected, since batch engines can look ahead and resolve ambiguities.
  • Whissle's LM rescoring adds 1--4% absolute improvement on top of batch greedy decoding. The KenLM n-gram model catches spelling and grammar errors that pure CTC decoding misses.
  • Deepgram's batch mode narrows the gap -- in English, Deepgram batch (3.48%) approaches Gemini 2.0 Flash (3.12%), the accuracy leader.
  • Streaming is the harder problem. Every provider does better in batch mode. The question is whether your use case can wait for the full audio -- if you're building real-time applications (live captioning, call coaching, voice agents), streaming is not optional.

Batch mode is the right choice for post-call analytics, media transcription, and any workflow where latency doesn't matter. But for live applications, streaming accuracy is what counts -- and that's where the architectural differences between CTC-based streaming (Whissle) and proprietary streaming (Deepgram) become the deciding factor.


Beyond Transcription: Streaming Metadata

Here's where the comparison shifts from "who has the lowest WER" to "who gives you the most useful output."

All providers return transcripts with word timestamps and confidence scores. But the metadata landscape varies significantly -- and the delivery mode matters as much as the feature list:

FeatureWhissleDeepgramAssemblyAIGemini 2.0 Flash
TranscriptionYesYesYesYes
Word timestampsYesYesYes--
Word confidenceYesYesYes--
Interim/partial resultsYesYesYes--
Emotion detectionYes (real-time, 7 classes)Batch only (Sentiment)Batch only (Sentiment)Batch only (via prompt)
Intent classificationYes (real-time, 6 classes)Batch only--Batch only (via prompt)
Speaker change detectionYes (in-stream)------
Speaker demographicsYes (age, gender)------
Speech rate (WPM)Yes------
Filler word detectionYes------
Emotion probabilitiesYes (per-segment)------
Intent probabilitiesYes (per-segment)------
Speaker diarization--Add-on (streaming)Batch only--
Topic detection--Batch onlyBatch onlyBatch only (via prompt)
Entity detectionVia in-stream tokensBatch onlyBatch onlyBatch only (via prompt)
Summarization--Batch only (EN)Batch onlyBatch only (via prompt)
Delivery modeStreaming WebSocketStreaming + batch add-onsStreaming + batch add-onsBatch API only
Metadata latency~200ms (included)887--1,233ms (separate call)Separate batch call1,819--2,219ms (separate call)

Deepgram and AssemblyAI offer sentiment, intent, and entity features in their batch/pre-recorded APIs, but in streaming mode these are not available. Gemini 2.0 Flash can extract any metadata via prompting -- but only in batch mode, with 1.8--2.4 second latency per utterance. Deepgram's streaming add-on for speaker diarization is a notable exception.

Whissle streams all of the following in real time -- as part of the same WebSocket connection, with zero additional latency:

  • Emotion detection -- per-segment classification across 7 emotions with full probability distributions
  • Intent classification -- 20+ intent categories (inform, question, command, describe, etc.) with probabilities
  • Speaker change detection -- in-stream tokens marking when the speaker changes
  • Speech rate -- words per minute, measured in real time
  • Filler word detection -- counts of "um", "uh", "like", "you know"
  • Speaker demographics -- age range and gender classification

Metadata Coverage

DatasetSamplesEmotion DetectedIntent DetectedWPM Measured
LibriSpeech2,6202,620 (100%)2,620 (100%)2,618 (99.9%)
EN-IN Tech1,2041,088 (90.4%)1,088 (90.4%)1,060 (88.0%)
Soccer327327 (100%)327 (100%)327 (100%)
Court India764760 (99.5%)760 (99.5%)756 (99.0%)

Metadata Evaluation: Emotion Detection

The soccer broadcast dataset includes human-annotated ground truth for emotion, intent, age, and gender -- allowing us to evaluate Whissle's metadata predictions against labels.

Emotion distribution across datasets:

EmotionLibriSpeechEN-IN TechSoccerCourt India
Neutral2,173 (82.9%)1,082 (99.4%)326 (99.7%)276 (36.3%)
Angry216 (8.2%)2 (0.2%)--277 (36.4%)
Happy133 (5.1%)1 (0.1%)1 (0.3%)207 (27.2%)
Disgust62 (2.4%)1 (0.1%)----
Surprise16 (0.6%)1 (0.1%)----
Sad15 (0.6%)1 (0.1%)----
Fear5 (0.2%)------

LibriSpeech (audiobook narration) shows emotional variety -- professional narrators use distinct vocal tones for different characters and dramatic moments, which the model picks up. Tech interviews are almost entirely neutral, as expected for professional conversations. Soccer broadcast commentary is acoustically neutral even when describing exciting moments.

The Supreme Court dataset shows the most interesting emotion distribution: roughly equal parts angry (36.4%), neutral (36.3%), and happy (27.2%). This reflects the adversarial nature of courtroom proceedings -- advocates argue forcefully (detected as angry), judges maintain composure (neutral), and favorable rulings or successful arguments produce positive vocal tones (happy). Unlike soccer commentary where acoustic emotion is muted, courtroom speech carries genuine vocal intensity.

Ground truth comparison (Soccer Broadcast, 327 samples):

Ground Truth EmotionSamplesWhissle CorrectRecall
Neutral202202100.0%
Happy6611.5%
Sad5900.0%
Overall accuracy32720362.1%

An important nuance: the ground truth labels in the soccer dataset annotate contextual emotion (happy when a goal is scored, sad when a chance is missed), while Whissle detects acoustic emotion (the actual tone of voice). Professional broadcast commentators maintain a relatively even vocal tone regardless of context -- a commentator describing a goal doesn't necessarily sound acoustically "happy" in the way a person celebrating would. The 100% recall on neutral reflects that the model is accurately reading the acoustic signal, even when the annotated context differs.

Metadata Evaluation: Intent Classification

Intent distribution across datasets (EN-IN percentages based on 1,088 samples with detected intent; 116 very short segments returned no intent):

IntentLibriSpeechEN-IN TechSoccerCourt India
Inform2,337 (89.2%)1,054 (96.9%)326 (99.7%)752 (98.9%)
Question151 (5.8%)28 (2.6%)1 (0.3%)6 (0.8%)
Describe67 (2.6%)------
Command29 (1.1%)4 (0.4%)--1 (0.1%)
Other36 (1.4%)2 (0.2%)--1 (0.1%)

The soccer dataset has domain-specific ground truth intents (Play-by-play, Analysis, Statistic, Reaction). Whissle uses a general-purpose intent taxonomy. When we map the domain-specific labels to their general equivalents (Play-by-play/Analysis/Statistic map to Inform), the alignment accuracy is 99.7% (326/327) -- the model correctly identifies that commentary is fundamentally informational speech.

LibriSpeech shows more intent variety because audiobook narration includes dialogue with questions (5.8%), descriptive passages (2.6%), and imperative speech (1.1%). Tech interviews are overwhelmingly informational with occasional questions -- consistent with interview dynamics. Court proceedings are similarly dominated by informational speech (98.9%) -- advocates presenting arguments and judges delivering rulings -- with occasional questions from the bench (0.8%).

Speech Rate Analysis

Speech Rate Distribution
DatasetMean WPMMedian WPMRange
LibriSpeech (read speech)168.2166.928.4 -- 284.0
EN-IN Tech (conversational)115.2114.325.0 -- 400.0
Soccer (broadcast)149.7150.083.0 -- 212.5
Court India (legal)148.1146.659.8 -- 252.2

LibriSpeech shows the highest speech rate (professional narrators reading at a steady pace). Tech interviews are slower (thinking pauses, technical explanations). Soccer commentary and court proceedings sit in between at similar median rates (~150 WPM), though court speech shows a wider range -- from measured judicial deliberation to rapid-fire advocacy.

Why Metadata Changes Everything

If you're building live captioning, a transcription API is enough. But if you're building anything that needs to understand the conversation -- not just record it -- you need more than words.

1. Detect escalations before they happen. Real-time emotion detection flags when a customer's tone shifts from neutral to frustrated -- before they ask for a manager. Intent classification distinguishes a genuine question from a complaint. In a call center, this means routing to a supervisor in seconds, not minutes.

2. Coach speakers in real time. Speech rate tracking tells presenters they're rushing. Filler word counts ("um", "uh", "like") show job candidates exactly what to work on. All streamed live during the conversation, not buried in a post-call report.

3. Understand conversations, not just words. Speaker change detection identifies turn-taking patterns. Demographics reveal who's dominating a meeting. Intent classification distinguishes questions from statements from commands. These signals transform a flat transcript into a structured understanding of what actually happened.

4. Ship with confidence. Near-zero failures across 4,915 samples means virtually every audio segment returns a transcript plus metadata. No silent drop-outs, no missing data, no fallback logic needed. Getting all of this in a single streaming WebSocket connection, with zero additional latency at the same per-minute price, is a fundamentally different value proposition.

Metadata Extraction: The Hidden Latency Cost

Metadata Extraction Latency Comparison

Competitors offer metadata features -- but as separate batch API calls that add latency and cost on top of transcription. We benchmarked the real cost of getting metadata from each provider across four languages (100 samples per language).

Metadata Feature Comparison

What each provider returns with metadata enabled:

FeatureWhissle (streaming)Gemini 2.0 Flash (batch)Deepgram Nova-3 (batch)
TranscriptionYesYesYes
Emotion (7 classes)Yes -- real-timeYes -- batch*--
Intent (6 classes)Yes -- real-timeYes -- batch*Yes -- batch
SentimentYes -- real-timeYes -- batch*Yes -- batch
Topics--Yes -- batch*Yes -- batch
Named entitiesVia in-stream tokensYes -- batch*Yes -- batch
Summarization----Yes -- batch (EN only)
Emotion probabilitiesYes (per-segment)----
Intent probabilitiesYes (per-segment)----
Speech rate (WPM)Yes -- real-time----
Filler detectionYes -- real-time----
Speaker demographicsYes -- real-time----

* Gemini metadata is extracted via LLM prompt -- accurate but batch-only with variable latency.

Transcription accuracy with metadata enabled (WER %):

LanguageDeepgram streamingGemini+Meta (batch)Deepgram+Meta (batch)
English4.14%3.72%2.86%
Spanish15.09%12.48%13.26%
German23.01%17.77%18.38%
Hindi37.17%23.52%27.11%

Median latency to get transcript + metadata (ms):

LanguageWhissle (streaming)Gemini+Meta (batch)Deepgram+Meta (batch)
English~200ms1,819ms1,068ms
Spanish~200ms2,193ms887ms
German~200ms2,219ms886ms
Hindi~200ms1,906ms1,233ms

Whissle's streaming metadata arrives at ~200ms as part of the same WebSocket connection -- before competitors even finish their batch API calls. Deepgram's batch metadata API takes 886--1,233ms per utterance. Gemini's LLM-based extraction takes 1,819--2,219ms. Both require a separate API call after transcription, meaning the total pipeline latency is transcription time + metadata extraction time.

For a streaming use case processing 10 utterances per minute, the metadata overhead alone adds:

  • Whissle: 0ms additional (included in streaming response)
  • Deepgram batch meta: ~10s of additional API calls per minute
  • Gemini meta: ~20s of additional API calls per minute

Label taxonomy alignment: Both Whissle and Gemini use the same emotion labels (EMOTION_NEUTRAL, EMOTION_HAPPY, EMOTION_SAD, EMOTION_ANGRY, EMOTION_FEAR, EMOTION_SURPRISE, EMOTION_DISGUST) and intent labels (INTENT_INFORM, INTENT_QUESTION, INTENT_COMMAND, INTENT_AFFIRM, INTENT_OTHER). Deepgram's batch API returns sentiment (positive/negative/neutral) rather than fine-grained emotion classes -- a coarser signal that conflates emotion with opinion polarity.

The takeaway: Gemini and Deepgram can extract metadata, but only after the audio is fully processed, only via batch APIs, and at significant latency cost. Whissle delivers richer metadata -- including emotion probabilities, intent probabilities, speech rate, and filler detection -- in real time, as part of the same streaming connection, with zero additional latency or cost.


Confidence Scores: Who's Honest About Uncertainty?

Confidence Scores
DatasetWhissleDeepgramAssemblyAI
LibriSpeech0.9490.9870.868
EN-IN Tech0.7570.9030.809
Soccer0.8200.9340.842
Court India0.7850.9410.837

Deepgram's confidence is high and justified -- it has the best WER and the highest confidence. Well calibrated.

Whissle's confidence drops sharply from 0.949 on clean speech to 0.757 on accented speech -- a 20-point drop that honestly reflects the increased difficulty. If you're building a system that needs to know when to trust the transcript less, Whissle's confidence is informative.

AssemblyAI's confidence stays between 0.81 and 0.87 across all datasets -- even though its WER is the worst on LibriSpeech and it has a 26% failure rate. The confidence doesn't drop proportionally to accuracy.

A confidence score is only useful if low confidence actually predicts errors. Whissle's confidence is honest -- it goes down when things get hard. AssemblyAI's confidence stays high regardless.

Pricing and Deployment

WhissleDeepgramAssemblyAIGemini 2.0 Flash
Cost per audio minute$0.0043$0.0043$0.0050~$0.001*
Cost per audio hour$0.258$0.258$0.300~$0.06*
What's includedTranscription + emotion, intent, speaker change, WPM, fillers, demographicsTranscription only (metadata features are add-ons)Transcription onlyTranscription + metadata via prompt
Streaming supportYes (WebSocket)Yes (WebSocket)Yes (WebSocket)No (batch only)
Deployment optionsCloud API + self-hosted DockerCloud API onlyCloud API onlyCloud API only
Self-hosted optionYes (eliminates per-minute costs)NoNoNo

* Gemini pricing is per-token, not per audio minute. Cost varies by audio length and prompt. Estimate based on typical 10-second utterances.

Whissle matches Deepgram's base streaming transcription price at $0.0043 per audio minute -- but at that price, you get all metadata (emotion, intent, speaker change, speech rate, filler detection, demographics) included. With Deepgram, metadata features are paid add-ons on top of the base transcription cost. Gemini is the cheapest per-minute option, but the lack of streaming support makes it unsuitable for real-time applications.

For high-volume use cases, Whissle's self-hosted Docker deployment eliminates per-minute costs entirely -- critical for call centers, media processing, and organizations with data residency requirements.


Who Wins What?

Scorecard

Choose Deepgram when:

  • Transcription accuracy is your number one priority
  • Your audio is primarily clean, well-recorded English
  • You're comfortable with slightly higher first-segment latency (~3.4s)
  • You need the best WER at competitive per-minute pricing

Choose Whissle when:

  • You need the fastest first-segment response (1.0--1.7s) for real-time coaching, live captions, or conversational AI
  • You need streaming emotion, intent, speech rate, or filler detection alongside transcription
  • Near-zero failure rates are non-negotiable (medical, legal, safety, accessibility)
  • You want to self-host ASR on your own infrastructure
  • You're building beyond transcription -- coaching, analytics, meeting intelligence

Be cautious with AssemblyAI streaming:

  • The V3 streaming API had a 25.6% failure rate across our benchmark
  • When it works, accuracy is competitive -- but the failure rate makes it unreliable for production use
  • The deletion-dominant error profile suggests systemic issues with the streaming pipeline

The Bigger Picture

Clean speech ASR is effectively solved. At 4--12% WER on LibriSpeech, we're in the range of human disagreement. Providers arguing over percentage points on clean speech are fighting the last war.

Real-world ASR is far from solved. The jump from 4% to 46% WER when you add accents, noise, and domain-specific vocabulary is not a small gap -- it's an 11x degradation. This is the frontier.

LibriSpeech benchmarks are misleading in isolation. A provider that's 2x better on LibriSpeech might be only 1.3x better on your actual audio -- or worse if their model is overfit to clean speech.

Reliability is a feature, not an assumption. One of three major providers failed on 26% of samples. Your test suite needs to measure failure rate, not just WER on the samples that succeed.

Batch metadata adds latency, not intelligence. Deepgram and Gemini can extract sentiment, intent, and entities -- but only through separate batch API calls that add 0.9--2.2 seconds per utterance. For a system processing 10 utterances per minute, that's 9--22 seconds of metadata overhead per minute. Whissle delivers richer metadata (emotion with probabilities, intent with probabilities, speech rate, filler detection, demographics) at zero additional latency, in the same streaming response.

Transcription is table stakes. The next wave of speech AI isn't about squeezing 0.5% more WER -- it's about what else you can extract from audio in real time. Emotion, intent, speech patterns, engagement signals. The provider that delivers these alongside transcription, without additional latency or cost, changes the application design space.


Cross-Dataset Summary

DatasetSamplesAudioWhissle WERDeepgram WERAssemblyAI WERAAI Failures
LibriSpeech test-clean2,6205.40h9.75%4.83%12.55%987 (37.7%)
EN-IN Tech Interviews1,2045.27h62.94%43.62%51.25%253 (21.0%)
Soccer Broadcast3272.73h73.74%64.99%64.18%17 (5.2%)
Supreme Court India7644.24h70.48%67.44%61.31%2 (0.3%)
Total4,91517.64h------1,259 (25.6%)

Methodology Appendix

Benchmark script: Open source, available in our repository. Uses HuggingFace datasets for LibriSpeech and EN-IN Tech, JSONL manifests for Soccer. Full per-sample JSON results published for independent verification.

Text normalization: Lowercase, strip all punctuation, collapse whitespace. For WhissleAI datasets, metadata tokens (AGE_*, GENDER_*, EMOTION_*, INTENT_*) and entity markers (ENTITY_*...END) stripped from reference text before comparison.

WER computation: jiwer library (standard implementation). Per-sample WER averaged for mean; total errors / total words for corpus WER.

Corpus WER vs Mean WER: Mean WER averages per-sample rates -- a 3-word utterance with 1 error scores 33% and weighs equally with a 100-word utterance. Corpus WER divides total errors by total words, giving proportional weight. We report both; corpus WER is generally more representative.

Infrastructure: Whissle on Cloud Run (CPU, ONNX runtime). Deepgram and AssemblyAI via their cloud APIs. All requests from the same machine. April 2026.

Reproducibility: All result JSONs (per-sample transcripts, WER, latency, metadata) are published. Every number in this post can be independently verified.


We built this benchmark because we believe the ASR industry needs more transparent, multi-dataset evaluations. If you want to run these tests yourself or add your own datasets, the tooling is open source.

If you're exploring ASR for your application, try the providers on your actual audio before making a decision. Benchmarks inform -- but your data decides.

Share this post :

Ready to meet your personal AI?

Download the browser, try the web app, or build with our APIs — open source, self-hostable, and privacy-first.