We Benchmarked 3 Streaming ASR Providers Across 17 Hours of Audio. Here's What We Found.

By Whissle Research Team

Apr 10 2026

50

35

4,915 samples. Four datasets. Clean speech, Indian-accented tech interviews, noisy soccer broadcasts, and Indian Supreme Court proceedings. A transparent, warts-and-all comparison of Whissle, Deepgram, and AssemblyAI.


Most ASR benchmarks test on LibriSpeech and call it a day. The provider with the lowest number on clean, studio-recorded audiobook narration wins, and everyone moves on.

But that's not the audio your application deals with.

Your audio has accents. Background noise. People talking fast, interrupting each other, mumbling through technical jargon. The question isn't "which ASR is best on clean speech?" -- it's "which ASR holds up when conditions get real?"

We set out to answer that question. We benchmarked three streaming ASR providers -- Whissle, Deepgram (Nova-2), and AssemblyAI (Universal Streaming) -- across four increasingly difficult datasets. We measured accuracy, latency, reliability, and real-time metadata capabilities. We ran all three providers in parallel on the same audio to ensure a fair comparison.

Here's everything we found.


How We Tested

All three providers were tested using real-time WebSocket streaming -- the exact protocol you'd use in production. Audio was streamed in 100ms chunks at 1x real-time speed, simulating a live microphone.

ProviderModelStreaming Endpoint
WhissleDefault (ONNX runtime)wss://api.whissle.ai/asr/stream
DeepgramNova-2wss://api.deepgram.com/v1/listen
AssemblyAIUniversal Streamingwss://streaming.assemblyai.com/v3/ws

Fair comparison measures:

  • All providers received identical audio (PCM int16, mono, 16kHz)
  • All three ran concurrently on each sample -- same network conditions
  • Text normalized before WER: lowercased, punctuation stripped, whitespace collapsed
  • WER computed using the standard jiwer library

Key metrics:

MetricWhat It Measures
WERWord Error Rate -- (Insertions + Deletions + Substitutions) / Reference Words
Corpus WERTotal errors / total reference words (avoids inflation from short utterances)
CERCharacter Error Rate -- same formula at character level
Time to first segmentMilliseconds from audio stream start to first non-empty transcript
Failure ratePercentage of samples that returned no usable transcript

The Four Datasets

We deliberately chose datasets that span the difficulty spectrum -- one industry standard, three that represent real-world conditions.

Dataset 1: LibriSpeech test-clean

The industry baseline. Clean, clear, studio-quality audiobook narration.
PropertyValue
Sourceopenslr/librispeech_asr (HuggingFace)
Samples2,620
Total audio5.4 hours (19,453 seconds)
Duration range1.3s -- 35.0s (median 5.8s)
LanguageEnglish (US)
DomainRead audiobook speech
ConditionsStudio-quality, single speaker, no background noise
Duration BucketSamplesPercentage
Short (< 5s)1,08941.6%
Medium (5--10s)91735.0%
Long (> 10s)61423.4%

Why include it: It's the universal reference point. Every ASR paper reports LibriSpeech results, making it the baseline for cross-provider comparison.

Why it's not enough: No accents, no noise, no domain-specific vocabulary, no overlapping speakers. LibriSpeech performance doesn't predict real-world performance.


Dataset 2: EN-IN Tech Interviews

Indian-accented English in technical interview settings. Where clean-speech benchmarks break down.
PropertyValue
SourceWhissleAI/Meta_STT_EN-IN_Tech_Interviews (HuggingFace)
Samples1,204
Total audio5.3 hours (18,969 seconds)
Duration range1.1s -- 20.0s (median 20.0s)
LanguageEnglish with Indian accents (EN-IN)
DomainTechnical interview conversations
ConditionsVariable quality, phone/video call audio, technical vocabulary
Duration BucketSamplesPercentage
Short (< 5s)957.9%
Medium (5--10s)14011.6%
Long (> 10s)96980.5%

What makes it hard:

  • Indian English accents. Retroflex consonants, vowel substitutions (/v/ for /w/), syllable-timed rhythm -- systematic phonetic differences from US/UK English.
  • Technical vocabulary. Framework names, API terminology, algorithm discussions.
  • Conversational patterns. Hesitations, false starts, filler words, incomplete sentences.
  • Long utterances. 80% of samples exceed 10 seconds -- errors compound over longer audio.
  • Variable recording quality. Phone/video call audio with varying mic quality and background noise.

Each sample includes ground-truth annotations for speaker age, gender, emotion, and intent -- stripped before WER calculation.


Dataset 3: Soccer Broadcast Commentary

Live match commentary with crowd noise, rapid speech, and proper nouns from a dozen nationalities. The hardest test.
PropertyValue
SourceCustom dataset from YouTube soccer match broadcasts
Samples327
Total audio2.7 hours (9,810 seconds)
Segment duration30.0s each (uniform)
LanguageEnglish (UK)
DomainLive soccer match commentary
Primary sourceManchester City v Manchester United -- 2024 Community Shield (326 segments)

Rich ground-truth annotations embedded in reference text (stripped before WER):

Annotation TypeCountExamples
Player names379Fernandez, McAtee, Rashford
Team names139Manchester United, Arsenal
Game actions102shot, scored, corner, penalty
Referee decisions47penalty, offside
Competition names39Community Shield
Game timestamps3690 minutes
Field positions30right hand side
Scores111-1, 7-6
Ground-Truth EmotionSamplesGround-Truth IntentSamples
Neutral202 (61.8%)Analysis164 (50.2%)
Happy66 (20.2%)Play-by-play114 (34.9%)
Sad59 (18.0%)Statistic8 (2.4%)

Speaker demographics: All male. 65% aged 60+, 35% aged 45--60.

What makes it the hardest:

  • Constant background noise -- crowd chanting, whistles, stadium atmosphere
  • Rapid speech -- commentators average ~150 WPM, spiking during action
  • Proper noun density -- 379 player name mentions across 327 segments
  • Multiple speakers -- two commentators frequently overlapping
  • Domain jargon -- offside, set piece, penalty shootout
  • 30-second segments -- long continuous audio with dense speech
  • Broadcast processing -- dynamic range compression, crowd noise mixing

Dataset 4: Supreme Court of India Proceedings

Indian-accented legal English in a reverberant courtroom. Multiple speakers, heavy domain vocabulary, and challenging acoustics.
PropertyValue
SourceWhissleAI/supreme-court-india-meta-speech (HuggingFace)
Samples764
Total audio4.2 hours (15,280 seconds)
Segment duration20.0s each (uniform)
LanguageEnglish with Indian accents (EN-IN)
DomainSupreme Court oral arguments and judicial proceedings
ConditionsCourtroom acoustics, multiple speakers, legal terminology, reverberant environment

What makes it hard:

  • Indian English accents in formal legal speech. Judges, advocates, and amicus curiae speaking with varying regional accents -- phonetic patterns that trip up models trained primarily on US/UK English.
  • Dense legal vocabulary. Case citations, Latin legal terms, constitutional provisions, statutory references -- highly specialized language that rarely appears in general training data.
  • Reverberant courtroom acoustics. Large courtroom spaces produce echo and reverberation that degrades audio clarity.
  • Multiple speakers. Judges, advocates, and interveners -- with frequent interruptions and cross-talk during oral arguments.
  • Formal yet spontaneous speech. Legal arguments are partially prepared but delivered extemporaneously, with restarts, corrections, and hedging.

Results at a Glance

WER Comparison Across Datasets

Accuracy: Deepgram Wins on WER

LibriSpeech test-clean (clean read speech):

MetricWhissleDeepgramAssemblyAIBest
WER (mean)9.75%4.83%12.55%DG
WER (median)6.25%0.00%6.67%DG
Corpus WER8.24%4.16%11.11%DG
CER (mean)4.30%2.05%11.74%DG
Word confidence0.94940.98680.8679DG

EN-IN Tech Interviews (accented conversational speech):

MetricWhissleDeepgramAssemblyAIBest
WER (mean)62.94%43.62%51.25%DG
WER (median)42.86%28.17%41.51%DG
Corpus WER40.67%29.83%39.45%DG
CER (mean)44.13%36.46%45.62%DG

Soccer Broadcast (noisy commentary):

MetricWhissleDeepgramAssemblyAIBest
WER (mean)73.74%64.99%64.18%AAI
WER (median)50.00%36.25%40.35%DG
Corpus WER53.89%43.79%45.62%DG
CER (mean)59.69%56.17%57.58%DG

Supreme Court India (legal proceedings):

MetricWhissleDeepgramAssemblyAIBest
WER (mean)70.48%67.44%61.31%AAI
WER (median)53.19%42.86%47.37%DG
Corpus WER53.50%45.50%47.98%DG
CER (mean)57.12%61.33%55.54%AAI
Deepgram's Nova-2 model delivers the best corpus-level accuracy across all four datasets. On clean speech it's excellent (4.16% corpus WER). But even Deepgram drops to 30--46% corpus WER on challenging audio -- a 7--11x degradation from its LibriSpeech numbers. The Supreme Court dataset is the hardest test: all three providers exceed 45% corpus WER, reflecting the combined difficulty of Indian accents, legal vocabulary, and courtroom acoustics.
Corpus-Level WER

Latency: Whissle Is 2--3x Faster to First Words

How quickly does the first transcript appear after audio starts streaming? This matters enormously for live captioning, real-time coaching, and conversational AI.

Time to First Segment
DatasetWhissleDeepgramAssemblyAIWhissle Advantage
LibriSpeech1,002ms3,382ms1,435ms3.4x faster than DG
EN-IN Tech1,681ms4,626ms2,318ms2.8x faster than DG
Soccer1,465ms3,153ms1,750ms2.2x faster than DG
Court India814ms2,423ms1,537ms3.0x faster than DG
Whissle delivers the first transcript 2--3x faster than Deepgram across every dataset. When a user is watching live captions or an agent needs to respond in real time, that ~2-second head start is the difference between "responsive" and "laggy."

Deepgram appears to buffer more audio before emitting its first transcript -- a tradeoff that likely contributes to its higher accuracy. Whissle prioritizes responsiveness.

Latency percentiles (LibriSpeech):

Latency Percentiles
PercentileWhissleDeepgramAssemblyAI
P50 (median)1,002ms3,382ms1,435ms
P952,375ms5,375ms2,090ms
P994,002ms5,534ms2,512ms

Even at the tail (P99), Whissle stays under 4 seconds. Deepgram's P95 and P99 are tightly clustered around 5.4s, suggesting a consistent buffering strategy rather than variance.


Reliability: One Provider Has a Serious Problem

AssemblyAI Failure Rates
DatasetWhissleDeepgramAssemblyAI
LibriSpeech (2,620)0 (0.0%)0 (0.0%)987 (37.7%)
EN-IN Tech (1,204)0 (0.0%)0 (0.0%)253 (21.0%)
Soccer (327)0 (0.0%)0 (0.0%)17 (5.2%)
Court India (764)4 (0.5%)2 (0.3%)2 (0.3%)
Total (4,915)4 (0.08%)2 (0.04%)1,259 (25.6%)

AssemblyAI's V3 streaming API failed to return a transcript for one in four samples overall -- driven almost entirely by catastrophic failure rates on LibriSpeech (37.7%) and EN-IN Tech (21.0%). On the court dataset, all three providers had a handful of failures, but the numbers are negligible.

When we adjust AssemblyAI's WER to account for failures (treating each failed sample as 100% WER), its LibriSpeech number jumps from 12.55% to 45.49%. Its accuracy advantage on soccer disappears entirely.
Reported vs Adjusted WER

This is worth emphasizing: AssemblyAI's published WER numbers only reflect the samples that worked. For any production application, reliability is a prerequisite, not a feature.


Deep Dive: Where the Errors Happen

Raw WER numbers tell you how much a provider gets wrong. Error type analysis tells you how it gets things wrong -- and that pattern reveals something important about each provider's architecture.

Error Composition

Every transcription error is one of three types:

  • Substitution -- the wrong word ("cat" instead of "cap")
  • Deletion -- a word was dropped entirely
  • Insertion -- a word was hallucinated that wasn't spoken
Error Type Composition

Whissle has a balanced, substitution-dominant profile. On LibriSpeech: 62% substitutions, 20% deletions, 18% insertions. This is the typical acoustic model pattern -- the model hears something but gets the word wrong. As conditions get harder, the deletion rate creeps up, but the profile stays balanced.

Deepgram shifts strategy between clean and difficult audio. On clean speech: 68% substitutions. On accented speech: 49% deletions. It becomes more conservative -- dropping words it's uncertain about rather than guessing. Better to omit than to hallucinate.

AssemblyAI is deletion-dominant everywhere. LibriSpeech: 78% deletions. EN-IN Tech: 76% deletions. Combined with its 26% failure rate, this points to a systemic issue in the streaming pipeline -- segments being dropped, turns ending prematurely, or transcripts being truncated.

ProviderLibriSpeechEN-IN TechSoccerCourt IndiaPattern
Whissle62% sub / 20% del / 18% ins51% sub / 33% del / 16% ins45% sub / 33% del / 21% ins37% sub / 30% del / 34% insBalanced, substitution-led
Deepgram68% sub / 22% del / 11% ins36% sub / 49% del / 15% ins39% sub / 36% del / 25% ins27% sub / 8% del / 65% insShifts strategy per dataset
AssemblyAI21% sub / 78% del / 2% ins18% sub / 76% del / 5% ins33% sub / 49% del / 18% ins20% sub / 52% del / 28% insDeletion-dominant

The Degradation Problem

How much does each provider degrade from clean speech to real-world audio?

WER Degradation
ProviderLibriSpeech (clean)EN-IN (accented)DegradationSoccer (noisy)DegradationCourt India (legal)Degradation
Whissle8.24%40.67%4.9x53.89%6.5x53.50%6.5x
Deepgram4.16%29.83%7.2x43.79%10.5x45.50%10.9x
AssemblyAI11.11%39.45%3.6x45.62%4.1x47.98%4.3x
Deepgram has the best absolute numbers everywhere -- but it also degrades the most dramatically. Its 4.16% on LibriSpeech balloons 10.9x to 45.50% on court proceedings. This suggests its LibriSpeech excellence may partly reflect optimization for clean read speech rather than generalized robustness.

This is the central insight of this benchmark: LibriSpeech performance is a poor predictor of real-world performance. The provider with the best clean-speech score may not maintain that lead on your actual audio.


The WER Distribution Story

Mean WER tells you the average. But the distribution tells you what to actually expect.

WER by Duration (LibriSpeech)

WER by Duration
DurationWhissleDeepgramAssemblyAISamples
Short (< 5s)12.42%5.82%15.46%1,089
Medium (5--10s)7.81%4.19%10.23%917
Long (> 10s)7.93%4.01%11.39%614

All providers improve on longer utterances (more context for language model rescoring). Deepgram's advantage is consistent across all duration buckets.

WER Brackets

LibriSpeech -- what percentage of samples fall into each quality bracket?

BracketWhissleDeepgramAssemblyAI*
Perfect (0% WER)901 (34.4%)1,532 (58.5%)446 (27.3%)
Excellent (≤ 10%)1,723 (65.8%)2,235 (85.3%)1,046 (64.1%)
Good (≤ 20%)2,277 (86.9%)2,485 (94.8%)1,351 (82.7%)
Poor (> 50%)40 (1.5%)14 (0.5%)84 (5.1%)
Total failure (≥ 100%)9 (0.3%)8 (0.3%)11 (0.7%)

*AssemblyAI percentages based on 1,633 successful samples only (987 failed)

EN-IN Tech Interviews -- where things get uncomfortable for everyone:

BracketWhissleDeepgramAssemblyAI*
Perfect (0% WER)13 (1.1%)62 (5.2%)31 (3.3%)
Good (≤ 20%)156 (13.0%)471 (39.1%)250 (26.3%)
Poor (> 50%)487 (40.4%)381 (31.7%)394 (41.4%)
Total failure (≥ 100%)247 (20.5%)272 (22.6%)194 (20.4%)

*AssemblyAI percentages based on 951 successful samples only (253 failed)

On accented speech, even the best provider gets fewer than 4 in 10 samples to "good" accuracy. This isn't a provider problem -- it's an industry problem. Accented ASR is far from solved.

Soccer Broadcast: Zero samples achieved perfect transcription for any provider. Not a single 30-second broadcast segment was transcribed without at least one error.


Beyond Transcription: Streaming Metadata

Here's where the comparison shifts from "who has the lowest WER" to "who gives you the most useful output."

All three providers return transcripts with word timestamps and confidence scores. But the metadata landscape varies significantly:

FeatureWhissleDeepgramAssemblyAI
TranscriptionYesYesYes
Word timestampsYesYesYes
Word confidenceYesYesYes
Interim/partial resultsYesYesYes
Emotion detectionYes (real-time)Batch only (Sentiment Analysis)Batch only (Sentiment Analysis)
Intent classificationYes (real-time)----
Speaker change detectionYes (in-stream)----
Speaker demographicsYes (age, gender)----
Speech rate (WPM)Yes----
Filler word detectionYes----
Emotion probabilitiesYes (per-segment)----
Intent probabilitiesYes (per-segment)----
Speaker diarization--Add-on (streaming)Batch only
Topic detection--Batch onlyBatch only
Entity detectionVia in-stream tokens--Batch only

Deepgram and AssemblyAI offer sentiment analysis and some other features in their batch/pre-recorded APIs, but in streaming mode -- the mode tested here -- these are not available. Deepgram's streaming add-on for speaker diarization is a notable exception.

Whissle streams all of the following in real time -- as part of the same WebSocket connection, with zero additional latency:

  • Emotion detection -- per-segment classification across 7 emotions with full probability distributions
  • Intent classification -- 20+ intent categories (inform, question, command, describe, etc.) with probabilities
  • Speaker change detection -- in-stream tokens marking when the speaker changes
  • Speech rate -- words per minute, measured in real time
  • Filler word detection -- counts of "um", "uh", "like", "you know"
  • Speaker demographics -- age range and gender classification

Metadata Coverage

DatasetSamplesEmotion DetectedIntent DetectedWPM Measured
LibriSpeech2,6202,620 (100%)2,620 (100%)2,618 (99.9%)
EN-IN Tech1,2041,088 (90.4%)1,088 (90.4%)1,060 (88.0%)
Soccer327327 (100%)327 (100%)327 (100%)
Court India764760 (99.5%)760 (99.5%)756 (99.0%)

Metadata Evaluation: Emotion Detection

The soccer broadcast dataset includes human-annotated ground truth for emotion, intent, age, and gender -- allowing us to evaluate Whissle's metadata predictions against labels.

Emotion distribution across datasets:

EmotionLibriSpeechEN-IN TechSoccerCourt India
Neutral2,173 (82.9%)1,082 (99.4%)326 (99.7%)276 (36.3%)
Angry216 (8.2%)2 (0.2%)--277 (36.4%)
Happy133 (5.1%)1 (0.1%)1 (0.3%)207 (27.2%)
Disgust62 (2.4%)1 (0.1%)----
Surprise16 (0.6%)1 (0.1%)----
Sad15 (0.6%)1 (0.1%)----
Fear5 (0.2%)------

LibriSpeech (audiobook narration) shows emotional variety -- professional narrators use distinct vocal tones for different characters and dramatic moments, which the model picks up. Tech interviews are almost entirely neutral, as expected for professional conversations. Soccer broadcast commentary is acoustically neutral even when describing exciting moments.

The Supreme Court dataset shows the most interesting emotion distribution: roughly equal parts angry (36.4%), neutral (36.3%), and happy (27.2%). This reflects the adversarial nature of courtroom proceedings -- advocates argue forcefully (detected as angry), judges maintain composure (neutral), and favorable rulings or successful arguments produce positive vocal tones (happy). Unlike soccer commentary where acoustic emotion is muted, courtroom speech carries genuine vocal intensity.

Ground truth comparison (Soccer Broadcast, 327 samples):

Ground Truth EmotionSamplesWhissle CorrectRecall
Neutral202202100.0%
Happy6611.5%
Sad5900.0%
Overall accuracy32720362.1%

An important nuance: the ground truth labels in the soccer dataset annotate contextual emotion (happy when a goal is scored, sad when a chance is missed), while Whissle detects acoustic emotion (the actual tone of voice). Professional broadcast commentators maintain a relatively even vocal tone regardless of context -- a commentator describing a goal doesn't necessarily sound acoustically "happy" in the way a person celebrating would. The 100% recall on neutral reflects that the model is accurately reading the acoustic signal, even when the annotated context differs.

Metadata Evaluation: Intent Classification

Intent distribution across datasets (EN-IN percentages based on 1,088 samples with detected intent; 116 very short segments returned no intent):

IntentLibriSpeechEN-IN TechSoccerCourt India
Inform2,337 (89.2%)1,054 (96.9%)326 (99.7%)752 (98.9%)
Question151 (5.8%)28 (2.6%)1 (0.3%)6 (0.8%)
Describe67 (2.6%)------
Command29 (1.1%)4 (0.4%)--1 (0.1%)
Other36 (1.4%)2 (0.2%)--1 (0.1%)

The soccer dataset has domain-specific ground truth intents (Play-by-play, Analysis, Statistic, Reaction). Whissle uses a general-purpose intent taxonomy. When we map the domain-specific labels to their general equivalents (Play-by-play/Analysis/Statistic map to Inform), the alignment accuracy is 99.7% (326/327) -- the model correctly identifies that commentary is fundamentally informational speech.

LibriSpeech shows more intent variety because audiobook narration includes dialogue with questions (5.8%), descriptive passages (2.6%), and imperative speech (1.1%). Tech interviews are overwhelmingly informational with occasional questions -- consistent with interview dynamics. Court proceedings are similarly dominated by informational speech (98.9%) -- advocates presenting arguments and judges delivering rulings -- with occasional questions from the bench (0.8%).

Speech Rate Analysis

Speech Rate Distribution
DatasetMean WPMMedian WPMRange
LibriSpeech (read speech)168.2166.928.4 -- 284.0
EN-IN Tech (conversational)115.2114.325.0 -- 400.0
Soccer (broadcast)149.7150.083.0 -- 212.5
Court India (legal)148.1146.659.8 -- 252.2

LibriSpeech shows the highest speech rate (professional narrators reading at a steady pace). Tech interviews are slower (thinking pauses, technical explanations). Soccer commentary and court proceedings sit in between at similar median rates (~150 WPM), though court speech shows a wider range -- from measured judicial deliberation to rapid-fire advocacy.

Why Metadata Changes Everything

If you're building live captioning, a transcription API is enough. But if you're building anything that needs to understand the conversation -- not just record it -- you need more than words.

1. Detect escalations before they happen. Real-time emotion detection flags when a customer's tone shifts from neutral to frustrated -- before they ask for a manager. Intent classification distinguishes a genuine question from a complaint. In a call center, this means routing to a supervisor in seconds, not minutes.

2. Coach speakers in real time. Speech rate tracking tells presenters they're rushing. Filler word counts ("um", "uh", "like") show job candidates exactly what to work on. All streamed live during the conversation, not buried in a post-call report.

3. Understand conversations, not just words. Speaker change detection identifies turn-taking patterns. Demographics reveal who's dominating a meeting. Intent classification distinguishes questions from statements from commands. These signals transform a flat transcript into a structured understanding of what actually happened.

4. Ship with confidence. Near-zero failures across 4,915 samples means virtually every audio segment returns a transcript plus metadata. No silent drop-outs, no missing data, no fallback logic needed. Getting all of this in a single streaming WebSocket connection, with zero additional latency at the same per-minute price, is a fundamentally different value proposition.


Confidence Scores: Who's Honest About Uncertainty?

Confidence Scores
DatasetWhissleDeepgramAssemblyAI
LibriSpeech0.9490.9870.868
EN-IN Tech0.7570.9030.809
Soccer0.8200.9340.842
Court India0.7850.9410.837

Deepgram's confidence is high and justified -- it has the best WER and the highest confidence. Well calibrated.

Whissle's confidence drops sharply from 0.949 on clean speech to 0.757 on accented speech -- a 20-point drop that honestly reflects the increased difficulty. If you're building a system that needs to know when to trust the transcript less, Whissle's confidence is informative.

AssemblyAI's confidence stays between 0.81 and 0.87 across all datasets -- even though its WER is the worst on LibriSpeech and it has a 26% failure rate. The confidence doesn't drop proportionally to accuracy.

A confidence score is only useful if low confidence actually predicts errors. Whissle's confidence is honest -- it goes down when things get hard. AssemblyAI's confidence stays high regardless.

Pricing and Deployment

WhissleDeepgramAssemblyAI
Cost per audio minute$0.0043$0.0043$0.0050
Cost per audio hour$0.258$0.258$0.300
What's includedTranscription + emotion, intent, speaker change, WPM, fillers, demographicsTranscription only (metadata features are add-ons)Transcription only
Deployment optionsCloud API + self-hosted DockerCloud API onlyCloud API only
Self-hosted optionYes (eliminates per-minute costs)NoNo

Whissle matches Deepgram's base streaming transcription price at $0.0043 per audio minute -- but at that price, you get all metadata (emotion, intent, speaker change, speech rate, filler detection, demographics) included. With Deepgram, metadata features are paid add-ons on top of the base transcription cost.

For high-volume use cases, Whissle's self-hosted Docker deployment eliminates per-minute costs entirely -- critical for call centers, media processing, and organizations with data residency requirements.


Who Wins What?

Scorecard

Choose Deepgram when:

  • Transcription accuracy is your number one priority
  • Your audio is primarily clean, well-recorded English
  • You're comfortable with slightly higher first-segment latency (~3.4s)
  • You need the best WER at competitive per-minute pricing

Choose Whissle when:

  • You need the fastest first-segment response (1.0--1.7s) for real-time coaching, live captions, or conversational AI
  • You need streaming emotion, intent, speech rate, or filler detection alongside transcription
  • Near-zero failure rates are non-negotiable (medical, legal, safety, accessibility)
  • You want to self-host ASR on your own infrastructure
  • You're building beyond transcription -- coaching, analytics, meeting intelligence

Be cautious with AssemblyAI streaming:

  • The V3 streaming API had a 25.6% failure rate across our benchmark
  • When it works, accuracy is competitive -- but the failure rate makes it unreliable for production use
  • The deletion-dominant error profile suggests systemic issues with the streaming pipeline

The Bigger Picture

Clean speech ASR is effectively solved. At 4--12% WER on LibriSpeech, we're in the range of human disagreement. Providers arguing over percentage points on clean speech are fighting the last war.

Real-world ASR is far from solved. The jump from 4% to 46% WER when you add accents, noise, and domain-specific vocabulary is not a small gap -- it's an 11x degradation. This is the frontier.

LibriSpeech benchmarks are misleading in isolation. A provider that's 2x better on LibriSpeech might be only 1.3x better on your actual audio -- or worse if their model is overfit to clean speech.

Reliability is a feature, not an assumption. One of three major providers failed on 26% of samples. Your test suite needs to measure failure rate, not just WER on the samples that succeed.

Transcription is table stakes. The next wave of speech AI isn't about squeezing 0.5% more WER -- it's about what else you can extract from audio in real time. Emotion, intent, speech patterns, engagement signals. The provider that delivers these alongside transcription, without additional latency or cost, changes the application design space.


Cross-Dataset Summary

DatasetSamplesAudioWhissle WERDeepgram WERAssemblyAI WERAAI Failures
LibriSpeech test-clean2,6205.40h9.75%4.83%12.55%987 (37.7%)
EN-IN Tech Interviews1,2045.27h62.94%43.62%51.25%253 (21.0%)
Soccer Broadcast3272.73h73.74%64.99%64.18%17 (5.2%)
Supreme Court India7644.24h70.48%67.44%61.31%2 (0.3%)
Total4,91517.64h------1,259 (25.6%)

Methodology Appendix

Benchmark script: Open source, available in our repository. Uses HuggingFace datasets for LibriSpeech and EN-IN Tech, JSONL manifests for Soccer. Full per-sample JSON results published for independent verification.

Text normalization: Lowercase, strip all punctuation, collapse whitespace. For WhissleAI datasets, metadata tokens (AGE_*, GENDER_*, EMOTION_*, INTENT_*) and entity markers (ENTITY_*...END) stripped from reference text before comparison.

WER computation: jiwer library (standard implementation). Per-sample WER averaged for mean; total errors / total words for corpus WER.

Corpus WER vs Mean WER: Mean WER averages per-sample rates -- a 3-word utterance with 1 error scores 33% and weighs equally with a 100-word utterance. Corpus WER divides total errors by total words, giving proportional weight. We report both; corpus WER is generally more representative.

Infrastructure: Whissle on Cloud Run (CPU, ONNX runtime). Deepgram and AssemblyAI via their cloud APIs. All requests from the same machine. April 2026.

Reproducibility: All result JSONs (per-sample transcripts, WER, latency, metadata) are published. Every number in this post can be independently verified.


We built this benchmark because we believe the ASR industry needs more transparent, multi-dataset evaluations. If you want to run these tests yourself or add your own datasets, the tooling is open source.

If you're exploring ASR for your application, try the providers on your actual audio before making a decision. Benchmarks inform -- but your data decides.

Share this post :

Ready to meet your personal AI?

Open source, self-hostable, and privacy-first. Try Whissle free in your browser — no sign-up required.

Try Whissle Free