Multilingual ASR Benchmarks: How N-Gram Language Models Cut Error Rates by 10% Across Languages

By Whissle Research Team

Apr 16 2026

50

35

300 samples across three languages. GPU-accelerated CTC inference with beam search and KenLM n-gram fusion. A transparent, four-column comparison of Whissle (greedy decoder), Whissle + Language Model, Deepgram Nova-3, and AssemblyAI Universal Streaming.


Most ASR benchmarks test on LibriSpeech and call it a day. The provider with the lowest number on clean, studio-recorded audiobook narration wins, and everyone moves on.

But real applications serve users across languages. Your call center handles Spanish callers. Your meeting tool transcribes German. And when your acoustic model produces uncertain outputs, a language model can mean the difference between a usable transcript and garbage.

We set out to answer two questions at once: How does Whissle perform on non-English languages? And does adding an n-gram language model to the CTC decoder actually improve accuracy?

We benchmarked four configurations -- Whissle (greedy decoder), Whissle + KenLM language model, Deepgram Nova-3 (multilingual), and AssemblyAI (Universal Streaming) -- across three languages: English, Spanish, and German. We measured accuracy, latency, reliability, and the precise impact of language model fusion on word error rates.

Here's everything we found.


What Changed Since Our Last Benchmark

Our previous benchmark tested English-only with Whissle on CPU. This update introduces three major changes:

  • GPU acceleration. Whissle now runs on NVIDIA L4 GPUs via Cloud Run (us-east4), replacing the CPU-only ONNX runtime. Real-time throughput improved significantly.
  • N-gram language model integration. KenLM-based 3-gram models, trained on Wikipedia text per language group, are fused into CTC beam search decoding. This gives the decoder linguistic context beyond raw acoustic scores.
  • Multilingual benchmarking. English, Spanish, and German -- with language-matched LM models. Deepgram upgraded from Nova-2 to Nova-3 (their latest multilingual model).

The result is a four-column comparison: Whissle greedy (pure acoustic model), Whissle + LM (beam search with KenLM), Deepgram Nova-3, and AssemblyAI Universal Streaming.


How We Tested

All providers were tested using real-time WebSocket streaming -- the exact protocol you'd use in production. Audio was streamed in 100ms chunks at 1x real-time speed, simulating a live microphone.

ProviderModel / ConfigStreaming Endpoint
Whissle (greedy)GPU + CTC greedy decodewss://api.whissle.ai/asr/stream
Whissle + LMGPU + CTC beam search + KenLMwss://api.whissle.ai/asr/stream
DeepgramNova-3 (multilingual)wss://api.deepgram.com/v1/listen
AssemblyAIUniversal Streamingwss://streaming.assemblyai.com/v3/ws

Fair comparison measures:

  • All providers received identical audio (PCM int16, mono, 16kHz)
  • Both Whissle configurations (greedy and +LM) ran sequentially on each sample, then Deepgram and AssemblyAI ran concurrently
  • Text normalized before WER: lowercased, punctuation stripped, whitespace collapsed
  • WER computed using the standard jiwer library
  • Language parameter set for each provider (e.g. language=es for Spanish)

Key metrics:

MetricWhat It Measures
WER (mean)Word Error Rate -- (Insertions + Deletions + Substitutions) / Reference Words, averaged per sample
WER (median)Median per-sample WER -- robust to outliers
CERCharacter Error Rate -- same formula at character level
Time to first segmentMilliseconds from audio stream start to first non-empty transcript
RTFXReal-time factor -- how much faster than real-time the transcription completes
Failure ratePercentage of samples that returned no usable transcript

The Three Datasets

We chose three read-speech datasets -- one per language -- from standard academic benchmarks. This isolates the language variable: same recording conditions, same genre (audiobook narration), different languages.

Dataset 1: LibriSpeech test-clean (English)

The industry baseline. Clean, studio-quality audiobook narration in American English.
PropertyValue
Sourceopenslr/librispeech_asr (HuggingFace)
Samples100 (from test split)
Total audio670.6 seconds (~11.2 minutes)
Total reference words1,870
LanguageEnglish (US)
KenLM modelENGLISH.bin (trained on English Wikipedia)

Dataset 2: Multilingual LibriSpeech -- Spanish

Read speech in Castilian and Latin American Spanish, from the Multilingual LibriSpeech corpus.
PropertyValue
SourceWhissleAI/multilingual-libri-test-spanish (HuggingFace)
Samples100
Total audio1,488.4 seconds (~24.8 minutes)
Total reference words3,250
LanguageSpanish
KenLM modelEUROPEAN.bin (trained on Spanish, French, German, Portuguese, Italian, and 10 more European language Wikipedias)

Dataset 3: Multilingual LibriSpeech -- German

Read speech in Standard German, from the same Multilingual LibriSpeech family.
PropertyValue
SourceWhissleAI/multilingual-libri-test-german (HuggingFace)
Samples100
Total audio1,451.8 seconds (~24.2 minutes)
Total reference words2,680
LanguageGerman
KenLM modelEUROPEAN.bin (same model as Spanish -- covers all European group languages)

The N-Gram Language Model: How It Works

Before diving into results, it's worth understanding exactly what the language model does and how it integrates with our CTC decoder. This is the technical contribution that makes the multilingual results possible.

The Problem: Acoustic Models Alone Aren't Enough

A CTC (Connectionist Temporal Classification) acoustic model processes audio frame by frame and outputs a probability distribution over the token vocabulary at each time step. A greedy decoder simply picks the most likely token at each frame, collapses consecutive duplicates and blanks, and produces text. It's fast -- but it only considers acoustic evidence. It has no knowledge of what words are likely to follow other words.

This means the greedy decoder makes mistakes that a human reader would immediately catch: "he went too the store" instead of "he went to the store." The acoustic signals for "too" and "to" are nearly identical. A language model knows that "to the" is far more probable than "too the" and can correct this.

KenLM: N-Gram Language Models from Wikipedia

We train word-level 3-gram KenLM language models from Wikipedia text. The training pipeline works as follows:

N-Gram Language Model: Training and Decoding Pipeline

1. Text collection. For each language group, we download Wikipedia articles via HuggingFace Datasets. The Whissle ASR model uses a multilingual tokenizer organized into groups -- ENGLISH, EUROPEAN (de, nl, fr, es, pt, it, sv, da, and more), SLAVIC (ru, pl, cs, uk, bg, and more), INDO_ARYAN (hi, mr, bn, gu, ur), DRAVIDIAN (ta, te, kn), and MANDARIN (zh). Each group shares a SentencePiece tokenizer, so the language model is trained on all languages in the group combined.

Language Groups and KenLM Coverage

2. Text normalization. Raw Wikipedia text is NFKC-normalized, lowercased, stripped of non-word characters (preserving Unicode ranges for each script -- Latin, Cyrillic, Devanagari, CJK, etc.), and cleaned of short or degenerate lines. This produces a clean word-level corpus in the hundreds of megabytes per group.

3. N-gram training. We use the KenLM toolkit's lmplz to train a 3-gram model with pruning thresholds 0 0 1 (keep all unigrams and bigrams, prune trigrams occurring fewer than 2 times). The ARPA model is then converted to a compact binary format using build_binary. A unigram word list (top 500,000 most frequent words) is also extracted for the beam search decoder's vocabulary constraint.

4. Binary model output. The final artifacts are ENGLISH.bin, EUROPEAN.bin, etc. -- one per tokenizer group. At server startup, these are loaded into CTCBeamSearchDecoder instances backed by pyctcdecode.

CTC Beam Search with LM Fusion

The key insight is shallow fusion: the beam search decoder combines two independent scores for each candidate word sequence:

score(w) = log Pacoustic(w) + α · log PLM(w) + β · |w|

Where:

  • Pacoustic(w) is the CTC model's probability of word sequence w, computed from the frame-level token log-probabilities
  • PLM(w) is the KenLM n-gram probability -- how likely is this word given the previous 2 words?
  • α (alpha) controls the LM weight -- how much linguistic context should influence the decision vs. raw acoustics
  • β (beta) is the word insertion bonus -- a positive value encourages longer transcripts (prevents the LM from favoring short, high-probability sequences)
  • |w| is the number of words in the candidate sequence

We use α = 0.1 and β = 0.5 as default parameters. The low alpha keeps the acoustic model dominant -- the LM is a gentle correction signal, not a sledgehammer. The beam width is 100 candidates.

Critical Implementation Details

Several engineering details proved essential for good performance:

  • Log-softmax before metadata suppression. Our model vocabulary has ~18,189 tokens, of which ~9,919 are metadata tokens (AGE_*, GENDER_*, EMOTION_*, INTENT_*, ENTITY_*). These must be suppressed for the beam search decoder. We apply log-softmax on the full vocabulary first, then zero out metadata tokens. If metadata is suppressed before normalization, the remaining speech tokens get inflated probabilities (normalization over fewer tokens), making the acoustic model appear ~2x more confident than it really is -- drowning out the LM's corrections.
  • Word boundary bias. For BPE tokens where a word-boundary variant (▁F) and a non-boundary variant (F) both exist, we boost the boundary variant by 5.0 logits. Without this, KenLM tends to merge words (e.g., "SanFrancisco" instead of "San Francisco"). The bias is only applied to tokens with ambiguous counterparts, not unique word-start tokens like ▁thousand.
  • Per-group decoder instances. Each language group gets its own CTCBeamSearchDecoder loaded at server startup. When a streaming request specifies language=es, the engine resolves Spanish to the EUROPEAN group and uses the corresponding decoder. If no language is specified, the engine auto-detects the language group from the acoustic model's token-range activations.

The result is that adding beam search + LM to a streaming pipeline adds negligible latency (the beam search runs on the same GPU, and the KenLM lookup is a hash table query per frame) while providing meaningful accuracy improvements -- especially for non-English languages.


Results at a Glance

WER Comparison Across Languages

English: LibriSpeech test-clean (100 samples, 670.6s audio)

MetricWhissle (greedy)Whissle + LMDeepgram Nova-3AssemblyAIBest
WER (mean)5.79%6.21%4.14%10.21%DG
WER (median)3.77%0.00%0.00%7.14%DG/W+LM
CER (mean)2.06%2.50%1.89%10.46%DG
First segment (P50)967ms961ms4,373ms1,418msW
First segment (P95)1,442ms1,328ms5,509ms2,227msW+LM
RTFX (median)0.87x0.86x0.88x0.87xDG
Avg confidence0.9560.9560.9880.883DG
Failed samples0005--

LM impact on English: The KenLM model slightly increases mean WER (+0.42%), but improves the median from 3.77% to 0.00% -- more samples get perfect scores. The tradeoff: the ENGLISH.bin model introduces additional insertions (36 vs 22) while reducing substitutions (65 vs 76). For well-trained English acoustic models, the greedy decoder is already strong enough that the LM's corrections are roughly offset by the occasional word it adds.

Whissle delivers the first transcript in under 1 second -- 4.5x faster than Deepgram (967ms vs 4,373ms). Even at P95, Whissle stays under 1.5 seconds while Deepgram exceeds 5.5 seconds.

Spanish: Multilingual LibriSpeech (100 samples, 1,488.4s audio)

MetricWhissle (greedy)Whissle + LMDeepgram Nova-3AssemblyAIBest
WER (mean)36.90%33.81%15.93%100% FAILDG
WER (median)35.48%31.82%13.71%--DG
CER (mean)11.12%10.99%5.45%--DG
First segment (P50)1,197ms1,214ms2,919ms--W
First segment (P95)1,709ms1,732ms5,282ms--W
RTFX (median)0.94x0.94x0.95x--DG
Avg confidence0.8400.8410.974--DG
Failed samples000100--

LM impact on Spanish: -3.09% absolute WER reduction (36.90% → 33.81%, or 8.4% relative improvement). The EUROPEAN.bin KenLM model -- trained on Spanish, French, German, Portuguese, Italian, and 10 other European language Wikipedias -- provides strong linguistic priors for Spanish word sequences.

The error breakdown tells the story: deletions dropped from 182 to 145 (the LM fills in words the acoustic model missed), substitutions dropped from 919 to 835 (the LM corrects acoustically ambiguous words), and correctly recognized words increased from 2,149 to 2,270 out of 3,250 total.

AssemblyAI's Universal Streaming returned zero transcripts for all 100 Spanish samples -- a 100% failure rate. This makes it unusable for any non-English streaming application.

German: Multilingual LibriSpeech (100 samples, 1,451.8s audio)

MetricWhissle (greedy)Whissle + LMDeepgram Nova-3AssemblyAIBest
WER (mean)33.65%30.03%22.96%100% FAILDG
WER (median)32.88%30.20%21.53%--DG
CER (mean)9.61%8.35%6.59%--DG
First segment (P50)1,170ms1,152ms3,613ms--W+LM
First segment (P95)2,231ms2,219ms5,460ms--W+LM
RTFX (median)0.93x0.93x0.95x--DG
Avg confidence0.8770.8760.940--DG
Failed samples000100--

LM impact on German: -3.62% absolute WER reduction (33.65% → 30.03%, or 10.8% relative improvement). German shows the strongest LM benefit of all three languages, likely because German's compound words and case morphology create many acoustically ambiguous sequences that the LM can disambiguate.

The most dramatic change is in deletions: they dropped from 93 to 38 -- a 59% reduction. The LM is recovering words that the greedy decoder was losing entirely. Correctly recognized words jumped from 1,856 to 1,989 out of 2,680 reference words.

AssemblyAI again returned zero transcripts for all 100 German samples. Combined with the Spanish result, this confirms that AssemblyAI's streaming API does not support non-English languages.

The Language Model Impact: A Cross-Language View

N-Gram Language Model Impact on WER
LanguageGreedy WER+LM WERAbsolute ChangeRelative ChangeDirection
English5.79%6.21%+0.42%+7.3%Slight hurt
Spanish36.90%33.81%-3.09%-8.4%Clear improvement
German33.65%30.03%-3.62%-10.8%Strong improvement

The pattern is clear: the LM helps most where the acoustic model struggles most. On English -- where the greedy decoder already achieves 5.79% WER -- the LM is roughly neutral (slightly worse mean, slightly better median). On Spanish and German -- where greedy WER is 33--37% -- the LM provides 3--4% absolute improvement.

Why the LM Helps Non-English More Than English

Three factors explain this asymmetry:

  1. English acoustic training data dominance. The acoustic model was trained on significantly more English data than any other language. This means the English acoustic scores are already well-calibrated -- the greedy decoder makes good choices because the model is confident and usually right. For Spanish and German, the acoustic model is less certain, creating more room for the LM to help.
  2. Morphological complexity. Spanish verb conjugations and German compound words create more acoustically ambiguous token sequences. "trabajamos" (we work) vs "trabaja más" (work more) are nearly identical acoustically but very different linguistically. The LM's n-gram statistics help the decoder navigate these ambiguities.
  3. Insertion-deletion tradeoff. For English, the LM introduces extra insertions (+14 words, from 22 to 36) while only modestly reducing substitutions (-11, from 76 to 65). For German, the LM's insertion increase (+43) is more than compensated by a massive deletion decrease (-55, from 93 to 38). The LM is recovering dropped words -- a net positive.
How the Language Model Changes Error Distribution

Error Composition with and without LM

Error Type Composition
LanguageConfigInsertionsDeletionsSubstitutionsCorrectTotal Ref
EnglishGreedy2210761,7841,870
English+LM369651,7961,870
SpanishGreedy591829192,1493,250
Spanish+LM861458352,2703,250
GermanGreedy43937311,8562,680
German+LM86386531,9892,680

The LM consistently increases insertions across all languages (it's more willing to "guess" a word than stay silent) while decreasing deletions and substitutions. For English, these effects roughly cancel out. For Spanish and German, the deletion and substitution reductions decisively outweigh the insertion increase.


Latency: Whissle Delivers 3--4x Faster First Segments

How quickly does the first transcript appear after audio starts streaming? This matters enormously for live captioning, real-time coaching, and conversational AI.

Time to First Segment
LanguageWhissle (greedy)Whissle + LMDeepgram Nova-3Whissle Advantage
English967ms961ms4,373ms4.5x faster
Spanish1,197ms1,214ms2,919ms2.4x faster
German1,170ms1,152ms3,613ms3.1x faster
Whissle delivers the first transcript 2.4--4.5x faster than Deepgram across every language. The LM adds negligible latency (~15ms difference on average) because KenLM lookups are hash table queries that execute in microseconds.

Latency percentiles:

Latency Percentiles
PercentileWhissle +LM (EN)DG Nova-3 (EN)Whissle +LM (ES)DG Nova-3 (ES)Whissle +LM (DE)DG Nova-3 (DE)
P50961ms4,373ms1,214ms2,919ms1,152ms3,613ms
P951,328ms5,509ms1,732ms5,282ms2,219ms5,460ms
P991,551ms5,855ms1,991ms5,460ms2,594ms5,664ms

Even at P99, Whissle stays under 2.6 seconds across all languages. Deepgram's P95 and P99 consistently land in the 5.3--5.9s range, suggesting a fixed buffering strategy independent of language.


Reliability: AssemblyAI Fails Completely on Non-English

Failure Rates
LanguageSamplesWhissle (greedy)Whissle + LMDeepgram Nova-3AssemblyAI
English1000 (0%)0 (0%)0 (0%)5 (5%)
Spanish1000 (0%)0 (0%)0 (0%)100 (100%)
German1000 (0%)0 (0%)0 (0%)100 (100%)
Total3000 (0%)0 (0%)0 (0%)205 (68.3%)

Whissle and Deepgram achieved zero failures across all 300 samples in all three languages. AssemblyAI failed on 5% of English samples and 100% of Spanish and German samples -- their streaming API does not appear to support non-English languages at all.

For any multilingual production application, AssemblyAI's streaming API is not viable. Their batch/pre-recorded API may support additional languages, but the real-time streaming endpoint tested here does not.


Character Error Rate: A Finer-Grained View

CER Comparison

CER measures accuracy at the character level -- useful for morphologically rich languages where a single wrong character can change a word's meaning (e.g., German "Haus" vs "Maus").

LanguageWhissle (greedy)Whissle + LMDeepgram Nova-3LM Impact
English2.06%2.50%1.89%+0.44% (worse)
Spanish11.12%10.99%5.45%-0.13% (neutral)
German9.61%8.35%6.59%-1.26% (better)

German shows the strongest CER improvement from the LM (-1.26%), consistent with its morphological complexity creating more character-level errors that the LM can correct. Spanish CER is nearly unchanged, suggesting the LM's word-level improvements for Spanish come from recovering whole words (reducing deletions) rather than correcting individual characters.


Correct Word Recovery

Correct Words by Language

Another way to visualize the LM's effect: how many words out of the reference text were correctly recognized?

LanguageRef WordsGreedy Correct+LM CorrectDG Nova-3 CorrectLM Gain
English1,8701,784 (95.4%)1,796 (96.0%)1,797 (96.1%)+12 words
Spanish3,2502,149 (66.1%)2,270 (69.8%)2,765 (85.1%)+121 words
German2,6801,856 (69.3%)1,989 (74.2%)2,135 (79.7%)+133 words

The LM recovers 121 additional correct words in Spanish and 133 in German -- meaningful improvements in transcript quality. For German, the LM closes 28% of the gap between greedy Whissle and Deepgram Nova-3.


Scorecard

Multilingual ASR Benchmark Scorecard

Choose Deepgram when:

  • Transcription accuracy is your number one priority across all languages
  • You can tolerate 3--5 second latency to first transcript
  • You need multilingual support with the lowest WER available

Choose Whissle when:

  • You need the fastest first-segment response (~1 second) for real-time coaching, live captions, or voice agents
  • You need streaming emotion, intent, speech rate, or filler detection alongside transcription
  • Zero failure rates are non-negotiable
  • You want to self-host ASR on your own GPU infrastructure (eliminating per-minute costs)
  • You need n-gram LM customization -- bring your own domain-specific language model for medical, legal, or financial vocabulary

Avoid AssemblyAI streaming for:

  • Any non-English language (100% failure rate on Spanish and German)
  • Production applications requiring reliable streaming (5% failure rate even on English)

The Bigger Picture

N-gram LMs are cheap and effective. A 3-gram KenLM model trained on Wikipedia in a few hours provides 3--4% absolute WER reduction for non-English languages at near-zero inference cost. This is low-hanging fruit that any CTC-based ASR system should exploit.

The LM helps most where you need it most. For well-resourced languages like English where the acoustic model is already strong, the LM is roughly neutral. For languages with less training data or more morphological complexity, the LM's linguistic priors provide meaningful corrections. As we expand to more languages (Hindi, Arabic, Mandarin, and more), we expect even stronger LM benefits.

Multilingual streaming ASR is not universally solved. AssemblyAI's 100% failure rate on non-English streaming shows that multilingual support can't be assumed. If your application serves users in multiple languages, test thoroughly -- don't trust marketing claims.

Latency and accuracy are complementary, not opposed. Whissle delivers first transcripts 3--4x faster than Deepgram while the LM closes the accuracy gap. The next frontier isn't choosing between fast and accurate -- it's having both.

Self-hosted LM customization is a game changer. Because our KenLM models are standard ARPA-format n-grams, users can train domain-specific models on their own data -- medical terminology, legal jargon, product names -- and load them at startup. This is impossible with cloud-only providers.


Methodology Appendix

Benchmark script: Open source, available in our repository (scripts/benchmark_whissle_vs_deepgram.py). Uses HuggingFace datasets for all three languages. Full per-sample JSON results published for independent verification.

Text normalization: Lowercase, strip all punctuation, collapse whitespace. Applied identically to reference and hypothesis text for all providers and all languages.

WER computation: jiwer library (standard implementation). Per-sample WER averaged for mean; individual error counts (insertions, deletions, substitutions, hits) summed across all samples.

Infrastructure: Whissle on Cloud Run with NVIDIA L4 GPU (us-east4 region). Deepgram and AssemblyAI via their cloud APIs. All requests from the same machine, sequential per language (English, then Spanish, then German). April 2026.

KenLM models: 3-gram, trained on Wikipedia text with pruning thresholds 0 0 1. ENGLISH.bin for English, EUROPEAN.bin for Spanish and German. Alpha = 0.1, beta = 0.5, beam width = 100.

Deepgram model: Nova-3 (multilingual), released 2025. This is an upgrade from our previous benchmark which used Nova-2.

Reproducibility: All result JSONs (per-sample transcripts, WER, latency, error counts) are published as results_{en,es,de}_gpu_lm_v2.json. Every number in this post can be independently verified.


This is the second edition of our ASR benchmark series. The first edition tested English-only across four diverse datasets (LibriSpeech, Indian tech interviews, soccer broadcasts, and Supreme Court proceedings) with Whissle on CPU. This edition adds GPU acceleration, n-gram language models, and multilingual evaluation. We will continue expanding to more languages and more challenging real-world datasets.

If you're exploring ASR for your application, try the providers on your actual audio before making a decision. Benchmarks inform -- but your data decides.

Share this post :

Ready to meet your personal AI?

Open source, self-hostable, and privacy-first. Try Whissle free in your browser — no sign-up required.

Try Whissle Free