Building a Hindi Speech Model That Understands Context

By Whissle Research Team

May 03 2026

Hindi is the fourth most-spoken language in the world, yet production-grade Hindi speech recognition that goes beyond raw transcription remains rare. Most systems can convert audio to text, but very few can tell you who is speaking, how they feel, or what they want — all from the same audio signal, in real time.

We built Whissle META-ASR v18 to solve exactly this: a single model that transcribes Hindi speech while simultaneously extracting speaker age, gender, emotion, intent, and named entities. In this post, we describe how we trained it, how it performs against two leading commercial APIs, and what makes a single-pass approach different from traditional pipelines.

The META Approach

META-ASR (Meta-aware Transcription with Entities) extends the standard CTC-based ASR architecture with two capabilities that run in parallel during inference:

Inline entity tokens: The model's vocabulary includes special tokens like ENTITY_PERSON_NAME, ENTITY_CITY, and END that bracket named entities directly in the transcription output.
Tag classifier head: A lightweight classification head attached to the encoder produces utterance-level predictions for AGE, GENDER, EMOTION, and INTENT without consuming any CTC decoder capacity.

Architecture Overview

FastConformer encoder with dual output: CTC decoder for text + entities, and a classification head for utterance-level metadata tags.

The core encoder is a 17-layer FastConformer pretrained on 60K hours of multilingual audio, then fine-tuned on curated Hindi data. The tag classifier uses masked mean pooling over encoder outputs followed by per-category linear heads. Both outputs are produced in a single forward pass — no pipeline, no post-processing.

Training Data

We curated 223K Hindi utterances from a pool of 760K candidates. The curation pipeline uses diversity-aware selection (k-center greedy sampling on MFCC embeddings) to maximize coverage of speakers, topics, and acoustic conditions. Each utterance is annotated with:

Annotation	Examples	Method
Transcription	Devanagari text	Human + LLM correction
Named Entities	ENTITY_PERSON_NAME ENTITY_CITY ENTITY_ORGANIZATION	LLM-based NER pipeline
Speaker Age	AGE_18_30 AGE_30_45 AGE_45_60	Acoustic classifier
Gender	GENDER_MALE GENDER_FEMALE	Acoustic classifier
Emotion	EMOTION_NEUTRAL EMOTION_HAPPY EMOTION_SAD	Prosody analysis + LLM
Intent	INTENT_INFORM INTENT_QUERY INTENT_COMMAND	LLM classification

Training Details

The model was fine-tuned on an NVIDIA A100 80GB GPU with the following configuration:

Base model: FastConformer-CTC-BPE with language adapter (17 Conformer layers, 256-dim)
Tokenizer: Aggregate SentencePiece (660 tokens covering Indo-Aryan scripts)
Batch size: 24 utterances, max 20s duration
Optimizer: AdamW with cosine annealing (peak LR 1e-4)
Augmentation: SpecAugment + ambient noise perturbation (environmental sounds injected at random SNR)
Tag classifier loss weight: 0.3 (30% of total loss allocated to metadata classification)

Training Progression

Validation WER across epochs. The model began overfitting after epoch 0, with WER increasing from 17.73% to 20.67% by epoch 2. We selected the epoch-0 checkpoint for deployment.

Early stopping at epoch 0 might seem surprising, but it reflects a common dynamic in fine-tuning from strong pretrained models: the initial adaptation captures the target domain well, and further training causes the model to overfit to training-set-specific patterns (we observed spurious END tokens appearing in predictions by epoch 2).

Benchmark Setup

We evaluated four systems head-to-head on two Hindi test sets:

System	Type	Model
Whissle v18	On-device / self-hosted	FastConformer-CTC + Tag Classifier (480M params)
Deepgram Nova-2	Cloud API	Proprietary multilingual ASR
Gemini 2.5 Flash	Cloud API	Multimodal LLM with audio understanding
Sarvam Saaras v3	Cloud API	Indic-focused multilingual ASR

Test sets:

In-house validation (5,000 samples): Randomly sampled from our 36K-sample validation set. These samples include annotated entities, age, gender, emotion, and intent labels — allowing us to measure both transcription and metadata accuracy.
FLEURS Hindi (418 samples): Google's standardized evaluation set for Hindi (google/fleurs hi_in test). Pure transcription evaluation — no metadata labels available.

Results

WER Comparison

Word Error Rate comparison across four systems on both test sets. Lower is better.

System	In-House WER	FLEURS WER	Metadata
Whissle v18	17.87%	21.25%	AGE, GENDER, EMOTION, INTENT, ENTITIES
Deepgram Nova-2	26.68%	24.13%	None
Gemini 2.5 Flash	18.75%	17.42%	None
Sarvam Saaras v3	19.02%	16.67%	None

Tag Classification Performance

Beyond transcription, Whissle v18 extracts utterance-level metadata that no pure ASR system provides. Here's how the tag classifier performs on the in-house test set:

Tag Classification Accuracy

Per-category accuracy and macro F1 for the four metadata dimensions. These predictions are produced simultaneously with transcription in a single forward pass.

The tag classifier runs on pooled encoder representations — it adds no overhead since the encoder computation is already performed for CTC decoding. This is fundamentally different from pipeline approaches where a separate model would need to re-process the audio for each metadata dimension.

Sample Predictions

Here are real examples from the benchmark showing the model's output compared to ground truth:

Example 1

Reference: उन्होंने अपने निकटतम प्रतिद्वंद्वी ENTITY_PERSON_NAME आशीष कुमार सिन्हा END को पराजित किया

Whissle: उन्होंने अपने निकटतम प्रतिदंदी आशीष कुमार सिन्हा को पराजित कर दिया

Tags: AGE_60PLUS GENDER_MALE EMOTION_NEUTRAL INTENT_INFORM

Example 2

Reference: मैट्रिक की तीन जिला टॉपरों को पांच पांच हजार ENTITY_CURRENCY नकद END और प्रशस्ति पत्र दिया गया

Whissle: मैट्रो के तीन जिला टॉपरों को पांच पांच हजार नपद और प्रशस्ती पत्र दिया गया

Tags: AGE_60PLUS GENDER_MALE EMOTION_NEUTRAL INTENT_INFORM

What Makes This Different

The key differentiator isn't just accuracy — it's what you get out of a single inference pass:

Capability	Whissle META	Traditional ASR + Pipeline
Transcription	Single pass	Single pass
Named Entity Recognition	Same pass (inline tokens)	Separate NER model on text
Speaker Age / Gender	Same pass (tag classifier)	Separate audio classifier
Emotion Detection	Same pass (tag classifier)	Separate SER model
Intent Classification	Same pass (tag classifier)	Separate NLU model or LLM call
Total inference passes	1	4-5

On our in-house test set, Whissle v18 achieves the lowest WER (17.87%) while also extracting metadata — Gemini comes close at 18.75%, followed by Sarvam Saaras v3 at 19.02%, but neither provides metadata. On the standardized FLEURS Hindi benchmark, Sarvam leads (16.67%) followed closely by Gemini (17.42%), with Whissle at 21.25%. This is expected: FLEURS consists of clean, read-aloud speech that plays to the strengths of cloud APIs with massive pretraining data, while our in-house data reflects real-world conditions with ambient noise, spontaneous speech, and diverse speakers where Whissle's task-specific training excels. Deepgram trails on both benchmarks (26.68% and 24.13%), suggesting its Hindi coverage may not be as mature. Critically, Whissle runs at 6ms per sample — 400x faster than cloud APIs — making it viable for real-time edge deployment.

For applications like real-time voice assistants, call center analytics, or live captioning with context, this single-pass approach eliminates the need for orchestrating multiple models.

Try It Yourself

The Hindi META-ASR model is available through multiple channels:

Whissle API: api.whissle.ai — production-ready REST API with streaming WebSocket support
HuggingFace: WhissleAI/STT-meta-HI — download the model weights and run locally
Self-hosted Docker: Our unified gateway image bundles ASR + agent + backend into a single container (deployment guide)

We're actively expanding to more languages — Mandarin, Bengali, Marathi, Punjabi, and European languages are in the pipeline. If you're building voice-first applications that need more than transcription, reach out — we'd love to hear what you're working on.