Building a Hindi Speech Model That Understands Context

By Whissle Research Team

May 03 2026

0

0

Hindi is the fourth most-spoken language in the world, yet production-grade Hindi speech recognition that goes beyond raw transcription remains rare. Most systems can convert audio to text, but very few can tell you who is speaking, how they feel, or what they want — all from the same audio signal, in real time.

We built Whissle META-ASR v18 to solve exactly this: a single model that transcribes Hindi speech while simultaneously extracting speaker age, gender, emotion, intent, and named entities. In this post, we describe how we trained it, how it performs against two leading commercial APIs, and what makes a single-pass approach different from traditional pipelines.

The META Approach

META-ASR (Meta-aware Transcription with Entities) extends the standard CTC-based ASR architecture with two capabilities that run in parallel during inference:

  • Inline entity tokens: The model's vocabulary includes special tokens like ENTITY_PERSON_NAME, ENTITY_CITY, and END that bracket named entities directly in the transcription output.
  • Tag classifier head: A lightweight classification head attached to the encoder produces utterance-level predictions for AGE, GENDER, EMOTION, and INTENT without consuming any CTC decoder capacity.

Architecture Overview

META-ASR Architecture

FastConformer encoder with dual output: CTC decoder for text + entities, and a classification head for utterance-level metadata tags.

The core encoder is a 17-layer FastConformer pretrained on 60K hours of multilingual audio, then fine-tuned on curated Hindi data. The tag classifier uses masked mean pooling over encoder outputs followed by per-category linear heads. Both outputs are produced in a single forward pass — no pipeline, no post-processing.

Training Data

We curated 223K Hindi utterances from a pool of 760K candidates. The curation pipeline uses diversity-aware selection (k-center greedy sampling on MFCC embeddings) to maximize coverage of speakers, topics, and acoustic conditions. Each utterance is annotated with:

Annotation Examples Method
Transcription Devanagari text Human + LLM correction
Named Entities ENTITY_PERSON_NAME ENTITY_CITY ENTITY_ORGANIZATION LLM-based NER pipeline
Speaker Age AGE_18_30 AGE_30_45 AGE_45_60 Acoustic classifier
Gender GENDER_MALE GENDER_FEMALE Acoustic classifier
Emotion EMOTION_NEUTRAL EMOTION_HAPPY EMOTION_SAD Prosody analysis + LLM
Intent INTENT_INFORM INTENT_QUERY INTENT_COMMAND LLM classification

Training Details

The model was fine-tuned on an NVIDIA A100 80GB GPU with the following configuration:

  • Base model: FastConformer-CTC-BPE with language adapter (17 Conformer layers, 256-dim)
  • Tokenizer: Aggregate SentencePiece (660 tokens covering Indo-Aryan scripts)
  • Batch size: 24 utterances, max 20s duration
  • Optimizer: AdamW with cosine annealing (peak LR 1e-4)
  • Augmentation: SpecAugment + ambient noise perturbation (environmental sounds injected at random SNR)
  • Tag classifier loss weight: 0.3 (30% of total loss allocated to metadata classification)

Training Progression

Training WER across epochs

Validation WER across epochs. The model began overfitting after epoch 0, with WER increasing from 17.73% to 20.67% by epoch 2. We selected the epoch-0 checkpoint for deployment.

Early stopping at epoch 0 might seem surprising, but it reflects a common dynamic in fine-tuning from strong pretrained models: the initial adaptation captures the target domain well, and further training causes the model to overfit to training-set-specific patterns (we observed spurious END tokens appearing in predictions by epoch 2).

Benchmark Setup

We evaluated four systems head-to-head on two Hindi test sets:

System Type Model
Whissle v18 On-device / self-hosted FastConformer-CTC + Tag Classifier (480M params)
Deepgram Nova-2 Cloud API Proprietary multilingual ASR
Gemini 2.5 Flash Cloud API Multimodal LLM with audio understanding
Sarvam Saaras v3 Cloud API Indic-focused multilingual ASR

Test sets:

  • In-house validation (5,000 samples): Randomly sampled from our 36K-sample validation set. These samples include annotated entities, age, gender, emotion, and intent labels — allowing us to measure both transcription and metadata accuracy.
  • FLEURS Hindi (418 samples): Google's standardized evaluation set for Hindi (google/fleurs hi_in test). Pure transcription evaluation — no metadata labels available.

Results

WER Comparison

WER Comparison Chart

Word Error Rate comparison across four systems on both test sets. Lower is better.

System In-House WER FLEURS WER Metadata
Whissle v18 17.87% 21.25% AGE, GENDER, EMOTION, INTENT, ENTITIES
Deepgram Nova-2 26.68% 24.13% None
Gemini 2.5 Flash 18.75% 17.42% None
Sarvam Saaras v3 19.02% 16.67% None

Tag Classification Performance

Beyond transcription, Whissle v18 extracts utterance-level metadata that no pure ASR system provides. Here's how the tag classifier performs on the in-house test set:

Tag Classification Accuracy

Tag Accuracy Chart

Per-category accuracy and macro F1 for the four metadata dimensions. These predictions are produced simultaneously with transcription in a single forward pass.

The tag classifier runs on pooled encoder representations — it adds no overhead since the encoder computation is already performed for CTC decoding. This is fundamentally different from pipeline approaches where a separate model would need to re-process the audio for each metadata dimension.

Sample Predictions

Here are real examples from the benchmark showing the model's output compared to ground truth:

Example 1

Reference: उन्होंने अपने निकटतम प्रतिद्वंद्वी ENTITY_PERSON_NAME आशीष कुमार सिन्हा END को पराजित किया

Whissle: उन्होंने अपने निकटतम प्रतिदंदी आशीष कुमार सिन्हा को पराजित कर दिया

Tags: AGE_60PLUS GENDER_MALE EMOTION_NEUTRAL INTENT_INFORM

Example 2

Reference: मैट्रिक की तीन जिला टॉपरों को पांच पांच हजार ENTITY_CURRENCY नकद END और प्रशस्ति पत्र दिया गया

Whissle: मैट्रो के तीन जिला टॉपरों को पांच पांच हजार नपद और प्रशस्ती पत्र दिया गया

Tags: AGE_60PLUS GENDER_MALE EMOTION_NEUTRAL INTENT_INFORM

What Makes This Different

The key differentiator isn't just accuracy — it's what you get out of a single inference pass:

Capability Whissle META Traditional ASR + Pipeline
Transcription Single pass Single pass
Named Entity Recognition Same pass (inline tokens) Separate NER model on text
Speaker Age / Gender Same pass (tag classifier) Separate audio classifier
Emotion Detection Same pass (tag classifier) Separate SER model
Intent Classification Same pass (tag classifier) Separate NLU model or LLM call
Total inference passes 1 4-5

On our in-house test set, Whissle v18 achieves the lowest WER (17.87%) while also extracting metadata — Gemini comes close at 18.75%, followed by Sarvam Saaras v3 at 19.02%, but neither provides metadata. On the standardized FLEURS Hindi benchmark, Sarvam leads (16.67%) followed closely by Gemini (17.42%), with Whissle at 21.25%. This is expected: FLEURS consists of clean, read-aloud speech that plays to the strengths of cloud APIs with massive pretraining data, while our in-house data reflects real-world conditions with ambient noise, spontaneous speech, and diverse speakers where Whissle's task-specific training excels. Deepgram trails on both benchmarks (26.68% and 24.13%), suggesting its Hindi coverage may not be as mature. Critically, Whissle runs at 6ms per sample — 400x faster than cloud APIs — making it viable for real-time edge deployment.

For applications like real-time voice assistants, call center analytics, or live captioning with context, this single-pass approach eliminates the need for orchestrating multiple models.

Try It Yourself

The Hindi META-ASR model is available through multiple channels:

  • Whissle API: api.whissle.ai — production-ready REST API with streaming WebSocket support
  • HuggingFace: WhissleAI/STT-meta-HI — download the model weights and run locally
  • Self-hosted Docker: Our unified gateway image bundles ASR + agent + backend into a single container (deployment guide)

We're actively expanding to more languages — Mandarin, Bengali, Marathi, Punjabi, and European languages are in the pipeline. If you're building voice-first applications that need more than transcription, reach out — we'd love to hear what you're working on.

Share this post :

Ready to meet your personal AI?

Download the browser, try the web app, or build with our APIs — open source, self-hostable, and privacy-first.