Meta-aware Voice Action Model
By
Date
50
35
For AI companions to feel truly interactive, speech recognition needs to do more than just transcribe words. It must understand context, adapt on the fly, and act in real time. While models like Whisper are great at transcription, they often miss crucial metadata like speaker changes, named entities, or user intents which are vital for natural conversation.
Existing approaches often require complex pipelines of multiple models, and while recent multi-modal LLMs can understand context, they suffer from high latency. This makes real-time, context-aware voice interaction a challenge.
The Challenge with Traditional ASR
Modern end-to-end ASR systems are excellent at transcription but fall short in real-world conversations where contextual metadata is key. The common challenges include:
- Cascading Pipelines: Most voice applications use a multi-step process where transcription is just the first step, followed by separate models for understanding intent, entities, and sentiment. This is often slow and inefficient.
- Loss of Information: Important acoustic and prosodic cues (like tone and emotion) that are crucial for understanding human behavior are often lost during simple transcription.
- Latency in LLMs: While powerful, large multi-modal language models that can understand context often have high latency, making them unsuitable for real-time applications.
How META-1 Works
META-1 is a multimodal, context-aware speech recognition model that simultaneously transcribes speech and annotates it with contextual tags like intents, named entities, and speaker changes. It builds upon pre-trained speech encoders but extends their vocabulary with special <user-defined-symbols>
. These symbols are used to predict various tasks alongside transcription.
The model is fine-tuned on pairs of speech recordings and annotated text. The text contains both the transcription and the user-defined symbols in-line. Here’s a look at the core components:
- Feature Extractor: An MFCC-based feature extractor processes raw audio into latent representations.
- Voice Encoder: A Transformer or Conformer-based encoder further processes these features.
- Multi-Task Decoder: A single autoregressive decoder generates both the transcription and the contextual tags. The special symbols are part of the decoder's vocabulary, allowing it to mark events like entities, intents, or speaker changes directly in the output.
This single-stage approach means the model outputs a final, enriched transcript without needing any post-processing, making it highly efficient.
Sample Training Pairs
Audio Sample | Annotated Transcription |
---|---|
/audios/common_voice_en_19948199.mp3 |
The copyright of ENTITY_PRODUCT TiddlyWiki END is held in trust by ENTITY_ORGANIZATION UnaMesa END, a Non-profit organization. AGE_30_45 GER_MALE EMOTION_NEUTRAL INTENT_INFORM |
/audios/10_Countries_where_I_felt_SAFEST_Traveling_ALONE_+_5_That_Are_NOT_16k_chunk_3.wav |
ENTITY_LOCATION iceland END. number two, ENTITY_LOCATION japan END. i have had just the most wonderful time in ENTITY_LOCATION japan END with the local people. they are always so kind, so helpful. even when they don't speak english very well, i find that people really try if i approach them. if i have like a translator, something like that, i find that people will put in the effort to help me out. cradim overall is also just lower in ENTITY_LOCATION japan END. AGE_18_30 GER_FEMALE EMOTION_NEU SPEAKER_CHANGE overall, it's also just lower inch. AGE_30_45 GER_FEMALE EMOTION_NEU INTENT_INFORM |
/audios/10_Best_Travel_Channels_on_YouTube_to_follow_travel_virtually_my_personal_favorites_16k_chunk_13.wav |
in europe in 2014. back then, he started to focus more and more on unique experiences, strange places and crazy festivals. AGE_30_45 GER_MALE EMOTION_HAP SPEAKER_CHANGE we are in ENTITY_LOCATION san juan de la vega END and ENTITY_LOCATION guanajuato END, ENTITY_LOCATION mexico END. and this is the exploding hammer festival. i've got these pumps. AGE_30_45 GER_MALE EMOTION_HAP INTENT_INFORM |
Architecture for Real-Time Performance
To ensure real-time operation, META-1 uses a two-part design:
- Backbone Transformer: A large encoder (based on Llama or Conformer) processes speech frames and previously generated tokens to capture long-range contextual dependencies.
- Lightweight Decoder: A smaller, specialized decoder head generates both transcription and metadata tokens at each step. This design keeps latency low, even with a large backbone, by amortizing the computational cost.
Efficient Training with Compute Amortization
Training the model with long audio contexts and metadata can be resource-intensive. We use a compute amortization scheme to manage this:
- The full backbone is trained on every input frame to maintain high transcription quality.
- The decoder is selectively trained for tagging tasks on a random subset of frames (e.g., 1 in 16).
This approach significantly improves training efficiency without a noticeable drop in performance.
Datasets
We compiled a diverse, multilingual dataset from several public sources to train and evaluate our models, enabling them to understand and process speech in multiple languages for global applications.
Language | Dataset Name | Video | Audio | # Samples | # Hours |
---|---|---|---|---|---|
English | AVSpeech | Yes | Yes | 39,117 | 71 |
LibreSpeech-EN | No | Yes | 92,245 | 293 | |
YouTube | Yes | Yes | 8,633 | 50 | |
CommonVoice | No | Yes | 788,705 | 1,116 | |
PeopleSpeech | No | Yes | 1,737,742 | - | |
German | LibreSpeech-DE | No | Yes | 400,000 | 1,995 |
CommonVoice | No | Yes | - | - | |
Spanish | LibreSpeech-ES | No | Yes | 14,135 | 59 |
CommonVoice | No | Yes | - | - | |
Italian | LibreSpeech-IT | No | Yes | 15,830 | 66 |
CommonVoice | No | Yes | - | - | |
French | LibreSpeech-FR | No | Yes | 219,505 | 915 |
CommonVoice | No | Yes | - | - | |
Portuguese | LibreSpeech-PT | No | Yes | 19,200 | 83 |
CommonVoice | No | Yes | - | - |
Results and Conclusion
Our model achieves competitive Word Error Rates (WER) on benchmarks like SLUE and SLURP, while also excelling at inserting user-defined symbols for entity and intent tagging. The integrated approach leads to higher F1 scores for entity and intent detection compared to multi-stage systems.
META-1 simplifies the deployment of context-aware voice applications by eliminating the need for separate NLP or diarization modules. Our ongoing work focuses on integrating external language models for even better performance and extending the model to more languages and complex conversational scenarios.