Using Visuals for Better Sound Awareness

By

Date

In our daily lives, we effortlessly use visual cues—like lip movements or the surrounding scene—to understand what someone is saying, especially in noisy environments. However, Automatic Speech Recognition (ASR) systems often struggle when background noise is present. At Whissle, we're working to bridge this gap by developing a model that improves transcription by correlating noise sources with what's happening visually.

A New Approach to Noisy Speech

Unlike many audio-visual models that focus on lip-reading, our approach takes in the broader visual information from the environment. This allows our model to more naturally separate speech from noise, much like the human brain does. This research is based on our recent paper, Visual-Aware Speech Recognition for Noisy Scenarios.

How It Works

We've designed an architecture that integrates pre-trained audio and visual encoders using a multi-headed attention mechanism. By training the model on video inputs, it learns to transcribe speech while simultaneously identifying and predicting noise labels. This is based on the hypothesis that teaching a model to recognize the visual sources of noise will lead to better speech recognition.

A Scalable Dataset Pipeline

One of the key contributions of this work is a scalable pipeline for creating audio-visual datasets. We mix audio-visual noise datasets (from AudioSet) with clean speech (from People's Speech) at various noise ratios. This automated process eliminates the need for manually curated, specialized datasets and allows us to generate a large corpus for training.

Significant Improvements

Our results show a significant improvement in transcription accuracy over existing audio-only models in noisy scenarios. The experiments confirm that visual cues play a vital role in this improvement, allowing the model to achieve better performance even with fewer parameters than larger, audio-only models.

Future Work

We plan to continue improving our model by exploring additional pre-trained checkpoints and expanding our dataset to include more complex noisy scenarios. We believe this approach has the potential to scale up and be adopted by the wider community to build more robust and efficient Audio-Visual Speech Recognition systems.