Cross-stitched Multi-modal Encoders

Karan Singla

By Karan Singla

Nov 14 2023

50

35

image

We propose a novel architecture for multi-modal speech and text input. We combine pretrained speech and text encoders using multi-headed cross-modal attention and jointly fine-tune on the target problem. The resultant architecture can be used for continuous token-level classification or utterancelevel prediction acting on simultaneous text and speech. The resultant encoder efficiently captures both acoustic-prosodic and lexical information. We compare the benefits of multi-headed attention-based fusion for multimodal utterance-level classification against a simple concatenation of pre-pooled, modalityspecific representations. Our model architecture is compact, resource efficient, and can be trained on a single consumer GPU card.

Paper

Active use-cases

  • Punctuation restoration and capitalization using Audio LM. Similar Code provided by NVIDIA NeMo.
[@portabletext/react] Unknown block type "image", specify a component for it in the `components.types` prop
  • Prompt enabled E2E ASR.
Popular Tags :
Share this post :