AVFormer: Injecting Vision into Frozen Speech Models for Zero-Shot AV-ASR

03/29/2023
by   Paul Hongsuck Seo, et al.
5

Audiovisual automatic speech recognition (AV-ASR) aims to improve the robustness of a speech recognition system by incorporating visual information. Training fully supervised multimodal models for this task from scratch, however is limited by the need for large labelled audiovisual datasets (in each downstream domain of interest). We present AVFormer, a simple method for augmenting audio-only models with visual information, at the same time performing lightweight domain adaptation. We do this by (i) injecting visual embeddings into a frozen ASR model using lightweight trainable adaptors. We show that these can be trained on a small amount of weakly labelled video data with minimum additional training time and parameters. (ii) We also introduce a simple curriculum scheme during training which we show is crucial to enable the model to jointly process audio and visual information effectively; and finally (iii) we show that our model achieves state of the art zero-shot results on three different AV-ASR benchmarks (How2, VisSpeech and Ego4D), while also crucially preserving decent performance on traditional audio-only speech recognition benchmarks (LibriSpeech). Qualitative results show that our model effectively leverages visual information for robust speech recognition.

READ FULL TEXT

page 4

page 7

page 12

research
04/29/2020

Multiresolution and Multimodal Speech Recognition with Transformers

This paper presents an audio visual automatic speech recognition (AV-ASR...
research
09/18/2023

Are Soft Prompts Good Zero-shot Learners for Speech Recognition?

Large self-supervised pre-trained speech models require computationally ...
research
11/10/2021

Scaling ASR Improves Zero and Few Shot Learning

With 4.5 million hours of English speech from 10 different sources acros...
research
07/10/2023

SparseVSR: Lightweight and Noise Robust Visual Speech Recognition

Recent advances in deep neural networks have achieved unprecedented succ...
research
04/25/2022

Understanding Audio Features via Trainable Basis Functions

In this paper we explore the possibility of maximizing the information r...
research
06/10/2023

OpenSR: Open-Modality Speech Recognition via Maintaining Multi-Modality Alignment

Speech Recognition builds a bridge between the multimedia streaming (aud...
research
05/18/2023

Prompting the Hidden Talent of Web-Scale Speech Models for Zero-Shot Task Generalization

We investigate the emergent abilities of the recently proposed web-scale...

Please sign up or login with your details

Forgot password? Click here to reset