Learning Contextually Fused Audio-visual Representations for Audio-visual Speech Recognition

02/15/2022
by   Zi-Qiang Zhang, et al.
0

With the advance in self-supervised learning for audio and visual modalities, it has become possible to learn a robust audio-visual speech representation. This would be beneficial for improving the audio-visual speech recognition (AVSR) performance, as the multi-modal inputs contain more fruitful information in principle. In this paper, based on existing self-supervised representation learning methods for audio modality, we therefore propose an audio-visual representation learning approach. The proposed approach explores both the complementarity of audio-visual modalities and long-term context dependency using a transformer-based fusion module and a flexible masking strategy. After pre-training, the model is able to extract fused representations required by AVSR. Without loss of generality, it can be applied to single-modal tasks, e.g. audio/visual speech recognition by simply masking out one modality in the fusion module. The proposed pre-trained model is evaluated on speech recognition and lipreading tasks using one or two modalities, where the superiority is revealed.

READ FULL TEXT

page 1

page 2

page 3

page 4

research
01/13/2020

Visually Guided Self Supervised Learning of Speech Representations

Self supervised representation learning has recently attracted a lot of ...
research
11/21/2022

VATLM: Visual-Audio-Text Pre-Training with Unified Masked Prediction for Speech Representation Learning

Although speech is a simple and effective way for humans to communicate ...
research
12/12/2022

Jointly Learning Visual and Auditory Speech Representations from Raw Data

We present RAVEn, a self-supervised multi-modal approach to jointly lear...
research
10/12/2021

Multi-Modal Pre-Training for Automated Speech Recognition

Traditionally, research in automated speech recognition has focused on l...
research
02/10/2023

AV-data2vec: Self-supervised Learning of Audio-Visual Speech Representations with Contextualized Target Representations

Self-supervision has shown great potential for audio-visual speech recog...
research
05/15/2022

Learning Lip-Based Audio-Visual Speaker Embeddings with AV-HuBERT

This paper investigates self-supervised pre-training for audio-visual sp...
research
06/24/2022

BYOL-S: Learning Self-supervised Speech Representations by Bootstrapping

Methods for extracting audio and speech features have been studied since...

Please sign up or login with your details

Forgot password? Click here to reset