AV-data2vec: Self-supervised Learning of Audio-Visual Speech Representations with Contextualized Target Representations

02/10/2023
by   Jiachen Lian, et al.
0

Self-supervision has shown great potential for audio-visual speech recognition by vastly reducing the amount of labeled data required to build good systems. However, existing methods are either not entirely end-to-end or do not train joint representations of both modalities. In this paper, we introduce AV-data2vec which addresses these challenges and builds audio-visual representations based on predicting contextualized representations which has been successful in the uni-modal case. The model uses a shared transformer encoder for both audio and video and can combine both modalities to improve speech recognition. Results on LRS3 show that AV-data2vec consistently outperforms existing methods under most settings.

READ FULL TEXT

page 1

page 2

page 3

page 4

research
07/08/2020

Learning Speech Representations from Raw Audio by Joint Audiovisual Self-Supervision

The intuitive interaction between the audio and visual modalities is val...
research
02/15/2022

Learning Contextually Fused Audio-visual Representations for Audio-visual Speech Recognition

With the advance in self-supervised learning for audio and visual modali...
research
04/05/2022

Towards End-to-end Unsupervised Speech Recognition

Unsupervised speech recognition has shown great potential to make Automa...
research
08/11/2023

Lip2Vec: Efficient and Robust Visual Speech Recognition via Latent-to-Latent Visual to Audio Representation Mapping

Visual Speech Recognition (VSR) differs from the common perception tasks...
research
10/28/2019

Towards Unsupervised Speech Recognition and Synthesis with Quantized Speech Representation Learning

In this paper we propose a Sequential Representation Quantization AutoEn...
research
12/12/2022

Jointly Learning Visual and Auditory Speech Representations from Raw Data

We present RAVEn, a self-supervised multi-modal approach to jointly lear...
research
01/25/2020

Multi-task self-supervised learning for Robust Speech Recognition

Despite the growing interest in unsupervised learning, extracting meanin...

Please sign up or login with your details

Forgot password? Click here to reset