LiRA: Learning Visual Speech Representations from Audio through Self-supervision

06/16/2021
by   Pingchuan Ma, et al.
5

The large amount of audiovisual content being shared online today has drawn substantial attention to the prospect of audiovisual self-supervised learning. Recent works have focused on each of these modalities separately, while others have attempted to model both simultaneously in a cross-modal fashion. However, comparatively little attention has been given to leveraging one modality as a training objective to learn from the other. In this work, we propose Learning visual speech Representations from Audio via self-supervision (LiRA). Specifically, we train a ResNet+Conformer model to predict acoustic features from unlabelled visual speech. We find that this pre-trained model can be leveraged towards word-level and sentence-level lip-reading through feature extraction and fine-tuning experiments. We show that our approach significantly outperforms other self-supervised methods on the Lip Reading in the Wild (LRW) dataset and achieves state-of-the-art performance on Lip Reading Sentences 2 (LRS2) using only a fraction of the total labelled data.

READ FULL TEXT

page 1

page 2

page 3

page 4

research
07/08/2020

Learning Speech Representations from Raw Audio by Joint Audiovisual Self-Supervision

The intuitive interaction between the audio and visual modalities is val...
research
11/28/2019

Self-Supervised Learning by Cross-Modal Audio-Video Clustering

The visual and audio modalities are highly correlated yet they contain d...
research
02/24/2022

Leveraging Unimodal Self-Supervised Learning for Multimodal Audio-Visual Speech Recognition

Training Transformer-based models demands a large amount of data, while ...
research
12/09/2021

LipSound2: Self-Supervised Pre-Training for Lip-to-Speech Reconstruction and Lip Reading

The aim of this work is to investigate the impact of crossmodal self-sup...
research
12/10/2021

FaceFormer: Speech-Driven 3D Facial Animation with Transformers

Speech-driven 3D facial animation is challenging due to the complex geom...
research
04/01/2022

WavFT: Acoustic model finetuning with labelled and unlabelled data

Unsupervised and self-supervised learning methods have leveraged unlabel...
research
06/12/2023

NPVForensics: Jointing Non-critical Phonemes and Visemes for Deepfake Detection

Deepfake technologies empowered by deep learning are rapidly evolving, c...

Please sign up or login with your details

Forgot password? Click here to reset