Learning Speech Representations from Raw Audio by Joint Audiovisual Self-Supervision

07/08/2020
by   Abhinav Shukla, et al.
45

The intuitive interaction between the audio and visual modalities is valuable for cross-modal self-supervised learning. This concept has been demonstrated for generic audiovisual tasks like video action recognition and acoustic scene classification. However, self-supervision remains under-explored for audiovisual speech. We propose a method to learn self-supervised speech representations from the raw audio waveform. We train a raw audio encoder by combining audio-only self-supervision (by predicting informative audio attributes) with visual self-supervision (by generating talking faces from audio). The visual pretext task drives the audio representations to capture information related to lip movements. This enriches the audio encoder with visual information and the encoder can be used for evaluation without the visual modality. Our method attains competitive performance with respect to existing self-supervised audio features on established isolated word classification benchmarks, and significantly outperforms other methods at learning from fewer labels. Notably, our method also outperforms fully supervised training, thus providing a strong initialization for speech related tasks. Our results demonstrate the potential of multimodal self-supervision in audiovisual speech for learning good audio representations.

READ FULL TEXT

page 3

page 8

research
11/28/2019

Self-Supervised Learning by Cross-Modal Audio-Video Clustering

The visual and audio modalities are highly correlated yet they contain d...
research
02/10/2023

AV-data2vec: Self-supervised Learning of Audio-Visual Speech Representations with Contextualized Target Representations

Self-supervision has shown great potential for audio-visual speech recog...
research
06/16/2021

LiRA: Learning Visual Speech Representations from Audio through Self-supervision

The large amount of audiovisual content being shared online today has dr...
research
12/15/2022

MAViL: Masked Audio-Video Learners

We present Masked Audio-Video Learners (MAViL) to train audio-visual rep...
research
04/24/2023

Deep Audio-Visual Singing Voice Transcription based on Self-Supervised Learning Models

Singing voice transcription converts recorded singing audio to musical n...
research
01/31/2022

Self-supervised Graphs for Audio Representation Learning with Limited Labeled Data

Large scale databases with high-quality manual annotations are scarce in...
research
01/25/2020

Multi-task self-supervised learning for Robust Speech Recognition

Despite the growing interest in unsupervised learning, extracting meanin...

Please sign up or login with your details

Forgot password? Click here to reset