Learning Audio-Visual Speech Representation by Masked Multimodal Cluster Prediction

01/05/2022
by   Bowen Shi, et al.
0

Video recordings of speech contain correlated audio and visual information, providing a strong signal for speech representation learning from the speaker's lip movements and the produced sound. We introduce Audio-Visual Hidden Unit BERT (AV-HuBERT), a self-supervised representation learning framework for audio-visual speech, which masks multi-stream video input and predicts automatically discovered and iteratively refined multimodal hidden units. AV-HuBERT learns powerful audio-visual speech representation benefiting both lip-reading and automatic speech recognition. On the largest public lip-reading benchmark LRS3 (433 hours), AV-HuBERT achieves 32.5 labeled data, outperforming the former state-of-the-art approach (33.6 trained with a thousand times more transcribed video data (31K hours). The lip-reading WER is further reduced to 26.9 data from LRS3 and combined with self-training. Using our audio-visual representation on the same benchmark for audio-only speech recognition leads to a 40 2.3 https://github.com/facebookresearch/av_hubert

READ FULL TEXT
research
01/05/2022

Robust Self-Supervised Audio-Visual Speech Recognition

Audio-based automatic speech recognition (ASR) degrades significantly in...
research
03/30/2023

SynthVSR: Scaling Up Visual Speech Recognition With Synthetic Supervision

Recently reported state-of-the-art results in visual speech recognition ...
research
07/13/2022

MM-ALT: A Multimodal Automatic Lyric Transcription System

Automatic lyric transcription (ALT) is a nascent field of study attracti...
research
02/21/2022

Spanish and English Phoneme Recognition by Training on Simulated Classroom Audio Recordings of Collaborative Learning Environments

Audio recordings of collaborative learning environments contain a consta...
research
01/24/2022

Unified Multimodal Punctuation Restoration Framework for Mixed-Modality Corpus

The punctuation restoration task aims to correctly punctuate the output ...
research
09/28/2022

TVLT: Textless Vision-Language Transformer

In this work, we present the Textless Vision-Language Transformer (TVLT)...
research
05/28/2022

Is Lip Region-of-Interest Sufficient for Lipreading?

Lip region-of-interest (ROI) is conventionally used for visual input in ...

Please sign up or login with your details

Forgot password? Click here to reset