A Single Self-Supervised Model for Many Speech Modalities Enables Zero-Shot Modality Transfer

07/14/2022
by   Wei-Ning Hsu, et al.
0

While audio-visual speech models can yield superior performance and robustness compared to audio-only models, their development and adoption are hindered by the lack of labeled and unlabeled audio-visual data and the cost to deploy one model per modality. In this paper, we present u-HuBERT, a self-supervised pre-training framework that can leverage both multimodal and unimodal speech with a unified masked cluster prediction objective. By utilizing modality dropout during pre-training, we demonstrate that a single fine-tuned model can achieve performance on par or better than the state-of-the-art modality-specific models. Moreover, our model fine-tuned only on audio can perform well with audio-visual and visual speech input, achieving zero-shot modality generalization for speech recognition and speaker verification. In particular, our single model yields 1.2 recognition word error rate on LRS3 with audio-visual/audio/visual input.

READ FULL TEXT

page 1

page 2

page 3

page 4

research
06/10/2023

OpenSR: Open-Modality Speech Recognition via Maintaining Multi-Modality Alignment

Speech Recognition builds a bridge between the multimedia streaming (aud...
research
05/15/2022

Learning Lip-Based Audio-Visual Speaker Embeddings with AV-HuBERT

This paper investigates self-supervised pre-training for audio-visual sp...
research
03/29/2022

Transfer Learning Framework for Low-Resource Text-to-Speech using a Large-Scale Unlabeled Speech Corpus

Training a text-to-speech (TTS) model requires a large scale text labele...
research
03/14/2023

BLAT: Bootstrapping Language-Audio Pre-training based on AudioSet Tag-guided Synthetic Data

Compared with ample visual-text pre-training research, few works explore...
research
02/24/2022

Leveraging Unimodal Self-Supervised Learning for Multimodal Audio-Visual Speech Recognition

Training Transformer-based models demands a large amount of data, while ...
research
09/10/2023

Multimodal Fish Feeding Intensity Assessment in Aquaculture

Fish feeding intensity assessment (FFIA) aims to evaluate the intensity ...
research
12/09/2014

Multimodal Transfer Deep Learning with Applications in Audio-Visual Recognition

We propose a transfer deep learning (TDL) framework that can transfer th...

Please sign up or login with your details

Forgot password? Click here to reset