Practice of the conformer enhanced AUDIO-VISUAL HUBERT on Mandarin and English

02/28/2023
by   Xiaoming Ren, et al.
0

Considering the bimodal nature of human speech perception, lips, and teeth movement has a pivotal role in automatic speech recognition. Benefiting from the correlated and noise-invariant visual information, audio-visual recognition systems enhance robustness in multiple scenarios. In previous work, audio-visual HuBERT appears to be the finest practice incorporating modality knowledge. This paper outlines a mixed methodology, named conformer enhanced AV-HuBERT, boosting the AV-HuBERT system's performance a step further. Compared with baseline AV-HuBERT, our method in the one-phase evaluation of clean and noisy conditions achieves 7 benchmark dataset LRS3. Furthermore, we establish a novel 1000h Mandarin AVSR dataset CSTS. On top of the baseline AV-HuBERT, we exceed the WeNet ASR system by 14 The conformer-enhanced AV-HuBERT we proposed brings 7 reduction on CMLR, compared with the baseline AV-HuBERT system.

READ FULL TEXT
research
01/05/2022

Robust Self-Supervised Audio-Visual Speech Recognition

Audio-based automatic speech recognition (ASR) degrades significantly in...
research
01/06/2020

Audio-visual Recognition of Overlapped speech for the LRS2 dataset

Automatic recognition of overlapped speech remains a highly challenging ...
research
10/16/2020

Multimodal Speech Recognition with Unstructured Audio Masking

Visual context has been shown to be useful for automatic speech recognit...
research
07/11/2022

pMCT: Patched Multi-Condition Training for Robust Speech Recognition

We propose a novel Patched Multi-Condition Training (pMCT) method for ro...
research
12/14/2020

AV Taris: Online Audio-Visual Speech Recognition

In recent years, Automatic Speech Recognition (ASR) technology has appro...
research
05/11/2022

End-to-End Multi-Person Audio/Visual Automatic Speech Recognition

Traditionally, audio-visual automatic speech recognition has been studie...
research
08/11/2023

Lip2Vec: Efficient and Robust Visual Speech Recognition via Latent-to-Latent Visual to Audio Representation Mapping

Visual Speech Recognition (VSR) differs from the common perception tasks...

Please sign up or login with your details

Forgot password? Click here to reset