Multi-Modal Pre-Training for Automated Speech Recognition

10/12/2021
by   David M. Chan, et al.
0

Traditionally, research in automated speech recognition has focused on local-first encoding of audio representations to predict the spoken phonemes in an utterance. Unfortunately, approaches relying on such hyper-local information tend to be vulnerable to both local-level corruption (such as audio-frame drops, or loud noises) and global-level noise (such as environmental noise, or background noise) that has not been seen during training. In this work, we introduce a novel approach which leverages a self-supervised learning technique based on masked language modeling to compute a global, multi-modal encoding of the environment in which the utterance occurs. We then use a new deep-fusion framework to integrate this global context into a traditional ASR method, and demonstrate that the resulting method can outperform baseline methods by up to 7 to 45

READ FULL TEXT

page 1

page 2

page 3

page 4

research
02/15/2022

Learning Contextually Fused Audio-visual Representations for Audio-visual Speech Recognition

With the advance in self-supervised learning for audio and visual modali...
research
05/19/2022

Content-Context Factorized Representations for Automated Speech Recognition

Deep neural networks have largely demonstrated their ability to perform ...
research
01/16/2023

OLKAVS: An Open Large-Scale Korean Audio-Visual Speech Dataset

Inspired by humans comprehending speech in a multi-modal manner, various...
research
12/12/2022

Jointly Learning Visual and Auditory Speech Representations from Raw Data

We present RAVEn, a self-supervised multi-modal approach to jointly lear...
research
11/23/2022

Device Directedness with Contextual Cues for Spoken Dialog Systems

In this work, we define barge-in verification as a supervised learning t...
research
02/27/2023

Improving Medical Speech-to-Text Accuracy with Vision-Language Pre-training Model

Automatic Speech Recognition (ASR) is a technology that converts spoken ...
research
03/29/2022

DRSpeech: Degradation-Robust Text-to-Speech Synthesis with Frame-Level and Utterance-Level Acoustic Representation Learning

Most text-to-speech (TTS) methods use high-quality speech corpora record...

Please sign up or login with your details

Forgot password? Click here to reset