Recurrent Neural Network Transducer for Audio-Visual Speech Recognition

11/08/2019
by   Takaki Makino, et al.
0

This work presents a large-scale audio-visual speech recognition system based on a recurrent neural network transducer (RNN-T) architecture. To support the development of such a system, we built a large audio-visual (A/V) dataset of segmented utterances extracted from YouTube public videos, leading to 31k hours of audio-visual training content. The performance of an audio-only, visual-only, and audio-visual system are compared on two large-vocabulary test sets: a set of utterance segments from public YouTube videos called YTDEV18 and the publicly available LRS3-TED set. To highlight the contribution of the visual modality, we also evaluated the performance of our system on the YTDEV18 set artificially corrupted with background noise and overlapping speech. To the best of our knowledge, our system significantly improves the state-of-the-art on the LRS3-TED set.

READ FULL TEXT

page 1

page 2

page 3

page 4

research
01/25/2022

Transformer-Based Video Front-Ends for Audio-Visual Speech Recognition

Audio-visual automatic speech recognition (AV-ASR) extends the speech re...
research
09/03/2018

LRS3-TED: a large-scale dataset for visual speech recognition

This paper introduces a new multi-modal dataset for visual and audio-vis...
research
07/13/2018

Large-Scale Visual Speech Recognition

This work presents a scalable solution to open-vocabulary visual speech ...
research
02/26/2022

Visual Speech Recognition for Multiple Languages in the Wild

Visual speech recognition (VSR) aims to recognise the content of speech ...
research
09/01/2023

Mi-Go: Test Framework which uses YouTube as Data Source for Evaluating Speech Recognition Models like OpenAI's Whisper

This article introduces Mi-Go, a novel testing framework aimed at evalua...
research
03/01/2019

KT-Speech-Crawler: Automatic Dataset Construction for Speech Recognition from YouTube Videos

In this paper, we describe KT-Speech-Crawler: an approach for automatic ...
research
06/10/2023

OpenSR: Open-Modality Speech Recognition via Maintaining Multi-Modality Alignment

Speech Recognition builds a bridge between the multimedia streaming (aud...

Please sign up or login with your details

Forgot password? Click here to reset