LRS3-TED: a large-scale dataset for visual speech recognition

09/03/2018
by   Triantafyllos Afouras, et al.
0

This paper introduces a new multi-modal dataset for visual and audio-visual speech recognition. It includes face tracks from over 400 hours of TED and TEDx videos, along with the corresponding subtitles and word alignment boundaries. The new dataset is substantially larger in scale compared to other public datasets that are available for general research.

READ FULL TEXT
research
01/21/2023

A Multi-Purpose Audio-Visual Corpus for Multi-Modal Persian Speech Recognition: the Arman-AV Dataset

In recent years, significant progress has been made in automatic lip rea...
research
01/16/2023

OLKAVS: An Open Large-Scale Korean Audio-Visual Speech Dataset

Inspired by humans comprehending speech in a multi-modal manner, various...
research
11/08/2019

Recurrent Neural Network Transducer for Audio-Visual Speech Recognition

This work presents a large-scale audio-visual speech recognition system ...
research
03/30/2023

SynthVSR: Scaling Up Visual Speech Recognition With Synthetic Supervision

Recently reported state-of-the-art results in visual speech recognition ...
research
09/12/2021

MovieCuts: A New Dataset and Benchmark for Cut Type Recognition

Understanding movies and their structural patterns is a crucial task to ...
research
07/13/2018

Large-Scale Visual Speech Recognition

This work presents a scalable solution to open-vocabulary visual speech ...
research
03/01/2023

WhisperX: Time-Accurate Speech Transcription of Long-Form Audio

Large-scale, weakly-supervised speech recognition models, such as Whispe...

Please sign up or login with your details

Forgot password? Click here to reset