SlideSpeech: A Large-Scale Slide-Enriched Audio-Visual Corpus

09/11/2023
by   Haoxu Wang, et al.
0

Multi-Modal automatic speech recognition (ASR) techniques aim to leverage additional modalities to improve the performance of speech recognition systems. While existing approaches primarily focus on video or contextual information, the utilization of extra supplementary textual information has been overlooked. Recognizing the abundance of online conference videos with slides, which provide rich domain-specific information in the form of text and images, we release SlideSpeech, a large-scale audio-visual corpus enriched with slides. The corpus contains 1,705 videos, 1,000+ hours, with 473 hours of high-quality transcribed speech. Moreover, the corpus contains a significant amount of real-time synchronized slides. In this work, we present the pipeline for constructing the corpus and propose baseline methods for utilizing text information in the visual slide context. Through the application of keyword extraction and contextual ASR methods in the benchmark system, we demonstrate the potential of improving speech recognition performance by incorporating textual information from supplementary video slides.

READ FULL TEXT
research
01/21/2023

A Multi-Purpose Audio-Visual Corpus for Multi-Modal Persian Speech Recognition: the Arman-AV Dataset

In recent years, significant progress has been made in automatic lip rea...
research
03/26/2021

Construction of a Large-scale Japanese ASR Corpus on TV Recordings

This paper presents a new large-scale Japanese speech corpus for trainin...
research
07/13/2018

Large-Scale Visual Speech Recognition

This work presents a scalable solution to open-vocabulary visual speech ...
research
04/29/2020

Beyond Instructional Videos: Probing for More Diverse Visual-Textual Grounding on YouTube

Pretraining from unlabelled web videos has quickly become the de-facto m...
research
06/11/2021

Improving RNN-T ASR Performance with Date-Time and Location Awareness

In this paper, we explore the benefits of incorporating context into a R...
research
10/07/2021

WenetSpeech: A 10000+ Hours Multi-domain Mandarin Corpus for Speech Recognition

In this paper, we present WenetSpeech, a multi-domain Mandarin corpus co...
research
02/11/2023

ASDF: A Differential Testing Framework for Automatic Speech Recognition Systems

Recent years have witnessed wider adoption of Automated Speech Recogniti...

Please sign up or login with your details

Forgot password? Click here to reset