Learning to Jointly Transcribe and Subtitle for End-to-End Spontaneous Speech Recognition

10/14/2022
by   Jakob Poncelet, et al.
0

TV subtitles are a rich source of transcriptions of many types of speech, ranging from read speech in news reports to conversational and spontaneous speech in talk shows and soaps. However, subtitles are not verbatim (i.e. exact) transcriptions of speech, so they cannot be used directly to improve an Automatic Speech Recognition (ASR) model. We propose a multitask dual-decoder Transformer model that jointly performs ASR and automatic subtitling. The ASR decoder (possibly pre-trained) predicts the verbatim output and the subtitle decoder generates a subtitle, while sharing the encoder. The two decoders can be independent or connected. The model is trained to perform both tasks jointly, and is able to effectively use subtitle data. We show improvements on regular ASR and on spontaneous and conversational ASR by incorporating the additional subtitle decoder. The method does not require preprocessing (aligning, filtering, pseudo-labeling, ...) of the subtitles.

READ FULL TEXT

page 1

page 2

page 3

page 4

research
02/16/2022

Conversational Speech Recognition By Learning Conversation-level Characteristics

Conversational automatic speech recognition (ASR) is a task to recognize...
research
09/17/2021

Dual-Encoder Architecture with Encoder Selection for Joint Close-Talk and Far-Talk Speech Recognition

In this paper, we propose a dual-encoder ASR architecture for joint mode...
research
09/18/2023

Instruction-Following Speech Recognition

Conventional end-to-end Automatic Speech Recognition (ASR) models primar...
research
11/02/2020

Dual-decoder Transformer for Joint Automatic Speech Recognition and Multilingual Speech Translation

We introduce dual-decoder Transformer, a new model architecture that joi...
research
07/04/2020

Deep Graph Random Process for Relational-Thinking-Based Speech Recognition

Lying at the core of human intelligence, relational thinking is characte...
research
02/12/2022

USTED: Improving ASR with a Unified Speech and Text Encoder-Decoder

Improving end-to-end speech recognition by incorporating external text d...
research
03/31/2022

Importance of Different Temporal Modulations of Speech: A Tale of Two Perspectives

How important are different temporal speech modulations for speech recog...

Please sign up or login with your details

Forgot password? Click here to reset