End-to-End Multimodal Speech Recognition

04/25/2018
by   Shruti Palaskar, et al.
0

Transcription or sub-titling of open-domain videos is still a challenging domain for Automatic Speech Recognition (ASR) due to the data's challenging acoustics, variable signal processing and the essentially unrestricted domain of the data. In previous work, we have shown that the visual channel -- specifically object and scene features -- can help to adapt the acoustic model (AM) and language model (LM) of a recognizer, and we are now expanding this work to end-to-end approaches. In the case of a Connectionist Temporal Classification (CTC)-based approach, we retain the separation of AM and LM, while for a sequence-to-sequence (S2S) approach, both information sources are adapted together, in a single model. This paper also analyzes the behavior of CTC and S2S models on noisy video data (How-To corpus), and compares it to results on the clean Wall Street Journal (WSJ) corpus, providing insight into the robustness of both approaches.

READ FULL TEXT

page 1

page 2

page 3

page 4

research
11/06/2017

Multilingual Speech Recognition With A Single End-To-End Model

Training a conventional automatic speech recognition (ASR) system to sup...
research
11/09/2018

Multimodal Grounding for Sequence-to-Sequence Speech Recognition

Humans are capable of processing speech by making use of multiple sensor...
research
12/08/2016

Towards better decoding and language model integration in sequence to sequence models

The recently proposed Sequence-to-Sequence (seq2seq) framework advocates...
research
12/01/2017

Visual Features for Context-Aware Speech Recognition

Automatic transcriptions of consumer-generated multi-media content such ...
research
04/02/2020

Towards Relevance and Sequence Modeling in Language Recognition

The task of automatic language identification (LID) involving multiple d...
research
03/19/2018

Acoustic feature learning using cross-domain articulatory measurements

Previous work has shown that it is possible to improve speech recognitio...
research
02/03/2022

Joint Speech Recognition and Audio Captioning

Speech samples recorded in both indoor and outdoor environments are ofte...

Please sign up or login with your details

Forgot password? Click here to reset