Fine-Grained Grounding for Multimodal Speech Recognition

10/05/2020
by   Tejas Srinivasan, et al.
0

Multimodal automatic speech recognition systems integrate information from images to improve speech recognition quality, by grounding the speech in the visual context. While visual signals have been shown to be useful for recovering entities that have been masked in the audio, these models should be capable of recovering a broader range of word types. Existing systems rely on global visual features that represent the entire image, but localizing the relevant regions of the image will make it possible to recover a larger set of words, such as adjectives and verbs. In this paper, we propose a model that uses finer-grained visual information from different parts of the image, using automatic object proposals. In experiments on the Flickr8K Audio Captions Corpus, we find that our model improves over approaches that use global visual features, that the proposals enable the model to recover entities and other related words, such as adjectives, and that improvements are due to the model's ability to localize the correct proposals.

READ FULL TEXT

page 3

page 6

page 7

page 8

research
10/16/2020

Multimodal Speech Recognition with Unstructured Audio Masking

Visual context has been shown to be useful for automatic speech recognit...
research
02/13/2020

Looking Enhances Listening: Recovering Missing Speech Using Images

Speech is understood better by using visual context; for this reason, th...
research
07/04/2014

Recognition of Isolated Words using Zernike and MFCC features for Audio Visual Speech Recognition

Automatic Speech Recognition (ASR) by machine is an attractive research ...
research
11/09/2018

Multimodal Grounding for Sequence-to-Sequence Speech Recognition

Humans are capable of processing speech by making use of multiple sensor...
research
01/25/2022

Transformer-Based Video Front-Ends for Audio-Visual Speech Recognition

Audio-visual automatic speech recognition (AV-ASR) extends the speech re...
research
10/07/2019

A Case Study on Combining ASR and Visual Features for Generating Instructional Video Captions

Instructional videos get high-traffic on video sharing platforms, and pr...
research
10/29/2019

Transformer-based Cascaded Multimodal Speech Translation

This paper describes the cascaded multimodal speech translation systems ...

Please sign up or login with your details

Forgot password? Click here to reset