Looking Enhances Listening: Recovering Missing Speech Using Images

02/13/2020
by   Tejas Srinivasan, et al.
0

Speech is understood better by using visual context; for this reason, there have been many attempts to use images to adapt automatic speech recognition (ASR) systems. Current work, however, has shown that visually adapted ASR models only use images as a regularization signal, while completely ignoring their semantic content. In this paper, we present a set of experiments where we show the utility of the visual modality under noisy conditions. Our results show that multimodal ASR models can recover words which are masked in the input acoustic signal, by grounding its transcriptions using the visual representations. We observe that integrating visual context can result in up to 35 that end-to-end multimodal ASR systems can become more robust to noise by leveraging the visual context.

READ FULL TEXT

page 1

page 2

page 3

page 4

research
10/16/2020

Multimodal Speech Recognition with Unstructured Audio Masking

Visual context has been shown to be useful for automatic speech recognit...
research
11/09/2018

Multimodal Grounding for Sequence-to-Sequence Speech Recognition

Humans are capable of processing speech by making use of multiple sensor...
research
02/27/2023

Multimodal Speech Recognition for Language-Guided Embodied Agents

Benchmarks for language-guided embodied agents typically assume text-bas...
research
10/05/2020

Fine-Grained Grounding for Multimodal Speech Recognition

Multimodal automatic speech recognition systems integrate information fr...
research
10/21/2022

Can Visual Context Improve Automatic Speech Recognition for an Embodied Agent?

The usage of automatic speech recognition (ASR) systems are becoming omn...
research
10/29/2019

Transformer-based Cascaded Multimodal Speech Translation

This paper describes the cascaded multimodal speech translation systems ...
research
07/14/2023

SGGNet^2: Speech-Scene Graph Grounding Network for Speech-guided Navigation

The spoken language serves as an accessible and efficient interface, ena...

Please sign up or login with your details

Forgot password? Click here to reset