Improving Multimodal Speech Recognition by Data Augmentation and Speech Representations

04/27/2022
by   Dan Oneata, et al.
0

Multimodal speech recognition aims to improve the performance of automatic speech recognition (ASR) systems by leveraging additional visual information that is usually associated to the audio input. While previous approaches make crucial use of strong visual representations, e.g. by finetuning pretrained image recognition networks, significantly less attention has been paid to its counterpart: the speech component. In this work, we investigate ways of improving the base speech recognition system by following similar techniques to the ones used for the visual encoder, namely, transferring representations and data augmentation. First, we show that starting from a pretrained ASR significantly improves the state-of-the-art performance; remarkably, even when building upon a strong unimodal system, we still find gains by including the visual modality. Second, we employ speech data augmentation techniques to encourage the multimodal system to attend to the visual stimuli. This technique replaces previously used word masking and comes with the benefits of being conceptually simpler and yielding consistent improvements in the multimodal setting. We provide empirical results on three multimodal datasets, including the newly introduced Localized Narratives.

READ FULL TEXT
research
02/25/2021

MixSpeech: Data Augmentation for Low-resource Automatic Speech Recognition

In this paper, we propose MixSpeech, a simple yet effective data augment...
research
10/29/2019

Improving sequence-to-sequence speech recognition training with on-the-fly data augmentation

Sequence-to-Sequence (S2S) models recently started to show state-of-the-...
research
04/05/2021

Talk, Don't Write: A Study of Direct Speech-Based Image Retrieval

Speech-based image retrieval has been studied as a proxy for joint repre...
research
06/07/2021

Data Augmentation Methods for End-to-end Speech Recognition on Distant-Talk Scenarios

Although end-to-end automatic speech recognition (E2E ASR) has achieved ...
research
06/30/2019

Analyzing Utility of Visual Context in Multimodal Speech Recognition Under Noisy Conditions

Multimodal learning allows us to leverage information from multiple sour...
research
03/04/2023

The DKU Post-Challenge Audio-Visual Wake Word Spotting System for the 2021 MISP Challenge: Deep Analysis

This paper further explores our previous wake word spotting system ranke...
research
09/30/2021

SpliceOut: A Simple and Efficient Audio Augmentation Method

Time masking has become a de facto augmentation technique for speech and...

Please sign up or login with your details

Forgot password? Click here to reset