Investigating the Lombard Effect Influence on End-to-End Audio-Visual Speech Recognition

06/05/2019
by   Pingchuan Ma, et al.
3

Several audio-visual speech recognition models have been recently proposed which aim to improve the robustness over audio-only models in the present of noise. However, almost all of them ignore the impact of the Lombard effect, i.e., the change in speaking style in noisy environments which aims to make speech more intelligible and affects both the acoustic characteristics of speech and the lip movements. In this paper, we investigate the impact of the Lombard effect in audio-visual speech recognition. To the best of our knowledge, this is the first work which does so using end-to-end deep architectures and presents results on unseen speakers. Our results show that properly modelling Lombard speech is always beneficial. Even if a relatively small amount of Lombard speech is added to the training set then the performance in a real scenario, where noisy Lombard speech is present, can be significantly improved. We also show that the standard approach followed in the literature, where a model is trained and tested on noisy plain speech, provides a correct estimate of the video-only performance and slightly underestimates the audio-visual performance. In case of audio-only approaches, performance is overestimated for SNRs higher than -3dB and underestimated for lower SNRs.

READ FULL TEXT

page 1

page 2

page 3

page 4

research
11/13/2018

Modality Attention for End-to-End Audio-visual Speech Recognition

Audio-visual speech recognition (AVSR) system is thought to be one of th...
research
02/18/2018

Visual-Only Recognition of Normal, Whispered and Silent Speech

Silent speech interfaces have been recently proposed as a way to enable ...
research
10/19/2017

Combining Multiple Views for Visual Speech Recognition

Visual speech recognition is a challenging research problem with a parti...
research
07/28/2020

Multimodal Integration for Large-Vocabulary Audio-Visual Speech Recognition

For many small- and medium-vocabulary tasks, audio-visual speech recogni...
research
09/15/2023

The Multimodal Information Based Speech Processing (MISP) 2023 Challenge: Audio-Visual Target Speaker Extraction

Previous Multimodal Information based Speech Processing (MISP) challenge...
research
11/05/2017

Robust Speech Recognition Using Generative Adversarial Networks

This paper describes a general, scalable, end-to-end framework that uses...
research
11/25/2020

De-STT: De-entaglement of unwanted Nuisances and Biases in Speech to Text System using Adversarial Forgetting

Training a robust Speech to Text (STT) system requires tens of thousands...

Please sign up or login with your details

Forgot password? Click here to reset