End-to-end Audiovisual Speech Activity Detection with Bimodal Recurrent Neural Models

09/12/2018
by   Fei Tao, et al.
0

Speech activity detection (SAD) plays an important role in current speech processing systems, including automatic speech recognition (ASR). SAD is particularly difficult in environments with acoustic noise. A practical solution is to incorporate visual information, increasing the robustness of the SAD approach. An audiovisual system has the advantage of being robust to different speech modes (e.g., whisper speech) or background noise. Recent advances in audiovisual speech processing using deep learning have opened opportunities to capture in a principled way the temporal relationships between acoustic and visual features. This study explores this idea proposing a bimodal recurrent neural network (BRNN) framework for SAD. The approach models the temporal dynamic of the sequential audiovisual data, improving the accuracy and robustness of the proposed SAD system. Instead of estimating hand-crafted features, the study investigates an end-to-end training approach, where acoustic and visual features are directly learned from the raw data during training. The experimental evaluation considers a large audiovisual corpus with over 60.8 hours of recordings, collected from 105 speakers. The results demonstrate that the proposed framework leads to absolute improvements up to 1.2 implemented with deep neural network (DNN). The proposed approach achieves 92.7 under noisy acoustic environment, which is only 1.0 obtained under ideal conditions (e.g., clean speech obtained with a high definition camera and a close-talking microphone).

READ FULL TEXT
research
11/09/2016

Audio Visual Speech Recognition using Deep Recurrent Neural Networks

In this work, we propose a training algorithm for an audio-visual automa...
research
01/11/2016

Environmental Noise Embeddings for Robust Speech Recognition

We propose a novel deep neural network architecture for speech recogniti...
research
03/26/2018

Light Gated Recurrent Units for Speech Recognition

A field that has directly benefited from the recent advances in deep lea...
research
11/21/2016

Robust end-to-end deep audiovisual speech recognition

Speech is one of the most effective ways of communication among humans. ...
research
04/30/2018

Investigations on End-to-End Audiovisual Fusion

Audiovisual speech recognition (AVSR) is a method to alleviate the adver...
research
03/29/2022

Improving Generalization of Deep Neural Network Acoustic Models with Length Perturbation and N-best Based Label Smoothing

We introduce two techniques, length perturbation and n-best based label ...
research
12/21/2018

End-to-End Classification of Reverberant Rooms using DNNs

Reverberation is present in our workplaces, our homes and even in places...

Please sign up or login with your details

Forgot password? Click here to reset