Resource aware design of a deep convolutional-recurrent neural network for speech recognition through audio-visual sensor fusion

03/13/2018
by   Matthijs Van keirsbilck, et al.
0

Today's Automatic Speech Recognition systems only rely on acoustic signals and often don't perform well under noisy conditions. Performing multi-modal speech recognition - processing acoustic speech signals and lip-reading video simultaneously - significantly enhances the performance of such systems, especially in noisy environments. This work presents the design of such an audio-visual system for Automated Speech Recognition, taking memory and computation requirements into account. First, a Long-Short-Term-Memory neural network for acoustic speech recognition is designed. Second, Convolutional Neural Networks are used to model lip-reading features. These are combined with an LSTM network to model temporal dependencies and perform automatic lip-reading on video. Finally, acoustic-speech and visual lip-reading networks are combined to process acoustic and visual features simultaneously. An attention mechanism ensures performance of the model in noisy environments. This system is evaluated on the TCD-TIMIT 'lipspeaker' dataset for audio-visual phoneme recognition with clean audio and with additive white noise at an SNR of 0dB. It achieves 75.70 percentage points better than the state-of-the-art for all noise levels.

READ FULL TEXT

page 1

page 2

page 3

page 4

research
12/21/2018

An Empirical Analysis of Deep Audio-Visual Models for Speech Recognition

In this project, we worked on speech recognition, specifically predictin...
research
11/09/2016

Audio Visual Speech Recognition using Deep Recurrent Neural Networks

In this work, we propose a training algorithm for an audio-visual automa...
research
11/21/2016

Robust end-to-end deep audiovisual speech recognition

Speech is one of the most effective ways of communication among humans. ...
research
10/19/2017

Visual Speech Recognition Using PCA Networks and LSTMs in a Tandem GMM-HMM System

Automatic visual speech recognition is an interesting problem in pattern...
research
10/20/2020

Replacing Human Audio with Synthetic Audio for On-device Unspoken Punctuation Prediction

We present a novel multi-modal unspoken punctuation prediction system fo...
research
02/15/2018

Deep Learning for Lip Reading using Audio-Visual Information for Urdu Language

Human lip-reading is a challenging task. It requires not only knowledge ...
research
01/10/2017

Multi-task Learning Of Deep Neural Networks For Audio Visual Automatic Speech Recognition

Multi-task learning (MTL) involves the simultaneous training of two or m...

Please sign up or login with your details

Forgot password? Click here to reset