Visual Speech Recognition Using PCA Networks and LSTMs in a Tandem GMM-HMM System

10/19/2017
by   Marina Zimmermann, et al.
0

Automatic visual speech recognition is an interesting problem in pattern recognition especially when audio data is noisy or not readily available. It is also a very challenging task mainly because of the lower amount of information in the visual articulations compared to the audible utterance. In this work, principle component analysis is applied to the image patches - extracted from the video data - to learn the weights of a two-stage convolutional network. Block histograms are then extracted as the unsupervised learning features. These features are employed to learn a recurrent neural network with a set of long short-term memory cells to obtain spatiotemporal features. Finally, the obtained features are used in a tandem GMM-HMM system for speech recognition. Our results show that the proposed method has outperformed the baseline techniques applied to the OuluVS2 audiovisual database for phrase recognition with the frontal view cross-validation and testing sentence correctness reaching 79 cross-validation.

READ FULL TEXT

page 7

page 12

research
03/13/2018

Resource aware design of a deep convolutional-recurrent neural network for speech recognition through audio-visual sensor fusion

Today's Automatic Speech Recognition systems only rely on acoustic signa...
research
10/19/2017

Combining Multiple Views for Visual Speech Recognition

Visual speech recognition is a challenging research problem with a parti...
research
07/04/2014

Recognition of Isolated Words using Zernike and MFCC features for Audio Visual Speech Recognition

Automatic Speech Recognition (ASR) by machine is an attractive research ...
research
05/19/2020

Should we hard-code the recurrence concept or learn it instead ? Exploring the Transformer architecture for Audio-Visual Speech Recognition

The audio-visual speech fusion strategy AV Align has shown significant p...
research
03/12/2017

Combining Residual Networks with LSTMs for Lipreading

We propose an end-to-end deep learning architecture for word-level visua...
research
11/03/2018

Pushing the boundaries of audiovisual word recognition using Residual Networks and LSTMs

Visual and audiovisual speech recognition are witnessing a renaissance w...
research
06/12/2023

Multi-View Frequency-Attention Alternative to CNN Frontends for Automatic Speech Recognition

Convolutional frontends are a typical choice for Transformer-based autom...

Please sign up or login with your details

Forgot password? Click here to reset