End-to-End Visual Speech Recognition for Small-Scale Datasets

04/02/2019
by   Stavros Petridis, et al.
6

Traditional visual speech recognition systems consist of two stages, feature extraction and classification. Recently, several deep learning approaches have been presented which automatically extract features from the mouth images and aim to replace the feature extraction stage. However, research on joint learning of features and classification remains limited. In addition, most of the existing methods require large amounts of data in order to achieve state-of-the-art performance, otherwise they under-perform. In this work, we present an end-to-end visual speech recognition system based on fully-connected layers and Long-Short Memory (LSTM) networks which is suitable for small-scale datasets. The model consists of two streams which extract features directly from the mouth and difference images, respectively. The temporal dynamics in each stream are modelled by a Bidirectional LSTM (BLSTM) and the fusion of the two streams takes place via another BLSTM. An absolute improvement of 0.6 3.4 AVLetters and AVLetters2 databases, respectively.

READ FULL TEXT

page 1

page 2

page 3

page 4

research
01/20/2017

End-To-End Visual Speech Recognition With LSTMs

Traditional visual speech recognition systems consist of two stages, fea...
research
09/12/2017

End-to-End Audiovisual Fusion with LSTMs

Several end-to-end deep learning approaches have been recently presented...
research
02/18/2018

End-to-end Audiovisual Speech Recognition

Several end-to-end deep learning approaches have been recently presented...
research
03/22/2023

Exploring Turkish Speech Recognition via Hybrid CTC/Attention Architecture and Multi-feature Fusion Network

In recent years, End-to-End speech recognition technology based on deep ...
research
05/30/2017

Deep Learning for Environmentally Robust Speech Recognition: An Overview of Recent Developments

Eliminating the negative effect of non-stationary environmental noise is...
research
09/07/2016

A three-dimensional approach to Visual Speech Recognition using Discrete Cosine Transforms

Visual speech recognition aims to identify the sequence of phonemes from...
research
10/23/2019

A practical two-stage training strategy for multi-stream end-to-end speech recognition

The multi-stream paradigm of audio processing, in which several sources ...

Please sign up or login with your details

Forgot password? Click here to reset