A three-dimensional approach to Visual Speech Recognition using Discrete Cosine Transforms

09/07/2016
by   Toni Heidenreich, et al.
0

Visual speech recognition aims to identify the sequence of phonemes from continuous speech. Unlike the traditional approach of using 2D image feature extraction methods to derive features of each video frame separately, this paper proposes a new approach using a 3D (spatio-temporal) Discrete Cosine Transform to extract features of each feasible sub-sequence of an input video which are subsequently classified individually using Support Vector Machines and combined to find the most likely phoneme sequence using a tailor-made Hidden Markov Model. The algorithm is trained and tested on the VidTimit database to recognise sequences of phonemes as well as visemes (visual speech units). Furthermore, the system is extended with the training on phoneme or viseme pairs (biphones) to counteract the human speech ambiguity of co-articulation. The test set accuracy for the recognition of phoneme sequences is 20 best values reported in other papers by approximately 2 the result is three-fold: Firstly, this paper is the first to show that 3D feature extraction methods can be applied to continuous sequence recognition tasks despite the unknown start positions and durations of each phoneme. Secondly, the result confirms that 3D feature extraction methods improve the accuracy compared to 2D features extraction methods. Thirdly, the paper is the first to specifically compare an otherwise identical method with and without using biphones, verifying that the usage of biphones has a positive impact on the result.

READ FULL TEXT

page 6

page 10

research
03/22/2023

Exploring Turkish Speech Recognition via Hybrid CTC/Attention Architecture and Multi-feature Fusion Network

In recent years, End-to-End speech recognition technology based on deep ...
research
01/20/2017

End-To-End Visual Speech Recognition With LSTMs

Traditional visual speech recognition systems consist of two stages, fea...
research
08/27/2020

Automatic Speech Summarisation: A Scoping Review

Speech summarisation techniques take human speech as input and then outp...
research
12/05/2013

A Gabor block based Kernel Discriminative Common Vector (KDCV) approach using cosine kernels for Human Face Recognition

In this paper a nonlinear Gabor Wavelet Transform (GWT) discriminant fea...
research
04/02/2019

End-to-End Visual Speech Recognition for Small-Scale Datasets

Traditional visual speech recognition systems consist of two stages, fea...
research
04/11/2022

Multistream neural architectures for cued-speech recognition using a pre-trained visual feature extractor and constrained CTC decoding

This paper proposes a simple and effective approach for automatic recogn...
research
03/24/2015

Fast keypoint detection in video sequences

A number of computer vision tasks exploit a succinct representation of t...

Please sign up or login with your details

Forgot password? Click here to reset