DeepAI AI Chat
Log In Sign Up

A three-dimensional approach to Visual Speech Recognition using Discrete Cosine Transforms

by   Toni Heidenreich, et al.

Visual speech recognition aims to identify the sequence of phonemes from continuous speech. Unlike the traditional approach of using 2D image feature extraction methods to derive features of each video frame separately, this paper proposes a new approach using a 3D (spatio-temporal) Discrete Cosine Transform to extract features of each feasible sub-sequence of an input video which are subsequently classified individually using Support Vector Machines and combined to find the most likely phoneme sequence using a tailor-made Hidden Markov Model. The algorithm is trained and tested on the VidTimit database to recognise sequences of phonemes as well as visemes (visual speech units). Furthermore, the system is extended with the training on phoneme or viseme pairs (biphones) to counteract the human speech ambiguity of co-articulation. The test set accuracy for the recognition of phoneme sequences is 20 best values reported in other papers by approximately 2 the result is three-fold: Firstly, this paper is the first to show that 3D feature extraction methods can be applied to continuous sequence recognition tasks despite the unknown start positions and durations of each phoneme. Secondly, the result confirms that 3D feature extraction methods improve the accuracy compared to 2D features extraction methods. Thirdly, the paper is the first to specifically compare an otherwise identical method with and without using biphones, verifying that the usage of biphones has a positive impact on the result.


page 6

page 10


Exploring Turkish Speech Recognition via Hybrid CTC/Attention Architecture and Multi-feature Fusion Network

In recent years, End-to-End speech recognition technology based on deep ...

End-To-End Visual Speech Recognition With LSTMs

Traditional visual speech recognition systems consist of two stages, fea...

Automatic Speech Summarisation: A Scoping Review

Speech summarisation techniques take human speech as input and then outp...

End-to-End Visual Speech Recognition for Small-Scale Datasets

Traditional visual speech recognition systems consist of two stages, fea...

Applying wav2vec2 for Speech Recognition on Bengali Common Voices Dataset

Speech is inherently continuous, where discrete words, phonemes and othe...

Fast keypoint detection in video sequences

A number of computer vision tasks exploit a succinct representation of t...