Comparing phonemes and visemes with DNN-based lipreading

05/08/2018
by   Kwanchiva Thangthai, et al.
0

There is debate if phoneme or viseme units are the most effective for a lipreading system. Some studies use phoneme units even though phonemes describe unique short sounds; other studies tried to improve lipreading accuracy by focusing on visemes with varying results. We compare the performance of a lipreading system by modeling visual speech using either 13 viseme or 38 phoneme units. We report the accuracy of our system at both word and unit levels. The evaluation task is large vocabulary continuous speech using the TCD-TIMIT corpus. We complete our visual speech modeling via hybrid DNN-HMMs and our visual speech decoder is a Weighted Finite-State Transducer (WFST). We use DCT and Eigenlips as a representation of mouth ROI image. The phoneme lipreading system word accuracy outperforms the viseme based system word accuracy. However, the phoneme system achieved lower accuracy at the unit level which shows the importance of the dictionary for decoding classification outputs into words.

READ FULL TEXT

page 3

page 9

research
09/14/2018

Visual Speech Language Models

Language models (LM) are very powerful in lipreading systems. Language m...
research
12/31/2018

Advancing Acoustic-to-Word CTC Model with Attention and Mixed-Units

The acoustic-to-word model based on the Connectionist Temporal Classific...
research
12/19/2017

Subword and Crossword Units for CTC Acoustic Models

This paper proposes a novel approach to create an unit set for CTC based...
research
07/01/2021

Interactive decoding of words from visual speech recognition models

This work describes an interactive decoding method to improve the perfor...
research
03/15/2018

Advancing Acoustic-to-Word CTC Model

The acoustic-to-word model based on the connectionist temporal classific...
research
08/04/2016

An improved uncertainty decoding scheme with weighted samples for DNN-HMM hybrid systems

In this paper, we advance a recently-proposed uncertainty decoding schem...
research
06/15/2016

Automatic Pronunciation Generation by Utilizing a Semi-supervised Deep Neural Networks

Phonemic or phonetic sub-word units are the most commonly used atomic el...

Please sign up or login with your details

Forgot password? Click here to reset