Decoding visemes: improving machine lipreading (PhD thesis)

10/03/2017
by   Helen L Bear, et al.
0

Machine lipreading (MLR) is speech recognition from visual cues and a niche research problem in speech processing & computer vision. Current challenges fall into two groups: the content of the video, such as rate of speech or; the parameters of the video recording e.g, video resolution. We show that HD video is not needed to successfully lipread with a computer. The term "viseme" is used in machine lipreading to represent a visual cue or gesture which corresponds to a subgroup of phonemes where the phonemes are visually indistinguishable. A phoneme is the smallest sound one can utter, because there are more phonemes per viseme, maps between units show a many-to-one relationship. Many maps have been presented, we compare these and our results show Lee's is best. We propose a new method of speaker-dependent phoneme-to-viseme maps and compare these to Lee's. Our results show the sensitivity of phoneme clustering and we use our new knowledge to augment a conventional MLR system. It has been observed in MLR, that classifiers need training on test subjects to achieve accuracy. Thus machine lipreading is highly speaker-dependent. Conversely speaker independence is robust classification of non-training speakers. We investigate the dependence of phoneme-to-viseme maps between speakers and show there is not a high variability of visemes, but there is high variability in trajectory between visemes of individual speakers with the same ground truth. This implies a dependency upon the number of visemes within each set for each individual. We show that prior phoneme-to-viseme maps rarely have enough visemes and the optimal size, which varies by speaker, ranges from 11-35. Finally we decode from visemes back to phonemes and into words. Our novel approach uses the optimum range visemes within hierarchical training of phoneme classifiers and demonstrates a significant increase in classification accuracy.

READ FULL TEXT

page 25

page 26

page 28

page 29

page 30

page 38

page 39

page 40

research
10/03/2017

Speaker-independent machine lip-reading with speaker-dependent viseme classifiers

In machine lip-reading, which is identification of speech from visual-on...
research
10/03/2017

Visual gesture variability between talkers in continuous visual speech

Recent adoption of deep learning methods to the field of machine lipread...
research
10/24/2018

The speaker-independent lipreading play-off; a survey of lipreading machines

Lipreading is a difficult gesture classification task. One problem in co...
research
05/08/2018

Comparing heterogeneous visual gestures for measuring the diversity of visual speech signals

Visual lip gestures observed whilst lipreading have a few working defini...
research
10/03/2017

Finding phonemes: improving machine lip-reading

In machine lip-reading there is continued debate and research around the...
research
10/03/2017

Visual speech recognition: aligning terminologies for better understanding

We are at an exciting time for machine lipreading. Traditional research ...
research
05/31/2017

Putting a Face to the Voice: Fusing Audio and Visual Signals Across a Video to Determine Speakers

In this paper, we present a system that associates faces with voices in ...

Please sign up or login with your details

Forgot password? Click here to reset