DeepAI
Log In Sign Up

Phoneme-to-viseme mappings: the good, the bad, and the ugly

Visemes are the visual equivalent of phonemes. Although not precisely defined, a working definition of a viseme is "a set of phonemes which have identical appearance on the lips". Therefore a phoneme falls into one viseme class but a viseme may represent many phonemes: a many to one mapping. This mapping introduces ambiguity between phonemes when using viseme classifiers. Not only is this ambiguity damaging to the performance of audio-visual classifiers operating on real expressive speech, there is also considerable choice between possible mappings. In this paper we explore the issue of this choice of viseme-to-phoneme map. We show that there is definite difference in performance between viseme-to-phoneme mappings and explore why some maps appear to work better than others. We also devise a new algorithm for constructing phoneme-to-viseme mappings from labeled speech data. These new visemes, `Bear' visemes, are shown to perform better than previously known units.

READ FULL TEXT VIEW PDF

page 11

page 12

page 17

10/03/2017

Which phoneme-to-viseme maps best improve visual-only computer lip-reading?

A critical assumption of all current visual speech recognition systems i...
06/16/2021

Latent Mappings: Generating Open-Ended Expressive Mappings Using Variational Autoencoders

In many contexts, creating mappings for gestural interactions can form p...
01/25/2022

Characterizations and constructions of n-to-1 mappings over finite fields

n-to-1 mappings have wide applications in many areas, especially in cryp...
02/27/2018

Augmented CycleGAN: Learning Many-to-Many Mappings from Unpaired Data

Learning inter-domain mappings from unpaired data can improve performanc...
11/02/2018

Unsupervised Hyperalignment for Multilingual Word Embeddings

We consider the problem of aligning continuous word representations, lea...

1 Introduction

Recognition and synthesis of expressive audio-visual speech has proven to be a most challenging problem. When comparing audio-visual speech with acoustic recognition, one can identify several sources of difficulty. Firstly, the visual component of speech brings new problems such as pose, lighting, frame rate, resolution, and so on. Secondly, old problems in acoustic recognition, such as person specificity or the optimal recognition units, appear in new ways in the visual domain. While some of these aspects have been partially studied, progress has been hampered by very small datasets. Furthermore, reliable tracking has eluded many researchers which in turn has led to sub-optimal feature extraction, consequent poor performance and hence, incorrect conclusions about the parts of the problem that are tractable or intractable. A further challenge is the lack of consensus on the recognition units and it is commonplace to need to compare, say, word error rates with viseme error rates computed from a different set of visemes. Our contention is that progress in expressive audio-visual speech will remain stunted while this fundamental uncertainty remains. In this paper we review the choice of visual recognition units and provide a comprehensive set of evaluations of the competing phoneme-to-viseme mappings. We give guidance on what works well and provide explanations for the differences in performance. We also devise new algorithms for selecting optimal visual units should this be desired.

We should note that while this paper tends to focus on visual-only recognition, or lipreading, this aspect is by far the most challenging so progress on lipreading can be used to provide more useful audio-visual systems.

The rest of this paper is structured as follows: we discuss the current restrictions on a conventional lipreading system and identify the limitation of each upon the system. We then study the current sets of published visemes, before presenting a new speaker-dependent clustering algorithm for creating sets of visemes for individual speakers. We show that creating these speaker-dependent visemes follows from simple clustering and merge algorithms. These new visemes are tested on both isolated words and continuous speech datasets before we evaluate the efficacy of the improved performance against the extra investment into a new lipreading system. Since it is computationally simple to develop these speaker-dependent visemes we contend they are also a useful step in the analysis of speaker variability which itself is one of the more challenging problems in general lipreading.

2 Limitations in lipreading systems

It is often said that lipreading is difficult because not all sounds appear on the lips111newman2010 compares the performance of a system that measures, via electromagnetic articulography, the hidden and visual parts of the mouth so the extent of this statement can be quantified.. This is true but in reality there are a number of problems that can corrupt the lipreading signal even before one reaches the problem of trying to decode the visual signal. Table 1 provides a taxonomy of the challenges in lipreading. Some of them relate to the problems of extracting useful information from the visual signal whereas some appear later in the signal processing chain and relate to the coding and classification of the visual signal.

Evaluation Previously studied?
Motion Yes, ong2008robust ; Matthews_Baker_2004 ; 927467
Pose Yes, 6298439 ; pass2010investigation ; Moore2011541 ; 4218129 ; kaucic1998accurate ; 11011995 ; lucey2009visual
Expression Yes, pass2010investigation ; Moore2011541
Frame rate Yes, Blokland199897 ; saitoh2010study
Video quality Yes bearicip ; heckmann2003effects ; ACP:ACP371
Color Yes, kaucic1998accurate
Unit choice Yes, cappelletta2012phoneme ; howell2013confusion ; Hazen1027972 ; shin2011real ; bear2016decoding
Feature Yes, matthews1998nonlinear ; lan2009comparing ; improveVis ; 927467 ; Matthews_Baker_2004
Classifier technology Yes, 982900 ; htk34 ; zhu2000use ; cappelletta2012phoneme ; thangthai2015improving
Multiple persons Yes, 871067 ; simstruct ; visualvowelpercept ; 871073
Speaker identity Yes, 607030 ; bear2015speaker ; newman2010speaker
Rate of speech Yes, 6854158 ; bear2016decoding
Table 1: Challenges to successful machine lipreading. Each challenge has some references.

Motion is an important part of almost all realistic settings. It is therefore essential to have either some form of tracking or to devise features that are invariant to non-informational motions. An early dataset which captured speaker motion (not camera motion) is CUAVE 5745028 . Lipreading experiments on this dataset such as 1415153 examine two different features, one based on the Discrete Cosine Transform (DCT) and another on the Active Appearance Model (AAM). The AAM (which can be shape-only, appearance-only or shape and appearance models) 927467 sometimes preceded by Linear Predictors (LP) ong2008robust . An AAM 927467 is a model trained on a combination of shape and/or appearance information from a subset of video frames. The model is usually built from video frames manually labeled with landmarks which are chosen to cover the full range of motion throughout the video. In 1415153 they prefer the DCT but note that there were implementation difficulties with the AAM which meant it was improperly tracked. Further lip-reading experiments on CUAVE 7760575 clarifies how challenging comparing results is, because there is no agreed evaluation protocol which could account for the motion challenge/face alignment. This is attributed to their partial success with particular speakers.

The majority of automatic lipreading systems use a frontal pose in which the speaker’s facial place is normal to the principal ray of the camera. However in Moore2011541 for example, an improvement in expression recognition is seen by both computers and humans when the pose is rotated to 45. Other work 4218129 ; kaucic1998accurate , looks more specifically at visual speech recognition and suggests that a profile view of a speaker may not lead to catastrophically low accuracies. This observation is consistent with 11011995 which measures human sentence perception from three viewing angles: full-frontal view (0), angled view (45), and side view (90). In this single-subject study a post-lingual deaf woman was tested to measure accuracy at the three angles independently. The three angles were randomly presented in every lipreading session. The results indicated that the side-view angle is most effective. A model for pose-mismatched lipreading is presented in lucey2009visual in which it is shown that without training data at the correct pose, the recognition accuracy falls dramatically. However, the authors also show that this can be mitigated by projecting the features back to a canonical pose. This transformation principle is also used in 6298439 which presents a view-independent lipreading system. This investigation uses a continuous speech corpus compared to the small vocabulary dataset in lucey2009visual . This later study acknowledges a human lipreaders preference for a non-frontal view and suggests it could be attributed to lip protrusion. They show that the 45 angle is preferable. In short, when it comes to pose, there is evidence that it can be accounted for and need not be insurmountable. Therefore, for this work we stick to frontal pose.

Expression can be difficult to disentangle with the spoken word when lipreading natural speech. Smiling (a happy expression) has an known effect on lip motions during speech shor1978production . Effects on the inner, outer lips and lip protrusions have been measured in fagel2010effects who shows that smiling during speech (particularly vowels) places a restriction on lip motion with greater demand placed on the inner lips as variation in outer lips and lip protrusion is reduced. This in turn creates a greater challenge when lipreading non-neutral speech as gestures become less distinct. Furthermore, expression also effects the temporal property of speech kienast1999articulatory ; kienast2000acoustical . When a particular phoneme is uttered, its duration can be shortened (for example when angry and vowels particularly become shorter) or elongated, for example when a speaker is sad.

To the best of our knowledge there is no systematic study which specifically investigates lipreading expressive speech. Rather, tasks focus on either, synthesizing expression in faces shaw2016expressive ; hamza2004ibm ; khatri2014facial or expression recognition during speech happy2015automatic ; yan2016sparse ; zhang2014multimodal .

Studies such as Blokland199897 on the effect of low video frame-rate on human speech intelligibility during video communications, suggest that lower frame rates, if they are visible to the speaker, encourage humans to over-articulate to compensate for the reduced visual information available, akin to a visual Lombard effect. Accuracy is maximized when the same frame rate is used for both training and testing saitoh2010study . They further recommend that when the training data cannot be recorded at the same frame rate as the test data, then it is best if the training data has a higher frame rate (for feature extraction) than the test data. A further observation is that word classification rates vary in a non-linear fashion as the frame rate is reduced.

When it comes to dependence of lipreading on video quality

, an investigation into the effects of compression artifacts, visual noise (simulated with white noise) and localization errors in training is presented in

heckmann2003effects , and in ACP:ACP371 . The authors undertake two experiments, of which the first includes some attention to spatial resolution (the number of pixels). However, here, resolution varies along with other parameters. Neither of these papers consider the simple removal of information from a smaller image compared to a larger one. A more systematic study of resolution can be found in bearicip in which video of varying resolution is parameterized using AAMs AAMs . This work shows that machines can lipread continuous speech with as little as two pixels per lip.

With regard to color, it has been surprisingly under used. In kaucic1998accurate algorithms are derived which contain three key components: shape models, motion models, and focused color feature detectors. In early works it was common to use colored lip-stick or markers to help track the lips (tracking remains challenging) but many authors convert the image to grayscale and use grayscale features.

Unit choice refers to the question of whether to use phonemes, visemes, words or something else. Classifiers built on phonemes howell2013confusion , visemes Hazen1027972 , and words shin2011real

have all been previously presented. Sometimes the unit choice is linked to the problem: word classifiers often use word units, whereas continuous speech has to use phonemes or visemes. It is essentially a trade-off since using phonemes means accepting that there will be units that do not appear on the lips (the words “bad”, “pad”, and “mad” are usually said to be visually indistinguishable) whereas using visemes leads to better unit accuracy but there is then the problem of homopheny (words that have identical visemic transcriptions but different spellings). One study has reviewed how the unit selection affects recognition in relation to the unit selection of the supporting language model

bear2016decoding and have shown that phoneme networks work best for both phoneme and viseme classifiers. However the practical reality is that many systems use visemes and there is need to resolve which choice of visemes works best. Comparative studies such as cappelletta2012phoneme have attempted to compare some previous viseme sets but, these often only consider a few different sets rather than the gulf available.

Lan et al. present in improveVis a comparison of different features first presented in 927467 . Revisited in Matthews_Baker_2004 , AAM features are produced as either model-based (using shape information) or pixel-based (using appearance information). In improveVis Lan et al. observed that state of the art AAM features with appearance parameters outperform other feature types like sieve features, 2D DCT, and eigen-lip features, suggesting appearance is more informative than shape. Also pixel methods benefit from image normalisation to remove shape and affine variation from region of interest (in this example, the mouth and lips). The method in improveVis classified words with the an Audio-Visual dataset known as RMAV but recommended in future creating classifiers with viseme labels for lipreading, and advises that most information is from the inner of the mouth.

Some works have attempted to adapt features to address different problems, such as motion described above. For example, in seymour2008comparison the authors suggest altering HMM modeling to permit either frozen or occluded frames, and demonstrate that even low level jitter will significantly affect the quality of lip reading features.

When it comes to the choice of classifier technology it is the norm that machine lipreading systems adapt methods from acoustic recognition. This not only follows from the observation that visual and acoustic speech have the same origins but also from the practical observation that language models are expensive to create and it makes sense to re-use the models across the two modalities. The conventional classifier process is 1) data preparation (an acoustic example is creating MFCC’s zhu2000use , whereas a visual example might be cappelletta2012phoneme

), 2) build Hidden Markov Model classifiers, and 3) feed the classification outputs through a language network to produce a transcript. Like feature selection, the choice of classifier is affected by the problem in hand. An optimal audio recognizer will not guarantee optimal performance in an audio-visual, or visual only domain. In

potamianos2004audio , for example, it is noted that their audio-visual results should not be “read across” to lipreading.

More modern deep learning techniques for lipreading are an alternative approach which require much more training data

thangthai2015improving

. A key disadvantage of these methods is a lack of understanding about what exactly a neural network is learning in order for it to classify unseen gestures. So often the results from deep learning are good but the scientific insight can be poor. Thus recent work has begun to demonstrate performance of different deep learning approaches with a variety of neural network architectures. Convolution neural networks (CNN) have been particularly prevalent for image classification (

sharif2014cnn ; yan2015hd

) and Long Short Term Memory networks (LSTM) are performing well on temporal problems (e.g. language modeling

sundermeyer2012lstm or, scene labeling byeon2015scene ). For lipreading, we have evidence that both of these achieve good recognition rates in end-to-end systems, in Chung2017 a CNN achieves 61.1% top accuracy and in wand2016lipreading an LSTM achieves 79.6% top accuracy on a small dataset. However, our lipreading is a combination of these challenges, that is a temporal-visual classification problem.

For lipreading multiple persons, simstruct ; visualvowelpercept detailed human lipreading of multiple people, simstruct recognizes consonants, and visualvowelpercept visual vowels. 871073 presents an audio-visual system for HCI which automatically detects a talking person (both spatially and temporally) using video and audio data from a single microphone. In summary there is no reason to think that multi-person lipreading is any less viable than single-person lipreading, although the challenge of variability due to speaker identity is real.

Speaker identity is a major challenge in machine lipreading because Visual speech is not consistent across individuals. Sometimes this can be advantageous as in 607030 where they use lipreading to identify speakers. With known speakers - lipreading recognition rates can be high, but with unknown speakers (referred to as speaker-independent lipreading) this is as yet not at the same standard as speaker dependent lipreading. In bear2015speaker results show that classifiers trained and tested on distinct speakers compared to those trained and tested on the same speakers are statistically significantly different. This is supported in newman2010speaker where the authors strive to discriminate languages from visual speech and they conclude that in order to improve performance would be to move away from speaker-dependent features.

For acoustic speech it is acknowledged that people have different speaking styles, accents and rates of speech. For visual speech there is the additional confusion of what we call a “visual accent” in which very similar sounds can be made by persons with very different mouth shapes – examples of visual accent effects include people who talk out of the side of their mouths; ventriloquists and mimics. The rate of speech alters both an utterance duration and articulator positions. Therefore, both the sounds produced, but particularly, visible appearance are altered. In 6854158 , the authors present an experiment which measures the effect of speech rate and shows the effect is significantly higher on visual speech than in acoustic. Anecdotal evidence suggests that speaker visual style can evolve as speakers age due to co-articulation reduction as a person travels/interacts with other adults bear2016decoding .

In summary, while audio-visual speech processing has a great number of challenges, one of the pivotal ones is the question of the visual units and how they should be derived. Since all language models are defined in terms of phonemes, the practical question is the choice of the mapping from phonemes to visemes. The literature has presented a great number of these phoneme-to-viseme (P2V) mappings and few consistent comparisons between them so this is the topic for the next section.

3 Comparison of phoneme-to-viseme mappings

A summary of published P2V maps is provided in theobaldPHD Tables 2.3 and 2.4. This list is not exhaustive and these mappings motivated by: a focus on just consonants binnie1976visual ; fisher1968confusions ; franks1972confusion ; walden1977effects ; being speaker-dependent kricos1982differences , prioritizing particular visemes owens1985visemes ; or a focus on vowels disney ; montgomery1983physical . These are useful starting points, but for the purpose of this study we would like the phoneme-to-viseme mappings to include all phonemes in the transcript of the dataset to accurately reflect the range of phonemes used in a full vocabulary. Therefore, some mappings used here are a pairing of two mappings suggested in literature, e.g. one maps for the vowels and one map for the consonants. A full list of the mappings used is in Tables 2 and 3. Of these mappings , the most common are ‘the Disney 12’ disney , the ‘lipreading 18’ by Nichie lip_reading18 , and Fisher’s fisher1968confusions .

Classification Viseme phoneme sets
Bozkurt bozkurt2007comparison {/ei/ //} {/ei/ /e/ /æ/} {//} {/i/ // // /y/} {//}
{// // // //} {/u/ // /w/}
Disney disney {// /h/} {// /i/ /ai/ /e/ //} {/u/} {// // //}
Hazen Hazen1027972 {// // /u/ // // /w/ //} {// //} {/æ/ /e/ /ai/ /ei/}
{// // /i/}
Jeffers jeffers1971speechreading {// /æ/ // /ai/ /e/ /ei/ // /i/ // // //} {// //} {//}
{// // // /u/}
Lee lee2002audio {/i/ //} {/e/ /ei/ /æ/} {// // /ai/ //} {// // //} {// /u/}
Montgomery montgomery1983physical {/i/ //} {/e/ /æ/ /ei/ /ai/} {// // //} {// // //}{//}
{/i/ /hh/} {// //} {/u/ /u/}
Neti neti2000audio {// // // // // // /H/} {/u/ // //} {/æ/ /e/ /ei/ /ai/}
{// /i/ //}
Nichie lip_reading18 {/uw/} {// //} {//} {/i/ // /ay/} {//} {/iy/ /æ/} {/e/ //}
{/u/} {// /ei/}
Table 2: Vowel phoneme-to-viseme maps previously presented in literature.
Classification Viseme phoneme sets
Binnie binnie1976visual {/p/ /b/ /m/} {/f/ /v/} {/T/ /D/} {/S/ /Z/} {/k/ /g/} {/w/} {/r/}
{/l/ /n/} {/t/ /d/ /s/ /z/}
Bozkurt bozkurt2007comparison {/g/ /H/ /k/ /N/} {/l/ /d/ /n/ /t/} {/s/ /z/} {/tS/ /S/ /dZ/ /Z/} {/T/ /D/}
{/r/} {/f/ /v/} {/p/ /b/ /m/}
Disney disney {/p/ /b/ /m/} {/w/} {/f/ /v/} {/T/} {/l/} {/d/ /t/ /z/ /s/ /r/ /n/}
{/S/ /tS/ /j/} {/y/ /g/ /k/ /N/}
Finn finn1988automatic {/p/ /b/ /m/} {/T/ /D/} {/w/ /s/} {/k/ /h/ /g/} {/S/ /Z/ /tS/ /j/}
{/y/} {/z/} {/f/} {/v/} {/t/ /d/ /n/ /l/ /r/}
Fisher fisher1968confusions {/k/ /g/ /N/ /m/} {/p/ /b/} {/f/ /v/} {/S/ /Z/ /dZ/ /tS/}
{/t/ /d/ /n/ /T/ /D/ /z/ /s/ /r/ /l/}
Franks franks1972confusion {/p/ /b/ /m/} {/f/} {/r/ /w/} {/S/ /dZ/ /tS/}
Hazen Hazen1027972 {/l/} {/r/} {/y/} {/b/ /p/} {m} {/s/ /z/ /h/} {/tS/ /dZ/ /S/ /Z/}
{/t/ /d/ /T/ /D/ /g/ /k/} {/N/} {/f/ /v/}
Heider heider1940experimental {/p/ /b/ /m/} {/f/ /v/} {/k/ /g/} {/S/ /tS/ /dZ/} {/T/} {/n/ /t/ /d/}
{/l/} {/r/}
Jeffers jeffers1971speechreading {/f/ /v/} {/r/ /q/ /w/} {/p/ /b/ /m/} {/T/ /D/} {/tS/ /dZ/ /S/ /Z/}
{/s/ /z/} {/d/ /l/ /n/ /t/} {/g/ /k/ /N/}
Kricos kricos1982differences {/p/ /b/ /m/} {/f/ /v/} {/w/ /r/} {/t/ /d/ /s/ /z/}
{/k/ /n/ /j/ /h/ /N/ /g/} {/l/} {/T/ /D/} {/S/ /Z/ /tS/ /dZ/}
Lee lee2002audio {/d/ /t/ /s/ /z/ /T/ /D/} {/g/ /k/ /n/ /N/ /l/ /y/ /H/} {/dZ/ /tS/ /S/ /Z/}
{/r/ /w/} {/f/ /v/} {/p/ /b/ /m/}
Neti neti2000audio {/l/ /r/ /y/} {/s/ /z/} {/t/ /d/ /n/} {/S/ /Z/ /dZ/ /tS/} {/p/ /b/ /m/}
{/N/ /k/ /g/ /w/} {/f/ /v/} {/T/ /D/}
Nichie lip_reading18 {/p/ /b/ /m/} {/f/ /v/} {/W/ /w/} {/r/} {/s/ /z/} {/S/ /Z/ /tS/ /j/}
{/T/} {/l/} {/k/ /g/ /N/} {/H/} {/t/ /d/ /n/} {/y/}
Walden walden1977effects {/p/ /b/ /m/} {/f/ /v/} {/T /D/} {/S/ /Z/} {/w/} {/s/ /z/} {/r/}
{/l/} {/t/ /d/ /n/ /k/ /g/ /j/}
Woodward woodward1960phoneme {/p/ /b/ /m/} {/f/ /v/} {/w /r/ /W/}
{/t/ /d/ /n/ /l/ /T/ /D/ /s/ /z/ /tS/ /dZ/ /S/ /Z/ /j/ /k/ /g/ /h/}
Table 3: Consonant phoneme-to-viseme maps previously presented in literature.

In total, eight vowel- and fifteen consonant-maps are identified here and all of these are paired with each other to provide 120 P2V maps to test.

Recent comparisons between maps include cappelletta2012phoneme and as part of theobaldPHD . In theobaldPHD the following list of reasons are given for discrepancies between classifier sets.

  • Variation between speakers - i.e. speaker identity.

  • Variation between viewers - indicating lipreading ability varies by individuals, those with more practice are better able to identify visemes.

  • The context of the speech presented - context has an influence on how consonants appear on the lips. In real tasks the context will enable easier distinction between indistinguishable phonemes in syllable only tests.

  • Clustering criteria - the grouping methods vary between authors. For example, ‘phonemes are said to belong to a viseme if, when clustered, the percent correct identification for the viseme is above some threshold, which is typically between 70 - 75% correct. A stricter grouping criterion has a higher threshold, so more visemes are identified.’theobaldPHD .

These last two points are reinforced by cappelletta2012phoneme who achieved highest accuracy with the phoneme-to-viseme map of Jeffers in an HMM-based lipreading system. They attribute this to the use of continuous speech which encapsulates the same viseme in more contexts within the training data, and suggest that the Jeffers map has better clustering of consonant visemes for those contexts.

In Table 4 we have described the sources and derivation methods for all of the phoneme-to-viseme maps used in our comparison study. We see the majority are constructed using human testing with few test subjects, for example Finn finn1988automatic used only one lipreader, and Kricos kricos1982differences twelve. Data-driven methods are most recent, e.g. Lee’s lee2002audio visemes were presented in 2002 and Hazen’s Hazen1027972 in 2004. The remaining visemes are based around linguistic/phonemic rules.

Author Year Inspiration Description Test subjects
Binnie 1976 Human testing Confusion patterns unknown
Bozkurt 2007 Subjective linguistics Common tri-phones 462
Disney Speech synthesis Observations unknown
Finn 1988 Human perception Montgomerys visemes 1
and /H/
Fisher 1986 Human testing Multiple-choice 18
intelligibility test
Franks 1972 Human perception Confusions among sounds unknown
produced in similar
articulatory positions 275
Hazen 2004 Data-driven Bottom-up clustering 223
Heider 1940 Human perception Confusions post-training unknown
Jeffers 1971 Linguistics Sensory and cognitive unknown
correlates
Kricos 1982 Human testing Hierarchical clustering 12
Lee 2002 Data-driven Merging of Fisher visemes unknown
Montgomery 1983 Human perception Confusion patterns 10
Neti 2000 Linguistics Decision tree clusters 26
Nichie 1912 Human observations Human observation of unknown
lip movements
Walden 1977 Human testing Hierarchical clustering 31
Woodward 1960 Linguistics Language rules unknown
and context
Table 4: A comparison of literature phoneme-to-viseme maps.

As an example, the clustering method of Hazen Hazen1027972 involved bottom-up clustering using maximum Bhattacharyya distances bhattachayya1943measure to measure similarity between the phoneme-labeled Gaussian models. Before clustering, some phonemes were manually merged, with , with , and with .

A P2V map may be summarized as a ratio we call “compression factor,”

(1)

which is the ratio of number output visemes, to input phonemes . The compression factors for the P2V maps are listed in Table 5. Silence and garbage visemes are not included in Compression Factors.

Consonant Map V:P CF Vowel Map V:P CF
Woodward 4:24 0.16 Jeffers 3:19 0.16
Disney 6:22 0.18 Neti 4:20 0.20
Fisher 5:21 0.23 Hazen 4:18 0.22
Lee 6:24 0.25 Disney 4:11 0.36
Franks 5:17 0.29 Lee 5:14 0.36
Kricos 8:24 0.33 Bozkurt 7:19 0.37
Jeffers 8:23 0.35 Montgomery 8:19 0.42
Neti 8:23 0.35 Nichie 9:15 0.60
Bozkurt 8:22 0.36 - - -
Finn 10:23 0.43 - - -
Walden 9:20 0.45 - - -
Binnie 9:19 0.47 - - -
Hazen 10:21 0.48 - - -
Heider 8:16 0.50 - - -
Nichie 18:33 0.54 - - -
Table 5: Compression factors for viseme maps previously presented in literature.

Because we have a British English dataset and some works were formulated using American English diacritics diacritics we omit the following phonemes from some mappings: (Disney disney ), (Bozkirt bozkurt2007comparison ), (Hazen Hazen1027972 ), and (Jeffers jeffers1971speechreading ). Moreover, Kricos provides speaker-dependent visemes kricos1982differences . These have been generalized for our tests using the most common mixtures of phonemes. Where a viseme map does not include phonemes present in the ground truth transcript these are grouped into one viseme denoted (). Note that all phonemes in each P2V map are in the dataset but no mapping includes all 29 phonemes in the AVL2 vocabulary.

3.1 Data preparation

The AVLetters2 (AVL2) dataset cox2008challenge is used to train and test HMM classifiers based upon our 120 P2V mappings with HTK htk34 . AAM features (concatenated as in (4)) are used as they are known to outperform other feature methods in machine lipreading cappelletta2012phoneme . AVL2 cox2008challenge is an HD version of the AVLetters dataset matthews1998nonlinear . It is a single word dataset of five male British English speakers reciting the alphabet seven times. We use four of these speakers at the fifth tracked too poorly to have confidence in lipreading accuracy. The speakers in this dataset are illustrated in yogiThesis . AVL2 has 28 videos of between and frames between and in duration. As the dataset provides isolated words of single letters, it lends itself to controlled experiments without needing to address matters such as varying co-articulation.

Speaker Shape Appearance Combined
S1 11 27 38
S2 9 19 28
S3 9 17 25
S4 9 17 25
Table 6:

The number of parameters in shape, appearance and combined shape & appearance AAM features for each speaker in the AVLetters2 dataset for each speaker. Features retain 95% variance of facial information.

Table 6 describes the features extracted from the AVL2 videos. These features have been derived after tracking a full-face Active Appearance Model throughout the video before extracting features containing only the lip area. Therefore, they contain information representing only the speaker’s lips and none of the rest of the face. Speakers 2, 3 and 4 are similar in number of parameters contained in the features. The combined features are the concatenation of the shape and appearance features Matthews_Baker_2004 . All features retain 95% variance of facial shape and appearance information.

Figure 1: Occurrence frequency of phonemes in the RMAV dataset.

The RMAV dataset consists of 20 British English speakers (we use 12 speakers,seven male and five female, who have been tracked to maintain comparability with earlier work), 200 utterances per speaker of a subset of the Resource Management (RM) context independent sentences from fisher1986darpa which totals around 1000 words each. The sentences are selected to maintain a good coverage all phonemes lan2010improving and to represent the coverage of phonemes in spoken speech. The original videos were recorded in high definition and in a full-frontal position. Individual speakers are tracked using Active Appearance Models Matthews_Baker_2004 and AAM features of concatenated shape and appearance information have been extracted.

Figure 1 plots the frequency of all phonemes within the RMAV dataset over 200 sentences and Table 7 lists the number of parameters of shape, appearance, and combined shape and appearance AAM features where the features retain 95% variance of facial information.

Speaker Shape Appearance Combined
S1 13 46 59
S2 13 47 60
S3 13 43 56
S4 13 47 60
S5 13 45 58
S6 13 47 60
S7 13 37 50
S8 13 46 59
S9 13 45 58
S10 13 45 58
S11 14 72 86
S12 13 45 58
Table 7: The number of parameters of shape, appearance, and combined shape and appearance AAM features for the RMAV dataset speakers. Features retain 95% variance of facial information.

3.2 Classification method

The method for these speaker-dependent classification tests on our combined shape and appearance features uses HMM classifiers built with HTK htk34 . The features selected are from the AVL2 and RMAV datasets. The videos are tracked with a full-face AAM (Figure 2 (left)) and the features extracted consist of only the lip information (Figure 2 (right)). This means that we obtain a robust tracking from the full-face model, then using this fit information, we apply a sub-active appearance model of only the lips. The HMM classifiers are based upon viseme labels within each P2V map. A ground truth for measuring correct classification is a viseme transcription produced using the BEEP British English pronunciation dictionary beep and a word transcription. The phonetic transcript is converted to a viseme transcript assuming the visemes in the mapping being tested (Tables 3 and 2

). We test using a leave-one-out seven-fold cross validation. Seven folds are selected as we have seven utterances of the alphabet per speaker in AVL2, this is increased to 10-fold cross-validation for RMAV speakers. The HMMs are initialized using ‘flat start’ training and re-estimated eight times and then force-aligned using HTK’s

HVite. Training is completed by re-estimating the HMMs three more times with the force-aligned transcript.

3.3 Active appearance models

An example full-face shape model example is in Figure 2 where there are landmarks, of which are modeling the inner and outer lip contours.

Figure 2: Example Active Appearance Model shape mesh (left), a lips only model is on the right.

The shape of an AAM is the collection of coordinates of the vertices (landmarks) which make up a mesh,

(2)

These landmarks are aligned and normalized via Procrustes analysis procrustes

and then analyzed via a Principal Component Analysis (PCA) to

(3)

where is the mean shape, are coefficient shape parameters, and

are the eigenvectors of the co-variance matrix of the

largest eigenvalues

Matthews_Baker_2004 .

Having built an Active Shape Model, the next step is to augment it with appearance data and hence compute an Active Appearance Model (AAM). Each shape model is used to warp the image data back to the mean shape. The appearance of those warped images is now modeled again using PCA 927467 ,

(4)

where are the appearance parameters, is the shape-free-mean appearance, and are the appearance image eigenvectors of the co-variance matrix.

Usually the best results are obtained using both shape and appearance information combined within a single AAM 982900 ; 927467 . Therefore, unless explicitly stated otherwise, we use these. Once an AAM is built and trained, we fit the model using the Inverse Compositional algorithm inversecompAlg to all frames in the video sequence Matthews_Baker_2004 .

3.4 Comparison of current phoneme-to-viseme maps

Recognition performance of the HMMs can be measured by both correctness, , and accuracy, ,

(5)
(6)

where is the number of substitution errors, is the number of deletion errors, is the number of insertion errors and the total number of labels in the reference transcriptions htk34 . An insertion error (which are notoriously common in lip reading hazen2006automatic ) occurs when the recognizer output has extra words/visemes missing from the original transcript htk34 . As an example one could say “Once upon a midnight dreary”, but the recognizer outputs “Once upon upon midnight dreary dreary”. Here the recognizer has inserted two words which were never present and has deleted one222Once this utterance has been translated to one of viseme labels rather than words, as an example using Montgomery’s visemes, this sentence becomes “v09 v12 v04 v05 - v12 v01 v12 v04 - v12 - v01 v10 v04 v11 v04 - v04 v07 v16 v07 v16” (hyphens are included to show breaks between words). In this case, the same insertion errors would create predicted outputs of “v09 v12 v04 v05 - v12 v01 v12 v04 - v12 v01 v12 v04 - v01 v10 v04 v11 v04 - v04 v07 v16 v07 v16 - v04 v07 v16 v07 v16.”.

In this experiment, classification performance of the HMMs is measured by correctness, (5), as there are no insertion errors to consider htk34 . It is acknowledged that word classification is not as high performing as viseme classification. However, as each viseme set being tested has a different number of phonemes and visemes, words, are used so we can compare different viseme sets. It is the difference between each set, rather than the individual performance, which is of interest in this investigation.

Figure 3: Speaker-dependent all-speaker mean word classification, , comparing viseme classes on isolated word speech (top) and continuous speech (bottom)

Figure 3 shows the correctness of each pair of viseme sets. On the top is the isolated word case (the AVL2 data) and on the bottom the continuous data (RMAV). Each diagram is ordered by the mean correctness over all speakers. For the isolated words the Lee vowel and consonant sets lee2002audio are the best with the Montgomery vowels montgomery1983physical and Hazen consonants Hazen1027972 close behind. The worst performers are Disney vowels disney and the Franks franks1972confusion and Woodward consonants woodward1960phoneme . For continuous speech the Disney vowels are the best performer disney as are the Woodward consonants woodward1960phoneme . It is notable that for continuous speech the high compression factor visemes sets work better than those with larger numbers of visemes. The most likely explanation is that continuous speech has additional variability due to co-articulation so a few coarsely defined visemes are better than a greater number of finely defined ones.

Figure 4: Speaker-independent all-speaker mean word classification, . For a given mapping (axis) the performance is measured after pairing with all vowel mappings (left) and vice versa on the right on AVL2 isolated words (top) and RMAV continuous (bottom)

Figure 4 shows the mean word correctness, , over all speakers, for pairings of vowel and consonant maps ordered by correctness from left to right. Again, isolated word results (the AVL2 data) at the top and continuous (RMAV) on the bottom. As previously, for isolated words, the Disney vowels are significantly worse than all others when paired with all consonant difference over the whole group. The Lee lee2002audio , Montgomery montgomery1983physical and Bozkurt bozkurt2007comparison vowels are consistently above the mean and above the upper error bar for Disney disney , Jeffers jeffers1971speechreading and Hazen Hazen1027972 vowels. In comparing the consonants, Lee lee2002audio and Hazen Hazen1027972 are the best whereas Woodward woodward1960phoneme and Franks franks1972confusion are the bottom performers. There is a significant difference between the ‘best’ visemes for individual speakers which arises from the unique way in which everyone articulates their speech.

The continuous speech experiment results in Figure 4 (bottom) show that, for vowel visemes, the Disney set surpasses all others, whereas Woodward’s consonants are now a better fit. This is interesting as neither viseme set are data-derived. We recall that Disney’s disney are designed from human perception for synthesis of characters, and Woodward’s woodward1960phoneme are from a pilot investigation into phoneme perception in lipreading using linguistic rules. As we move to more realistic data , continuous speech, many of the data-driven approaches degrade which implies that they data used to derive these visemes was unrealistic. For example the Lee visemes lee2002audio were derived without any use of video data at all so it is hardly surprising that they are fragile when presented with more realistic data.

The idea that vowel and consonant visemes should be treated differently is no surprise. The suggestion that vowel visemes are essentially mouth shapes and the consonants govern how we move in and out of them was first presented by Nichie in 1912 from human observations by a profoundly deaf educator lip_reading18 and is supported by results in bear2014phoneme which show we should not mix vowel and consonant visemes for best results. Therefore, it is reassuring to see that the better speaker-independent phoneme-to-viseme mapping for continuous speech is a combination of two previous maps, where the two maps have differing derivation methods; perception and language rules.

Generally speaking the continuous case (bottom of Figure 4) gives improved accuracies compared to the isolated word case (top of Figure 4. The first response to explain this is to suggest the increase is caused by better training of classifiers with the greater volume of training samples in RMAV than in AVL2. However, we should note that this effect is marginally countered by the co-articulation effects in continuous speech, so a set of classifiers trained on a larger isolated word dataset and compared to AVL2 would provide a greater increase in recognition.

Consonants Vowels

Isolated words

Continuous speech

Figure 5: Critical difference of all phoneme-to-viseme maps independent of phoneme-to-viseme pair partner. Vowel maps are on the left side, consonants on the right. Isolated words are in the top row, and continuous speech along the bottom row.

Figure 5 are critical difference plots between the viseme class sets based upon their classification performance criticaldiff

with isolated word training. Critical difference is a measure of the confidence intervals between different machine learning algorithms derived from Friedman tests on the ranked scores (here

). Two assumptions within critical difference are: all measured results are ‘reliable’, and all algorithms are evaluated using the same random samples criticaldiff . As we use the HTK standard metrics young2006htk , and use results with consistent random sampling across folds, these assumptions are not a concern. We have selected critical differences here as these evaluate the performance of multiple classifiers on different datasets, whereas such as bouckaert2004evaluating ; bengio2004no , often require paired data or identical datasets.

Figure 5 shows a significant difference between some sub-sets of visemes. This is shown by the horizontal bars which do not overlap all viseme sets. Where the horizontal bars do overlap, this shows the viseme sets are indistinguishable at a 95% confidence. When comparing isolated words with continuous speech we see fewer significant differences with continuous speech despite there being more test data.

Table 8 summarises the best-performing visemes (consonant and vowels) for the isolated and continuous word data. The first column shows that the Lee consonants are the best performing for isolated words. But also that Hazan, Nichie, Neti etc are indistinguishable from Lee (they within Lee’s critical difference). For continuous speech, the Woodward consonant visemes are the best but Fisher, Franks Disney etc are indistinguisable. In bold are the viseme sets that are common to both isolated words and continuous speech: Lee, Hazen, Finn and Fisher. For the vowels (second column) there are no common sets. However if we look at best and second-best (the third column of Table 8) then Hazen and Neti emerge as common.

First Position Consonants First Position Vowels Second Position Vowels
Lee Lee Montgomery
Hazen Montgomery Nichie
Nichie Nichie Bozkurt
Neti Bozkurt Hazen
Walden Neti
Jeffers
Kricos
Binnie
Finn
Bozkurt
Fisher
Woodward Disney Jeffers
Fisher Jeffers Hazen
Franks Hazen Neti
Disney
Lee
Heider
Hazen
Finn
Table 8: Critically different viseme sets changes with isolated word and continuous speech data. Sets are listed in the order they appear in Figure 5.

Looking across all sets the common method that performs near the top is that due to Hazen Hazen1027972 . Interestingly these visemes were derived using the most realistic data (an audio-visual corpus based on TIMIT) and formed by a tree-based clustering of phoneme-trained HMMs. Note that the Hazan visemes were derived from American English data whereas here we use British English speakers.

The effectiveness of each mapping as a function of compression factor is presented in Figure 6. The two plots representing continuous speech (bottom of Figure 6) show improving performance with decreasing compression factor – we speculated earlier that the coarser visemes were better able to handle co-articulation. For the isolated word case (top) there is little difference. Very roughly, the best performing methods appear to have around 2 to 4 phonemes per viseme.

Figure 6: Scatter plot showing the relationship between compression factors, (-axes), and word correctness, , classification (-axes) with consonant phoneme-to-viseme maps (left) and vowel phoneme-to-viseme maps (right), isolated word results are at the top, and continuous speech along the bottom.

So far we have seen that there are noticeable differences between classification performances associated with a variety of viseme sets in the literature. Given that quite a few of the viseme sets are incremental improvements on previous sets, it is good to see confirmation that these sets are have rather similar performance. We have identified the best sets for the various conditions and have used critical difference plots to explain the similarity between methods. We have identified that the most robust methods seem to be based on clustering large amounts of data but a questions arises when it comes to individual speakers – is it viable to create viseme sets per speaker and, if so, how similar are they? This is the topic of the next section.

4 Encoding speaker-dependent visemes

In the second part of our phoneme-to-viseme mapping study, two approaches are used to find a better method of mapping phonemes to visemes. These approaches are both speaker-dependent and data-driven from phoneme classification. Two cases are considered:

  1. a strictly coupled map, where a phoneme can be grouped into a viseme only if it has been confused with all the phonemes within the viseme, and

  2. a relaxed coupled case, where phonemes can be grouped into a viseme if it has been confused with any phoneme within the viseme.

With all new P2V mappings each phoneme can be allocated to only one viseme class. These new P2V maps are tested on the AVL2 dataset using the same classification method as described in Section 3.2. The results from the best performing P2V map from our comparison study (Lee lee2002audio or Woodward woodward1960phoneme and Disney disney ) is the benchmark to measure improvements with respect to the training data.

4.1 Viseme classes with strictly confusable phonemes

Our approaches for identifying visemes are speaker-dependent, data-driven and based on phoneme confusions within the classifier. The idea of speaker-dependent visemes is not new visualvowelpercept ; bear2015speaker but our algorithm is, and in conjunction with the fixed outputs available from HTK enables easy reuse. The first undertaking in this work is to complete classification using phoneme labeled HHM classifiers. The classifiers are built in HTK with flat-start HMMs and force-aligned training-data for each speaker. The HMMs are re-estimated 11 times in total over seven folds of leave-one-out cross validation. This overall classification task does not perform well (see Table 9) particularly for an isolated word dataset. However, the HTK tool HResults

is used to output a confusion matrix for each fold detailing which phoneme labels confuse with others and how often.

Speaker 1 Speaker 2 Speaker 3 Speaker 4
Phoneme
Table 9: Mean per speaker Correctness, , of phoneme-labeled HMM classifiers.

For both data-driven speaker-dependent approaches, this is the first step of completing phoneme classification is essential to create the data to derive the P2V maps from. This is completed for each speaker in both AVL2 and RMAV datasets. Now, let us use a smaller seven-unit confusion matrix example, as in Table 10, to explain our clustering method.

1 0 0 0 0 0 4
0 0 0 2 0 0 0
1 0 0 0 0 0 1
0 2 1 0 2 0 0
3 0 1 1 1 0 0
0 0 0 0 0 4 0
1 0 3 0 0 0 1
Table 10: Demonstration confusion matrix showing confusions between phoneme-labeled classifiers to be used for clustering to create new speaker-dependent visemes. True positive classifications are shown in red, confusions of either false positives and false negatives are shown in blue. The estimated classes are listed horizontally and the real classes are vertical.

For the ‘strictly-confused’ viseme set (remembering there is one per speaker), the second step of deriving the P2V map is to check for single-phoneme visemes. Any phonemes which have only been correctly recognized and have no false positive/negative classifications are permitted to be single phoneme visemes. In Table 10 we have highlighted the true positive classifications in red and both false positives and false negative classifications in blue which shows is the only phoneme to fit our ‘single-phoneme viseme’ definition. has a true positive value of and zero false classifications. Therefore this is our first viseme. . This action is followed by defining all combinations of remaining phonemes which can be grouped into visemes and identifying the grouping that contains the largest number of confusions by ordering all the viseme possibilities by descending size (Table 11).

Table 11: List of all possible subgroups of phonemes with an example set of seven phonemes

Our grouping rule states that phonemes can be grouped into a viseme class only if all of the phonemes within the candidate group are mutually confusable. This means each pair of phonemes within a viseme must have a total false positive and false negative classification greater than zero. Once a phoneme has been assigned to a viseme class it can no longer be considered for grouping, and so any possible phoneme combinations that include this viseme are discarded. This ensures phonemes can belong to only a single viseme.

By iterating though our list of all possibilities in order, we check if all the phonemes are mutually confused. This means all phonemes have a positive confusion value (a blue value in Table 10) with all others.

The first phoneme possibility in our list where this is true is .

This is confirmed by the Table 10 values:

also,

and,

This becomes our second viseme and thus our current viseme list looks like Table 12.

Viseme Phonemes
Table 12: Demonstration example 1: first-iteration of clustering, a phoneme-to-viseme map for strictly-confused phonemes.

We now only have three remaining phonemes to cluster, and . This reduces our list of possible combinations substantially, see Table 13.

Table 13: List of all possible subgroups of phonemes with an example set of seven phonemes after the first viseme is formed.

The next iteration of our clustering algorithm identifies the combination of remaining phonemes which correspond to the next largest number of confusions, and so on, until no phonemes can be merged. This leaves us with the final visemes in Table 14.

Viseme Phonemes
Table 14: Demonstration example 2: final phoneme-to-viseme map for strictly-confused phonemes.

Our original phoneme classification has produced confusion matrices which permit confusions between vowel and consonant phonemes. We can see in Section 3.1 (Tables 2 and 3), previously presented P2V maps that vowel and consonant phonemes are not commonly mixed within visemes. Therefore, we make two types of P2V maps: one which permits vowels and consonant phonemes to be mixed within the same viseme, and a second which restricts visemes to be vowel or consonant only by putting an extra condition in when checking for confusions greater than zero.

It should be remembered that not all phonemes present in the ground truth transcripts will have been recognized and included in the phoneme confusion matrix. Any of the remaining phonemes which have not been assigned to a viseme are grouped into a single garbage viseme. This approach ensures any phonemes which have been confused are grouped into a viseme and we do not lose any of the ‘rarer’, and less common visual phonemes. For example, , , , and are not in the original transcript and so can be placed into . But for Speaker 2, also contains and , and for Speaker 4 also contains and , as these do not show up in the speaker’s phoneme classification outputs. This task has been undertaken for all four speakers in our dataset. The final P2V maps are shown in Table 15.

Classification P2V mapping - permitting mixing of vowels and consonants
Speaker1 {// /ai/ /i/ /n/ //} {/b/ /e/ /ei/ /y/ } {/d/ /s/} {/tS/ /l/} {// /v/}
(CF:0.48) {/w/} {/f/} {/k/} {// /v/} {/dZ/ /z/} {// /u/} {/t/}
Speaker2 {// /ai/ /ei/ /i/ /s/} {/e/ /v/ /w/ /y/} {/l/ /m/ /n/} {/b/ /d/ /p/}
(CF: 0.44) {/z/} {tS/} {/t/} {//} {/dZ/ /k/} {// /f/} {// /u/}
Speaker3 {/ei/ /f/ /n/} {/d/ /t/ /p/} {/b/ /s/} {/l/ /m/} {// /e/} {/i/} {/u/}
(CF: 0.68) {//} {/dZ/} {//} {/z/} {/y/} {/tS}/ {/ai/} {//} {//} {/dZ/} {//}
{/k/ /w/} {/v/} {/z/}
Speaker4 {// /ai/ /i/ /ei/ } {/m/ /n/} {// /e/ /p/} {/k/ /w/} {/d/ /s/} {/dZ/ /t/}
(CF: 0.64) {/f/} {/v/} {//} {/z/} {/tS/} {/b/} {//} {//} {/l/} {/u/} {/b/}
Classification P2V mapping - restricting mixing of vowels and consonants
Speaker1 {// /i/ // /u/} {// /ei/} {// /e/ /ei/} {/d/ /s/ /t/ } {/tS/ /l/ } {/k/}
(CF:0.50) {/z/} {/w/} {/f/} {/m/ /n/} {/dZ/ /v/} {/b/ /y/}
Speaker2 {/ai/ /ei/ /i/ /u/} {//} {//} {/e/} {//} {//} {/v/ /w/} {/dZ/ /p/ /y/}
(CF: 0.58) {/d/ /b/} {/t/} {/k/} {/tS/} {/l/ /m/ /n/} {/f/ /s/}
Speaker3 {/ei/ /i/} {/ai/} {// /e/} {//} {/d/ /p/ /t/} {/l/ /m/} {/k/ /w/} {/v/}
(CF: 0.68) {/tS/} {//} {/y/} {/u/} {//} {/z/} {/f/ /n/} {/b/ /s/} {/dZ/}
Speaker4 {// /ai/ /i/ /ei/} {// /e/} {/m/ /n/} {/k/ /l/} {/dZ/ /t/} {/d/ /s/} {/tS/}
(CF: 0.65) {//} {/y/} {/u/} {//} {/w/} {/f/} {/v/} {/b/}
Table 15: Strictly-confused phoneme speaker-dependent visemes. The score in brackets is the compression factor. is listed on top, visemes are listed at the bottom.

4.2 Viseme classes with relaxed confusions between phonemes

A disadvantage of the strictly confusable viseme set is that it contains some spurious single-phoneme visemes where the phoneme cannot be grouped because it is not confused with all other phonemes in the viseme. These types of phonemes are likely to be either: borderline cases at the extremes of a viseme cluster, i.e. they have subtle visual similarities to more than one phoneme cluster, or they do not occur frequently enough in the training data to be differentiated from other phonemes.

Viseme Phonemes
Table 16: Demonstration example 3: final phoneme-to-viseme map for relaxed-confused phonemes.

To address this we complete a second pass-through of the strictly-confused visemes listed in Table 14. We begin with the visemes as they currently stand (in our demonstration example containing four classes) and relax the condition requiring confusion with all of the phonemes. Now any single phoneme viseme (in our demonstration, ) can be allocated to a previously existing viseme if it has been confused with any phoneme in the viseme. In Table 10 we see was confused with , , and . Because is not in the same viseme as and we use the value of confusion to decide which to allocate it to as follows.

Therefore; for the total confusion with is , whereas the total confusion with is . We select the viseme with most confusion to incorporate the unallocated phoneme . This reduces the number of viseme classes by merging single-phoneme visemes from Table 14 to form a second set shown in Table 16. This has the added benefit that we have also increased the number of training samples for each classifier.

Bear1, : Bear2, :
Mixed vowels and consonants Split vowels and consonants
Strict-confusion of phonemes Strict-confusion of phonemes
Bear3, : Bear4,
Mixed vowels and consonants Split vowels and consonants
Relaxed-confusion of phonemes Relaxed-confusion of phonemes
Table 17: The four variations on speaker-dependent phoneme-to-viseme maps derived from phoneme confusion in phoneme classification.

Remember, as we have two versions of Table 14 - one with mixed vowel and consonant phonemes and a second with divided vowels and consonant phonemes - the same still applies to our relaxed-confused visemes sets. This means we end up with four types of speaker-dependent phoneme-to-viseme maps, described in Table 17. For our strictly-confused P2V maps in Table 15, these become the relaxed P2V maps in Table 18. In Table 17 we have labeled each of the four variations , , and for ease of reference.

Classification P2V mapping - permitting mixing of vowels and consonants
Speaker1 {/b/ /e/ /ei/ /p/ /w/ /y/ /k/} {// /ai/ /f/ /i/ /m/ /n/ //}
(CF:0.28) {/dZ/ /z/} {// /u/} {/d/ /s/ /t/} {/tS/ /l/} {// /v/}{// /v/}
Speaker2 {// // /ai/ /ei/ /i/ /s/ /tS/} {/e/ /t/ /v/ /w/ /y/} {/l/ /m/ /n/}
(CF: 0.32) {// /f/} {/z/} {/b/ /d/ /p/} {// /u/} {/dZ/ /k/}
Speaker3 {// /ai/ /ei/ /f/ /i/ /n/} {// /e/ /y/ /tS/} {/b/ /s/ /v/} {/l/ /m/ /u/}
(CF: 0.40) {/dZ/} {//} {/z/} {/d/ /p/ /t/} {/k/ /w/} {//}
Speaker4 {// /ai/ /tS/ /i/ /ei/ } {// /m/ /u/ /n/} {// /e/ /p/ /v/ /y/}
(CF: 0.32) {/dZ/ /t/} {/k/ /l/ /w/} {//} {/d/ /f/ /s/} {/b/}
Classification P2V mapping - restricting mixing of vowels and consonants
Speaker1 {// /i/ // /u/} {// /ai/} {// /e/ /ei/} {/b/ /w/ /y/} {/d/ /f/ /s/ /t/}
(CF:0.47) {/k/} {/z/} {/m/} {/l/} {/tS/} {/dZ/ /k/ /v/ /z/}
Speaker2 {// // // /ai/ /ei/ /i/ // /u/} {/k/ /t/ /v/ /w/} {/tS/ /l/ /m/ /n/}
(CF: 0.29) {/f/ /s/} {/dZ/ /p/ /y/} {/b/ /d/} {/z/}
Speaker3 {// /ai/ /i/ /ei/} {// /e/} {/b/ /s/ /v/} {/d/ /p/ /t/} {/l/ /m/}
(CF: 0.56) {/y/} {/dZ/} {//} {/z/} {/u/} {// /e/} {/k/ /w/} {/f/ /n/} {//} {/tS/}
Speaker4 {// /ai/ /i/ /ei/} {/tS/ /k/ /l/ /w/} {/d/ /f/ /s/ /v/} {/m/ /n/}
(CF: 0.50) {/f/} {//} {/dZ/ /t/} {//} {/u/} {/y/} {/b/}
Table 18: Relaxed-confused phoneme speaker-dependent visemes. The score in brackets is the ratio of visemes to phonemes. visemes are on top, and listed below.

Now, and this is why these visemes are defined as relaxed, any remaining phonemes which have confusions, but are so far not assigned to a viseme, the phoneme-pair confusions are used to map the remaining phonemes to an appropriate viseme, even though it does not confuse with all phonemes already in it. Any remaining phonemes which are not assigned to a viseme are grouped into a new garbage viseme. This approach ensures any phonemes which have been confused with any other are grouped into a viseme.

4.3 Results analysis

Figure 7 (top) compares the new speaker-dependent viseme method with the Lee visemes which are the benchmark from the isolated word study.

Figure 7: Word classification correctness , using all four new methods of deriving speaker dependent visemes. AVL2 (top) and RMAV (bottom) speakers against Lee (top) and Woodward and Disney (bottom) benchmarks in black.

For Speaker 1 and Speaker 3, no new viseme map significantly improves upon Lee’s performance although we do see improvements for both Speaker 2 and Speaker 4. The strictly-confused and split viseme map improves upon Lee’s previous best word classification.

The second set of our experiments with continuous speech training data (RMAV) is to repeat our investigation with speaker-dependent visemes. These have been derived with the same methods described in Section 4.14.2 and are listed in full for each speaker in A. Our classification method is identical to that used previously with HMMs. In the previous work of bear2014phoneme , we see limited improvement in word classification with viseme classes due to the size of the dataset.

In Figure 7 (bottom) we have plotted the word correctness achieved for each RMAV speaker using all four variants of the speaker-dependent visemes. Our first observation is that on this figure, the correctness scores achieved range from to , whereas in Figure 7 (top) the values range from to . As before, this overall increase is attributed to the larger volume of training samples in RMAV compared to AVLetters2.

Compared to the benchmark of the Disney vowels and Montgomery consonant visemes which has been plotted in black on Figure 7 (bottom) we see that the comparison between speaker-dependent visemes and the best speaker-independent visemes is subject to the speaker. For three out of 12 speakers (sp01, sp03, sp05), the speaker-dependent visemes are all worse than our benchmark. For another three of our 12 speakers (sp02, sp09, sp14) all of the speaker-dependent visemes out-perform the benchmark. For all six remaining speakers, the results are mixed. This suggests that it is possible that speaker-dependent visemes could improve on speaker-independent ones, but that it is essential that they are exactly right for the individual otherwise they become at worse, detrimental, or a lot of effort for no significant improvement.

Careful observation of Figure 7 (top) shows that when considering the performance of mixed or split visemes, split visemes signfificantly (se) outperform mixed. When considering relaxed versus split the split has a marginal advantage but it is not significant (1se).

The comparison of strict and split visemes for continuous speech (Figure 7 (bottom) is consistent with the isolated word observations. The strictly-confused visemes perform better than those with a relaxed confusion, but not statistically significantly (1se). Again, we see that mixing vowel and consonants phonemes within individual viseme classes reduces the classification performance but not significantly.

Figure 8: Comparing the accuracy change between strict and relaxed visemes to show the improvement in accuracy/reduction in insertion errors for all 12 speakers in continuous speech. The baseline is the correctness classification which ignores insertion error penalties.

In Figure 8 we have plotted accuracy, , and correctness, , for our best performing speaker-dependent visemes () on continuous speech. We also plot, the accuracy scores of our benchmark from Woodward and Disney’s visemes. These are compared with the correctness scores as a baseline to show the improvement. Whilst the improvement of speaker-dependent visemes is not significant when measured by Correctness, by plotting the accuracy of the viseme classifiers we can see that they do have a positive influence in reducing insertion errors which are a bugbear of lipreading.

5 Performance of individual visemes

Figure 9: Individual viseme classification, Pr with speaker-dependent visemes for four speakers with isolated word training of classifiers B1 visemes (top) and B2 visemes (bottom).
Figure 10: Individual viseme classification, Pr with speaker-dependent visemes for four speakers with isolated word training of classifiers. B3 visemes (top) and B4 visemes (bottom).

In Figures 9 and 10, the contribution of each viseme has been listed in descending order along the

axis for each speaker in AVL2. The contribution of each viseme is measured as the probability of each class, Pr

. These values have been calculated from the HResults confusion matrices.

This analysis of visemes within a set is also used in bear2014some , which proposes a threshold subject to the information in the features.

Figure 11: Individual viseme classification, Pr with speaker-dependent visemes for twelve speakers with continuous speech training of classifiers. B1 visemes (top) and B2 visemes (bottom).
Figure 12: Individual viseme classification, Pr with speaker-dependent visemes for twelve speakers with continuous speech training of classifiers. B3 visemes (top) and B4 visemes (bottom).

The same viseme comparison analysis has been repeated for our continuous speech recognition experiments and the results are shown in Figures 11 and 12.

In the isolated word data (Figures 9 and 10) the difference between a high-performing speaker map and a poor one is striking. Speaker 3 for example has at least five visemes in which (more in some configurations) whereas Speaker 1 has only one good viseme. Referring to Tables 15 and 18 there is no consistency on the best viseme although generally visual silence appears to be easy to spot. This variation is to be expected – speaker variablity is a very serious problem in lipreading.

Figures 11 and 12 show the same thing for the continuous speech data. Now there is a shallower drop-off to the curve and there are certainly no visemes for which . Although there appears to be less variablity among speakers this is an illusion caused by the poorly-performing visemes to be similar among speakers – within the top five visemes there are significant differences among speakers.

6 Conclusions

While lipreading and hence expressive audio-visual speech recognition face a number of challenges, one the persistent difficulties has been the multiplicity of mappings between phonemes and visemes. This paper has described a study of previously suggested Phoneme-to-Viseme (P2V) maps. For isolated word classification, Lee’s lee2002audio is the best of the previously published maps. For continuous speech a combination of Woodward’s and Disney’s visemes are better. The best performing viseme sets have on average, between two and four phonemes per viseme.

When looking at speaker-independent visemes, whilst most viseme sets do not experience any difference in correctness between isolated and continuous speech, it is interesting to note that Woodward consonant visemes are better for continuous speech and are linguistically derived, whereas Lee visemes are better for isolated words and are data-derived. This suggests that an optimal set of visemes for all speakers would need to consider both the visual speech gestures of the individual and the rules of language. Which in essence is the dilemma for visemes: does one choose units that make sense in terms of likely visual gestures or in terms of the linguistic problem that is trying to be solved.

Figure 13: A simple augment to the conventional lip-reading system to include speaker-dependent visemes.

We have also derived some new visemes, the ‘Bear’ visemes. These new data-driven visemes respect speaker individuality in speech and uses this property to demonstrate that our second data-driven method tested, a strictly-confused viseme derivation with split vowel and consonant phonemes, can improve word classification. The best of Bear visemes is the strict confused phonemes with split vowels and consonants () for both isolated and continuous speech.

Furthermore, a review of these speaker-dependent visemes (listed in Tables 1518, and A) shows that formally ‘accepted’ visemes such as { } and {/S/ /Z/ /dZ/ /tS/} are no longer present. Similarly with our previous vowel based visemes, six of our eight prior viseme sets pair // with // (albeit not as a complete viseme, others are also present) but with our best speaker-dependent visemes these two phonemes are not paired. This is an interesting insight because it suggests that formerly ‘accepted’ strong visemes might not be so useful for all speakers, and some adaptability, or further investigation into understanding viseme variation is still needed. Our suggestion at this time, is that linguistics or co-articulation in continuous speech, are a strong influence causing this variation.

In practical terms, our new viseme derivation method is simple and can be included within a conventional lipreading system easily. This is demonstrated in Figure 13 where our clustering method is shown in dashed boxes. We recommend this approach for viseme classification since speaker-independent visemes are unlikely to perform well.

In general, for cases, Speaker-dependent visemes reduce insertion errors when classifying continuous speech. This is thought to be because the phoneme confusions in speaker-dependent visemes are affected by speaker specific visual co-articulation. For all viseme sets, not mixing vowel and consonant phonemes significantly improves classification.

7 Acknowledgments

We gratefully acknowledge the assistance of Dr Yuxuan Lan and Dr Barry-John Theobald, formerly of the University of East Anglia for their help with HTK and general advice and guidance.

This work was conducted while Helen L. Bear was in receipt of a studentship from the UK Engineering and Physical Sciences Research Council (EPSRC).

References

  • (1) B.-J. T. Jacob L. Newman, S. J. Cox, Limitations of visual speech recognition, in: Proceedings of the International Conference on Audio-Visual Speech Processing, 2010.
  • (2) E. Ong, R. Bowden, Robust lip-tracking using rigid flocks of selected linear predictors, in: 8th IEEE Int. Conf. on Automatic Face and Gesture Recognition (FG2008), 2008, pp. 247–254.
  • (3) I. Matthews, S. Baker, Active appearance models revisited

    , International Journal of Computer Vision 60 (2) (2004) 135–164.


    URL http://www.springerlink.com/openurl.asp?id=doi:10.1023/B:VISI.0000029666.37597.d3
  • (4) T. Cootes, G. Edwards, C. Taylor, Active appearance models, IEEE Transactions on Pattern Analysis and Machine Intelligence 23 (6) (2001) 681 –685. doi:10.1109/34.927467.
  • (5) Y. Lan, B.-J. Theobald, R. Harvey, View independent computer lip-reading, in: IEEE International Conference on Multimedia and Expo (ICME), 2012, pp. 432–437. doi:10.1109/ICME.2012.192.
  • (6) A. Pass, J. Zhang, D. Stewart, An investigation into features for multi-view lipreading, in: Image Processing (ICIP), 2010 17th IEEE International Conference on, IEEE, 2010, pp. 2417–2420.
  • (7) S. Moore, R. Bowden, Local binary patterns for multi-view facial expression recognition, Computer Vision and Image Understanding 115 (4) (2011) 541 – 558. doi:10.1016/j.cviu.2010.12.001.
  • (8) K. Kumar, T. Chen, R. Stern, Profile view lip reading, in: IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), Vol. 4, 2007, pp. IV–429–IV–432. doi:10.1109/ICASSP.2007.366941.
  • (9) R. Kaucic, A. Blake, Accurate, real-time, unadorned lip tracking, in: Computer Vision, 1998. Sixth International Conference on, IEEE, 1998, pp. 370–375.
  • (10) S. L. Bauman, G. Hambrecht, Analysis of view angle used in speechreading training of sentences, American Journal of Audiology 4 (3) (1995) 67–70.
    URL http://aja.asha.org/cgi/content/abstract/4/3/67
  • (11) P. Lucey, G. Potamianos, S. Sridharan, Visual speech recognition across multiple views, in: A. W.-C. Liew, S. Wang (Eds.), Visual Speech Reognition: Lip Segmentation and Mapping, 2009. doi:10.4018/978-1-60566-186-5.
  • (12) A. Blokland, A. H. Anderson, Effect of low frame-rate video on intelligibility of speech, Speech Communication 26 (1-2) (1998) 97–103. doi:http://dx.doi.org/10.1016/S0167-6393(98)00053-3.
  • (13) T. Saitoh, R. Konishi, A study of influence of word lip reading by change of frame rate, in: Proceedings of the International Conference on Auditory-Visual Speech Processing (AVSP), 2010.
  • (14) H. Bear, R. W. Harvey, B.-J. Theobald, Y. Lan, Resolution limits on visual speech recognition, in: IEEE International Conference on Image Processing, 2014, pp. 2009–2013. doi:10.1109/ICIP.2014.7025274.
  • (15) M. Heckmann, F. Berthommier, C. Savariaux, K. Kroschel, Effects of image distortions on audio-visual speech recognition, in: AVSP 2003-International Conference on Audio-Visual Speech Processing, 2003, pp. 163–168.
  • (16) M. Vitkovitch, P. Barber, Visible speech as a function of image quality: Effects of display parameters on lipreading ability, Applied Cognitive Psychology 10 (2) (1996) 121–140. doi:10.1002/(SICI)1099-0720(199604)10:2<121::AID-ACP371>3.0.CO;2-V.
    URL http://dx.doi.org/10.1002/(SICI)1099-0720(199604)10:2<121::AID-ACP371>3.0.CO;2-V
  • (17) L. Cappelletta, N. Harte, Phoneme-to-viseme mapping for visual speech recognition., in: ICPRAM (2), 2012, pp. 322–329.
  • (18) D. Howell, B.-J. Theobald, S. J. Cox, Confusion modelling for automated lip-reading using weighted finite-state transducers., in: AVSP, 2013, pp. 197–202.
  • (19) T. J. Hazen, K. Saenko, C.-H. La, J. R. Glass, A segment-based audio-visual speech recognizer: Data collection, development, and initial experiments, in: Proceedings of the 6th International Conference on Multimodal Interfaces, ICMI ’04, ACM, New York, NY, USA, 2004, pp. 235–242. doi:10.1145/1027933.1027972.
    URL http://doi.acm.org/10.1145/1027933.1027972
  • (20)

    J. Shin, J. Lee, D. Kim, Real-time lip reading system for isolated korean word recognition, Pattern Recognition 44 (3) (2011) 559–571.

  • (21) H. L. Bear, R. Harvey, Decoding visemes: Improving machine lip-reading, in: 2016 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), IEEE, 2016, pp. 2009–2013.
  • (22) I. Matthews, J. Bangham, R. Harvey, S. Cox, Non-linear scale decomposition based features for visual speech recognition, Proceedings of the IX European Signal Processing Conference (EUSIPCO) (1998) 303–305.
  • (23) Y. Lan, R. Harvey, B. Theobald, E.-J. Ong, R. Bowden, Comparing visual features for lipreading, in: International Conference on Auditory-Visual Speech Processing 2009, 2009, pp. 102–106.
  • (24) Y. Lan, B.-J. Theobald, R. Harvey, E.-J. Ong, R. Bowden, Improving visual features for lip-reading, Proceedings of the International Conference on Audio-Visual Speech Processing (AVSP) 7 (3) (2010) 42–48.
  • (25) I. Matthews, T. Cootes, J. Bangham, S. Cox, R. Harvey, Extraction of visual features for lipreading, Pattern Analysis and Machine Intelligence, IEEE Transactions on 24 (2) (2002) 198 –213. doi:10.1109/34.982900.
  • (26) S. Young, G. Evermann, M. Gales, T. Hain, D. Kershaw, X. A. Liu, G. Moore, J. Odell, D. Ollason, D. Povey, V. Valtchec, P. Woodland, The HTK Book (for HTK Version 3.4), Cambridge University Engineering Department, 2006.
    URL http://htk.eng.cam.ac.uk/docs/docs.shtml
  • (27) Q. Zhu, A. Alwan, On the use of variable frame rate analysis in speech recognition, in: Acoustics, Speech, and Signal Processing, 2000. ICASSP’00. Proceedings. 2000 IEEE International Conference on, Vol. 3, IEEE, 2000, pp. 1783–1786.
  • (28) K. Thangthai, R. Harvey, S. Cox, B.-J. Theobald, Improving lip-reading performance for robust audiovisual speech recognition using dnns, in: Proc. FAAVSP, 1St Joint Conference on Facial Analysis, Animation and Audio–Visual Speech Processing, 2015.
  • (29) F. J. Huang, T. Chen, Tracking of multiple faces for human-computer interfaces and virtual environments, in: Proceedings of IEEE International Conference on Multimedia and Expo (ICME), Vol. 3, 2000, pp. 1563–1566. doi:10.1109/ICME.2000.871067.
  • (30) J. Jiang, A. Alwan, L. E. Bernstein, E. T. Auer, P. A. Keating, Similarity structure in perceptual and physical measures for visual consonants across talkers, in: Proceedings of the IEEE International Conference on Acoustics, Speech, and Signal Processing (ICASSP), Vol. 1, 2002, pp. I–441 –I–444. doi:10.1109/ICASSP.2002.5743749.
  • (31) S. Lesner, P. Kricos, Visual vowel and diphthong perception across speakers, Journal of the Academy of Rehabilitative Audiology 14 (1981) 252–258.
  • (32) R. Cutler, L. Davis, Look who’s talking: speaker detection using video and audio correlation, in: IEEE International Conference on Multimedia and Expo (ICME), Vol. 3, 2000, pp. 1589–1592. doi:10.1109/ICME.2000.871073.
  • (33) J. Luettin, N. Thacker, S. Beet, Speaker identification by lipreading, in: Proceedings of the Fourth International Conference on Spoken Language (ICSLP), Vol. 1, 1996, pp. 62–65. doi:10.1109/ICSLP.1996.607030.
  • (34) H. L. Bear, S. J. Cox, R. W. Harvey, Speaker-independent machine lip-reading with speaker-dependent viseme classifiers, Facial Animation and Audio-Visual Speech Processing (FAAVSP) 2015 (2015) 190–195.
    URL http://www.isca-speech.org/archive/avsp15/papers/av15_190.pdf
  • (35) J. L. Newman, S. J. Cox, Speaker independent visual-only language identification, in: Acoustics Speech and Signal Processing (ICASSP), 2010 IEEE International Conference on, IEEE, 2010, pp. 5026–5029.
  • (36) S. Taylor, B.-J. Theobald, I. Matthews, The effect of speaking rate on audio and visual speech, in: IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), 2014, pp. 3037–3041. doi:10.1109/ICASSP.2014.6854158.
  • (37) E. K. Patterson, S. Gurbuz, Z. Tufekci, J. N. Gowdy, Cuave: A new audio-visual database for multimodal human-computer interface research, in: 2002 IEEE International Conference on Acoustics, Speech, and Signal Processing, Vol. 2, 2002, pp. II–2017–II–2020. doi:10.1109/ICASSP.2002.5745028.
  • (38) J. F. G. Perez, A. F. Frangi, E. L. Solano, K. Lukas, Lip reading for robust speech recognition on embedded devices, in: Proceedings. (ICASSP ’05). IEEE International Conference on Acoustics, Speech, and Signal Processing, 2005., Vol. 1, 2005, pp. 473–476. doi:10.1109/ICASSP.2005.1415153.
  • (39) K. Paleček, Lipreading using spatiotemporal histogram of oriented gradients, in: 2016 24th European Signal Processing Conference (EUSIPCO), 2016, pp. 1882–1885. doi:10.1109/EUSIPCO.2016.7760575.
  • (40) R. E. Shor, The production and judgment of smile magnitude, The Journal of General Psychology 98 (1) (1978) 79–96.
  • (41) S. Fagel, Effects of smiling on articulation: Lips, larynx and acoustics, in: Development of multimodal interfaces: active listening and synchrony, Springer, 2010, pp. 294–303.
  • (42) M. Kienast, A. Paeschke, W. Sendlmeier, Articulatory reduction in emotional speech, in: Sixth European Conference on Speech Communication and Technology, 1999.
  • (43) M. Kienast, W. F. Sendlmeier, Acoustical analysis of spectral and temporal changes in emotional speech, in: ISCA Tutorial and Research Workshop (ITRW) on Speech and Emotion, 2000.
  • (44) F. Shaw, B.-J. Theobald, Expressive modulation of neutral visual speech, IEEE MultiMedia 23 (4) (2016) 68–78.
  • (45) W. Hamza, E. Eide, R. Bakis, M. Picheny, J. Pitrelli, The ibm expressive speech synthesis system, in: Eighth International Conference on Spoken Language Processing, 2004.
  • (46) N. N. Khatri, Z. H. Shah, S. A. Patel, Facial expression recognition: A survey, International Journal of Computer Science and Information Technologies (IJCSIT) 5 (1) (2014) 149–152.
  • (47) S. Happy, A. Routray, Automatic facial expression recognition using features of salient facial patches, IEEE transactions on Affective Computing 6 (1) (2015) 1–12.
  • (48) J. Yan, W. Zheng, Q. Xu, G. Lu, H. Li, B. Wang, Sparse kernel reduced-rank regression for bimodal emotion recognition from facial expression and speech, IEEE Transactions on Multimedia 18 (7) (2016) 1319–1329.
  • (49) S. Zhang, X. Wang, G. Zhang, X. Zhao, Multimodal emotion recognition integrating affective speech with facial expression, WSEAS Transactions on Signal Processing 10 (2014) (2014) 526–537.
  • (50) T. Cootes, G. Edwards, C. Taylor, Active appearance models, Pattern Analysis and Machine Intelligence, IEEE Transactions on 23 (6) (2001) 681 –685. doi:10.1109/34.927467.
  • (51) R. Seymour, D. Stewart, J. Ming, Comparison of image transform-based features for visual speech recognition in clean and corrupted videos, Journal on Image and Video Processing 2008 (2008) 14.
  • (52) G. Potamianos, C. Neti, J. Luettin, I. Matthews, Audio-visual automatic speech recognition: An overview, Issues in Visual and Audio-Visual Speech Processing 22 (2004) 23.
  • (53) A. Sharif Razavian, H. Azizpour, J. Sullivan, S. Carlsson, Cnn features off-the-shelf: an astounding baseline for recognition, in: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition Workshops, 2014, pp. 806–813.
  • (54) Z. Yan, V. Jagadeesh, D. DeCoste, W. Di, R. Piramuthu, Hd-cnn: Hierarchical deep convolutional neural network for image classification, in: International Conference on Computer Vision (ICCV), Vol. 2, 2015.
  • (55) M. Sundermeyer, R. Schlüter, H. Ney, Lstm neural networks for language modeling., in: Interspeech, 2012, pp. 194–197.
  • (56)

    W. Byeon, T. M. Breuel, F. Raue, M. Liwicki, Scene labeling with lstm recurrent neural networks, in: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, 2015, pp. 3547–3555.

  • (57) J. S. Chung, A. Zisserman, Lip Reading in the Wild, Springer International Publishing, Cham, 2017, pp. 87–103. doi:10.1007/978-3-319-54184-6_6.
    URL http://dx.doi.org/10.1007/978-3-319-54184-6_6
  • (58) M. Wand, J. Koutník, J. Schmidhuber, Lipreading with long short-term memory, in: Acoustics, Speech and Signal Processing (ICASSP), 2016 IEEE International Conference on, IEEE, 2016, pp. 6115–6119.
  • (59) B.-J. Theobald, Visual speech synthesis using shape and appearance models, Ph.D. thesis, University of East Anglia (2003).
  • (60) C. A. Binnie, P. L. Jackson, A. A. Montgomery, Visual intelligibility of consonants: A lipreading screening test with implications for aural rehabilitation, Journal of Speech and Hearing Disorders 41 (4) (1976) 530.
  • (61) C. G. Fisher, Confusions among visually perceived consonants, Journal of Speech, Language and Hearing Research 11 (4) (1968) 796.
  • (62) J. R. Franks, J. Kimble, The confusion of english consonant clusters in lipreading, Journal of Speech, Language and Hearing Research 15 (3) (1972) 474.
  • (63) B. E. Walden, R. A. Prosek, A. A. Montgomery, C. K. Scherr, C. J. Jones, Effects of training on the visual recognition of consonants, Journal of Speech, Language and Hearing Research 20 (1) (1977) 130.
  • (64) P. B. Kricos, S. A. Lesner, Differences in visual intelligibility across talkers., The Volta Review 82 (1982) 219–226.
  • (65) E. Owens, B. Blazek, Visemes observed by hearing-impaired and normal-hearing adult viewers, Journal of Speech and Hearing Research 28 (3) (1985) 381.
  • (66) J. Lander, Read my lips: Facial animation techniques, http://www.gamasutra.com/view/feature/131587/read_my_lips_facial_animation_.php, accessed: 2014-01-28 (2014).
  • (67) A. A. Montgomery, P. L. Jackson, Physical characteristics of the lips underlying vowel lipreading performance, The Journal of the Acoustical Society of America 73 (1983) 2134–2144.
  • (68) E. B. Nitchie, Lip-Reading, principles and practise: A handbook for teaching and self-practise, Frederick A Stokes Co, New York, 1912.
  • (69) E. Bozkurt, C. Erdem, E. Erzin, T. Erdem, M. Ozkan, Comparison of phoneme and viseme based acoustic units for speech driven realistic lip animation, in: 3DTV Conference, IEEE, 2007, pp. 1–4.
  • (70) J. Jeffers, M. Barley, Speechreading (lipreading), Thomas Springfield, IL:, 1971.
  • (71)

    S. Lee, D. Yook, Audio-to-visual conversion using Hidden Markov Models, in: PRICAI 2002: Trends in Artificial Intelligence, Springer, 2002, pp. 563–570.

  • (72) C. Neti, G. Potamianos, J. Luettin, I. Matthews, H. Glotin, D. Vergyri, J. Sison, A. Mashari, J. Zhou, Audio-visual speech recognition, in: Final Workshop 2000 Report, Vol. 764, 2000.
  • (73) K. E. Finn, A. A. Montgomery, Automatic optically-based recognition of speech, Pattern Recognition Letters 8 (3) (1988) 159–164.
  • (74) F. Heider, G. M. Heider, An experimental investigation of lipreading, Psychological Monographs 52 (232) (1940) 124–153.
  • (75) M. F. Woodward, C. G. Barber, Phoneme perception in lipreading, Journal of Speech, Language and Hearing Research 3 (3) (1960) 212.
  • (76) A. Bhattachayya, On a measure of divergence between two statistical population defined by their population distributions, Bulletin Calcutta Mathematical Society 35 (1943) 99–109.
  • (77) K. Wilson, The Columbia guide to standard American English, New York : Columbia University Press, 1993.
  • (78) S. Cox, R. Harvey, Y. Lan, J. Newman, B. Theobald, The challenge of multispeaker lip-reading, in: International Conference on Auditory-Visual Speech Processing, 2008, pp. 179–184.
  • (79) H. L. Bear, Decoding visemes: improving machine lip-reading. PhD thesis, University of East Anglia, 2016.
  • (80) W. M. Fisher, G. R. Doddington, K. M. Goudie-Marshall, The DARPA speech recognition research database: specifications and status, in: Proceedings of the DARPA Workshop on speech recognition, 1986, pp. 93–99.
  • (81) Y. Lan, B.-J. Theobald, R. Harvey, E.-J. Ong, R. Bowden, Improving visual features for lip-reading, in: Proceedings of International Conference on Auditory-Visual Speech Processing, Vol. 201, 2010.
  • (82) Cambridge University, UK. BEEP pronounciation dictionary [online] (1997) [cited Jan 2013].
  • (83) J. Gower, Generalized procrustes analysis, Psychometrika 40 (1) (1975) 33–51. doi:10.1007/BF02291478.
    URL http://dx.doi.org/10.1007/BF02291478
  • (84) S. Baker, Inverse compositional algorithm, in: K. Ikeuchi (Ed.), Computer Vision, Springer US, 2014, pp. 426–428. doi:10.1007/978-0-387-31439-6_759.
    URL http://dx.doi.org/10.1007/978-0-387-31439-6_759
  • (85) T. J. Hazen, Automatic alignment and error correction of human generated transcripts for long speech recordings., in: INTERSPEECH, Vol. 2006, 2006, pp. 1606–1609.
  • (86) H. L. Bear, R. W. Harvey, B.-J. Theobald, Y. Lan, Which phoneme-to-viseme maps best improve visual-only computer lip-reading?, in: Advances in Visual Computing, Springer, 2014, pp. 230–239. doi:10.1007/978-3-319-14364-4_22.
  • (87) J. Demsar, Statistical comparisons of classifiers over multiple datasets, Journal of Machine Learning Research 7 (2006) 1–30.
  • (88) S. J. Young, G. Evermann, M. Gales, D. Kershaw, G. Moore, J. Odell, D. Ollason, D. Povey, V. Valtchev, P. Woodland, The HTK book version 3.4 (2006).
  • (89) R. R. Bouckaert, E. Frank, Evaluating the replicability of significance tests for comparing learning algorithms, in: Advances in knowledge discovery and data mining, Springer, 2004, pp. 3–12.
  • (90)

    Y. Bengio, Y. Grandvalet, No unbiased estimator of the variance of k-fold cross-validation, The Journal of Machine Learning Research 5 (2004) 1089–1105.

  • (91) H. L. Bear, G. Owen, R. Harvey, B.-J. Theobald, Some observations on computer lip-reading: moving from the dream to the reality, in: SPIE Security+ Defence, International Society for Optics and Photonics, 2014, pp. 92530G–92530G. doi:10.1117/12.2067464.

Appendix A RMAV Speaker-dependent P2V maps

Speaker Bear1 Bear2 Bear3 Bear4
Viseme Phonemes Viseme Phonemes Viseme Phonemes Viseme Phonemes
sp01 /v01/ /dZ/ /m/ /v01/ /æ/ // // /ay/ /v01/ // /t/ /T/ /uw/ /v01/ /æ/ // // /ay/
/v02/ // // /iy/ /k/ /eh/ // // /iy/ /z/ /eh/ // // /iy/
/n/ /N/ /r/ /s/ /v02/ // // /v02/ // // /iy/ /k/ /v02/ /S/ /T/ /v/ /w/
/v03/ /ey/ /v03/ // // /ey/ /n/ /N/ /r/ /s/ /z/
/v04/ // /D/ /E/ /eh/ /v04/ // /sil/ /sil/ /sil/ /sil/ /sp/ /v03/ /b/ /d/ /f/ /k/
// /v05/ /uw/ /gar/ /gar/ // /æ/ // // /m/ /n/ /N/ /p/ /r/
/v05/ // /v06/ // // /ay/ // /b/ /tS/ /r/ /s/ /t/
/v06/ // /t/ /T/ /uw/ /v07/ // /tS/ /d/ /D/ /E/ /eh/ /sil/ /sil/ /sil/ /sp/
/z/ /v08/ // /eh/ /ey/ /f/ /g/ /H/ /gar/ /gar/ // // // //
/v07/ // // /p/ /w/ /v09/ // /H/ /dZ/ /m/ // // /D/ // /ey/ /g/ /H/
/v08/ /S/ /v10/ // // // /p/ /S/ // /H/ /dZ/ // // //
/v09/ // /v11/ /b/ /d/ /f/ /k/ // // /w/ /y/ /Z/ // // // /uw/ /Z/
/v10/ /æ/ /m/ /n/ /N/ /p/ /r/ /Z/ /Z/
/v11/ /d/ /g/ /H/ /r/ /s/ /t/
/v12/ /b/ /v12/ /D/ /dZ/
/v13/ /y/ /v13/ /S/ /T/ /v/ /w/
/v14/ // /ay/ /z/
/v15/ /Z/ /v14/ /g/
/v16/ // /v15/ /tS/ /H/
/v17/ /sil/ /v16/ /Z/
/v18/ // /sil/ /sil/ /sil/ /sp/
/v19/ /tS/
/v20/ //
/v21/ //
/gar/ /gar/ /sp/
Table 19: A speaker-dependent phoneme-to-viseme mapping derived from phoneme recognition confusions for RMAV speaker sp01
Speaker Bear1 Bear2 Bear3 Bear4
Viseme Phonemes Viseme Phonemes Viseme Phonemes Viseme Phonemes
sp02 /v01/ /l/ /m/ /n/ /p/ /v01/ // /ay/ /E/ /eh/ /v01/ // /ay/ /b/ /d/ /v01/ // /ay/ /E/ /eh/
/s/ /S/ /t/ /v/ /w/ /ey/ // /iy/ /eh/ /ey/ /dZ/ /ey/ // /iy/
/w/ /v02/ // // // // /v02/ /l/ /m/ /n/ /p/ /v02/ /b/ /m/ /n/ /N/
/v02/ /g/ /H/ // // /v03/ /æ/ // // // /s/ /S/ /t/ /v/ /w/ /r/ /s/ /S/ /t/ /v/
/k/ /v04/ // /uw/ /w/ /v/ /w/ /y/ /z/
/v03/ // /ay/ /b/ /d/ /v05/ // /sil/ /sil/ /sil/ /sp/ /sil/ /sil/ /sil/ /sp/
/eh/ /ey/ /dZ/ /v06/ /sil/ /gar/ /gar/ // /æ/ // // /gar/ /gar/ // /æ/ // //
/v04/ // // /v07/ // // /tS/ /E/ // /f/ // /tS/ /d/ /D/ /f/
/v05/ // /uw/ /y/ /z/ /v08/ /b/ /m/ /n/ /N/ /f/ /g/ /H/ // // /f/ /g/ /H/ // /dZ/
/v06/ // // /r/ /s/ /S/ /t/ /v/ // /iy/ /k/ /N/ // /dZ/ /k/ /l/ // //
/v07/ /æ/ // // // /v/ /w/ /y/ /z/ // // // /T/ // // // /T/ // //
/v08/ /f/ /N/ // /v09/ /dZ/ // // /uw/ /y/ /z/ // /uw/ /Z/
/v09/ /E/ /v10/ /d/ /D/ /f/ /g/ /z/ /Z/
/v10/ /tS/ /T/ /k/ /l/
/v11/ /Z/ /v11/ /tS/ /T/
/v12/ // /v12/ /Z/
/v13/ /sil/ /sil/ /sil/ /sil/ /sp/
/gar/ /gar/ // /sp/ /gar/ /gar/ //
Table 20: A speaker-dependent phoneme-to-viseme mapping derived from phoneme recognition confusions for RMAV speaker sp02
Speaker Bear1 Bear2 Bear3 Bear4
Viseme Phonemes Viseme Phonemes Viseme Phonemes Viseme Phonemes
sp03 /v01/ /ey/ /f/ // /iy/ /v01/ /E/ // /sil/ /uw/ /v01/ /ey/ /f/ // /iy/ /v01/ /ay/ /eh/ /ey/ //
/k/ /l/ /m/ /n/ /S/ /v02/ // /k/ /l/ /m/ /n/ /S/ /iy/ // //
/S/ /v03/ /ay/ /eh/ /ey/ // /S/ /v02/ /g/ /k/ /l/ /m/
/v02/ /D/ /g/ /iy/ // // /v02/ /E/ /r/ /s/ /t/ /p/ /r/ /s/ /t/ /T/
/v03/ /E/ /r/ /s/ /sil/ /v04/ // /z/ /T/
/uw/ /z/ /v05/ // /sil/ /sil/ /sil/ /sp/ /sil/ /sil/ /sil/ /sp/
/v04/ /d/ /T/ /v/ /w/ /v06/ /æ/ // // /gar/ /gar/ // /æ/ // // /gar/ /gar/ // /æ/ // //
/v05/ // // /p/ /v07/ // // /ay/ // /b/ /tS/ // // /b/ /tS/ /d/
/v06/ /æ/ /v08/ // /tS/ /d/ /D/ /eh/ // /d/ /D/ /E/ // /f/
/v07/ // /ay/ /b/ /tS/ /v09/ // // /g/ /H/ // /N/ /f/ /H/ /dZ/ /N/ //
/v08/ /N/ /v10/ /g/ /k/ /l/ /m/ /N/ // // // /p/ // /S/ // // /uw/
/v09/ /H/ /p/ /r/ /s/ /t/ /T/ /p/ /T/ // // /v/ /uw/ /v/ /w/ /y/ /z/
/v10/ // /eh/ // /T/ /v/ /w/ /y/ /Z/ /z/ /Z/
/v11/ // // /v11/ /tS/ /d/ /D/ /f/
/v12/ // // /v12/ /dZ/ /v/ /w/ /z/
/v13/ /Z/ /v13/ /b/
/v14/ // /v14/ /S/ /Z/
/v15/ // /v15/ /H/ /N/
/gar/ /gar/ // /sp/ /sil/ /sil/ /sil/ /sp/
/gar/ /gar/ //
Table 21: A speaker-dependent phoneme-to-viseme mapping derived from phoneme recognition confusions for RMAV speaker sp03
Speaker Bear1 Bear2 Bear3 Bear4
Viseme Phonemes Viseme Phonemes Viseme Phonemes Viseme Phonemes
sp05 /v01/ /æ/ // // /D/ /v01/ /æ/ // // /eh/ /v01/ /ay/ /b/ /d/ /w/ /v01/ /ay/ /uw/
// /ey/ // /iy/ /k/ /ey/ // /iy/ // // /v02/ /æ/ // // /D/ /v02/ /d/ /D/ /f/ /dZ/
/k/ /l/ /n/ // // /ey/ // /iy/ /k/ /l/ /m/ /n/ /r/ /s/
/v02/ /p/ /r/ /s/ /t/ /v02/ /E/ // /k/ /l/ /n/ /s/ /S/
/z/ /v03/ /ay/ /uw/ /sil/ /sil/ /sil/ /sp/ /sil/ /sil/ /sil/ /sp/
/v03/ // /N/ /uw/ /v/ /v04/ // /gar/ /gar/ // // // // /gar/ /gar/ // /æ/ // //
/v04/ /tS/ // /v05/ // // /E/ /f/ /g/ /H/ // // // /b/ /tS/ /E/
/v05/ /ay/ /b/ /d/ /w/ /v06/ // // // /dZ/ /m/ /N/ // /E/ /eh/ // /ey/ /g/
/v06/ /f/ /m/ /v07/ /g/ /H/ /t/ /v/ // // // /p/ /r/ /g/ /H/ // // /iy/
/v07/ // /g/ /H/ /v08/ /p/ /w/ /y/ /r/ /s/ /S/ /t/ /T/ /iy/ /N/ // // //
/v08/ // /S/ /v09/ /d/ /D/ /f/ /dZ/ /T/ // // /uw/ /v/ // /p/ /t/ /T/ //
/v09/ /dZ/ /l/ /m/ /n/ /r/ /s/ /v/ /y/ /z/ /Z/ // // /v/ /w/ /y/
/v10/ /dZ/ /s/ /S/ /y/ /z/ /Z/
/v11/ /E/ /y/ /v10/ /N/ /T/
/v12/ /T/ /v11/ /b/ /tS/
/v13/ // // /v12/ /Z/
/v14/ /Z/ /sil/ /sil/ /sil/ /sp/
/v15/ // /gar/ /gar/ // //
/v16/ /j/ /h/
/gar/ /gar/ // // /sil/ /sp/
Table 22: A speaker-dependent phoneme-to-viseme mapping derived from phoneme recognition confusions for RMAV speaker sp05
Speaker Bear1 Bear2 Bear3 Bear4
Viseme Phonemes Viseme Phonemes Viseme Phonemes Viseme Phonemes
sp06 /v01/ // /ay/ /d/ /D/ /v01/ // /æ/ // // /v01/ /H/ /N/ // // /v01/ // /æ/ // //
/eh/ // /k/ /l/ /n/ // // // // // /v02/ // /ay/ /d/ /D/ // // // // //
/n/ /p/ /s/ /t/ // /eh/ // /k/ /l/ /n/ //
/v02/ /v/ /w/ /y/ /z/ /v02/ /sil/ /uw/ /n/ /p/ /s/ /t/ /v02/ /k/ /l/ /m/ /n/
/v03/ /m/ /v03/ /ay/ /ey/ /iy/ // /sil/ /sil/ /sil/ /sp/ /r/ /s/ /S/ /t/ /v/
/v04/ /H/ /N/ // // /v04/ // // /gar/ /gar/ // /æ/ // // /v/ /w/ /y/ /z/
/v05/ /ey/ /iy/ /r/ /S/ /v05/ /E/ // /b/ /tS/ // /ey/ /sil/ /sil/ /sil/ /sp/
/v06/ // /v06/ // /ey/ /f/ /g/ // /iy/ /gar/ /gar/ // // /ay/ //
/v07/ // /æ/ // // /v07/ // /iy/ /dZ/ /m/ // /r/ /tS/ /d/ /D/ /E/ /ey/
/v08/ /f/ /T/ // /v08/ /k/ /l/ /m/ /n/ /r/ /S/ /T/ // // /ey/ /f/ /g/ /H/ /iy/
/v09/ /uw/ /r/ /s/ /S/ /t/ /v/ // /uw/ /v/ /w/ /y/ /iy/ /dZ/ /N/ // /T/
/v10/ /uw/ /v/ /w/ /y/ /z/ /y/ /z/ /Z/ /T/ // // /uw/ /Z/
/v11/ /b/ /tS/ /g/ /v09/ /b/ /tS/ /d/ /D/ /Z/
/v12/ // /dZ/ /g/ /dZ/
/v13/ /Z/ /v10/ /Z/
/v14/ /sil/ /v11/ /H/ /T/
/v15/ // /v12/ /N/
/v16/ // /sil/ /sil/ /sil/ /sp/
/v17/ /u/ /w/ /gar/ /gar/ //
/gar/ /gar/ // /sp/
Table 23: A speaker-dependent phoneme-to-viseme mapping derived from phoneme recognition confusions for RMAV speaker sp06
Speaker Bear1 Bear2 Bear3 Bear4
Viseme Phonemes Viseme Phonemes Viseme Phonemes Viseme Phonemes
sp08 /v01/ /eh/ /f/ /H/ // /v01/ // /æ/ // // /v01/ /eh/ /f/ /H/ // /v01/ // /æ/ // //
/l/ /m/ /N/ /p/ /r/ /eh/ /ey/ // /iy/ /uw/ /l/ /m/ /N/ /p/ /r/ /eh/ /ey/ // /iy/ /uw/
/r/ /s/ /t/ /uw/ /r/ /s/ /t/ /uw/
/v02/ // /æ/ // // /v02/ // /sil/ /sil/ /sil/ /sp/ /v02/ /k/ /l/ /n/ /p/
/ey/ /n/ // /v03/ // // /gar/ /gar/ // /æ/ // // /s/ /t/ /T/ /v/ /w/
/v03/ /ay/ /b/ /uw/ /v04/ // // /ay/ // /b/ /tS/ /w/ /z/
/v04/ /g/ /v05/ // /E/ /tS/ /d/ /D/ /E/ // /sil/ /sil/ /sil/ /sp/
/v05/ /tS/ /v06/ // // // /ey/ /g/ // /dZ/ /gar/ /gar/ // // // /b/
/v06/ /S/ /y/ /v07/ // /dZ/ /k/ /n/ // // /d/ /D/ /E/ // /f/
/v07/ // /v08/ /k/ /l/ /n/ /p/ // // /S/ /T/ // /f/ /g/ /H/ // /dZ/
/v08/ /k/ /s/ /t/ /T/ /v/ /w/ // // /uw/ /v/ /w/ /dZ/ /m/ /N/ // //
/v09/ /dZ/ /w/ /z/ /w/ /y/ /z/ /Z/ // // /S/ // //
/v10/ /D/ /v/ /w/ /z/ /v09/ /d/ /D/ /f/ /H/ // /y/ /Z/
/v11/ /T/ /Z/ /N/
/v12/ // // /v10/ /g/ /dZ/
/v13/ // // /v11/ /b/ /tS/ /S/
/v14/ // /E/ /v12/ /Z/
/v15/ // /v13/ /y/
/v16/ // /sil/ /sil/ /sil/ /sp/
/gar/ /gar/ // /sil/ /sp/ /gar/ /gar/ // //
Table 24: A speaker-dependent phoneme-to-viseme mapping derived from phoneme recognition confusions for RMAV speaker sp08
Speaker Bear1 Bear2 Bear3 Bear4
Viseme Phonemes Viseme Phonemes Viseme Phonemes Viseme Phonemes
sp09 /v01/ /D/ /E/ // /g/ /v01/ /E/ // /v01/ // /N/ // // /v01/ // //
/k/ /l/ /m/ /n/ /p/ /v02/ // /æ/ // // /v02/ /D/ /E/ // /g/ /v02/ // /æ/ // //
/p/ /eh/ /ey/ // /iy/ // /k/ /l/ /m/ /n/ /p/ /eh/ /ey/ // /iy/ //
/v02/ // /y/ // /p/ //
/v03/ /ay/ /r/ /s/ /S/ /v03/ // /uw/ /v03/ /ay/ /r/ /s/ /S/ /v03/ /k/ /l/ /m/ /n/
/v/ /w/ /z/ /v04/ // /v/ /w/ /z/ /p/ /r/ /s/ /S/ /t/
/v04/ /æ/ // // /b/ /v05/ // /sil/ /sil/ /sil/ /sp/ /t/ /T/ /z/
/T/ /v06/ // /gar/ /gar/ // /æ/ // // /sil/ /sil/ /sil/ /sp/
/v05/ /eh/ /ey/ /f/ // /v07/ // // // /b/ /tS/ /d/ /eh/ /gar/ /gar/ // // /b/ /tS/
/v06/ // /N/ // // /v08/ /sil/ /eh/ /ey/ /f/ /H/ // /D/ /E/ // /f/ /g/
/v07/ // /v09/ /k/ /l/ /m/ /n/ // // /dZ/ // /T/ /g/ /H/ // /dZ/ //
/v08/ // /p/ /r/ /s/ /S/ /t/ /T/ // // /uw/ /y/ // // // /uw/ /v/
/v09/ // /uw/ /t/ /T/ /z/ /y/ /Z/ /v/ /w/ /y/ /Z/
/v10/ /dZ/ /v10/ /f/
/v11/ /tS/ /v11/ /d/ /D/ /dZ/
/v12/ /Z/ /v12/ /g/ /v/ /w/ /y/
/v13/ // /v13/ /b/
/v14/ /sil/ /v14/ /tS/ /H/
/v15/ /H/ /v15/ /Z/
/v16/ // /sil/ /sil/ /sil/ /sp/
/v17/ /a/ /a/ /gar/ /gar/ // //
/gar/ /gar/ // // /sp/
Table 25: A speaker-dependent phoneme-to-viseme mapping derived from phoneme recognition confusions for RMAV speaker sp09
Speaker Bear1 Bear2 Bear3 Bear4
Viseme Phonemes Viseme Phonemes Viseme Phonemes Viseme Phonemes
sp10 /v01/ // /iy/ /dZ/ /l/ /v01/ // /ay/ /eh/ // /v01/ // /uw/ /v/ /w/ /v01/ /æ/ // // //
/N/ // /iy/ // // /v02/ /H/ /n/ // // /v02/ // /ay/ /eh/ //
/v02/ /H/ /n/ // // /v02/ /æ/ // // // /r/ /s/ /t/ /T/ // /iy/ // //
/r/ /s/ /t/ /T/ /v03/ // /sil/ /sil/ /sil/ /sp/ /v03/ /d/ /D/ /f/ /H/
/v03/ /b/ /v04/ /E/ /uw/ /gar/ /gar/ // /æ/ // // /l/ /m/ /n/ /p/ /r/
/v04/ /æ/ /d/ /D/ /E/ /v05/ // // /ay/ // /b/ /tS/ /d/ /r/ /s/ /t/ /v/ /w/
/ey/ /f/ /v06/ // /d/ /D/ /E/ /eh/ // /w/ /z/
/v05/ /k/ /v07/ /sil/ // /ey/ /f/ /g/ // /v04/ /b/ /tS/ /y/
/v06/ // /uw/ /v/ /w/ /v08/ // // // /iy/ /dZ/ /k/ /sil/ /sil/ /sil/ /sp/
/v07/ /ay/ /S/ /sil/ /v09/ // /k/ /l/ /m/ /N/ // /gar/ /gar/ // // // /E/
/v08/ // /v10/ /d/ /D/ /f/ /H/ // /S/ // // /z/ // /dZ/ /N/ // /S/
/v09/ // // /z/ /l/ /m/ /n/ /p/ /r/ /z/ /Z/ /S/ /T/ // /uw/ /Z/
/v10/ // /r/ /s/ /t/ /v/ /w/ /Z/
/v11/ /tS/ /g/ /w/ /z/
/v12/ // // /v11/ /S/
/v13/ // // /v12/ /g/ /dZ/ /N/
/v14/ /Z/ /v13/ /b/ /tS/ /y/
/v15/ // /v14/ /Z/
/v16/ // /v15/ /T/
/gar/ /gar/ /sp/ /sil/ /sil/ /sil/ /sp/
Table 26: A speaker-dependent phoneme-to-viseme mapping derived from phoneme recognition confusions for RMAV speaker sp10
Speaker Bear1 Bear2 Bear3 Bear4
Viseme Phonemes Viseme Phonemes Viseme Phonemes Viseme Phonemes
sp11 /v01/ /iy/ /k/ /m/ /n/ /v01/ /uw/ /v01/ // // /ay/ /tS/ /v01/ /æ/ // /ay/ /E/
// /p/ /r/ /s/ /t/ /v02/ /æ/ // /ay/ /E/ /ey/ // /ey/ // /iy/
/t/ // /ey/ // /iy/ /v02/ /d/ /D/ /f/ /v02/ /dZ/ /k/ /l/ /m/
/v02/ /v/ /v03/ // /v03/ /iy/ /k/ /m/ /n/ /N/ /p/ /r/ /s/ /t/
/v03/ // // /ay/ /tS/ /v04/ // // // // /p/ /r/ /s/ /t/ /t/ /w/
/ey/ /v05/ // /t/ /sil/ /sil/ /sil/ /sp/
/v04/ /d/ /D/ /f/ /v06/ // /sil/ /sil/ /sil/ /sp/ /gar/ /gar/ // // // //
/v05/ /w/ /v07/ // /gar/ /gar/ // /æ/ // // /b/ /tS/ /d/ /D/ /f/
/v06/ /S/ /v08/ /sil/ /b/ /E/ // /g/ /H/ /f/ /g/ /H/ // //
/v07/ // /æ/ // /b/ /v09/ // /H/ // // /dZ/ /l/ // // // /S/ /T/
/v08/ // /E/ // // /v10/ // /l/ // // /S/ /T/ /T/ // // /uw/ /v/
/v09/ /T/ // /v11/ // /T/ // // /uw/ /v/ /v/ /y/ /z/ /Z/
/v10/ // /v12/ /dZ/ /k/ /l/ /m/ /v/ /w/ /y/ /z/ /Z/
/v11/ /g/ /y/ /z/ /N/ /p/ /r/ /s/ /t/ /Z/
/v12/ /H/ /l/ /t/ /w/
/v13/ // /uw/ /v13/ /d/ /f/ /g/ /H/
/v14/ /Z/ /v14/ /S/
/v15/ // /v15/ /y/ /z/
/v16/ /sil/ /v16/ /D/ /T/ /v/
/v17/ /dZ/ /v17/ /tS/
/v18/ // /v18/ /Z/
/gar/ /gar/ // /sp/ /v19/ /b/
/sil/ /sil/ /sil/ /sp/
/gar/ /gar/ //
Table 27: A speaker-dependent phoneme-to-viseme mapping derived from phoneme recognition confusions for RMAV speaker sp11
Speaker Bear1 Bear2 Bear3 Bear4
Viseme Phonemes Viseme Phonemes Viseme Phonemes Viseme Phonemes
sp13 /v01/ // /d/ // /k/ /v01/ /æ/ // // /ay/ /v01/ // /d/ // /k/ /v01/ /æ/ // // /ay/
/n/ /p/ /s/ /uw/ /v/ // /ey/ // // /iy/ /n/ /p/ /s/ /uw/ /v/ // /ey/ // // /iy/
/v/ /z/ /Z/ /iy/ /v/ /z/ /Z/ /iy/
/v02/ // /v02/ /E/ // // /uw/ /sil/ /sil/ /sil/ /sp/ /v02/ /d/ /f/ /g/ /k/
/v03/ // /f/ /g/ /r/ /v03/ // /gar/ /gar/ // /æ/ // // /m/ /n/ /N/ /p/ /s/
/v04/ /b/ /D/ /E/ /eh/ /v04/ // // /ay/ // /b/ /tS/ /D/ /s/ /t/ /v/ /w/ /z/
/v05/ /tS/ /v05/ // // /D/ /E/ /eh/ // /ey/ /z/
/v06/ // /iy/ // // /v06/ /sil/ /ey/ /f/ /g/ /H/ // /sil/ /sil/ /sil/ /sp/
/v07/ // // /v07/ // // /iy/ /dZ/ /m/ /N/ /gar/ /gar/ // // // //
/v08/ /æ/ // // /ay/ /v08/ /dZ/ /r/ /S/ /y/ /N/ // // // /r/ /tS/ /D/ /E/ /H/ /dZ/
/v09/ // /y/ /v09/ /d/ /f/ /g/ /k/ /r/ /S/ /t/ /T/ // /dZ/ // // // /r/
/v10/ /m/ /sil/ /t/ /T/ /m/ /n/ /N/ /p/ /s/ // // /w/ /y/ /r/ /S/ /T/ // //
/v11/ /S/ /s/ /t/ /v/ /w/ /z/ // /uw/ /y/ /Z/
/v12/ /ey/ /z/
/v13/ // /w/ /v10/ /H/
/v14/ /N/ /v11/ /b/ /tS/ /D/
/gar/ /gar/ // /sp/ /v12/ /Z/
/v13/ /T/
/sil/ /sil/ /sil/ /sp/
/gar/ /gar/ //
Table 28: A speaker-dependent phoneme-to-viseme mapping derived from phoneme recognition confusions for RMAV speaker sp13
Speaker Bear1 Bear2 Bear3 Bear4
Viseme Phonemes Viseme Phonemes Viseme Phonemes Viseme Phonemes
sp14 /v01/ /tS/ /iy/ /dZ/ /m/ /v01/ /æ/ // // /ay/ /v01/ /æ/ // /ey/ /f/ /v01/ /æ/ // // /ay/
// /p/ /r/ /s/ /t/ /eh/ // /ey/ // /iy/ /v02/ /S/ /v/ /w/ /y/ /eh/ // /ey/ // /iy/
/t/ /T/ /iy/ /v03/ /tS/ /iy/ /dZ/ /m/ /iy/
/v02/ // /ay/ /N/ /v02/ /uw/ // /p/ /r/ /s/ /t/ /v02/ /D/ /f/ /H/ /k/
/v03/ // /b/ /d/ /D/ /v03/ // /t/ /T/ /m/ /n/ /r/ /s/ /S/
/l/ /v04/ // // // /v04/ // /b/ /d/ /D/ /S/ /t/ /v/ /w/
/v04/ /S/ /v/ /w/ /y/ /v05/ // /sil/ /l/ /sil/ /sil/ /sil/ /sp/
/v05/ /g/ /H/ /k/ /v06/ // /sil/ /sil/ /sil/ /sp/ /gar/ /gar/ // // // //
/v06/ /E/ // /v07/ // /gar/ /gar/ // // // // /tS/ /d/ /g/ // /dZ/
/v07/ /æ/ // /ey/ /f/ /v08/ // // /E/ /g/ /H/ // /dZ/ /N/ // // //
/v08/ // /uw/ /v09/ // // /k/ /N/ // // // /p/ /T/ // //
/v09/ // /v10/ /a/ /a/ // // // /uw/ /Z/ // /uw/ /y/ /z/ /Z/
/v10/ // // /v11/ /D/ /f/ /H/ /k/ /Z/ /Z/
/v11/ // /m/ /n/ /r/ /s/ /S/
/v12/ /Z/ /S/ /t/ /v/ /w/
/v13/ // /v12/ /z/
/v14/ /sil/ /v13/ /y/
/v15/ // /v14/ /b/ /tS/ /d/ /T/
/v16/ /i/ /a/ /v15/ /p/
/gar/ /gar/ // // /sp/ /v16/ /g/
/v17/ /dZ/ /N/
/v18/ /Z/
/sil/ /sil/ /sil/ /sp/
/gar/ /gar/ // //
Table 29: A speaker-dependent phoneme-to-viseme mapping derived from phoneme recognition confusions for RMAV speaker sp14
Speaker Bear1 Bear2 Bear3 Bear4
Viseme Phonemes Viseme Phonemes Viseme Phonemes Viseme Phonemes
sp15 /v01/ // /d/ /D/ /ey/ /v01/ // /ay/ /eh/ /ey/ /v01/ // /d/ /D/ /ey/ /v01/ // /ay/ /eh/ /ey/
// /iy/ /k/ /l/ /m/ /iy/ // /uw/ // /iy/ /k/ /l/ /m/ /iy/ // /uw/
/m/ /n/ /y/ /v02/ // // // /E/ /m/ /n/ /y/ /v02/ /b/ /d/ /D/ /f/
/v02/ // /p/ /r/ /s/ // /sil/ /sil/ /sil/ /sp/ /k/ /l/ /m/ /n/ /N/
/t/ /T/ /z/ /v03/ // /gar/ /gar/ // /æ/ // // /N/ /p/ /v/
/v03/ /eh/ // /v04/ // /æ/ // /ay/ // /b/ /tS/ /E/ /sil/ /sil/ /sil/ /sp/
/v04/ // /æ/ // // /v05/ /sil/ // /E/ /eh/ // /g/ /H/ /gar/ /gar/ // /æ/ // //
/v05/ // /v06/ // /H/ // /dZ/ /N/ // // /tS/ /E/ // /H/
/v06/ /N/ /uw/ /v/ /v07/ // // // // /p/ /r/ /H/ // /dZ/ // //
/v07/ // /v08/ /b/ /d/ /D/ /f/ /r/ /s/ /S/ /t/ /T/ // /r/ /s/ /S/ /t/
/v08/ /g/ /H/ /dZ/ /k/ /l/ /m/ /n/ /N/ /T/ // // /uw/ /v/ /t/ /T/ // // /w/
/v09/ // /N/ /p/ /v/ /v/ /w/ /z/ /Z/ /w/ /y/ /z/ /Z/
/v10/ /b/ /tS/ /v09/ /r/ /s/ /S/ /t/
/v11/ // /z/
/v12/ /ay/ /E/ /v10/ /dZ/
/v13/ /sil/ // /v11/ /Z/
/v14/ // // /v12/ /w/ /y/
/v15/ /Z/ /v13/ /H/
/v16/ /e/ /r/ /v14/ /tS/
/gar/ /gar/ // /sp/ /sil/ /sil/ /sil/ /sp/
Table 30: A speaker-dependent phoneme-to-viseme mapping derived from phoneme recognition confusions for RMAV speaker sp15