A New Re-synchronization Method based Multi-modal Fusion for Automatic Continuous Cued Speech Recognition

01/03/2020 ∙ by Li Liu, et al. ∙ Grenoble Institute of Technology 0

Cued Speech (CS) is an augmented lip reading complemented by hand coding, and it is very helpful to the deaf people. Automatic CS recognition can help communications between the deaf people and others. Due to the asynchronous nature of lips and hand movements, fusion of them in automatic CS recognition is a challenging problem. In this work, we propose a novel re-synchronization procedure for multi-modal fusion, which aligns the hand features with lips feature. It is realized by delaying hand position and hand shape with their optimal hand preceding time which is derived by investigating the temporal organizations of hand position and hand shape movements in CS. This re-synchronization procedure is incorporated into a practical continuous CS recognition system that combines convolutional neural network (CNN) with multi-stream hidden markov model (MSHMM). A significant improvement of about 4.6% has been achieved retaining 76.6% CS phoneme recognition correctness compared with the state-of-the-art architecture (72.04%), which did not take into account the asynchrony of multi-modal fusion in CS. To our knowledge, this is the first work to tackle the asynchronous multi-modal fusion in the automatic continuous CS recognition.

READ FULL TEXT VIEW PDF
POST COMMENT

Comments

There are no comments yet.

Authors

page 1

page 4

page 9

page 13

This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

I Introduction

Communication is one of the most important parts of human life, and more and more attention has been paid to improve communications among the disabled people in nowadays’ society. It was reported by the World Health Organization (WHO) [1] that more than 5% of the world’s population (466 million people) has disabling hearing loss (432 million adults and 34 million children) in the world. As one of the most common communication ways for deaf people, lip reading [2, 3] helps the deaf or hearing impaired people access spoken speech, and has undoubtedly improved communication of these people a lot. However, a significant drawback is that lip reading can only provide insufficient information in most cases. In fact, it cannot distinguish some contrasts, for example [p] vs. [b], which is caused by the similarity of labial shapes. As a result, this problem makes it difficult for deaf or hearing impaired people to access speech only by lip reading.

To overcome insufficient information of lip reading and improve reading ability of deaf children, in 1967, Cornett [4] invented the first Cued Speech (CS) system for the American English, which complements lip reading and makes all the phonemes of a spoken language clearly visible. This system is based on four hand positions and eight hand shapes in which two major criteria were set: the minimum effort for encoding, and the maximum visual contrast. In the French CS named Langue française Parlée Complétée (LPC) [5] (see Fig. 1), five hand positions (i.e., mouth, chin, throat, side, cheek) are used to encode vowel groups, and eight hand shapes are used to encode consonant groups. For British English CS, four hand positions are used to code monophthong vowel groups and eight hand shapes are designed to code consonant groups. By using CS systems, sounds that may look identical on lips (e.g., /y/, /u/ and /o/), can be distinguished using hand information, and thus it is possible for the deaf people to understand a spoken language using visual information alone.

CS has drawn increasing attention from all over the world and has been adapted to more than sixty spoken languages for the moment. Another widely used communication method in deaf community is

Sign language (SL) [6, 7, 8] which was developed in the early 18th century. SL is a language with its own grammar and syntax, while CS is a visual representation of spoken languages. The deaf people are able to master and learn the native language more easily by using CS [5, 9].

Fig. 1: Hand coding in French Cued Speech system. Five hand positions (mouth, chin, throat, side, cheek) are used to code vowel groups. Eight hand shapes are used to code consonant groups. The * in the side position is used to code a single consonant, and the ** in the full hand shape is used to code a single vowel.

This work investigates the framework of asynchronous multi-modal fusion [10, 11, 12] applied to automatic continuous CS recognition. Since the multi-modal streams in CS (i.e., lips, hand shape and hand position) that need to be fused are naturally asynchronous [13, 14], the feature fusion of CS recognition posses a major challenge. The supervised CS recognition system will be interfered if the lips and hand information are misaligned.

In the existing literature, CS feature modalities are assumed to be synchronous in the recognition task. In [15, 16], direct feature fusion (i.e., direct concatenation of the features) was applied to the isolated111Isolated CS recognition means that the temporal segmentation of each phoneme to be recognized are given at the test stage. CS recognition without taking into account the asynchrony issue. In the state-of-the-art [17], a tandem architecture that combines convolutional neural network (CNN) [18, 19] with multi-stream hidden markov model (MSHMM) [20] was used for the continuous CS recognition, and is referred as in this work. In [17], MSHMM merges different features by giving weights to three feature modalities of CS, but it does not take into account the asynchrony issue between them. Therefore, there is still room for improvement regarding the CS recognition performance by exploring a pre-processing approach to tackle the fusion of asynchronous multi-modalities.

In this work, we propose a new architecture based on a novel re-synchronization method for asynchronous multi-modal feature fusion in CS. The novelty stems in investigating the temporal organization of hand movement, which allows us to obtain the optimal hand preceding time (HPT) for both vowels and consonants. More precisely, our method composes of two main stages.

  1. First of all, we build a hand preceding model (HPM) by analyzing the HPT on the database containing four CS speakers, and show that this model provides efficient segmentations of hand movements for all four speakers. Our proposed model significantly improves hand position recognition accuracy, and present the first main contribution of this work.

  2. Secondly, we propose an efficient re-synchronization procedure based on the HPM to align the hand feature stream, so that the hand and lips movements are statistically synchronous, providing a good fusion condition. By incorporating this re-synchronization procedure into the tandem CNN-MSHMM architecture (see in Fig. 2), the automatic continuous CS recognition obtains a significant improvement compared with the in the state-of-the-art [17]. This is the second main contribution of the present work. To our knowledge, the proposed re-synchronization procedure is the first method to tackle the asynchrony issue of the CS multi-modal fusion applied to the automatic continuous CS recognition.

Fig. 2: Overview of the proposed automatic CS recognition architecture in this work with the re-synchronization procedure. are the temporal segmentations/boundaries for phonemes in case of lips, hand position and hand shape, respectively. Compared with the state-of-the-art architecture in [17], the temporal relationship modeling (dotted box) and the re-synchronization procedure (i.e., the tier before the MSHMM-GMM decoder) are added.

Ii Related works

The literature on automatic CS recognition can be classified into three main categories: feature extraction, multi-modal temporal modeling and CS recognition modeling. We will discuss them separately in this section.

Ii-a Feature extraction in CS

In the literature on the automatic CS recognition, video images were recorded with artifices applied to the CS speaker (blue sticks on the lips, blue marks on the hand and forehead) to mark the pertinent information and make their further feature extraction easier. For example, in [15, 16], lips feature was extracted by tracking the color marks on speaker’s lips. Then a threshold was applied to the gray level images to segment the blue lips. The coordinates of the color marks on the finger were used as features for the hand shape and hand position modeling. In [21], the speaker wore a one-colored glove in order to help the hand segmentation. Stillitano et al. [22]

used active contours combined with parametric models to extract the lips contour in CS.

In our recent works [23, 24, 25], several methods to get rid of these artifices on the speaker’s lips and hand were explored. A modified constrained local neural fields (CLNF) model was proposed to extract the inner lips height and width, and an adaptive ellipse model was proposed for inner lips parameters extraction in CS. In this work, we adopt the deep CNN for the feature extraction of lips and hand shape, and use the artificial neural network (ANN) [18, 26] to process the hand position feature.

Ii-B Multi-modal temporal modeling in CS

The temporal organization of hand movements in CS was studied in [13, 27]. For CV syllables222syllables consist of a consonant followed by a vowel., it was found that the hand position reaches its target before the vowel being visible at lips on average ms [13] (based on the non-sense syllables logatome, like ‘mamuma’), or ms [27] (based on the syllables extracted from French continuous sentences). In this work, we focus on not only the HPT of vowels, but also that of consonants.

In our previous work [28]

, the relationship between the HPT of vowels and their target time instant (i.e., position in time) was analyzed. It was found that HPT follows a Gaussian distribution that remains almost the same for all the instants of vowels, except a small time interval (about

) before the end of each sentence. In this work, instead of following the piece-wise linear relationship, which gives different HPT for each vowel, we make a simple but efficient assumption that the mean value of the Gaussian distribution is suitable for all vowels. Moreover, based on the HPT for vowels in [28], we explore the optimal HPT of consonants, and then propose a re-synchronization procedure to align the hand features with lips feature for CS multi-modal feature fusion.

Ii-C CS recognition modeling

The early work on the CS recognition is for the isolated vowel recognition in [29, 30]. Then, the isolated CS phoneme recognition was realized in [15], and the continuous CS speech recognition based on an corpus of isolated words was conducted in [16] using the context-independent HMM. For the automatic continuous CS recognition based on a corpus of continuous sentences, a tandem CNN-HMM architecture extracting the CS feature from raw image was proposed in [17]. One-stream context-dependent HMM and MSHMM are both used for the multi-modal feature fusion, and MSHMM obtains better performance. However, none of them takes into account the asynchrony issue of the multi-modalities in CS.

In this work, we propose a new automatic CS recognition architecture by adding a re-synchronization procedure to process the features extracted by CNNs and ANN before feeding them to the context-dependent MSHMM-GMM decoder. The experimental results show that this re-synchronization procedure significantly improves the CS recognition performance compared with both the state-of-the-art of isolated [15, 16] and continuous CS recognition [17].

Iii Temporal organization of hand movement

In this section, we investigate the temporal organization of hand movement in CS, which is important for establishing the HPMs for vowels and consonants in Section IV. We first illustrate the hand preceding phenomenon using an example of CS where the speaker utters fait des ([f E d e]) in a French sentence Il fait des achats. It can be seen in Fig. 3 that, when the CS speaker’s hand points to the chin position for the vowel [E], she is only lip-reading the consonant [f], and the vowel [E] has not begun yet. This phenomenon are also observed for the syllables [d e] in this example.

Fig. 3: Illustration of the asynchrony phenomenon in the CS lips-hand movement. Two syllables [f E] and [d e] are extracted from the French sentence Il fait des achats. The red vertical lines show the target instants of the vowels and consonants in the audio signal, hand position and shape streams, respectively.

Let the rectangles on the 3rd row (hand position tier) denote the time intervals in which the hand reaches its target position, and the rectangles on the 4th row (hand shape tier) denote the time intervals in which the hand prepares its shape to indicate consonants. It can be seen in Fig. 3 that the corresponding rectangles on the 3rd and 4th rows are aligned on the right end, and the bottom one is longer. The underlying reason is that in our database (will be introduced in Section V-A), we observe the following two facts. (1) In the hand movement, the hand shape reaches its final shape at the same time as the hand position reaches its target for a vowel. (2) The hand stays at the target position only for a small time interval. During this period, the hand shape is almost formed, but the hand continues moving and rotating. Therefore, these intervals are relatively long (about ms). In fact, when the hand begins to leave the target position, the fingers move quickly to prepare the next hand shape.

As a consequence, the hand position is more sensitive to the asynchrony issue than the hand shape. This may be due to the intrinsic fact that the hand often stays in its target position for a relatively short time, while the full hand shape keeps a longer time in the coding process of CS.

Fig. 4: Illustration of different parameters concerning the hand movement temporal organization in CS. and are the HPT for consonants and vowels. and are the time intervals for target hand position and hand shape.

Now, we give some definitions and notations for HPMs. Without loss of generality, we only consider the CV syllables [4]. We are interested in how long time the hand precedes the lips movement. As shown in Fig. 4, for a vowel in a syllable, the middle instant of this vowel in the audio signal is denoted by , and the target instant in the hand position movement is denoted by . is the time interval in which the hand reaches its target position, which is set to be ms experimentally. Then the HPT (in ms) for vowels is

(1)

For a consonant in a syllable, the middle instant of this consonant in the audio signal is denoted by . Since the hand preceding phenomenon, precedes . We assume that the complete hand shape is realized at the same time as the target position . However, this complete hand shape does not correspond to a single instant but a certain time interval naturally. This time interval is before because after this moment, the hand shape begins to change immediately. Let be the time interval corresponding to a given hand shape, and be the time interval in which a hand shape is almost formed but continues moving and rotating. The middle instant of is denoted by , which is the target instant in the hand shape. Then the HPT (in ms) for consonants is

(2)

Iv Novel re-synchronization procedure

In this section, we first formulate the problem of multi-modal fusion in CS recognition. Then, based on the temporal organization of hand movement in Section III, and by studying the HPT for vowels and consonants, we use them to explore the HPMs and build a new re-synchronization procedure.

Iv-a Problem formulation

In the automatic continuous CS phoneme recognition, features of lips , hand position and hand shape are merged and fed into the phonetic decoder. Let the phoneme extracted from a continuous French sentence at time be determined as

(3)

where is the merged feature and is the model parameter for .

As we introduced in Section III, hand position and hand shape features are both asynchronous with lips feature in CS. Therefore, at time , the direct concatenated feature will be interfered and not suitable to train one particular phoneme class . For example, in Fig. 3, the vowel [E] in lips feature is merged with [e] in hand position feature to train the reference vowel [E] if a direct concatenation fusion is applied to these features.

This work aims to propose a way to align and with , i.e., to build transformations and such that

(4)
(5)

are both synchronous with . Then the merged feature for phoneme can be used to train the model of this phoneme without any interference.

Iv-B Hand preceding model for vowels

Now we establish the relationship between HPT and based on our database (will be introduced in Section V-A). For this purpose, it is necessary to measure . To make them precise enough, we determine them manually. A manual temporal segmentation of vowels and consonants for each sentence is accomplished by using the movie editor Magix [31, 32]. Here, we set to be the middle instant of the temporal target interval, which contains several images around the hand target position.

All the can be obtained by the audio-based segmentation. Then can be calculated by (1). Taking the LM speaker as an example, of 1066 vowels extracted from 138 sentences are plotted in Fig. 5(a) with respect to . In this figure, all the points are aligned by their end (i.e., the instant 0) instead of the beginning, since in this way we can find a common rule between and for all the vowels. More precisely, from the beginning of a sentence to a certain instant (about one second before the end), the statistical repartition of is almost the same in this period. Then, decreases until the end of the sentence and finally converges. By aligning all the sentences by their end, this phenomenon becomes very evident. Indeed, the distribution of short sentences and long sentences are superposed entirely at the end of sentences. Besides, by comparing the distribution of other three speakers, we find that they follow the same repartition as shown in Fig. 5(b)-(d).

Based on these observations, we build the HPM as

(6)

where is the time instant of vowel, and is the mean value of the before the turning time instant . As shown in Fig. 5, we can see that the HPT of all vowels before follows a Gaussian distribution. Because we observe that the mean value can reflect the main character of these points, we take it as the model. After

, we make a linear regression based on these points since the HPT for vowels decreases linearly.

and in (6) are the slope and intercept of the linear line after , respectively. In our case, stays at about before the end of a sentence. The HPM (6) fits all sentences of four CS speakers.

(a)
(b)
(c)
(d)
Fig. 5: HPT distribution and HPM. The abscissa is the vowel instant in sentences. All these points are aligned at the end of sentences, where the instant is 0. Y-axis: the (in seconds). In (a), the red circles show the distribution of 50 long sentences, and the blue stars show the distribution of 88 short sentences from LM corpus. The black curve shows the HPM. In (b), the blue stars show the distribution of 88 short sentences from LM corpus and the magenta stars show the distribution of 44 short sentences from SC corpus. In (c), the blue stars show the distribution all 83 vowels of the MD corpus made of 50 single words. In (d), the blue stars show the distribution all 1045 vowels of the CA corpus made of 97 British English sentences.

The proposed HPM for the vowel will be evaluated in Section VI-A, where a hand position recognition experiment is carried out. In this experiment, we propose a temporal segmentation method for the hand position movement based on the HPM. More precisely, from the audio-based segmentation, each temporal segmentation of vowels is shifted by using the HPM in (6) according to the vowel instant in the sentence. Besides, the proposed temporal segmentation is also used to improve the efficiency of the ANN training for the hand position feature, as well as the CNN training for the hand shape feature (with respect to the HPM for the consonants) in the automatic CS recognition.

Iv-C Hand preceding model for consonants

To investigate the HPM for the consonant, we first perform a statistical study on

(7)

where and are the middle instant of the vowel and consonant in the audio signal.

Based on the LM corpus, we randomly choose 10 sentences containing about 100 syllables. of all these syllables are calculated and vary in a broad range. Therefore, we only consider its mean value (about 110ms). Since the optimal is about ms, from Fig. 4, we can deduce that precedes by about 30ms (i.e., the difference between ms and ms). We assume that is about ms (i.e., three images) based on a large number of observations. Consequently, precedes by about

ms, which is an theoretical estimation of the optimal

for all consonants.

Then we experimentally determine the optimal . To do this, we perform a hand shape recognition experiment using the CNN-based hand features (will be introduced in Section V-C) and multi-Gaussian classifier. In the recognition, we apply several different temporal segmentations, which are derived by shifting the audio-based segmentation with different (from 0 to ms with a step of ms). We denote by the that gives the best hand shape recognition accuracy. A hand shape recognition experiment based on the LM corpus of all 476 sentences is conducted. 80% of the data is used for training and the rest 20% is for test. The recognition results are shown in Fig. 6, which gives the recognition accuracy as a function of the monotonically increasing . We observe a convex curve (red curve) with a local maximum value of about ms. This is coherent with the theoretical analysis of . Indeed, the peak region of this curve is relatively smooth, but the clear maximum value confirms that exists.

The temporal segmentation of hand shape is obtained using the same way as for the hand position. This temporal segmentation of hand shape will be used in the CNN training for hand shape feature extraction in Section V-C.

Fig. 6: Hand shape recognition accuracy using different segmentations in function of (red curve). The green circle highlights the optimal recognition accuracy when ms.

Iv-D Re-synchronization procedure

Instead of following the piecewise linear relationship in (6), which gives different for the vowels after , we propose a simple but efficient assumption that is suitable for all vowels, as introduced in Section IV-B. Similarly, we assume that is suitable for all consonants.

Now we propose the re-synchronization procedure to align the hand shape and position features with lips feature based on the above HPMs for vowels and consonants. This procedure contains two steps:

  1. Positively shift by , i.e.,

    (8)

    where ms.

  2. Positively shift by , i.e.,

    (9)

    where ms.

We take the vowel case in Fig. 7 as an example to illustrate this procedure. Fig. 7(a) shows the audio signal of the French sentence Ma chemise est roussie with the phonetic annotations. The lips feature is assumed to be synchronous with the audio signal [33]. In Fig. 7(b), the hand position is defined to be the coordinate of the hand back point. It is clear that the hand position stream is not synchronous with the audio signal. Therefore, a direct fusion of these two streams will cause some interference. In Fig. 7(c), the aligned hand position stream is obtained by positively shifting the original one by ms as in (8). It turns out that the hand position stream is re-synchronized with the audio signal on average.

For consonants, the alignment of the hand shape feature is similar, but with ms. Even though the value of and may vary for different speakers, the proposed re-synchronization procedure is new, simple and efficient to improve the multi-modal fusion for continuous CS recognition.

Fig. 7: Illustration of the re-synchronization procedure. (a) The audio speech with its temporal segmentations and phonetic annotations. (b) The original hand position trajectory (i.e., coordinate of the hand back point). (c) The re-synchronized hand position derived by shifting the original hand position in (b) with . Two green lines correspond to the audio based temporal segmentation of vowel [i].

V Experimental setup

In this section, we first introduce the database and the experimental metric for the HPMs as well as the CS continuous recognition experiments. Then technical details for the automatic continuous CS recognition are presented.

V-a Database

The database contains the recording data of three French CS speakers LM, SC, MD and one British English CS speaker CA. Three French CS corpora were recorded in a sound-proof room of GIPSA-lab, France. Color video images of the speaker’s upper body are recorded at 50 fps, with a spatial resolution of 720576 pixels RGB images. The LM speaker pronounces and codes a set of 238 French sentences in CS derived from a corpus described in [29, 34]. Each sentence is repeated twice resulting in a set of 476 sentences (about phonemes, and images totally). The SC corpus is made of 267 sentences from the same database as LM. The MD corpus contains videos of 50 French words made of numbers and daily words. The corpus is uttered 10 times. It should be mentioned that the LM corpus is recorded without using any artificial mark, while the SC and MD corpora are earlier recorded with artificial marks [27, 29]. However, in this work, the SC and MD corpora are only used to establish and evaluate the HPM and they do not attend the automatic CS recognition. Therefore, in this work, their artificial marks are not used at all although they exist.

The CA corpus is the first British English CS corpus333This database will be made publicly available on Zenodo., which is recorded for this work in Cued Speech UK444http://www.cuedspeech.co.uk/ association, without using any artificial mark. The professional CS speaker CA (with no hearing impairment) is asked to simultaneously utter and encode a set of British English sentences (e.g., I feel it is a time to move to a new chapter in my career). Color video images of the speaker’s upper body are recorded at 25 fps, with a spatial resolution of x. There are totally 907 monophthongs and 138 diphthongs in this corpus.

In Section IV, we take a subset of the whole database to build the HPM, which contains 138 sentences including 88 short sentences555In this work, a short sentence means a sentence with less than 4 vowels. Otherwise, it is a long sentence. and 50 long sentences from LM corpus (totally 1066 vowels), and 44 short sentences (196 vowels) from SC corpus. The hand position is manually determined in the rule that it is assumed to be the 2D position of the middle finger if the middle finger appears; otherwise, it is assumed to be the index finger. Moreover, the hand back point is also tracked manually. The advantage of this point is that it is always visible in the CS coding process, and thus all the hand back points can be collected without any interruption.

For the CS phoneme recognition, all 476 sentences of LM corpus are used with 34 French phoneme classes ( vowels and consonants). We have made this database publicly available666 https://doi.org/10.5281/zenodo.1206001.. The French CS is described with lips visemes (as defined in [29]), different hand shapes, and different hand positions (as defined in Fig. 1). The phonetic transcription is extracted automatically using Lliaphon software [35] and post-checked manually to adapt it to the pronunciation of the CS speaker. The audio-based temporal segmentation of each vowel is extracted from the conventional audio speech recognition (ASR) system in HTK 3.4 [36]. Using forced alignment, the audio signal synchronous with the video is automatically labeled.

V-B Experimental metric

For the experiments including the hand shape recognition (for investigating the in Section IV-C), the hand position recognition (for evaluating the HPM in Section VI-A) and the CS vowel, consonant, phoneme recognition (for evaluating the proposed re-synchronization procedure in Section VI-B and Section VI-C

), we randomly select 80% of the sentences as the training set while the rest 20% is the test set. Because each sentence is recorded twice, we allocate each sentence and its repetition either simultaneously to the training set or to the test set. All the results are repeated ten times with different training sets and test sets, so that the standard deviation (std) of these results can be controlled. It will be seen in

Section VI that the stds of all the experimental results are less than 0.5%.

In CS phoneme recognition, the correctness of the MSHMM-GMM decoder

(10)

is chosen as the metric, where is the number of phonemes in the test set, is the number of deletion errors, and is the number of substitution errors.

V-C Automatic continuous Cued Speech recognition

The continuous CS recognition (see Fig. 2) is carried out to evaluate the proposed re-synchronization procedure, and it contains three steps: feature extraction, multi-modal feature fusion and phonetic decoding. Note that the re-synchronization procedure is used to deal with the multi-modal feature fusion. Now we only introduce the methods for feature extraction and phonetic decoding.

V-C1 Cued Speech feature extraction

Lips and hand shape feature extraction

We first extract the regions of interest (ROIs) of lips and hand shape. The lips ROI is extracted using Kanade-Lucas-Tomasi (KLT) feature tracker [37] (with a dezooming process), and the hand shape ROI is extracted using Adaptive Background Mixture Model (ABMM) [38]

, which is able to track the variable-shape object. Then, lips and hand shape ROIs are transferred to 2D gray images and resized to a fixed size using the cubic interpolation.

Fig. 8: Implementation details of the CNNs used to extract the hand shape features from hand ROI in CS.

CNN is very powerful in extracting high-level features of images [18]. In order to get rid of using the artifices (e.g., color marks on speaker’s lips and hand), in this work, CNN is used to extract the lips and hand shape features from their raw ROI (i.e., nature images without marks). The CNN architecture in this work is almost the same as that in the state-of-the-art [17] (see Fig. 8). The only difference is that, instead of using the hand movement temporal segmentation derived by the simple procedure (i.e., the left temporal boundary of each phoneme is forced to extend to the left boundary of its previous phoneme), in this work, we use the HPM-based temporal segmentations of hand position and hand shape movement for CNN training.

Hand position feature extraction

In [15, 16], hand position was tracked by detecting the color marks on the hand, and no automatic method was used. In our previous work [17]

, ABMMs were used to automatically extract the hand position in CS on raw hand ROIs. It is then processed by a simple feed-forward ANN, which contains two standard fully connected layers with ReLU activation function. Two hidden layers with four neurons in each layer are used for the hand position feature processing. After the softmax layer, it will output the posterior probabilities of the target classes with dimension 6, which is the final hand position feature. A mini-batch gradient descent algorithm based on the

Root Mean Square Propagation

(RMSprop) adaptive learning rate method is used for the parameter optimization in ANN.

V-C2 MSHMM-GMM phonetic decoding

As in the architecture [17], in the proposed architecture, each phoneme is modeled by a context-dependent triphone MSHMM (i.e., takes into account the contextual information about the preceding and following phonemes) [39]. Three emitting states are used with GMM to model the features of lips, hand position and hand shape together with their first derivatives. The main difference between and is that MSHMM-GMMs are used to model the re-synchronized multi-modal features in , while in , MSHMM-GMMs are used to model the asynchronous multi-modal features. In , the emission probability at state is

(11)

where is the merged feature at time , and is the number of feature streams, which is set to be in this work.

is the probability density function of the observation

for feature stream and time . is the mean value and is the covariance matrix of the Gaussian probability density function with state , mixture Gaussian component and stream . For stream , gaussians are used to model the observation, with weight . In the experiment, iteratively increases from 1 to 4 in the training, and is finally set to be which achieves the best performance. is the weight for feature stream , which is optimized using the cross-validation. These weights should satisfy

(12)

Finally, the optimal weights are set to be , and for lips, hand shape and hand position, respectively.

It should be mentioned that, in this work, we do not take into account the pronunciation dictionary or language model in order to directly compare the CS recognition results with the state-of-the-art [17], which does not incorporate them.

Vi Results and Discussions

In this section, we first evaluate the performance of HPM for vowels. Then a two-step evaluation of the proposed CS recognition architecture is carried out to evaluate the performance of the proposed re-synchronization procedure.

Recognition accuracy (%) audio-based segmentation HPM-based segmentation ground truth segmentation
LM 64.23 82.35 96.93
SC 75.11 83.93 99.91
MD 65.77 73.10 98.26
CA 54.61 65.03 92.02
TABLE I: Hand position recognition results using multi-Gaussian classifier based on different temporal segmentations of hand position movement for four CS speakers. Ground truth target finger is used as the hand position. Audio-based segmentation is the temporal segmentation used in the previous work [15, 16] and the HPM-based segmentation is the temporal segmentation predicted by the proposed HPM for hand movement.

Vi-a Evaluation of the HPM-based temporal segmentation of hand position

To evaluate the HPM-based temporal segmentation of hand position, we compare it with both the ground truth and audio-based segmentations. (1) For the ground truth, the target instants of all vowels are determined manually, which constitutes a golden reference for the hand position recognition in CS. (2) In the literature [15, 16], the temporal segmentation of hand position is always based on the audio signal. It is important to compare this segmentation with the HPM-based segmentation to see the potential benefits of HPM.

We first evaluate the proposed HPM by visualizing the hand position distributions in a 2D image using different temporal segmentations. Then, we apply the simple Gaussian classifier to hand position recognition.

Hand position spatial distributions of vowels for four CS speakers using the target finger points and different temporal segmentations are shown in Fig. 9. It is reasonable to assume that the better temporal segmentation corresponds to more distinguishable and separable hand position distributions. It can be seen that the Gaussian ellipse becomes more and more distinguishable from left to right for all these four speakers. Note that the second speaker SC codes CS using her left hand while other three speakers use their right hand. Taking the LM case corresponding to (a), (b) and (c) as an example, the points in (a) have large parts of overlaps for five positions, while the points in (b) are significantly separable. In particular, the points at the throat and chin positions are divided. Moreover, the distribution of the Gaussian ellipse in (b) is very close to that in (c) with only a few intersections between the ellipses. These distributions efficiently illustrate a satisfactory performance of the HPM.

(a) (b) (c)
(d) (e) (f)
(g) (h) (i)
(j) (k) (l)
Fig. 9: Hand position (target finger position) distributions with different temporal segmentations for four CS speakers LM, SC, MD and CA, respectively. (a), (d), (g) and (j): the audio-based segmentation. (b), (e), (h) and (k): the HPM-based segmentation. (c), (f), (i) and (l): the ground truth temporal segmentation. For French CS speakers LM, SC and MD, five groups of points correspond to different hand positions. Red points: cheek; green points: mouth; black points: throat; cyan points: chin; blue points: side. Note that the speaker SC uses her left hand to code CS, while other three use their right hand. In British English CS, only four hand positions are used to code all vowels. Thus, for the speaker CA, the meanings of four colors are listed as following. Red points: mouth; black points: throat; green points: chin; blue points: side.

For further evaluation, we apply these three temporal segmentations to the hand position recognition experiments for four speakers. The experiment results are reported in Table I. Five multi-gaussian models are trained for the five positions based on a sub-database containing 138 sentences from LM corpus, a sub-database containing 44 sentences from SC corpus and all vowels from 50 words of MD corpus. For the British English CS corpus CA, four multi-gaussian models are trained for the four positions based on the CA corpus.

Using the target finger positions and the ground truth temporal segmentation, recognition accuracies of all four speakers are higher than other two cases which use the audio-based and HPM-based segmentations. More precisely, for speakers LM, SC, MD and CA, the highest accuracies of 96.9%, 99.9%, 98.3% and 92.02%, are achieved (see Fig. 9(c), (f), (i) and (l)), respectively. This constitutes golden references to the hand position recognition in CS for these four speakers. However, when the audio-based segmentation is used, the recognition accuracies become lower 64.23%, 75.1%, 65.8% and 54.61% (see Fig. 9(a), (d), (g) and (j)), respectively. When using the proposed HPM-based segmentation, compared with the audio-based segmentation, the accuracies increase significantly to 82.4%, 83.9%, 73.1% and 65.03% (see Fig. 9(b), (e), (h) and (k)), respectively. All these results show that the audio-based segmentation is indeed not suitable for the hand position recognition, and our proposed HPM improves the performance of hand position recognition significantly.

Note that for MD corpus, the improvement (around 7.5%) using the HPM-based temporal segmentation is not as evident as LM and SC corpora. This is because the MD corpus is composed of single words with a comparably short time duration, while LM and SC corpora are made of continuous sentences. Therefore the hand preceding phenomenon in MD corpus is not as strong as other two corpora.

Above all, we see that the HPM-based temporal segmentation can significantly improve the performance of hand position recognition.

Vi-B Evaluation of the re-synchronization procedure applied to the vowel and consonant recognition in CS

Now, we evaluate the proposed re-synchronization procedure by carrying out the first step, i.e., vowel and consonant recognition experiments based on the fused features of two streams. We use MSHMM-GMM to see the benefits of this re-synchronization procedure.

Vi-B1 Vowel recognition based on the fusion of lips and hand position

We mainly discuss the vowel recognition based on the fusion of lips and hand position. The vowel recognitions using only lips and only hand position information are also presented.

Lips and hand position features are merged and fed to a two-stream MSHMM with triphone context-dependent modeling. 14 vowels (i.e., [i, e, E, a, y, ø, u, o, O, Ẽ, œ̃, , Õ, @]) plus silence are the target labels. Results are shown in Table II. We see that the vowel recognition using the re-synchronization procedure (8) obtains a higher correctness (74.65%) than the one without using it (70.12%). An improvement of about 4.5% is achieved. For the single stream case, using only lips and only hand position give very similar and low correctness. This is reasonable since these two streams are supposed to be combined to recognize the vowels, and every single stream only carries the viseme-level information.

one-stream non-resyn resyn
only lips 34.88%
only hand position 35.71%
lips + hand position 70.12% 74.65%
TABLE II: Vowel recognition results using lips and hand position. is the correctness and ‘—’ means that this case does not exist. The shorthand ‘one-stream’ means the CS vowel recognition using only one stream feature. ‘non-resyn’ means the CS recognition without using the re-synchronization procedure, while ‘resyn’ means using this procedure.

Vi-B2 Consonant recognition based on the fusion of lips and hand shape

Lips and hand shape features are merged and fed to a two-stream MSHMM-GMM with triphone context-dependent modeling. 18 consonants (i.e., [p, t, k, b, d, f, s, S, v, z, Z, m, n, l, g, r, j, w]) plus silence are the target labels. In Table III, we can see that using the re-synchronization procedure achieves a higher correctness (82.28%) than the one without using it (80.33%). For the single stream case, using only lips and only hand shape obtain a much lower correctness compared with the fused feature.

one-stream non-resyn resyn
only lips 42.25%
only hand shape 56.99%
lips + hand shape 80.33% 82.28%
TABLE III: Consonant recognition using lips and hand shape. is the correctness and ‘—’ means that this case does not exist. The shorthand ‘one-stream’ means the CS consonant recognition using only one stream feature. ‘non-resyn’ means the CS recognition without using the re-synchronization procedure, while ‘resyn’ means using this procedure.

Vi-C Evaluation of the re-synchronization procedure applied to the automatic CS phoneme recognition

To further evaluate the proposed re-synchronization procedure, a new multi-modal CS phoneme recognition architecture is investigated (see Fig. 2). We compare the result of with the state-of-the-art in [17], which does not take into account the asynchrony issue in CS. The results are shown in Fig. 10.

Now we make some analysis. We first focus on the case when the hand positions given by ABMMs are used, as this is the case in [17]. Using architecture, the phoneme recognition correctness is 71.0%, without using any re-synchronization procedure. When the proposed re-synchronization procedure is incorporated, the recognition correctness increases to 72.67%, which shows a minor improvement of 1.67% (see columns 3 and 4 in Fig. 10). We realize that in , the triphone context-dependent modeling of MSHMM is helpful to correcting the recognition errors, which could be the co-articulation or the asynchrony of multi-modalities [40]. Therefore, using the context-dependent modeling may hide the effect of the re-synchronization procedure. In order to avoid this phenomenon, we examine the phoneme recognition correctness without using this context-dependent modeling. Then only 60.4% is obtained without any re-synchronization procedure, while it increases to 64.38% when using the proposed re-synchronization procedure (see columns 1 and 2 in Fig. 10). It can be seen that the improvement of about 4% is more evident than the case of using the context-dependent modeling (1.67%).

The above improvement is weak. This may be because (1) only a weight of 0.2 is applied to the hand position stream, and (2) the hand position extracted by ABMMs has some errors (see Section V-C). For the first reason, as mentioned in Section III, the hand position feature is much more sensitive to the asychrony problem in CS. The small weight reduces the importance of the re-synchronization procedure. For the second reason, the errors directly reduce the efficiency of the re-synchronization procedure and cause some interference of the CS recognition system, since the hand position target can be recognized accurately only if the correct hand position feature is used at a good temporal boundary for a particular vowel. Note that in this work, we only consider the interference caused by the hand position errors because the CNN-based features for lips and hand shape are shown to be correct in [17].

In order to see the effects when the hand position is correct, we use the ground truth hand position instead of that given by the ABMMs. The CS phoneme recognition results in this case are shown in Fig. 10, from column 5 to column 8.

Now we make some discussions. In this case, without the context-dependent modeling or re-synchronization procedure, the correctness is 62.33%, which is close to the result of 60.4% based on the hand positions by ABMMs. This can be explained by the above first reason. However, when the re-synchronization procedure is used, a correctness of 70.1% is achieved, which shows a significant improvement of about 7.7% (see columns 5 and 6 in Fig. 10). This shows benefits of the proposed re-synchronization procedure.

When using the context-dependent modeling (see columns 7 and 8 in Fig. 10), the correctness is 72.04% without the re-synchronization procedure. However, when using both of them, a significant correctness of 76.63% is obtained (with an improvement of 4.6%) compared with the state-of-the-art [17] in the automatic continuous CS phoneme recognition case. Moreover, this result also outperforms the state-of-the-art 74.4% [15] which is for the automatic isolated CS phoneme recognition.

We also calculate the CS phoneme recognition correctness when using only the lips information. The result is low (around 30%). It can be seen that by using the CS, the phoneme recognition performance increases by a large margin of around 46.63% compared with the case of using lip reading only. This confirms that CS can significantly help the Deaf/Hard of hearing on speech perception and production.

Fig. 10: Results of the continuous CS phoneme recognition with and without using the proposed re-synchronization procedure and the context-dependent modeling. The shorthand ‘non-resyn’ means the CS recognition without using the re-synchronization procedure, while ‘resyn’ means using the re-synchronization procedure. The error bars are the stds of the results (less than 0.5) and all the differences are statistically significant.

Vii Conclusion

In this work, a novel re-synchronization procedure is proposed for the CS multi-modal feature fusion applied to a practical automatic continuous French CS recognition system (i.e., CNN-MSHMMs). First, by exploring the relationship between the HPT and the time instants of phonemes in French continuous sentences, we obtain the optimal HPTs for all vowels (ms) and for all consonants (ms), and develop new HPMs. Then, the re-synchronization procedure is proposed by delaying the hand position and shape streams with these two different optimal HPT, respectively. The lips and hand features can be re-synchronized on average. By incorporating the proposed re-synchronization procedure to the CNN-MSHMMs based CS recognition system, we propose a new architecture that takes into account the asynchrony issue in the CS recognition. The evaluation of the automatic CS phoneme recognition using achieves a significant improvement (about 4.6%) compared with the state-of-the-art architecture . In the future work, we aim at investigating the proposed fusion methods in CS of other languages (e.g., the American English) and generalizing it to be speaker-independent.

Acknowledgement

The authors would like to thank the CS speakers for their time spent on the French CS data recording, and Thomas Hueber for his help in CNN-HMM. The authors would like to thank the referees for their valuable comments and suggestions. This work is supported by a PhD thesis grant of Université Grenoble Alpes in France, and in part by the Natural Sciences and Engineering Research Council of Canada under Grant RGPIN239031.

References

  • [1] World Health Organization, “Deafness and hearing loss,” https://www.who.int/news-room/fact-sheets/detail/deafness-and-hearing-loss, 2019.
  • [2] Barbara Ed Dodd and Ruth Ed Campbell, Hearing by eye: The psychology of lip-reading., Lawrence Erlbaum Associates, Inc, 1987.
  • [3] Gaye H Nicholls and Daniel Ling Mcgill, “Cued speech and the reception of spoken language,” Journal of Speech, Language, and Hearing Research, vol. 25, no. 2, pp. 262–269, 1982.
  • [4] Richard Orin Cornett, “Cued speech,” American annals of the deaf, vol. 112, no. 1, pp. 3–13, 1967.
  • [5] Carol J LaSasso, Kelly Lamar Crain, and Jacqueline Leybaert, Cued Speech and Cued Language Development for Deaf and Hard of Hearing Children, Plural Publishing, 2010.
  • [6] William C Stokoe Jr, “Sign language structure: An outline of the visual communication systems of the american deaf,” Journal of deaf studies and deaf education, vol. 10, no. 1, pp. 3–37, 2005.
  • [7] Scott K Liddell and Robert E Johnson, “American sign language: The phonological base,” Sign language studies, vol. 64, no. 1, pp. 195–277, 1989.
  • [8] Clayton Valli and Ceil Lucas, Linguistics of American sign language: an introduction, Gallaudet University Press, 2000.
  • [9] Sarah Elizabeth Reynolds, “An examination of cued speech as a tool for language, literacy, and bilingualism for children who are deaf or hard of hearing,” 2007.
  • [10] Vicente Peruffo Minotto, Claudio Rosito Jung, and Bowon Lee, “Multimodal multi-channel on-line speaker diarization using sensor fusion through svm,” IEEE Transactions on Multimedia, vol. 17, no. 10, pp. 1694–1705, 2015.
  • [11] Ou Wu, Haiqiang Zuo, Weiming Hu, and Bing Li, “Multimodal web aesthetics assessment based on structural svm and multitask fusion learning,” IEEE Transactions on Multimedia, vol. 18, no. 6, pp. 1062–1076, 2016.
  • [12] Mengfan Tang, Xiao Wu, Pranav Agrawal, Siripen Pongpaichet, and Ramesh Jain, “Integration of diverse data sources for spatial pm2. 5 data interpolation,” IEEE Transactions on Multimedia, vol. 19, no. 2, pp. 408–417, 2017.
  • [13] Virginie Attina, Denis Beautemps, Marie-Agnès Cathiard, and Matthias Odisio, “A pilot study of temporal organization in cued speech production of french syllables: rules for a cued speech synthesizer,” Speech Communication, vol. 44, no. 1, pp. 197–214, 2004.
  • [14] Virginie Attina, Marie-Agnès Cathiard, and Denis Beautemps, “Temporal measures of hand and speech coordination during french cued speech production,” in International Gesture Workshop. Springer, 2005, pp. 13–24.
  • [15] Panikos Heracleous, Denis Beautemps, and Noureddine Aboutabit, “Cued speech automatic recognition in normal-hearing and deaf subjects,” Speech Communication, vol. 52, no. 6, pp. 504–512, 2010.
  • [16] Panikos Heracleous, Denis Beautemps, and Norihiro Hagita, “Continuous phoneme recognition in cued speech for french,” in Signal Processing Conference (EUSIPCO), 2012 Proceedings of the 20th European. IEEE, 2012, pp. 2090–2093.
  • [17] Li Liu, Thomas Hueber, Gang Feng, and Denis Beautemps, “Visual recognition of continuous cued speech using a tandem cnn-hmm approach,” in Interspeech, 2018, 2018, pp. 2643–2647.
  • [18] Yann LeCun, Yoshua Bengio, and Geoffrey Hinton,

    Deep learning,”

    Nature, vol. 521, no. 7553, pp. 436–444, 2015.
  • [19] Ian Goodfellow, Yoshua Bengio, Aaron Courville, and Yoshua Bengio, Deep learning, vol. 1, MIT press Cambridge, 2016.
  • [20] Gerasimos Potamianos, Chalapathy Neti, Guillaume Gravier, Ashutosh Garg, and Andrew W Senior, “Recent advances in the automatic recognition of audiovisual speech,” Proceedings of the IEEE, vol. 91, no. 9, pp. 1306–1326, 2003.
  • [21] Thomas Burger, Alice Caplier, and Stéphane Mancini, “Cued speech hand gestures recognition tool,” in Signal Processing Conference in European. IEEE, 2005, pp. 1–4.
  • [22] Sébastien Stillittano, Vincent Girondel, and Alice Caplier, “Lip contour segmentation and tracking compliant with lip-reading application constraints,” Machine vision and applications, pp. 1–18, 2013.
  • [23] Li Liu, Gang Feng, and Denis Beautemps, “Extraction automatique de contour de lèvre à partir du modèle clnf,” in Actes des 31èmes Journées d’Etude de la Parole, 2016.
  • [24] Li Liu, Gang Feng, and Denis Beautemps, “Automatic tracking of inner lips based on clnf,” in Acoustics, Speech and Signal Processing (ICASSP), 2017 IEEE International Conference, 2017, pp. 5130–5134.
  • [25] Li Liu, Gang Feng, and Denis Beautemps, “Inner lips parameter estimation based on adaptive ellipse model,” in 14th International Conference on Auditory-Visual Speech Processing (AVSP 2017), 2017.
  • [26] Daniel Svozil, Vladimir Kvasnicka, and Jiri Pospichal,

    “Introduction to multi-layer feed-forward neural networks,”

    Chemometrics and intelligent laboratory systems, vol. 39, no. 1, pp. 43–62, 1997.
  • [27] Noureddine Aboutabit, Denis Beautemps, and Laurent Besacier, “Hand and lip desynchronization analysis in french cued speech: Automatic temporal segmentation of hand flow,” in Proc. IEEE-ICASSP, 2006, vol. 1, pp. I–I.
  • [28] Li Liu, Gang Feng, and Denis Beautemps, “Automatic temporal segmentation of hand movement for hand position recognition in french cued speech,” in Acoustics, Speech and Signal Processing (ICASSP), 2018 IEEE International Conference, 2018, pp. 3061–3065.
  • [29] Noureddine Aboutabit, Reconnaissance de la Langue Française Parlée Complété (LPC): décodage phonétique des gestes main-lèvres., Ph.D. thesis, Institut National Polytechnique de Grenoble-INPG, 2007.
  • [30] Denis Beautemps, Laurent Girin, Noureddine Aboutabit, Gérard Bailly, Laurent Besacier, Gaspard Breton, Thomas Burger, Alice Caplier, Marie-Agnès Cathiard, Denis Chêne, et al., “Telma: Telephony for the hearing-impaired people. from models to user tests,” 2007, pp. 201–208.
  • [31] Daniel V Abreu, Thomas K Tamura, Donald G Keamy Jr, Roland D Eavey, et al., “Podcasting: contemporary patient education,” Ear, Nose & Throat Journal, vol. 87, no. 4, pp. 208, 2008.
  • [32] Jeff Naylor, Magix Movie Edit Pro 2014 Revealed, Dtvpro Publishing, 2014.
  • [33] Angela Tinwell, Mark Grimshaw, and Deborah Abdel Nabi, “The effect of onset asynchrony in audio-visual speech and the uncanny valley in virtual characters,” International Journal of Mechanisms and Robotic Systems, vol. 2, no. 2, pp. 97–110, 2015.
  • [34] Guillaume Gibert, Gérard Bailly, Denis Beautemps, Frédéric Elisei, and Rémi Brun, “Analysis and synthesis of the three-dimensional movements of the head, face, and hand of a speaker using cued speech,” The Journal of the Acoustical Society of America, vol. 118, no. 2, pp. 1144–1153, 2005.
  • [35] Frédéric Béchet, “Lia phon: un systeme complet de phonétisation de textes,” Traitement automatique des langues, vol. 42, no. 1, pp. 47–67, 2001.
  • [36] Steve J Young and Sj Young, The HTK hidden Markov model toolkit: Design and philosophy, University of Cambridge, Department of Engineering, 1993.
  • [37] Jianbo Shi and Tomasi Carlo, “Good features to track,” in Computer Vision and Pattern Recognition, 1994. Proceedings CVPR’94., 1994 IEEE Computer Society Conference on. IEEE, 1994, pp. 593–600.
  • [38] Chris Stauffer and W Eric L Grimson, “Adaptive background mixture models for real-time tracking,” in Proc. IEEE-CVPR, 1999, vol. 2, pp. 246–252.
  • [39] Steve J Young, Julian J Odell, and Philip C Woodland, “Tree-based state tying for high accuracy acoustic modelling,” in Proceedings of the workshop on Human Language Technology. Association for Computational Linguistics, 1994, pp. 307–312.
  • [40] Jean-Luc Schwartz, Pierre Escudier, and Pascal Teissier, “Multimodal speech: Two or three senses are better than one,” Language and Speech Processing, pp. 377–415, 2009.