LRW-1000: A Naturally-Distributed Large-Scale Benchmark for Lip Reading in the Wild

10/16/2018 ∙ by Shuang Yang, et al. ∙ 6

Large-scale datasets have successively proven their fundamental importance in several research fields, especially for early progress in some emerging topics. In this paper, we focus on the problem of visual speech recognition, also known as lipreading, which has received an increasing interest in recent years. We present a naturally-distributed large-scale benchmark for lip reading in the wild, named LRW-1000, which contains 1000 classes with about 745,187 samples from more than 2000 individual speakers. Each class corresponds to the syllables of a Mandarin word which is composed of one or several Chinese characters. To the best of our knowledge, it is the largest word-level lipreading dataset and also the only public large-scale Mandarin lipreading dataset. This dataset aims at covering a "natural" variability over different speech modes and imaging conditions to incorporate challenges encountered in practical applications. This benchmark shows a large variation over several aspects, including the number of samples in each class, resolution of videos, lighting conditions, and speakers' attributes such as pose, age, gender, and make-up. Besides a detailed description of the dataset and its collection pipeline, we evaluate the popular lipreading methods and perform a thorough analysis of the results from several aspects. The results demonstrate the consistency and challenges of our dataset, which may open up some new promising directions for future work. The dataset and corresponding codes will be available on the web for research use.

READ FULL TEXT VIEW PDF
POST COMMENT

Comments

There are no comments yet.

Authors

page 1

page 2

page 4

page 6

This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

I Introduction

Visual speech recognition, also known as lipreading, is to recognize the speech content in a video based on the visual information, especially based on the analysis of the speaker’s lip movement sequence. It has been demonstrated that incorporating visual information in audio-based speech recognition systems can bring obvious performance improvements, especially in cases where multiple speakers are present or the acoustic signal is noisy [18, 10, 14]. Lipreading also plays an important role in several other scenarios, such as aids for hearing-impaired persons, analysis of silent movies, liveness verification in video authentication systems, and so on.

The common procedure of lip reading involves two steps: performing an analysis of the motion information in the given image sequence and transforming the information into text words or sentences. This procedure links lipreading to two closely related fields: audio-based speech recognition and action recognition. Both lipreading and audio-based speech recognition aim to transcribe a time-series data into text, while the similarity between lipreading and action recognition lies in the fact that both tasks focus on analyzing the motion information to obtain a conclusion. However, currently there is a large performance gap between lipreading and these two closely related tasks. One main reason is that there were few large-scale lipreading datasets in the past, which was likely the major obstacle to progress in lipreading.

Fortunately, with the development of deep learning technologies, some researchers have begun to collect large-scale data for lipreading with the help of deep learning tools in recent two years. Existing public datasets can be divided into two classes: word-level and sentence-level. We focus on word-level lipreading in this paper. One outstanding benchmark is the

LRW[5] dataset proposed in 2016, which has classes and covers a large variation of the speech conditions. In addition, all the videos in this dataset are of fixed-size and fixed-length, which provides much convenience to the community. However, all the words in this dataset have a similar length and the best performance on this dataset in terms of classification accuracy has achieved as high as in merely two years. Some other popular word-level lipreading datasets include OuluVS1[23] and OuluVS2[1], which are proposed in 2009 and 2015 respectively. There are classes in both datasets and the state-of-the-art performance has achieved an accuracy of more than . These exciting results mark a significant and praiseworthy improvement of lipreading methods, which is very encouraging. However, lipreading in natural or “in the wild” settings remains a challenging problem due to the large variations in real-world environments. Meanwhile, these appealing results also call for more challenging datasets to trigger new progress and to inspire novel ideas for lipreading.

Fig. 1: Example frames in our dataset, which show a large variation of the speech conditions, including the lighting conditions, resolution, speaker’s age, pose, gender, make-up etc.

To this end, we collect a naturally-distributed large-scale dataset for lipreading in the wild. The contributions in this paper are summarized as follows.

Firstly, we present a challenging lipreading dataset which contains classes with about samples from more than speakers in total. Each class corresponds to the syllable of a Mandarin word which is composed of one or several Chinese characters. There are about million Chinese character instances which cover Chinese syllables. To the best of our knowledge, this database is currently the largest word-level lipreading dataset and the only public large-scale Mandarin lipreading dataset.

Secondly, different from existing lipreading datasets, our benchmark aims to provide naturally-distributed data to the community, highlighted by the following properties. (a) It contains large variations in the speech conditions, including lighting conditions, resolution of videos, speaker’s attribute variations in pose, speech rate, age, gender and make-up, as shown in Fig. 1. (b) Some words are allowed to contain more samples than some others, which is consistent with the actual case that some words indeed occur more frequently than others. (c) Samples of the same word are not limited to a previously specified length range to allow the existence of various speech rates. These three properties make our dataset quite consistent with practical settings.

Thirdly, we provide a comprehensive comparison of the current popular lipreading methods and perform a detailed analysis of their performance in several different settings to observe the effect of different factors on lipreading, including the performance with respect to image scales, word length, speaker’s pose and the model capacity on naturally-distributed data. The results demonstrate the consistency and the challenges of our benchmark, which may lead to some new inspirations to the related research communities.

Ii Related Work

Datasets # of Classes # of Speakers Resolution Pose Envir. Color/Gray Best Perf. Year
AVICAR [12] 10 100 - Controlled In-car Gray 37.9% 2004
AVLetters [15] 26 10 Fixed Controlled Lab Gray 43.5% 2002
OuluVS1 [23] 10 20 Fixed Controlled Lab Color 91.4% 2009
OuluVS2 [1] 10 53 Fixed (6 different sizes) Controlled Lab Color 93.2% 2015
LRW [5] 500 Fixed Natural TV Color 83.0% 2016
LRW-1000 1000 Naturally Natural TV Color 38.19% 2018
TABLE I: Summary of Existing Well-known word-level Lipreading Datasets

In this section, we provide an overview of the current word-level lipreading datasets, followed by a survey of state-of-the-art methods targeting at lipreading.

Ii-a Word-level Lipreading Datasets

Some well-known word-level lipreading datasets are summarized in Table I. All these datasets have contributed greatly to the progress of automatic lipreading. In this part, we will give a brief review of these well-known datasets shown in the table.

AVICAR [12] and AVLetters [15] were proposed in and respectively and were widely used in an early period. The words in these two datasets are digits and letters from speakers and speakers respectively. These two datasets provided an initial impetus for the early progress in automatic lipreading.

OuluVS1 [23], released in , consists of phrases spoken by subjects with sequences in total. This dataset provides cropped mouth region sequences, which brings much convenience to related researchers. However, the average number of samples in each class is merely , which is not enough to cover the many sources of variation in practical applications.

OuluVS2 [1], released in , extends the number of subjects in OuluVS1 to . The speakers are recorded from five fixed different views: frontal, profile, , and . One major difference compared with AVLetters and OuluVS1 is that OuluVS2 contains several fixed different viewpoints, which makes it more difficult than the above three datasets and is therefore widely used in previous lipreading studies. However, there are few variations beyond the view conditions. In the meanwhile, the number of speakers is still limited.

LRW [5], an appealing large-scale lipreading dataset released in 2016, contains classes with more than a thousand speakers. The videos are no longer posed videos recorded in controlled lab environments as above, but rather extracted from TV shows and thus covers large variations of speech conditions. This remains a challenging dataset for lipreading and therefore has been widely used by most existing deep learning based methods. However, one defining setting of this dataset is that all the words are ensured to have a roughly equal duration and each class is specified to contain roughly the same number of samples. This setting causes a gap between this dataset and practical applications because both word frequency and speech rate are always not constant everywhere in real-world. We believe that if a method learned from data which addresses these two points can still achieve a good performance, it should also perform well when applied for practical applications.

Although there have been some English lipreading datasets such as the ones listed above, there are very few public Mandarin lipreading datasets available up to now. With the rapid development of scientific technologies, automatic lipreading of any language would definitely catch more and more researchers’ attention over time. Therefore, we hope LRW-1000 could fill a part of the gap for automatic lipreading of Mandarin.

Ii-B Lipreading Methods

Automated lipreading has been studied as a computer vision task for decades. Most early methods focus on designing appropriate hand-engineered features to obtain good representations. Some well-known features include the Discrete Cosine Transform (DCT)

[17], active appearance model (AAM), motion history image (MHI) [9], Local Binary Pattern (LBP) [23] and vertical optical flow [20]

, to name a few. With the rapid development of deep learning technologies, more and more work began to perform end-to-end recognition with the help of deep neural networks (DNN). According to the types of the front-end network, modern lipreading methods can be roughly divided into the following three classes.

(1) Fully 2D CNN based: Two-dimensional convolution has been proved successful in extracting representative features in images. With this inspiration, some early lipreading work [13], [16], [8] try to obtain a discriminative representation of each frame individually with some pre-trained 2D CNN models, such as VGGNet [3] and residual networks [11]. One representative work is the multi-tower structure proposed by Chung and Zisserman in [8], where each tower takes a single frame or a -channel image as input with each channel corresponding to a single frame in grayscale. Then the activations from all the towers are concatenated to produce the final representation of the whole sequence. This multi-tower structure has been proved to be effective with appealing results on the current challenging dataset LRW.

(2) Fully 3D CNN based: One direct reason for the wide use of 3D convolutional layers in lipreading has to do with the success of 3D CNN in action recognition [22]. One popular model whose front-end network is completely based on 3D convolution is the LipNet model [2]

. It contains three 3D convolutional layers which transform the raw input video into spatial-temporal features and feed them to the following gated recurrent units (GRUs) to generate the final transcription. The effectiveness has been proved by its remarkable performance on the public dataset which has surpassed professional human lipreaders by a large margin.

(3) Mixture of 2D and 3D convolution: The regular 2D spatial convolutional layers have been proved to be effective in extracting discriminative features in the spatial domain, while spatial-temporal convolutional layers are believed to be able to better capture the temporal dynamics in a sequence. For this reason, some researchers have begun to combine the advantages of the two types to generate even stronger features. In [21], Stafylakis and Tzimiropoulos proposed to combine two spatial-temporal convolutional layers with a 2D residual network to produce the final representation of the sequence. It achieves the state-of-the-art performance on LRW with an accuracy of .

In this paper, we evaluate each of the above state-of-the-art approaches on our proposed benchmark and present a detailed analysis of the results which may provide some inspirations for future research.

Iii Data Construction

In this section, we describe the pipeline for collecting and processing the LRW-1000 benchmark, as shown in Fig. 2. We first present the choice of television programs from which the dataset was created and then provide details of the data preprocessing procedures, which interleave automatic process with manual annotation and extra filtering efforts to make the data consistent for research.

Iii-a Program Selection and Data Collection

In our benchmark, all collected programs are either broadcast news or conversational programs with a focus on news and current events. To encourage the diversity of speakers and speech content, we select programs from both regional and national TV stations, featuring a wide range of male and female TV presenters, guests, reporters and interviewees who speak Mandarin or dialectal Chinese. The final program list is from broadcast sources with programs and yields more than hours of raw videos over the two-month data collection period. This large duration endows the data with nearly full coverage of the commonly used words and the same natural variability as in practical applications.

The broadcast collection described above is retrieved daily through an IPTV streaming service in China, hosted by Northeastern University. It produces fps recordings in H.264 encoding, with to Mbps video bitrates and to Kbps audio bitrates. The video resolution is for high-definition channels and for standard-definition channels. This makes our data cover a wide range of scales. Since the source videos were recorded through cable TV and re-encoded in real-time, they may contain temporal discontinuities which appear as frozen frames or artifacts. We clip each video up to the first occurrence of such abnormalities and feed the remaining videos to the following procedures.

Fig. 2: Pipeline to generate samples in our dataset.

Iii-B Shot Boundary Detection

We first employ a shot boundary detector by comparing the color histograms of adjacent frames. Within each detected shot, we choose three evenly spaced frames and perform face detection by a CNN-based multi-view face detector with the SeetaFaceEngine2 toolkit

[19]. If none of them contain a face larger than pixels, we dismiss the shot as not containing any potential speakers. What is worth noting is that, although we deliberately set a low minimum size of the candidate faces to closely mimic in-the-wild settings, statistics from the final data still show that there are few samples with lip resolution below as shown in Fig. 3.

Iii-C Annotations, Face Detection, and Face Tracking

Most Chinese TV programs have no verbatim subtitles, so we create rough transcripts of the videos with the commercial iFLYREC speech recognition service, time-aligned at the sentence level. This process automatically detects voiced segments in the audio track and diarizes it by speaker turn. We then isolate sentences which are within shots retained in the previous stage, and manually annotate each video clip with the active speaker’s position, gender, exact endpoints of the speech, and speech content. Finally, to further refine the manually-checked text annotations, a more robust ASR tool by iFLYTEK is used to produce faithful transcripts of the utterances which is compared again with the manually checked transcripts. After several rounds of interleaved mannual and automatic check, the final annotation is believed to be accurate enough for the final use.

To associate each utterance with the corresponding speaker’s face, we use the landmark detector in SeetaFaceEngine2 on the first frame and check by comparing the coordinates of each detected face with the manual annotation. Then, a kernelized correlation filter (KCF) tracker is utilized for the selected faces in the given duration to obtain the whole speaking sequence. During the tracking process, we perform automatic validation of the tracking quality every frames with the CNN-based face detector in SeetaFaceEngine2.

Iii-D Audio-to-Video Synchronization

After the above process, we check for the synchronization issues and find that similar to [7] [4], the audio and video streams in the collected videos can be out of sync, with the largest offset being less than one second. To tackle this problem, we introduce the SyncNet model in [6], which extracts visual features from frames of cropped faces using a 3D VGG-M network and computes their distance to the MFCC-based audio features. The model searches for offsets within frames, attempting to minimize the distance between the two features so that the two modalities are synchronized. We run the model over all the extracted utterances from each video and average the distances across these samples. If the determined offset is greater than frames in any clip or the samples do not reach a consent, we will perform shifting of the video stream manually to obtain the final synchronization.

Iii-E Facial Landmark Detection and Mouth Region Extraction

At this stage, we have obtained face tracks of individuals speaking, as well as synchronized audio with corresponding transcripts. The next step is to extract the mouth regions. We first detect facial landmarks with the SeetaFaceEngine2 toolkit. Using these landmarks, the detected faces are first rotated so that the eyes are level. Then, a square mouth-centered RoI is extracted for each frame. To account for the yaw variations, the size of the RoI is set to the horizontal distance between the two mouth corners extended by an empirically determined factor of , or twice the distance between the nose tip and the center of the mouth (), whichever is larger. However, this crop sometimes extends beyond the desired region for extremely small faces, so we restrict the size of the region to be no more than .

In other words, the size of a RoI bounding box is determined by

where and are the coordinates of the left and right mouth corners. Finally, to smooth the resulting boxes, we apply a first-order Savitzky-Golay filter with window length

to the estimated face rotations, the coordinates of the

and centers, and the size of the ROIs.

Iii-F Validating the Extracted RoIs

On some extremely challenging videos where the yaw and pitch angles are large, the landmark predictor fails and the extracted RoIs are either inaccurate or wrong. We train a binary CNN classifier to remove these non-lip images from the dataset. We begin the training process by using the initial unfiltered crops as positive samples and generate negative samples by shifting the crop region randomly in the original frame. After convergence, we filter the dataset using the trained model and fine-tune on the resulting subset. The trained model has a high recall (e.g. it easily picks up glasses at the top corner, and sometimes fails on profile views and low-resolution images, which are scarcer in the dataset), so we ask a human annotator to revise the inference results and remove false alarms.

Iv Dataset Statistics

LRW-1000 is challenging due to its large variations in scale, resolution, background clutter, and speaker’s attributes including pose, age, gender, and make-up. These factors are significant for building a robust practical lipreading system. To quantify these properties, we perform a comprehensive analysis based on the statistics of several aspects.

Iv-a Statistics about the Source Videos

We select television programs with videos in total, where each raw video has a variable duration of minutes to hours. All the programs fall within the class of news and current affairs. Because different programs always have different broadcasters, we split all the videos in a single program into one subset of the train, test, and validation set to ensure that there are no or few overlapped speakers between train, test and validation set. In summary, there are videos with hours’ duration in total. The principle of splitting train/test/validation follows two points: (a) there are no or few overlapped speakers in these three sets; (b) the total duration of the three sets follows the ratio of , which means the number of samples in these three set follows a similar ratio round .

Considering the above two points, we finally select videos of programs with more than hours for training, videos of programs with hours for test, and videos of programs with hours for validation.

Iv-B Statistics about Words Samples

In this subsection, we present the statistics about the word samples in this benchmark.

The final extracted samples in our benchmark have a duration of hours with more than video clips in total, which are selected from the above raw videos. On the average, each class has about samples which makes it adequate to learn deep models. The minimum and maximum length of the data are about seconds and seconds respectively, with an average of 0.3 seconds for each sample, which is reasonable as some words are indeed short in our practical speaking process. On the other hand, the abundance of such instances in real-world settings also suggests that they should not be overlooked in our research.

Iv-C Statistics about the Lip Region Resolution

Considering the diversity of scales of input videos in practical applications, we do not delete those sequences with small or large sizes. Instead, we collect these words according to their intrinsic natural distribution and found that there are indeed only few low or large sizes. Most of them contribute to some moderate value and the statistics about the resolution of lip regions is shown in Fig. 3. The two peaks in the figure are due to that there are two types of raw videos in standard definition (SD) and high definition (HD). The existence of this case brings our benchmark much closer to practical applications because of this large coverage.

Fig. 3: Scale distribution of the data.

Iv-D Statistics about Speakers

There are more than speakers in the videos used to construct our benchmark. All these speakers are mostly interviewers, broadcasters, program guests and so on. The large number and diversity in their identity equip the data with a broad coverage of the age, pose, gender, accent, and personal speaking habits. These factors make the data challenging for currently existing lipreading methods. We would evaluate the state-of-the-art word-level lipreading models on our benchmark and the results are much meaningful for designing practical lipreading models. Among the multiple characteristics of speakers, we select pose as a statistical object because it is believed to be a more critical factor compared with other characteristics for the lipreading task. We present the distribution of data in the pitch-angle, yaw-angle, and roll-angle respectively in Fig. 4. We can see that although we do not perform partivular filtering to the speakers, the data still focus on the frontal view.

(a) Pitch-angle Distribution
(b) Yaw-angle Distribution
(c) Roll-angle Distribution
Fig. 4: Pose distribution of the data in our benchmark.

V Experiments

In this section, we present the evaluation results of the popular lipreading methods and give a detailed analysis to show the characteristics and challenges of the proposed benchmark.

V-a Baseline Methods

We cast the word-level lipreading task on our benchmark as a multi-class recognition problem and evaluate three popular methods on this dataset. Specifically, we evaluate three types of models with different types of front-end network: a fully 2D CNN based front-end, a fully 3D CNN based front-end and a front-end mixing the 2D and 3D convolutional layers. Based on these three types of models, we hope to provide a relatively complete analysis and comparison of the currently popular methods.

The first network structure in our experiments is the LSTM-5 network based on the multi-tower structure proposed in [8], which is completely composed of 2D convolutional layers. This structure has achieved appealing performance on the public word-level dataset LRW [5]. The second network is based on LipNet [2] which contains only three spatial-temporal convolutional layers as the front-end. The third network is the model proposed in [21] which contains two 3D convolutional layers cascaded with a Residual Network as the front-end network. In the experimental process, the original LipNet consistently failed to converge. We believe that it’s because the data in this dataset is too complex to be learned from with only three spatial-temporal layers. Therefore, we propose to transform DenseNet into a 3D counterpart and apply it as the fully 3D convolutional front-end. We named this model as D3D (DenseNet in 3D version), whose structure is shown in Fig. 5. These three models are abbreviated as “LSTM-5”, “3D+2D” and “D3D” respectively in the experiments.

To perform a fair comparison, all the three models are combined with a back-end network of the same structure which contains a two-layer bi-directional RNN to perform the final recognition. The recurrent units used in our experiments are bi-directional Gated Recurrent Units. In the remainder of this section, we compare these three models side by side and obtain some interesting observations.

Fig. 5: The proposed D3D network (DenseNet in 3D version).

V-B Experimental settings

1) Data Preprocessing:

In our experiments, all the images are converted to grayscale and normalized with respect to the overall mean and variance. When fed into the models, the frames in each sequence are cropped in the same random position for training and cropped in the center position for validation and test. All the images are resized to a fixed size of

and then cropped to a size of . As an effective data augmentation step, we also randomly flip all the frames in the same sequence horizontally. To accelerate the training process, we divide the training process into two stages. In the first stage, we choose shorter sequences with a length below , allowing a larger batch-size for training. Then we add the remaining sequences to the training set when the models exhibit a tendency of convergence.

2) Parameters Settings:

Our implementation is based on PyTorch and the models are trained on servers with four NVIDIA Titan X GPUs with 12GB memory. We use the Adam optimizer with an initial learning rate of

, with . All the networks are pre-trained on LRW

. During the training process, we apply a Dropout with a probability of

to the last layer of each model to prevent the model from being trapped in some local optima for the LRW dataset.

3) Evaluation Protocols:

We provide two evaluation metrics in our experiments. The recognition accuracy over all

classes is naturally considered as the base metric, since this is a classification task. Meanwhile, motivated by the large diversity the data shows in many aspects, such as the number of samples in each class, we also provide the Kappa Coefficient as a second evaluation metric.

V-C Recognition Results

To evaluate the effects of different factors on lipreading, we split the data into different difficulty levels according to the input scales, speaker’s pose, and the sample length respectively, as shown in Table II. In the following part, we will present a thorough comparison of the models on all the three levels to obtain a complete and comprehensive analysis of the results.

Target Easy Medium Hard
Input Scales 150 100 50
Pose 20 40 60
Sample Length 30 15 5
TABLE II: Partition of Different Difficulty Levels on LRW-1000

(1) General Performance: We show the results on LRW and LRW-1000 in Table III and Table IV respectively. We can see that there is a similar trend of these three models in both LRW and LRW-1000. The method combining 3D convolution together with 2D convolution performs best in both of the two datasets. The LSTM-5 method which relies only on the 2D convolutional layers performs worse compared with the other two models. This is reasonable because 3D convolution has the advantage to capture short-term motion information which has been proved to be important for lipreading. However, the network with a fully 3D front-end could not surpass the model combining 2D and 3D convolutional layers. This result proves the necessity of 2D convolutional layers for separately exracting the fine-grained features in spatial domain, which is quite useful for discriminating words with similar lip movements. In addition, the performance gap between these three models on LRW-1000 are not much wide and the top-1 accuracy rate ranges from to for the classes, which confirms the challenges and the consistency of our data.

Method Accuracy
LSTM-5 66.0%
D3D 78.0%
3D+2D 83.0%
TABLE III: Recognition Results on LRW
Method Top-1 Top-5 Top-10 Kappa (Top-1)
LSTM-5 25.76% 48.74% 59.73% 0.24
D3D 34.76% 59.80% 69.81% 0.33
3D+2D 38.19% 63.50% 73.30% 0.37
TABLE IV: Recognition Results on LRW-1000

(2) Performance vs. word’s length: There are a small amount of samples with a short duration in our benchmark which can be used to roughly evaluate the performance of lipreading models in the extreme cases. The length is calculated by the number of frames in our experiments. As shown in Table V , all the models perform similarly when the word has a relatively short duration. As the length of the word gradually increases, the performance of all the three models would becomes better and more stable, likely because the context included in the sample increases simultaneously with words’ length. One other possible reason is that the number of samples within the longer-duration is larger than samples with shorter-duration.

Fig. 6: Recognition Accuracy w.r.t the Word Length
Methods Easy Medium Hard All
LSTM-5 25.76% 25.27% 24.63% 25.76%
D3D 34.75% 34.36% 31.01% 34.76%
3D+2D 38.19% 37.34% 30.44% 38.75%
TABLE V: Performance w.r.t the Word Length on LRW-1000

(3) Performance vs. input scales: Most current existing datasets are composed of videos with a fixed resolution, which we think is not enough for training practical lipreading models. In our dataset, we do not limit the maximum size of face but rather set the minimum size to a relatively small value to achieve a more complete coverage of different scales of frames. This is based on the expectation that models learned from this naturally-distributed data should perform more robustly in practical applications.

We evaluate the models on the three levels which are divided by the standard of resolution in Table II. Data with a resolution smaller than falls in hard level. Similarly, data with a resolution smaller than and fall in the medium level and easy level respectively. We can see that the performance of models do tend to increase as we make a transition from the hard level to the medium level and from the medium level to the easy level. As shown in Table II and Fig. 7, the results show that a large size of the input image would indeed help improve the lipreading performance, but the performance would achieve a stable status when the input scale is above some value. On the other hand, the performance gap between these three settings are not wide and the accuracy is still close to % for the classes even in the hard-level where all the test sequences are with a resolution below . This result again demonstrates the consistency in our data which covers a large variation over the input scales.

Fig. 7: Recognition Accuracy w.r.t the Resolution Scale.
Methods Easy Medium Hard All
LSTM-5 26.41% 24.38% 21.02% 25.76%
D3D 35.31% 32.98% 27.75% 34.76%
3D+2D 39.08% 36.07% 31.18% 38.75%
TABLE VI: Performance w.r.t Input Scales on LRW-1000

(4) Performance vs. speakers’ pose: In this section, we evaluate the models under different poses measured by the yaw angle. As shown in Fig. 8 and Table VII, the performance of all the three models drops greatly as the yaw angle increases. This may poses a serious challenge to most of the current lipreading models in real-world scenarios. When speakers are viewed from a large angle, there is too much occlusion in the lip region which makes it hard to learn the patterns from the data. This significant drop of performance observed when the camera moves from frontal views to profile views points out a challenging direction worthy of deeper study.

Fig. 8: Recognition Accuracy w.r.t the Pose Angle.
Methods Easy Medium Hard All
LSTM-5 17.03% 14.51% 11.6% 25.76%
D3D 23.31% 19.95% 15.78% 34.76%
3D+2D 24.89% 20.76% 15.9% 38.75%
TABLE VII: Performance w.r.t the Pose Angle on LRW-1000

Vi Conclusions

In this paper, we have proposed a naturally-distributed large-scale word-level benchmark, named LRW-1000, for lip reading in the wild. We have evaluated the representative lipreading methods on our dataset to compare the effects of different factors on lipreading. With this new dataset, we wish to present the community with some challenges of the lipreading task scale, pose and word duration variations. These factors are all challenges for current lipreading models and also ubiquitous in many real-world applications. We look forward to exciting research results inspired by the benchmark and the corresponding analysis provided in this paper.

References

  • [1] I. Anina, Z. Zhou, G. Zhao, and M. Pietikäinen. Ouluvs2: A multi-view audiovisual database for non-rigid mouth motion analysis. In Automatic Face and Gesture Recognition,IEEE International Conference and Workshops on, pages 1–5, 2015.
  • [2] Yannis M. Assael, Brendan Shillingford, Shimon Whiteson, and Nando de Freitas. Lipnet: Sentence-level lipreading. arXiv preprint, abs/1611.01599:1–12, 2016.
  • [3] K. Chatfield, K. Simonyan, A. Vedaldi, and A. Zisserman. Return of the devil in the details: Delving deep into convolutional nets. In British Machine Vision Conference, 2014.
  • [4] J. S. Chung, A. Senior, O. Vinyals, and A. Zisserman. Lip reading sentences in the wild.

    IEEE Conference on Computer Vision and Pattern Recognition

    , pages 3444–3453, 2017.
  • [5] J. S. Chung and A. Zisserman. Lip reading in the wild. In Asian Conference on Computer Vision, pages 87–103, 2016.
  • [6] J. S. Chung and A. Zisserman. Out of time: automated lip sync in the wild. In Workshop on Multi-view Lip-reading, ACCV, 2016.
  • [7] Joon Son Chung and Andrew Zisserman. Lip reading in profile. In British Machine Vision Conference, pages 1–11, 2017.
  • [8] Joon Son Chung and Andrew Zisserman. Learning to lip read words by watching videos. Computer Vision and Image Understanding, pages 1–10, 2018.
  • [9] P. Duchnowski, M. Hunke, D. Busching, U. Meier, and A. Waibel. Toward movement-invariant automatic lip-reading and speech recognition. In Acoustics, Speech, and Signal Processing International Conference on, pages 109–112, 1995.
  • [10] S. Dupont and J. Luettin. Audio-visual speech modeling for continuous speech recognition. IEEE Transactions on Multimedia, 2(3):141–151, 2000.
  • [11] K. He, X. Zhang, S. Ren, and J. Sun. Deep residual learning for image recognition. In IEEE Conference on Computer Vision and Pattern Recognition, pages 770–778, 2016.
  • [12] Bowon Lee, Mark Hasegawa-Johnson, Camille Goudeseune, Suketu Kamdar, Sarah Borys, Ming Liu, and Thomas Huang. Avicar: Audio-visual speech corpus in a car environment. In International Conference on Spoken Language Processing, pages 2489–2492, 1 2004.
  • [13] Y. Li, Y. Takashima, T. Takiguchi, and Y. Ariki.

    Lip reading using a dynamic feature of lip images and convolutional neural networks.

    In International Conference on Computer and Information Science, pages 1–6, 2016.
  • [14] I. Matthews, T. F. Cootes, J. A. Bangham, S. Cox, and R. Harvey. Extraction of visual features for lipreading. IEEE Transactions on Pattern Analysis and Machine Intelligence, 24(2):198–213, 2002.
  • [15] I. Matthews, T. F. Cootes, J. A. Bangham, S. Cox, and R. Harvey. Extraction of visual features for lipreading. IEEE Transactions on Pattern Analysis and Machine Intelligence, 24(2):198–213, 2002.
  • [16] Kuniaki Noda, Yuki Yamaguchi, Kazuhiro Nakadai, Hiroshi G. Okuno, and Tetsuya Ogata. Lipreading using convolutional neural network. In Fifteenth Annual Conference of the International Speech Communication Association, pages 1149–1153, 2014.
  • [17] G. Potamianos, H. P. Graf, and E. Cosatto. An image transform approach for hmm based automatic lipreading. In Image Processing International Conference on, pages 173–177, 1998.
  • [18] Morishima S., Ogata S., Murai K., and Nakamura S. Audio-visual speech translation with automatic lip synchronization and face tracking based on 3d head model. In Acoustics, Speech, and Signal Processing, IEEE International Conference on, volume abs/1611.01599, pages 2117–2120, 2002.
  • [19] SeetaFace. Seetafaceengine2. https://github.com/seetaface.
  • [20] A. A. Shaikh, D. K. Kumar, W. C. Yau, M. Z. C. Azemin, and J. Gubbi.

    Lip reading using optical flow and support vector machines.

    In Image and Signal Processing,International Congress on, pages 327–330, 2010.
  • [21] Themos Stafylakis and Georgios Tzimiropoulos. Combining residual networks with lstms for lipreading. arXiv preprint arXiv:1703.04105, pages 3652–3656, 2017.
  • [22] Du Tran, Lubomir Bourdev, Rob Fergus, Lorenzo Torresani, and Manohar Paluri. Learning spatiotemporal features with 3d convolutional networks. In IEEE International Conference on Computer Vision (ICCV), pages 4489–4497, 2015.
  • [23] G. Zhao, M. Barnard, and M. Pietikainen. Lipreading with local spatiotemporal descriptors. IEEE Transactions on Multimedia, 11(7):1254–1265, 2009.