End-to-End Subtitle Detection and Recognition for Videos in East Asian Languages via CNN Ensemble with Near-Human-Level Performance

11/18/2016 ∙ by Yan Xu, et al. ∙ Microsoft 0

In this paper, we propose an innovative end-to-end subtitle detection and recognition system for videos in East Asian languages. Our end-to-end system consists of multiple stages. Subtitles are firstly detected by a novel image operator based on the sequence information of consecutive video frames. Then, an ensemble of Convolutional Neural Networks (CNNs) trained on synthetic data is adopted for detecting and recognizing East Asian characters. Finally, a dynamic programming approach leveraging language models is applied to constitute results of the entire body of text lines. The proposed system achieves average end-to-end accuracies of 98.2 Simplified Chinese and 40 videos in Traditional Chinese respectively, which is a significant outperformance of other existing methods. The near-perfect accuracy of our system dramatically narrows the gap between human cognitive ability and state-of-the-art algorithms used for such a task.



There are no comments yet.


page 3

page 12

page 18

page 25

page 34

This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

1 Introduction

Detecting and recognizing video subtitle texts in East Asian languages (e.g. Simplified Chinese, Traditional Chinese, Japanese and Korean) is a challenging task with many promising applications like automatic video retrieval and summarization. Different from traditional printed document OCR, recognizing subtitle texts embedded in videos is complicated by cluttered backgrounds, diversified fonts, loss of resolution and low contrast between texts and backgrounds [1].

Given that video subtitles are almost always horizontal, subtitle detection can be partitioned into two steps: subtitle top/bottom boundary (STBB) detection and subtitle left/right boundary (SLRB) detection. These four detected boundaries enclose a bounding box that is likely to contain subtitle texts. Then the texts inside the bounding box are ready to be recognized.

Despite the similarity between video subtitle detection and scene text detection (i.e. detect texts embedded in natural static images [1]), the instinctive sequence information of videos makes it necessary to address these two tasks respectively [2]. As illustrated in Fig.1, for most videos with single-line subtitles in East Asian languages, texts at the subtitle region exhibit homogeneous properties throughout the video, including consistent STBB position, color and single character width (SCW). Meanwhile, the non-subtitle region varies unpredictably from frame to frame. With the assistance of this valuable sequence information, we put forward a suitable image operator that can facilitate the detection of STBB and SCW. We call this image operator the Character Width Transform (CWT), as it exploits one of the most distinctive features of East Asian characters—consistent SCW.

Fig. 1: Illustration of the consistent STBB position throughout the video. The red box denotes the subtitle region, while the green box denotes the non-subtitle region.

Considering the complexity of backgrounds and the diversity of subtitle texts, adopting a high-capacity classifier for both text detection and recognition is imperative. CNNs have most recently proven their mettle handling image text detection and recognition

[3, 4]. By virtue of their special bio-inspired structures (i.e. local receptive fields, weight sharing and sub-sampling), CNNs are extremely robust to noise, deformation and geometric transformations [5]

and thus are capable of recognizing characters with diverse fonts and distinguishing texts from cluttered backgrounds. Besides, the architecture of CNNs enables efficient feature sharing across different tasks: features extracted from hidden layers of a CNN character classifier can also be used for text detection

[4]. Additionally, the fixed input size of typical CNNs makes them especially suitable for recognizing East Asian characters whose SCW is consistent.

In view of the straightforward generation pipeline of video subtitles, it is technically feasible to obtain training data by simulating and recovering this generation pipeline. To be more specific, when equipped with a comprehensive dictionary, several fonts and numerous random backgrounds, machines can produce huge volumes of synthetic data covering thousands of characters in diverse fonts without strenuous manual labeling. As a cornucopia of synthetic training data meet the “data-hungry” nature of CNNs, models trained merely on synthetic data can achieve competitive performance on real-world datasets.

Another observation is that the recognition performance degrades with the burgeoning number of character categories (as in the case of East Asian languages). In a similar circumstance, Jaderberg et al. [6] attempt to alleviate this problem with a sophisticated incremental learning method. Here we propose a more straightforward solution: instead of using a single CNN, we independently train multiple (ten in this paper) CNN models that consolidate a CNN ensemble. These models are complementary to each other, as the training data is shuffled respectively for training different models.

In this paper, by seamlessly integrating the above-mentioned cornerstones, we propose an end-to-end subtitle text detection and recognition system specifically customized to videos with a large concentration of subtitles in East Asian languages. Firstly, STBB and SCW are detected based on a novel image operator with the sequence information of videos. SCW being determined at an early stage can provide instructive information to improve the performance of the remaining modules in the system. Afterwards, SLRB is detected by a SVM text/non-text classifier (it takes CNN features as input) and a horizontal sliding window (its width is set to SCW). According to the detected top, bottom, left and right boundaries, the video subtitle is successfully detected. Finally, single characters are recognized by the CNN ensemble and the text line recognition result is determined by a dynamic programming algorithm leveraging a 3-gram language model. We show that the CNN ensemble produces a recognition accuracy of 99.4% on a large real-world dataset including around 177,000 characters in 20,000 frames. This dataset with ground truth annotations and our CNN models will be made publicly available.

Our contribution can be summarized as follows:

  • We propose an end-to-end subtitle detection and recognition system for East Asian languages. By achieving 98.2% and 98.3% end-to-end recognition accuracies for Simplified Chinese and Traditional Chinese respectively, this system remarkably narrows the gap to human-level reading performance111Human-level reading performance is 99.6% according to the experiment in Section 4.1..

  • We define a novel image operator whose outputs enable the effective detection of STBB and SCW. The sequence information is integrated throughout the video to increase the reliability of the proposed image operator. This module achieves a competitive result on a dataset including 1,097 videos.

  • We leverage a CNN ensemble to perform the classification of East Asian characters across huge dictionaries. The ensemble reduces the recognition error rate by approximately 75% in comparison with a single CNN. CNNs in our system serve both as text detectors and character recognizers owing to efficient feature sharing. The visualization of CNNs proves that different CNN models can capture distinctive features of characters.

The remainder of this paper is organized as follows. Section 2 reviews related works. Section 3 describes the synthetic data generation scheme, the CNN ensemble and the end-to-end system. In Section 4, the proposed system and each module in it are evaluated on a large dataset, and the experimental results are presented. In Section 5, observations from our experiments are discussed. A conclusion and discussion of future work are given in Section 6.

2 Related work

In this section, we focus on reviewing relevant literature on image text detection and recognition. As for other text detection and recognition methods, several review papers [1, 7, 8, 9] can be referred to.

2.1 Image text detection

Generally, text detection methods are based on either connected components or sliding windows [4]. Connected component based methods, like Maximally Stable Extremal Regions (MSER) [10, 11, 12], enjoy their computational efficiency and high recall rates, but suffer from a large number of false detections. Methods based on sliding windows [13, 3, 4, 14, 15, 16] adopt a multi-scale window to scan through all locations of an image, then apply a trained classifier with either hand-engineered features or learned features to distinguish texts from non-texts. Though this kind of method produces significantly less false detections, the computational cost of scanning every location of the image is unbearable. Therefore, connected component based methods and sliding-window based methods are often utilized together for text detection [17, 6, 18, 12], where the former generate text region proposals and the latter eliminate false detections. This text detection scheme is also adopted in this paper, but our text region proposal method is based on the sequence information of video and thus not comparable to existing methods designed for scene text detection. Hence, we focus on reviewing methods based on video sequence information and text region verification works that aim to eliminate false detections.

2.1.1 Methods incorporating video sequence information

Tang et al. [19] analyze the difference of adjacent frames to detect the subtitle text based on the assumption that in each shot the scene changes more gradually than the subtitle text. Wang et al. [20] exploit a multi-frame integration technique within 30 consecutive frames to reduce the complexity of backgrounds before the text detection process. Liu et al. [21] compare the distribution of stroke-like edges between adjacent frames and segment the video into clips in which the same caption is contained. Then they adopt a temporal “and” operation to identify caption regions. However, contrary to the proposed method in this paper, these existing methods rarely exploit temporal information throughout the video.

2.1.2 Text region verification based on hand-engineered features

Traditional methods harness manually designed low-level features such as SIFT and histogram of oriented gradients (HOG) to train a classifier to distinguish texts from non-texts. For instance, Wang et al. [22] propose a new block partition method and combine the edge orient histogram feature with the gray scale contrast feature (EOH-GSC) for text verification. Neumann et al. [18] adopt the SVM classifier with a set of geometric features for text detection. Wang et al. [14] and Jaderberg et al. [6] eliminate false text detections by Random Ferns with HOG features. Minetto et al. [23]

propose a HOG-based texture descriptor (T-HOG) that ameliorates traditional HOG features on the text/non-text discrimination task. Effective as these handcrafted features are to describe image content information, they are suboptimal to represent text data due to their heavy dependence on priori knowledge and heuristic rules.

2.1.3 Text region verification based on feature learning

In contrast to these traditional methods, more advanced methods take advantage of high-capability feature learning to automatically learn a more robust representation of text data, hence possessing a powerful discrimination ability to eliminate false text detections. Delakis and Garcia [15] train a CNN to detect texts from raw images in a sliding window fashion. Wang et al. [3] and Huang et al. [12]

utilize a multi-layer CNN for both text detection and recognition, and the first layer of the network is trained with an unsupervised learning algorithm

[13]. Ren et al. [16] are the first to tackle Simplified Chinese scene text detection. They propose an algorithm called convolutional sparse auto-encoder (CSAE) to pre-train the first layer of CNN on unlabeled synthetic data for Simplified Chinese scene text detection.

Both the above-mentioned methods and our approach are based on feature learning, comparing favorably against methods based on hand-engineered features. We further promote East Asian text detection performance by training a CNN ensemble in an end-to-end manner on labeled synthetic data.

2.2 Image text recognition

Similar to Section 2.1 where the importance of features is addressed, existing image text recognition methods are also classified into those based on hand-engineered features [24, 25, 26, 18, 14, 27] and those based on feature learning [13, 17, 3, 4, 6, 28, 29, 30, 31, 32, 33, 34].

2.2.1 Image text recognition based on hand-engineered features

Bissacco et al. [26] propose a scene text recognition system by combining a neural network trained on HOG features with a powerful language model. Lee et al. [24] present a new text recognition method by merging gradient histograms, gradient magnitude and color features. Bai et al. [27] use HOG features, artificially generated training data and a neural network classifier for Simplified Chinese image text recognition. Though state-of-the-art performance was achieved, its 85.44% recognition accuracy still impedes its practical application.

2.2.2 Image text recognition based on feature learning

Elagouni et al. [34] harness a CNN to perform character recognition with the aid of a language model, and their system achieves outstanding performance on 12 videos in French. Jaderberg et al. [4] propose a novel CNN architecture that facilitates efficient feature sharing for different tasks like text detection, character classification and bigram classification. Alsharif and Pineau [17] utilize the Maxout network [35]

together with an HMM with a fixed lexicon to recognize image words. Jaderberg et al.

[6] propose a CNN that directly takes whole word images as input and classifies them across a dictionary of 90,000 English words.

Works tackling East Asian image text recognition with CNNs are relatively rare. Zhong et al. [33] adopt a CNN with a multi-pooling layer on top of the final convolutional layer to perform multi-font printed Simplified Chinese character recognition, which renders their method robust to spatial layout variations and deformations. Bai et al. [31] propose a CNN architecture for Simplified Chinese and English character recognition, and the hidden-layers are shared across these two languages. However, both works [33, 31] can only recognize an isolated character as opposed to a text line. Besides, the work of Bai et al. [31] can only recognize 500 Simplified Chinese characters, though there are thousands of characters commonly used [36]. Therefore, to the best of our knowledge, the system proposed in this paper is the first to leverage high-capability CNNs to recognize image text lines in Simplified Chinese (and also other East Asian languages) with a comprehensive alphabet consisting of 7,008 characters.

3 Method

In this section, we will describe the synthetic data generation pipeline, the CNN ensemble and the end-to-end system in detail. As illustrated in Fig.2, the end-to-end system consists of three modules including STBB and SCW detection, SLRB detection and subtitle recognition.

Fig. 2: Overview of the proposed system. The end-to-end system consists of three modules corresponding to three boxes with blue dashed borders in the figure. Given a set of video frames, the first module detects STBB and SCW. In the second module, SLRB is detected by a SVM text/non-text classifier with features extracted from the hidden layer of the CNN ensemble. In the third module, a sliding window with width equaling to SCW is employed, and the CNN ensemble recognizes characters in each window region. The final result is given by a dynamic programming algorithm with a language model.

3.1 Synthetic data generation

As it is easy to simulate the generation pipeline of subtitles, training data are synthetically generated in a scheme similar to [37, 38]. The labeled synthetic data in Simplified Chinese (SC), Traditional Chinese (TC) and Japanese (JP) are generated to train CNNs in SC, TC and JP respectively.

(1) Dictionary construction: three comprehensive dictionaries that respectively cover 7,009 SC characters, 4,809 TC characters and 2,282 JP characters are constructed. A space character is included in each dictionary.

(2) Font rendering: 22, 19 and 17 kinds of font for SC, TC and JP are collected respectively for introducing more variations to the training data.

(3) Random selection of background and character: 45,441 frames are randomly extracted from 11 news videos downloaded from the Internet. Afterwards, small background patches are randomly cropped from these frames. The size of every background patch is determined with regard to a random combination of a character and a font. 200,000 machine-born white characters with dark shadows are generated by repeatedly selecting a random combination of a font and a character from the dictionary.

(4) Random shift and Gaussian blur: every randomly generated machine-born character is superimposed on a randomly selected background patch with a random shift of pixels, where

is drawn from a uniform distribution on the interval [-2, 2]. Then every image is convolved with a Gaussian blur at the scale of

pixels, where is drawn from a uniform distribution on the interval [0.5, 1.6]. The convolved images are then converted to grayscale images and resized to 24 24. Therefore, 200,000 samples are generated for SC, TC, and JP respectively.

The procedure of generating training samples for the text/non-text SVM classifier is almost the same, except that the same number of background patches without characters are also stored as non-text training examples. Fig.3 presents some of the training data.

Fig. 3: Examples of the machine-simulated training data. The small patches on the first line are non-text training examples, while those on the second line are text training examples.

3.2 Convolutional Neural Networks ensemble

CNNs have been recently applied to recognize image texts with great success [4, 6, 17, 3]. The architecture of our CNN model is mainly inspired by [39], in which a four-layer CNN with local response normalization achieved an 11% test error rate on the CIFAR-10 dataset [40]. As delineated by Table 1, the configuration of our net is derived from the code shared by Krizhevsky [41]. Our CNN takes as input a character image rescaled to the size of 24

24 pixels and returns as output a vector of

values between 0 and 1. The input image is converted to grayscale image so as to reduce the susceptibility of our model to variable text colors and alleviate the computational burden.

Layer Type Size-in Size-out Kernel
conv1 convolutional 24241 242464 5564,1
pool1 max-pooling 242464 121264 3364,2
rnorm1 local response norm 121264 121264
conv2 convolutional 121264 121264 5564,1
rnorm2 local response norm 121264 121264
pool2 max-pooling 121264 6664 3364,2
local3 locally-connected 6664 6664 3364,1
local4 locally-connected 6664 6632 3332,1
fc fully-connected 6632
probs softmax
Table 1: CNN configuration. The input and output sizes are described in . The kernel is specified as . represents number of character categories.

Note that we do not perform the data augmentation as proposed by [39], in which 24 24 patches are randomly cropped from the original 32 32 images in CIFAR-10 [40] to prohibit overfitting. The reason behind this is twofold. On the one hand, the loss of critical information, including radicals and strokes in characters, is inevitable if the original images are randomly cropped. On the other hand, we are not concerned about overfitting because our synthetic dataset can be arbitrarily large.

3.2.1 Details of learning

Stochastic gradient descent with a batch size of 128 images is used to train our models. Parameters like learning rates, weight decay and momentum are concurrent with the shared code [42]

. 195,000 images are used for training while the remaining 5,000 images are used for validation. We train each model for only one epoch on the training set, which takes approximately two hours on one NVIDIA Tesla K20Xm GPU.

3.2.2 Visualization

In Fig.4, we visualize the learned CNN ensemble using the technique as demonstrated [43, 44]. It can be observed that the appearance of different shifts and fonts of a specific category is captured in a single image, and ten CNN models in the CNN ensemble learn something slightly different from each other albeit the overall similarity. The visualization indicates that the CNN ensemble has captured distinctive features of characters.

Fig. 4: Visualization of 5 character classes learned from the Traditional Chinese character classifier. There are 10 visualization results corresponding to 10 CNN models in each line. These images are generated by numerically optimizing the input image which maximizes the score of a specific character category [43, 44].

3.2.3 Training the text/non-text SVM classifier

We adopt a linear SVM classifier [45] to determine whether there is a character in a given image patch. The SVM takes the outputs of the local4 layer of the CNN ensemble as its features. The local4 layer of every CNN outputs a 6 6 32 feature map, which is 1152-dimensional after concatenation. The CNN ensemble consists of 10 CNNs, thus the feature vector of the SVM is 11520-dimensional. The parameter of the SVM controls the trade off between margin maximization and errors of the SVM on training data. is optimized on the synthetic validation set.

3.3 STBB and SCW detection

In this section, we describe the proposed image operator CWT and how it is applied with the sequence information to detect STBB and SCW.

3.3.1 Character Width Transform

One feature that distinguishes East Asian text from other elements of a video frame is its consistent SCW. SCWs of East Asian characters are identical as long as their font styles and font sizes are set the same. In this work, we leverage this fact to define CWT, which recovers regions that are likely to contain texts.

Fig. 5: Illustration of the distribution patterns of histograms at a subtitle region (window region 2) and non-subtitle regions (window region 1 and 3).

CWT is a local image operator. At each local region, CWT generates a histogram that estimates the distribution of SCWs of the subtitle text in this region. SCW is estimated by detecting pixels that are likely to locate at the space between characters and calculating the pairwise distances between these detected pixels. As illustrated in Fig.

5, the randomness at non-subtitle regions makes the pairwise distances distribute uniformly. Meanwhile, at subtitle regions, more pairwise distances come from the space between characters, leading to the emergence of a local peak in the vicinity of the SCW. Based on the distribution patterns of histograms constructed at different local regions, we predicate that the STBB and the SCW can be determined simultaneously.

Detecting pixels at the space between characters requires the binarization of frames extracted from videos (see Fig.

6 (b) for illustration). Firstly, each RGB frame with the size of is transformed into LAB color space to avoid the illumination inference [46]. Then, Sauvola algorithm [47] is adopted to separate text components from background (binarization) for its robustness to the uneven illumination and noise. This algorithm performs local thresholding with -by- neighborhood. Both and are set to 150 pixels and the threshold is set to 0.34.

Fig. 6: (a) is an original RGB frame and (b) is the binarized frame. (c) illustrates the proposed vertical sliding window. In (c), the red box represents the vertical sliding window, and the dashed red arrow shows the direction in which the sliding window moves.

CWT is then applied to every local region in a sliding-window manner. Concretely, a sliding window (as shown in Fig.6 (c)) is adopted, where is a variable less than

and determined according to the resolution of videos. This window scans each frame by moving vertically from top to bottom at stride 1, and

window regions can be obtained. Finally we acquire histograms by applying CWT at every window region.

Let denote a pixel in the binarized frame where are the coordinates. Values of most text pixels are 1 after the binarization. We take the sliding-window region whose top boundary is at position , and the sum of elements in its each column is:


After that, pixels that are likely to locate at the space between characters are detected by local-minimum points (). We denote a set of by , where . As illustrated by Fig.7, the majority of are interspersed among backgrounds as well as the space between characters. If more than 30 are connected (i.e. , , , ), they will be removed, which can effectively eliminate from backgrounds while reserve from the space between characters. The rationality of this constraint is that more than 30 connected could only come from backgrounds.

Fig. 7: The majority of are interspersed among backgrounds (denoted by red asterisks) and the space between characters (denoted by green asterisks).

Then all pairwise distances between are calculated and stored in a set :


where and denote the minimum and the maximum SCW respectively.

It is noteworthy that since the statistical information derived from a single frame is too coarse to provide a reliable estimation of SCW, we can not construct a histogram directly from in the next step. This is when the sequence information of video comes in handy. As STBB and SCW are consistent throughout the video, we assume that values in are drawn from the same underlying distribution, where represents the number of frames in the video. Based on this assumption, histograms can be constructed from frames throughout the video:


where equals 1 if and 0 otherwise. In order to alleviate the computational burden, videos are downsampled to 0.0625 fps without compromising the STBB detection performance.

3.3.2 Detecting the STBB and SCW

Given histograms ,, the STBB and the SCW can be determined. Concretely, if the local peaks (see Fig.5) of several adjacent histograms , all locate near , and will be regarded as positions of a set of candidate STBB, and will be the corresponding SCW. Our algorithm is presented in Algorithm 1, of which the output contains several candidate sets of STBB and estimated SCW.

1: histograms , maximum SCW , minimum SCW , minimum subtitle height
3:candidate STBB and SCW
5:Find local peaks inside histograms:
6:for  to  do
7:      for  to  do
9:            if  then

Estimate the position of local peak by quadratic interpolation as

11:            end if
12:      end for
13:end for
15:Detect adjacent histograms with similar local peak positions:
17:for  to  do
18:      for  to  do
19:            if  then
21:                 for  to  do
23:                       if  then
24:                             break for
25:                       end if
28:                 end for
29:                 if  then
31:                 end if
32:            end if
33:      end for
34:end for
Algorithm 1 STBB and SCW determination

Note that elements contained in are raw candidates, some of which might come from non-subtitle regions and should be eliminated. A post processing algorithm are adopted to remove these false-positive candidates: (1) if two candidates with a similar SCW are overlapped, we eliminate the one whose subtitle height is smaller. (2) if two candidates have a similar STBB and the SCW of one of them is approximately two times larger than the other one, the candidate with the larger SCW is eliminated. (3) candidates whose STBB locate at the upper half of the frame are eliminated due to the fact that most of subtitles are superimposed on the bottom half of the frame.

This post processing algorithm eliminates almost all false detections, and a small amount of surviving false-positives will be further removed by the text/non-text classifier in the step following.

3.4 SLRB detection

Raw subtitle regions bounded by the detected STBB and the left/right boundary of original frames are cropped from original frames. The size of is , where represents subtitle height. Then, SLRB are detected in a sliding-window manner: a window, a window and a window that respectively slide from left to right across with stride 1 are adopted, where is the determined SCW. Then, every window region is classified as either text region or non-text region by the SVM classifier described in Section 3.2.3. Supposing that and respectively denote the left boundary position and the right boundary position of the -th window region predicted as a text region, and there are window regions predicted as text regions. Algorithm 2 is designed to merge overlapping window regions predicted as text regions together and subsequently determine the SLRB. According to the output and of Algorithm 2, subtitle region is detected by further removing non-subtitle regions on two sides of . This process is illustrated in Fig.8. The parameter of Algorithm 2 is determined according to the resolution of videos. being too large would cause the real subtitle region to be easily connected with non-subtitle regions that are incorrectly predicted, while being too small, an integral sentence might be easily broken into pieces.

Fig. 8: This delineates the subtitle detection procedure. STBB and SCW are detected firstly. Then a sliding window horizontally scans the subtitle region detected in the first step. Every window region is predicted either as text (T) or non-text (N) by the SVM classifier, which takes CNN features as input. Based on the predictions, Algorithm 2 finally determines SLRB. For illustration convenience, the stride of the sliding window is enlarged to SCW.
1: predicted text window regions , parameter controlling the maximum gap between two clauses separated by space, the determined SCW
3:the left and the right boundarids of subtitle
8:while  do
11:      while  and  do
14:      end while
15:      if  then
16:            if  then
20:            else
21:                 if   then
23:                 else
27:                 end if
28:            end if
29:      end if
31:end while
Algorithm 2 SLRB determination

3.5 Subtitle recognition

Now that the subtitle region has been successfully detected, we will describe the proposed subtitle recognition scheme with three steps including sliding window based segmentation, window region recognition and dynamic programming determination.

3.5.1 Sliding window based segmentation

In order to recognize each single character in the subtitle, the subtitle region must be properly segmented (i.e. split the image text line into patches that each of which contains a single character). This step is challenging due to touching characters and the inherent structure of separation from the left and right sides of many East Asian characters. Unlike other methods where potential segmentation points must be determined precariously [48, 27, 26, 29], our method obviates this step since the SCW is known, which is an inborn advantage of our system. Three sliding windows identical to those in the Section 3.4 are adopted again to slide from left to right across at stride one, and each window region is fed into the CNN ensemble for recognition.

3.5.2 Window region recognition

Given a window region (,

), the softmax layer of each CNN model outputs the probability of each category, and categories whose probabilities are among the top 20 are reserved. Then, probabilities of these reserved categories are averaged across 10 CNN models. If the largest average probability is greater than a threshold (i.e. 0.2), candidate categories of (

, ) with the top 5 average probabilities will be recorded before moving to the next window position (, ). Otherwise, the window region (, ) would probably reside between two adjacent characters. In this case, it will be abandoned and the next window region (, ) will be examined. Finally, those recorded 5 candidate categories whose probabilities are greater than 0.05 will be stored with their associated recognition probabilities and the window position (, ).

3.5.3 Dynamic programming determination

The final recognition results are determined by a dynamic programming algorithm. From the leftmost window (, ) step by step all the way to the rightmost window (, ), this algorithm builds the whole sentence by repeatedly appending the character in the next window position (i.e. , or pixels rightward) to the previously recognized sentence. In each step from the window (, ) to the next window (, ), every previously recognized sentence that arrives to (, ) is processed by a character based 3-gram language model. For every unique 3-gram word group consisting of the newly appended character and two former characters, a recognition probability and a 3-gram language probability are recorded, based on which the total score of the word group is calculated as:


is the proportion of the language score and the recognition score which is 0.3 in our experiment. Since the sliding window has three widths (i.e. , and ), it is possible to obtain several identical word groups that arrive at but with different scores during the building process. Therefore, a pruning strategy that only reserves the word group with the highest score is applied to reduce the redundancy and improve the efficiency. The building process terminates when approaches the right boundary of the image, and the total score of the -th possible sentence is:


where represents the sum of all in the -th candidate sentence and represents the number of windows (i.e. characters) in the -th candidate sentence. The sentence with the highest total score is selected as the final recognition result.

4 Experiments

We conduct ample experiments to evaluate each component of the proposed system. The end-to-end performance of our system is also reported in this section.

4.1 Dataset

As listed in Table 2, an extensive dataset containing 1097 videos in Simplified Chinese, Traditional Chinese and Japanese is constructed. These videos exhibit a wide range of diversity in TV program genres, including talk shows, documentaries, news reports, etc.

STBBs of all videos and SLRBs of videos marked by are annotated manually. As our recognition module is almost error-free, the recognition results of videos marked by are annotated by a human annotator “A” on the basis of the outputs of the proposed system. The annotations obtained in this manner are regarded as ground truth. To test the quality of the ground truth annotations, we randomly select 400 frames containing 4494 characters from the already annotated frames and employ another two human annotators “B” and “C” to annotate these frames independently again. By comparing the annotations from “B” and “C”, the final agreement on the result is reached, based on which the annotations from “A” are examined. The annotations from “A” achieve 99.8% accuracy, indicating that the ground truth annotations are of high quality.

We also measure the human-level reading performance on these 400 frames. A human annotator “D” is employed to annotate these frames manually, and the annotations from “D” are examined based on the final agreement mentioned-above. The human-level reading performance is estimated by the performance of “D”, of which the reading accuracy is 99.6%.

Language #Videos Resolution
Traditional Chinese 1015 (40) 480320
Traditional Chinese 40 852480
Simplified Chinese 40 (40) 852480
Japanese 2 480320
Table 2: Our dataset configuration. All videos are utilized to evaluate the STBB detection module, while only videos marked by ’’ are randomly selected to evaluate the remaining modules and the end-to-end system.

4.2 Experiments on STBB and SCW detection

In order to demonstrate the efficacy of our method, all videos in the dataset are selected for evaluation. In the experiment, the height of the vertical sliding window is optimized with regard to videos with 480 320 resolution and videos with 852 480 resolution respectively.

The CNN ensemble trained on synthetic data with random shift empowers our system with high robustness even if the STBB are not precisely detected. For this consideration, our evaluation method is defined as follows: the STBB of a video are detected correctly if


where , , and denote positions of detected top boundary, ground-truth top boundary, detected bottom boundary and ground-truth bottom boundary respectively.

We perform a series of tests to determine the optimal value of parameter (the height of the proposed vertical sliding window in Section 3.3.1). The input variables , and of Algorithm 1 are set to 5, 40 and 12 respectively. Table 3 shows the performance of our STBB detection module with regard to different . The variable actually controls the trade-off between the STBB detection accuracy and the tolerability to noise. From our experiments, we observe that when is too small, the histogram becomes more susceptible to background noise as well as strokes inside characters that do not reflect SCW. But being too large would compromise the STBB detection accuracy.

Number of
Number of videos whose
STBB are correctly detected
480320 1017 1 972 95.6%
3 980 96.4%
5 951 93.5%
7 934 91.8%
852480 80 3 73 91.3%
5 75 93.8%
7 75 93.8%
Table 3: Parameter optimization. STBB detection precision is not presented for the reason that false-positives are subsequently removed by the text/non-text classifier. Therefore, every video only has one final subtitle location. Note that the correctness of STBB determination always entail the correctness of SCW determination, hence only the former is reported.

4.3 Experiments on SLRB detection

In this section, the performance of our SLRB detection module is evaluated against two baseline methods based on hand-engineered features: T-HOG [23] and EOH-GSC [22]. The input parameter of Algorithm 2 is set to 0.7/2.5 for videos in 480 320/852 480 resolution respectively.

Our evaluation method is quite similar to the ICDAR’03 detection protocol [49]. Let denote the ground-truth SLRB, and denote the corresponding detected SLRB. The average match between all and in a video is defined as twice the length of intersection divided by the sum of the lengths:


where is the distance between a set of left and right boundaries and denotes all the ground-truth SLRBs in a video.

Table 4 lists the statistics of of 80 videos and shows the superiority of our CNN features over T-HOG [23] and EOH-GSC [22] features on the text/non-text classification task.

Language CNN features EOH-GSC [22] T-HOG [23]
Simplified Chinese 99.4 0.9% 96.1 2.5 91.7 4.6
Traditional Chinese 99.5 0.4% 96.8 3.3 94.0 5.1
Table 4: The statistics of . We randomly select 80 videos (40 in Simplified Chinese and 40 in Traditional Chinese) whose STBBs are correctly determined for evaluation.

4.4 Experiments on subtitle recognition

This section measures the performance of our character recognition module. For comparison, we test the same 80 videos in the previous section with Grayscale based Chinese Image Text Recognition (gCITR) [27] as well as another two commercial OCR software: ABBYY FineReader 12 [50] and Microsoft OCR library [51]. gCITR [27] is the previous state-of-the-art system for Simplified Chinese subtitle recognition, where 85.44% word accuracy is achieved on another dataset. Besides, the performance of a single CNN is also reported in order to manifest the efficacy of the CNN ensemble.

The performance of our subtitle recognition module is evaluated by the word accuracy that is defined as:


here, is the number of ground-truth words and represents Levenshtein edit distance [52] to change a recognized sentence into ground-truth.

TV programs #Videos #Words ABBYY[50] gCITR[27] MS OCR[51] Single CNN CNN ensemble
HXLA 3 4630 52.4% 78.5% 89.9% 97.4% 99.7%
CFZG 3 7711 78.7% 91.8% 89.7% 98.1% 99.7%
ZGSY 3 8982 68.7% 81.6% 85.8% 98.5% 99.9%
DA 2 3936 64.8% 69.1% 89.0% 97.7% 99.7%
JXTZ 2 4682 66.8% 70.3% 88.3% 97.8% 99.6%
FNMS 2 5681 68.3% 87.7% 87.7% 99.2% 99.8%
JF 5 9299 54.3% 75.8% 84.8% 98.2% 99.3%
KJL 2 3372 61.9% 87.8% 61.3% 98.0% 99.8%
KXDG 1 2027 40.6% 76.2% 56.3% 97.5% 98.3%
AQGY 2 4850 56.6% 79.7% 56.9% 94.3% 96.9%
CCTVJS 2 3918 85.2% 71.1% 82.6% 96.2% 99.9%
SDGJ 3 8700 67.0% 83.2% 82.6% 98.4% 99.9%
DSGY 1 1872 68.9% 31.4% 63.4% 97.8% 99.0%
JXX 1 3618 67.8% 80.5% 71.7% 97.7% 99.6%
TTXS 1 2090 39.8% 68.7% 86.3% 96.7% 99.5%
YSRS 3 8914 48.6% 78.6% 80.8% 98.1% 99.7%
YST 2 4712 54.8% 85.7% 85.9% 97.1% 99.3%
BBQN 1 2751 51.9% 76.9% 76.8% 96.1% 99.6%
ZHDWM 1 1319 55.7% 82.2% 52.4% 95.9% 97.4%
Total 40 93064
Average 62.0% 79.4% 80.5% 97.7% 99.4%
Table 5: Word Accuracy of Simplified Chinese.
TV programs #Videos #Words ABBYY[50] gCITR[27] MS OCR[51] Single CNN CNN ensemble
DXSLM 2 2024 62.8% 86.8% 98.2% 99.6%
KXLL 10 11819 84.4% 89.4% 97.1% 99.5%
NDXW 11 30683 38.3% 47.9% 96.7% 99.4%
QJXTW 2 6245 34.4% 61.9% 97.9% 99.6%
YXW 3 4361 54.0% 63.4% 97.5% 99.5%
XWWW 4 10124 41.6% 59.1% 96.7% 99.5%
XGD 2 5147 35.2% 62.1% 97.8% 99.4%
XTWJY 2 4264 39.2% 67.8% 97.8% 99.6%
XYZY 3 7603 93.2% 85.4% 97.3% 99.4%
YHHS 1 2103 53.9% 68.4% 97.0% 99.6%
Total 40 84373
Average 50.8% 62.0% 97.1% 99.4%
Table 6: Word Accuracy of Traditional Chinese. * gCITR [27] is not designed for Traditional Chinese.

Table 5 and Table 6 shows the performance of ABBYY [50] , gCITR [27], Microsoft OCR library [51], our single CNN and the CNN ensemble on the Simplified Chinese and Traditional Chinese text line recognition task. The performance of the proposed method exceeds other baselines by a large margin. In order to demonstrate the efficacy of our system on other languages, we also test it on two videos in Japanese, and an average 97.4% is achieved.

4.5 End-to-end performance

The same 80 videos in the previous section are selected for evaluating the end-to-end performance. Table 7 compares the end-to-end performance of the proposed system with ABBYY [50], gCITR [27], Microsoft OCR [51].

ABBYY[50] gCITR[27] MS OCR[51] Proposed
Simplified Chinese 60.7% 78.1% 79.3% 98.2%
Traditional Chinese 49.7% - 60.9% 98.3%
Table 7: End-to-end performance. Notice that three baselines take subtitle region detected by our system as input rather than raw video frames, as ABBYY [50] and Microsoft OCR [51] may generate many false detections on raw video frames and gCITR [27] can only perform text recognition.

5 Discussion

Although the STBB detection module has achieved competitive performance, there is still room for improvement. We observe that a majority of incorrectly detected STBBs locate near the ground-truth boundaries (Fig.9). Actually, more accurate boundary positions can be obtained if some regression methods like the one in [6] are adopted. In the SLRB detection module, it is observed that specific characters are sporadically misclassified as non-texts. We find the strokes of these characters are all very sparse, which can be easily confused with edge or texture features at backgrounds (Fig.10). Confusion and loss of radicals and strokes are two major mistakes made by the CNN character recognizer (Fig.11). Character categories that are misclassified more than three times are examined and the causes of the errors are scrutinized. We find that 45.5% of the errors are caused by resemblances between two characters, 33.2% are caused by cluttered backgrounds, 18.2% are caused by the incorporation of the language model and 3.2% are caused by large vertical shifts of characters.

Fig. 9: Typical mistakes made by the STBB detection module. Red boxes denote the detected STBB.
Fig. 10: Typical mistakes made by the SLRB detection module. Red boxes denote detected subtitle regions.
Fig. 11: Typical recognition mistakes made by the CNN ensemble. Red boxes mark the incorrectly recognized characters. The ground-truth characters are enclosed in parentheses.

6 Conclusion

In this paper, we present an end-to-end subtitle text detection and recognition system specifically designed for videos with subtitles in East Asian languages. By applying CWT and integrating the sequence information throughout the video, we are able to detect STBB and SCW simultaneously. This represents a departure from scene text detection problem where sophisticated methods are designed to detect texts in a single image. A CNN ensemble is leveraged to classify East Asian characters into thousands of categories. Our models are trained purely on synthetic data, which makes it possible for our system to be re-trained on other languages without requiring human labeling effort. Our system, as well as each module in it, compares favorably against existing methods on an extensive dataset. The near-human-level performance of our system qualifies it for practical application. For example, our system can provide accurate and reliable text labels for speech recognition researches, since video subtitles are synchronous with speech in videos.

In future work, this system will be tested on videos in Korean or other languages with consistent SCW.


This work is supported by Microsoft Research under the eHealth program, the Beijing Natural Science Foundation in China under Grant 4152033, the Technology and Innovation Commission of Shenzhen in China under Grant shengfagai2016-627, the Beijing Young Talent Project in China, the Fundamental Research Funds for the Central Universities of China under Grant SKLSDE-2015ZX-27 from the State Key Laboratory of Software Development Environment in Beihang University in China. We would like to thank Jinfeng Bai for conducting the gCITR baseline experiment.


  • [1] Q. Ye, D. Doermann, Text detection and recognition in imagery: A survey, IEEE Trans. Pattern Anal. Mach. Intell. (2015) 1480–1500.
  • [2] D. Karatzas, F. Shafait, S. Uchida, M. Iwamura, L. G. i Bigorda, S. R. Mestre, J. Mas, D. F. Mota, J. A. Almazan, L. P. de las Heras, Icdar 2013 robust reading competition, in: International Conference on Document Analysis and Recognition (ICDAR), 2013, pp. 1484–1493.
  • [3] T. Wang, D. J. Wu, A. Coates, A. Y. Ng, End-to-end text recognition with convolutional neural networks, International Conference on Pattern Recognition (ICPR) (2012) 3304–3308.
  • [4]

    M. Jaderberg, A. Vedaldi, A. Zisserman, Deep features for text spotting, in: European Conference on Computer Vision (ECCV), 2014, pp. 512–528.

  • [5] J. C. Rajapakse, L. Wang, Neural information processing: research and development, Vol. 152, Springer, 2012.
  • [6] M. Jaderberg, K. Simonyan, A. Vedaldi, A. Zisserman, Reading text in the wild with convolutional neural networks, International Journal of Computer Vision (IJCV) (2016) 1–20.
  • [7] K. Jung, K. I. Kim, A. K. Jain, Text information extraction in images and video: a survey, Pattern Recognit. (2004) 977–997.
  • [8] N. Sharma, U. Pal, M. Blumenstein, Recent advances in video based document processing: A review, in: IAPR Workshop on Document Analysis Systems, 2012, pp. 63–68.
  • [9] J. Zhang, R. Kasturi, Extraction of text objects in video documents: Recent progress, in: IAPR Workshop on Document Analysis Systems, 2008, pp. 5–17.
  • [10] J. Matas, O. Chum, M. Urban, T. Pajdla, Robust wide-baseline stereo from maximally stable extremal regions, British Machine Vision Conference (BMVC) (2004) 761–767.
  • [11] C. Shi, C. Wang, B. Xiao, Y. Zhang, S. Gao, Scene text detection using graph model built upon maximally stable extremal regions, Pattern Recognit. Lett. (2013) 107–116.
  • [12] W. Huang, Y. Qiao, X. Tang, Robust scene text detection with convolution neural network induced mser trees, in: European Conference on Computer Vision (ECCV), 2014, pp. 497–511.
  • [13] A. Coates, B. Carpenter, C. Case, S. Satheesh, B. Suresh, T. Wang, D. J. Wu, A. Y. Ng, Text detection and character recognition in scene images with unsupervised feature learning, in: International Conference on Document Analysis and Recognition (ICDAR), 2011, pp. 440–445.
  • [14] W. Kai, B. Babenko, S. Belongie, End-to-end scene text recognition, in: International Conference on Computer Vision (ICCV), 2011, pp. 1457–1464.
  • [15] M. Delakis, C. Garcia, text detection with convolutional neural networks, in: International Conference on Computer Vision Theory and Applications (VISAPP), 2008, pp. 290–294.
  • [16] X. Ren, K. Chen, X. Yang, Y. Zhou, A new unsupervised convolutional neural network model for chinese scene text detection, in: IEEE China Summit and International Conference on Signal and Information Processing (ChinaSIP), 2015.
  • [17] O. Alsharif, J. Pineau, End-to-end text recognition with hybrid hmm maxout models, International Conference on Learning Representations (ICLR).
  • [18] L. Neumann, J. Matas, A method for text localization and recognition in real-world images, in: Asian Conference on Computer Vision (ACCV), 2010, pp. 770–783.
  • [19] X. Tang, X. Gao, J. Liu, H. Zhang, A spatial-temporal approach for video caption detection and recognition, IEEE Trans. Neural Netw. (2002) 961–971.
  • [20] R. Wang, W. Jin, L. Wu, A novel video caption detection approach using multi-frame integration, in: International Conference on Pattern Recognition (ICPR), 2004, pp. 449–452.
  • [21] X. Liu, W. Wang, Robustly extracting captions in videos based on stroke-like edges and spatio-temporal analysis, IEEE Trans. Multimedia (2012) 482–489.
  • [22] X. Wang, L. Huang, C. Liu, A new block partitioned text feature for text verification, in: International Conference on Document Analysis and Recognition (ICDAR), 2009, pp. 366–370.
  • [23] R. Minetto, N. Thome, M. Cord, N. J. Leite, J. Stolfi, T-hog: An effective gradient-based descriptor for single line text regions, Pattern Recognit. (2013) 1078–1090.
  • [24] C.-Y. Lee, A. Bhardwaj, W. Di, V. Jagadeesh, R. Piramuthu, Region-based discriminative feature pooling for scene text recognition, in: IEEE Conference on Computer Vision and Pattern Recognition (CVPR), 2014, pp. 4050–4057.
  • [25] K. Wang, S. Belongie, Word spotting in the wild, in: European Conference on Computer Vision (ECCV), 2010, pp. 591–604.
  • [26] A. Bissacco, M. Cummins, Y. Netzer, H. Neven, Photoocr: Reading text in uncontrolled conditions, in: IEEE International Conference on Computer Vision (ICCV), 2013, pp. 785–792.
  • [27] J. Bai, Z. Chen, B. Feng, B. Xu, Chinese image text recognition on grayscale pixels, in: IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), 2014, pp. 1380–1384.
  • [28] Z. Saidane, C. Garcia, Automatic scene text recognition using a convolutional neural network, International Workshop on Camera-Based Document Analysis and Recognition (CBDAR).
  • [29] Z. Saidane, C. Garcia, J. Dugelay, The image text recognition graph (itrg), in: Proc. Intl. Conf. on Multimedia and Expo, 2009, pp. 266–269.
  • [30] Y. Netzer, T. Wang, A. Coates, A. Bissacco, B. Wu, A. Y. Ng, Reading digits in natural images with unsupervised feature learning, Neural Information Processing Systems (NIPS).
  • [31] J. Bai, Z. Chen, B. Feng, B. Xu, Image character recognition using deep convolutional neural network learned from different languages, in: IEEE International Conference on Image Processing (ICIP), 2014, pp. 2560–2564.
  • [32] K. Elagouni, C. Garcia, F. Mamalet, P. Sébillot, Text recognition in multimedia documents: a study of two neural-based ocrs using and avoiding character segmentation, International Journal on Document Analysis and Recognition (IJDAR) (2014) 19–31.
  • [33] Z. Zhong, L. Jin, Z. Feng, Multi-font printed chinese character recognition using multi-pooling convolutional neural network, in: International Conference on Document Analysis and Recognition (ICDAR), 2015, pp. 96–100.
  • [34]

    K. Elagouni, C. Garcia, P. Billot, A comprehensive neural-based approach for text recognition in videos using natural language processing., in: International Conference on Multimedia Retrieval (ICMR), 2011, pp. 1–8.

  • [35]

    I. J. Goodfellow, D. Warde-Farley, M. Mirza, A. Courville, Y. Bengio, Maxout networks, International Conference on Machine Learning (ICML) (2013) 1319–1327.

  • [36] A.-B. Wang, K.-C. Fan, Optical recognition of handwritten chinese characters by hierarchical radical matching method, Pattern Recognit. (2001) 15–35.
  • [37] J. Bai, Z. Chen, B. Feng, B. Xu, Chinese image character recognition using dnn and machine simulated training samples, in: International Conference on Artificial Neural Networks (ICANN), 2014, pp. 209–216.
  • [38] M. Jaderberg, K. Simonyan, A. Vedaldi, A. Zisserman, Synthetic data and artificial neural networks for natural scene text recognition, arXiv preprint arXiv:1406.2227.
  • [39]

    A. Krizhevsky, I. Sutskever, G. E. Hinton, Imagenet classification with deep convolutional neural networks, in: Neural Information Processing Systems (NIPS), 2012, pp. 1097–1105.

  • [40] A. Krizhevsky, G. Hinton, Learning multiple layers of features from tiny images.
  • [41] CNN configuration, http://code.google.com/p/cuda-convnet/source/browse/trunk/example-layers/layers-conv-local-11pct.cfg, (accessed 16.09.04) (2014).
  • [42] Layer parameters, https://code.google.com/p/cuda-convnet/source/browse/trunk/example-layers/layer-params-conv-local-11pct.cfg, (accessed 16.09.04) (2014).
  • [43] K. Simonyan, A. Vedaldi, A. Zisserman, Deep inside convolutional networks: Visualising image classification models and saliency maps, arXiv preprint arXiv:1312.6034.
  • [44] D. Erhan, Y. Bengio, A. Courville, P. Vincent, Visualizing higher-layer features of a deep network, Technical report, University of Montreal.
  • [45] C. Cortes, V. Vapnik, Support-vector networks, Machine learning (1995) 273–297.
  • [46] Y. Qu, W. Liao, S. Lu, S. Wu, Hierarchical Text Detection: From Word Level to Character Level, Springer, 2013.
  • [47] J. Sauvola, M. Pietikäinen, Adaptive document image binarization, Pattern Recognit. (2000) 225–236.
  • [48] B. Verma, A contour code feature based segmentation for handwriting recognition., in: International Conference on Document Analysis and Recognition (ICDAR), 2003, pp. 1203–1207.
  • [49] S. M. Lucas, A. Panaretos, L. Sosa, A. Tang, S. Wong, R. Young, Icdar 2003 robust reading competitions, in: International Conference on Document Analysis and Recognition (ICDAR), 2003, p. 682.
  • [50] ABBYY FineReader 12, https://www.abbyy.com/finereader/, (accessed 16.09.04) (2016).
  • [51] Microsoft OCR library, https://code.msdn.microsoft.com/Uses-the-OCR-Library-to-2a9f5bf4, (accessed 16.09.04) (2014).
  • [52] V. I. Levenshtein, Binary codes capable of correcting deletions, insertions and reversals, Probl. Inf. Transmission (1965) 707–710.