AE TextSpotter: Learning Visual and Linguistic Representation for Ambiguous Text Spotting

08/03/2020 ∙ by Wenhai Wang, et al. ∙ SenseTime Corporation Nanjing University The University of Adelaide The University of Hong Kong NetEase, Inc 0

Scene text spotting aims to detect and recognize the entire word or sentence with multiple characters in natural images. It is still challenging because ambiguity often occurs when the spacing between characters is large or the characters are evenly spread in multiple rows and columns, making many visually plausible groupings of the characters (e.g. "BERLIN" is incorrectly detected as "BERL" and "IN" in Fig. 1(c)). Unlike previous works that merely employed visual features for text detection, this work proposes a novel text spotter, named Ambiguity Eliminating Text Spotter (AE TextSpotter), which learns both visual and linguistic features to significantly reduce ambiguity in text detection. The proposed AE TextSpotter has three important benefits. 1) The linguistic representation is learned together with the visual representation in a framework. To our knowledge, it is the first time to improve text detection by using a language model. 2) A carefully designed language module is utilized to reduce the detection confidence of incorrect text lines, making them easily pruned in the detection stage. 3) Extensive experiments show that AE TextSpotter outperforms other state-of-the-art methods by a large margin. For example, we carefully select a validation set of extremely ambiguous samples from the IC19-ReCTS dataset, where our approach surpasses other methods by more than 4 released at https://github.com/whai362/TDA-ReCTS.

READ FULL TEXT VIEW PDF
POST COMMENT

Comments

There are no comments yet.

Authors

page 2

page 14

page 19

page 20

page 21

This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

1 Introduction

Text analysis in unconstrained scene images like text detection and text recognition is important in many applications, such as document recognition, license plate recognition, and visual question answering based on texts. Although previous works [27, 33, 19, 20, 12, 13] have acquired great success, there are still many challenges to be solved.

This work addresses one of the important challenges, which is reducing the ambiguous bounding box proposals in scene text detection. These ambiguous proposals widely occur when the spacing of the characters of a word is large or multiple text lines are juxtaposed in different rows or columns in an image. For example, as shown in Fig. 1(c)(d), the characters in an image can be grouped into multiple visually plausible words, making the detection results ambiguous and significantly hindering the accuracy of text detectors and text spotters.

Figure 1: The detection results of different methods in cases of ambiguity. Top row: an example of large character spacing. Bottom row: an example of juxtaposed text lines. (a) are original images. (b) are ground-truths. (c) are detection results of Mask TextSpotter [13]. (d) are our detection results.

Existing text spotters [19, 13, 22, 7] typically follow a pipeline as shown in Fig. 2(a), where a text recognition module is stacked after a text detection module, and linguistic knowledge (e.g.lexicon) is employed for text recognition. However, their language-based modules are isolated from the detection modules, which only detect text lines by learning visual features, leading to ambiguous proposals and hindering the accuracy of the entire systems. As shown in Fig. 1(c), these vision-based text detectors are insufficient to detect text lines correctly in ambiguous samples (i.e. horizontal bounding box versus vertical bounding box in this example).

In contrast, unlike existing text detectors that only utilize vision knowledge, this work incorporates linguistic knowledge into text detection by learning linguistic representation to reduce ambiguous proposals. As shown in Fig. 2(b), the re-scoring step is not only the main difference between our approach and the previous methods, but also the key step to remove ambiguity in text detection. Intuitively, without a natural language model, it is difficult to identify the correct text line in ambiguous scenarios. For example, prior art [13] detected incorrect text lines as shown in Fig.1(c), neglecting the semantic meaning of text lines. In our approach, the knowledge of natural language can be used to reduce errors as shown in Fig. 1(d), because the text content may follow the distribution of natural language if the detected text line is correct, otherwise the text content is out of the distribution of natural language. Therefore, the linguistic representation can help refine the detection results.

This paper proposes a novel framework called Ambiguity Eliminating Text Spotter (AE TextSpotter) as shown in Fig. 2(b), to solve the ambiguity problem in text detection. The rationale of our method is to re-score the detection results by using linguistic representation, and eliminate ambiguity by filtering out low-scored text line proposals. Specifically, AE TextSpotter consists of three important components, including detection, recognition and rescoring.

Figure 2: Comparisons between (a) previous text spotters and (b) our AE TextSpotter. Different from previous methods [19, 13, 22, 7] where the modules of text detection and linguistic knowledge are isolated, our method lifts the language module ahead in text detection to re-score the detection results by learning linguistic representation.

In the detection step, we build a text detection module (TDM) to provide candidates of text lines with high recall, in order to avoid missed detection. In this step, the number of candidates is very large because of the high-recall rate. In the step of recognition, we carefully design a character-based recognition module (CRM), to efficiently recognize the candidate text lines. In the re-scoring step, we propose a language module (LM) that can learn linguistic representation to re-score the candidate text lines and to eliminate ambiguity, making the text lines that correspond to natural language have higher scores than those not. The three steps above complement each other, enabling the AE TextSpotter to successfully detect and recognize text lines even in extremely challenging scenarios. As shown in Fig. 1(d), AE TextSpotter can correctly detect text lines when multiple vertical and horizontal text lines are presented.

Our main contributions are summarized as follows. 1) The AE TextSpotter identifies the importance of linguistic representation for text detection, in order to solve the text detection ambiguity problem. To our knowledge, this is the first time to learn both visual features and linguistic features for text detection in a framework. 2) We design a novel text recognition module, called the character-based recognition module (CRM), to achieve fast recognition of numerous candidate text lines. 3) Extensive experiments on several benchmarks demonstrate the effectiveness of AE TextSpotter. For example, AE TextSpotter achieves state-of-the-art performance on both text detection and recognition on IC19-ReCTS [32]. Moreover, we construct a hard validation set for benchmarking Text Detection Ambiguity, termed TDA-ReCTS111The training/validation list and evaluation scripts have been released at github.com/whai362/TDA-ReCTS., where AE TextSpotter achieves the F-measure of 81.39% and the 1-NED of 51.32%, outperforming its counterparts by 4.01% and 4.65%.

2 Related Work

Scene text detection

has been a research hotspot in computer vision for a long period. Methods based on deep learning have become the mainstream of scene text detection. Tian

et al. [27] and Liao et al. [14] successfully adopted the framework of object detection into text detection and achieved good performance on horizontal text detection. After that, many works [33, 24, 4, 15, 17] took the orientation of text lines into consideration and make it possible to detect arbitrary-oriented text lines. Recently, curved text detection attracted increasing attention, and segmentation-based methods [20, 12, 30, 31] achieved excellent performances over the curved text benchmarks. These methods improve the performance of text detection to a high level, but none of them can deal with the ambiguity problem in text detection. In this work, we introduce linguistic features in the text detection module to solve the text detection ambiguity problem.

Scene text recognition targets to decode a character sequence from variable-length text images. CRNN [25] introduced CTC [8] in a text recognition model, and made the model trained end-to-end. Following this pipeline, other CTC-based methods [18, 28] were proposed and achieved significant improvement in text recognition. Another direction of text recognition is attention-based encoder and decoder framework [26, 2, 1], which focuses on one position and predict the corresponding character at each time step. These attention-based methods have achieved high accuracy on various benchmarks. However, most existing methods work slowly when recognizing a large number of candidate text lines in an image. In this paper, we design a character-based recognition module to solve this problem.

Scene text spotting aims to detect and recognize text lines simultaneously. Li et al. [11] proposed a framework for horizontal text spotting, which contains a text proposal network for text detection and an attention-based method for text recognition. FOTS [19] presented a method for oriented text spotting by new RoIRotate operation, and significantly improved the performance by jointly training detection and recognition tasks. Mask TextSpotter [13] added character-level supervision to Mask R-CNN [9] to simultaneously identity text lines and characters, which is suitable for curved text spotting. Qin et al. [22] also developed a curved text spotter based on Mask R-CNN, and used a RoI masking operation to extract the feature for curved text recognition. However, in these methods, linguistic knowledge (e.g. lexicon) is only used in post-processing, so it is isolated from network training. Different from them, the linguistic representation in our method is learned together with the visual representation in a framework, making it possible to eliminate ambiguity in text detection.

3 Proposed Method

3.1 Overall Architecture

Figure 3: The overall architecture of AE TextSpotter, which has three key modules: 1) Text Detection Module (TDM); 2) Character-based Recognition Module (CRM); 3) Language Module (LM).

Fig. 3 shows the overall architecture of AE TextSpotter, which consists of two vision-based modules and one language-based module, namely, the text detection module (TDM), the character-based recognition module (CRM), and the language module (LM). Among these modules, TDM and CRM aim to detect the bounding boxes and recognize the content of candidate text lines; and LM is applied to lower the scores of incorrect text lines by utilizing linguistic features, which is the key module to remove ambiguous samples.

In the forward phase, we firstly feed the input image to the backbone network (i.e., ResNet-50 [10] + FPN [16]) and Region Proposal Network (RPN) [23], to produce text line proposals and character proposals. Secondly, based on text line proposals, TDM predicts tight bounding boxes  (i.e., rotated rectangles or polygons) and visual scores for candidate text lines (see Fig. 3(d)). Thirdly, using character proposals and bounding boxes as input, CRM recognizes the text contents of candidate text lines (see Fig. 3(e)). After that, we feed the text contents to LM, to predict the linguistic scores for candidate text lines. Then we combine the visual scores and linguistic scores into the final scores as Eqn 1.

(1)

where is used to balance and . Finally, on the basis of the final scores , we select the final bounding boxes and text contents of text lines (see Fig. 3(f)), by using NMS and removing low-score text lines.

The training process of AE TextSpotter is divided into two stages: 1) training the backbone network, RPN, TDM, and CRM end-to-end; 2) training LM with the predicted text lines and text contents. In the first stage, we fix the weights of LM, and optimize the backbone network, RPN, TDM, and CRM by a joint loss of , , and

. In the second stage, we fix the weights of the backbone network, RPN, TDM, and CRM, and optimize LM by loss function

.

3.2 Text Detection Module

Figure 4: The details of the detection head and the mask head in the text detection module (TDM). “Conv(3, 3)”, “FC” and “DConv” represent the convolution layer with the kernel size of 33, fully connected layer, and deconvolution layer, respectively. “2” denotes a stack of two consecutive convolution layers.

The text detection module (TDM) aims to detect the candidate text lines with extremely high recall. Due to the requirement of high recall, TDM needs to be implemented by a regression-based method, whose recall can be controlled by the classification score threshold and the NMS threshold. In this work, following [31, 13, 22], TDM is modified based on Mask R-CNN [9]. For each text line, it predicts an axis-aligned rectangular bounding box and the corresponding segmentation mask.

As shown in Fig. 4, there are a detection head and a mask head in TDM. Firstly, we use RoIAlign [9] to extract feature patches (77256) for text line proposals predicted by RPN. Secondly, we feed feature patches to the classification and regression branches in the detection head, to predict visual scores and rectangular bounding boxes of text lines. Then, the new feature patches (1414256) are extracted for rectangular bounding boxes . After that, using the new feature patches as input, the mask head predicts masks of text lines, and the mask contours are tight bounding boxes . Finally, we use a loose score threshold and a loose NMS threshold to filter out redundant boxes in tight bounding boxes , and the rest are candidate text lines.

3.3 Character-based Recognition Module

The character-based recognition module (CRM) is applied to efficiently recognize a large number of candidate text lines produced by TDM. CRM includes a detection head and a match-assemble (MA) algorithm. The detection head runs only once for each input image to detect and classify characters; and the proposed MA algorithm is used to group characters into text line transcriptions, which is rule-based and very fast. Therefore the CRM is very suitable for recognizing numerous candidate text lines predicted by TDM.

The detection head of CRM follows a pipeline similar to the detection head of TDM (see Fig. 4 (a)). Due to the large number of character categories (about 3,600 in IC19-ReCTS), the detection head of CRM needs a better feature patch as input and a deeper classification branch. Therefore, in the detection head of CRM, the size of feature patches extracted by RoIAlign is 1414256, and we change the number of convolutional layers in the classification branch to 4. The output of the detection head is rectangular bounding boxes and contents of characters.

After character detection and classification, we propose a match-assemble (MA) algorithm to obtain text contents. The procedure of the MA algorithm is presented in Fig. 5. For each bounding box of candidate text line, we firstly match the predicted characters (see Fig. 5(a)) with the candidate text line by using Eqn. 2, and get the internal characters  (see Fig. 5 (c)) for the candidate text line .

(2)

Here, , and are the intersection area of two boxes, the area of a box, and bounding boxes of detected characters, respectively. Secondly, we assemble the internal characters according to their center positions. Specifically, we arrange the characters from left to right, when the width of the external horizontal rectangle of the candidate text line is larger than its height, or else we arrange the characters from top to down. After the arrangement, we concatenate character contents into the text content of the candidate text line.

Figure 5: The details of match-assemble (MA) algorithm. “Match” is the operation of matching character boxes with a text line boxes. “Assemble” is the operation of assembling the internal characters to a text line transcription.

3.4 Language Module

The language module (LM) is a language-based module that can re-score candidate text lines according to their contents. Specifically, LM predicts a linguistic score to determine whether the text content follows the distribution of natural language.

The pipeline of LM is presented in Fig. 6. At first, we use BERT [6]

, a widely-used pre-trained model in NLP, to extract sentence vectors (

) for the input text content . Here, 16 is the maximum length of text content, and we fill 0 for text lines whose length is less than 16. Then we feed sentence vectors to a classification network composed of BiGRU, AP, and FC , to predict the linguistic scores

of the input text contents. Here, BiGRU, AP, and FC represent binary input gated recurrent unit 

[3], average pooling layer, and fully connected layer, respectively.

The label generation for LM is based on Intersection over Union (IoU) metric. Concretely, for every candidate text line, we consider the label of its text content to be positive if and only if there is a matched ground-truth whose IoU with the candidate text line is greater than a threshold .

Figure 6: The details of Language Module (LM). “BiGRU”, and “FC” represent the binary input gated recurrent unit [3], and fully connected layer, respectively. “2” denotes a stack of two consecutive layers.

3.5 Loss Function

As mentioned in Sec. 3.1, the training of AE TextSpotter has two stages. Correspondingly, the loss function has two parts: 1) the visual loss function ; 2) the linguistic loss function .

The visual loss function is used to optimize the backbone network, RPN, TDM, and CRM. It is a multi-task loss function that can be defined as Eqn. 3.

(3)

where is loss function of RPN as defined in [23]. and are loss functions of TDM and CRM respectively, which can be formulated as Eqn. 4 and Eqn. 5.

(4)
(5)

Here, and are loss functions of classification branches in TDM and CRM respectively, which are implemented by cross-entropy losses. and are loss functions of regression branches in TDM and CRM respectively, which are implemented by smooth  [3]. is the loss function of mask head in TDM, which is the same as the mask loss in [9].

The linguistic loss function is used to optimize LM. Actually, LM only involves a task of classifying the correct and incorrect text lines in candidates, so is implemented by a cross-entropy loss.

4 Experiment

4.1 Datasets

AE TextSpotter needs character-level annotations in the training phase. In addition text detection ambiguity is relatively rare in English due to its writing habit. Therefore, we evaluate our method on a multi-language dataset with character-level annotations.

ICDAR 2019 ReCTS (IC19-ReCTS) [32] is a newly-released large-scale dataset that includes 20,000 training images and 5,000 testing images, covering multiple languages, such as Chinese, English and Arabic numerals. The images in this dataset are mainly collected in the scenes with signboards, All text lines and characters in this dataset are annotated with bounding boxes and transcripts.

TDA-ReCTS is a validation set for benchmarking text detection ambiguity, which contains 1,000 ambiguous images selected from the training set of IC19-ReCTS. To minimize the impact of subjective factors, we designed rules to pick out images with the case of large character spacing or juxtaposed text lines, and then randomly sample 1,000 images among them as the validation set.

When selecting ambiguous images, an image is regarded as a sample with large character spacing, if at least one text line in the image has large character spacing. In addition, an image is treated as a sample with juxtaposed text lines, if there is a pair of text lines aligned with the top, bottom, left or right direction, and their characters have similar scales. More details about the both rules can be found in Appendix A.1.

4.2 Implementation Details

In the backbone network, the ResNet50 [10]

is initialized by the weights pre-trained on ImageNet 

[5]. In RPN, the aspect ratios of text line anchors are set to 1/8, 1/4, 1/2, 1, 2, 4, 8 to match the extreme aspect ratios of text lines, and the aspect ratios of character anchors are set to 1/2, 1, 2. All other settings in RPN are the same as in Mask R-CNN. In TDM, the classification branch makes a binary classification of text and non-text, and the two post-processing thresholds (i.e., and ) in the regression branch are set to 0.01 and 0.9, respectively. In CRM, the match threshold is set to 0.3. In LM, candidate text lines are classified into correct and incorrect ones., and the training label threshold is set to 0.8. When combining the visual and linguistic scores, the balance weight in Eqn. 1 is set to 0.7. In the post-processing, the NMS threshold is set to 0.1 and the classification score threshold is set to 0.6.

As mentioned in Sec. 3.1

, the training of AE TextSpotter is divided into two phases. In the first phase, we train backbone network, RPN, TDM, and CRM with batch size 16 on 8 GPUs for 12 epochs, and the initial learning rate is set to 0.02 which is divided by 10 at 8 epoch and 11 epoch. In the second phase, we fix the weights of the backbone network, RPN, TDM, and CRM, and train LM with batch size 16 on 8 GPUs for 12 epochs, and the initial learning rate is set to 0.2 which is divided by 10 at 8 epoch and 11 epoch. We optimize all models using stochastic gradient descent (SGD) with a weight decay of

and a momentum of 0.9.

In the training phase, we ignore the blurred text regions labeled as “DO NOT CARE” in all datasets; and apply random scale, random horizontal flip, and random crop on training images.

4.3 Ablation Study

The proportion of ambiguous text lines. Text detection ambiguity is caused by large character spacing and juxtaposed text lines. To count the proportion of the two problematic cases, we use the rules mentioned in Sec. 4.1 to count the ambiguous text lines in the IC19-ReCTS training set and TDA-ReCTS. As shown in Table 1, ambiguous text lines occupy 5.97% of total text lines in the IC19-ReCTS training set, which reveals that text detection ambiguity is not rare in the natural scene. Moreover, the proportion of ambiguous text lines in TDA-ReCTS reaches 25.80%. Although the proportion in TDA-ReCTS is much larger than in the IC19-ReCTS training set, the normal text lines are still in the majority.

Type IC19-ReCTS Training Set TC-ReCTS
Number Proportion Number Proportion
Large Character Spacing 1,241 1.14% 589 7.07%
Juxtaposed Text Lines 5,379 4.94% 1,615 19.39%
Union of two Categories 6,509 5.97% 2,148 25.80%
All 108,963 - 8,327 -
Table 1: The proportion of text lines with the problem of text detection ambiguity.

The high recall of the text detection module. To verify the high recall of the text detection module (TDM), we compare the recall and the number of candidate text lines per image on TDA-ReCTS, under different visual score thresholds and the NMS thresholds . As shown in Table. 4.3, when and , the recall of TDM reaches the upper bound of 0.971. However, in this case, the number of candidate text lines per image is 971.5, which is too large to be recognized. When and , the recall of TDM is 0.969, slightly lower than the upper bound, but the number of candidate text lines per image is reduced to 171.7. If we use common post-processing thresholds (i.e., and ), the recall of TDM is only 0.852, which will result in many missed detection of text lines. Therefore, in experiments, the values of and are set to 0.01 and 0.9 by default, which can achieve a high recall without too many candidate text lines.

The effectiveness of the character-based recognition module. The character-based recognition module (CRM) aims to fast recognize numerous candidate text lines predicted by TDM. To test the speed of CRM, we compare the time cost per text line of CRM and mainstream recognizers (e.g. CRNN [25] and ASTER [26]). For a fair comparison, we run these methods on 1080Ti GPU and report their time cost per image on the IC19-ReCTS test set. Note that, the input text images of CRNN and ASTER are scaled to a height of 32 when testing. In AE TextSpotter, the recognition module is used to recognize the candidate text lines, whose number reaches 171.7 per image. Table. 4.3 shows the speed of different methods. CRNN and ASTER cost 966.7 ms and 4828.2 ms respectively to recognize all candidate text lines in each image. The time cost of CRM on each image is only 122.3 ms, so CRM runs much faster than CRNN and ASTER. In addition, using the same detection result predicted by AE TextSpotter, our CRM can achieve the 1-NED of 71.81 on the test set of IC19-ReCTS, which is better than CRNN and ASTER.

Recall Number of Candidates
0 1 0.971 971.5
0.01 0.9 0.969 171.7
0.5 0.5 0.852 11.1
Table 3: The time cost per image and 1-NED of different recognizers on the IC19-ReCTS test set.
Method Time Cost 1-NED
CRNN [25] 966.7 ms 68.32
ASTER [26] 4828.2 ms 58.49
CRM (ours) 122.3 ms 71.81
Table 2: The recall of TDM and the number of candidate text lines per image under different post-processing thresholds.

Figure 7: The distributions of detection results predicted by models with and without LM. The points in red dotted boxes are false positives with high scores.

The effectiveness of the language module. To investigate the effectiveness of the language module (LM), we compare the performance of the models with and without LM on TDA-ReCTS. For a fair comparison, we build a model without LM by removing LM from AE TextSpotter in the training phase and setting in Eqn. 1 to 1.0 in the testing phase. As shown in Table 4, on TDA-ReCTS, the model with LM obtains the F-measure of 81.39 and the 1-NED of 51.32%, significantly surpassing the model without LM by 3.46% and 3.57%.

In addition, we analyze the quality of the detection results predicted by the models with and without LM on TDA-ReCTS. Generally, the quality of a bounding box can be measured by its IoU with the matching ground truth, and a bounding box is considered to be correct if its IoU exceeds 0.5. As shown in Fig. 7 (a), the detection results of the model without LM have many incorrect results (IoU 0.5), even though their final scores are greater than 0.8. However, this problem is alleviated in the model with LM (see Fig. 7 (b)), where there are few detection results with high scores but low IoU. These experiment results demonstrate the power of LM.

4.4 Comparisons with State-of-the-Art Methods

To show the effectiveness of the proposed AE TextSpotter, we compare it with state-of-the-art methods on TDA-ReCTS and IC19-ReCTS test set.

On TDA-ReCTS, we compare the proposed AE TextSpotter with some representative methods. All results are obtained by using authors provided or officially released codebases, and these methods are trained on the 19,000 training images in IC19-ReCTS (except 1,000 images in TDA-ReCTS). In the testing phase, we scale the short side of images to 800, and evaluate the performance of these methods on TDA-ReCTS by the evaluation metric defined in

[32]. Table 4 shows the performance of these methods. The proposed AE TextSpotter achieves the F-measure of 81.39% and the 1-NED of 51.32%, outperforming the counterparts by at least 4.01% and 4.65% respectively. Here, 1-NED denotes normalized edit distance, a recognition metric defined in [32]. Note that, the proportion of ambiguous text lines in TDA-ReCTS is only 25.80%, and we believe the improvement will be more significant if the proportion is larger.

On the IC19-ReCTS test set, we train models on the complete training set of IC19-ReCTS. As shown in Table 5, our AE TextSpotter achieves the F-measure of 91.80% and the 1-NED of 71.81%, surpassing other methods. Moreover, when we enlarge the scale of input images (short size 1000), the F-measure and 1-NED of our method can be boosted to 91.86% and 73.12%. Without extra datasets, the single-scale detection performance of our method is comparable to the methods that use extra datasets, multi-scale testing, and model ensemble (91.86% v.s. 92.99%). Unlike the detection task, the multi-language recognition task relies more heavily on extra datasets. Therefore, our recognition performance is lower than the methods using extra datasets, multi-scale testing, and ensemble.

These results demonstrate that the proposed AE TextSpotter can correctly detect and recognize multi-language text lines, even in ambiguous scenarios. Some qualitative results on TDA-ReCTS are shown in Fig. 8.

Method Venue Backbone TDA-ReCTS
P R F 1-NED
EAST [33] CVPR’17 VGG16 70.58 61.59 65.78 -
PSENet [29] CVPR’19 ResNet50 72.14 65.53 68.68 -
FOTS [19] CVPR’18 ResNet50 68.20 70.00 69.08 34.87
Mask TextSpotter [13] TPAMI ResNet50 80.07 74.86 77.38 46.67
Detection Only - ResNet50 79.58 74.93 77.19 -
AE TextSpotter w/o LM - ResNet50 79.18 76.72 77.93 47.74
AE TextSpotter - ResNet50 84.78 78.28 81.39 51.32
Table 4: The single-scale results on TDA-ReCTS. “P”, “R”, “F” and “1-NED” mean the precision, recall, F-measure, and normalized edit distance [32], respectively.
Method Venue Backbone IC19-ReCTS Test Set
P R F 1-NED
Tencent-DPPR Team* [32] Competition - 93.49 92.49 92.99 81.45
SANHL_v1* [32] Competition - 91.98 93.86 92.91 81.43
HUST_VLRGROUP* [32] Competition - 91.87 92.36 92.12 79.38
EAST [33] CVPR’17 VGG16 74.25 73.65 73.95 -
PSENet [29] CVPR’19 ResNet50 87.31 83.86 85.55 -
FOTS [19] CVPR’18 ResNet50 78.25 82.49 80.31 50.83
Mask TextSpotter [13] TPAMI ResNet50 89.30 88.77 89.04 67.79
Detection Only - ResNet50 90.72 90.73 90.72 -
AE TextSpotter w/o LM - ResNet50 91.54 90.78 91.16 70.95
AE TextSpotter (800) - ResNet50 92.60 91.01 91.80 71.81
AE TextSpotter (1000) - ResNet50 92.28 91.45 91.86 73.12
Table 5: The single-scale results on the IC19-ReCTS test set. “P”, “R”, “F” and “1-NED” represent the precision, recall, F-measure, and normalized edit distance, respectively. “*” denotes the methods in competition [32], which use extra datasets, multi-scale testing, and model ensemble. “800” means that the short side of input images is scaled to 800.

4.5 Time Cost Analysis of AE TextSpotter

We analyze the time cost of all modules in AE TextSpotter. Specifically, we evaluate our methods on images in TDA-ReCTS and calculate the average time cost per image. All results are tested by PyTorch 

[21] with the batch size of 1 on one 1080Ti GPU in a single thread. Table 6 shows the time cost of backbone, TDM, CRM, LM, and post-processing respectively. Among them, the detection module takes up most of the execution time, because it needs to predict the mask for every candidate text line. The entire framework can run 1.3 FPS when the short size of input images are 800.

Scale FPS Time Cost per Image (ms)
Backbone RPN TDM CRM LM Post-proc
800 1.3 31.3 35.1 440.0 122.3 118.6 3.8
1000 1.2 63.1 42.6 487.2 127.8 121.1 3.9
Table 6: The time cost of all modules in AE TextSpotter.

Figure 8: Visualization results of AE TextSpotter on TDA-ReCTS.

4.6 Discussion

As demonstrated in previous experiments, the proposed AE TextSpotter works well in most cases, including scenarios of text detection ambiguity. Nevertheless, there is still room for improvement in some parts as follows. 1) The recognition performance of our method is not high (only 51.32% in Table 4), supressing the effectiveness of LM. 2) TDA-ReCTS is selected by rules that only give a vague description of ambiguity in text detection. Therefore, a manually selected dataset is important for further exploring text detection ambiguity. 3) The proposed AE TextSpotter is a baseline for removing text detection ambiguity, and it is still worthwhile to further explore a better way to combine visual and linguistic representation.

Although our AE TextSpotter works well in most cases. It still fails in some intricate images, such as text-like regions, strange character arrangement, and texts in fanc styles. More details about these failed cases can be found in Appendix A.2.

5 Conclusion & Future Work

In this paper, we first conducted a detailed analysis of the ambiguity problem in text detection, and revealed the importance of linguistic representation in solving this problem. Then we proposed a novel text spotter, termed AE TextSpotter, which introduces linguistic representation to eliminate ambiguity in text detection. For the first time, linguistic representation is utilized in scene text detection to deal with the problem of text detection ambiguity. Concretely, we propose the Language Module (LM), which can learn to re-score the text proposals by linguistic representation. LM can effectively lower the scores of incorrect text lines while improve the scores of correct proposals. Furthermore, a new validation set is proposed for benchmarking the ambiguity problem. Extensive experiments demonstrate the advantages of our method, especially in scenarios of text detection ambiguity.

In the future, we will strive to build a more efficient and fully end-to-end network for ambiguous text spotting, and plan to build a large-scale and more challenging dataset which is full of ambiguous text samples.

Acknowledgments

This work is supported by the Natural Science Foundation of China under Grant 61672273 and Grant 61832008, the Science Foundation for Distinguished Young Scholars of Jiangsu under Grant BK20160021, and Scientific Foundation of State Grid Corporation of China (Research on Ice-wind Disaster Feature Recognition and Prediction by Few-shot Machine Learning in Transmission Lines). Chunhua Shen and his employer received no financial support for the research, authorship and publication of this paper.

References

  • [1] Z. Cheng, F. Bai, Y. Xu, G. Zheng, S. Pu, and S. Zhou (2017) Focusing attention: towards accurate text recognition in natural images. In Proceedings of the IEEE International Conference on Computer Vision, pp. 5076–5084. Cited by: §2.
  • [2] Z. Cheng, Y. Xu, F. Bai, Y. Niu, S. Pu, and S. Zhou (2018) Aon: towards arbitrarily-oriented text recognition. In

    Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition

    ,
    pp. 5571–5579. Cited by: §2.
  • [3] K. Cho, B. Van Merriënboer, C. Gulcehre, D. Bahdanau, F. Bougares, H. Schwenk, and Y. Bengio (2014) Learning phrase representations using rnn encoder-decoder for statistical machine translation. arXiv preprint arXiv:1406.1078. Cited by: Figure 6, §3.4, §3.5.
  • [4] D. Deng, H. Liu, X. Li, and D. Cai (2018) Pixellink: detecting scene text via instance segmentation. In

    Thirty-Second AAAI Conference on Artificial Intelligence

    ,
    Cited by: §2.
  • [5] J. Deng, W. Dong, R. Socher, L. Li, K. Li, and L. Fei-Fei (2009) Imagenet: a large-scale hierarchical image database. In 2009 IEEE conference on computer vision and pattern recognition, pp. 248–255. Cited by: §4.2.
  • [6] J. Devlin, M. Chang, K. Lee, and K. Toutanova (2018) Bert: pre-training of deep bidirectional transformers for language understanding. arXiv preprint arXiv:1810.04805. Cited by: §3.4.
  • [7] W. Feng, W. He, F. Yin, X. Zhang, and C. Liu (2019) TextDragon: an end-to-end framework for arbitrary shaped text spotting. In Proceedings of the IEEE International Conference on Computer Vision, pp. 9076–9085. Cited by: Figure 2, §1.
  • [8] A. Graves, S. Fernández, F. Gomez, and J. Schmidhuber (2006)

    Connectionist temporal classification: labelling unsegmented sequence data with recurrent neural networks

    .
    In Proceedings of the 23rd international conference on Machine learning, pp. 369–376. Cited by: §2.
  • [9] K. He, G. Gkioxari, P. Dollár, and R. Girshick (2017) Mask r-cnn. In Proceedings of the IEEE international conference on computer vision, pp. 2961–2969. Cited by: §2, §3.2, §3.2, §3.5.
  • [10] K. He, X. Zhang, S. Ren, and J. Sun (2016) Deep residual learning for image recognition. In Proceedings of the IEEE conference on computer vision and pattern recognition, pp. 770–778. Cited by: §3.1, §4.2.
  • [11] H. Li, P. Wang, and C. Shen (2017) Towards end-to-end text spotting with convolutional recurrent neural networks. arXiv preprint arXiv:1707.03985. Cited by: §2.
  • [12] X. Li, W. Wang, W. Hou, R. Liu, T. Lu, and J. Yang (2018) Shape robust text detection with progressive scale expansion network. arXiv preprint arXiv:1806.02559. Cited by: §1, §2.
  • [13] M. Liao, P. Lyu, M. He, C. Yao, W. Wu, and X. Bai (2019) Mask textspotter: an end-to-end trainable neural network for spotting text with arbitrary shapes. IEEE transactions on pattern analysis and machine intelligence. Cited by: Figure 1, Figure 2, §1, §1, §1, §2, §3.2, Table 4, Table 5, Figure 10, Figure 11, §6.3.
  • [14] M. Liao, B. Shi, X. Bai, X. Wang, and W. Liu (2017) TextBoxes: a fast text detector with a single deep neural network.. In AAAI, pp. 4161–4167. Cited by: §2.
  • [15] M. Liao, B. Shi, and X. Bai (2018) Textboxes++: a single-shot oriented scene text detector. IEEE transactions on image processing 27 (8), pp. 3676–3690. Cited by: §2.
  • [16] T. Lin, P. Dollár, R. Girshick, K. He, B. Hariharan, and S. Belongie (2017) Feature pyramid networks for object detection. In Proceedings of the IEEE conference on computer vision and pattern recognition, pp. 2117–2125. Cited by: §3.1.
  • [17] J. Liu, X. Liu, J. Sheng, D. Liang, X. Li, and Q. Liu (2019) Pyramid mask text detector. arXiv preprint arXiv:1903.11800. Cited by: §2.
  • [18] W. Liu, C. Chen, K. K. Wong, Z. Su, and J. Han (2016) STAR-net: a spatial attention residue network for scene text recognition.. In BMVC, Vol. 2, pp. 7. Cited by: §2.
  • [19] X. Liu, D. Liang, S. Yan, D. Chen, Y. Qiao, and J. Yan (2018) Fots: fast oriented text spotting with a unified network. In Proceedings of the IEEE conference on computer vision and pattern recognition, pp. 5676–5685. Cited by: Figure 2, §1, §1, §2, Table 4, Table 5, Figure 10, Figure 11, §6.3.
  • [20] S. Long, J. Ruan, W. Zhang, X. He, W. Wu, and C. Yao (2018) Textsnake: a flexible representation for detecting text of arbitrary shapes. In Proceedings of the European Conference on Computer Vision (ECCV), pp. 20–36. Cited by: §1, §2.
  • [21] A. Paszke, S. Gross, F. Massa, A. Lerer, J. Bradbury, G. Chanan, T. Killeen, Z. Lin, N. Gimelshein, L. Antiga, et al. (2019) PyTorch: an imperative style, high-performance deep learning library. In Advances in Neural Information Processing Systems, pp. 8024–8035. Cited by: §4.5.
  • [22] S. Qin, A. Bissacco, M. Raptis, Y. Fujii, and Y. Xiao (2019) Towards unconstrained end-to-end text spotting. In Proceedings of the IEEE International Conference on Computer Vision, pp. 4704–4714. Cited by: Figure 2, §1, §2, §3.2.
  • [23] S. Ren, K. He, R. Girshick, and J. Sun (2015) Faster r-cnn: towards real-time object detection with region proposal networks. In Advances in neural information processing systems, pp. 91–99. Cited by: §3.1, §3.5.
  • [24] B. Shi, X. Bai, and S. Belongie (2017) Detecting oriented text in natural images by linking segments. arXiv preprint arXiv:1703.06520. Cited by: §2.
  • [25] B. Shi, X. Bai, and C. Yao (2016) An end-to-end trainable neural network for image-based sequence recognition and its application to scene text recognition. IEEE transactions on pattern analysis and machine intelligence 39 (11), pp. 2298–2304. Cited by: §2, §4.3, §4.3.
  • [26] B. Shi, M. Yang, X. Wang, P. Lyu, C. Yao, and X. Bai (2018) Aster: an attentional scene text recognizer with flexible rectification. IEEE transactions on pattern analysis and machine intelligence. Cited by: §2, §4.3, §4.3.
  • [27] Z. Tian, W. Huang, T. He, P. He, and Y. Qiao (2016) Detecting text in natural image with connectionist text proposal network. In European Conference on Computer Vision, pp. 56–72. Cited by: §1, §2.
  • [28] J. Wang and X. Hu (2017)

    Gated recurrent convolution neural network for ocr

    .
    In Advances in Neural Information Processing Systems, pp. 335–344. Cited by: §2.
  • [29] W. Wang, E. Xie, X. Li, W. Hou, T. Lu, G. Yu, and S. Shao (2019) Shape robust text detection with progressive scale expansion network. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 9336–9345. Cited by: Table 4, Table 5.
  • [30] W. Wang, E. Xie, X. Song, Y. Zang, W. Wang, T. Lu, G. Yu, and C. Shen (2019) Efficient and accurate arbitrary-shaped text detection with pixel aggregation network. In Proceedings of the IEEE International Conference on Computer Vision, Cited by: §2.
  • [31] E. Xie, Y. Zang, S. Shao, G. Yu, C. Yao, and G. Li (2019) Scene text detection with supervised pyramid context network. In Proceedings of the AAAI Conference on Artificial Intelligence, Vol. 33, pp. 9038–9045. Cited by: §2, §3.2.
  • [32] R. Zhang, Y. Zhou, Q. Jiang, Q. Song, N. Li, K. Zhou, L. Wang, D. Wang, M. Liao, M. Yang, et al. (2019) ICDAR 2019 robust reading challenge on reading chinese text on signboard. In 2019 International Conference on Document Analysis and Recognition (ICDAR), pp. 1577–1581. Cited by: §1, §4.1, §4.4, Table 4, Table 5, §6.1, §6.3.
  • [33] X. Zhou, C. Yao, H. Wen, Y. Wang, S. Zhou, W. He, and J. Liang (2017) EAST: an efficient and accurate scene text detector. arXiv preprint arXiv:1704.03155. Cited by: §1, §2, Table 4, Table 5.

6 Appendix

6.1 Ambiguous Images Selection

To keep the objectivity of TDA-ReCTS validation set, we designed two rules to select ambiguous images from the training set of IC19-ReCTS [32], and then randomly sample 1,000 images among them as the validation set.

For an image, we consider its text lines as , and the internal characters of the th text line as .

An image is regarded as a sample with large character spacing, if at least one text line in the image has large character spacing. The character spacing of a text line is large, if it satisfies Eqn. 6.

(6)

where means the th character in the internal characters . denotes the Euclidean distance between the center points of two character boxes. signifies the scale of a character box, which is calculated by the square root of the box area.

Moreover, an image is considered as a sample with juxtaposed text lines, if the image has a pair of text lines aligned to the top, bottom, left or right direction, and characters in them have similar scales. Two text lines (i.e., and ) are aligned, if they satisfy Eqn. 7. The characters in two text lines (i.e., and ) have similar scales, if they satisfy Eqn. 8.

(7)
(8)

Here, is the number of internal characters of th text line. represents the absolute difference of the top/bottom/left/right of two text line boxes.

6.2 Failure Analysis

As demonstrated in previous experiments, the proposed AE TextSpotter works well in most cases. It still fails in some intricate images, such as text-like regions (see Fig. 9(a)), strange character arrangement (see Fig. 9(b)), and texts in fancy styles (see Fig. 9(c)). Most text-like regions can be removed with a high classification score threshold, but there are still some inevitable error detections. In Fig. 9(b), characters are arranged in “x” shape, which does not follow common writing habit. Therefore, it is difficult to detect text lines with extremely strange character arrangements. The detection of text lines in fancy styles is feasible. However, due to fancy styles, it is hard to correctly recognize text contents in these text lines.

Figure 9: Failed examples.

6.3 Visual Comparisons between AE TextSpotter and State-of-The-Art Methods.

In this section, we present the effectiveness of our AE TextSpotter by comparing visual results of different methods. Specifically, we visualize and analyse the results predicted by Mask TextSpotter [13], FOTS [19], and our AE TextSpotter on TDA-ReCTS. For a fair comparison, both methods are trained on the 19,000 training images in IC19-ReCTS [32] (except 1,000 images in TDA-ReCTS). In the testing phase, we scale the short side of test images to 800. Fig. 10 and Fig. 11 show the examples of large character spacing and juxtaposed text lines, respectively. In these examples, our method is clearly better than Mask TextSpotter and FOTS. From these results, we can find that the proposed AE TextSpotter has the following abilities.

  • Detecting and recognizing text lines with large character spacing;

  • Detecting and recognizing juxtaposed text lines;

  • Detecting and recognizing the text lines with various orientations;

  • Detecting and recognizing the multi-language (e.g. English, Chinese and Arabic numerals) text lines;

  • Thanks to the strong feature representation, our method is also robust to complex and unstable illumination, different colors and variable scales.

Figure 10: Examples of text lines with large character spacing. (a) are original images. (b) are ground-truths. (c) are results of Mask TextSpotter [13]. (d) are results of FOTS [19]. (e) are results of our AE TextSpotter.

Figure 11: Examples of juxtaposed text lines. (a) are original images. (b) are ground-truths. (c) are results of Mask TextSpotter [13]. (d) are results of FOTS [19]. (e) are results of our AE TextSpotter.