Log In Sign Up

Character Region Attention For Text Spotting

by   Youngmin Baek, et al.

A scene text spotter is composed of text detection and recognition modules. Many studies have been conducted to unify these modules into an end-to-end trainable model to achieve better performance. A typical architecture places detection and recognition modules into separate branches, and a RoI pooling is commonly used to let the branches share a visual feature. However, there still exists a chance of establishing a more complimentary connection between the modules when adopting recognizer that uses attention-based decoder and detector that represents spatial information of the character regions. This is possible since the two modules share a common sub-task which is to find the location of the character regions. Based on the insight, we construct a tightly coupled single pipeline model. This architecture is formed by utilizing detection outputs in the recognizer and propagating the recognition loss through the detection stage. The use of character score map helps the recognizer attend better to the character center points, and the recognition loss propagation to the detector module enhances the localization of the character regions. Also, a strengthened sharing stage allows feature rectification and boundary localization of arbitrary-shaped text regions. Extensive experiments demonstrate state-of-the-art performance in publicly available straight and curved benchmark dataset.


page 2

page 5

page 7

page 12

page 13

page 14


Towards End-to-End Text Spotting in Natural Scenes

Text spotting in natural scene images is of great importance for many im...

Efficient Scene Text Localization and Recognition with Local Character Refinement

An unconstrained end-to-end text localization and recognition method is ...

A New Perspective for Flexible Feature Gathering in Scene Text Recognition Via Character Anchor Pooling

Irregular scene text recognition has attracted much attention from the r...

DEER: Detection-agnostic End-to-End Recognizer for Scene Text Spotting

Recent end-to-end scene text spotters have achieved great improvement in...

Double Supervised Network with Attention Mechanism for Scene Text Recognition

In this paper, we propose Double Supervised Network with Attention Mecha...

Decoupling Recognition from Detection: Single Shot Self-Reliant Scene Text Spotter

Typical text spotters follow the two-stage spotting strategy: detect the...

PGNet: Real-time Arbitrarily-Shaped Text Spotting with Point Gathering Network

The reading of arbitrarily-shaped text has received increasing research ...

1 Introduction

Figure 1: Concept of the proposed method. The character region feature from the detector is used as an input character attention feature to the recognizer. Having a tightly coupled architecture lets the recognition loss flow through the whole network.

Scene text spotting, including text detection and recognition, has recently attracted much attention because of its variety of applications in instant translation, image retrieval, and scene parsing. Although existing text detectors and recognizers work well on horizontal texts, it still remains as a challenge when it comes to spotting curved text instances in scene images.

To spot curved texts in an image, a classic method is to cascade existing detection and recognition models to manage text instances on each side. The detectors[32, 31, 2] attempt to capture the geometric attributes of curved texts by applying complicated post-processing techniques, and the recognizers apply multi-directional encoding[7] or take rectification modules[37, 46, 11] to enhance the accuracy of the recognizer on curved texts.

As deep learning advanced, researches have been made to combine detectors and recognizers into a jointly trainable end-to-end network

[14, 29]. Having a unified model not only provides efficiency in the size and speed of the model, but also helps the model learn a shared feature that pulls up the overall performance. To gain benefit from this property, attempts have also been made to handle curved text instances using an end-to-end model[32, 34, 10, 44]. However, most of the existing works only adopt a RoI pooling to share low-level feature layers between detection and recognition branches. In the training phase, instead of training the whole network, only shared feature layers are trained using both detection and recognition losses.

As shown in Fig. 1, we propose a novel end-to-end Character Region Attention For Text Spotting model, referred to as CRAFTS. Instead of isolating detection and recognition modules in two separate branches, we form a single pipeline by establishing a complimentary connection between the modules. We observe that recognizer [1] using attention-based decoder and detector [2] encapsulating character spatial information share a common sub-task which is to localize character regions. By tightly integrating the two modules, the outputs from the detection stage helps recognizer attend better to the character center points, and loss propagated from recognizer to detector stage enhances the localization of the character regions. Furthermore, the network is able to maximize the quality of the feature representation used in the common sub-tasks. To best of our knowledge, this is the first end-to-end work that constructs a tightly coupled loss propagation.

The summary of our contribution follows; (1) We propose an end-to-end network that could detect and recognize arbitrary-shaped texts. (2) We construct a complementary relationship between the modules by utilizing spatial character information from the detector on the rectification and recognition module. (3) We establish a single pipeline by propagating the recognition loss throughout all the features in the network. (4) We achieve the state-of-the-art performances in IC13, IC15, IC19-MLT, and TotalText [20, 19, 33, 5] datasets that contain numerous horizontal, curved and multilingual texts.

2 Related Work

Text detection and recognition methods Detection networks use regression based [16, 24, 25, 48] or segmentation based [9, 31, 43, 45] methods to produce text bounding boxes. Some recent methods like [17, 26, 47] take Mask-RCNN [13] as the base network and gain advantages from both regression and segmentation methods by employing multi-task learning. In terms of units for text detection, all methods could also be sub-categorized depending on the use of word-level or character-level[16, 2] predictions.

Text recognizers typically adopt CNN-based feature extractor and RNN based sequence generator, and are categorized by their sequence generators; connectionist temporal classification (CTC) [35] and attention-based sequential decoder [21, 36]. Detection model provides information of the text regions, but it is still a challenge for the recognizer to extract useful information in arbitrary-shaped texts. To help recognition networks handle irregular texts, some researches  [36, 27, 37]

utilize spatial transformer network (STN)

[18]. Also, the papers  [11, 46] further extend the use of STN by iterative executing the rectification method. These studies show that running STN recursively helps recognizer extract useful features in extremely curved texts. In [28], Recurrent RoIWarp Layer was proposed to crop individual characters before recognizing them. The work proves that the task of finding a character region is closely related to the attention mechanism used in the attention-based decoder.

One way to construct a text spotting model is to sequentially place detection and recognition networks. A well known two-staged architecture couples TextBox++[24] detector and CRNN[35] recognizer. With its simplicity, the method achieves favorable results.

End-to-end using RNN-based recognizer EAA[14] and FOTS[29] are end-to-end models based on EAST detector [49]. The difference between these two networks lies in the recognizer. The FOTS model uses CTC decoder [35], and the EAA model uses attention decoder [36]. Both works implement an affine transformation layer to pool the shared feature. The proposed affine transformation works well on horizontal texts, but shows limitations when handling arbitrary-shaped texts. TextNet [42] proposed a spatial-aware text recognizer with perspective-RoI transformation in the feature pooling layer. The network keeps an RNN layer to recognize a sequence of text in the 2D feature map, but due to the lack of expressively of the quadrangles, the network still shows limitations when detecting curved texts.

Qin et al. [34] proposed a Mask-RCNN[13] based end-to-end network. Given the box proposals, features are pooled from the shared layer and the ROI-masking layer is used to filter out the background clutters. The proposed method increases its performance by ensuring attention only in the text region. Busta et al. proposed Deep TextSpotter [3] network and extended their work in E2E-MLT [4]. The network is composed of FPN based detector and a CTC-based recognizer. The model predicts multiple languages in an end-to-end manner.

End-to-end using CNN-based recognizer Most CNN-based models that recognize texts in character level have advantages when handling arbitrary-shaped texts. MaskTextSpotter [32] is a model that recognizes text using a segmentation approach. Although it has strengths in detecting and recognizing individual characters, it is difficult to train the network since character-level annotations are usually not provided in the public datasets. CharNet [44] is another segmentation-based method that makes character level predictions. The model is trained in a weakly-supervised manner to overcome the lack of character-level annotations. During training, the method performs iterative character detection to create pseudo-ground-truths.

While segmentation-based recognizers have shown great success, the method suffers when the number of target characters increases. Segmentation based models require more output channels as the number of character sets grow, and this increases memory requirements. The journal version of MaskTextSpotter[23] expands the character set to handle multiple languages, but the authors added a RNN-based decoder instead of using their initially proposed CNN-based recognizer. Another limitation of segmentation-based recognizer is the lack of contextual information in the recognition branch. Due to the absence of sequential modeling like RNNs, the accuracy of the model drops under noisy images.

TextDragon [10] is another segmentation-based method that localize and recognize text instances. However, a predicted character segment is not guaranteed to cover a single character region. To solve the issue, the model incorporates CTC to remove overlapping characters. The network shows good detection performance but shows limitations in the recognizer due to the lack of sequential modeling.

3 Methodology

3.1 Overview

Proposed CRAFTS network can be divided into three stages; detection stage, sharing stage, and recognition stage. A detailed pipeline of the network is illustrated in Fig. 2. Detection stage takes an input image and localizes oriented text boxes. Sharing stage then pools backbone high-level features and detector outputs. The pooled features are then rectified using the rectification module, and are concatenated together to form a character attended feature. In the recognition stage, attention-based decoder predicts text labels using the character attended feature. Finally, a simple post-processing technique is optionally used for better visualization.

Figure 2: Schematic overview of CRAFTS pipeline.

3.2 Detection Stage

CRAFT detector[2]

is selected as a base network because of its capability of representing semantic information of the character regions. The outputs of the CRAFT network represent center probability of character regions and linkage between them. We contemplate that this character centeredness information can be used to support the attention module in the recognizer since both modules aim to localize the center position of characters. In this work, we make three changes in the original CRAFT model; backbone replacement, link representation, and orientation estimation.

Backbone replacement Recent studies show that the use of ResNet50 captures well-defined feature representations of both the detector and the recognizer[30, 1]. We therefore replace the backbone of the network from VGG-16[40] to ResNet50[15].

Link representation The occurrence of vertical texts is not common in Latin texts, but it is frequently found in East Asian languages like Chinese, Japanese, and Korean. In this work, a binary center line is used to connect the sequential character regions. This change was made because employing the original affinity maps on vertical texts often produced ill-posed perspective transformation that generated invalid box coordinates. To generate ground truth linkmap, a line segment with thickness is drawn between adjacent characters. Here, , where and are the diagonal lengths of adjacent character boxes and is the scaling coefficient. Use of the equation lets the width of the center line proportional to the size of the characters. We set as 0.1 in our implementation.

Orientation estimation It is important to obtain the right orientation of text boxes since the recognition stage requires well-defined box coordinates to recognize the text properly. To this end, we add two-channel outputs in the detection stage; channel is used to predict angles of characters along the x-axis, y-axis each. To generate the ground truth of orientation map, the upward angle of the GT character bounding box is represented as , the channel predicting x-axis has a value of , and the channel predicting y-axis has a value of . The ground truth orientation map is generated by filling the pixels in the region of the word box with the values of and . The trigonometric function is not directly used to let the channels have the same output range with the region map and the link map; between 0 and 1.

The loss function for orientation map is calculated by Eq.



where and denote the ground truth of text orientation. Here, the character region score is used as a weighting factor because it represents the confidence of the character centeredness. By doing this, the orientation loss is calculated only in the positive character regions.

The final objective function in the detection stage is defined as,


where and denote character region loss and link loss, which are exactly same in [2]. The is the orientation loss, and is multiplied with to control the weight. In our experiment, we set to 0.1.

Figure 3: Schematic illustration of the backbone network and the detection head.

The architecture of the backbone and modified detection head is illustrated in Fig. 3. The final output of the detector has four channels, each representing character region map , character link map , and two orientation maps .

During inference, we apply the same post-processing as described in [2] to obtain text bounding boxes. First, by using predefined threshold values, we make binary maps of character region map and character link map . Then, using the two maps, the text blobs are constructed by using connected components labeling(CCL). The final boxes are obtained by finding a minimum bounding box enclosing each text blob. We additionally determine the orientation of the bounding box by utilizing pixel-wise averaging scheme. As shown in the Eq. 3, the angle of the text box is found by taking the arctangent of accumulated sine and cosine values at the predicted orientation map.


denotes orientation of the text box, and are the 2-ch orientation outputs. The same character centerdeness-based weighting scheme that used in the loss calculation is applied to predict the orientation as well.

3.3 Sharing Stage

Sharing stage consists of two modules: text rectification module and character region attention(CRA) modules. To rectify arbitrarily-shaped text region, a thin-plate spline (TPS) [37] transformation is used. Inspired by the work of [46], our rectification module incorporates iterative-TPS to acquire a better representation of the text region. By updating the control points attractively, the curved geometry of a text in an image becomes ameliorated. Through empirical studies, we discover that three TPS iterations are sufficient for rectification.

Typical TPS module takes an word image as input, but we feed the character region map and link map since they encapsulate geometric information of the text regions. We use twenty control points to tightly cover the curved text region. To use these control points as a detection result, they are transformed to the original input image coordinate. We optionally perform 2D polynomial fitting to smooth the bounding polygon. Examples of iterative-TPS and final smoothed polygon output are shown in Fig. 4.

Figure 4: Example of iterative TPS. The middle rows show TPS control points on each iteration, and the bottom row shows rectified images on each stage. The control points are drawn in the image level for better visualization. Actual rectification is done in the feature space. Final result was smoothed using a 2D polynomial.

CRA module is the key component that tightly couples detection and recognition modules. By simply concatenating rectified character score map with feature representation, the model establishes following advantages. Creating a link between detector and recognizer allows recognition loss to propagate through detection stage, and this improves the quality of character score map. Also, attaching character region map to the feature helps recognizer attend better to the character regions. Ablation study of using this module will be discussed further in the experiment section.

3.4 Recognition Stage

The modules in the recognition stage are formed based on the results reported in [1]

. There are three components in the recognition stage: feature extraction, sequence modeling, and prediction. The feature extraction module is made lighter than a solitary recognizer since it takes high-level semantic features as input.

Detailed architecture of the module is shown in Table 1. After extracting the features, a bidirectional LSTM is applied for sequence modeling, and attention-based decoder makes a final text prediction.

Layers Configurations Output
Input pooled feature
Conv1 c: k:
MaxPool k:
s: p:
Conv2 c: k:
Conv3 c: k:
s: p:
Conv4 c: k:
s: p:
AvgPool k:
s: p:
Table 1: A simplified ResNet feature extraction module .

At each time step, attention-based recognizer decodes textual information by masking attention outputs to the features. Although attention module works well in most cases, it fails to predict characters when attention points are misaligned or vanished [6, 14]. Fig. 5 shows the effect of using CRA module. Well-placed attention points allow robust text prediction.

Figure 5: Attention problems with and without Character Region Attention module. The red dots represent attention points of the decoding characters. Missing characters are colored in blue, and misrecognized characters are colored in red. The cropped images are slightly different since generated control points in each rectification modules are inconsistent.

The objective function, , in the recognition stage is


where indicates the generation probability of the character sequence, , from the cropped feature representation, of the -th word box.

The final loss, , used for training is composed of detection loss and recognition loss by taking . The overall flow of the recognition loss is shown in Fig. 6. The loss flows through the weights in the recognition stage, and propagates towards detection stage through Character Region Attention module. Detection loss on the other hand is used as an intermediate loss, and thus the weights before detection stage are updated using both detection and recognition losses.

Figure 6: The entire loss flow of CRAFTS model.

4 Experiment

4.1 Datasets

English datasets IC13 [20] dataset consists of high-resolution images, 229 for training and 233 for testing. A rectangular box is used to annotate word-level text instances. IC15 [20] consists of 1000 training and 500 testing images. A quadrilateral box is used to annotate word-level text instances. TotalText [5] has 1255 training and 300 testing images. Unlike IC13 and IC15 datasets, it contains curved text instances and is annotated using polygon points.

Multi-language dataset IC19 [33] dataset contains 10,000 training and 10,000 testing images. The dataset contains texts in 7 different languages and is annotated using quadrilateral points.

4.2 Training strategy

We jointly train both the detector and recognizer in the CRAFTS model. To train the detection stage, we follow the weakly-supervised training method described in [2]. The recognition loss is calculated by making a batch of randomly sampled cropped word features in each image. Maximum number of words per image is set to 16 to prevent out-of-memory error. Data augmentations in the detector apply techniques like crops, rotations, and color variations. For the recognizer, the corner points of the ground truth boxes are perturbed in a range between 0 to 10% of the shorter length of the box.

The model is first trained on the SynthText dataset [12] for 50k iterations, and we further train the network on target datasets. Adam optimizer is used, and On-line Hard Negative Mining(OHEM) [39] is applied to enforce 1:3 ratio of positive and negative pixels in the detection loss. When fine-tuning the model, SynthText dataset is mixed with the ratio of 1:5. We take 94 characters to cover alphabets, numbers, and special characters, and take 4267 characters for the multi-language dataset.

Method IC13(Det) IC13(E2E) IC15(Det) IC15(E2E) FPS
Deep TextSpotter[3] - - - 89 86 77 - - - 54 51 47 9
TextBoxes++[24] 86 92 89 93 92 85 78.5 87.8 82.9 73.3 65.8 51.9 -
TextNet[42] 89.1 93.6 91.3 89.7 88.8 82.9 80.8 85.7 83.2 78.6 74.9 60.4 2.7
EAA[14] 89 91 90 91 89 86 86 87 87 82 77 63 -
TextDragon[10] - - - - - - 83.7 92.4 87.8 82.5 78.3 65.1 -
FOTS[29] - - 92.8 91.9 90.1 84.7 87.9 91.8 89.8 83.5 79.1 65.3 7.5
Li et al.[22] 80.5 91.4 85.6 92.5 91.2 84.9 - - - 84.4 78.9 66.1 1.3
Qin et al.[34] - - - - - - 87.9 91.6 89.7 85.5 81.9 69.9 4.7
CharNet[44] - - - - - - 90.4 92.6 91.5 85.0 81.2 71.0 -
MaskTextSpotter[23] 89.5 94.8 92.1 93.3 91.3 88.2 87.3 86.6 87.0 83.0 77.7 73.5 2.0
CRAFTS(ours) 90.9 96.1 93.4 94.2 93.8 92.2 85.3 89.0 87.1 83.1 82.1 74.9 5.4
Table 2: Results on horizontal Latin datasets. denote the results based on multi-scale tests. R, P, and H refer to recall, precision and H-mean, and S, W, and G indicate strongly-, weakly- and generic-contextualization results, respectively. The best score is highlighted in bold

. The evaluation metric of ICDAR 2013 detection task is DetEval, and IoU metric is used for other three cases. FPS is for reference only due to the different experimental environments.

4.3 Experimental Results

Horizontal datasets (IC13, IC15) To target the IC13 benchmark, we take the model trained on the SynthText dataset and perform finetuning on IC13 and IC19 datasets. During inference, we resize the longer side of the input to 1280. The results show significant increase in performance when compared with the previous state-of-the-art works.

The model trained on IC13 dataset is then fine-tuned on the IC15 dataset. During the evaluation process, the input size of the model is set to 2560x1440. Note that we perform generic evaluation without the generic vocabulary set. The quantitative results on IC13 and IC15 datasets are listed in Table. 2.

Our method surpasses previous methods in both generic and weakly- contextualization end-to-end tasks, and shows comparable results in other tasks. The generic performance is meaningful because a vocabulary set is not provided in practical scenarios. Note that we get slightly low detection scores on IC15 dataset and also observe low performance in strongly-contextualization results. The relatively low detection performance is obtained mainly due to the granularity difference, and will be discussed further in the later section.

Curved datasets (TotalText) From the model trained on IC13 dataset, we further train the model on TotalText dataset. During inference, we resize the longer side of the input to 1920, and the control points from rectification module are used for detector evaluation. The qualitative results are shown in Fig. 7. The character region map and the link map are illustrated using a heatmap, and the weighted pixel-wise angle values are visualized in the HSV color space. As it is shown in the figure, the network successfully localizes polygon regions and recognizes characters in the curved text region. Two top-left figures show successful recognition of fully rotated and highly curved text instances.

Method Detection E2E(None) E2E(Full)
TextDragon[10] 75.7 85.6 80.3 - - 48.4 74.8
TextNet[42] 59.4 68.2 63.5 56.4 51.9 54.0 -
Li et al.[22] 59.8 64.8 62.2 - - 57.8 -
MaskTextSpotter[23] 75.4 81.8 78.5 - - 65.3 77.4
CharNet[44] 85.0 88.0 86.5 - - 69.2 -
Qin et al.[34] 85.0 87.8 86.4 - - 70.7 -
CRAFTS(ours) 85.4 89.5 87.4 72.2 86.5 78.7 -
Table 3:

Results on TotalText dataset. None means no lexicon is used for contextualization. The full lexicon contains all words in the test set.

indicates the multi-scale inference, and denotes models trained on the private datasets.

Quantitative results on TotalText dataset are listed in Table. 3. DetEval[5] evaluates the performance of the detector and modified IC15 evaluation scheme measures the end-to-end performance. Our method outperforms previously reported methods by a large margin. Note that even without the vocabulary set, the end-to-end result significantly exceeds the h-mean score by 8.0%.

Multi-language dataset (IC19) Evaluation on multiple languages is performed using IC19-MLT dataset. The output channel in the prediction layer of the recognizer was expanded to 4267 to handle the characters in Arabic, Latin, Chinese, Japanese, Korean, Bangladesh, and Hindi. However, occurrence of characters in the dataset is not evenly distributed. Among 4267 characters in the training set, 1017 characters occur once in the dataset, and this insufficiency makes it hard for the model to make accurate label predictions. To solve class imbalance problem, we first freeze the weights in the detection stage and pretrain the weights in the recognizer with other publicly available multi-language datasets: SynthMLT, ArT, LSVT, ReCTS and RCTW [4, 8, 41, 38]. We then let the loss flow through the whole network and use IC19 dataset to finetune the model. Since no paper reports performance, we compare our results with E2E-MLT [4, 33]. The samples from the IC19 dataset are shown in Fig. 8. We hope our study is set as a baseline for future works on the IC19-MLT benchmark.

Method Detection E2E
E2E-MLT[4] - - - 20.5 37.4 26.5
CRAFTS(ours) 70.1 81.7 75.5 48.5 72.9 58.2
Table 4: Results on IC19-MLT dataset.
Figure 7: Results on the TotalText dataset. First row: each column shows the input image (top) with its respective region score map, link map, and orientation map.
Figure 8: Qualitative results on IC19 MLT dataset.

4.4 Ablation study

Attention assisted by Character Region Attention In this section, we study how Character Region Attention(CRA) affects the performance of the recognizer by training a separate network without CRA.

Table. 5 shows the effect of using CRA on benchmark datasets. Without CRA, we observe performance drops on all of the datasets. Especially on the perspective dataset (IC15) and the curved dataset (TotalText), we observe a greater gap when compared with the horizontal dataset (IC13). This implies that feeding character attention information improves the performance of the recognizer when dealing with irregular texts.

Dataset Type CRA R P H Gain
IC13 Horizontal - 89.2 94.8 91.9 -
89.5 95.0 92.2 +0.3
IC15 Perspective - 64.9 84.1 73.2 -
65.9 86.7 74.9 +1.7
TotalText Curved - 71.9 84.0 77.5 -
72.2 86.5 78.7 +1.2
Table 5: The end-to-end performance comparisons of using character attention maps. CRA denotes the use of Character Region Attention in recognition stage. R, P, and H refers to Recall, Precision and Hmean values.

Recognition loss in the detection stage The recognition loss flowing through the detection stage affects the quality of the character region map and character link map. It is expected that the recognition loss helps detector localize character regions more explicitly. However, the improvement of character localization is not clearly presented in word-level evaluation. Therefore, in order to show individual character localization ability of the detector, we take advantage of the pseudo-character box generation process in the CRAFT detector. When generating pseudo-ground-truths, supervision network calculates the difference between the number of generated pseudo characters with the number of ground-truth characters in the word transcription. Table. 6 shows number of character error length on each dataset measured with fully trained networks.

Dataset Total Lengths SynthText w.o R-loss w. R-loss Diff.
ICDAR2015 46,107 9,324 1,251 1,147 -104 (-8.3%)
TotalText 53,645 16,385 5,050 4,521 - 529 (-10.4%)
Table 6: Comparison of character error length on each dataset with trained networks.

When training the network on SynthText dataset, character error length on each dataset is large. Error decreases further as training is performed on real datasets, but the value drops further by propagating recognition loss to the detection stage. This implies that the use of Character Region Attention improves the quality of the localization ability of the detector.

Importance of orientation estimation The orientation estimation is important because there are many oriented texts in scene text images. Our pixel-wise averaging scheme is very useful for the recognizer to receive well-defined features. We compare the results of our model when the orientation information is not used. On the IC15 dataset, the performance drops from 74.9% to 74.1% (-0.8%), and on TotalText dataset, the h-mean value drops from 78.7% to 77.5% (-1.2%). The results show that the use of accurate angle information escalates performance on rotated texts.

4.5 Discussions

Inference speed Since inference speed varies depending on the input image size, we measure the FPS on different input resolutions, each having a longer side of 960, 1280, 1600, and 2560. The test results give FPS of 9.9, 8.3, 6.8, and 5.4, respectively. For all experiments, we use Nvidia P40 GPU with Intel(R) Xeon(R) CPU. When compared with the 8.6 FPS of the VGG based CRAFT detector[2], the ResNet based CRAFTS network achieves higher FPS on the same sized input. Also, directly using the control points from the rectification module alleviates the need of post processing for polygon generation.

Figure 9: Failure cases in IC15 dataset due to granularity difference. Red boxes denote detection results, cyan boxes denote ground truths.

Granularity difference issue We assume that the granularity difference between the ground-truth and prediction box causes relatively low detection performance on the IC15 dataset. Character-level segmentation methods tend to generalize character connectivity based on space and color cues, and not capture the whole feature of word instance. For this reason, the outputs do not follow the annotation style of the boxes required by the benchmark. The figure 9 shows the failure cases in the IC15 dataset, which proves that the detection results are marked incorrect while we observe acceptable qualitative results.

5 Conclusion

In this paper, we present an end-to-end trainable single pipeline model that tightly couples detection and recognition modules. Character region attention in the sharing stage fully exploit character region map to help recognizer rectify and attend better to the text regions. Also, we design the recognition loss propagate through detection stage and enhances the character localization ability of the detector. In addition, the rectification module in the sharing stage enables fine localization of curved texts, and obviates the need of developing hand crafted post-processing. The experimental results validate state-of-the-art performance of CRAFTS on various datasets.


  • [1] J. Baek, G. Kim, J. Lee, S. Park, D. Han, S. Yun, S. J. Oh, and H. Lee (2019-10) What is wrong with scene text recognition model comparisons? dataset and model analysis. In

    The IEEE International Conference on Computer Vision (ICCV)

    Cited by: §1, §3.2, §3.4.
  • [2] Y. Baek, B. Lee, D. Han, S. Yun, and H. Lee (2019) Character region awareness for text detection. In

    Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition

    pp. 9365–9374. Cited by: §1, §1, §2, §3.2, §3.2, §3.2, §4.2, §4.5.
  • [3] M. Busta, L. Neumann, and J. Matas (2017) Deep textspotter: an end-to-end trainable scene text localization and recognition framework. In Proceedings of the IEEE International Conference on Computer Vision, pp. 2204–2212. Cited by: §2, Table 2.
  • [4] M. Busta, Y. Patel, and J. Matas (2018) E2E-mlt-an unconstrained end-to-end method for multi-language scene text. In Asian Conference on Computer Vision, pp. 127–143. Cited by: §2, §4.3, Table 4.
  • [5] C. K. Ch’ng and C. S. Chan (2017) Total-text: a comprehensive dataset for scene text detection and recognition. In ICDAR, Vol. 1, pp. 935–942. Cited by: §1, §4.1, §4.3.
  • [6] Z. Cheng, F. Bai, Y. Xu, G. Zheng, S. Pu, and S. Zhou (2017) Focusing attention: towards accurate text recognition in natural images. In Proceedings of the IEEE International Conference on Computer Vision, pp. 5076–5084. Cited by: §3.4.
  • [7] Z. Cheng, Y. Xu, F. Bai, Y. Niu, S. Pu, and S. Zhou (2018) Aon: towards arbitrarily-oriented text recognition. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 5571–5579. Cited by: §1.
  • [8] C. K. Chng, Y. Liu, Y. Sun, C. C. Ng, C. Luo, Z. Ni, C. Fang, S. Zhang, J. Han, E. Ding, et al. (2019) ICDAR2019 robust reading challenge on arbitrary-shaped text-rrc-art. In 2019 International Conference on Document Analysis and Recognition (ICDAR), pp. 1571–1576. Cited by: §4.3.
  • [9] D. Deng, H. Liu, X. Li, and D. Cai (2018) Pixellink: detecting scene text via instance segmentation. In

    Thirty-Second AAAI Conference on Artificial Intelligence

    Cited by: §2.
  • [10] W. Feng, W. He, F. Yin, X. Zhang, and C. Liu (2019) TextDragon: an end-to-end framework for arbitrary shaped text spotting. In Proceedings of the IEEE International Conference on Computer Vision, pp. 9076–9085. Cited by: §1, §2, Table 2, Table 3.
  • [11] Y. Gao, Y. Chen, J. Wang, Z. Lei, X. Zhang, and H. Lu (2018) Recurrent calibration network for irregular text recognition. arXiv preprint arXiv:1812.07145. Cited by: §1, §2.
  • [12] A. Gupta, A. Vedaldi, and A. Zisserman (2016) Synthetic data for text localisation in natural images. In CVPR, pp. 2315–2324. Cited by: §4.2.
  • [13] K. He, G. Gkioxari, P. Dollár, and R. Girshick (2017) Mask r-cnn. In ICCV, pp. 2980–2988. Cited by: §2, §2.
  • [14] T. He, Z. Tian, W. Huang, C. Shen, Y. Qiao, and C. Sun (2018) An end-to-end textspotter with explicit alignment and attention. In CVPR, pp. 5020–5029. Cited by: §1, §2, §3.4, Table 2.
  • [15] W. He, X. Zhang, F. Yin, and C. Liu (2017) Deep direct regression for multi-oriented scene text detection. In CVPR, pp. 745–753. Cited by: §3.2.
  • [16] H. Hu, C. Zhang, Y. Luo, Y. Wang, J. Han, and E. Ding (2017) Wordsup: exploiting word annotations for character based text detection. In ICCV, Cited by: §2.
  • [17] Z. Huang, Z. Zhong, L. Sun, and Q. Huo (2019) Mask r-cnn with pyramid attention network for scene text detection. In 2019 IEEE Winter Conference on Applications of Computer Vision (WACV), pp. 764–772. Cited by: §2.
  • [18] M. Jaderberg, K. Simonyan, A. Zisserman, et al. (2015) Spatial transformer networks. In Advances in neural information processing systems, pp. 2017–2025. Cited by: §2.
  • [19] D. Karatzas, L. Gomez-Bigorda, A. Nicolaou, S. Ghosh, A. Bagdanov, M. Iwamura, J. Matas, L. Neumann, V. R. Chandrasekhar, S. Lu, et al. (2015) ICDAR 2015 competition on robust reading. In ICDAR, pp. 1156–1160. Cited by: §1.
  • [20] D. Karatzas, F. Shafait, S. Uchida, M. Iwamura, L. G. i Bigorda, S. R. Mestre, J. Mas, D. F. Mota, J. A. Almazan, and L. P. De Las Heras (2013) ICDAR 2013 robust reading competition. In ICDAR, pp. 1484–1493. Cited by: §1, §4.1.
  • [21] C. Lee and S. Osindero (2016)

    Recursive recurrent nets with attention modeling for ocr in the wild

    In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 2231–2239. Cited by: §2.
  • [22] H. Li, P. Wang, and C. Shen (2019) Towards end-to-end text spotting in natural scenes. arXiv preprint arXiv:1906.06013. Cited by: Table 2, Table 3.
  • [23] M. Liao, P. Lyu, M. He, C. Yao, W. Wu, and X. Bai (2019)

    Mask textspotter: an end-to-end trainable neural network for spotting text with arbitrary shapes

    IEEE transactions on pattern analysis and machine intelligence. Cited by: §2, Table 2, Table 3.
  • [24] M. Liao, B. Shi, and X. Bai (2018) Textboxes++: a single-shot oriented scene text detector. Image Processing 27 (8), pp. 3676–3690. Cited by: §2, §2, Table 2.
  • [25] M. Liao, Z. Zhu, B. Shi, G. Xia, and X. Bai (2018) Rotation-sensitive regression for oriented scene text detection. In CVPR, pp. 5909–5918. Cited by: §2.
  • [26] J. Liu, X. Liu, J. Sheng, D. Liang, X. Li, and Q. Liu (2019) Pyramid mask text detector. arXiv preprint arXiv:1903.11800. Cited by: §2.
  • [27] W. Liu, C. Chen, K. K. Wong, Z. Su, and J. Han (2016) STAR-net: a spatial attention residue network for scene text recognition.. In BMVC, Vol. 2, pp. 7. Cited by: §2.
  • [28] W. Liu, C. Chen, and K. K. Wong (2018) Char-net: a character-aware neural network for distorted scene text recognition. In Thirty-Second AAAI Conference on Artificial Intelligence, Cited by: §2.
  • [29] X. Liu, D. Liang, S. Yan, D. Chen, Y. Qiao, and J. Yan (2018) FOTS: fast oriented text spotting with a unified network. In CVPR, pp. 5676–5685. Cited by: §1, §2, Table 2.
  • [30] S. Long, X. He, and C. Ya (2018) Scene text detection and recognition: the deep learning era. arXiv preprint arXiv:1811.04256. Cited by: §3.2.
  • [31] S. Long, J. Ruan, W. Zhang, X. He, W. Wu, and C. Yao (2018) Textsnake: a flexible representation for detecting text of arbitrary shapes. In Proceedings of the European Conference on Computer Vision (ECCV), pp. 20–36. Cited by: §1, §2.
  • [32] P. Lyu, M. Liao, C. Yao, W. Wu, and X. Bai (2018) Mask textspotter: an end-to-end trainable neural network for spotting text with arbitrary shapes. In Proceedings of the European Conference on Computer Vision (ECCV), pp. 67–83. Cited by: §1, §1, §2.
  • [33] N. Nayef, Y. Patel, M. Busta, P. N. Chowdhury, D. Karatzas, W. Khlif, J. Matas, U. Pal, J. Burie, C. Liu, et al. (2019) ICDAR2019 robust reading challenge on multi-lingual scene text detection and recognition—rrc-mlt-2019. In 2019 International Conference on Document Analysis and Recognition (ICDAR), pp. 1582–1587. Cited by: §1, §4.1, §4.3.
  • [34] S. Qin, A. Bissacco, M. Raptis, Y. Fujii, and Y. Xiao (2019) Towards unconstrained end-to-end text spotting. In Proceedings of the IEEE International Conference on Computer Vision, pp. 4704–4714. Cited by: §1, §2, Table 2, Table 3.
  • [35] B. Shi, X. Bai, and C. Yao (2016) An end-to-end trainable neural network for image-based sequence recognition and its application to scene text recognition. IEEE transactions on pattern analysis and machine intelligence 39 (11), pp. 2298–2304. Cited by: §2, §2, §2.
  • [36] B. Shi, X. Wang, P. Lyu, C. Yao, and X. Bai (2016) Robust scene text recognition with automatic rectification. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 4168–4176. Cited by: §2, §2.
  • [37] B. Shi, M. Yang, X. Wang, P. Lyu, C. Yao, and X. Bai (2018) Aster: an attentional scene text recognizer with flexible rectification. IEEE transactions on pattern analysis and machine intelligence 41 (9), pp. 2035–2048. Cited by: §1, §2, §3.3.
  • [38] B. Shi, C. Yao, M. Liao, M. Yang, P. Xu, L. Cui, S. Belongie, S. Lu, and X. Bai (2017) Icdar2017 competition on reading chinese text in the wild (rctw-17). In 2017 14th IAPR International Conference on Document Analysis and Recognition (ICDAR), Vol. 1, pp. 1429–1434. Cited by: §4.3.
  • [39] A. Shrivastava, A. Gupta, and R. Girshick (2016) Training region-based object detectors with online hard example mining. In CVPR, pp. 761–769. Cited by: §4.2.
  • [40] K. Simonyan and A. Zisserman (2015) Very deep convolutional networks for large-scale image recognition. In ICLR, Cited by: §3.2.
  • [41] Y. Sun, Z. Ni, C. Chng, Y. Liu, C. Luo, C. C. Ng, J. Han, E. Ding, J. Liu, D. Karatzas, et al. (2019) ICDAR 2019 competition on large-scale street view text with partial labeling-rrc-lsvt. In 2019 International Conference on Document Analysis and Recognition (ICDAR), pp. 1557–1562. Cited by: §4.3.
  • [42] Y. Sun, C. Zhang, Z. Huang, J. Liu, J. Han, and E. Ding (2018) TextNet: irregular text reading from images with an end-to-end trainable network. In Asian Conference on Computer Vision, pp. 83–99. Cited by: §2, Table 2, Table 3.
  • [43] W. Wang, E. Xie, X. Li, W. Hou, T. Lu, G. Yu, and S. Shao (2019) Shape robust text detection with progressive scale expansion network. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 9336–9345. Cited by: §2.
  • [44] L. Xing, Z. Tian, W. Huang, and M. R. Scott (2019) Convolutional character networks. In Proceedings of the IEEE International Conference on Computer Vision, pp. 9126–9136. Cited by: §1, §2, Table 2, Table 3.
  • [45] Y. Xu, Y. Wang, W. Zhou, Y. Wang, Z. Yang, and X. Bai (2019) Textfield: learning a deep direction field for irregular scene text detection. IEEE Transactions on Image Processing. Cited by: §2.
  • [46] F. Zhan and S. Lu (2019) Esir: end-to-end scene text recognition via iterative image rectification. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 2059–2068. Cited by: §1, §2, §3.3.
  • [47] C. Zhang, B. Liang, Z. Huang, M. En, J. Han, E. Ding, and X. Ding (2019) Look more than once: an accurate detector for text of arbitrary shapes. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 10552–10561. Cited by: §2.
  • [48] Z. Zhong, L. Sun, and Q. Huo (2019) An anchor-free region proposal network for faster r-cnn-based text detection approaches. International Journal on Document Analysis and Recognition (IJDAR) 22 (3), pp. 315–327. Cited by: §2.
  • [49] X. Zhou, C. Yao, H. Wen, Y. Wang, S. Zhou, W. He, and J. Liang (2017) EAST: an efficient and accurate scene text detector. In CVPR, pp. 2642–2651. Cited by: §2.