CentripetalText: An Efficient Text Instance Representation for Scene Text Detection

07/13/2021 ∙ by Tao Sheng, et al. ∙ Peking University 7

Scene text detection remains a grand challenge due to the variation in text curvatures, orientations, and aspect ratios. One of the most intractable problems is how to represent text instances of arbitrary shapes. Although many state-of-the-art methods have been proposed to model irregular texts in a flexible manner, most of them lose simplicity and robustness. Their complicated post-processings and the regression under Dirac delta distribution undermine the detection performance and the generalization ability. In this paper, we propose an efficient text instance representation named CentripetalText (CT), which decomposes text instances into the combination of text kernels and centripetal shifts. Specifically, we utilize the centripetal shifts to implement the pixel aggregation, which guide the external text pixels to the internal text kernels. The relaxation operation is integrated into the dense regression for centripetal shifts, allowing the correct prediction in a range, not a specific value. The convenient reconstruction of the text contours and the tolerance of the prediction errors in our method guarantee the high detection accuracy and the fast inference speed respectively. Besides, we shrink our text detector into a proposal generation module, namely CentripetalText Proposal Network (CPN), replacing SPN in Mask TextSpotter v3 and producing more accurate proposals. To validate the effectiveness of our designs, we conduct experiments on several commonly used scene text benchmarks, including both curved and multi-oriented text datasets. For the task of scene text detection, our approach achieves superior or competitive performance compared to other existing methods, e.g., F-measure of 86.3 Total-Text, F-measure of 86.1 end-to-end scene text recognition, we outperform Mask TextSpotter v3 by 1.1 Total-Text.



There are no comments yet.


page 4

page 7

page 9

page 13

page 14

page 15

page 16

page 17

This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

1 Introduction

In the past decade, scene text detection has attracted increasing interests of the computer vision community, as localizing the region of each text instance in natural images with high accuracy is an essential prerequisite for many practical applications such as blind navigation, scene understanding, and text retrieval. With the rapid development of object detection 

fasterrcnn; ssd; maskrcnn; fpn and segmentation fcn; bisenet; dice; zhaoetal, some promising methods east; textsnake; psenet; DB; PAN; contournet have been proposed to solve the problem. However, scene text detection is still a challenging task due to the variety of text curvatures, orientations, and aspect ratios.

How to represent the text instances in real imagery is one of the major challenges for scene text detection task, and usually there are two strategies to solve the problem arising from this challenge. The first is to treat text instances as a specific kind of object and uses rotated rectangles or quadrangles for description. This kind of methods is typically inherited from generic object detection and often utilizes manually designed anchors for better regression. Obviously, this solution ignores the geometric traits of irregular texts, which may introduce considerable background noises, and furthermore, it is difficult to formulate appropriate anchors to fit the texts of various shapes. The other strategy decomposes text instances into several conceptual or physical components, and reconstructs the polygonal contours through a series of indispensable post-processing steps. For example, PAN PAN follows the idea of clustering and aggregates text pixels according to the distances between their embeddings. In TextSnake textsnake

, text instances are represented with text center lines and ordered disks. Consequently, these methods are more flexible and more general than the previous ones in modeling. Nevertheless, most of them suffer from slow inference speed, due to complicated post-processing steps, essentially caused by this kind of tedious multi-component-based representation strategy. For another, their component prediction is modeled as a simple Dirac delta distribution, which strictly requires numerical outputs to reach the exact positions and thus weakens the ability to tolerate the mistakes. The wrong component prediction will propagate errors to heuristic post-processing procedures, making the rebuilt text contours inaccurate. Based on the above observations, we can find out that the implementation of a fast and accurate scene text detector is heavily dependent on a simple but effective text instance representation with a robust post-processing algorithm, which can tolerate ambiguity and uncertainty.

Figure 1: Performance and speed of some top-performing real-time scene text detectors on the Total-Text dataset. Enabled by the proposed CT, an efficient representation of text instances, our method outperforms DB DB and PAN PAN, and achieves the best tradeoff between accuracy and speed. More results are shown in Tab. 3.

To overcome these problems, we propose an efficient component-based representation method named CentripetalText (CT) for arbitrary-shaped texts. Enabled by the CT, our scene text detector outperforms other state-of-the-art opponents and achieves the best tradeoff between accuracy and speed (as shown in Fig. 1). Specifically, as illustrated in Fig. 3

, our method contains two steps: i) input images are fed to the convolutional neural network to predict the probability maps and centripetal shift maps. ii) pixels are grouped to form the text instances through the heuristics based on text kernels and centripetal shifts. In details, the text kernels are generated from the probability map followed by binarization and connected component search, and the centripetal shifts are predicted at each position of the centripetal shift map. Then each pixel is shifted by the amount of its centripetal shift from its original position in the centripetal shift map to text kernel pixel or background pixel in the probability map. All the pixels that can be shifted into the region of the same text kernel form a text instance. In this manner, we can reconstruct final text instances fast and easily through marginal matrix operations and several calls to functions of the OpenCV library. Moreover, we develop an enhanced regression loss, namely the Relaxed L1 Loss, mainly for dense centripetal shift regression, which further improves the detection precision. Benefiting from the new loss, our method is robust to the prediction errors of the centripetal shifts because the centripetal shifts which can guide the pixel to the region of the right text kernel are all regarded as positive. Besides, CT can be fused with CNN-based text detectors or spotters in a plug-and-play manner. We replace SPN in Mask TextSpotter v3 

masktextspotterv3 with our CentripetalText Proposal Network (CPN), a proposal generation module based on CT, which produces more accurate proposals and improves the end-to-end text recognition performance further.

To evaluate the effectiveness of the proposed CT and Relaxed L1 Loss, we adopt the design of network architecture in PAN PAN

and train a powerful end-to-end scene text detector by replacing its text instance representation and loss function with ours. We conduct extensive experiments on the commonly used scene text benchmarks including Total-Text 

totaltext, CTW1500 ctw1500, and MSRA-TD500 msratd500

, demonstrating that our method achieves superior or competitive performance compared to the state of the art, e.g., F-measure of 86.3% at 40.0 FPS on Total-Text, F-measure of 86.1% at 34.8 FPS on MSRA-TD500, etc. For the task of end-to-end text recognition, equipped with CPN, the F-measure value of Mask TextSpotter v3 can further be boosted to 71.9% and 79.5% without and with lexicon, respectively.

Major contributions of our work can be summarized as follows:

  • We propose a novel and efficient text instance representation method named CentripetalText, in which text instances are decomposed into the combination of text kernels and centripetal shifts. The attached post-processing algorithm is simple and robust, making the generation of text contours fast and accurate.

  • To reduce the burden of model training, we develop an enhanced loss function, namely the Relaxed L1 Loss, mainly for dense centripetal shift regression, which further improves the detection performance.

  • Equipped with the proposed CT and Relaxed L1 Loss, our scene text detector achieves superior or competitive results compared to other existing approaches on the curved or oriented text benchmarks, and our end-to-end scene text recognizer surpasses the current state of the art.

Figure 2: Illustration of the proposed CentripetalText representation. Text regions (in yellow) can be decomposed into the combination of text kernels (in blue) and centripetal shifts (both in red and green). The centripetal shifts represented as green arrows start from background pixels to non-text-kernel pixels, which are helpless to the further generation of text contours, while other centripetal shifts in red start from text region (foreground) pixels to text kernel pixels, which contribute to define the shapes. All the pixels that can be shifted into the region of the same text kernel form a text instance. For a better view, we only visualize the centripetal shifts over the bottom text instance.

2 Related work

Text instance representation methods can be roughly classified into two categories: component-free methods and component-based methods.

Component-free methods treat every text instance as a complete whole and directly regress the rotated rectangles or quadrangles for describing scene texts without any reconstruction process. These methods are usually inspired by general object detectors such as Faster R-CNN fasterrcnn and SSD ssd, and often utilize heuristic anchors as priori knowledge. TextBoxes textboxes successfully adapted the object detection framework SSD for text detection by modifying the aspect ratios of anchors and the kernel scales of filters. TextBoxes++ textboxes++ and EAST east could predict either rotated rectangles or quadrangles for text regions with and without the priori of anchors, respectively. SPCnet spcnet modified Mask R-CNN maskrcnn by adding the semantic segmentation guidance to suppress false positives.

Component-based methods prefer to model text instances from local perspectives and decompose instances into components such as characters or text fragments. SegLink seglink decomposed long texts into locally-detectable segments and links and combined the segments into whole words according to the links to get final detection results. MSR MSR detected scene texts by predicting dense text boundary points. PSENet psenet gradually expanded the detected areas from small kernels to complete instances via a progressive scale expansion algorithm.

Figure 3: An overview of our proposed model.

3 Methodology

In this section, we first introduce our new representation CT for texts of arbitrary shapes. Then, we elaborate on our method and training details.

3.1 Representation


An efficient scene text detector must have a well-defined representation for text instances. The traditional description methods inherited from generic object detection (e.g., rotated rectangles or quadrangles) fail to encode the geometric properties of irregular texts. To guarantee the flexibility and generality, we propose a new method named CentripetalText, in which text instances are composed of text kernels and centripetal shifts. As demonstrated in Fig. 2, CT expresses a text instance as a cluster of the pixels which can be shifted into the region of the same text kernel through centripetal shifts. As pixels are the basic units of digital images, CT has the ability to model different forms, regardless of shapes and lengths.

Mathematically, given an input image , the ground truth annotations are denoted as , where stands for the th text instances. Each text instance has its corresponding text kernel , a shrinked version of the original text region. Since a text kernel is a subset of its text instance, which satisfies , we treat it as the main basic of the pixel aggregation. Instead of distance in conventional methods, each centripetal shifts which appears in every position of the image guides the clustering of text pixels. In this sense, the text instance can be easily represented with the aggregated pixels which can be shifted into the region of a text kernel according to the amount of centripetal shifts:


Label generation

The label generation for the probability map is inspired by PSENet psenet, where the positive area of the text kernel (shaded areas in Fig. 4) is generated by shrinking the annotated polygon (shaded areas in Fig. 4) using the Vatti clipping algorithm vatti. The offset of shrinking is computed based on the perimeter and area of the original polygon and the shrink ratio is set to 0.7 empirically. Since the annotations in the dataset may not necessarily fit the text instances well, we develop a training mask to distinguish the supervision of the valid and ignoring regions. The text instance excluding the text kernel () is the ignoring region, which means that the gradients in this area are not propagated back to the network. The training mask can be formulated as follows:


We simply multiply the training mask by the loss of the segmentation branch to eliminate the influence brought by the wrong annotations.

Figure 4: Label generation. (a) Text instance; (b) Text kernel; (c) Text kernel reference; (d) Both ends (from text instance excluding text kernel to text kernel reference) of centripetal shift.

In the label generation of the regression branch, the instance affects the centripetal shift map in three ways. First, the centripetal shift in the background region () should prevent the background pixels from entrying into any text kernel and thus we set it to intuitively. Second, the centripetal shift in the text kernel region () should keep the text kernel pixels where they are and we also set it to for convenience. Third, we expect that each pixel in the region of the text instance excluding the text kernel () can be guided to its corresponding kernel by the centripetal shift. Therefore, we continuously conduct the erosion operation over the text kernel map twice, compare this two temporary results and obtain the text kernel reference (polygons with solid lines in Fig. 4) as the destination of the generated centripetal shift. As shown in Fig. 4, we build the centripetal shift between each pixel in the shaded area and its nearest text kernel reference to prevent numerical accuracy issues caused by rounding off. Note that if two instances overlap, the smaller instance has higher priority. The generation of the centripetal shift can be formulated as follows:


where represents the nearest text kernel reference to the pixel . During training, the Smooth L1 loss fastrcnn is applied for supervision. Nevertheless, according to a previous observation GFL, the dense regression can be modeled as a simple Dirac delta distribution, which fails to consider the ambiguity and uncertainty in datasets. To address the problem, we develop a regression mask as a relaxation operation and integrate it into the Smooth L1 loss to reduce the burden of the model training. We extend the correct prediction from one specific value to a range and any centripetal shift which moves the pixel into the right region is treated as positive during training. The regression mask can be formulated as follows:


Like the segmentation loss, we multiply the regression mask by the Smooth L1 loss and form a novel loss function, namely the Relaxed L1 loss for dense centripetal shift prediction, to further improve the detection accuracy. The Relaxed L1 loss function can be formulated as follows:


where denotes the standard Smooth L1 loss, and denote the predicted centripetal shift at the position and its ground truth respectively.

3.2 Scene text detection with CentripetalText


In order to detect texts with arbitrary shapes fast and accurately, we adopt the efficient model design in PAN PAN and follow it with our CT and Relaxed L1 loss. First, ResNet18 resnet is used as the default backbone for a fair comparison. Then, to remedy the weak representation ability of the lightweight backbone, two cascaded FPEMs PAN continuously enhance the feature pyramid in both top-down and bottom-up manners. Afterwards, the generated feature pyramids of different depths are fused by FFM PAN into a single basic feature before final segmentation. Finally, we predict the probability map and the centripetal shift map from the basic feature for further contour generation.


Our loss function can be formulated as:


where denotes the segmentation loss of text kernels, and denotes the regression loss of centripetal shifts. is a constant to balance the weights of the segmentation and regression losses. We set it to 0.05 in all experiments. Specifically, the prediction of text kernels is basically a pixel-wise binary classification problem and we apply the dice loss dice for this part. Equipped with the training mask , the segmentation loss can be formulated as follows:


where denotes the dice loss function, and denote the predicted probability of text kernels at the position and its ground truth respectively. Note that we adopt Online Hard Example Mining (OHEM) ohem to address the imbalance of positives and negatives while calculating . Regarding the regression loss, a detailed description has been provided in Sec. 3.1.


The procedure for contour reconstruction is shown in Fig. 3. After feed-forwarding, the network produces the probability map and the centripetal shift map. We firstly binarize the probability map with a constant threshold (0.2) to get the binary map. Then, we find the connected components (text kernels) from the binary map as the clusters of the pixel aggregation. Afterwards, we assign each pixel to the cluster according to which text kernel (or background) can the pixel be shifted into by its centripetal shift. Finally, we build the text contour based on the pixels in one group. Note that our post-processing strategy has an essential difference with PAN PAN. The post-processing in PAN is an iterative process, which gradually extends the text kernel to the text region by merging its neighbor pixels iteratively. On the contrary, we conduct the aggregation in one step, which means that the shift calculation of all the pixels implemented by one matrix operation are in parallel, saving the inference time to a certain extent.

CentripetalText Proposal Network

Our scene text detector is shrunk to a text proposal module, termed as CentripetalText Proposal Network (CPN), by transforming the polygonal outputs to the minimum area rectangles and instance masks. We follow the main design of the text detection and recognition modules of Mask TextSpotter v3 masktextspotterv3 and replace SPN with our CPN for the comparison of proposal quality and recognition accuracy.

4 Experiments

4.1 Datasets

SynthText synthtext is a synthetic dataset, consisting of more than 800,000 synthetic images. This dataset is used to pre-train our model.

Total-Text totaltext is a curved text dataset including 1,255 training images and 300 testing images. This dataset contains horizontal, multi-oriented, and curve text instances labeled at the word level.

CTW1500 ctw1500 is another curved text dataset including 1,000 training images and 500 testing images. The text instances are annotated at text-line level with 14-polygons.

MSRA-TD500 msratd500 is a multi-oriented text dataset which contains 300 training images and 200 testing images with text-line level annotation. Due to its small scale, we follow the previous works east; textsnake to include extra 400 training images from HUST-TR400 hust.

Backbone Neck Total-Text MSRA-TD500
ResNet18 FPN fpn 84.0 46.8 81.6 42.0
FPEM PAN 84.9 40.0 83.0 34.8
ResNet50 FPN fpn 85.1 25.5 82.7 21.9
FPEM PAN 85.6 24.0 83.5 20.5
Table 1: Quantitative results of our text detection models with different backbones and necks. “F” means F-measure.
Rep. Regression Loss Total-Text MSRA-TD500
PAN PAN 83.5 39.6 78.9 30.2
CT (Ours) Smooth L1fastrcnn 83.0 40.0 81.0 34.8
Balanced L1librarcnn 83.8 40.0 81.7 34.8
Relaxed L1 84.9 40.0 83.0 34.8
Table 2: Comparison between PAN PAN and our models with different regression losses. “Rep.” denotes representation and “F” means F-measure.
Figure 5: Qualitative results of the proposed method. Images in row 1-3 are sampled from Total-Text, CTW1500, and MSRA-TD500, respectively. Ground truth annotations are in red and our detection results are in green.

4.2 Implementation details

To make fair comparisons, we use the same training settings described below. ResNet resnet

pre-trained on ImageNet 

imagenet is used as the backbone of our method. All models are optimized by the Adam optimizer with the batch size of 16 on 4 GPUs. We train our model under two training strategies: (1) learning from scratch; (2) fine-tuning models pre-trained on the SynthText dataset. Whichever training strategies, we pre-train models on SynthText for 50k iterations with a fixed learning rate of , and train models on real datasets for 36k iterations with the “poly” learning rate strategy zhaoetal, where “power” is set to 0.9 and the initial learning rate is . Data augmentation follows the official implementation of PAN, including random scale, random horizontal flip, random rotation, and random crop. The blurred texts labeled as DO NOT CARE are ignored during training. In addition, we set the negative-positive ratio of OHEM to 3, and the shrinking rate of the text kernel to 0.7. All those models are tested with a batch size of 1 on a GTX 1080Ti GPU without bells and whistles. As for end-to-end recognition, we leave the original training and testing settings of Mask TextSpotter v3 unchanged.

4.3 Ablation study

To analyze our designs in depth, we conduct a series of ablation studies on both curve and multi-oriented text datasets (Total-Text and MSRA-TD500). In addition, all models in this subsection are trained from scratch.

On the one hand, to make full use of the capability of the proposed CT, we try different backbones and necks for the efficient architecture. As shown in Tab. 2, although “ResNet18 + FPN” and “ResNet50 + FPEM” are the fastest and the most accurate detectors, respectively, “ResNet18 + FPEM” achieves the best tradeoff between accuracy and speed. Thus, we keep this combination by default in the following experiments. On the other hand, we study the validity of the Relaxed L1 loss by replacing it with others. Compared with the baseline Smooth L1 Loss fastrcnn and the newly-released Balanced L1 loss librarcnn, the F-measure value of our method improves over 1% on both two datasets, which indicates the effectiveness of the Relaxed L1 loss. Moreover, under the same setting of the model architecture, we outperform PAN by a wide extent while keeping its fast inference speed, indicating that the proposed CT is more efficient.

4.4 Comparisons with state-of-the-art methods

Method Ext. Venue Total-Text CTW1500
CTPN CTPN - ECCV’16 - - - - 60.4 53.8 56.9 7.1
SegLink seglink - CVPR’17 30.3 23.8 26.7 - 42.3 40.0 40.8 10.7
EAST east - CVPR’17 50.0 36.2 42.0 - 78.7 49.1 60.4 21.2
PSENet psenet - CVPR’19 81.8 75.1 78.3 3.9 80.6 75.6 78.0 3.9
PAN PAN - ICCV’19 88.0 79.4 83.5 39.6 84.6 77.7 81.0 39.8
CT-320 - - 87.6 72.7 79.4 93.2 85.7 73.2 79.0 107.2
CT-512 - - 87.9 80.8 84.2 57.0 85.2 78.4 81.7 59.8
CT-640 - - 88.8 81.4 84.9 40.0 85.5 79.2 82.2 40.8
TextSnake textsnake ECCV’18 82.7 74.5 78.4 - 67.9 85.3 75.6 -
MSR MSR IJCAI’19 83.8 74.8 79.0 - 85.0 78.3 81.5 -
SegLink++ seglink++ PR’19 82.1 80.9 81.5 - 82.8 79.8 81.3 -
PSENet psenet CVPR’19 84.0 78.0 80.9 3.9 84.8 79.7 82.2 3.9
SPCNet spcnet AAAI’19 83.0 82.8 82.9 - - - - -
LOMO* LOMO CVPR’19 87.6 79.3 83.3 - 85.7 76.5 80.8 -
CRAFT CRAFT CVPR’19 87.6 79.9 83.6 - 86.0 81.1 83.5 -
Boundary boundary AAAI’20 85.2 83.5 84.3 - - - - -
DB DB AAAI’20 87.1 82.5 84.7 32.0 86.9 80.2 83.4 22.0
PAN PAN ICCV’19 89.3 81.0 85.0 39.6 86.4 81.2 83.7 39.8
DRRG DRRG CVPR’20 86.5 84.9 85.7 - 85.9 83.0 84.5 -
CT-320 - 88.0 75.4 81.2 93.2 87.7 74.7 80.7 107.2
CT-512 - 90.2 81.5 85.6 57.0 87.8 79.0 83.2 59.8
CT-640 - 90.5 82.5 86.3 40.0 88.3 79.9 83.9 40.8
Table 3: Quantitative detection results on Total-Text and CTW1500. “P”, “R” and “F” represent the precision, recall and F-measure respectively. “Ext.” denotes external training data. * indicates the multi-scale testing is performed.
Method Ext. P R F FPS
RRPN rrpn - 82.0 68.0 74.0 -
EAST east - 87.3 67.4 76.1 13.2
PAN PAN - 80.7 77.3 78.9 30.2
CT-736 - 87.1 79.3 83.0 34.8
SegLink seglink 86.0 70.0 77.0 8.9
PixelLink pixellink 83.0 73.2 77.8 3.0
TextSnake textsnake 83.2 73.9 78.3 1.1
RRD rrd 87.0 73.0 79.0 10.0
TextField textfield 87.4 75.9 81.3 -
CRAFT CRAFT 88.2 78.2 82.9 8.6
MCN MCN 88.0 79.0 83.0 -
PAN PAN 84.4 83.8 84.1 30.2
DB DB 91.5 79.2 84.9 32.0
DRRG DRRG 88.1 82.3 85.1 -
CT-736 90.0 82.5 86.1 34.8
Table 4: Quantitative detection results on MSRA-TD500. “P”, “R” and “F” represent the precision, recall and F-measure respectively. “Ext.” denotes external training data.

Curved text detection

We first evaluate our CT on the datasets Total-Text and CTW1500 to test its ability for curved text detection. During testing, we set the short side of images to different scales (320, 512, 640) and keep their aspect ratios. We compare with state-of-the-art detectors in Tab. 3. When learning from scratch, CT-640 achieves the competitive F-measure of 84.9%, surpassing most state-of-the-art methods pre-trained on external text datasets. When pre-training on SynthText, the F-measure value of our best model CT-640 reaches 86.3%, which is 0.6% better than second-best DRRG DRRG, while still ensuring the real-time detection speed (40.0 FPS). Fig. 1 demonstrates the accuracy-speed tradeoff of the top-performing real-time text detectors, from which it can be observed that our CT breaks through the limitations of accuracy-speed boundary. Analogous results can be found on CTW1500. With external training data, the F-measure of CT-640 is 83.9%, the second place of all methods, which is only lower than DGGR. Meanwhile, the speed can still exceed 40 FPS. In summary, the experiments conducted on Total-Text and CTW1500 demonstrate that the proposed CT achieves superior or competitive results compared to state-of-the-art methods, indicating its superiority in modeling curved texts. We visualize our detection results in Fig. 5 for further inspection.

Multi-oriented text detection

We also evaluate CT on the dataset MSRA-TD500 to test the robustness in modeling multi-oriented texts. As shown in Tab. 4, CT achieves the F-measure value of 83.0% at 34.8 FPS without external training data. Compared with PAN, our method outperforms it by 4.1%. When pre-training on SynthText, the F-measure value of our CT can further be boosted to 86.1%. The highest performance and the fastest speed achieved by CT prove the generalization ability to deal with texts with extreme aspect ratios and various orientations in complex natural scenarios.

(a) Segmentation Proposal Network (SPN)
(b) CentripetalText Proposal Network (CPN)
Figure 6: Qualitative comparison of proposals obtained by SPN and CPN. The blue rectangles denote the text proposals and the green areas denote the binary polygon masks.
Method None Full
TextBoxes* textboxes 36.3 48.9
Mask TextSpotter v1 masktextspotterv1 52.9 71.8
Qin et al. qin 63.9 -
Boundary boundary 65.0 76.1
Mask TextSpotter v2 masktextspotterv2 65.3 77.4
CharNet* charnet 69.2 -
ABCNet* abcnet 69.5 78.4
Mask TextSpotter v3 masktextspotterv3 71.2 78.4
Mask TextSpotter v3 w/ CPN 71.9 79.5
Table 5: Quantitative end-to-end recognition results on Total-Text. The evaluation protocol is the same as the one in Mask TextSpotter v3 masktextspotterv3. “None” means recognition without any lexicon. “Full” lexicon contains all words in the test set. * indicates the multi-scale testing is performed.

End-to-end text recognition

We simply replace SPN in Mask TextSpotter v3 with our proposed CPN to develop a more powerful end-to-end text recognizer. We evaluate CPN-based text spotter on Total-Text to test the proposal generation quality for the text spotting task. As shown in Tab. 5, equipped with CPN, Mask TextSpotter v3 achieves the F-measure value of 71.9% and 79.5% when the lexicon is not used and used respectively. Compared with the original version and other state-of-the-art methods, our method can obtain higher performance whether the lexicon is provided or not. Thus, the quantitative results demonstrate that CPN can produce more accurate text proposals than SPN, which is beneficial for recognition and can improve the performance of end-to-end text recognition further.

We visualize the text proposals and the polygon masks generated by SPN and CPN for intuitive comparison. As shown in Fig. 6, we can see that the polygon masks produced by CPN fit the text instances more tightly, which qualitatively proves the superiority of the proposed CPN to produce text proposals.

5 Conclusion

To keep the simplicity and robustness of text instance representation, we propose CentripetalText (CT) which decomposes text instances into the combination of text kernels and centripetal shifts. Text kernels identify the skeletons of text instances while centripetal shifts guide the external text pixels to the internal text kernels. Moreover, to reduce the burden of model training, a relaxation operation is integrated into the dense regression for centripetal shifts, allowing the correct prediction in a range. Equipped with the proposed CT, our detector achieves superior or comparable performance compared to state-of-the-art methods while keeping the real-time inference speed. The source code is available at ***, and we hope that the proposed CT can serve as a common representation for scene texts.


Appendix A Rotation robustness analysis

RA() CharNet charnet MTSv2 masktextspotterv2 MTSv3 masktextspotterv3 MTSv3 w/ CPN
0 61.7 61.2 61.4 86.3 75.2 80.3 89.0 73.0 80.2 89.7 76.3 82.4
15 66.3 61.9 64.0 78.4 53.5 63.6 87.2 69.8 77.5 87.8 72.0 79.1
30 60.9 56.5 58.6 73.9 54.7 62.9 87.8 67.5 76.3 89.6 69.4 78.2
45 34.2 33.5 33.9 66.4 45.8 54.2 88.5 66.8 76.1 89.4 66.9 76.5
60 10.3 8.4 9.3 68.2 48.3 56.6 88.5 67.6 76.6 88.6 67.1 76.4
75 0.3 0.2 0.2 77.0 59.2 67.0 86.9 67.6 76.0 88.1 67.7 76.5
90 0.0 0.0 0.0 82.0 56.9 67.1 85.9 57.9 69.1 87.8 60.6 71.7
Table 6: Quantitative end-to-end recognition results (without lexicon) on Rotated ICDAR2013. The evaluation protocol is the same as the one in ICDAR2015 dataset. CharNet is tested with the official released model. Mask TextSpotter v2 (MTSv2), Mask TextSpotter v3 (MTSv3) and our model (MTSv3 w/ CPN) are trained with the same rotating augmentation. “RA” is short for rotating angles. “P”, “R” and “F” represent the precision, recall and F-measure respectively.

To further demonstrate the rotation robustness of our method, we evaluate our CPN-based text spotter on the Rotated ICDAR2013 dataset.

Rotated ICDAR2013 masktextspotterv3 is an augmented text dataset that is generated from ICDAR2013 icdar2013. To form the Rotated ICDAR2013 dataset, all the images and annotations in the test set of the ICDAR2013 benchmark are rotated with some specific angles, including , , , , and . The dataset contains 229 training images and 233 testing images. The text instances are annotated at the text-line level with rotated rectangles. Since the annotations are extended from horizontal rectangles to multi-oriented ones, we adopt the evaluation protocols in the ICDAR2015 dataset icdar2015.

As shown in Tab. 6, we compare three top-performing methods CharNet charnet, Mask TextSpotter v2 masktextspotterv2, and Mask TextSpotter v3 masktextspotterv3 with our proposed text spotter at different rotation angles. We can see that CharNet and Mask TextSpotter v2 fail to deal with the multi-oriented texts and their performances fall well below ours. Moreover, Our method surpasses Mask TextSpotter v3 by more than 1.5% when the rotation angles are , , and , and we obtain the competitive performance under the other angles. The extensive experiments prove the superior robustness to various orientations of scene texts offered by our method.

Appendix B Limitation

Figure 7: Failure samples.

Although the proposal CT works well in most cases of scene text detection, it still fails in some difficult cases as shown in Fig. 7. On the one hand, our method may mistakenly treat some decorative patterns as texts and thus produces false positives (see Fig. 7). In this situation, the followed recognition module can effectively restrain such failures according to the high-level semantic information. On the other hand, whether two close text instance should be connected into one or not is still a challenging problem which influences the detection performance deeply (see Fig. 7). In the future, we plan to solve this problem and make the model more robust.

Appendix C More detection and recognition results

More detection results are shown in Fig. 8 (Total-Text), Fig. 9 (CTW1500), and Fig. 10 (MSRA-TD500), and end-to-end recognition results on Total-Text are shown in Fig. 11.

Figure 8: Detection results on Total-Text.
Figure 9: Detection results on CTW1500.
Figure 10: Detection results on MSRA-TD500.
Figure 11: Recognition results on Total-Text.