In the past decade, scene text detection has attracted increasing interests of the computer vision community, as localizing the region of each text instance in natural images with high accuracy is an essential prerequisite for many practical applications such as blind navigation, scene understanding, and text retrieval. With the rapid development of object detectionfasterrcnn; ssd; maskrcnn; fpn and segmentation fcn; bisenet; dice; zhaoetal, some promising methods east; textsnake; psenet; DB; PAN; contournet have been proposed to solve the problem. However, scene text detection is still a challenging task due to the variety of text curvatures, orientations, and aspect ratios.
How to represent the text instances in real imagery is one of the major challenges for scene text detection task, and usually there are two strategies to solve the problem arising from this challenge. The first is to treat text instances as a specific kind of object and uses rotated rectangles or quadrangles for description. This kind of methods is typically inherited from generic object detection and often utilizes manually designed anchors for better regression. Obviously, this solution ignores the geometric traits of irregular texts, which may introduce considerable background noises, and furthermore, it is difficult to formulate appropriate anchors to fit the texts of various shapes. The other strategy decomposes text instances into several conceptual or physical components, and reconstructs the polygonal contours through a series of indispensable post-processing steps. For example, PAN PAN follows the idea of clustering and aggregates text pixels according to the distances between their embeddings. In TextSnake textsnake
, text instances are represented with text center lines and ordered disks. Consequently, these methods are more flexible and more general than the previous ones in modeling. Nevertheless, most of them suffer from slow inference speed, due to complicated post-processing steps, essentially caused by this kind of tedious multi-component-based representation strategy. For another, their component prediction is modeled as a simple Dirac delta distribution, which strictly requires numerical outputs to reach the exact positions and thus weakens the ability to tolerate the mistakes. The wrong component prediction will propagate errors to heuristic post-processing procedures, making the rebuilt text contours inaccurate. Based on the above observations, we can find out that the implementation of a fast and accurate scene text detector is heavily dependent on a simple but effective text instance representation with a robust post-processing algorithm, which can tolerate ambiguity and uncertainty.
To overcome these problems, we propose an efficient component-based representation method named CentripetalText (CT) for arbitrary-shaped texts. Enabled by the CT, our scene text detector outperforms other state-of-the-art opponents and achieves the best tradeoff between accuracy and speed (as shown in Fig. 1). Specifically, as illustrated in Fig. 3
, our method contains two steps: i) input images are fed to the convolutional neural network to predict the probability maps and centripetal shift maps. ii) pixels are grouped to form the text instances through the heuristics based on text kernels and centripetal shifts. In details, the text kernels are generated from the probability map followed by binarization and connected component search, and the centripetal shifts are predicted at each position of the centripetal shift map. Then each pixel is shifted by the amount of its centripetal shift from its original position in the centripetal shift map to text kernel pixel or background pixel in the probability map. All the pixels that can be shifted into the region of the same text kernel form a text instance. In this manner, we can reconstruct final text instances fast and easily through marginal matrix operations and several calls to functions of the OpenCV library. Moreover, we develop an enhanced regression loss, namely the Relaxed L1 Loss, mainly for dense centripetal shift regression, which further improves the detection precision. Benefiting from the new loss, our method is robust to the prediction errors of the centripetal shifts because the centripetal shifts which can guide the pixel to the region of the right text kernel are all regarded as positive. Besides, CT can be fused with CNN-based text detectors or spotters in a plug-and-play manner. We replace SPN in Mask TextSpotter v3masktextspotterv3 with our CentripetalText Proposal Network (CPN), a proposal generation module based on CT, which produces more accurate proposals and improves the end-to-end text recognition performance further.
To evaluate the effectiveness of the proposed CT and Relaxed L1 Loss, we adopt the design of network architecture in PAN PAN
and train a powerful end-to-end scene text detector by replacing its text instance representation and loss function with ours. We conduct extensive experiments on the commonly used scene text benchmarks including Total-Texttotaltext, CTW1500 ctw1500, and MSRA-TD500 msratd500
, demonstrating that our method achieves superior or competitive performance compared to the state of the art, e.g., F-measure of 86.3% at 40.0 FPS on Total-Text, F-measure of 86.1% at 34.8 FPS on MSRA-TD500, etc. For the task of end-to-end text recognition, equipped with CPN, the F-measure value of Mask TextSpotter v3 can further be boosted to 71.9% and 79.5% without and with lexicon, respectively.
Major contributions of our work can be summarized as follows:
We propose a novel and efficient text instance representation method named CentripetalText, in which text instances are decomposed into the combination of text kernels and centripetal shifts. The attached post-processing algorithm is simple and robust, making the generation of text contours fast and accurate.
To reduce the burden of model training, we develop an enhanced loss function, namely the Relaxed L1 Loss, mainly for dense centripetal shift regression, which further improves the detection performance.
Equipped with the proposed CT and Relaxed L1 Loss, our scene text detector achieves superior or competitive results compared to other existing approaches on the curved or oriented text benchmarks, and our end-to-end scene text recognizer surpasses the current state of the art.
2 Related work
Text instance representation methods can be roughly classified into two categories: component-free methods and component-based methods.
Component-free methods treat every text instance as a complete whole and directly regress the rotated rectangles or quadrangles for describing scene texts without any reconstruction process. These methods are usually inspired by general object detectors such as Faster R-CNN fasterrcnn and SSD ssd, and often utilize heuristic anchors as priori knowledge. TextBoxes textboxes successfully adapted the object detection framework SSD for text detection by modifying the aspect ratios of anchors and the kernel scales of filters. TextBoxes++ textboxes++ and EAST east could predict either rotated rectangles or quadrangles for text regions with and without the priori of anchors, respectively. SPCnet spcnet modified Mask R-CNN maskrcnn by adding the semantic segmentation guidance to suppress false positives.
Component-based methods prefer to model text instances from local perspectives and decompose instances into components such as characters or text fragments. SegLink seglink decomposed long texts into locally-detectable segments and links and combined the segments into whole words according to the links to get final detection results. MSR MSR detected scene texts by predicting dense text boundary points. PSENet psenet gradually expanded the detected areas from small kernels to complete instances via a progressive scale expansion algorithm.
In this section, we first introduce our new representation CT for texts of arbitrary shapes. Then, we elaborate on our method and training details.
An efficient scene text detector must have a well-defined representation for text instances. The traditional description methods inherited from generic object detection (e.g., rotated rectangles or quadrangles) fail to encode the geometric properties of irregular texts. To guarantee the flexibility and generality, we propose a new method named CentripetalText, in which text instances are composed of text kernels and centripetal shifts. As demonstrated in Fig. 2, CT expresses a text instance as a cluster of the pixels which can be shifted into the region of the same text kernel through centripetal shifts. As pixels are the basic units of digital images, CT has the ability to model different forms, regardless of shapes and lengths.
Mathematically, given an input image , the ground truth annotations are denoted as , where stands for the th text instances. Each text instance has its corresponding text kernel , a shrinked version of the original text region. Since a text kernel is a subset of its text instance, which satisfies , we treat it as the main basic of the pixel aggregation. Instead of distance in conventional methods, each centripetal shifts which appears in every position of the image guides the clustering of text pixels. In this sense, the text instance can be easily represented with the aggregated pixels which can be shifted into the region of a text kernel according to the amount of centripetal shifts:
The label generation for the probability map is inspired by PSENet psenet, where the positive area of the text kernel (shaded areas in Fig. 4) is generated by shrinking the annotated polygon (shaded areas in Fig. 4) using the Vatti clipping algorithm vatti. The offset of shrinking is computed based on the perimeter and area of the original polygon and the shrink ratio is set to 0.7 empirically. Since the annotations in the dataset may not necessarily fit the text instances well, we develop a training mask to distinguish the supervision of the valid and ignoring regions. The text instance excluding the text kernel () is the ignoring region, which means that the gradients in this area are not propagated back to the network. The training mask can be formulated as follows:
We simply multiply the training mask by the loss of the segmentation branch to eliminate the influence brought by the wrong annotations.
In the label generation of the regression branch, the instance affects the centripetal shift map in three ways. First, the centripetal shift in the background region () should prevent the background pixels from entrying into any text kernel and thus we set it to intuitively. Second, the centripetal shift in the text kernel region () should keep the text kernel pixels where they are and we also set it to for convenience. Third, we expect that each pixel in the region of the text instance excluding the text kernel () can be guided to its corresponding kernel by the centripetal shift. Therefore, we continuously conduct the erosion operation over the text kernel map twice, compare this two temporary results and obtain the text kernel reference (polygons with solid lines in Fig. 4) as the destination of the generated centripetal shift. As shown in Fig. 4, we build the centripetal shift between each pixel in the shaded area and its nearest text kernel reference to prevent numerical accuracy issues caused by rounding off. Note that if two instances overlap, the smaller instance has higher priority. The generation of the centripetal shift can be formulated as follows:
where represents the nearest text kernel reference to the pixel . During training, the Smooth L1 loss fastrcnn is applied for supervision. Nevertheless, according to a previous observation GFL, the dense regression can be modeled as a simple Dirac delta distribution, which fails to consider the ambiguity and uncertainty in datasets. To address the problem, we develop a regression mask as a relaxation operation and integrate it into the Smooth L1 loss to reduce the burden of the model training. We extend the correct prediction from one specific value to a range and any centripetal shift which moves the pixel into the right region is treated as positive during training. The regression mask can be formulated as follows:
Like the segmentation loss, we multiply the regression mask by the Smooth L1 loss and form a novel loss function, namely the Relaxed L1 loss for dense centripetal shift prediction, to further improve the detection accuracy. The Relaxed L1 loss function can be formulated as follows:
where denotes the standard Smooth L1 loss, and denote the predicted centripetal shift at the position and its ground truth respectively.
3.2 Scene text detection with CentripetalText
In order to detect texts with arbitrary shapes fast and accurately, we adopt the efficient model design in PAN PAN and follow it with our CT and Relaxed L1 loss. First, ResNet18 resnet is used as the default backbone for a fair comparison. Then, to remedy the weak representation ability of the lightweight backbone, two cascaded FPEMs PAN continuously enhance the feature pyramid in both top-down and bottom-up manners. Afterwards, the generated feature pyramids of different depths are fused by FFM PAN into a single basic feature before final segmentation. Finally, we predict the probability map and the centripetal shift map from the basic feature for further contour generation.
Our loss function can be formulated as:
where denotes the segmentation loss of text kernels, and denotes the regression loss of centripetal shifts. is a constant to balance the weights of the segmentation and regression losses. We set it to 0.05 in all experiments. Specifically, the prediction of text kernels is basically a pixel-wise binary classification problem and we apply the dice loss dice for this part. Equipped with the training mask , the segmentation loss can be formulated as follows:
where denotes the dice loss function, and denote the predicted probability of text kernels at the position and its ground truth respectively. Note that we adopt Online Hard Example Mining (OHEM) ohem to address the imbalance of positives and negatives while calculating . Regarding the regression loss, a detailed description has been provided in Sec. 3.1.
The procedure for contour reconstruction is shown in Fig. 3. After feed-forwarding, the network produces the probability map and the centripetal shift map. We firstly binarize the probability map with a constant threshold (0.2) to get the binary map. Then, we find the connected components (text kernels) from the binary map as the clusters of the pixel aggregation. Afterwards, we assign each pixel to the cluster according to which text kernel (or background) can the pixel be shifted into by its centripetal shift. Finally, we build the text contour based on the pixels in one group. Note that our post-processing strategy has an essential difference with PAN PAN. The post-processing in PAN is an iterative process, which gradually extends the text kernel to the text region by merging its neighbor pixels iteratively. On the contrary, we conduct the aggregation in one step, which means that the shift calculation of all the pixels implemented by one matrix operation are in parallel, saving the inference time to a certain extent.
CentripetalText Proposal Network
Our scene text detector is shrunk to a text proposal module, termed as CentripetalText Proposal Network (CPN), by transforming the polygonal outputs to the minimum area rectangles and instance masks. We follow the main design of the text detection and recognition modules of Mask TextSpotter v3 masktextspotterv3 and replace SPN with our CPN for the comparison of proposal quality and recognition accuracy.
SynthText synthtext is a synthetic dataset, consisting of more than 800,000 synthetic images. This dataset is used to pre-train our model.
Total-Text totaltext is a curved text dataset including 1,255 training images and 300 testing images. This dataset contains horizontal, multi-oriented, and curve text instances labeled at the word level.
CTW1500 ctw1500 is another curved text dataset including 1,000 training images and 500 testing images. The text instances are annotated at text-line level with 14-polygons.
MSRA-TD500 msratd500 is a multi-oriented text dataset which contains 300 training images and 200 testing images with text-line level annotation. Due to its small scale, we follow the previous works east; textsnake to include extra 400 training images from HUST-TR400 hust.
|CT (Ours)||Smooth L1fastrcnn||83.0||40.0||81.0||34.8|
4.2 Implementation details
To make fair comparisons, we use the same training settings described below. ResNet resnet
pre-trained on ImageNetimagenet is used as the backbone of our method. All models are optimized by the Adam optimizer with the batch size of 16 on 4 GPUs. We train our model under two training strategies: (1) learning from scratch; (2) fine-tuning models pre-trained on the SynthText dataset. Whichever training strategies, we pre-train models on SynthText for 50k iterations with a fixed learning rate of , and train models on real datasets for 36k iterations with the “poly” learning rate strategy zhaoetal, where “power” is set to 0.9 and the initial learning rate is . Data augmentation follows the official implementation of PAN, including random scale, random horizontal flip, random rotation, and random crop. The blurred texts labeled as DO NOT CARE are ignored during training. In addition, we set the negative-positive ratio of OHEM to 3, and the shrinking rate of the text kernel to 0.7. All those models are tested with a batch size of 1 on a GTX 1080Ti GPU without bells and whistles. As for end-to-end recognition, we leave the original training and testing settings of Mask TextSpotter v3 unchanged.
4.3 Ablation study
To analyze our designs in depth, we conduct a series of ablation studies on both curve and multi-oriented text datasets (Total-Text and MSRA-TD500). In addition, all models in this subsection are trained from scratch.
On the one hand, to make full use of the capability of the proposed CT, we try different backbones and necks for the efficient architecture. As shown in Tab. 2, although “ResNet18 + FPN” and “ResNet50 + FPEM” are the fastest and the most accurate detectors, respectively, “ResNet18 + FPEM” achieves the best tradeoff between accuracy and speed. Thus, we keep this combination by default in the following experiments. On the other hand, we study the validity of the Relaxed L1 loss by replacing it with others. Compared with the baseline Smooth L1 Loss fastrcnn and the newly-released Balanced L1 loss librarcnn, the F-measure value of our method improves over 1% on both two datasets, which indicates the effectiveness of the Relaxed L1 loss. Moreover, under the same setting of the model architecture, we outperform PAN by a wide extent while keeping its fast inference speed, indicating that the proposed CT is more efficient.
4.4 Comparisons with state-of-the-art methods
Curved text detection
We first evaluate our CT on the datasets Total-Text and CTW1500 to test its ability for curved text detection. During testing, we set the short side of images to different scales (320, 512, 640) and keep their aspect ratios. We compare with state-of-the-art detectors in Tab. 3. When learning from scratch, CT-640 achieves the competitive F-measure of 84.9%, surpassing most state-of-the-art methods pre-trained on external text datasets. When pre-training on SynthText, the F-measure value of our best model CT-640 reaches 86.3%, which is 0.6% better than second-best DRRG DRRG, while still ensuring the real-time detection speed (40.0 FPS). Fig. 1 demonstrates the accuracy-speed tradeoff of the top-performing real-time text detectors, from which it can be observed that our CT breaks through the limitations of accuracy-speed boundary. Analogous results can be found on CTW1500. With external training data, the F-measure of CT-640 is 83.9%, the second place of all methods, which is only lower than DGGR. Meanwhile, the speed can still exceed 40 FPS. In summary, the experiments conducted on Total-Text and CTW1500 demonstrate that the proposed CT achieves superior or competitive results compared to state-of-the-art methods, indicating its superiority in modeling curved texts. We visualize our detection results in Fig. 5 for further inspection.
Multi-oriented text detection
We also evaluate CT on the dataset MSRA-TD500 to test the robustness in modeling multi-oriented texts. As shown in Tab. 4, CT achieves the F-measure value of 83.0% at 34.8 FPS without external training data. Compared with PAN, our method outperforms it by 4.1%. When pre-training on SynthText, the F-measure value of our CT can further be boosted to 86.1%. The highest performance and the fastest speed achieved by CT prove the generalization ability to deal with texts with extreme aspect ratios and various orientations in complex natural scenarios.
|Mask TextSpotter v1 masktextspotterv1||52.9||71.8|
|Qin et al. qin||63.9||-|
|Mask TextSpotter v2 masktextspotterv2||65.3||77.4|
|Mask TextSpotter v3 masktextspotterv3||71.2||78.4|
|Mask TextSpotter v3 w/ CPN||71.9||79.5|
End-to-end text recognition
We simply replace SPN in Mask TextSpotter v3 with our proposed CPN to develop a more powerful end-to-end text recognizer. We evaluate CPN-based text spotter on Total-Text to test the proposal generation quality for the text spotting task. As shown in Tab. 5, equipped with CPN, Mask TextSpotter v3 achieves the F-measure value of 71.9% and 79.5% when the lexicon is not used and used respectively. Compared with the original version and other state-of-the-art methods, our method can obtain higher performance whether the lexicon is provided or not. Thus, the quantitative results demonstrate that CPN can produce more accurate text proposals than SPN, which is beneficial for recognition and can improve the performance of end-to-end text recognition further.
We visualize the text proposals and the polygon masks generated by SPN and CPN for intuitive comparison. As shown in Fig. 6, we can see that the polygon masks produced by CPN fit the text instances more tightly, which qualitatively proves the superiority of the proposed CPN to produce text proposals.
To keep the simplicity and robustness of text instance representation, we propose CentripetalText (CT) which decomposes text instances into the combination of text kernels and centripetal shifts. Text kernels identify the skeletons of text instances while centripetal shifts guide the external text pixels to the internal text kernels. Moreover, to reduce the burden of model training, a relaxation operation is integrated into the dense regression for centripetal shifts, allowing the correct prediction in a range. Equipped with the proposed CT, our detector achieves superior or comparable performance compared to state-of-the-art methods while keeping the real-time inference speed. The source code is available at ***, and we hope that the proposed CT can serve as a common representation for scene texts.
Appendix A Rotation robustness analysis
|RA()||CharNet charnet||MTSv2 masktextspotterv2||MTSv3 masktextspotterv3||MTSv3 w/ CPN|
To further demonstrate the rotation robustness of our method, we evaluate our CPN-based text spotter on the Rotated ICDAR2013 dataset.
Rotated ICDAR2013 masktextspotterv3 is an augmented text dataset that is generated from ICDAR2013 icdar2013. To form the Rotated ICDAR2013 dataset, all the images and annotations in the test set of the ICDAR2013 benchmark are rotated with some specific angles, including , , , , and . The dataset contains 229 training images and 233 testing images. The text instances are annotated at the text-line level with rotated rectangles. Since the annotations are extended from horizontal rectangles to multi-oriented ones, we adopt the evaluation protocols in the ICDAR2015 dataset icdar2015.
As shown in Tab. 6, we compare three top-performing methods CharNet charnet, Mask TextSpotter v2 masktextspotterv2, and Mask TextSpotter v3 masktextspotterv3 with our proposed text spotter at different rotation angles. We can see that CharNet and Mask TextSpotter v2 fail to deal with the multi-oriented texts and their performances fall well below ours. Moreover, Our method surpasses Mask TextSpotter v3 by more than 1.5% when the rotation angles are , , and , and we obtain the competitive performance under the other angles. The extensive experiments prove the superior robustness to various orientations of scene texts offered by our method.
Appendix B Limitation
Although the proposal CT works well in most cases of scene text detection, it still fails in some difficult cases as shown in Fig. 7. On the one hand, our method may mistakenly treat some decorative patterns as texts and thus produces false positives (see Fig. 7). In this situation, the followed recognition module can effectively restrain such failures according to the high-level semantic information. On the other hand, whether two close text instance should be connected into one or not is still a challenging problem which influences the detection performance deeply (see Fig. 7). In the future, we plan to solve this problem and make the model more robust.