Bidirectional Regression for Arbitrary-Shaped Text Detection

by   Tao Sheng, et al.
Peking University

Arbitrary-shaped text detection has recently attracted increasing interests and witnessed rapid development with the popularity of deep learning algorithms. Nevertheless, existing approaches often obtain inaccurate detection results, mainly due to the relatively weak ability to utilize context information and the inappropriate choice of offset references. This paper presents a novel text instance expression which integrates both foreground and background information into the pipeline, and naturally uses the pixels near text boundaries as the offset starts. Besides, a corresponding post-processing algorithm is also designed to sequentially combine the four prediction results and reconstruct the text instance accurately. We evaluate our method on several challenging scene text benchmarks, including both curved and multi-oriented text datasets. Experimental results demonstrate that the proposed approach obtains superior or competitive performance compared to other state-of-the-art methods, e.g., 83.4


page 2

page 5

page 13


Efficient and Accurate Arbitrary-Shaped Text Detection with Pixel Aggregation Network

Scene text detection, an important step of scene text reading systems, h...

Arbitrary-Shaped Text Detection withAdaptive Text Region Representation

Text detection/localization, as an important task in computer vision, ha...

CentripetalText: An Efficient Text Instance Representation for Scene Text Detection

Scene text detection remains a grand challenge due to the variation in t...

Fourier Contour Embedding for Arbitrary-Shaped Text Detection

One of the main challenges for arbitrary-shaped text detection is to des...

All You Need is a Second Look: Towards Arbitrary-Shaped Text Detection

Arbitrary-shaped text detection is a challenging task since curved texts...

TextRay: Contour-based Geometric Modeling for Arbitrary-shaped Scene Text Detection

Arbitrary-shaped text detection is a challenging task due to the complex...

I3CL:Intra- and Inter-Instance Collaborative Learning for Arbitrary-shaped Scene Text Detection

Existing methods for arbitrary-shaped text detection in natural scenes f...

1 Introduction

In many text-related applications, robustly detecting scene texts with high accuracy, namely localizing the bounding box or region of each text instance with high Intersection over Union (IoU) to the ground truth, is fundamental and crucial to the quality of service. For example, in vision-based translation applications, the process of generating clear and coherent translations is highly dependent on the text detection accuracy. However, due to the variety of text scales, shapes and orientations, and the complex backgrounds in natural images, scene text detection is still a tough and challenging task.

With the rapid development of deep convolutional neural networks (DCNN), a number of effective methods  

[44, 41, 38, 36, 35, 34]

have been proposed for detecting texts in scene images, achieving promising performance. Among all these DCNN-based approaches, the majority of them can be roughly classified into two categories: regression-based methods with anchors and segmentation-based methods. Regression-based methods are typically motivated by generic object detection 

[30, 21, 11, 2], and treat text instances as a specific kind of object. However, it is difficult to manually design appropriate anchors for irregular texts. On the contrary, segmentation-based methods prefer to regard scene text detection as a segmentation task [23, 5, 8, 20] and need extra predictions besides segmentation results to rebuild text instances. Specifically, as shown in Fig. 1, MSR [38] predicts central text regions and distance maps according to the distance between each predicted text pixel and its nearest text boundary, which brings the state-of-the-art scene text detection accuracy. Nevertheless, the performance of MSR is still restricted by its relatively weak capability to utilize the context information around the text boundaries. That is to say, MSR only uses the pixels inside the central areas for detection, while ignores the pixels around the boundaries which include necessary context information. Moreover, the regression from central text pixels makes the positions of predicted boundary points ambiguous because there is a huge gap between them, requiring the network to have a large receptive field.

Figure 1: The comparison of detection results of MSR [38] and our method. In the first row, MSR predicts central text regions (top right) and distance maps according to the distance between each predicted text pixel and its nearest text boundary (top left). In the second row, we choose the pixels around the text boundary for further regression. The pixels for regression are marked as light blue.

To solve these problems, we propose an arbitrary-shaped text detector, which makes full use of context information around the text boundaries and predicts the boundary points in a natural way. Our method contains two steps: 1) input images are feed to the convolutional network to generate text regions, text kernels, pixel offsets, and pixel orientations. 2) at the stage of post-processing, the previous predictions are fused one by one to rebuild the final results. Unlike MSR, we choose the pixels around the text boundaries for further regression (Fig. 1 bottom). Due to the regression computed in two directions, either from the external or the internal text region to the text boundary, we call our method Bidirectional Regression. For a fair comparison, we choose a simple and commonly-used network structure as the feature extractor, ResNet50 [12] for backbone and FPN [19] for detection neck. Then, the segmentation branch predicts text regions and text kernels, while the regression branch predicts pixel offsets and pixel orientations which help to separate adjacent text lines. Also, we propose a novel post-processing algorithm to reconstruct final text instances accurately. We conduct extensive experiments on four challenging benchmarks including Total-Text [3], CTW1500 [41], ICDAR2015 [14], and MSRA-TD500 [39], which demonstrate that our method obtains superior or comparable performance compared to the state of the art.

Major contributions of our work can be summarized as follows:

  • To overcome the drawbacks of MSR, we propose a new kind of text instance expression which makes full use of context information and implements the regression in a natural manner.

  • To get complete text instances robustly, we develop a novel post-processing algorithm that sequentially combines the predictions to get accurate text instances.

  • The proposed method achieves superior or comparable performance compared to other existing approaches on two curved text benchmarks and two oriented text benchmarks.

2 Related Work

Convolutional neural network approaches have recently been very successful in the area of scene text detection. CNN-based scene text detectors can be roughly classified into two categories: regression-based methods and segmentation-based methods.

Regression-based detectors usually inherit from generic object detectors, such as Faster R-CNN [30] and SSD [21], directly regressing the bounding boxes of text instances. TextBoxes [15] adjusts the aspect ratios of anchors and the scales of convolutional kernels of SSD to deal with the significant variation of scene texts. TextBoxes++ [16] and EAST [44] further regress quadrangles of multi-oriented texts in pixel level with and without anchors, respectively. For better detection of long texts, RRD [18] generates rotation-invariant features for classification and rotation-sensitive features for regression. RRPN [27] modifies Faster R-CNN by adding rotation to proposals for titled text detection. The methods mentioned above achieve excellent results in several benchmarks. Nevertheless, most of them suffer from the complex anchor settings or the inadequate description of irregular texts.

Segmentation-based detectors prefer to treat scene text detection as a semantic segmentation problem and apply a matching post-processing algorithm to get the final polygons. Zhang et al. [42] utilized FCN [23]

to estimate text regions and further distinguish characters with MSER 

[29]. In PixelLink [6], text/non-text and links predictions in pixel level are carried out to separate adjacent text instances. Chen et al. [4] proposed the concept of attention-guided text border for better training. SPCNet [36] and Mask TextSpotter [25] adopt the architecture of Mask R-CNN [11] in the instance segmentation task to detect the texts with arbitrary shapes. In TextSnake [24], text instances are represented with text center lines and ordered disks. MSR [38] predicts central text regions and distance maps according to the distance between each predicted text pixel and its nearest text boundary. PSENet [35] proposes progressive scale expansion algorithm, learning text kernels with multiple scales. However, these methods all lack the rational utilization of context information, and thus often result in inaccurate text detection.

Some previous methods [4, 25, 37] also try to strengthen the utilization of context information according to the inference of the text border map or the perception of the whole text polygon. Regrettably, they only focus on pixels inside the text polygon and miss pixels outside the label. Different from existing methods, a unique text expression is proposed in this paper to force the network to extract foreground and background information simultaneously and produce a more representative feature for precise localization.

3 Methodology

In this section, we first compare the text instance expressions of common object detection models, a curved text detector MSR [38]

and ours. Then, we describe the whole network architecture of our proposed method. Afterwards, we elaborate on the post-processing algorithm and the generation procedure of arbitrary-shaped text instances. Finally, the details of the loss function in the training phase are given.

3.1 Text Instance Expression

Bounding Box A robust scene text detector must have a well-defined expression for text instances. In generic object detection methods, the text instances are always represented as bounding boxes, namely rotated rectangles or quadrangles, whose shapes heavily rely on vertices or geometric centers. However, it is difficult to decide vertices or geometric center of a curved text instance, especially in natural images. Besides, as shown in Fig. 2

(a), the bounding box (orange solid line) can not fit the boundary (green solid line) of the curved text well and introduces a large number of background noises, which could be problematic for detecting scene texts. Moreover, at the stage of scene text recognition, the features extracted in this way may confuse the model to obtain incorrect recognition results.

MSR does not have this expression problem because as shown in Fig. 2(b), it predicts central text regions (violet area) and offset maps according to the distance (orange line with arrow) between each predicted text pixel and its nearest text boundary. Note that the central text regions are derived from the original text boundary (green solid line), which help to separate adjacent words or text lines. Obviously, a central text region can be discretized into a set of points , and is the number of points. Naturally, the offset between and its nearest boundary can be represented as . Further, the contour of the text instance can be represented with the point set,


We observe that MSR focuses on the location of center points and the relationship between center and boundary points, while ignores the crucial context information around the text boundaries and chooses inappropriate references as the starts of offsets. More specifically, 1) the ignored features around the boundary are the most discriminative part of the text instance’s features extracted from the whole image, because foreground information inside the boundary and background information outside the boundary are almost completely different and easy to be distinguished. 2) compared to the points around the boundary, the center ones are farther away from the boundary, which can not give enough evidences to decide the position of the scene text instance. In other words, if we can utilize the context information well and choose better references, we will possibly be able to address MSR’s limitations and further improve the detection performance.

Figure 2: The comparison of three text instance expressions: (a) bounding boxes, (b) MSR’s expression [38], and (c-f) our Bidirectional Regression.

Bidirectional Regression (Ours) We now define our text instance expression so that the generated feature maps can have more useful contextual information for detection. We use four predictions to represent a text instance, including the text region, text kernel, pixel offset, and pixel orientation. As shown in Fig. 2(c), inspired by PSENet [35], the text kernel is generated by shrinking the annotated polygon (green solid line) to the yellow dotted line using the Vatti clipping algorithm [33]. The offset of shrinking is computed based on the perimeter and area of the original polygon :


where is the shrink ratio, set to 0.6 empirically. To contain the background information, we enlarge the annotated polygon to the blue dotted line in the same way and produce the text region. The expansion ratio is set to 1.2 in this work. Similarly, the text kernel and the text region can be treated as point sets, and further the text border (violet area in Fig. 2(d)) is the difference set of both, which can be formulated as:


For convenience, we use to represent the text border, where is the number of points in this area. As shown in Fig. 2(e) and Fig. 2(f), we establish the pixel offset map according to the distance (orange line with arrow) between each text border point and its nearest text boundary, and the pixel orientation map according to the orientation (white line with arrow) from each text border point to its nearest text kernel point. Note that if two instances overlap, the smaller one has higher priority. Like MSR, we use sets and to represent the pixel offset and the pixel orientation respectively, where

is a unit vector. Similar to Eq. 

1, the final predicted contour can be formulated as follows:


where means the Euclidean distance between two points. and are constants and we choose empirically. That is to say, if there exists a text border point that is close enough to the text kernel point, and the vector composed of these two points has a similar direction with the predicted pixel orientation, then the text border point will be shifted according to the amount of the predicted pixel offset, and be treated as the boundary point of the final text instance.

Comparing the areas for regression in MSR and our approach (Fig. 2(b)(e)), we can see that our proposed method chooses the points much closer to the boundary so that we do not require the network to have large receptive fields when detecting large text instances. Meanwhile, our model learns the foreground and background features macroscopically, and mixes them into a more discriminating one to localize the exact position of scene texts. To sum up, we design a powerful expression for arbitrary-shaped text detection.

3.2 Network Architecture

From the network’s perspective, our model is surprisingly simple, and Fig. 3 illustrates the whole architecture. For a fair comparison with MSR, we employ ResNet-50 [12]

as the backbone network to extract initial multi-scale features from input images. A total of 4 feature maps are generated from the Res2, Res3, Res4, and Res5 layers of the backbone, and they have strides of 4, 8, 16, 32 pixels with respect to the input image respectively. To reduce the computational cost and the network complexity, we use

convolutions to reduce the channel number of each feature map to 256. Inspired by FPN [19]

, the feature pyramid is enhanced by gradually merging the features of adjacent scales in a top-down manner. Then, the deep but thin feature maps are fused by bilinear interpolation and concatenation into a basic feature, whose stride is 4 pixels and the channel number is 1024. The basic feature is used to predict text regions, text kernels, pixel offsets, and pixel orientations simultaneously. Finally, we apply a sequential and efficient post-processing algorithm to obtain the final text instances.

Figure 3: An overview of our proposed model. Our method contains two components, the CNN-based network and the post-processing algorithm. (a) We use ResNet-50 and FPN to extract the feature pyramid, and concatenate them into a basic feature for further prediction. (b) The post-processing algorithm takes two forecasts as inputs and produces a new one every step, reconstructing the scene texts with arbitrary shapes finally.

3.3 Post-Processing

As described in the above subsection and illustrated in Fig. 3(a-d), four predictions are generated from the basic feature, followed by the post-processing algorithm. The text region can describe the text instance coarsely but can not separate two adjacent instances (Fig. 3(c)). In contrast, the text kernel can separate them but can not describe them (see Fig. 3(d)). Therefore, we use text kernels to determine the coarse position, then use pixel orientations to classify the ungrouped text region points, and use pixel offsets to slightly modify the contours.

We first find the connected components in the text kernel map, and each connected component represents the kernel of a single text instance. For better visualization, different kernels are painted with different colors (see Fig. 3(d)). Moreover, the pixel offset map and the pixel orientation map are difficult to be visualized, so we replace the real ones with the diagrammatic sketches in a fictitious scene (see Fig. 3(a)(b)). Then, we combine the text region and the text kernel, obtaining the text border (see Fig. 3(e)), which is the difference set of two predictions and meanwhile the aggregation of the ungrouped text region points. Combined with the pixel orientation, each text border point has its own orientation. As shown in Fig. 3(f), four colors (yellow, green, violet, and blue) represent the four directions (up, down, left, and right) correspondingly. Afterwards, the oriented border and the text kernel are combined together to classify the text border points into the groups of previously connected components in the text kernel (see Fig. 3(g)) according to the difference between the predicted orientation and the orientation from each text border point to its nearest text kernel point. A text border point should be deserted if the distance to its nearest text kernel point is too far. Furthermore, each grouped border point will be shifted to its nearest point on the text boundary, which can be calculated by summing up the coordinates of the point and the predicted offset in the pixel offset map (see Fig. 3(h)). Finally, we adopt the Alpha-Shape Algorithm to produce concave polygons enclosing the shifted points of each text kernel group (see Fig. 3(I)), precisely reconstructing the shapes of the text instances in scene images. Through this effective post-processing algorithm, we can detect scene texts with arbitrary shapes fast and accurately, which is experimentally proved in Section 4.

3.4 Loss Function

Our loss function can be formulated as:


where and denote the binary segmentation loss of text regions and text kernels respectively, denotes the regression loss of pixel offsets, and denotes the orientation loss of pixel orientations. and are normalization constants to balance the weights of the segmentation and regression loss. We set them to 0.5 and 0.1 in all experiments.

The prediction of text regions and text kernels is basically a pixel-wise binary classification problem. We follow MSR and adopt the dice loss [28] for this part. Considering the imbalance of text and non-text pixels in the text regions, Online Hard Example Mining (OHEM) [32] is also adopted to select the hard non-text pixels when calculating .

The prediction of the distance from each text border point to its nearest text boundary is a regression problem. Following the regression for bounding boxes in generic object detection, we use the Smooth L1 loss [9] for supervision, which is defined as:


where and denote the predicted offset and the corresponding ground truth, respectively, denotes the standard Smooth L1 loss, and denotes the number of text border points. Moreover, the prediction of the orientation from each text border point to its nearest text kernel point can also be treated as a regression problem. For simplicity, we adopt the cosine loss defined as follows:


where denote the dot product of the predicted direction vector and its ground truth, which is equal to the cosine value of the angle between these two orientations. Note that we only take the points in the text border into consideration when calculating and .

4 Experiments

4.1 Datasets

SynthText [10] is a synthetical dataset containing more than 800,000 synthetic scene text images, most of which are annotated at word level with multi-oriented rectangles. We pre-train our model on this dataset.
Total-Text [3] is a curved text dataset which contains 1,255 training images and 300 testing images. The texts are all in English and contain a large number of horizontal, multi-oriented, and curved text instances, each of which is annotated at word level with a polygon.
CTW1500 [41] is another curved text dataset that has 1,000 images for training and 500 images for testing. The dataset focuses on curved texts, which are largely in English and Chinese, and annotated at text-line level with 14-polygons.
ICDAR2015 [14] is a commonly-used dataset for scene text detection, which contains 1,000 training images and 500 testing images. The dataset is captured by Google Glasses, where text instances are annotated at word level with quadrilaterals.
MSRA-TD500 [39] is a small dataset that contains a total of 500 images, 300 for training and the remaining for testing. All captured text instances are in English and Chinese, which are annotated at text-line level with best-aligned rectangles. Due to the rather small scale of the dataset, we follow the previous works [44, 24] to add the 400 training images from HUST-TR400 [40] into the training set.

4.2 Implementation Details

The following settings are used throughout the experiments. The proposed method is implemented with the deep learning framework, Pytorch, on a regular GPU workstation with 4 Nvidia Geforce GTX 1080 Ti. For the network architecture, we use the ResNet-50 


pre-trained on ImageNet 

[7] as our backbone. For learning, we train our model with the batch size of 16 on 4 GPUs for 36K iterations. Adam optimizer with a starting learning rate of is used for optimization. We use the “poly” learning rate strategy [43], where the initial rate is multiplied by , and the “power” is set to 0.9 in all experiments. For data augmentation, we apply random scale, random horizontal flip, random rotation, and random crop on training images. We ignore the blurred texts labeled as DO NOT CARE in all datasets. For others, Online hard example mining (OHEM) is used to balance the positive and negative samples, and the negative-positive ratio is set to 3. We first pre-train our model on SynthText, and then fine-tune it on other datasets. The training settings of the two stages are the same.

4.3 Ablation Study

To prove the effectiveness of our proposed expression for text instances, we carry out ablation studies on the curved text dataset Total-Text. Note that, all the models in this subsection are pre-trained on SynthText first. The quantitative results of the same network architecture with different expressions (with corresponding post-processing algorithms) are shown in Tab. 1.

Expression Precision Recall F-score
MSR - 84.7 77.3 80.8
Ours 1.0 85.8 79.1 82.3
Ours 1.2 87.0 80.1 83.4
Table 1: The results of models with different expressions over the curved text dataset Total-Text. “” means the expansion ratio of the text region.
Expression F
MSR 80.8 88.2 85.5 76.4 48.2 5.7
Ours 83.4 87.9 85.6 78.4 56.9 15.2
Table 2: The results of models with different expressions over the varying IoU thresholds from 0.5 to 0.9. “” means the IoU threshold is set to “” when evaluating. The measure of “F” follows the Total-Text dataset, setting TR to 0.7 and TP to 0.6 threshold for a fairer evaluation.

To better analyze the capability of the proposed expression, we replace our text instance expression in the proposed text detector with MSR’s. The F-score of the model with the MSR’s expression (the first row in Tab. 1) drops 2.6 compared to our method (the third row in Tab. 1), which indicates the effectiveness of our text instance expression clearly. To prove the necessity of introducing background pixels around the text boundary, we adjust the expansion ratio from 1.2 to 1.0 (the second row in Tab. 1). We can see that the F-score value increases by 1.1 when the model extracts the foreground and background features macroscopically. Furthermore, to judge whether the text border pixels are the better references for the pixel offsets or not, we compare the model with MSR’s expression and ours without enlarging text instances, and notice that ours makes about 1.5 improvement on F-score. We further analyze the detection accuracy between the model with MSR’s expression and ours by varying the evaluation IoU threshold from 0.5 to 0.9. Tab. 2 shows that our method defeats the competitor for most IoU settings, especially in high IoU levels, indicating that our predicted polygons fit text instances better.

4.4 Comparisons with State-of-the-Art Methods

Curved text detection We first evaluate our method over the datasets Total-Text and CTW1500 which contain many curved text instances. In the testing phase, we set the short side of images to 640 and keep their original aspect ratio. We show the experimental results in Tab. 3. On Total-Text, our method achieves the F-score of 83.4%, which surpasses all other state-of-the-art methods by at least 0.5%. Especially, we outperform our counterpart, MSR, in F-score by over 4%. Analogous results can be found on CTW1500. Our method obtains 81.8% in F-score, the second-best one of all methods, which is only lower than PSENet [35] but surpasses MSR by 0.3%. To sum up, our experiments conducted on these two datasets demonstrate the advantages of our method when detecting text instances with arbitrary shapes in complex natural scenes. We visualize our detection results in Fig. 4(a)(b) for further inspection.

Method Total-Text CTW1500
SegLink [31] 30.3 23.8 26.7 42.3 40.0 40.8
EAST [44] 50.0 36.2 42.0 78.7 49.1 60.4
Mask TextSpotter [25] 69.0 55.0 61.3 - - -
TextSnake [24] 82.7 74.5 78.4 67.9 85.3 75.6
CSE [22] 81.4 79.1 80.2 81.1 76.0 78.4
TextField [37] 81.2 79.9 80.6 83.0 79.8 81.4
PSENet-1s [35] 84.0 78.0 80.9 84.8 79.7 82.2
SPCNet [36] 83.0 82.8 82.9 - - -
TextRay [34] 83.5 77.9 80.6 82.8 80.4 81.6
MSR(Baseline) [38] 83.8 74.8 79.0 85.0 78.3 81.5
Ours 87.0 80.1 83.4 85.7 78.2 81.8
Table 3: Experimental results on the curved-text-line datasets Total-Text and CTW1500.

Oriented text detection Then we evaluate the proposed method over the multi-oriented text dataset ICDAR2015. In the testing phase, we set the short side of images to 736 for better detection. To fit its evaluation protocol, we use a minimum area rectangle to replace each output polygon. The performance on ICDAR2015 is shown in Tab. 4. Our method achieves the F-score of 82.2%, which is on par with MSR while the FPS of ours is 3 times of MSR. Indeed, MSR adopts the multi-scale multi-stage detection network, a particularly time-consuming architecture, so it is no surprise that its speed is lower than Ours. Compared with state-of-the-art methods, our method is not as well as some competitors (e.g. PSENet [35], CRAFT [1], DB [17]), but our method has the fastest inference speed (13.2 fps) and keeps a good balance between accuracy and latency. The qualitative illustrations in Fig. 4(c) show that the proposed method can detect multi-oriented texts well.

Long straight text detection We also evaluate the robustness of our proposed method on the long straight text dataset MSRA-TD500. During inference, the short side of images is set to 736 for a fair comparison. As shown in Tab. 4, our method achieves 82.4% in F-score, which is 0.7% better than MSR and comparable to the best-performing detectors DB and CRAFT. Therefore, our method is robust for detecting texts with extreme aspect ratios in complex scenarios (see Fig. 4(d)).

Method ICDAR2015 MSRA-TD500
SegLink [31] 73.1 76.8 75.0 - 86.6 70.0 77.0
RRPN [27] 82.0 73.0 77.0 - 82.0 68.0 74.0
EAST [44] 83.6 73.5 78.2 13.2 87.3 67.4 76.1
Lyu et al. [26] 94.1 70.7 80.7 3.6 87.6 76.2 81.5
DeepReg [13] 82.0 80.0 81.0 - 77.0 70.0 74.0
RRD [18] 85.6 79.0 82.2 6.5 87.0 73.0 79.0
PixelLink [6] 82.9 81.7 82.3 7.3 83.0 73.2 77.8
TextSnake [24] 84.9 80.4 82.6 1.1 83.2 73.9 78.3
Mask TextSpotter [25] 85.8 81.2 83.4 4.8 - - -
PSENet-1s [35] 86.9 84.5 85.7 1.6 - - -
CRAFT [1] 89.8 84.3 86.9 8.6 88.2 78.2 82.9
DB [17] 91.8 83.2 87.3 12.0 90.4 76.3 82.8
MSR(Baseline) [38] 86.6 78.4 82.3 4.3 87.4 76.7 81.7
Ours 82.6 81.9 82.2 13.2 83.7 81.1 82.4
Table 4: Experimental results on the oriented-text-line dataset ICDAR2015 and long-straight-text-line dataset MSRA-TD500.

5 Conclusion

In this paper, we analyzed the limitations of existing segmentation-based scene text detectors and proposed a novel text instance expression to address these limitations. Moreover, considering their limited ability to utilize context information, our method extracts both foreground and background features for robust detection. The pixels around text boundaries are chosen as references of the predicted offsets for accurate localization. Besides, a corresponding post-processing algorithm is introduced to generate the final text instances. Extensive experiments demonstrated that our method achieves the performance superior or comparable to other state-of-the-art approaches on several publicly available benchmarks.

Figure 4: The qualitative results of the proposed method. Images in columns (a)-(d) are sampled from the datasets Total-Text, CTW1500, ICDAR2015, and MSRA-TD500 respectively. The green polygons are the detection results predicted by our method, while the blue ones are ground-truth annotations.

5.0.1 Acknowledgements

This work was supported by Beijing Nova Program of Science and Technology (Grant No.: Z191100001119077), Center For Chinese Font Design and Research, and Key Laboratory of Science, Technology and Standard in Press Industry (Key Laboratory of Intelligent Press Media Technology).


  • [1] Y. Baek, B. Lee, D. Han, S. Yun, and H. Lee (2019) Character region awareness for text detection. In CVPR, pp. 9365–9374. Cited by: §4.4, Table 4.
  • [2] Z. Cai and N. Vasconcelos (2018) Cascade r-cnn: delving into high quality object detection. In CVPR, pp. 6154–6162. Cited by: §1.
  • [3] C. K. Ch’ng and C. S. Chan (2017) Total-text: a comprehensive dataset for scene text detection and recognition. In ICDAR, Vol. 1, pp. 935–942. Cited by: §1, §4.1.
  • [4] J. Chen, Z. Lian, Y. Wang, Y. Tang, and J. Xiao (2019) Irregular scene text detection via attention guided border labeling. SCIS 62 (12), pp. 220103. Cited by: §2, §2.
  • [5] L. Chen, G. Papandreou, I. Kokkinos, K. Murphy, and A. L. Yuille (2017) Deeplab: semantic image segmentation with deep convolutional nets, atrous convolution, and fully connected crfs. IEEE Trans. Pattern Anal. Mach. Intell. 40 (4), pp. 834–848. Cited by: §1.
  • [6] D. Deng, H. Liu, X. Li, and D. Cai (2018) Pixellink: detecting scene text via instance segmentation. In AAAI, Cited by: §2, Table 4.
  • [7] J. Deng, W. Dong, R. Socher, L. Li, K. Li, and L. Fei-Fei (2009) Imagenet: a large-scale hierarchical image database. In CVPR, pp. 248–255. Cited by: §4.2.
  • [8] H. Ding, X. Jiang, B. Shuai, A. Qun Liu, and G. Wang (2018) Context contrasted feature and gated multi-scale aggregation for scene segmentation. In CVPR, pp. 2393–2402. Cited by: §1.
  • [9] R. Girshick (2015) Fast r-cnn. In ICCV, pp. 1440–1448. Cited by: §3.4.
  • [10] A. Gupta, A. Vedaldi, and A. Zisserman (2016) Synthetic data for text localisation in natural images. In CVPR, pp. 2315–2324. Cited by: §4.1.
  • [11] K. He, G. Gkioxari, P. Dollár, and R. Girshick (2017) Mask r-cnn. In ICCV, pp. 2961–2969. Cited by: §1, §2.
  • [12] K. He, X. Zhang, S. Ren, and J. Sun (2016) Deep residual learning for image recognition. In CVPR, pp. 770–778. Cited by: §1, §3.2, §4.2.
  • [13] W. He, X. Zhang, F. Yin, and C. Liu (2017) Deep direct regression for multi-oriented scene text detection. In ICCV, pp. 745–753. Cited by: Table 4.
  • [14] D. Karatzas, L. Gomez-Bigorda, A. Nicolaou, S. Ghosh, A. Bagdanov, M. Iwamura, J. Matas, L. Neumann, V. R. Chandrasekhar, S. Lu, et al. (2015) ICDAR 2015 competition on robust reading. In ICDAR, pp. 1156–1160. Cited by: §1, §4.1.
  • [15] M. Liao, B. Shi, X. Bai, X. Wang, and W. Liu (2017) TextBoxes: a fast text detector with a single deep neural network. In AAAI, pp. 4161–4167. Cited by: §2.
  • [16] M. Liao, B. Shi, and X. Bai (2018) Textboxes++: a single-shot oriented scene text detector. IEEE Trans. Image Process. 27 (8), pp. 3676–3690. Cited by: §2.
  • [17] M. Liao, Z. Wan, C. Yao, K. Chen, and X. Bai (2020)

    Real-time scene text detection with differentiable binarization

    In AAAI, Vol. 34, pp. 11474–11481. Cited by: §4.4, Table 4.
  • [18] M. Liao, Z. Zhu, B. Shi, G. Xia, and X. Bai (2018) Rotation-sensitive regression for oriented scene text detection. In CVPR, pp. 5909–5918. Cited by: §2, Table 4.
  • [19] T. Lin, P. Dollár, R. Girshick, K. He, B. Hariharan, and S. Belongie (2017) Feature pyramid networks for object detection. In CVPR, pp. 2117–2125. Cited by: §1, §3.2.
  • [20] S. Liu, L. Qi, H. Qin, J. Shi, and J. Jia (2018) Path aggregation network for instance segmentation. In CVPR, pp. 8759–8768. Cited by: §1.
  • [21] W. Liu, D. Anguelov, D. Erhan, C. Szegedy, S. Reed, C. Fu, and A. C. Berg (2016) Ssd: single shot multibox detector. In ECCV, pp. 21–37. Cited by: §1, §2.
  • [22] Z. Liu, G. Lin, S. Yang, F. Liu, W. Lin, and W. L. Goh (2019) Towards robust curve text detection with conditional spatial expansion. In CVPR, pp. 7269–7278. Cited by: Table 3.
  • [23] J. Long, E. Shelhamer, and T. Darrell (2015) Fully convolutional networks for semantic segmentation. In CVPR, pp. 3431–3440. Cited by: §1, §2.
  • [24] S. Long, J. Ruan, W. Zhang, X. He, W. Wu, and C. Yao (2018) Textsnake: a flexible representation for detecting text of arbitrary shapes. In ECCV, pp. 20–36. Cited by: §2, §4.1, Table 3, Table 4.
  • [25] P. Lyu, M. Liao, C. Yao, W. Wu, and X. Bai (2018) Mask textspotter: an end-to-end trainable neural network for spotting text with arbitrary shapes. In ECCV, pp. 67–83. Cited by: §2, §2, Table 3, Table 4.
  • [26] P. Lyu, C. Yao, W. Wu, S. Yan, and X. Bai (2018) Multi-oriented scene text detection via corner localization and region segmentation. In CVPR, pp. 7553–7563. Cited by: Table 4.
  • [27] J. Ma, W. Shao, H. Ye, L. Wang, H. Wang, Y. Zheng, and X. Xue (2018) Arbitrary-oriented scene text detection via rotation proposals. IEEE Trans. Image Process. 20 (11), pp. 3111–3122. Cited by: §2, Table 4.
  • [28] F. Milletari, N. Navab, and S. Ahmadi (2016) V-net: fully convolutional neural networks for volumetric medical image segmentation. In 3DV, pp. 565–571. Cited by: §3.4.
  • [29] L. Neumann and J. Matas (2010) A method for text localization and recognition in real-world images. In ACCV, pp. 770–783. Cited by: §2.
  • [30] S. Ren, K. He, R. Girshick, and J. Sun (2015) Faster r-cnn: towards real-time object detection with region proposal networks. In NIPS, pp. 91–99. Cited by: §1, §2.
  • [31] B. Shi, X. Bai, and S. Belongie (2017) Detecting oriented text in natural images by linking segments. In CVPR, pp. 2550–2558. Cited by: Table 3, Table 4.
  • [32] A. Shrivastava, A. Gupta, and R. Girshick (2016) Training region-based object detectors with online hard example mining. In CVPR, pp. 761–769. Cited by: §3.4.
  • [33] B. R. Vatti (1992) A generic solution to polygon clipping. CACM 35 (7), pp. 56–63. Cited by: §3.1.
  • [34] F. Wang, Y. Chen, F. Wu, and X. Li (2020) TextRay: contour-based geometric modeling for arbitrary-shaped scene text detection. In ACM-MM, pp. 111–119. Cited by: §1, Table 3.
  • [35] W. Wang, E. Xie, X. Li, W. Hou, T. Lu, G. Yu, and S. Shao (2019) Shape robust text detection with progressive scale expansion network. In CVPR, pp. 9336–9345. Cited by: §1, §2, §3.1, §4.4, §4.4, Table 3, Table 4.
  • [36] E. Xie, Y. Zang, S. Shao, G. Yu, C. Yao, and G. Li (2019) Scene text detection with supervised pyramid context network. In AAAI, Vol. 33, pp. 9038–9045. Cited by: §1, §2, Table 3.
  • [37] Y. Xu, Y. Wang, W. Zhou, Y. Wang, Z. Yang, and X. Bai (2019) Textfield: learning a deep direction field for irregular scene text detection. IEEE Trans. Image Process. 28 (11), pp. 5566–5579. Cited by: §2, Table 3.
  • [38] C. Xue, S. Lu, and W. Zhang (2019) MSR: multi-scale shape regression for scene text detection. In IJCAI, pp. 989–995. Cited by: Figure 1, §1, §2, Figure 2, §3, Table 3, Table 4.
  • [39] C. Yao, X. Bai, W. Liu, Y. Ma, and Z. Tu (2012) Detecting texts of arbitrary orientations in natural images. In CVPR, pp. 1083–1090. Cited by: §1, §4.1.
  • [40] C. Yao, X. Bai, and W. Liu (2014) A unified framework for multioriented text detection and recognition. IEEE Trans. Image Process. 23 (11), pp. 4737–4749. Cited by: §4.1.
  • [41] L. Yuliang, J. Lianwen, Z. Shuaitao, and Z. Sheng (2017) Detecting curve text in the wild: new dataset and new solution. arXiv preprint arXiv:1712.02170. Cited by: §1, §1, §4.1.
  • [42] Z. Zhang, C. Zhang, W. Shen, C. Yao, W. Liu, and X. Bai (2016) Multi-oriented text detection with fully convolutional networks. In CVPR, pp. 4159–4167. Cited by: §2.
  • [43] H. Zhao, J. Shi, X. Qi, X. Wang, and J. Jia (2017) Pyramid scene parsing network. In CVPR, pp. 2881–2890. Cited by: §4.2.
  • [44] X. Zhou, C. Yao, H. Wen, Y. Wang, S. Zhou, W. He, and J. Liang (2017) East: an efficient and accurate scene text detector. In CVPR, pp. 5551–5560. Cited by: §1, §2, §4.1, Table 3, Table 4.