Log In Sign Up

FC2RN: A Fully Convolutional Corner Refinement Network for Accurate Multi-Oriented Scene Text Detection

by   Xugong Qin, et al.

Recent scene text detection works mainly focus on curve text detection. However, in real applications, the curve texts are more scarce than the multi-oriented ones. Accurate detection of multi-oriented text with large variations of scales, orientations, and aspect ratios is of great significance. Among the multi-oriented detection methods, direct regression for the geometry of scene text shares a simple yet powerful pipeline and gets popular in academic and industrial communities, but it may produce imperfect detections, especially for long texts due to the limitation of the receptive field. In this work, we aim to improve this while keeping the pipeline simple. A fully convolutional corner refinement network (FC2RN) is proposed for accurate multi-oriented text detection, in which an initial corner prediction and a refined corner prediction are obtained at one pass. With a novel quadrilateral RoI convolution operation tailed for multi-oriented scene text, the initial quadrilateral prediction is encoded into the feature maps which can be further used to predict offset between the initial prediction and the ground-truth as well as output a refined confidence score. Experimental results on four public datasets including MSRA-TD500, ICDAR2017-RCTW, ICDAR2015, and COCO-Text demonstrate that FC2RN can outperform the state-of-the-art methods. The ablation study shows the effectiveness of corner refinement and scoring for accurate text localization.


page 2

page 3

page 10

page 14


Deep Direct Regression for Multi-Oriented Scene Text Detection

In this paper, we first provide a new perspective to divide existing hig...

MOST: A Multi-Oriented Scene Text Detector with Localization Refinement

Over the past few years, the field of scene text detection has progresse...

TextField: Learning A Deep Direction Field for Irregular Scene Text Detection

Scene text detection is an important step of scene text reading system. ...

MSR: Multi-Scale Shape Regression for Scene Text Detection

State-of-the-art scene text detection techniques predict quadrilateral b...

Scale-Invariant Multi-Oriented Text Detection in Wild Scene Images

Automatic detection of scene texts in the wild is a challenging problem,...

Accurate Scene Text Detection through Border Semantics Awareness and Bootstrapping

This paper presents a scene text detection technique that exploits boots...

Multi-Tailed, Multi-Headed, Spatial Dynamic Memory refined Text-to-Image Synthesis

Synthesizing high-quality, realistic images from text-descriptions is a ...

1 Introduction

Reading text in the wild has attracted lots of attention due to its wide applications in scene understanding, license plate recognition, autonomous navigation, and document analysis. As a prerequisite of text recognition, text detection plays an essential role in the whole procedure of scene text understanding. Despite great progress achieved by recent scene text detection methods inspired by object detection and segmentation methods, detecting text in natural images remains very challenging due to large variations of scales, orientations and aspect ratios as well as low qualities and perspective distortions. Although curve text detection attracts lots of attention recently, the proportion of curve text in reality is relatively small. Accurate multi-oriented text detection is still one of the most important open problems to be solved.

Recently, various methods [58, 24, 12, 16, 29, 28, 18, 44, 56, 42, 25] are proposed to detect multi-oriented text. All of the methods can be roughly divided into anchor-based methods and anchor-free methods. Anchor-based methods assume a set of prior boxes for reference which simplifies the problem to learn relative offsets to anchors. Anchor-free regression methods get rid of complicate design of anchor boxes and directly regress geometry of text, which makes it a clear and effective pipeline for scene text detection [58]. However, due to large variations of scales, orientations and aspect ratios, direct regression may produce unsatisfactory results as shown in Fig. 1. Recent research also reveals performance degradation when training with texts containing large rotation variations [25, 50]. It also shows deficiency when detecting long texts with large aspect ratios because of the limitation of the receptive field as shown in Fig. 1.

Figure 1: Detection results of direct regression and FCRN on long text lines. The red and the blue boxes are results of direct regression and FCRN respectively.

To localize multi-oriented text accurately while keeping a simple pipeline, we propose a fully convolutional corner refinement network (FCRN) for multi-oriented scene text detection. As shown in Fig. 2, the network produces initial prediction with direct regression. With a novel quadrilateral RoI convolution (QRC) operation tailed for multi-oriented scene text, the initial quadrilateral prediction is encoded into the feature maps which can be further used to predict offset between the initial prediction and the ground-truth as well as output a refined confidence score. The refined prediction is obtained with the initial prediction and the predicted offset. The design of the whole network follows the spirit of anchor-free regression in a fully convolutional manner. With corner refinement and scoring, the proposed text detector can detect long texts and decline low-quality detections produced in initial prediction.

The contributions of this work could be summarized as follows:

  • We propose a novel build-in module to encode an initial quadrilateral prediction into the feature maps with a light deformable convolution operation.

  • We embed the proposed module into the network to perform corner refinement and scoring in the baseline detector, leading to a fully convolutional corner refinement network for multi-oriented scene text detection.

  • Experimental results on several datasets show our method can outperform the state-of-the-art methods. The ablation study shows the effectiveness of the corner refinement and scoring for accurate localization.

Figure 2: Illustration of our proposed method. The red and blue dashed lines show the process of direct regression and corner refinement. The dashed box in yellow denotes the final detection result. The red and blue solid points represent the sampling positions of standard convolution and QRC.

2 Related Work

As a basic component of OCR, scene text detection has been a hot research topic for a long time. With the powerful ability of feature representation and characteristics of end-to-end optimization, recent works on scene text detection are almost based on deep learning. We roughly review recent works by dividing them into anchor-based methods and anchor-free methods, then we review works on adaptive convolutional sampling.

Anchor-based methods [24, 16, 29, 18, 38, 35, 11, 53, 47, 33] assume a set of prior boxes and simplify the learning problem to learn relative offsets to anchors. Recent text detection methods benefit a lot from classic generic object detection methods. DMPNet [24], SegLink [35], SSTD [11], Textboxes++ [16] and RRD [18] are based on SSD [21]. CTPN [38] and RRPN [29] are based on Faster R-CNN [34], and SPCNet [47] is based on Mask R-CNN [9]. Anchors of different scales and aspect ratios are used to model the large variations in scales and aspect ratios of scene text. However, different datasets usually require different settings to achieve the best performance, which makes it inflexible to transfer between different datasets.

Anchor-free methods either directly predict geometry of scene text [58, 12, 56, 42, 27] or segmentation confidence score [44, 57, 6, 39, 45, 17] of being text or not. In EAST [58], a clear and effective pipeline is proposed to regress text geometry representation of rotated bounding boxes or quadrilaterals. DeepReg [12] also directly regresses the four corners of quadrilaterals for multi-oriented text detection. TextSnake [27] performs local geometry regression and then reconstructs text instances, which is able to detect texts of arbitrary shape. LOMO [56] regresses text geometry in a coarse-to-fine manner. Methods based on segmentation view text detection as an instance segmentation problem. Apart from text score, shrunk kernel score [44, 27, 45, 17], link score with neighbors [6]

, or similarity vector

[39, 45] are used to distinguish different text instances.

Adaptive convolutional sampling. Deformable Convolution [4] introduces learnable offsets in convolutional sampling. For scene text detection, ITN [42]

applies affine transformation estimation for each location. The predicted affine transformation is then used to deform the sampling grid in regular convolution, leading to scale and orientation robust text detection. However, the affine transformation modeling falls when dealing with text with perspective transformation. Moreover, it is not directly targeting to detection and cumulative error from the transformation prediction will degrade the overall performance. Adaptive Convolution is proposed in Cascade RPN

[41] for feature alignment in which the offsets in deformable convolution is determined by the anchors. Nevertheless, this modeling does not fit for multi-oriented scene text which is tightly bounded by long quadrilaterals. Inspired by these methods, we propose QRC which is tailed for multi-oriented scene text detection and is able to deal with proposals of arbitrary convex quadrilaterals. With QRC, the initial geometry prediction is encoded in the features thus the receptive field is changed correspondingly.

3 Methodology

Figure 3: The overall framework of FCRN. It consists of FPN backbone and shared heads between feature pyramid levels. The red and the blue lines represents the initial prediction and the refined prediction.

In this section, we describe the framework of FC

RN in detail. The baseline model is first introduced. Next, we introduce QRC. Then, we describe the corner refinement and scoring heads. Finally, the learning target and the loss function in optimization are introduced.

3.1 Baseline Model

The architecture of FCRN is illustrated in Fig. 3. We first introduce the baseline model. We adopt ResNet50 [10] as the backbone. Standard FPN [19] is used to deal with scale variations of scene text. Instead of using one level output as those in [58, 44, 56], we use features from different levels that have different receptive fields naturally to predict texts of different scales. Pyramid levels of are used. The resolution of is of the input image. We use as the first level for better detecting tiny text. The output features of all pyramids have

channels. We add three consecutive convolutions to the detection heads for further feature extraction. After that, two

convolutions are added to predict the offset to the center point and the classification score respectively.

In training, we project texts of different scales to different pyramid levels. Instead of using the short side of text as scale measurement, we compute the lengths of the two lines connecting the midpoints of the opposing edge pairs. The shorter one is taken as the scale of the text instance. This may be a more feasible and robust measurement to describe the scale of texts. Specifically, texts with scales of , , , , are projected to to respectively.

Given as one level of the feature maps and

as the stride before the layer, for location

on feature map , the center of the feature bin can be computed as . The ground-truth quadrilaterals are from top-left to bottom-left in clockwise order. The learning target is the offsets between the center of the feature bin and the corresponding four corners of the ground-truth, normalized by the stride . The quadrilateral prediction can be obtained by the center coordinate and the offset prediction .

3.2 Quadrilateral RoI Convolution

Given a feature map , in standard 2D convolution, the feature map is first sampled using a regular grid , and the samples are summed up with the weight whose kernel size is . Here, the grid is defined by the kernel size and dilation. For each location on the output feature map , we have:


In QRC, the regular gird R is augmented with offsets which is inferred from the new grid that is generated by uniformly sampling on the initial quadrilateral prediction based on .


Let (, , , ) denote the projection (corners from top left to bottom left) of onto the feature map, each position in the grid is obtained by


where corresponds to a linear kernel and can be computed as follows:


where , , , correspond to kernel index, kernel size and two corner points. Obviously, is a linear combination of the corresponding four corners of the quadrilateral prediction. For example, the nine sampling positions of the QRC are the midpoints of four edges, the center of the quadrilateral and the four corners as illustrated in Fig.4. Given Eq.2, QRC can be easily implemented with a deformable convolution layer [4]. It is worth noting that the proposed QRC requires no additional computation compared with the vanilla convolution, which enables it to be integrated into any existing regression-based multi-oriented text detectors seamlessly.

Figure 4: Illustration of QRC. The gray quadrilateral represents quadrilateral prediction. The solid points on the quadrilateral are corners and the light blue points are obtained by uniformly sampling.

3.3 Corner Refinement and Scoring

Figure 5: The head architecture of FCRN. The part in the dashed gray box is the head architecture of the baseline model.

The architecture of detection head is illustrated in Fig. 5. The initial prediction and score are produced based on the initial feature maps. We use initial features and initial prediction to produce the refined features as described in Sec. 3.2. Then two convolution heads are added to perform corner refinement and scoring. With QRC, the initial quadrilateral prediction is encoded into the initial features which act as attention mechanism that focuses on text instances rather than backgrounds. With initial prediction encoded, the refinement head is able to predict offsets between the initial prediction and the ground-truth, which can produce more accurate localization results. The scoring head predicts a new classification score on the refined features which has a more appropriate receptive field and is more discriminative in distinguishing foregrounds and backgrounds.

3.4 Optimization

Due to the fully convolutional nature of the proposed network, the pipeline of the network is quite simple and can be optimized end-to-end. Given one feature bin on the output feature maps, corresponding initial score, initial quadrilateral prediction, refined score and refined quadrilateral prediction are obtained at one pass.

Label generation. We use the rules in [16]

to decide the order of four corners. The positive samples of initial classification and regression tasks are the shrunk version of text regions. Instead of shrinking four edges with the same distance, we shrink four edges with the same proportion. We argue that feature bins near edges may be not appropriate to regress the whole text when dealing with long texts and tend to produce local incomplete predictions. Moreover, feature bins near the edge region are more likely to be outliers which may dominate the gradient during training. If the center of the feature bin falls into the shrunk version of a specific ground-truth, this ground-truth is assigned to the feature bin in training. The negative samples are the feature bins that don’t fall into any ground-truth quadrilateral. Feature bins in do-not-care regions are ignored during training.

Label or learning targets for corner refinement and scoring are as follows. The two branches predict offsets between the initial prediction and the ground-truth and new classification score based on the refined features. Ideally, the IoU (intersection over union) between the prediction quadrilateral and the ground-truth is a good criterion for measuring how good the prediction is. We consider a feature bin as a positive sample if its IoU between a ground-truth is higher than a threshold. And negative samples are the ones whose IoU between any ground-truth quadrilateral is lower than the threshold. For each positive feature bin, the learning target is the ground-truth which has the highest IoU with it. However, the computation for the IoU of two quadrilaterals is hard to parallel. In practice, we use the IoU between the minimum bounding boxes of the initial prediction and the ground-truth instead. We set the IoU threshold as 0.5 and find it works well in practice.

Loss function. The classification losses used in the network are focal loss [20]:




and are the label and prediction. In both the classification losses, we set and .

The regression losses we used are smooth L1 loss [8]:


where is i-th coordinate offset between detected quadrilateral and the ground-truth and is the corresponding predicted value. The whole loss function of the network is formulated as follows:


, , , and is the loss for the initial classification, initial regression, refined classification and refined regression. The balanced factors , and are all set to 1.

During inference, the refined quadrilateral prediction can be obtained by


where , , , are the center coordinate of the feature bin, the initial offset prediction, the refined offset prediction and the stride, scored by the refined score .

4 Experiments

In this section, we evaluate our approach on MSRA-TD500 [54], ICDAR2017-RCTW [36], ICDAR2015 [14] and COCO-Text [40] to show the effectiveness of our approach.

4.1 Datasets

MSRA-TD500 is a multilingual dataset focusing on oriented text lines. Large variations of text scales and orientations are presented in this dataset. It consists of 300 training images and 200 testing images.

ICDAR2017-RCTW comprises 8034 training images and 4229 test images with scene texts printed in either Chinese or English. The images are captured from different sources including street views, posters, screen-shot, etc. Multi-oriented words and text lines are annotated using quadrangles.

ICDAR2015 is a multi-oriented text detection dataset only for English, which includes 1000 training images and 500 testing images. The text regions are annotated with quadrilaterals.

COCO-Text is a large dataset that contains 63686 images, where 43,686 of the images are used for training, 10,000 for validation, and 10,000 for testing. It is one of the challenges of the ICDAR 2017 robust reading competition. The dataset is quite challenging due to the diversity of text in natural scenes.

4.2 Implementation Details

Our work is implemented based on MMDetection [2]

. ImageNet pre-trained model is used to initialize the backbone. We use SGD as optimizer with batch size 2, momentum 0.9 and weight decay 0.0001 in training. The number of maximum iterations in training is 48 epochs. We adopt warm-up in the initial 500 iterations. The initial learning rate is set to 0.00125 for all experiments, and decayed by 0.1 on the 32nd and 44th epoch. The shrink factor of the text region in training is set to 0.25. All the experiments are performed on GeForce GTX 1080 Ti.

Due to the imbalanced distribution of text scales and orientations, we adopt random rotation, random crop, random scale, random flip as data augmentation in training. For testing, we only use single scale testing for all datasets because public methods have numerous settings for multi-scale testing, which is hard to give a fair comparison. The scale and maximum size are (1200, 1600). We adopt polygonal non-maximum suppression (PNMS) proposed in [23] to suppress redundant detections.

4.3 Ablation Study

We perform ablation study on MSRA-TD500 with different settings to analyze the function of corner refinement and scoring. To better illustrate the ability to detect long texts, we use 4k well annotated samples from [36] for pretraining which is adopted in [25]. The result is in Table 1. With corner refinement, the F-measure is higher than the baseline model by 4.1 F-measure. When we replace the initial confidence score with the refined confidence score, we achieve 1.1 more increasing on F-measure. The result shows the effectiveness of corner refinement and scoring quantitatively.

IoU threshold of 0.5 is usually adopted in detection. However, it is not enough for accurate scene text detection and subsequent text recognition task. We further constrain the IoU threshold to 0.75 for better illustration. As is shown in Table 1, the F-measure of three methods drops by 31.6, 17.1, 13.4 respectively compared with those under 0.5 IoU metric. With corner refinement, 18.6 F-measure increment is obtained upon the baseline. The performance obtains another increment of 4.8 F-measure when using the refined score. This shows the effectiveness of QRC and the two subtasks clearly under a high IoU threshold.

Baseline  CR  CS IoU@0.5 IoU@0.75
82.8 82.1 82.5 51.1 50.7 50.9
89.5 83.8 86.6 71.4 67.7 69.5
90.3 85.2 87.7 80.5 69.0 74.3
Table 1: Evaluation results on MSRA-TD 500 with different model settings. ”Baseline”, ”CR” and ”CS” represent the baseline model, corner refinement and corner scoring respectively. “P”, “R”, and “F” indicate precision, recall, and F-measure respectively.

Detection results of three methods and ground-truth are visualized in Fig. 6, which illustrates the function of corner refinement and scoring well. When detecting long texts, the baseline tends to produce incomplete detections due to the limitation of the receptive field, which severely degrades the performance. With corner refinement, the model is able to produce more accurate regression results compared with the baseline model. However, the initial score describes how suitable the feature bin is for direct regression and is not fit for the refined regression. As a result, more accurate regression results may be suppressed due to the unreasonable scoring process. The refined score is predicted based on the refined features with initial prediction encoded and measures how well the initial prediction is. With the refined score, the confidences of detection results are much more reasonable. Moreover, the refined score also shows stronger ability distinguishing scene text with backgrounds.

(a) Baseline
(b) Baseline + CR
(c) Baseline + CR + CS
(d) GT
Figure 6: Visualization of results with different model settings. ”CR” and ”CS” denote corner refinement and corner scoring. (a-d) correspond to the detection results of baseline model, baseline model with corner refinement, FCRN and the ground-truth respectively. Detection results and the ground-truth are marked with yellow and orange boxes.

4.4 Comparison with State-of-the-Arts

4.4.1 Detecting Oriented Multi-Lingual Text Lines.

We evaluate our method on challenging oriented text line datasets MSRA-TD500 and ICDAR2017-RCTW.

Method MSRA-TD500 ICDAR2015
EAST [58] 87.3 67.4 76.1 83.6 73.5 78.2
SegLink [35] 86.0 70.0 77.0 73.1 76.8 75.0
DeepReg [12] 77.0 70.0 74.0 82.0 80.0 81.0
SSTD [11] - - - 80.2 73.9 76.9
WordSup [13] - - - 79.3 77.0 78.2
RRPN [29] 82.0 68.0 74.0 82.0 73.0 77.0
PixelLink [6] 83.0 73.2 77.8 85.5 82.0 83.7
Lyu et al. [28] 87.6 76.2 81.5 94.1 70.7 80.7
RRD [18] 87.0 73.0 79.0 85.6 79.0 82.2
MCN [26] 88.0 79.0 83.0 72.0 80.0 76.0
ITN [42] 90.3 72.3 80.3 85.7 74.1 79.5
FTSN [5] 87.6 77.1 82.0 88.6 80.0 84.1
IncepText [53] 87.5 79.0 83.0 90.5 90.6 85.3
TextSnake [27] 83.2 73.9 78.3 84.9 80.4 82.6
Border [51] 83.0 77.4 80.1 - - -
TextField [49] 87.4 75.9 81.3 84.3 80.5 82.4
SPCNet [47] - - - 88.7 85.8 87.2
PSENet-1s [44] - - - 86.9 84.5 85.7
CRAFT [1] 88.2 78.2 82.9 89.8 84.3 86.9
SAE [39] 84.2 81.7 82.9 88.3 85.0 86.6
Wang et al. [46] 85.2 82.1 83.6 89.2 86.0 87.6
LOMO [56] - - - 91.3 83.5 87.2
MSR [52] 87.4 76.7 81.7 86.6 78.4 82.3
BDN [25] 89.6 80.5 84.8 89.4 83.8 86.5
PAN [45] 84.4 83.8 84.1 84.0 81.9 82.8
GNNets [50] - - - 90.4 86.7 88.5
DB [17] 91.5 79.2 84.9 91.8 83.2 87.3
FOTS* [22] - - - 91.0 85.2 88.0
Mask TextSpotter* [15] - - - 86.6 87.3 87.0
CharNet R-50* [48] - - - 91.2 88.3 89.7
Qin et al.* [32] - - - 89.4 85.8 87.5
TextDragon* [7] - - - 92.3 83.8 87.9
Boundary* [43] - - - 89.8 87.5 88.6

Text Perceptron*

- - - 92.3 82.5 87.1
FCRN 90.3 81.8 85.8 89.0 88.7 88.9
Table 2: Experimental results on MSRA-TD500 and ICDAR2015. “P”, “R”, and “F” indicate precision, recall, and F-measure respectively. ”*” denotes end-to-end recognition methods.

MSRA-TD500. Because the training set is rather small, we follow the previous works [58, 28, 27] to include the 400 images from HUST-TR400 [55] as training data. The detection results are listed in Table 2. FCRN achieves the state-of-the-art performance in terms of F-measure. The F-measure outperforms the state-of-the-art methods by 0.9 percents. FCRN outperforms anchor-free regression method EAST [58] and DeepReg [12] because of the enlarged receptive field and the refinement heads, which enables the network to detect oriented texts with large aspect ratios. Compared with methods that group or cluster local results to produce final results like PixelLink [6], SegLink [35], Lyu et al. [28], MCN [26] and TextSnake [27], FCRN outputs refined results at one pass, which eliminates the cumulative error of intermediate process. For two-stage Mask R-CNN based methods like FTSN [5], IncepText [53] and BDN [25], IoU match criterion using square anchors or proposals may cause confusion in learning because different inclined tiled texts may have nearly overlapping bounding boxes. Moreover, there may be several inclined tiled long texts in the same bounding box. This makes the mask branch that performs foreground/background binary classification hard to distinguish which one is the foreground instance and degrades the performance on multi-oriented text.

ICDAR2017-RCTW. No extra data but official RCTW training samples are used in training. Our single-scale testing results outperform the exiting best single-scale results [53] for 3.4 F-measure which is a large margin. And the results also outperform the SOTA method LOMO [56] which uses multi-scale testing, which illustrates the ability of the proposed method to deal with challenging long texts.

Method Precision Recall F-measure
Official baseline [36] 76.0 40.4 52.8
EAST [58] 59.7 47.8 53.1
RRD [18] 72.4 45.3 55.7
IncepText [53] 78.5 56.9 66.0
LOMO [56] 80.4 50.8 62.3
RRD MS [18] 77.5 59.1 67.0
Border MS [51] 78.2 58.8 67.1
LOMO MS [56] 79.1 60.2 68.4
FCRN 77.5 63.0 69.4
Table 3: Experimental results on RCTW benchmark. ”MS” denotes multi-scale testing.

4.4.2 Detecting Oriented English Words.

To further demonstrate the effectiveness of our method in detecting English words, we evaluate our method on ICDAR2015 and COCO-Text.

ICDAR2015. To compare with the state-of-the-art methods, we follow [44, 50, 47, 22] to use ICDAR2017-MLT [30] to pretrain the model. The pretrained model is finefuned another 48 epochs using ICDAR2015 training data. The evaluation protocol is based on [14]. As shown in Table 2, FCRN outperforms all other methods with single scale test performance of 88.9 F-measure. The recall rate of 88.7 outperforms the existing state-of-the-art method GNNets by nearly 2 percents. Note that GNNets uses ResNet152 as its backbone whereas our backbone is ResNet50. Compared with anchor-free detectors, FCRN obtains 10 percents F-measure increment with EAST [58] and outperforms recently proposed methods LOMO [56] and PSENet [44]. FCRN also outperforms other two-stage detectors like IncepText [53] and SPCNet [47]. Compared with end-to-end recognition which utilizes recognition supervision, FCRN outperforms most of the methods except CharNet [48] in which character-level supervision supervision is also used.

COCO-Text. We evaluate our model on the ICDAR2017 robust reading challenge on COCO-Text [3] with the annotations V1.4. The results are reported in Table 4. FCRN achieves the state-of-the-art performance among all the methods. Our single scale result outperforms RRD [18] and Lyu at el. [28] which use multi-scale test. FCRN also outperforms end-to-end recognition method Mask TextSpotter [15] which uses character-level annotations in training. FCRN is also comparable to the newly proposed end-to-end recognition method Boundary [43]. Moreover, under the IoU metric of 0.75, FCRN achieves better performance than other methods, in which 1.3 F-measure increment is obtained than the state-of-the-art method RRD with multi-scale testing. Compared with Lyu et al. MS, FCRN obtains more improvement at IoU 0.75 than that at IoU 0.5, which reveals that FCRN outputs more accurate quadrilaterals. Qualitative results are shown in Fig. 7.

Method IoU@0.5 IoU@0.75
UM [3] 47.6 65.5 55.1 23.0 31.0 26.0
TDN SJTU v2 [3] 62.4 54.3 58.1 32.0 28.0 30.0
Text Detection DL [3] 60.9 61.8 61.4 38.0 34.0 36.0
Lyu et al. [28] 72.5 52.9 61.1 40.0 30.0 34.6
Lyu et al. MS [28] 62.9 62.2 62.6 35.1 34.8 34.9
RRD MS [18] 64.0 57.0 61.0 38.0 34.0 36.0
Mask TextSpotter* [15] 66.8 58.3 62.3 - - -
Boundary* [43] 67.7 59.0 63.0 - - -
FCRN 68.5 58.2 63.0 44.7 32.1 37.3
Table 4: Experimental results on COCO-Text Challenge. ”MS” denotes multi-scale testing. “P”, “R”, and “F” indicate precision, recall, and F-measure respectively. ”*” denotes end-to-end recognition methods.
(a) MSRA-TD500
(b) RCTW
(c) ICDAR2015
(d) COCO-Text
Figure 7: Examples of detection results. The four columns are results on MSRA-TD500, ICDAR2017-RCTW, ICDAR2015 and COCO-Text respectively.

4.5 Runtime Analysis

The speed of FCRN is compared with the baseline model as shown in Table 5. FCRN is only a little slower than the baseline model. The additional parameters and computation are from the QRC and refinement heads. The refinement module adds only 0.61 M parameters on the base model with 28.42 M parameters. When the input size is , it rises 50.3 GMac more computation than the baseline model with 297.0 GMac. We find that the computation mainly comes from the dense convolution computation on pyramid level which has resolution of the input image.

Method Scale Flops Params
928 736
Baseline 7.7 FPS 11.0 FPS 297.0 GMac 28.42 M
FCRN 5.8 FPS 8.8 FPS 347.3 GMac 29.03 M
Table 5: Comparison on speed of different scales, model complexity and computation with input between the baseline model and FCRN.

4.6 Limitation

One limitation is that our method can not deal with text lines with large character spacing well since there are few samples in the training set. This also happens in many other methods.

5 Conclusion

In this work, we propose a novel fully convolutional corner refinement network (FCRN) for accurate multi-oriented scene text detection. FCRN has a simple yet effective pipeline. Compared with the baseline model, only several convolutions are added. With the prediction of direct regression encoded, the network is able to refine the initial prediction and produce a new score. Experiments on several datasets show the effectiveness of our method, especially on long text dataset. Furthermore, due to the light-weight and fully convolutional nature of our detector, it can be seamlessly integrated with other frameworks like FOTS [22], TextNet [37] for end-to-end recognition and LOMO [56] for better detecting curve text. Tricks like iterative refinement [56] at the testing time could be also included for better performance but not involved in this work. In the future, we will extend this work to end-to-end text spotting.


  • [1] Y. Baek, B. Lee, D. Han, S. Yun, and H. Lee (2019) Character region awareness for text detection. In CVPR, pp. 9365–9374. Cited by: Table 2.
  • [2] K. Chen, J. Wang, J. Pang, Y. Cao, Y. Xiong, X. Li, S. Sun, W. Feng, Z. Liu, J. Xu, Z. Zhang, D. Cheng, C. Zhu, T. Cheng, Q. Zhao, B. Li, X. Lu, R. Zhu, Y. Wu, J. Dai, J. Wang, J. Shi, W. Ouyang, C. C. Loy, and D. Lin (2019) MMDetection: open mmlab detection toolbox and benchmark. arXiv preprint arXiv:1906.07155. Cited by: §4.2.
  • [3] Coco-text challenge. Note: Cited by: §4.4.2, Table 4.
  • [4] J. Dai, H. Qi, Y. Xiong, Y. Li, G. Zhang, H. Hu, and Y. Wei (2017) Deformable convolutional networks. In ICCV, pp. 764–773. Cited by: §2, §3.2.
  • [5] Y. Dai, Z. Huang, Y. Gao, Y. Xu, K. Chen, J. Guo, and W. Qiu (2018) Fused text segmentation networks for multi-oriented scene text detection. In ICPR, pp. 3604–3609. Cited by: §4.4.1, Table 2.
  • [6] D. Deng, H. Liu, X. Li, and D. Cai (2018) Pixellink: detecting scene text via instance segmentation. In AAAI, pp. 6773–6780. Cited by: §2, §4.4.1, Table 2.
  • [7] W. Feng, W. He, F. Yin, X. Zhang, and C. Liu (2019) TextDragon: an end-to-end framework for arbitrary shaped text spotting. In ICCV, pp. 9076–9085. Cited by: Table 2.
  • [8] R. Girshick (2015) Fast r-cnn. In ICCV, pp. 1440–1448. Cited by: §3.4.
  • [9] K. He, G. Gkioxari, P. Dollár, and R. Girshick (2017) Mask r-cnn. In ICCV, pp. 2980–2988. Cited by: §2.
  • [10] K. He, X. Zhang, S. Ren, and J. Sun (2016) Deep residual learning for image recognition. In CVPR, pp. 770–778. Cited by: §3.1.
  • [11] P. He, W. Huang, T. He, Q. Zhu, Y. Qiao, and X. Li (2017) Single shot text detector with regional attention. In ICCV, pp. 3047–3055. Cited by: §2, Table 2.
  • [12] W. He, X. Zhang, F. Yin, and C. Liu (2017) Deep direct regression for multi-oriented scene text detection. In ICCV, pp. 745–753. Cited by: §1, §2, §4.4.1, Table 2.
  • [13] H. Hu, C. Zhang, Y. Luo, Y. Wang, J. Han, and E. Ding (2017) Wordsup: exploiting word annotations for character based text detection. In ICCV, pp. 4940–4949. Cited by: Table 2.
  • [14] D. Karatzas, L. Gomez-Bigorda, A. Nicolaou, S. Ghosh, A. Bagdanov, M. Iwamura, J. Matas, L. Neumann, V. R. Chandrasekhar, S. Lu, et al. (2015) ICDAR 2015 competition on robust reading. In ICDAR, pp. 1156–1160. Cited by: §4.4.2, §4.
  • [15] M. Liao, P. Lyu, M. He, C. Yao, W. Wu, and X. Bai (2019)

    Mask textspotter: an end-to-end trainable neural network for spotting text with arbitrary shapes

    IEEE Transactions on Pattern Analysis and Machine Intelligence. Cited by: §4.4.2, Table 2, Table 4.
  • [16] M. Liao, B. Shi, and X. Bai (2018) Textboxes++: a single-shot oriented scene text detector. IEEE Transactions on Image Processing 27 (8), pp. 3676–3690. Cited by: §1, §2, §3.4.
  • [17] M. Liao, Z. Wan, C. Yao, K. Chen, and X. Bai (2020)

    Real-time scene text detection with differentiable binarization

    In AAAI, Cited by: §2, Table 2.
  • [18] M. Liao, Z. Zhu, B. Shi, G. Xia, and X. Bai (2018) Rotation-sensitive regression for oriented scene text detection. In CVPR, pp. 5909–5918. Cited by: §1, §2, §4.4.2, Table 2, Table 3, Table 4.
  • [19] T. Lin, P. Dollár, R. Girshick, K. He, B. Hariharan, and S. Belongie (2017) Feature pyramid networks for object detection. In CVPR, pp. 2117–2125. Cited by: §3.1.
  • [20] T. Lin, P. Goyal, R. Girshick, K. He, and P. Dollár (2017) Focal loss for dense object detection. In ICCV, pp. 2980–2988. Cited by: §3.4.
  • [21] W. Liu, D. Anguelov, D. Erhan, C. Szegedy, S. Reed, C. Fu, and A. C. Berg (2016) Ssd: single shot multibox detector. In ECCV, pp. 21–37. Cited by: §2.
  • [22] X. Liu, D. Liang, S. Yan, D. Chen, Y. Qiao, and J. Yan (2018) Fots: fast oriented text spotting with a unified network. In CVPR, pp. 5676–5685. Cited by: §4.4.2, Table 2, §5.
  • [23] Y. Liu, L. Jin, S. Zhang, C. Luo, and S. Zhang (2019) Curved scene text detection via transverse and longitudinal sequence connection. Pattern Recognition 90, pp. 337–345. Cited by: §4.2.
  • [24] Y. Liu and L. Jin (2017) Deep matching prior network: toward tighter multi-oriented text detection. In CVPR, pp. 3454–3461. Cited by: §1, §2.
  • [25] Y. Liu, S. Zhang, L. Jin, L. Xie, Y. Wu, and Z. Wang (2019) Omnidirectional scene text detection with sequential-free box discretization. In IJCAI, pp. 3052–3058. Cited by: §1, §4.3, §4.4.1, Table 2.
  • [26] Z. Liu, G. Lin, S. Yang, J. Feng, W. Lin, and W. Ling Goh (2018) Learning markov clustering networks for scene text detection. In CVPR, pp. 6936–6944. Cited by: §4.4.1, Table 2.
  • [27] S. Long, J. Ruan, W. Zhang, X. He, W. Wu, and C. Yao (2018) Textsnake: a flexible representation for detecting text of arbitrary shapes. In ECCV, pp. 20–36. Cited by: §2, §4.4.1, Table 2.
  • [28] P. Lyu, C. Yao, W. Wu, S. Yan, and X. Bai (2018) Multi-oriented scene text detection via corner localization and region segmentation. In CVPR, pp. 7553–7563. Cited by: §1, §4.4.1, §4.4.2, Table 2, Table 4.
  • [29] J. Ma, W. Shao, H. Ye, L. Wang, H. Wang, Y. Zheng, and X. Xue (2018) Arbitrary-oriented scene text detection via rotation proposals. IEEE Transactions on Multimedia 20 (11), pp. 3111–3122. Cited by: §1, §2, Table 2.
  • [30] N. Nayef, F. Yin, I. Bizid, H. Choi, Y. Feng, D. Karatzas, Z. Luo, U. Pal, C. Rigaud, J. Chazalon, et al. (2017) Icdar2017 robust reading challenge on multi-lingual scene text detection and script identification-rrc-mlt. In ICDAR, Vol. 1, pp. 1454–1459. Cited by: §4.4.2.
  • [31] L. Qiao, S. Tang, Z. Cheng, Y. Xu, Y. Niu, S. Pu, and F. Wu (2020) Text perceptron: towards end-to-end arbitrary-shaped text spotting. In AAAI, Cited by: Table 2.
  • [32] S. Qin, A. Bissacco, M. Raptis, Y. Fujii, and Y. Xiao (2019) Towards unconstrained end-to-end text spotting. In ICCV, pp. 4704–4714. Cited by: Table 2.
  • [33] X. Qin, Y. Zhou, D. Yang, and W. Wang (2019)

    Curved text detection in natural scene images with semi-and weakly-supervised learning

    In ICDAR, pp. 559–564. Cited by: §2.
  • [34] S. Ren, K. He, R. Girshick, and J. Sun (2015) Faster r-cnn: towards real-time object detection with region proposal networks. In NeurIPS, pp. 91–99. Cited by: §2.
  • [35] B. Shi, X. Bai, and S. Belongie (2017) Detecting oriented text in natural images by linking segments. In CVPR, pp. 2550–2558. Cited by: §2, §4.4.1, Table 2.
  • [36] B. Shi, C. Yao, M. Liao, M. Yang, P. Xu, L. Cui, S. Belongie, S. Lu, and X. Bai (2017) Icdar2017 competition on reading chinese text in the wild (rctw-17). In ICDAR, Vol. 1, pp. 1429–1434. Cited by: §4.3, Table 3, §4.
  • [37] Y. Sun, C. Zhang, Z. Huang, J. Liu, J. Han, and E. Ding (2018) TextNet: irregular text reading from images with an end-to-end trainable network. In ACCV, pp. 83–99. Cited by: §5.
  • [38] Z. Tian, W. Huang, T. He, P. He, and Y. Qiao (2016) Detecting text in natural image with connectionist text proposal network. In ECCV, pp. 56–72. Cited by: §2.
  • [39] Z. Tian, M. Shu, P. Lyu, R. Li, C. Zhou, X. Shen, and J. Jia (2019) Learning shape-aware embedding for scene text detection. In CVPR, pp. 4234–4243. Cited by: §2, Table 2.
  • [40] A. Veit, T. Matera, L. Neumann, J. Matas, and S. Belongie (2016) COCO-text: dataset and benchmark for text detection and recognition in natural images. arXiv preprint arXiv:1601.07140. Cited by: §4.
  • [41] T. Vu, H. Jang, T. X. Pham, and C. Yoo (2019) Cascade rpn: delving into high-quality region proposal network with adaptive convolution. In NeurIPS, pp. 1430–1440. Cited by: §2.
  • [42] F. Wang, L. Zhao, X. Li, X. Wang, and D. Tao (2018)

    Geometry-aware scene text detection with instance transformation network

    In CVPR, pp. 1381–1389. Cited by: §1, §2, §2, Table 2.
  • [43] H. Wang, P. Lu, H. Zhang, M. Yang, X. Bai, Y. Xu, M. He, Y. Wang, and W. Liu (2020) All you need is boundary: toward arbitrary-shaped text spotting. In AAAI, Cited by: §4.4.2, Table 2, Table 4.
  • [44] W. Wang, E. Xie, X. Li, W. Hou, T. Lu, G. Yu, and S. Shao (2019) Shape robust text detection with progressive scale expansion network. In CVPR, pp. 9336–9345. Cited by: §1, §2, §3.1, §4.4.2, Table 2.
  • [45] W. Wang, E. Xie, X. Song, Y. Zang, W. Wang, T. Lu, G. Yu, and C. Shen (2019) Efficient and accurate arbitrary-shaped text detection with pixel aggregation network. In ICCV, pp. 8440–8449. Cited by: §2, Table 2.
  • [46] X. Wang, Y. Jiang, Z. Luo, C. Liu, H. Choi, and S. Kim (2019) Arbitrary shape scene text detection with adaptive text region representation. In CVPR, pp. 6449–6458. Cited by: Table 2.
  • [47] E. Xie, Y. Zang, S. Shao, G. Yu, C. Yao, and G. Li (2019) Scene text detection with supervised pyramid context network. In AAAI, Vol. 33, pp. 9038–9045. Cited by: §2, §4.4.2, Table 2.
  • [48] L. Xing, Z. Tian, W. Huang, and M. R. Scott (2019) Convolutional character networks. In ICCV, pp. 9126–9136. Cited by: §4.4.2, Table 2.
  • [49] Y. Xu, Y. Wang, W. Zhou, Y. Wang, Z. Yang, and X. Bai (2019) TextField: learning a deep direction field for irregular scene text detection. IEEE Transactions on Image Processing 28 (11), pp. 5566–5579. Cited by: Table 2.
  • [50] Y. Xu, J. Duan, Z. Kuang, X. Yue, H. Sun, Y. Guan, and W. Zhang (2019) Geometry normalization networks for accurate scene text detection. In ICCV, pp. 9137–9146. Cited by: §1, §4.4.2, Table 2.
  • [51] C. Xue, S. Lu, and F. Zhan (2018) Accurate scene text detection through border semantics awareness and bootstrapping. In ECCV, pp. 355–372. Cited by: Table 2, Table 3.
  • [52] C. Xue, S. Lu, and W. Zhang (2019) MSR: multi-scale shape regression for scene text detection. In IJCAI, pp. 989–995. Cited by: Table 2.
  • [53] Q. Yang, M. Cheng, W. Zhou, Y. Chen, M. Qiu, and W. Lin (2018) Inceptext: a new inception-text module with deformable psroi pooling for multi-oriented scene text detection. In IJCAI, pp. 1071–1077. Cited by: §2, §4.4.1, §4.4.1, §4.4.2, Table 2, Table 3.
  • [54] C. Yao, X. Bai, W. Liu, Y. Ma, and Z. Tu (2012) Detecting texts of arbitrary orientations in natural images. In CVPR, pp. 1083–1090. Cited by: §4.
  • [55] C. Yao, X. Bai, and W. Liu (2014) A unified framework for multioriented text detection and recognition. IEEE Transactions on Image Processing 23 (11), pp. 4737–4749. Cited by: §4.4.1.
  • [56] C. Zhang, B. Liang, Z. Huang, M. En, J. Han, E. Ding, and X. Ding (2019) Look more than once: an accurate detector for text of arbitrary shapes. In CVPR, pp. 10552–10561. Cited by: §1, §2, §3.1, §4.4.1, §4.4.2, Table 2, Table 3, §5.
  • [57] Z. Zhang, C. Zhang, W. Shen, C. Yao, W. Liu, and X. Bai (2016) Multi-oriented text detection with fully convolutional networks. In CVPR, pp. 4159–4167. Cited by: §2.
  • [58] X. Zhou, C. Yao, H. Wen, Y. Wang, S. Zhou, W. He, and J. Liang (2017) EAST: an efficient and accurate scene text detector. In CVPR, pp. 2642–2651. Cited by: §1, §2, §3.1, §4.4.1, §4.4.2, Table 2, Table 3.