STELA: A Real-Time Scene Text Detector with Learned Anchor

by   Linjie Deng, et al.

To achieve high coverage of target boxes, a normal strategy of conventional one-stage anchor-based detectors is to utilize multiple priors at each spatial position, especially in scene text detection tasks. In this work, we present a simple and intuitive method for multi-oriented text detection where each location of feature maps only associates with one reference box. The idea is inspired from the twostage R-CNN framework that can estimate the location of objects with any shape by using learned proposals. The aim of our method is to integrate this mechanism into a onestage detector and employ the learned anchor which is obtained through a regression operation to replace the original one into the final predictions. Based on RetinaNet, our method achieves competitive performances on several public benchmarks with a totally real-time efficiency (26:5fps at 800p), which surpasses all of anchor-based scene text detectors. In addition, with less attention on anchor design, we believe our method is easy to be applied on other analogous detection tasks. The code will publicly available at


page 4

page 7


An Anchor-Free Region Proposal Network for Faster R-CNN based Text Detection Approaches

The anchor mechanism of Faster R-CNN and SSD framework is considered not...

TricubeNet: 2D Kernel-Based Object Representation for Weakly-Occluded Oriented Object Detection

We present a new approach for oriented object detection, an anchor-free ...

Anchor DETR: Query Design for Transformer-Based Detector

In this paper, we propose a novel query design for the transformer-based...

CORE-Text: Improving Scene Text Detection with Contrastive Relational Reasoning

Localizing text instances in natural scenes is regarded as a fundamental...

Scene Text Detection with Selected Anchor

Object proposal technique with dense anchoring scheme for scene text det...

TedEval: A Fair Evaluation Metric for Scene Text Detectors

Despite the recent success of scene text detection methods, common evalu...

Anchor Distance for 3D Multi-Object Distance Estimation from 2D Single Shot

Visual perception of the objects in a 3D environment is a key to success...

1 Introduction

Text in scene usually conveys valuable semantic information. Thus, detecting text in natural images has recently attracted increasing attention in computer vision community cause perceiving information is a critical part of artificial general intelligence. It has been widely used in various applications such as multilingual translation, automotive assistance and image retrieval. Previous works

[5, 26, 1, 33] have been dominated by sliding windows or connected component with hand-crafted feature, which divided the task into a sequence of distinct steps and utilized bottom-up strategy to search characters and words. Although these methods have shown their promising performances, they may be restricted to complex situations due to the diversity of text instances and undesirable image quality.

With the astonishing progress for object detection by exploring the powerful deep learning technology

[15], recent methods take text as a specific object and extend the general object detection frameworks [29, 27, 20] to hypothesize word or text locations. Those approaches can be divided into two major groups: two-stage proposal-driven and one-stage proposal-free method. Although two-stage framework [37, 24, 10] consistently achieves top accuracy on the public benchmarks [12, 11, 25], recent works [16, 22, 2, 9] based on one-stage frameworks also demonstrate yielding faster text detectors with comparable accuracy.

Unlike two-stage detector who can classify boxes at any position and shape by using learned proposals

[29] and region pooling operation [6], one-stage detectors heavily rely on how densely the anchors cover the space of possible target locations [19]. A popular approach for achieving high coverage is to use multiple anchors to cover boxes of various scales and aspect ratios, especially in the tasks of scene text detection. TextBoxes++ [16] was based on SSD [20] and defined 7 specific aspect ratios (including 1, 2, 3, 5, 1/2, 1/3 and 1/5) for default boxes on each location of feature maps. In order to achieve multi-oriented text detection, DMPNet [22] added several rotated anchors, for a total of 12 (6 regular and 6 inclined) to find the best match to arbitrary-oriented text instance. Instead of choosing priors by hand, DeepTextSpotter [2] followed YOLOv2 [28]

runs k-means clustering

on the training set bounding boxes to automatically find suitable priors.

Given the anchor design of the above detectors, a natural question to ask is: could we decrease the number of anchors and maintain similar accuracy? This changing will bring twofold benefit: reducing manual attention on anchors and improving efficiency at inference stage. First, the shapes and scales of anchors has to be predefined for different tasks, and this must be careful because a wrong design may harm the performance of detection [35]. Second, most anchors correspond to false candidates which are irrelevant to the targets, and meanwhile a large number of anchors can lead to significant computational cost when the network involves heavy heads. Besides, although not mentioned in many papers, the anchor generation usually needs to cost a certain amount of time.

(a) ICDAR 2013
(b) ICDAR 2015
Figure 1: The recall rates of different anchor designs on ICDAR 2013(a) and 2015(b). ”#sc” means number of scales, ”#ar” means number of aspect ratios. The black line represents the IoU threshold () which is usually used in training stage to discriminate foreground and background. The other dashed lines with colors represent different anchor designs with various number of scales and aspect ratios. The solid line represent the recall rates of learned anchor which proposed in this work. The performances of each design will be shown in Section.3.2.

Being attracted to simple network architecture and high computational efficiency, in this work, we investigate the issue of anchor design within one-stage detector which we mentioned above for multi-oriented text detection. In one-stage methods, the optimization target in training and the prediction reference in testing are both based on the coverage between original anchors and target boxes. Then, the quality of those prior boxes has a critical impact on the performances of a detector. Normally, as the number of anchors increases, the coverage of targets increases, but it will still be saturated in some situations, as shown in Figure.1(b). Therefore, we need to find a better way to choose priors that make it easier for the network to learn to predict better detection. Inspired from the learned proposal mechanism [29] in the two-stage R-CNN framework, we intend to utilize the learned anchor which is obtained through a regression operation to replace the original one into the final predictions. It is worth noting that unlike region proposal network (RPN) in two-stage detector which can reduce the number of possible locations down to one or two thousands, we still maintain the original quantity of anchors and keep the rest parts of one-stage detector’s architecture. To validate its effectiveness, we adopt the state-of-the-art RetinaNet [19] as our baseline model and present a simple and intuitive text detector named STELA (Scene TExt Detector with Learned Anchor), in which each location of feature maps only associates with one anchor. Following the standard evaluation protocols in each benchmark, our method achieves comparable performances with an F-measure 0.887 on ICDAR 2013 [12], 0.833 on ICDAR 2015 [11] and 0.715 on ICDAR 2017 MLT [25]. Besides, our method is a totally real-time scene text detector with at , which surpasses all of anchor-based methods. At last, with less attention on anchor design, we believe our method is easy to be applied on other analogous detection tasks. Also, all of our training and testing code will open source soon.

2 Methodology

2.1 One-Stage Object Detection

In this section, we first review the one-stage detection pipeline. OverFeat [31]

is one of the first modern one-stage object detector based on deep neural networks. More recent SSD

[20] and YOLOv2 [28] have renewed interest in one-stage methods. The key idea of them is to associate a set of pre-defined anchors which are centered at each location of feature maps and make final predictions based on those reference boxes [14]. As shown in Figure.2(a)

, it basically contains a backbone network for feature extraction over the entire image and two parallel sub-networks following, one for predicting the probability distribution over multiple categories of each anchor and another for regression the offset from each positive candidate to a nearby ground-truth box, if one exists.

(a) One-Stage Detector
(b) Faster R-CNN
(c) Ours
Figure 2: The architectures of different frameworks. ”I” input image, ”Conv” backbone convolutions, ”H” network head, ”C” classification. ”A0” are the original anchors in all architectures, ”A1” in (b) and (c) represent the selected proposals and learned anchors respectively. ”A” and ”C” in red color represent the final outputs of each detector.

Comparing with two-stage R-CNN (Figure.2(b)) methods, one-stage detectors skip the region proposal generation step and gives final predictions (classification and regression) based on original anchors directly. However, its detection accuracy is usually behind that of two-stage approaches, one of the main reasons is they must process a much larger set of candidate object locations regularly sampled across an image. The extreme foreground-background class imbalance problem will encounter during training phase and hamper the resulting performance. More recently, RetinaNet [19] proposed focal loss (FL) to address the class imbalance problem that one-stage detectors is able to match the accuracy of existing two-stage ones. The focal loss is modified from standard cross entropy (CE) loss:


In the above specifies the ground-truth class and is the probability. Normally, we define :


and rewrite . Then, the classification loss is defined as:


where is a weighting factor and is a parameter. It applies a modulating term to the cross entropy loss in order to focus learning on hard examples and down-weight the numerous easy negatives. In our implementation, we follow the original focal loss that set and .

2.2 Rotated Bounding Box Regression

As depicted in [4], using rectangular bounding boxes to localize multi-oriented text may result in redundant background noise and unnecessary overlap. Thus, we adopt rotated rectangular boxes to match arbitrary-oriented text instances. Each bounding box is represented by a five tuple , where are the center point, are width and height,

is the angle to horizontal. The task of the regression operation is to predict the distance of each item from a positive anchor to the nearby ground-truth. Normally, to encourage a regression invariant to scale and location, the distance vector

is defined by:


where and represent a bounding box and its target ground-truth respectively. The regression task loss is calculated by regression target and predicted tuple


where is a robust loss defined in [6]. Usually, for improving the effectiveness of multi-task learning,

is normalized by its mean and variance. In our experiments, the mean is set to (0, 0, 0, 0, 0) and the variance is set to (0.1, 0.1, 0.2, 0.2, 0.1).

2.3 Learned Anchor

Figure 3: The view of different boxes. The blue, red and green boxes represent original anchor, learned anchor and final output box respectively. It is worth noting that the learned anchor (red) shares the same central point with the original one (blue). Viewing digitally with zooming is recommended.

Normally, the detector needs to search the true positives from thousands anchors and adjusts the shapes and locations to make them tighter on the targets. It is difficult to determine if a bounding box is a positive candidate cause it usually includes an object and some amount of the background. In practice, this is solved by the IoU metric between box and most nearby ground-truth . Commonly, the threshold is a constant set to 0.5. If the IoU is above the threshold , bounding box is considered to be an example of positive.


Also, specifies the ground-truth class. It is worth noting that conventional IoU based on rectangular boxes is unsatisfactory for our task, thus we modify it to compute the overlaps for rotated rectangles. Given all this, the optimization target in training is determined by the overlaps between original anchors with ground-truth boxes.

However, the original anchors with fixed scales and aspect ratios which are pre-defined manually may not be the optimal designs. Compared with one-stage detector, we argue that the most important part of proposal scheme in two-stage is that the selected proposals are chosen by learning. That makes two-stage method able to reduce the search space of targets, and meanwhile optimize the quality of candidates. Inspired by this, we intend to integrate this mechanism into the one-stage detectors. We simply add an extra regression branch for anchor refining and utilize learned one into the final classification and regression, as shown in Figure.2(c).

Especially, the regression targets of learned anchor is not arbitrary. As refer in [35], one of general rules for a reasonable anchor design is alignment. To use convolutional features as anchor representations, the center of an anchor need to be well aligned with feature map pixels. Towards this end, we only regress the offsets within , and this will keep anchors still align with feature map, as depicted in Figure.3. Following regression task, the anchor refining loss is defined as:


Unlike two-stage R-CNN method filter anchors with an objectness score, we only adjust the shape of each anchor and keep the quantity of anchors here. Evaluated on public benchmarks, we find comparing with original ones, the coverage of targets will be given a huge enhancement after anchor refining stage, as shown in Figure.1.

2.4 Network Architecture

For the trade-off between efficiency and accuracy, all of our experiments are implemented on RetinaNet [19] with ResNet-50 [8] as backbone, though other networks are still applicable. We also adopt the Feature Pyramid Network (FPN) from [18] to construct a rich, multi-scale feature pyramid from a single resolution input image. The FPN consists of levels to feature maps, and the corresponding base anchor sizes from to for detecting small text instances ( to

in source implementation). In original RetinaNet, the two sub-networks (heads) are deeper with 5 convolutional layers. For improving the running speed, we decrease the number of layers from 5 to 2 for streamlining the heads. This may result in a slight accuracy loss, but will give us a real-time text detector in return. Based on the above definitions, the model is trained to simultaneously minimize the losses on anchor refining, final regression and classification. Overall, the loss function is a weighted sum of three losses


where , , are user constants indicating the relative strength of each component defined above. In order to keep the balance of different loss types, we set them to 0.5, 0.5, 1.0 respectively.

3 Experiments

3.1 Implementation Details

The backbone of network is initialized by the model trained on ImageNet

[30] for classification task and other layers are initialized by following [19]. The network is trained with Adam [13] optimizer. Restricted by the hardware, the batch size is set to and the initial learning rate is set to . We randomly pick up 100,000 images from SynthText [7]

to pretrain the network for 5 epochs, and collect real data from ICDAR 2013

[12], 2015 [11] and 2017 [25] to finetune a final model for 25 epochs. The learning rate is decayed to after 15 epochs of finetuning. We use the multi-scale training scheme that randomly resize the input size between 480 and 800. Random flipping is also used for data augmentation.

Specially, in order to capture more regression target candidates, we set the IoU threshold to 0.3 in anchor refining training. In the inference stage, a confidence threshold with 0.3 and a non-maximum suppression threshold with 0.3 are applied to yield the final outputs. The proposed method is implemented by using PyTorch

111 and all experiments are carried out on a standard PC with Intel i7-6800k and a single NVIDIA TITAN Xp.

3.2 Ablation Study

To investigate the effectiveness of our method, we conduct several ablation studies. Each model is evaluated on ICDAR 2013 [12] and 2015 [11] benchmarks.

Anchor Design: We first investigate the impacts of different anchor designs on performances, including accuracy and efficiency. The baseline models are directly extended from RetinaNet by simply changing the regression strategy introduced in Section.2.2. The aspect ratios of anchors on single location of feature maps are simply selected from {0.25, 0.5, 1, 2, 4}. Also, the scales are chosen from (). As shown in Table.1, with the anchor increasing, the F-measure improved from 0.447 to 0.863 on ICDAR 2013, but no significant improvement (0.621 to 0.753) on ICDAR 2015. Analyzing from Figure.1, we argue that the coverage of targets gets saturated on ICDAR 2015, but not on ICDAR 2013. That proves once again the most important design factor in a one-stage detector is how densely it covers the space of target boxes.

Attaching anchor refining operation on the first baseline, the resulting model obtains a huge improvement (shown in Figure.1) in recall rates on both benchmarks, with great progresses on accuracy from 0.447 to 0.887 on ICDAR 2013 and 0.621 to 0.833 on ICDAR 2015. This strongly demonstrates the effectiveness of our approach. In addition, we also assess other baselines with anchor refining, there is only slower running speed, but no obvious improvement.

#sc #ar la ICDAR 2013 ICDAR 2015 FPS
1 1 0.447 0.621 27.3
1 3 0.789 0.742 26.3
3 3 0.788 0.742 24.1
3 5 0.863 0.753 22.3
1 1 0.887 0.833 26.5
Table 1: The impact of the different anchor designs. ”#sc” number of scales, ”#ar” number of aspect ratios. ”la” means learned anchor. All input images are resize to 800 pixels.

Number of Stages: Like Cascade R-CNN [3], we add more stages of anchor refining to compare the influences. We also increase the IoU threshold of each refining stage by following Cascade R-CNN. The results are summarized in Table.2. Increasing more refining stages will not lead to significant improvement, or even accuracy decrease. Besides, adding the number of stages will affect the running speed. Therefore, one refining stage is the best choice for our method.

#stages ICDAR 2013 ICDAR 2015 FPS
1 0.887 0.833 26.5
2 0.889 0.830 25.1
3 0.879 0.827 23.7
Table 2: The impact of the number of stages. ”#stage” means the number of anchor refining stages.

3.3 Comparison to State of the Art

We evaluate our method on several public benchmarks and compare to recent state-of-the-art methods. Figure.4 shows some detection results from each dataset.

ICDAR 2013 [12] dataset consists of 229 training and 233 testing images which were captured by user explicitly detecting the focus of the camera on the text content of interest. It is a standard benchmark for evaluating horizontal or nearly horizontal text detection. In this benchmark, we set the scale of input images to 800 for single-scale testing. We also evaluate on multi-scale testing which the scales are set to 320, 480, 640 and 800. As depicted in Table. 3, the proposed method outperforms all anchor based methods including DeepText [37], FCRN [7] and CTPN [34], which are mainly designed for nearly horizontal text detection. For single-scale testing, our method achieves a totally real-time running speed at 26.5 fps. Even the multi-scale, our method runs at a speed of 10.5 fps. Compared with recent methods [21, 17, 23, 4], our method is comparable with accuracy and efficiency.

ICDAR Standard DetEval FPS
Recall Precision F-measure Recall Precision F-measure
TextBoxes++[16] 0.740 0.860 0.800 0.740 0.880 0.810 11.6
CTPN [34] 0.730 0.930 0.820 0.820 0.930 0.880 7.1
FCRN [7] 0.764 0.938 0.842 0.755 0.920 0.830 0.8
DeepText [37] 0.830 0.870 0.850 0.6
SegLink [32] 0.830 0.877 0.853 20.6
Lyu et al. [23] 0.794 0.933 0.858 10.4
SSTD [9] 0.860 0.880 0.870 0.860 0.890 0.880 7.7
EAST [38] 0.828 0.926 0.874 16.8
CRPN [4] 0.839 0.919 0.876 0.840 0.921 0.879 9.1
FOTS [38] 0.882 0.883 23.9
Ours 0.849 0.927 0.887 0.851 0.933 0.890 26.5
Lyu et al. [23] 0.844 0.920 0.880 1.0
TextBoxes++[16] 0.840 0.910 0.880 0.860 0.920 0.890 2.3
RRD [17] 0.860 0.920 0.890
Ours 0.896 0.937 0.916 0.893 0.938 0.915 10.5
Table 3: Results on ICDAR 2013 Focused Scene Text. ”*” means multi-scale test. ”–” means no report in their papers.

ICDAR 2015 [11] benchmark was released during the ICDAR 2015 Robust Reading Competition. It provides 1000 training and 500 testing images which were collected without taking any specific prior attention. It was designed for multi-oriented text detection, so all images are annotated with word-level quadrangles. To evaluate the adaptability of our learned anchor, we still set the input size to 800 pixels. As shown in Table.4, our method achieves an F-measure of 0.833, which also surpasses all of the anchor-based methods [22, 32, 9, 24, 16], including one-stage and two-stage frameworks. Compared with other approaches [38, 21] which utilize a deep regression network that directly predict text region, our method still keep an absolute lead in running speed.

Recall Precision F-measure FPS
DMPNet [22] 0.680 0.730 0.710
SegLink [32] 0.768 0.731 0.750 8.9
SSTD [9] 0.730 0.800 0.770 7.7
EAST [38] 0.735 0.836 0.782 13.2
RRPN [24] 0.770 0.840 0.800 3.0
TextBoxes++ [16] 0.707 0.941 0.807 3.6
Lyu et al. [23] 0.767 0.872 0.817 11.6
RRD [17] 0.790 0.856 0.822 6.5
FOTS RT [21] 0.798 0.856 0.828 22.6
CRPN [4] 0.807 0.887 0.845 5.0
Ours 0.786 0.887 0.833 26.5
Table 4: Results on ICDAR 2015 Incidental Scene Text. All results of works are reported with single testing scale. ”–” means no report in their papers.

ICDAR 2017 MLT [25] is a large scale multi-lingual text dataset, which includes 7200 training, 1800 validation and 9000 testing images with in 9 languages. It was proposed for verifying the generalization ability of each method. Therefore, it is more difficult than previous ICDAR challenges. Due to a larger number of small text instances in this dataset, we enlarge the scale of testing image by 2 times to 1600 pixels and our method achieves 0.655, 0.787, 0.715 in recall, precision and F-measure by using the online evaluation system provided officially, as shown in Table.5. The presented results demonstrate that our method is capable of applying practically in multi-lingual text detection.

Recall Precision F-measure
SARI_FDU_RRPN_v1 0.555 0.711 0.623
Sensetime OCR 0.694 0.569 0.625
SCUT_DLVClab1 0.545 0.802 0.648
Lyu et al. [23] 0.556 0.838 0.668
FOTS [21] 0.575 0.809 0.672
Border [21] 0.621 0.777 0.690
SPCNET [36] 0.669 0.734 0.700
Ours 0.655 0.787 0.715
Table 5: Results on ICDAR 2017 Multi-lingual Scene Text Detection. All results of works are reported with single testing scale. ”” means the result is obtained from the ICDAR 2017 MLT leaderboard.
Figure 4: Selected results from the public benchmarks. Viewing digitally with zooming is recommended.

4 Conclusion and Future Work

In this work, we propose a simple and intuitive method based on RetinaNet for multi-oriented text detection where each location of feature maps associate with only one anchor. The aim of our method is to integrate the learning mechanism from two-stage R-CNN framework into the one-stage detector and utilize the learned anchor to replace the original one into the final predictions. Experimental results on public benchmarks confirm that the proposed method is capable of achieving comparable performance with state-of-the-art methods. Besides, it is a total real-time scene text detector. In the future, we are interested in integrating the detector with a text recognizer to consist an end-to-end text reading system. In addition, we also plan to evaluate it on other detection tasks to prove the universality of our approach.