ArbiText: Arbitrary-Oriented Text Detection in Unconstrained Scene

11/30/2017 ∙ by Daitao Xing, et al. ∙ NYU college 0

Arbitrary-oriented text detection in the wild is a very challenging task, due to the aspect ratio, scale, orientation, and illumination variations. In this paper, we propose a novel method, namely Arbitrary-oriented Text (or ArbText for short) detector, for efficient text detection in unconstrained natural scene images. Specifically, we first adopt the circle anchors rather than the rectangular ones to represent bounding boxes, which is more robust to orientation variations. Subsequently, we incorporate a pyramid pooling module into the Single Shot MultiBox Detector framework, in order to simultaneously explore the local and global visual information, which can, therefore, generate more confidential detection results. Experiments on established scene-text datasets, such as the ICDAR 2015 and MSRA-TD500 datasets, have demonstrated the supe rior performance of the proposed method, compared to the state-of-the-art approaches.



There are no comments yet.


page 1

page 2

page 5

page 7

page 8

This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

1 Introduction

Understanding texts in the wild plays an important role in many real-world applications such as PhotoOCR [2], road sign detection in intelligent vehicles [3], license plate detection [17], and assistive technology for the visually impaired [6] [1]. To achieve this goal, the task of accurate arbitrary-oriented text detection becomes extremely important. Conventionally, when dealing with horizontal texts under controlled environments, this task can be accomplished through character-based methods such as [9], [10], [18], and [24] considering that individual letters can be easily segmented and distinguished. However, in an unconstrained natural-scene image, text detection becomes rather challenging due to uncontrolled text variations and uncertainties, such as multi-orientation, text distortion, background noise, occlusion, and illumination changes. To address these problems, a lot of recent efforts have been devoted to employing state-of-the-art generic object detectors, such as the Fully Convolutional Network(FCN) [20]

, Region-based Convolutional Neural Network(R-CNN)

[19], and Single Shot Detector(SSD) [14], for the purpose of text detection in the wild.

Figure 1: Text detection results of the proposed ArbiText method. Images in the first row shows examples from ICDAR2015 dataset, the second row shows examples MSRA-TD500 dataset
Figure 2: The Framework of the Proposed Method. Given an input image with a size of 384384,VGG-16 base network outputs the first feature map from conv4_3 layer. More feature maps with cascading sizes are extracted from extra layers following the first feature map. The first feature map is also used to produce different sub-region representations through the Pyramid Pooling Module. These representation layers are then concatenate feature maps with same size to output the final feature maps. Finally, those maps are fed into a convolution layer to get the final.

Despite their promising performance in generic object detection, these methods suffered from bridging gaps between the data distributions of texts and generic objects. To enhance the generative abilities of existing deep models, [7] proposed to naturally blend rendered words onto wild images for training data augmentation. The trained model based on such training data is robust to noises and uncontrolled variations. [31] and [13] attempted to integrate region-proposal layers into the deep neural networks, which can generate text-specific proposals (e.g. bounding boxes with larger aspect ratios). In [16], the bounding-box rotation proposals were introduced to make the proposed model more adaptive for unknown orientations of texts in natural-scene images. Nevertheless, the aforementioned scene-text detectors either needed to consider a lot of proposal hypotheses, thus dramatically decreasing the computational efficiency, or utilize insufficient preset bounding-box characteristics to handle severe visual variations of scene-texts in unconstrained natural images.

To address the aforementioned drawbacks of existing methods, we propose a novel proposal-free model for arbitrary-oriented text detection in natural images based on the circle anchors and the Single Shot Detector (SDD) framework. More specifically, we adopt circle anchors to represent the bounding boxes, which are more robust to orientation, aspect ratio, and scale variations, compared to the conventional rectangular ones. The Single Shot Detector (SSD), one of the state-of-art object detectors, is employed, considering its fast detection speed and promising accuracy in generic object detection. Besides the feature maps generated by the original SSD, we additionally incorporate a pyramid pooling module, which can build multiple feature representations on different spatial scales. By merging those different kinds of feature maps, both the local and global information can be preserved, such that texts in unconstrained natural scenes can be more reliably detected. Subsequently, the merged feature maps are fed into a text detection module, consisting of several fully connected convolutional layers, to predict confidential circle anchors. Furthermore, in order to overcome the difficulty of deciding positive points caused by unfixed sizes of circle anchors, we introduce a novel mask loss function by assigning those ambiguous points to a new class. To obtained the final detection results, the Locality-Aware Non-Maximum-Suppression (LANMS) scheme

[33] is employed. It should be noted that we do not utilize any proposal, which makes the proposed method more computationally efficient.

In summary, the contributions of our work mainly lie in three-fold:

  • [label=]

  • We propose a novel proposal-free method for detecting arbitrary-oriented texts in unconstrained natural scene images, based on the circular anchor representation and the Single Shot Detector framework. The circular anchors are more robust to different aspect ratio, scale, and orientation variations, compared to conventional rectangular ones.

  • We incorporate a pyramid pooling module into SSD, which can explore both the local and global visual information for robust text detection.

  • We develop a new mask loss function to overcome the difficulty of deciding positive points caused by unfixed sizes of circle anchors, which can therefore improve the final detection accuracy.

2 Related Works

Character-based detection methods have already achieved state-of-art results on horizontal texts in relatively controlled and stable environments. Methods like those proposed in [9], [10], [18], and [24] either detect individual characters by classification of sliding windows or utilize some form of connected-component and region-based framework such as the Maximally Stable Extremal Regions(MSER) detector.

However, some of these methods might not be ideal for detecting scene-texts or multi-oriented texts as more and more environmental variations and uncertainties in terms of text distortion, orientation, occlusion, and noise are introduced. Detecting individual characters in close clusters or ones that blend into the background can also be challenging. Hence, many researchers decide to tackle the problem by approaching the task of text detection as object detection: treating words and/or text-lines as the target object.

Region-based Convolutional Neural Network(R-CNN)[19], Single Shot Detector(SSD)[14], and segmentation-based Fully Convolutional Network(FCN)[20] are frequently re-purposed for text detection because their superior speed and accuracy are better suited for the time and resource-constraining nature of the task. Expectedly, a majority of the cutting-edge research in scene-text detection, including ours, are based on one of the aforementioned object detection models, which we will analyze below.

Segmentation-based Methods: Both [29] and [26] accomplish semantic segmentation of text lines by utilizing Fully Convolutional Network(FCN), which has achieved great performance in pixel-level classification tasks. In [29], for example, pixel-wise text/non-text salient map is first produced via the FCN and subsequently, geometric and character processing is implemented to generate and filter text-line hypothesis. Although these methods can achieve state-of-art results even with scene-text detection in the wild, the requirement for a sophisticated post-processing step of word partitioning and false positive removal can be too time consuming and computationally intensive for real-world applications.

A more recent method proposed in [33], however, seeks to make dramatic improvement in efficiency over [29] and [26] by eliminating intermediate steps such as candidate aggregation and word partitioning in the neural network. Nevertheless, the inherent nature of the segmentation approach - dense per-pixel processing and prediction - is still a bottleneck that prevents segmentation-based methods from outperforming its competitors.

Region-Proposal-based Methods: Although region-proposal based models like R-CNN have already been a state-of-art object detector, it can not be implemented for the purpose of text detection without modifications since its anchor box design is not ideal for the large aspect ratio of words/text-lines. [31] addresses this problem by proposing a novel Region Proposal Network(RPN) called Inception-RPN, which contains a preset of text characteristic prior bounding boxes to generate text-specific proposals and thus filtering out low-quality word regions.

However, [31] only performs well on horizontal texts since bounding box characteristics are extremely unpredictable for scene-texts in the wild; multi-oriented and distorted texts can create countless possibilities and variations of bounding-box size, shape, and orientation.

To address this challenge, some researchers designed novel region-proposal methods: the rotation proposal method in [16] has the ability to predict the orientation of a text line and thus generate inclined bounding-boxes for oriented texts, while the quadrilateral sliding windows in [15] create a much tighter bounding-box fit around text regions, thus dramatically reduce background noise and interference. On the other hand, some researchers propose methods to modify model architecture like the one proposed in [5], which adds 2D offsets in the standard convolution to enable free form deformation of the sampling grid, and the one proposed in [8], which utilizes direct bounding-box regression originating from a center anchor point in a proposal region.

SSD-based Methods. SSD-based method is highly stable and efficient in generating word proposals because SSD is one of the fastest object detector that is also as accurate as slower region-proposal based models like R-CNN. However, SSD possesses similar shortcomings in terms of anchor box design when it comes to scene-text detection. Thus, [13] supplements SSD with ”textbox layers” that can generate bounding-boxes with larger aspect ratios and simultaneously predict text presence and bounding boxes. Unfortunately, this method only works on horizontal texts, and not scene-texts.
Thus in this paper, we attempt to solve the aforementioned limitations of previous detection models by utilizing a proposal-free method based on circular anchors and the SSD framework. Our method is computationally more efficient than both segmentation-based and Region-Proposal-based models because the removal of the region-proposal layer in our network. Our method also improves upon the existing SSD-based method by having the ability to detect both arbitrary-oriented texts and generic objects.

3 Method

In this section, we will describe the details of our proposed model - ArbiText. We will first introduce the framework and network architecture of our method. Subsequently, we will elaborate on the key components such as the circle anchor representation and the proposed loss function.

Figure 3: Matching of Circle Anchors

The red circle is generated from the ground-truth coordinates, and the blue circle anchor is the one with same size with red circle on the feature point. The blue circle which associates the red one by a vector(cx, cy, area, diameter, angle) where cx and cy are offset between centers of circles.

3.1 Model Framework

Our proposed method, in essence, is a multi-scale, proposal-free framework based on the Single Shot Detector. As shown in Fig.2, our model mainly consists of the following four components: 1) the backbone-based network for converting original images into dense feature representations; 2) the feature maps component with cascading map size for detecting multi-scale texts; 3) the Pyramid Pooling Module [30] for extracting sub-region feature representations; and 4) the final text detection layer for circle anchor prediction.

We adopt VGG-16 [22] as our base network, and utilize the 6 feature maps at the conv4_3, conv7, con8_2, conv9_2, conv10_2 and global layers. However, local information is lost as layer goes deeper and deeper, which results in poor detection precisions especially on texts with complex contextual information.

Inspired by [30], we introduced the Pyramid Pooling Module to leverage low-level visual information even in deeper layers. This module fuses feature maps with different pyramid scales. As shown in Fig. 2, the first feature map from the based network is separated on pyramid levels into different sub-regions and output pooled representations after a convolution layer. Thus, the low-level information of the original image could be preserved in multi-scale level feature maps, which will be further concatenated with ones of the same size to form the final feature map for text detection. By merging these two types of features, both the local and global visual information can be explored.

Finally, the text detection layer applies a convolution kernel on the fused feature map to output prediction on the text bounding box.

3.2 Circle Anchors

As illustrated in Fig. 3, instead of the traditional rectangular anchors, we use circle anchors to represent the bounding box. Specifically, a bounding box can be represented by a 5-dimensional vector (, , , , ), where , , and denotes the area, radius, and rotated angle of a circle anchor.


On a feature map of size , location, denoted , associates a circle anchor with , indicating that a unique circle anchor, represented by , is detected with confidence , where


Here, we use the area and radius for computational stability. Also, we multiply each value by a factor where =1.5.

The angle is the intersection angle between the long edge of the bounding box and the horizontal axis. Thus, the value of ranges from to .

In a deep neural network, each layer has a receptive field that indicates how much contextual information we can utilize. Although the circle anchor representation is invariant to scale variations, [32] has shown that feature maps have limited receptive fields that are much smaller than theoretical ones, especially on high-level layers. As a result, if we do not utilize multi-scale feature maps, the detection scope of the proposed circle anchor representation will be restricted. And considering the size of the extra feature layers, this operation only adds a small amount of computational cost.

3.3 Training Labels Rebuilding and the Loss Function Formulation


Figure 4: (a) shows the the coordinates of a rectangle. Given the area and diagonal of rectangle, and can be calculated by rotation angles. and are diagonal points of and , respectively, which also have negative values. (b) shows the predictable vertical flag, which is if the angles of the bounding box are between and ; otherwise, it is set to .
Figure 5: Score Distribution. (a) shows a bounding box the text. The red and yellow rectangles are possible bounding boxes that have maximum IOU overlap scores with ground truth . (b) shows the eclipse-shape like score function we use in ArbiText.
Figure 6: The Selection of Ground Truth. The blue rectangle is the bounding box and the grid represents a feature map with the same size. Only the feature points in the red eclipse are labeled as positives; the points outside of the yellow eclipse are labeled as negatives. The feature points in the yellow region will be labeled as their own separate class.

For a SSD-based method, all points on a feature map will potentially be used for minimizing a specific loss function. Each feature point needs to be labeled as either “positive” or “negative”. Specifically, in SSD, the points that are labeled as “positive” are chosen from the regions where the overlap between the default anchor and ground-truth bounding box is larger than 0.5. However, there is no default anchor in our method, but we can still calculate a confidence score for each point. As illustrated in Fig. 5 (a), the feature point on the edge of the bounding box can have a maximum overlap of 0.5(bounding boxes are colored in red and yellow). So the score follows an eclipse distribution (as illustrated in Fig. 5 (b)). The scores at the center of the eclipse have a maximum value of 1.0 and it decreases to 0.5 when the points reach the edge. We use a semi-ellipse as the function to compute the score for each point. As a result, the points outside of the eclipse have a score value 0.
Imagine an eclipse score function which has rotation angle , the semi-major axes and semi-minor axes have length of and , respectively, where and are the width and height of the bounding box. Thus, the score function can be represented as:


where , are the distances between a feature point and the center of the bounding box respectively,s is score. According to the score function, all points inside the eclipse have a score greater than 0.5. However, the closer the point is to the edge of the bounding box, the large the noise outside of the bounding box will be, which could make the training of networks harder to converge. Thus, only the points with score large than a threshold will be treated as positives (as shown in Fig. 6, only points inside the red zone are labeled as ”positive”). Points outside of the bounding box will be labeled as ”negative”. For those points with a score between 0.5 and , we assign them an additional label. Thus, there will be a total of classes(without background). This additional class will only be involved in calculating the classification loss.
The feature maps with different sizes can detect texts with different scales. A default box will be labeled as positive if


where is the height of bounding box and is the height of cell on feature map. For training, we use the following objective loss function:


where is the number of object categories, is the prediction location, and is the ground-truth location. For each class, if the corresponding point is labeled as ”positive” and belongs to the -th class. is the loss for vertical bounding box classification that only includes the positive points. We adopt L1 loss for smoothing and Softmax loss as the classification losses.

Figure 7: Results of ArtTex on the ICDAR2015 dataset. Blue rectangles are detected text regions by using ArtTex.

4 Experiments

4.1 Datasets

In order to evaluate the performance of the proposed method, we ran experiments on two benchmark datasets: the ICDAR 2015 dataset and the MSRA-TD500(TD500) dataset.

SynthText in the Wild[7] dataset contains more than 800,000 synthetic images created by blending rendered words on wild images. Only samples with width larger than pixels are chosen for training.

ICDAR 2015[34] incidental text dataset is from Challenge 4 of ICDAR 2015 Robust Reading Competition that includes 1000 training images and 500 testing images. Since those images are collected by Google Glasses, they suffer from motion blur. The blurry texts have a label of ”###” and are excluded from our experiment. We also included training and testing images from the ICDAR 2013 dataset[12], which helps us in building a more robust text detector.

MSRA-TD500(TD500)[4] is a multilingual dataset that includes oriented texts in both Chinese and English. Unlike ICDAR 2015, texts in MSRA-TD500 are annotated at the text-line level and the images were captured more formally, thus texts are much clearer and standardized. There are a total of 500 images, 300 of them were used as training data and 200 were used as testing data.

4.2 Implementation Details

Base network In our experiment, we uses a pre-trained VGG-16 as our based network. This network is widely used in object detection tasks. All images are resized to after data augmentation. We extracted five layers with cascading resolution as our feature maps, which are conv4_3, conv7, conv8, conv9, and conv10. We first trained our model on the SynthText dataset for 50,000 iterations with a learning rate of 0.001. Then, we fine-tuned our model using the other datasets with a 0.0005 learning rate. The details of training different datasets are described in later sections. We tested our model on different values and we discovered that our model achieves optimal performance when is set to for text detection.

Locality-Aware NMS In the post-processing stage, bounding boxes with a confidence score greater than 0.5 will be used to produce the final output by NMS merging. The naive NMS has computational complexity, which is not ideal for real-world applications. We adopt Locality—Aware NMS [33] to improve speed of merging bounding boxes.

Hard Negative Mining Hard Negative Mining is essential for SSD-based methods because of the imbalance between positive and negative training samples. We adopt the same configuration in SSD[14] by selecting the top negative training samples, where is number of positive training samples. Thus, we adopt Locality-Aware NMS[33] in our experiment. This algorithm can produce bounding boxes with greater precision in shorter time.

Data Augmentation

We utilize a data augmentation pipeline that is similar to the one in SSD to make our model more robust against different text variations. The original image is randomly cropped into patches. The crop size is chosen from [0.1, 1] of original image size. Each sample patch will be horizontally flipped with a probability of

. In order to balance samples of different orientations, we also augmented datasets by randomly rotating images by degrees. is randomly chosen from the following angle set: (-90, -75, -60, -45, -30, -15, 0, 15, 30, 45, 60, 75, 90).

Figure 8: Results of ArtTex on the MSRA-TD500 dataset. Blue rectangles are detected text regions by using ArtTex.
Method Precision Recall F-score
HUST_MCLAB 47.5 34.8 40.2
NJU_Text 72.7 35.8 48.0
StradVision-2 77.5 36.7 49.8
MCLAB_FCN[29] 70.8 43.0 53.6
CTPN[23] 51.6 74.2 60.9
Megcii-Image++ 72.4 57.0 63.8
Yao et al.[26] 72.3 58.7 64.8
Seglink [21] 73.1 76.8 75.0
ArbiText 79.2 73.5 75.9
Table 1: Comparison results of various methods on the ICDAR 2015 Incidental Text dataset

4.3 Detection Results

4.3.1 Detecting Oriented English Text

First, our model is tested on ICDAR 2015 dataset. The pre-trained model is fine-tuned using both the ICDAR 2013 and the ICDAR 2015 training datasets after 20k iterations. Considering all images in ICDAR 2015 have high resolution, testing images are first resized to . The threshold is set to 0.7, similar to the one used in the pre-training stage. Performance is evaluated using the official off-line evaluation scripts.
We list the results of our model along with other state-of-art object and text detection methods. The results were obtained from the original papers. The best result of this dataset was obtained by Seglink[21], which achieved a F-measure score of 75.0%. However, our model obtained a score of 75.9%. The improvement comes from the high precision rate we obtained, which outperforms the second highest model by 6.1%
Figure.7 shows several detection results taken from the testing dataset of ICDAR 2015. Our proposed method ArbiText can distinguish and localize all kinds of scene text in noisy backgrounds.

4.3.2 Detecting Multi-Lingual Text in Long Lines

We further tested our method on the TD500 dataset consists of long text in English and non-Latin scripts. We augmented this dataset by doing the following: 1) Randomly place an image on a canvas of times of the original image size filled by mean values where ranges from 1 to 3. 2)We applies random crop according to the overlap strategy described in section 4.3. Thus, we obtained enough images for training. The pre-trained model is fine-tuned for 20K iterations. All images are resized to , which is consistent with the training stage. The experiment has demonstrated that this technique can dramatically increase detection speed without losing much precision. As illustrated in Table 2, ArbiText achieved comparable F-measure scores with other state-of-the-art methods. However, benefiting from lighter network architecture and simplified anchors mechanism, ArbiText has the highest FPS of 12.1.
Figure.8 shows ArbiText can detect long lines of text in mixed languages(English and Chinese) without changing any parameters or structures.

Method Precision Recall F-score FPS
Kang et al.[11] 71 62 66 -
Yao et al.[25] 63 63 60 0.14
Yin et al.[27] 81 63 74 0.71
Yin et al.[28] 71 61 65 1.25
Zhang et al.[29] 83 67 74 0.48
Yao et al.[26] 77 75 76 1.61
Seglink [21] 86 70 77 8.9
ArbiText 78 72 75 12.1
Table 2: Comparison results of various methods on the MSRA-TD500 dataset
Figure 9: Failure Cases On MSRA-TD500 The blue rectangles are true positives. The red ones are false negatives and yellow ones are false positives. In (a) and (b), ArbiText fails to detect curved texts. In (c), ArbiText fails to detect certain hand-written texts

4.4 Limitations

As shown in Figure.9.a,b, curved texts can’t be represented by circle anchors. Moreover, Figure.9.c shows our model’s weakness in detecting hand-written texts.

5 Conclusion

We have presented ArbiText, a novel, proposal-free object detection method that can be utilized to detect both arbitrary-oriented texts and generic objects simultaneously. Its outstanding performance on different benchmarks demonstrates that ArbiText is accurate, robust, and flexible for real-world applications. In the future, we will extend the Circle Anchor methodology to detect deformable objects and/or texts.