Understanding texts in the wild plays an important role in many real-world applications such as PhotoOCR , road sign detection in intelligent vehicles , license plate detection , and assistive technology for the visually impaired  . To achieve this goal, the task of accurate arbitrary-oriented text detection becomes extremely important. Conventionally, when dealing with horizontal texts under controlled environments, this task can be accomplished through character-based methods such as , , , and  considering that individual letters can be easily segmented and distinguished. However, in an unconstrained natural-scene image, text detection becomes rather challenging due to uncontrolled text variations and uncertainties, such as multi-orientation, text distortion, background noise, occlusion, and illumination changes. To address these problems, a lot of recent efforts have been devoted to employing state-of-the-art generic object detectors, such as the Fully Convolutional Network(FCN) 
, Region-based Convolutional Neural Network(R-CNN), and Single Shot Detector(SSD) , for the purpose of text detection in the wild.
Despite their promising performance in generic object detection, these methods suffered from bridging gaps between the data distributions of texts and generic objects. To enhance the generative abilities of existing deep models,  proposed to naturally blend rendered words onto wild images for training data augmentation. The trained model based on such training data is robust to noises and uncontrolled variations.  and  attempted to integrate region-proposal layers into the deep neural networks, which can generate text-specific proposals (e.g. bounding boxes with larger aspect ratios). In , the bounding-box rotation proposals were introduced to make the proposed model more adaptive for unknown orientations of texts in natural-scene images. Nevertheless, the aforementioned scene-text detectors either needed to consider a lot of proposal hypotheses, thus dramatically decreasing the computational efficiency, or utilize insufficient preset bounding-box characteristics to handle severe visual variations of scene-texts in unconstrained natural images.
To address the aforementioned drawbacks of existing methods, we propose a novel proposal-free model for arbitrary-oriented text detection in natural images based on the circle anchors and the Single Shot Detector (SDD) framework. More specifically, we adopt circle anchors to represent the bounding boxes, which are more robust to orientation, aspect ratio, and scale variations, compared to the conventional rectangular ones. The Single Shot Detector (SSD), one of the state-of-art object detectors, is employed, considering its fast detection speed and promising accuracy in generic object detection. Besides the feature maps generated by the original SSD, we additionally incorporate a pyramid pooling module, which can build multiple feature representations on different spatial scales. By merging those different kinds of feature maps, both the local and global information can be preserved, such that texts in unconstrained natural scenes can be more reliably detected. Subsequently, the merged feature maps are fed into a text detection module, consisting of several fully connected convolutional layers, to predict confidential circle anchors. Furthermore, in order to overcome the difficulty of deciding positive points caused by unfixed sizes of circle anchors, we introduce a novel mask loss function by assigning those ambiguous points to a new class. To obtained the final detection results, the Locality-Aware Non-Maximum-Suppression (LANMS) scheme is employed. It should be noted that we do not utilize any proposal, which makes the proposed method more computationally efficient.
In summary, the contributions of our work mainly lie in three-fold:
We propose a novel proposal-free method for detecting arbitrary-oriented texts in unconstrained natural scene images, based on the circular anchor representation and the Single Shot Detector framework. The circular anchors are more robust to different aspect ratio, scale, and orientation variations, compared to conventional rectangular ones.
We incorporate a pyramid pooling module into SSD, which can explore both the local and global visual information for robust text detection.
We develop a new mask loss function to overcome the difficulty of deciding positive points caused by unfixed sizes of circle anchors, which can therefore improve the final detection accuracy.
2 Related Works
Character-based detection methods have already achieved state-of-art results on horizontal texts in relatively controlled and stable environments. Methods like those proposed in , , , and  either detect individual characters by classification of sliding windows or utilize some form of connected-component and region-based framework such as the Maximally Stable Extremal Regions(MSER) detector.
However, some of these methods might not be ideal for detecting scene-texts or multi-oriented texts as more and more environmental variations and uncertainties in terms of text distortion, orientation, occlusion, and noise are introduced. Detecting individual characters in close clusters or ones that blend into the background can also be challenging. Hence, many researchers decide to tackle the problem by approaching the task of text detection as object detection: treating words and/or text-lines as the target object.
Region-based Convolutional Neural Network(R-CNN), Single Shot Detector(SSD), and segmentation-based Fully Convolutional Network(FCN) are frequently re-purposed for text detection because their superior speed and accuracy are better suited for the time and resource-constraining nature of the task. Expectedly, a majority of the cutting-edge research in scene-text detection, including ours, are based on one of the aforementioned object detection models, which we will analyze below.
Segmentation-based Methods: Both  and  accomplish semantic segmentation of text lines by utilizing Fully Convolutional Network(FCN), which has achieved great performance in pixel-level classification tasks. In , for example, pixel-wise text/non-text salient map is first produced via the FCN and subsequently, geometric and character processing is implemented to generate and filter text-line hypothesis. Although these methods can achieve state-of-art results even with scene-text detection in the wild, the requirement for a sophisticated post-processing step of word partitioning and false positive removal can be too time consuming and computationally intensive for real-world applications.
A more recent method proposed in , however, seeks to make dramatic improvement in efficiency over  and  by eliminating intermediate steps such as candidate aggregation and word partitioning in the neural network. Nevertheless, the inherent nature of the segmentation approach - dense per-pixel processing and prediction - is still a bottleneck that prevents segmentation-based methods from outperforming its competitors.
Region-Proposal-based Methods: Although region-proposal based models like R-CNN have already been a state-of-art object detector, it can not be implemented for the purpose of text detection without modifications since its anchor box design is not ideal for the large aspect ratio of words/text-lines.  addresses this problem by proposing a novel Region Proposal Network(RPN) called Inception-RPN, which contains a preset of text characteristic prior bounding boxes to generate text-specific proposals and thus filtering out low-quality word regions.
However,  only performs well on horizontal texts since bounding box characteristics are extremely unpredictable for scene-texts in the wild; multi-oriented and distorted texts can create countless possibilities and variations of bounding-box size, shape, and orientation.
To address this challenge, some researchers designed novel region-proposal methods: the rotation proposal method in  has the ability to predict the orientation of a text line and thus generate inclined bounding-boxes for oriented texts, while the quadrilateral sliding windows in  create a much tighter bounding-box fit around text regions, thus dramatically reduce background noise and interference. On the other hand, some researchers propose methods to modify model architecture like the one proposed in , which adds 2D offsets in the standard convolution to enable free form deformation of the sampling grid, and the one proposed in , which utilizes direct bounding-box regression originating from a center anchor point in a proposal region.
SSD-based Methods. SSD-based method is highly stable and efficient in generating word proposals because SSD is one of the fastest object detector that is also as accurate as slower region-proposal based models like R-CNN. However, SSD possesses similar shortcomings in terms of anchor box design when it comes to scene-text detection. Thus,  supplements SSD with ”textbox layers” that can generate bounding-boxes with larger aspect ratios and simultaneously predict text presence and bounding boxes. Unfortunately, this method only works on horizontal texts, and not scene-texts.
Thus in this paper, we attempt to solve the aforementioned limitations of previous detection models by utilizing a proposal-free method based on circular anchors and the SSD framework. Our method is computationally more efficient than both segmentation-based and Region-Proposal-based models because the removal of the region-proposal layer in our network. Our method also improves upon the existing SSD-based method by having the ability to detect both arbitrary-oriented texts and generic objects.
In this section, we will describe the details of our proposed model - ArbiText. We will first introduce the framework and network architecture of our method. Subsequently, we will elaborate on the key components such as the circle anchor representation and the proposed loss function.
3.1 Model Framework
Our proposed method, in essence, is a multi-scale, proposal-free framework based on the Single Shot Detector. As shown in Fig.2, our model mainly consists of the following four components: 1) the backbone-based network for converting original images into dense feature representations; 2) the feature maps component with cascading map size for detecting multi-scale texts; 3) the Pyramid Pooling Module  for extracting sub-region feature representations; and 4) the final text detection layer for circle anchor prediction.
We adopt VGG-16  as our base network, and utilize the 6 feature maps at the conv4_3, conv7, con8_2, conv9_2, conv10_2 and global layers. However, local information is lost as layer goes deeper and deeper, which results in poor detection precisions especially on texts with complex contextual information.
Inspired by , we introduced the Pyramid Pooling Module to leverage low-level visual information even in deeper layers. This module fuses feature maps with different pyramid scales. As shown in Fig. 2, the first feature map from the based network is separated on pyramid levels into different sub-regions and output pooled representations after a convolution layer. Thus, the low-level information of the original image could be preserved in multi-scale level feature maps, which will be further concatenated with ones of the same size to form the final feature map for text detection. By merging these two types of features, both the local and global visual information can be explored.
Finally, the text detection layer applies a convolution kernel on the fused feature map to output prediction on the text bounding box.
3.2 Circle Anchors
As illustrated in Fig. 3, instead of the traditional rectangular anchors, we use circle anchors to represent the bounding box. Specifically, a bounding box can be represented by a 5-dimensional vector (, , , , ), where , , and denotes the area, radius, and rotated angle of a circle anchor.
On a feature map of size , location, denoted , associates a circle anchor with , indicating that a unique circle anchor, represented by , is detected with confidence , where
Here, we use the area and radius for computational stability. Also, we multiply each value by a factor where =1.5.
The angle is the intersection angle between the long edge of the bounding box and the horizontal axis. Thus, the value of ranges from to .
In a deep neural network, each layer has a receptive field that indicates how much contextual information we can utilize. Although the circle anchor representation is invariant to scale variations,  has shown that feature maps have limited receptive fields that are much smaller than theoretical ones, especially on high-level layers. As a result, if we do not utilize multi-scale feature maps, the detection scope of the proposed circle anchor representation will be restricted. And considering the size of the extra feature layers, this operation only adds a small amount of computational cost.
3.3 Training Labels Rebuilding and the Loss Function Formulation
For a SSD-based method, all points on a feature map will potentially be used for minimizing a specific loss function. Each feature point needs to be labeled as either “positive” or “negative”. Specifically, in SSD, the points that are labeled as “positive” are chosen from the regions where the overlap between the default anchor and ground-truth bounding box is larger than 0.5. However, there is no default anchor in our method, but we can still calculate a confidence score for each point. As illustrated in Fig. 5 (a), the feature point on the edge of the bounding box can have a maximum overlap of 0.5(bounding boxes are colored in red and yellow). So the score follows an eclipse distribution (as illustrated in Fig. 5 (b)). The scores at the center of the eclipse have a maximum value of 1.0 and it decreases to 0.5 when the points reach the edge. We use a semi-ellipse as the function to compute the score for each point. As a result, the points outside of the eclipse have a score value 0.
Imagine an eclipse score function which has rotation angle , the semi-major axes and semi-minor axes have length of and , respectively, where and are the width and height of the bounding box. Thus, the score function can be represented as:
where , are the distances between a feature point and the center of the bounding box respectively,s is score. According to the score function, all points inside the eclipse have a score greater than 0.5. However, the closer the point is to the edge of the bounding box, the large the noise outside of the bounding box will be, which could make the training of networks harder to converge. Thus, only the points with score large than a threshold will be treated as positives (as shown in Fig. 6, only points inside the red zone are labeled as ”positive”). Points outside of the bounding box will be labeled as ”negative”. For those points with a score between 0.5 and , we assign them an additional label. Thus, there will be a total of classes(without background). This additional class will only be involved in calculating the classification loss.
The feature maps with different sizes can detect texts with different scales. A default box will be labeled as positive if
where is the height of bounding box and is the height of cell on feature map. For training, we use the following objective loss function:
where is the number of object categories, is the prediction location, and is the ground-truth location. For each class, if the corresponding point is labeled as ”positive” and belongs to the -th class. is the loss for vertical bounding box classification that only includes the positive points. We adopt L1 loss for smoothing and Softmax loss as the classification losses.
In order to evaluate the performance of the proposed method, we ran experiments on two benchmark datasets: the ICDAR 2015 dataset and the MSRA-TD500(TD500) dataset.
SynthText in the Wild dataset contains more than 800,000 synthetic images created by blending rendered words on wild images. Only samples with width larger than pixels are chosen for training.
ICDAR 2015 incidental text dataset is from Challenge 4 of ICDAR 2015 Robust Reading Competition that includes 1000 training images and 500 testing images. Since those images are collected by Google Glasses, they suffer from motion blur. The blurry texts have a label of ”###” and are excluded from our experiment. We also included training and testing images from the ICDAR 2013 dataset, which helps us in building a more robust text detector.
MSRA-TD500(TD500) is a multilingual dataset that includes oriented texts in both Chinese and English. Unlike ICDAR 2015, texts in MSRA-TD500 are annotated at the text-line level and the images were captured more formally, thus texts are much clearer and standardized. There are a total of 500 images, 300 of them were used as training data and 200 were used as testing data.
4.2 Implementation Details
Base network In our experiment, we uses a pre-trained VGG-16 as our based network. This network is widely used in object detection tasks. All images are resized to after data augmentation. We extracted five layers with cascading resolution as our feature maps, which are conv4_3, conv7, conv8, conv9, and conv10. We first trained our model on the SynthText dataset for 50,000 iterations with a learning rate of 0.001. Then, we fine-tuned our model using the other datasets with a 0.0005 learning rate. The details of training different datasets are described in later sections. We tested our model on different values and we discovered that our model achieves optimal performance when is set to for text detection.
Locality-Aware NMS In the post-processing stage, bounding boxes with a confidence score greater than 0.5 will be used to produce the final output by NMS merging. The naive NMS has computational complexity, which is not ideal for real-world applications. We adopt Locality—Aware NMS  to improve speed of merging bounding boxes.
Hard Negative Mining Hard Negative Mining is essential for SSD-based methods because of the imbalance between positive and negative training samples. We adopt the same configuration in SSD by selecting the top negative training samples, where is number of positive training samples. Thus, we adopt Locality-Aware NMS in our experiment. This algorithm can produce bounding boxes with greater precision in shorter time.
We utilize a data augmentation pipeline that is similar to the one in SSD to make our model more robust against different text variations. The original image is randomly cropped into patches. The crop size is chosen from [0.1, 1] of original image size. Each sample patch will be horizontally flipped with a probability of. In order to balance samples of different orientations, we also augmented datasets by randomly rotating images by degrees. is randomly chosen from the following angle set: (-90, -75, -60, -45, -30, -15, 0, 15, 30, 45, 60, 75, 90).
|Yao et al.||72.3||58.7||64.8|
4.3 Detection Results
4.3.1 Detecting Oriented English Text
First, our model is tested on ICDAR 2015 dataset. The pre-trained model is fine-tuned using both the ICDAR 2013 and the ICDAR 2015 training datasets after 20k iterations. Considering all images in ICDAR 2015 have high resolution, testing images are first resized to . The threshold is set to 0.7, similar to the one used in the pre-training stage. Performance is evaluated using the official off-line evaluation scripts.
We list the results of our model along with other state-of-art object and text detection methods. The results were obtained from the original papers. The best result of this dataset was obtained by Seglink, which achieved a F-measure score of 75.0%. However, our model obtained a score of 75.9%. The improvement comes from the high precision rate we obtained, which outperforms the second highest model by 6.1%
Figure.7 shows several detection results taken from the testing dataset of ICDAR 2015. Our proposed method ArbiText can distinguish and localize all kinds of scene text in noisy backgrounds.
4.3.2 Detecting Multi-Lingual Text in Long Lines
We further tested our method on the TD500 dataset consists of long text in English and non-Latin scripts. We augmented this dataset by doing the following: 1) Randomly place an image on a canvas of times of the original image size filled by mean values where ranges from 1 to 3. 2)We applies random crop according to the overlap strategy described in section 4.3. Thus, we obtained enough images for training. The pre-trained model is fine-tuned for 20K iterations. All images are resized to , which is consistent with the training stage. The experiment has demonstrated that this technique can dramatically increase detection speed without losing much precision. As illustrated in Table 2, ArbiText achieved comparable F-measure scores with other state-of-the-art methods. However, benefiting from lighter network architecture and simplified anchors mechanism, ArbiText has the highest FPS of 12.1.
Figure.8 shows ArbiText can detect long lines of text in mixed languages(English and Chinese) without changing any parameters or structures.
|Kang et al.||71||62||66||-|
|Yao et al.||63||63||60||0.14|
|Yin et al.||81||63||74||0.71|
|Yin et al.||71||61||65||1.25|
|Zhang et al.||83||67||74||0.48|
|Yao et al.||77||75||76||1.61|
As shown in Figure.9.a,b, curved texts can’t be represented by circle anchors. Moreover, Figure.9.c shows our model’s weakness in detecting hand-written texts.
We have presented ArbiText, a novel, proposal-free object detection method that can be utilized to detect both arbitrary-oriented texts and generic objects simultaneously. Its outstanding performance on different benchmarks demonstrates that ArbiText is accurate, robust, and flexible for real-world applications. In the future, we will extend the Circle Anchor methodology to detect deformable objects and/or texts.
-  M. Bastan, H. Kandemir, and B. Canturk. MT3s: Mobile Turkish Scene Text-to-Speech System for the Visually Impaired. arXiv:1608.05054 [cs], Aug. 2016. arXiv: 1608.05054.
A. Bissacco, M. Cummins, Y. Netzer, and H. Neven.
PhotoOCR: Reading Text in Uncontrolled Conditions.
2013 IEEE International Conference on Computer Vision, pages 785–792, Dec. 2013.
-  L. Chen, Q. Li, M. Li, and Q. Mao. Traffic sign detection and recognition for intelligent vehicle. In 2011 IEEE Intelligent Vehicles Symposium (IV), pages 908–913, June 2011.
-  Y. Cong. MSRA Text Detection 500 Database (MSRA-TD500) - TC11, 2012.
-  J. Dai, H. Qi, Y. Xiong, Y. Li, G. Zhang, H. Hu, and Y. Wei. Deformable Convolutional Networks. arXiv:1703.06211 [cs], Mar. 2017. arXiv: 1703.06211.
N. Ezaki, M. Bulacu, and L. Schomaker.
Text detection from natural scene images: towards a system for
visually impaired persons.
Proceedings of the 17th International Conference on Pattern Recognition, 2004. ICPR 2004., volume 2, pages 683–686 Vol.2, Aug. 2004.
-  A. Gupta, A. Vedaldi, and A. Zisserman. Synthetic Data for Text Localisation in Natural Images. arXiv:1604.06646 [cs], Apr. 2016. arXiv: 1604.06646.
-  W. He, X.-Y. Zhang, F. Yin, and C.-L. Liu. Deep Direct Regression for Multi-Oriented Scene Text Detection. arXiv:1703.08289 [cs], Mar. 2017. arXiv: 1703.08289.
-  W. Huang, Y. Qiao, and X. Tang. Robust Scene Text Detection with Convolution Neural Network Induced MSER Trees. Sept. 2014.
-  M. Jaderberg, A. Vedaldi, and A. Zisserman. Deep Features for Text Spotting”, booktitle. 2014.
-  L. Kang, Y. Li, and D. Doermann. Orientation Robust Text Line Detection in Natural Images. In 2014 IEEE Conference on Computer Vision and Pattern Recognition, pages 4034–4041, June 2014.
-  D. Karatzas, F. Shafait, S. Uchida, M. Iwamura, L. G. i. Bigorda, S. R. Mestre, J. Mas, D. F. Mota, J. A. Almazàn, and L. P. d. l. Heras. ICDAR 2013 Robust Reading Competition. In 2013 12th International Conference on Document Analysis and Recognition, pages 1484–1493, Aug. 2013.
-  M. Liao, B. Shi, X. Bai, X. Wang, and W. Liu. TextBoxes: A Fast Text Detector with a Single Deep Neural Network. arXiv:1611.06779 [cs], Nov. 2016. arXiv: 1611.06779.
-  W. Liu, D. Anguelov, D. Erhan, C. Szegedy, S. Reed, C.-Y. Fu, and A. C. Berg. SSD: Single Shot MultiBox Detector. arXiv:1512.02325 [cs], 9905:21–37, 2016. arXiv: 1512.02325.
-  Y. Liu and L. Jin. Deep Matching Prior Network: Toward Tighter Multi-oriented Text Detection. arXiv:1703.01425 [cs], Mar. 2017. arXiv: 1703.01425.
-  J. Ma, W. Shao, H. Ye, L. Wang, H. Wang, Y. Zheng, and X. Xue. Arbitrary-Oriented Scene Text Detection via Rotation Proposals. arXiv:1703.01086 [cs], Mar. 2017. arXiv: 1703.01086.
-  S. Z. Masood, G. Shu, A. Dehghan, and E. G. Ortiz. License Plate Detection and Recognition Using Deeply Learned Convolutional Neural Networks. arXiv:1703.07330 [cs], Mar. 2017. arXiv: 1703.07330.
L. Neumann and J. Matas.
Real-Time Lexicon-Free Scene Text Localization and Recognition.IEEE Transactions on Pattern Analysis and Machine Intelligence, 38(9):1872–1885, Sept. 2016.
-  S. Ren, K. He, R. Girshick, and J. Sun. Faster R-CNN: Towards Real-Time Object Detection with Region Proposal Networks. arXiv:1506.01497 [cs], June 2015. arXiv: 1506.01497.
-  E. Shelhamer, J. Long, and T. Darrell. Fully Convolutional Networks for Semantic Segmentation. arXiv:1605.06211 [cs], May 2016. arXiv: 1605.06211.
-  B. Shi, X. Bai, and S. Belongie. Detecting Oriented Text in Natural Images by Linking Segments. arXiv:1703.06520 [cs], Mar. 2017. arXiv: 1703.06520.
-  K. Simonyan and A. Zisserman. Very Deep Convolutional Networks for Large-Scale Image Recognition. arXiv:1409.1556 [cs], Sept. 2014. arXiv: 1409.1556.
-  Z. Tian, W. Huang, T. He, P. He, and Y. Qiao. Detecting Text in Natural Image with Connectionist Text Proposal Network. arXiv:1609.03605 [cs], Sept. 2016. arXiv: 1609.03605.
-  T. Wang, D. J. Wu, A. Coates, and A. Y. Ng. End-to-end text recognition with convolutional neural networks. In Proceedings of the 21st International Conference on Pattern Recognition (ICPR2012), pages 3304–3308, Nov. 2012.
-  C. Yao, X. Bai, W. Liu, Y. Ma, and Z. Tu. Detecting texts of arbitrary orientations in natural images. In 2012 IEEE Conference on Computer Vision and Pattern Recognition, pages 1083–1090, June 2012.
-  C. Yao, X. Bai, N. Sang, X. Zhou, S. Zhou, and Z. Cao. Scene Text Detection via Holistic, Multi-Channel Prediction. arXiv:1606.09002 [cs], June 2016. arXiv: 1606.09002.
-  X. C. Yin, W. Y. Pei, J. Zhang, and H. W. Hao. Multi-Orientation Scene Text Detection with Adaptive Clustering. IEEE Transactions on Pattern Analysis and Machine Intelligence, 37(9):1930–1937, Sept. 2015.
-  X. C. Yin, X. Yin, K. Huang, and H. W. Hao. Robust Text Detection in Natural Scene Images. IEEE Transactions on Pattern Analysis and Machine Intelligence, 36(5):970–983, May 2014.
-  Z. Zhang, C. Zhang, W. Shen, C. Yao, W. Liu, and X. Bai. Multi-Oriented Text Detection with Fully Convolutional Networks. arXiv:1604.04018 [cs], Apr. 2016. arXiv: 1604.04018.
-  H. Zhao, J. Shi, X. Qi, X. Wang, and J. Jia. Pyramid Scene Parsing Network. arXiv:1612.01105 [cs], Dec. 2016. arXiv: 1612.01105.
-  Z. Zhong, L. Jin, S. Zhang, and Z. Feng. DeepText: A Unified Framework for Text Proposal Generation and Text Detection in Natural Images. arXiv:1605.07314 [cs], May 2016. arXiv: 1605.07314.
-  B. Zhou, A. Khosla, A. Lapedriza, A. Oliva, and A. Torralba. Object Detectors Emerge in Deep Scene CNNs. arXiv:1412.6856 [cs], Dec. 2014. arXiv: 1412.6856.
-  X. Zhou, C. Yao, H. Wen, Y. Wang, S. Zhou, W. He, and J. Liang. EAST: An Efficient and Accurate Scene Text Detector. arXiv:1704.03155 [cs], Apr. 2017. arXiv: 1704.03155.
-  X. Zhou, S. Zhou, C. Yao, Z. Cao, and Q. Yin. ICDAR 2015 Text Reading in the Wild Competition. arXiv:1506.03184 [cs], June 2015. arXiv: 1506.03184.