Recently, scene text detection has drawn great attention from computer vision and machine learning community. Driven by many content-based image applications such as photo translation and receipt content recognition, it has become a promising and challenging research area both in academia and industry. Detecting text in natural images is difficult, because both text and background may be complex in the wild and it often suffers from disturbance such as occlusion and uncontrollable lighting conditions.
Previous text detection methods[2, 3, 4, 5, 6] have achieved promising results on several benchmarks. The essential problem in text detection is to represent text region using discriminative features. Conventionally, hand-crafted features are designed[3, 7, 8]
to capture the properties of text region such as texture and shape, while in the past few years, deep learning based approaches[9, 10, 6, 11, 12, 13] directly learn hierarchical features from training data, demonstrating more accurate and efficient performance in various benchmarks such as ICDAR series contests[14, 15, 16].
Existing methods[10, 9, 6, 13] have obtained decent performance for detecting horizontal or near-horizontal text. While horizontal text detection has constraints of axis-aligned bounding-box ground truth, the multi-oriented text is not restrictive to a particular orientation and usually uses quadrilaterals for annotations. Therefore, it reports relatively lower accuracies in ICDAR 2015 Competition Challenge 4 “Incidental scene text localization” compared to horizontal scene text detection benchmarks[14, 15].
detection. In general, there are currently four different types of methods. Region based methods[19, 22, 21] leverage advanced object detection techniques such as Faster RCNN and SSD. Segmentation-based methods[25, 26]
mainly utilize fully convolutional neural networks (FCN) for generating text score maps, which often need several stages and components to achieve final detections. Direct regression based method regresses the position and size of an object from a given point. Finally, hybrid method combines text scores map and rotated/quadrangle bounding boxes generation to collaboratively obtain the efficient and accurate performance in multi-oriented text detection.
Inspired by recent advance of instance-aware semantic segmentation[27, 28], we present a novel perspective to handle the task of multi-oriented text detection. In this work, we leverage the merits from accurate region proposal based methods, and flexible segmentation based methods which can easily generate arbitrary-shaped text mask[25, 26] . It is an end-to-end trainable framework excluding redundant and low-efficient pipelines such as the use of text/nontext salient map and text-line generation. Based on region proposal network (RPN), our approach detects and segments text instance simultaneously, followed by non-maximum suppression (NMS) to suppress overlapping instances. Finally, a minimum quadrangle bounding box to fit each instance area is generated as the result of the whole detection process.
Our main contributions are summarized as follows:
We present an end-end efficient and trainable solution for multi-oriented text detection from an instance aware segmentation perspective, excluding any redundant pipelines.
During feature extraction, feature maps are composed in a fused fashion to adaptively satisfy the finer representation of text instance.
Mask-NMS is introduced to improve the standard NMS when facing heavily inclined or line-level text instances.
Without many bells and whistles, our approach outperforms state of the art on current multi-oriented text detection benchmarks.
Ii Related work
Detecting text in natural images has been widely studied in past few years, motivated by many text-related real-world applications such as photo OCR and blind navigation. One of the mainstream traditional methods for scene text detection are Connected Components (CCs) based methods[29, 30, 10, 31, 32] , which consider text as a group of individual components such as characters. Within these methods, stroke width transform (SWT)[3, 31] and maximally stable extremal region (MSER)[33, 32, 7] are usually used to seek character candidates. Finally, these candidates are combined to obtain text objects. Although these bottom up approaches may be accurate on some benchmarks[14, 15], they often suffer from too many pipelines, which may cause inefficiency. Another mainstream traditional methods are sliding window based[2, 34, 10]. These methods often use a fixed-size or multi-scale window to slide through the image searching the region which most likely contains text. However, the process of sliding window may involve large computational cost which results in inefficiency. Generally, traditional methods often require several steps to obtain final detections, and hand-designed features are usually used to represent properties of text. Therefore, they may suffer from inefficiency and low generalization ability against complex situations such as non-uniform illumination.
Recent progress on deep learning based approaches for object detection and semantic segmentation has provided new techniques for reading text in the wild, which can be also seen as an instance of general object detection. Driven by the advance of object detection frameworks such as Faster RCNN and SSD
, these methods achieved state of the art by either using a region proposal network to first classify some text region proposals[22, 17], or directly regress text bounding boxes coordinates from a set of default boxes[13, 19]. These methods are able to achieve leading performance on horizontal or multi-oriented scene text detection benchmarks. However, they may also be restricted to rectangular bounding box constraints even with appropriate rotation. Different from these methods, FCN based approaches generate text/non-text map which classifies text at the pixel level. Though it may be suited well for arbitrary shape of text in natural images, it often involves several pipelines which leads to inefficiency[25, 17].
Inspired by recent advance on instance-aware semantic segmentation[27, 28], we present an end-end trainable framework called Fused Text Segmentation Networks (FTSN) to handle arbitrary-shape text detection with no extra pipelines involved. It inherits merits from both object detection and semantic segmentation architecture which efficiently detects and segments an text instance simultaneously and accurately gives predictions in the pixel level. As text may rely on finer feature representation, a fused structure formed by multi-level feature maps is set to fit this property.
The proposed framework for multi-oriented scene text detection is diagrammed in Fig.2. It is a deep CNN model which mainly consists of three parts. Feature representations of each image are extracted through resnet-101 backbone, then multi-level feature maps are fused as FusedMapA which is fed to the region proposed network (RPN) for text region of interest (ROI) generation and FusedMapB for later rois’ PSROIPooling. Finally the rois are sent to the detection, segmentation and box regression branches to output text instances in pixel level along with their corresponding bounding boxes. The post-processing part includes NMS and minimal quadrilateral generation.
Iii-a Network Architecture
The convolutional feature representation is designed in a fusion fashion. The text instance is not like the general object such as people and cars which have relatively strong semantics. On the contrary, texts often vary tremendously in intra-class geometries. Consequently, low-level features should be taken into consideration. Basically, resnet-101 consists of five stages. Before region proposing, stage3 and upsampled stage4 feature maps are combined to form FusedMapA through element-wise adding, then upsampled feature maps from stage5 are fused with FusedMapA to form FusedMapB. It is noted that downsampling is not involved during stage5. Instead, we use the “hole algorithm”[37, 38]
to keep the feature stride and maintain the receptive field. The reason for this is that both text properties and the segmentation task may require finer features and involving final downsampling may lose some useful information.
Because using feature stride of stage3 may cause millions of anchors in original RPN which makes model training hard, so we add a with stride 2 convolution to reduce such huge number of anchors.
Followed FCIS, we use Joint Mask Prediction and Classification to simultaneously classify and mask the text instance on inside/outside score maps generated through PSROIPooling on conv-cls-seg feature maps, and box regression branch utilizes feature maps from conv-box after PSROIPooling (”” means one class is for text and the other for background). We use shown in Fig.2 in our experiments by default. It is noted that after PSROIPooling, the resolution of feature maps becomes . Therefore, we use global average pooling for classification (after pixel-wise max) and box regression branches, and pixel-wise softmax on mask branch.
Iii-B Ground Truth and Loss Function
The whole multi-task loss can be interpreted as
The full loss consists of two sub stage losses: RPN loss where is for region proposal classification and is for box regression, and text instance loss based on each ROI, where represent losses for instance classification, mask and box regression task respectively. is the hyper-parameter to control the balance among each loss term. They are set as in our experiments.
Classification and mask task both use cross-entropy as loss function, whereas we use smooth-L1 for box regression task formulated as
is set to 3 in our experiments which makes the box regression loss less sensitive to outliers.
Ground truth of each text instance is presented by bounding boxes and masks shown in Fig.3. In most multi-oriented text detection dataset, annotations are given in quadrilaterals such as IC15 or can be converted to quadrilaterals such as TD500. For each instance, we directly generate mask from quadrilateral coordinates and use the minimal rectangle containing the mask as the bounding box.
Iii-C Post Processing
Mask-NMS To obtain final detection results, we use Non-Maximum Suppression mechanism (NMS) to filter overlapped text instances and preserve those with highest scores. After NMS, we generate a minimum quadrilateral for each text instance covering the mask as shown in Fig.1.
Standard NMS computes IOU among bounding boxes, which may be fine for word-level and near-horizontal results’ filtering. However, it may filter some correct line-level detections when they are close and heavily inclined as shown in Fig.4 or when words stay close in the same line as shown in Fig.5. Consequently, we propose a modified NMS called Mask-NMS to handle such situations. Mask-NMS mainly changes bounding box IOU computation to so-called mask-maximum-intersection (MMI) as formulated:
are mask areas of two text instances to be computed, is the intersection area between the masks. Maximum intersection over the mask areas are used to replace original IOU for the reason that detections may easily involve line-level and word-level text instances simultaneously at the same line as shown in Fig.5. The proposed Mask-NMS has significantly improved performance for multi-oriented scene text detection as shown in section.5.
To evaluate the proposed framework, we conduct quantitative experiments on three public benchmarks: ICDAR2015, MSRA-TD500 and Total-Text.
ICDAR 2015 Incidental Text (IC15) the Challenge 4 of ICDAR 2015 Robust Reading Competition. IC15 contains 1000 training and 500 testing incidental images taken by Google Glasses without paying attention to viewpoint and image quality. Therefore, large variations in text scale, orientation and resolution lead to difficulty for text detection. Annotations of the dataset are given in word-level quadrilaterals.
MSRA-TD500 (TD500) is early presented in . The dataset is multi-oriented and multi-lingual including both Chinese and English text， which consists of 300 training and 200 testing images. Different from IC15, annotations of TD500 are at line level which are rotated rectangles.
Total-Text is presented in ICDAR2017. It consists of 1555 images with more than 3 different text orientations: Horizontal, Multi-Oriented, and Curved.
SynthText in the Wild (SynthText) The dataset contains 800,000 synthetic images, text with random color, fonts, scale and orientation are rendered on natural images carefully to have a realistic look. Annotations are given in character, word and line level.
Iv-B Implementation Details
Training We pretrain the proposed FTSN on a subset of SynthText containing 160,000 images, then finetune on IC15, TD500 and Total-Text. For optimization, standard SGD is used during training with learning rate
for first 5 epochs andfor the last epoch,and we also apply online hard example mining (OHEM) for balancing the positive and negative samples. Different from original RPN anchor ratios and scales setting for object detection, anchor scales of  and ratios of [1/3,1/2,1,2,3,5,7] are set because text often has a large aspect ratio and a small scale.
Data augmentation Multi-scale training, rotation and color jittering are applied during training. Scales are randomly chosen from [600,720,960,1100] and each number represents the short edge of input images. Rotation with , and are applied with horizontal flip. Consequently, it enlarges 8x dataset size than the original one. Random brightness, contrast and saturation jittering are applied for input images.
Testing Input images are resized to when testing. After NMS, mask voting is used to obtain an ensemble text instance mask by averaging all reasonable detections.
Experiments are conducted on MXNet and run on a server with Intel i7 6700K CPU, 64GB RAM, GTX 1080 and Ubuntu 14.04 OS.
|Method||Precision (%)||Recall (%)||Hmean (%)|
|Zhang et al.||71.0||43.0||54.0|
|Qin et al.||79.0||65.0||71.0|
|He et al.||82.0||80.0||81.0|
|Method||Precision (%)||Recall (%)||Hmean (%)|
|Yao et al.||63.0||63.0||60.0|
|Zhang et al.||83.0||67.0||74.0|
|He et al.||77.0||70.0||74.0|
|Method||Precision (%)||Recall (%)||Hmean (%)|
Tabel.1 shows results of the proposed FTSN on IC15 compared with previous state of art published methods. SNMS and MNMS represent standard NMS and Mask-NMS respectively. Our FTSN with Mask-NMS outperforms former best result by 5.3% in Precision and 3.1% in Hmean. It is evaluated by the official submission server111http://rrc.cvc.uab.es/?ch=4.
Results on TD500 are shown in Table.2 along with other state of art methods. It is shown that our methods outperform the current state of art approaches by a large margin in Hmean and Recall, without adding extra real-world training images.
Our method also shows great flexibility on the total-text dataset containing curved text. As the dataset is new to the community, experiments are seldom conducted on it which makes our results as a baseline shown in Table3. The evaluation metric uses IoU of 0.5 between each instance masks.
Outperforming the current state of the art, our approach runs about 4 FPS on images and 2.5 FPS when using Mask-NMS, which presents efficiency and accuracy.
It is noted that the proposed Mask-NMS significantly improved Hmean by 0.7 and 0.3 percent on IC15 and TD500, which mainly target the situations in Fig.4 and Fig.5.
Fig.6 shows example results of FTSN. From left to right, it illustrates results on IC15 ,TD500 and Total-Text dataset. The decent performance for word-level, line-level and curved text detection with large variation in resolution, view point, scale and linguistics suggests excellent generalization ability.
We present FTSN, an end-end efficient and accurate multi-oriented scene text detection framework. It has outperformed previous state of the art approaches on word-level line-level annotated benchmarks and report a baseline on total-text demonstrating decent generalization ability and flexibility.
The research is supported by The National Key Research and Development Program of China under grant 2017YFB1002401.
-  Y. Zhu, C. Yao, and X. Bai, “Scene text detection and recognition: recent advances and future trends,” Frontiers of Computer Science, vol. 10, no. 1, pp. 19–36, 2016.
X. Chen and A. L. Yuille, “Detecting and reading text in natural scenes,” in
IEEE Computer Society Conference on Computer Vision and Pattern Recognition, 2004, pp. 366–373.
-  B. Epshtein, E. Ofek, and Y. Wexler, “Detecting text in natural scenes with stroke width transform,” in Computer Vision and Pattern Recognition (CVPR), 2010 IEEE Conference on. IEEE, 2010, pp. 2963–2970.
-  M. Buta, L. Neumann, and J. Matas, “Fastext: Efficient unconstrained scene text detector,” in IEEE International Conference on Computer Vision, 2015, pp. 1206–1214.
-  S. Tian, Y. Pan, C. Huang, S. Lu, K. Yu, and C. L. Tan, “Text flow: A unified text detection system in natural scene images,” in IEEE International Conference on Computer Vision, 2016, pp. 4651–4659.
-  M. Jaderberg, K. Simonyan, A. Vedaldi, and A. Zisserman, “Reading text in the wild with convolutional neural networks,” International Journal of Computer Vision, vol. 116, no. 1, pp. 1–20, 2016.
-  L. Neumann and J. Matas, “Real-time scene text localization and recognition,” in IEEE Conference on Computer Vision and Pattern Recognition, 2012, pp. 3538–3545.
-  A. Zamberletti, L. Noce, and I. Gallo, Text Localization Based on Fast Feature Pyramids and Multi-Resolution Maximally Stable Extremal Regions. Springer International Publishing, 2014.
-  W. Huang, Y. Qiao, and X. Tang, Robust Scene Text Detection with Convolution Neural Network Induced MSER Trees. Springer International Publishing, 2014.
M. Jaderberg, A. Vedaldi, and A. Zisserman, “Deep features for text spotting,” inEuropean Conference on Computer Vision, 2014, pp. 512–528.
-  A. Gupta, A. Vedaldi, and A. Zisserman, “Synthetic data for text localisation in natural images,” pp. 2315–2324, 2016.
-  Z. Zhang, C. Zhang, W. Shen, C. Yao, W. Liu, and X. Bai, “Multi-oriented text detection with fully convolutional networks,” in Computer Vision and Pattern Recognition, 2016.
M. Liao, B. Shi, X. Bai, X. Wang, and W. Liu, “Textboxes: A fast text detector
with a single deep neural network,” in
Association for the Advancement of Artificial Intelligence, 2017.
-  A. Shahab, F. Shafait, and A. Dengel, “Icdar 2011 robust reading competition challenge 2: Reading text in scene images,” in International Conference on Document Analysis and Recognition, 2011, pp. 1491–1496.
-  D. Karatzas, F. Shafait et al., “Icdar 2013 robust reading competition,” in International Conference on Document Analysis and Recognition, 2013, pp. 1484–1493.
-  D. Karatzas, L. Gomez-Bigorda et al., “Icdar 2015 competition on robust reading,” in International Conference on Document Analysis and Recognition, 2015, pp. 1156–1160.
-  S. Qin and R. Manduchi, “Cascaded segmentation-detection networks for word-level text spotting,” 2017.
-  W. He, X.-Y. Zhang, F. Yin, and C.-L. Liu, “Deep direct regression for multi-oriented scene text detection,” arXiv preprint arXiv:1703.08289, 2017.
-  B. Shi, X. Bai, and S. Belongie, “Detecting oriented text in natural images by linking segments,” in Computer Vision and Pattern Recognition, 2017.
-  X. Zhou, C. Yao, H. Wen, Y. Wang, S. Zhou, W. He, and J. Liang, “East: An efficient and accurate scene text detector,” in Computer Vision and Pattern Recognition, 2017.
-  Y. Liu and L. Jin, “Deep matching prior network: Toward tighter multi-oriented text detection,” arXiv preprint arXiv:1703.01425, 2017.
-  J. Ma, W. Shao, H. Ye, L. Wang, H. Wang, Y. Zheng, and X. Xue, “Arbitrary-oriented scene text detection via rotation proposals,” arXiv preprint arXiv:1703.01086, 2017.
-  S. Ren, K. He, R. Girshick, and J. Sun, “Faster r-cnn: Towards real-time object detection with region proposal networks.” IEEE Transactions on Pattern Analysis & Machine Intelligence, vol. 39, no. 6, p. 1137, 2017.
-  W. Liu, D. Anguelov, D. Erhan, C. Szegedy, S. Reed, C.-Y. Fu, and A. C. Berg, “Ssd: Single shot multibox detector,” in European conference on computer vision. Springer, 2016, pp. 21–37.
-  Z. Zhang, C. Zhang, W. Shen, C. Yao, W. Liu, and X. Bai, “Multi-oriented text detection with fully convolutional networks,” in Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, 2016, pp. 4159–4167.
-  T. He, W. Huang, Y. Qiao, and J. Yao, “Accurate text localization in natural image with cascaded convolutional text network,” arXiv preprint arXiv:1603.09423, 2016.
-  Y. Li, H. Qi, J. Dai, X. Ji, and Y. Wei, “Fully convolutional instance-aware semantic segmentation,” in Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, 2017.
-  K. He, G. Gkioxari, P. Dollár, and R. Girshick, “Mask r-cnn,” arXiv preprint arXiv:1703.06870, 2017.
L. Neumann and J. Matas, “Real-time lexicon-free scene text localization and recognition,”IEEE transactions on pattern analysis and machine intelligence, vol. 38, no. 9, pp. 1872–1885, 2016.
-  K. Wang and S. Belongie, “Word spotting in the wild,” in European Conference on Computer Vision. Springer, 2010, pp. 591–604.
-  W. Huang, Z. Lin, J. Yang, and J. Wang, “Text localization in natural images using stroke feature transform and text covariance descriptors,” in Proceedings of the IEEE International Conference on Computer Vision, 2013, pp. 1241–1248.
-  L. Neumann and J. Matas, “A method for text localization and recognition in real-world images,” in Asian Conference on Computer Vision. Springer, 2010, pp. 770–783.
-  J. Matas, O. Chum, M. Urban, and T. Pajdla, “Robust wide-baseline stereo from maximally stable extremal regions,” Image & Vision Computing, vol. 22, no. 10, pp. 761–767, 2004.
-  S. M. Hanif and L. Prevost, “Text detection and localization in complex scene images using constrained adaboost algorithm,” in Document Analysis and Recognition, 2009. ICDAR’09. 10th International Conference on. IEEE, 2009, pp. 1–5.
-  S. Zhang, M. Lin, T. Chen, L. Jin, and L. Lin, “Character proposal network for robust text extraction,” in Acoustics, Speech and Signal Processing (ICASSP), 2016 IEEE International Conference on. IEEE, 2016, pp. 2633–2637.
-  K. He, X. Zhang, S. Ren, and J. Sun, “Deep residual learning for image recognition,” in Proceedings of the IEEE conference on computer vision and pattern recognition, 2016, pp. 770–778.
-  L.-C. Chen, G. Papandreou, I. Kokkinos, K. Murphy, and A. L. Yuille, “Deeplab: Semantic image segmentation with deep convolutional nets, atrous convolution, and fully connected crfs,” arXiv preprint arXiv:1606.00915, 2016.
-  J. Long, E. Shelhamer, and T. Darrell, “Fully convolutional networks for semantic segmentation,” in Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, 2015, pp. 3431–3440.
-  M. Lin, Q. Chen, and S. Yan, “Network in network,” arXiv preprint arXiv:1312.4400, 2013.
-  C. Yao, X. Bai, W. Liu, Y. Ma, and Z. Tu, “Detecting texts of arbitrary orientations in natural images,” in Computer Vision and Pattern Recognition (CVPR), 2012 IEEE Conference on. IEEE, 2012, pp. 1083–1090.
-  C. K. Ch’ng and C. S. Chan, “Total-text: A comprehensive dataset for scene text detection and recognition,” in 14th IAPR International Conference on Document Analysis and Recognition ICDAR, 2017.
-  A. Shrivastava, A. Gupta, and R. Girshick, “Training region-based object detectors with online hard example mining,” in Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, 2016, pp. 761–769.
-  J. Dai, K. He, and J. Sun, “Instance-aware semantic segmentation via multi-task network cascades,” in Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, 2016, pp. 3150–3158.
-  T. Chen, M. Li, Y. Li, M. Lin, N. Wang, M. Wang, T. Xiao, B. Xu, C. Zhang, and Z. Zhang, “Mxnet: A flexible and efficient machine learning library for heterogeneous distributed systems,” arXiv preprint arXiv:1512.01274, 2015.