Recently, scene text detection (STD) in the wild has drawn extensive attention because of its practical applications, such as blind navigation, autonomous driving, etc. Generally, the performance of STD has been greatly enhanced by the advanced object detection and segmentation frameworks which can be divided into two categories: 1) Segmentation-based methods. These methods draw inspiration from instance segmentation and conduct dense predictions in pixel levels. 2) Regression-based methods. Scene texts are detected using the adapted one-stage or two-stage frameworks which have been proved effective in general object detection tasks.
However, STD remains a challenging task due to its unique characteristics. Firstly, since the convolutional operation which is widely used in all segmentation or detection networks only processes a local neighbour, it hardly captures the long-range dependencies even it is stacked repeatedly. Thus, CNN-based STD methods sometimes fail to detect long text instances because they are far beyond CNN’s receptive field. Secondly, although detecting words or text lines with a relatively simple rectangle or quadrilateral representation has been well tackled, curve text detection with a more tight representation is not well solved. Finally, some text instances are extremely tiny which makes their precise shape description more difficult because even a little segmentation deviation may lead to the ultimate failure. Therefore, a single segmentation network fails to process images that vary greatly in text scales.
In order to solve the problems mentioned above, we propose NASK which contains a Text Instance Segmentation network (TIS) and a Fiducial pOint eXpression module (FOX), connected by Text RoI pooling. TIS is a context attended FCN with a proposed Group Spatial and Channel Attention (GSCA) for text instance segmentation. GSCA captures long-range dependencies by directly computing interactions between any two positions across both space and channels, which enhances the semantic information of shared feature maps. Then, similar to Faster R-CNN, Text RoI pooling accepts the shared feature maps and the bounding box coordinates generated by TIS as input and ”warps” these rectangular RoIs into a fixed size. Finally, FOX reconstructs texts with a set of fiducial points which are calculated using the predicted geometry attributes.
The main contributions of this work are summarized as follows: (a) A group spatial and channel attention module (GSCA) which aggregates the contextual information is introduced into FCN for feature refinements. (2) We propose a Fiducial pOint eXpression module (FOX) for the tighter arbitrary shape text detection. (3) A novel two-stage segmentation based STD detector named NASK incorporating GSCA and FOX is trained jointly and achieves state-of-the-art performance on two curve text detection benchmarks.
In this section, we describe the pipeline of the proposed NASK. Firstly, the overall pipeline of the whole model is briefly described in Section 2.1. Next, we elaborate on all proposed modules including GSCA and FOX. Finally, the optimization details are given in Section 2.4.
The overall architecture is demonstrated in Fig 1. Firstly, ResNet-50 based FCN with GSCA makes up the first stage text instance segmentation network TIS. Then Text RoI Pooling module transforms the rectangle text proposals to a fixed size. Finally, FOX is applied to obtain a tighter representation of curve text instances.
2.2 Group Spatial and Channel Attention Module
Inspired by Non-local network which is based on the self-attention mechanism, a Group Spatial and Channel Attention module is proposed. The detailed structure is displayed in Fig 2. Compared to Non-local network which only models the interactions between spatial positions in the same channel, GSCA explicitly learns the correlations among all elements across both space and channels. In order to alleviate the huge computational burden, GSCA incorporates the channel grouping idea to gather all C channels into G groups. Only the relationships within each group which contains channels are calculated and the computational complexity decreases from to . As for the affiliation among different groups, similar to SENet, the branch of global channel attention in Fig 2 is set to generate global channel-wise attention and distribute information among every group.
Specifically, the attended feature map is expressed as . Here are learnable spatial transformations implemented as serially connected convolution and reshape while is defined as matrix product for simplification. Then we have where is the group result of . Another branch aiming to capture global channel weights is implemented with two convolution layers and one fully connected layer. Thus, through , we deduce , where , and denote the number of channels, - channel weight and - channel feature map respectively. Meanwhile, a short-cut path is used to preserve the local information and the final output can be written as .
2.3 Fiducial Point Expression Module
As depicted in Fig 3, the geometrical representation of text instances includes text center line (TCL) , character scale , character orientation and text orientation . Specifically, the text center line is a binary mask based on the side-shrunk version of text polygon annotations. The scale is half the height of the character while the text orientation is defined as the horizontal angle between the current quadrilateral center and the next one . We take the midpoints on the top and bottom edges of each character quadrilateral as fiducial points and the character orientation is defined as the direction from the midpoint of the bottom edge to that of the top edge.
Mathematically, a text instance can be viewed as an ordered sequence , where is a hyper-parameter which denotes the number of character segments. Each node is associated with a group of geometrical attributes and can be represented as where every element is defined as above.
The overall text polygon generation process is illustrated in Fig 4. Firstly, two up-sampling and one convolution with 6 output channels are applied to regress all the geometrical attributes. The output is where , denote the character scale
of each pixel and the probability of pixels onTCL respectively. and are normalized as , to ensure their quadratic sum equals to 1. and are normalized in the same way. Then points are equidistantly sampled in the center line , named . For each , according to the geometric relationship, two corresponding fiducial points are computed as follows.
where , , are the center coordinate, scale and orientation for the - character respectively. Therefore, one single text instance can be represented with fiducial points. Finally, text polygons are generated by simply applying approxPolyDP in OpenCV and then mapped back to the original image proportionally.
The whole network is trained in an end-to-end manner using the following loss function:
where and are the loss for Text Instance Segmentation and Fiducial Point Expression module respectively. is cross-entropy loss for text regions with OHEM adopted. For , it can be expressed as follows:
where is cross-entropy loss for TCL. ,,, and are all calculated using Smoothed-L1 loss. All pixels outside TCL are set to 0 since the geometrical attributes make no sense to non-TCL points. The hyper-parameters are all set to 1 in our experiments.
To evaluate the effectiveness of the proposed NASK, we adopt two widely used datasets with arbitrary shape text instances for experiments and present detailed ablation studies.
Total-Text is a newly-released dataset for curve text detection which contains horizontal and multi-oriented texts as well. It is split into training and testing sets with 1255 and 300 images respectively.
SCUT-CTW1500 is a challenging dataset for long curve text detection. It consists of 1000 training images and 500 testing images. The text instances from this dataset are annotated as polygons with 14 vertices.
3.2 Implementation Details
The proposed method is implemented in PyTorch. For all datasets, images are randomly cropped and resized into. The cropped image regions are rotated randomly in 4 directions with , , , . The experiments are conducted on four NVIDIA TitanX GPUs each with 12GB memory. The training process is divided into two stages. Firstly, stage segmentation network is trained using Synthetic dataset
for 10 epochs. We take this step as a warm-up training strategy because the precise first-stage segmentation is a prerequisite for the subsequent text shape refinement. Then in the fine-tuning step, the whole model is trained using Adam optimizer with the learning rate re-initiated toand the learning rate decay factor set to 0.9.
3.3 Evaluation on Curved Text Benchmark
We evaluate the performance of NASK on Total-Text and SCUT-CTW1500 after finetuning about 10 epochs. The number of sample points in TCL is set to 8 and the group number of GSCA is set to 4. Thresholds , for regarding pixels to be text regions or are set to (0.7,0.6) and (0.8,0.4) respectively for Total-Text and SCUT-CTW1500. All quantitative results are shown in Table 1.
Note: R,P,H,F denotes Recall, Precision, H-mean and FPS respectively. For fair comparison, no external data is used for all models.
From Table1, we can see that NASK achieves the highest H-mean value of 82.2% with FPS reaching 8.4 on Total-Text. The quantitative results on SCUT-CTW1500 dataset also show NASK achieves a competitive result comparable to state-of-the-art methods with H-mean and Precision attaining 80.5% and 82.8%. Selected detection results are shown in Fig 5.
3.4 Ablation studies
We conduct several ablation experiments on SCUT-CTW1500 to analyze the proposed NASK. Details are discussed as follows.
Effectiveness of GSCA. We devise a set of comparative experiments to demonstrate the effectiveness of GSCA. For fair comparisons, we replace GSCA with two stacked convolution layers so that they share almost the same computation overhead. The experiment results in Table 2(a) show that GSCA brings about an obvious ascent in performance. For instance, by setting to 4, H-mean improves by 2.5% compared to the native model (). The visualization analysis in Fig 6 indicates that GSCA is context-aware that most of the weights are focused on the pixels belonging to the same category with the reference pixel.
Note: means the first stage segmentation namely TIS; denotes the group number of the attention module.
Influence of the number of attention module groups G. Several experiments are operated to study the impact of the group number of GSCA and the results are shown in Table 2(a). As expected, the detection speed increases with the rise of the group number and reaches the limit at about 12.9 FPS. It is also worthwhile to notice that the detection result is not much sensitive to . This may be attributed to the fact that the global channel attention effectively captures the rich correlations among groups.
Influence of the number of sample points n. The curve text representation is decided by a set of fiducial points. We evaluate NASK with different values of and results are shown in Fig 7. The performance witnesses a gigantic increase when changes from 2 to 8 and then gradually converges. Therefore, we set to 8 in our experiments.
Effectiveness of the first-stage segmentation (TIS). To demonstrate the effectiveness of the two-stage architecture, we conduct experiments that directly apply FOX on the input image and the comparative results are list in Table 2(b). It is obvious that the two-stage segmentation network effectively improves the detection performance with H-mean improved by 5.5%.
In this paper, we propose a novel text detector NASK to facilitate the detection of arbitrary shape texts. The whole network consists of serially connected Text Instance Segmentation (TIS), Text RoI Pooling and Fiducial Point Expression module (FOX). TIS conducts text instance segmentation while Text RoI Pooling transforms rectangle text bounding boxes to the fixed size. Then FOX achieves a tight and precise text detection result by predicting several geometric attributes. To capture the long-range dependency, a self-attention based mechanism called Group Spatial and Channel Attention module (GSCA) is incorporated into TIS to augment the feature representation. The effectiveness and efficiency of the proposed NASK have been proved by experiments with H-mean reaching 82.2% and 80.5% for Total-Text and SCUT-CTW 1500 respectively.
-  Yingying Zhu, Minghui Liao, Mingkun Yang, and Wenyu Liu. Cascaded segmentation-detection networks for text-based traffic sign detection. IEEE transactions on intelligent transportation systems, 19(1):209–219, 2017.
Proceedings of the IEEE international conference on computer vision, pages 1440–1448, 2015.
-  Wei Liu, Dragomir Anguelov, Dumitru Erhan, Christian Szegedy, Scott Reed, Cheng-Yang Fu, and Alexander C Berg. Ssd: Single shot multibox detector. In European conference on computer vision, pages 21–37. Springer, 2016.
Jonathan Long, Evan Shelhamer, and Trevor Darrell.
Fully convolutional networks for semantic segmentation.
Proceedings of the IEEE conference on computer vision and pattern recognition, pages 3431–3440, 2015.
-  Zheng Zhang, Chengquan Zhang, Wei Shen, Cong Yao, Wenyu Liu, and Xiang Bai. Multi-oriented text detection with fully convolutional networks. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pages 4159–4167, 2016.
-  Cong Yao, Xiang Bai, Nong Sang, Xinyu Zhou, Shuchang Zhou, and Zhimin Cao. Scene text detection via holistic, multi-channel prediction. arXiv preprint arXiv:1606.09002, 2016.
-  Minghui Liao, Baoguang Shi, and Xiang Bai. Textboxes++: A single-shot oriented scene text detector. IEEE transactions on image processing, 27(8):3676–3690, 2018.
-  Xinyu Zhou, Cong Yao, He Wen, Yuzhi Wang, Shuchang Zhou, Weiran He, and Jiajun Liang. East: an efficient and accurate scene text detector. In Proceedings of the IEEE conference on Computer Vision and Pattern Recognition, pages 5551–5560, 2017.
Xiaolong Wang, Ross Girshick, Abhinav Gupta, and Kaiming He.
Non-local neural networks.In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pages 7794–7803, 2018.
-  Shangbang Long, Jiaqiang Ruan, Wenjie Zhang, Xin He, Wenhao Wu, and Cong Yao. Textsnake: A flexible representation for detecting text of arbitrary shapes. In Proceedings of the European Conference on Computer Vision (ECCV), pages 20–36, 2018.
-  Shaoqing Ren, Kaiming He, Ross Girshick, and Jian Sun. Faster r-cnn: Towards real-time object detection with region proposal networks. In Advances in neural information processing systems, pages 91–99, 2015.
-  Ashish Vaswani, Noam Shazeer, Niki Parmar, Jakob Uszkoreit, Llion Jones, Aidan N Gomez, Lukasz Kaiser, and Illia Polosukhin. Attention is all you need. In Advances in neural information processing systems, pages 5998–6008, 2017.
-  Jie Hu, Li Shen, and Gang Sun. Squeeze-and-excitation networks. In Proceedings of the IEEE conference on computer vision and pattern recognition, pages 7132–7141, 2018.
-  Gary Bradski and Adrian Kaehler. Learning OpenCV: Computer vision with the OpenCV library. ” O’Reilly Media, Inc.”, 2008.
-  Abhinav Shrivastava, Abhinav Gupta, and Ross Girshick. Training region-based object detectors with online hard example mining. In Proceedings of the IEEE conference on computer vision and pattern recognition, pages 761–769, 2016.
-  Chee Kheng Ch’ng and Chee Seng Chan. Total-text: A comprehensive dataset for scene text detection and recognition. In 2017 14th IAPR International Conference on Document Analysis and Recognition (ICDAR), volume 1, pages 935–942. IEEE, 2017.
-  Liu Yuliang, Jin Lianwen, Zhang Shuaitao, and Zhang Sheng. Detecting curve text in the wild: New dataset and new solution. arXiv preprint arXiv:1712.02170, 2017.
-  Ankush Gupta, Andrea Vedaldi, and Andrew Zisserman. Synthetic data for text localisation in natural images. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pages 2315–2324, 2016.
-  Yongchao Xu, Yukang Wang, Wei Zhou, Yongpan Wang, Zhibo Yang, and Xiang Bai. Textfield: Learning a deep direction field for irregular scene text detection. IEEE Transactions on Image Processing, 2019.
-  Zhi Tian, Weilin Huang, Tong He, Pan He, and Yu Qiao. Detecting text in natural image with connectionist text proposal network. In European conference on computer vision, pages 56–72. Springer, 2016.
-  Yixing Zhu and Jun Du. Sliding line point regression for shape robust scene text detection. In 2018 24th International Conference on Pattern Recognition (ICPR), pages 3735–3740. IEEE, 2018.
-  Baoguang Shi, Xiang Bai, and Serge Belongie. Detecting oriented text in natural images by linking segments. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pages 2550–2558, 2017.
-  Xiang Li, Wenhai Wang, Wenbo Hou, Ruo-Ze Liu, Tong Lu, and Jian Yang. Shape robust text detection with progressive scale expansion network. arXiv preprint arXiv:1806.02559, 2018.