Scene text detection has been made great progress in recent years. The detection manners are evolving from axis-aligned rectangle to rotated rectangle and further to quadrangle. However, current datasets contain very little curve text, which can be widely observed in scene images such as signboard, product name and so on. To raise the concerns of reading curve text in the wild, in this paper, we construct a curve text dataset named CTW1500, which includes over 10k text annotations in 1,500 images (1000 for training and 500 for testing). Based on this dataset, we pioneering propose a polygon based curve text detector (CTD) which can directly detect curve text without empirical combination. Moreover, by seamlessly integrating the recurrent transverse and longitudinal offset connection (TLOC), the proposed method can be end-to-end trainable to learn the inherent connection among the position offsets. This allows the CTD to explore context information instead of predicting points independently, resulting in more smooth and accurate detection. We also propose two simple but effective post-processing methods named non-polygon suppress (NPS) and polygonal non-maximum suppression (PNMS) to further improve the detection accuracy. Furthermore, the proposed approach in this paper is designed in an universal manner, which can also be trained with rectangular or quadrilateral bounding boxes without extra efforts. Experimental results on CTW-1500 demonstrate our method with only a light backbone can outperform state-of-the-art methods with a large margin. By evaluating only in the curve or non-curve subset, the CTD + TLOC can still achieve the best results. Code is available at https://github.com/Yuliang-Liu/Curve-Text-Detector.READ FULL TEXT VIEW PDF
Recently, segmentation-based methods are quite popular in scene text
Scene text detection is an important step of scene text reading system. ...
Recent end-to-end trainable methods for scene text spotting, integrating...
Scene text in the wild is commonly presented with high variant
Omnidirectional scene text detection has received increasing research
Curve text or arbitrary shape text is very common in real-world scenario...
Analyzing and identifying the shortcomings of current subdivision method...
Text in the wild conveys valuable information, which can be used for real-time multi-lingual translation, behavior analysis, product identification, automotive assistance, etc.
Recently, the emergence of many text datasets, which are constructed on specific tasks and scene, have contributed to great progresses of text detection and recognition methods. Interestingly, it is observed that labels of text bounding boxes from emerging datasets also develop from rectangle to flexible quadrangle. For example, horizontal rectangular labels in ICDAR 2013 “Focus Scene Text”  and SVT ; rotated rectangular labels in MSRA-TD500  and USTB-SV1K ; four points labels in ICDAR 2015 “Incidental Text” , RCTW-17  and recent MLT competition dataset . Similarly, the advancements of scene text detection methods also change from axis-aligned rectangle based to rotated rectangle based and to quadrangle based. It was indicated in  that once the bounding box becomes tighter and flexible, it can improve the detecting confidence, reduce the risk of being suppressed by post-processing and be beneficial to subsequent text recognition.
To recognize the scene text, it is a strong requirement that the text can be tightly and robustly localized in advance. However, current datasets have very little curve text, and it is defective to label such text with quadrangle let alone rectangle. For example, as showed in Figure 1, using curve bounding box has three remarkable advantages:
Avoid needless overlap. Because the text may appear in many ways, the traditional four points localization may not handle such elusive peculiarity well. As shown in Figure 1 (a), quadrilateral bounding box can not avoid a mass of tanglesome overlap while curve bounding box can.
Less background noise. As shown in Figure 1 (b), if text appear in the curve form, quadrilateral bounding box suffers from background noise.
Avoid multiple text lines. The recent popular recognition methods [29, 28, 1, 19] all require single row text line in each bounding box. However, in some cases like Figure 1 (c), quadrilateral bounding box can not avoid multiple text lines from disturbing with each others while curve bounding box can exquisitely solve this problem.
Actually, curve text are also very common in our real-world. For examples, text in most kinds of columnar objects (bottles, stone piles, etc.), spherical objects, plicated plane (clothes, streamer, etc.), coins, logos, signboard and so on. However, to our best knowledge, current methods can not directly detect the curve text. The linking methods [32, 27, 12] can detect components of the text and then grouping them together to match the curve bounding box. But if there are many texts stack up together, in many cases like Figure 1 (a), empirical connection rules are almost impossible to group the tiny components properly, and somehow these methods always bring more false positives than direct detection manners in practice.
Therefore, in this paper, we collect text from various natural scenes, websites or image library like google open-image , and constructing a new dataset named CTW1500. This dataset contains 1500 images with over 10k text annotation, and each image contains at least one curve text. For evaluation and comparison, we split 1000 images as training set and 500 for testing. Based on our observation, for all kinds of the curve text regions, a 14 points polygon can be sufficient to localize them as shown in Figure 1 and Figure 2. By using the referenced equant lines, it doesn’t require much manpower to label as introduced in Sec. 3.
Based on the proposed dataset, we propose a simple but effective polygon based curve text detector (CTD), which may be a pioneering method that can directly detect the curve text. Unlike traditional detecting methods, the CTD separates the branches for width/height offsets predictions, which can be run below 4GB video memory with the speed of 13 FPS. In addition, the network architecture can be seamlessly integrated with a ingenious method we proposed, namely transverse and longitudinal offset connection (TLOC), which uses RNN to learn the inherent connection between locating points, making the detection more accurate and smooth. The CTD is also designed as an universal method, which can be trained with rectangular and quadrilateral bounding boxes without extra manual labels. Two simple but effective post-processing methods named non-polygon suppression (NPS) and polygonal non-maximum suppression (PNMS) are proposed to further intensify the generalization ability of CTD.
On the proposed dataset, the results demonstrate the CTD with a light reduced resnet-50  can effectively detect the curve text and outperform state-of-the-art methods with a large margin. By using TLOC, our method can remarkably improve the performance. Furthermore, we also evaluate our method on mere curve or non-curve test subset (by making the other text as not care regions), and the CTD+TLOC can still achieve the best results.
In the past dozen years, scene text detection methods have achieved great improvement. One of the main reasons of such unceasing progress is the evolution of the benchmark datasets - the data become harder, the amounts become larger and the labels become tighter. From 2003 on, rectangular labeled datasets such as ICDAR’03 , ICDAR’11 , ICDAR’13  and COCO-Text  have attracted a great number of research efforts. After 2010, the multi-oriented datasets with rotated rectangular labels (NEOCR , OSTD , MSRA-TD500  and USTB-SV1K ) come out, which stimulate many influential multi-oriented detecting methods in the literatures. And in 2015, the first quadrilateral labeled dataset ICDAR 2015 “Incidental Scene Text”  appears, which unprecedentedly attracted lots of attention according to its evaluating website  and many recent progress. Since then, quite a few larger and more challenge quadrilateral labeled dataset in ICDAR 2017 competition like RCTW-17  (dataset for Chinese and English text), DOST  (scene texts observed by video in the real environment) and MLT  (dataset for multi-lingual text) seem to become the next mainstream datasets.
Interestingly, the development of detecting manners also show a similar evolution tendency with datasets. Although the rectangular methods still receive interests, they seem becoming less mainstream. Since 2011, almost every year there are methods with rotated rectangular bounding box proposed. And in 2017, plentiful quadrilateral based detection methods [21, 27, 38, 10, 4] emerged. Basically, the quadrilateral based detection methods can also achieve best performance in rotated datasets or horizontal datasets (by evaluating the circumscribed rectangle), and they all show a fact that they can beat horizontal manners in multi-oriented datasets (by using the circumscribed rectangle to train and test) especially in terms of recall rate. This is mainly because the stronger supervision for quadrilateral labeled methods can avoid much background noise, unreasonable suppression and information loss. The viewpoint that stronger supervision helps detection can also be found on Mask-RCNN  which improves detecting results by jointly training with a branch of segmentation, and Ren et at.  also shows training with recognition can be conducive to text detection.
However, current text detection methods, even quadrangle based methods all show disappointing performance in curve texts, which are commonly appeared in our real world as introduced in section 1. The linking method like  can not detect the strong bending text (the last image in second row in its Fig. 3) as well. One reason is that all current datasets contain very little curve text and many of the curve text are labeled with unsatisfactory rectangles. The other reason is that current four points based detection methods can only loosely detect the curve text, which may cause severely mutual interference like Figure 1 (c). Therefore, in order to address the challenging problem of detecting curve text in the wild, we construct a new curve text based dataset named CTW1500, and then propose a novel method that can directly detect the curve text effectively.
Data description. The CTW1500 dataset contains 1500 images, with 10,751 bounding boxes (3,530 are curve bounding boxes) and at least one curve text per image. The images are manually harvested from internet, image library like google Open-Image  and our own data collected by phone cameras, which also contain lots of horizontal and multi-oriented text. The distribution of the images is various, containing indoor, outdoor, born digital, blurred, perspective distortion texts and so on. In addition, our dataset is multi-lingual with mainly Chinese and English text.
Annotation. The text is manually labeled by ourself with our labeling tool. For labeling text with horizontal or quadrilateral shape, simply two or four clicks are required. To surround the curve text, we create ten equidistant referenced lines to help label the extra 10 points (we practically find extra 10 points are sufficient to label all kinds of curve text as shown in Figure 2). The reason we use the equidistant lines is to ease the labeling effort, and reduce subjective interference. To evaluate the localization performance, we simply follow the PASCAL VOC protocol , which uses 0.5 IoU threshold to decide true or false positive. The only difference is we calculate the exact intersection-over-union (IoU) between the polygons instead of axis-aligned rectangles.
The labeling procedure is shown in Figure 3. Firstly, we click the four vertexes marked as 1, 2, 3, 4, and the referenced dashed line (blue) will be automatically created. Moving one of the mouse’s referenced line (horizontal and vertical black dashed lines) to the appropriate position (intersection of two referenced lines) and then click to determine the next point, and so on for the remanent points. We roughly calculate the labeling time of three shapes of text in Table 1, which shows labeling one curve text consumes approximately triple time than labeling with quadrangle. The CTW1500 dataset can be download at https://github.com/Yuliang-Liu/Curve-Text-Detector.
|Labeling Time (s)||2.5||4||13|
This section presents details of our curve text detector (CTD). We will first illustrate the architecture of the CTD and how we make use of the polygonal labels. After that, we will describe how a recurrent neural network (RNN) component is seamlessly connected to the CTD and subsequently introducing the universality of this method. Finally, we will present our two simple but effective post-processing methods that can further improve the performance.
The overall architecture of our CTD is shown in Figure 4
, which can be divided into three parts: backbone, RPN and regression module. Backbone usually adopts the popular models pre-trained by ImageNet and then uses the corresponding model to finetune, such as VGG-16 , ResNet  and so on. Region proposal network (RPN) and regression module are respectively connected to the backbone; while the former generates proposals for roughly recalling text and the latter finely adjusts the proposals to make it tighter.
In this paper, we use a reduced ResNet-50 (simply remove the last residual block) as our backbone, which requires less memory and can be faster. In the RPN stage, we use the default rectangular anchors to roughly recall the text but we set a very loose RPN-NMS threshold to avoid premature suppression. To detect the curve text with polygon, the CTD only need to modify the regression module by adding the curve locating points, which is inspired by DMPNet  and East  that adopting quadrilateral regressing branch separated with circumscribed rectangle regression. The rectangular branch can be easily learned by the network and let it converses fast, which can also roughly detect the text region in advanced and alleviate the following regression. Contrastively, the quadrilateral branch offers stronger supervision to guide the network being more accurate.
Similar to [25, 21], we also regress the relative positions for each points. Unlike , we use the minimum x and minimum y of the circumscribed rectangle as the datum point. Therefore, the relative length and ( ) of every point is greater than zero, which is somehow easier to train in practice. In addition, we separately predict the offsets and , which can not only reduce the parameters but more reasonable for sequential learning as introduced in the following subsection. The total number of the regressing items is 32; while 28 are the offset of the 14 points, and 4 are the x,y minimum and maximum of the circumscribed rectangle. The parameterizations of the 14 offsets ( and ) are listed below:
Where, and are ground truth and predicted offsets respectively. Besides, and are the width and height of the circumscribed rectangle. For boundary regression, we follow the same as Faster R-CNN . It is worth noticing that 28 values are enough to determine the position of 14 points, but in relative regressing mode, 32 values can be easier to retrieve the rest of 14 points and offer stronger supervision.
Using recurrent connection in text detection task was proved robust and effective by CTPN , which learns the latent sequence of tiny proposals and produces better results. However, CTPN is a linking based method, which requires empirical connection. Also, its connectionist proposal requires fixed images size for ensuring the fixed number of input time sequences of RNN. Unlike CTPN, our method can directly localize the curve region without exterior connection, and the number of the time sequences of RNN is not constrained by input image size. This is because the RNN is connected to the output of the position-sensitive RoI Pooling (PSROIPooling) 
, and the number of the output targets are fixed (14 width offsets and 14 height offsets). Basically, PSROIPooling is used to predict and vote the class probabilities and localization offsets, which evenly partitions each RoI into
bins to estimate position information. The dimension of the input convolutional layer should be, and thus the PSROIPooling can produce a score map for each category. For transverse and longitudinal offset prediction, we remove the background class localization score map and use a bin, so the input convolutional dimension is for height and width separately. Each value of -th bin is computed from the corresponding position in the (i; j)-th score map by using a average pooling:
where, is the pooled value in the -th bin for category , represents a score map from the corresponding dimension. denotes the left-top coordinate of a RoI, denotes the amount of pixels in the bin, and stands for all network parameters. After the PSROIPooling procedure, CTD will get the scores or the estimated offsets of each RoI via globally pooling on the position-sensitive score maps: , which produces a
-dimensional vector. The voting class score is then computed by the softmax operation across all categories and output the final confidence:
The localization offsets would be fed into the localization loss function. During training phase, we choose the similar multi-task loss functions for score and offsets prediction as follow:
where is the amount of both positive and negative proposals that match specific overlapping range and is the number of positive proposals because it is not necessary to refine the negative proposals. Besides, and are balance factors which weighs the importance among the classification and detection losses ( represents classification loss function; is localization loss function which can be smooth- loss or smooth- loss). Practically, we set to 3 or even more to balance localization loss which has much more targets. Moreover, represent the predicted class, estimated bounding-box, width and height offset respectively, and denote the corresponding ground-truth.
To improve detection performance, we separate transverse and longitudinal branches to predict the offsets for localizing the text region. Intuitively, each point is restricted by the last and next points and the textual region. For example, in the case of Figure 3, the offset width of the sixth labeling point should larger than the fifth point and less than the seventh point. Independently predicting each offset may lead to unsmooth text region, and somehow it may bring more false detection. Therefore, we assume the width/height of each point has associated context information, and using RNN to learn their latent characteristics. We name this method as recurrent transverse and longitudinal offset connection (TLOC). The TLOC structure is shown with purple patterns in Figure 4 and we also list some examples in Figure 5 to show the difference of whether adding TLOC or not. To adopt TLOC, we find the output of PSROIPooling is suitable to encode the offsets context information. Take the width offset branch as an example; firstly, PSROIPooling outputs 14 score map for voting of each proposal, and bins of the -th score map have voting values from respective position, which can be encoded as the feature of ; The RNN than takes width offsets feature of each point as sequential inputs, and recurrently updates the inherent state inside the hidden layers, , i.e.
where is the -th predicting offset from the corresponding psroi-pooling output channel. is a recurrent internal state computed from both current input () and the last state encoded in . The recurrence is computed by using a non-linear function
, where we adopt a bi-directional long short-term memory (BLSTM) architecture as our RNN. The internal state inside the RNN hidden layer associates the sequential context information by all previous estimated offsets through the recurrent connection, and we empirically use a 256D BLSTM hidden layer, thus . Finally, the output of the BLSTM is a dimensional vector, which is globally pooled by a () kernel to output the final prediction.
As introduced in Sec. 4.1, for each bounding box, the CTD outputs offsets for 14 points. However, almost all current benchmark datasets only have labels of two or four vertexes information. To train in these datasets, we can easily interpolate the equal division points in the largest side and its opposite side as shown in6.
Promisingly, by simply interpolating the points in the bounding box, our CTD can be effectively trained with all text region and it can achieve better results as demonstrated in the experiment section.
Non-polygon suppression (NPS). False positive detecting results are one of the important reasons that restricting the performance of text detection. However, in CTD, some diffident false positives will appear with invalid shape (for a valid polygon, there is not any intersecting side). In addition, there is hardly any scene text come out with intersecting side, and these invalid polygons are nearly impossible to recognize. Therefore, we simply suppress all these invalid polygons and we named it a non-polygon suppression (NPS), which can slightly improve the accuracy without influencing the recall rate.
Polygonal non-maximum suppression (PNMS). Non-maximum suppression  is proved very effective for object detection task. Because of the particularity of the curve scene text, rectangular NMS is limited to handle dense multi-oriented text as shown in Figure 1 (a) and (c). To solve this problem,  propose a locality-aware NMS and  devise a Mask-NMS to suppress the final output results. In this paper, we also improve the NMS by computing the overlapping area between polygons, named polygon non-maximum suppression (PNMS), which is proved effective in the following experiments.
|Algorithm||Whole set||Non-curve subset||Curve subset||S (FPS)|
|R (%)||P (%)||H (%)||R (%)||P (%)||H (%)||R (%)||P (%)||H (%)|
|CTD + TLOC (ours)||69.8||77.4||73.4||62.3||70.8||66.3||77.1||57.1||65.6||13.3|
|Algorithm||R (%)||P (%)||H (%)|
|CTD + NMS0.3||64.4||74.9||69.3|
|CTD + PNMS0.1||65.2||74.3||69.5|
|CTD + TLOC + PNMS0.1||69.8||77.4||73.4|
|CTD + TLOC + PNMS0.2||70.1||75.0||72.4|
|CTD + TLOC + PNMS0.3||70.8||71.6||71.2|
|CTD + TLOC + PNMS0.4||71.7||65.3||68.3|
|CTD + TLOC + NMS0.1||63.7||78.4||70.3|
|CTD + TLOC + NMS0.2||68.6||78.1||73.1|
|CTD + TLOC + NMS0.3||69.7||77.1||73.2|
|CTD + TLOC + NMS0.4||70.8||74.7||72.7|
In this section, we carry experiments on the proposed CTW1500 dataset to test our method. The testing environment is Ubuntu 16.04 64bit with single Nvidia 1080 GPU. All the text detection results are evaluated with the protocol introduced in section 3, which exactly calculates the IoU between the polygons instead of axis-align rectangles. For fair comparison, all the experiments only use the provided training data without any data augmentation.
Effectiveness of TLOC and PNMS. We first evaluate the availability of proposed TLOC and PNMS, and the results are given in Table 3. In this table, we can find whatever using CTD or CTD + TLOC, the PNMS can always slightly outperform the classic NMS. On the other hand, the results also demonstrate that by simply adding TLOC, the proposed CTD can be improved about 4 percents in terms of Hmean. Note that we don’t compare NPS here because it is a prerequisite post-processing method for evaluating polygonal overlapping area, which can slightly improve the accuracy.
Comparison with state-of-the-arts methods. For comprehensively evaluating our method, we compare the proposed CTD and CTD + TLOC with several state-of-the-art and common text detection methods. Note that for East  and Seglink , we re-implement these methods based on the unofficial source codes from github, and for fair comparison, we do not use huge synthetic data to pre-train seglink which may reduce its performance. For DMPNet  and CTPN 
, we both use Caffe framework to re-implement these methods. Besides, because none of these methods can be trained with curve text region, we follow the traditional circumscribed rectangular bounding box to make the labels trainable for them. Table 2 lists the experimental results. The results of the whole CTW-1500 test set show that the proposed CTD + TLOC can outperform state-of-the-art methods with more than 10 percents in terms of Hmean. To further evaluating the effectiveness, we also split the curve and non-curve text from the whole test set by simply regarding the other kind of text as difficult (not care) and making comparison. Promisingly, in the curve subset, proposed CTD + TLOC can outperform state-of-the-art methods with at least 28 percents of Hmean, and meanwhile, this method can also achieve the best results in detecting only non-curve texts, demonstrating its robustness and universality. Note that for our method, the results of the subset are both less than the whole set are because the way treating the other text as difficult will reduce the precision.
Besides, we compare the detection speed in the last column of the Table 2 and the results (13.3 or 15.2 FPS) show that our method is the second fastest, which also proves the effectiveness of our method.
Examples of the detection results are visualized in Figure 7. In the last column of this figure, we also use our CTD to detect the curve texts from other dataset, which qualitatively demonstrates its powerful ability for detecting curve texts, as well as its generalization ability.
Curve text is common in our real world but currently few datasets or methods are aiming at curve text detection. In order to facilitate this new challenging research of reading curve text in the wild, in this paper, we propose a new dataset named CTW1500, which is a novel dataset mainly constructed by curve text. The curve text in this dataset are approximated labeled by polygon which do not require too much manpower. In addition, we propose a new CTD approach which may be the first attempt to directly detect the curve texts. By devising a transverse and longitudinal offset connection (TLOC) method, the CTD can be seamlessly connected with RNN, which significantly improves the detecting performance. We also propose a simple but effective long side interpolation technique, which let CTD become an universal method that can also be trained with rectangular or quadrilateral bounding boxes without additional manual efforts. Lastly, we design two post-processing methods that are also demonstrated effective.
In future, the propose dataset could be enlarged as a curve text based recognition dataset, because the labeling manner seems good for recognition. Besides, although the flexible detecting manner like our CTD may be slightly slower than an rigid rectangle based detector, the former can solve more complicated problems like detecting curve text and achieving better results, which should worth further exploration.
Enriched deep recurrent visual attention model for multiple object recognition.In
Applications of Computer Vision (WACV), 2017 IEEE Winter Conference on, pages 971–978. IEEE, 2017.
Computer Vision and Pattern Recognition, 2004. CVPR 2004. Proceedings of the 2004 IEEE Computer Society Conference on, volume 2, pages II–II. IEEE, 2004.