MSR: Multi-Scale Shape Regression for Scene Text Detection

01/09/2019 ∙ by Chuhui Xue, et al. ∙ 6

State-of-the-art scene text detection techniques predict quadrilateral boxes which are prone to localization errors while dealing with long or curved text lines in scenes. This paper presents a novel multi-scale shape regression network (MSR) that is capable of locating scene texts of arbitrary orientations, shapes and lengths accurately. The MSR detects scene texts by predicting dense text boundary points instead of sparse quadrilateral vertices which often suffers from regression errors while dealing with long text lines. The detection by linking of dense boundary points also enables accurate localization of scene texts of arbitrary orientations and shapes whereas most existing techniques using quadrilaterals often include undesired background to the ensuing text recognition. Additionally, the multi-scale network extracts and fuses features at different scales concurrently and seamlessly which demonstrates superb tolerance to the text scale variation. Extensive experiments over several public datasets show that MSR obtains superior detection performance for both curved and arbitrarily oriented text lines of different lengths, e.g. 80.7 f-score for the CTW1500, 81.7 f-score for the MSRA-TD500, etc.

READ FULL TEXT VIEW PDF
POST COMMENT

Comments

There are no comments yet.

Authors

page 1

page 2

page 4

page 6

page 7

This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

I Introduction

Automated detection of various texts in scenes has attracted increasing interests in recent years due to its growing demands in many real-world applications such as image search, autonomous driving, etc. With the advance of deep neural networks (DNNs), a number of DNN based scene text detection systems

[1, 2, 3, 4] have been reported and many have achieved very promising detection performance. The prevalent idea is to treat scene texts as one specific type of object and adapt various generic object detection techniques [5, 6, 7] for the scene text detection task. Two typical approaches have been explored, where one approach employs anchors of different aspect ratios to detect text proposals and the other adapts semantic segmentation techniques by direct regressing text pixels to the four vertices of quadrilateral text localization boxes.

State-of-the-art scene text detection techniques still suffer from three typical constraints. The first is low localization accuracy due to the specific scene text shape - thin and of very different lengths. Due to this, proposal based techniques are often at a loss for selecting anchors of appropriate aspect ratios and segmentation based techniques often introduce large regression errors as pixels lying around the centre of long text lines are very far from the vertices of quadrilateral regression boxes. The second is inaccurate localization while dealing with text lines of arbitrary curvatures, where most existing techniques generate rectangular or quadrilateral localization boxes that often include undesired image background for the ensuing scene text recognition task. The third is unreliable detection while dealing with texts of abnormal sizes in images. Scene texts usually have much larger scale variations as compared with generic objects [8], e.g. the scale ratio between the largest and the smallest texts is up to 230 times for images in the ICDAR2017-RCTW dataset [9]. The large scale variation often leads to miss detection for ultra-small text instances or broken detection for ultra-large text instances.

Fig. 1: Scene text detection using the proposed multi-scale shape regression network (MSR): For scene texts with arbitrary orientations and shapes in (a), MSR first predicts dense text boundary points (in red color) as shown in (b) and then locates texts by a polygon (in green color) that encloses all boundary points of each text instance as shown in (c).

We design an innovative multi-scale shape regression network (MSR) that addresses all three constraints at one go as illustrated in Fig. 1. MSR regresses text pixels to the nearest text boundary points and locates texts by linking up the regressed text boundary points. It can thus detect scene texts of arbitrary orientations and shapes more accurately as compared with most existing techniques which produce quadrilateral vertices and often include undesired image background. MSR can also locate text lines of arbitrary lengths more accurately because regressing to the nearest boundary points introduces much less regression errors as compared with regressing to the quadrilateral vertices. In addition, a multi-scale network is designed in MSR for better tolerance to the large scale variation of texts in scenes. It employs multiple network channels to extract and fuse features at different scales concurrently and seamlessly which leads to more robust scene text detection. Experiments over several public datasets show that MSR is broadly applicable and achieves superior detection performance for scene texts in arbitrary orientations, shapes and lengths.

Fig. 2: The framework of the proposed technique: An image and its down-scaled are fed to the multi-scale shape regression network (MSR) as input. The MSR employs multiple network channels to extract and fuse features at different scales concurrently and seamlessly to predict the central text regions, the distances from the central text regions to the text boundaries and dense text boundary points. Scene texts of arbitrary orientations, shapes and lengths are located by a concave polygon that encloses all boundary points of each text instance.

The contributions of this work are threefold. First, it proposes a novel shape regression technique to predict dense text boundary points with which scene texts of arbitrary orientations, shapes and lengths can be located accurately. Second, it proposes a multi-scale network that employs multiple network channels to extract and fuse features at different scales concurrently and seamlessly. The network demonstrates great tolerance to the large text scale variation. Third, it develops an end-to-end trainable system that achieves superior scene text detection performance over a number of public datasets with arbitrarily-oriented and curved text lines of various lengths.

Ii Related Work

Ii-a Generic Object Detection

With the development of convolutional neural networks (CNNs), object detection has achieved great success after years of effort using hand-crafted features such as histogram of gradient (HoG) and scale-invariant feature transform (SIFT)

[10, 11]. The CNN-based work adopts a multi-task learning structure but follows two different approaches. One approach such as Faster R-CNN [5], YOLO [12], and SSD [6]

first extracts object features using CNNs to generate proposals or default boxes and then classifies them to the corresponding categories, where a regressor is employed to regress proposals to more accurate localization boxes. The other approach such as DenseBox

[7] extracts features to predict object regions instead of proposals and classifies each pixel to a category, where a regressor is employed to regress each pixel to the localization boxes.

The generic object detection techniques often face various problems while applied to the scene text detection task. In particular, text lines in scenes are usually thin and of very different lengths which makes it very difficult to select proposals of proper aspect ratios. In addition, most generic object detection techniques generate rectangular localization boxes which often includes undesired background while dealing multi-oriented or curved text lines in scenes. Further, inaccurate localization often happens due to regression errors while dealing with long text lines where text pixels around the text line centre are far from the vertices of the rectangular localization boxes.

Ii-B Scene Text Detection

Scene text detection has been studied for years, and most existing techniques can be broadly classified into two categories. The first category extracts text-specific features such as boundaries [13], FAST keypoints [14], stroke symmetry [15], etc. by using traditional image processing techniques such as stroke width transform (SWT) [16, 17], maximally stable extremal regions (MSERs)[18, 19, 20], HoG [21], etc. It usually involves multiple pre/post-processing steps and suffers from clear performance drop while dealing with degraded images with blurs, uneven lighting, etc.

The second category leverages CNNs to learn discriminative features and representations. One typical approach adapts generic object detection techniques which either leverage text-specific proposals or default boxes [22, 23, 1, 2, 24, 25, 26, 27, 28] or follow the DenseBox idea by first extracting text regions and then regressing each text pixel to the vertices of localization boxes [3, 4]. In addition, some work treats scene text detection as a text segmentation problem [29, 30, 31], which predicts pixel-level text feature map and localizes text instances by segmenting text regions directly from the text feature map.

The recent scene text detection research predicts quadrilateral boxes of arbitrary orientations for multi-oriented text lines which still faces various problems while dealing with text lines of arbitrary shapes and lengths. Our proposed shape regression network instead predicts distances from text pixels to the nearest text boundary. It generates dense text boundary points that can be linked up to locate scene texts of arbitrary shapes and lengths accurately.

Ii-C Large Scale Variation

Large scale variation has been a grand challenge in the object detection literature [8]

. Two typical approaches have been explored in the era of deep learning. The first approach exploits features of different scales that are extracted at multiple network layers of different depths, instead of just using features from the last network layer which often carries large-scale global information only. For example, FPN

[32] adopts a top-down network structure and treats it as a feature pyramid to make predictions at different network layers. U-Net [33] extracts features from different stages of the backbone network and fuses them by up-sampling features to the same scale. The other approach deals with large scale variation by employing images of different scales in network training. One typical way is to make predictions at images of different scales and then combines them as the final predictions [34, 35, 36]. In addition, [37, 8] attempt to predict objects at specific scales where the input image is resized to the corresponding resolutions, e.g. predicting smaller objects at higher resolutions of the input image.

Our proposed multi-scale network deals with large scale variations by marrying the merits of the two state-of-the-art approaches. Specifically, it designs a new network structure that employs multiple network channels to extract and fuse features at different network stages from images of different scales simultaneously and seamlessly. Experiments show its good tolerance to large object scale variations.

Iii Methodology

We propose a novel multi-scale shape regression network for accurate detection of scene texts of arbitrary orientations, shapes and lengths as illustrated in Fig. 2. A Multi-Scale Network is designed to extract and fuse features of different resolutions of image. The fused features are then fed to a Shape Regression module to detect central text regions and predict distances from each detected text pixel to its nearest text boundary. This produces a set of dense text boundary points that can be linked up to produce polygon localization boxes as highlighted in red and green colors in the last two images in Fig. 2.

Fig. 3:

Structure of proposed multi-scale network (for two-scale case): Features extracted from layers

Conv2 - Conv5 of two network channels are fused, where features of the same scale are fused by a Concat UpConv as illustrated and features from the deepest layer of the lower-scale channel are up-sampled to the scale of the previous layer for fusion.

Iii-a Shape Regression for Scene Text Localization

We design a novel shape regression technique for accurate localization of scene texts in arbitrary orientations, shapes and lengths. Instead of regressing to the vertices of rectangular or quadrilateral localization boxes, our proposed shape regressor regresses each text pixel to the nearest text boundary and predicts dense text boundary points for accurate localization of various texts in scenes.

The proposed shape regression module first performs text pixel classification and regression. The classification predicts central text regions (as illustrated in the first graph under the Shape Regression in Fig. 2) by using the fused feature map from the Multi-Scale Network. The regression predicts two distance maps (as illustrated in the second and third graphs under the Shape Regression) according to the distance between each predicted text pixel and its nearest text boundary in horizontal and vertical directions (i.e. x and y coordinates), respectively. Note the central text regions are derived from the original annotation boxes in training as shown in Figs. 3(a) and 3(c). It is smaller than the annotation box which helps better separate neighboring words or text lines within the predicted text region map.

Each predicted text pixel will thus regress to one nearest point on text boundary which can be located by summing up the coordinates of the text pixel and the predicted distances in horizontal and vertical directions. Scene texts can thus be located by a polygon that encloses all detected text boundary points. We adopt the Alpha-Shape Algorithm [38] which produces a concave polygon enclosing a set of given points. In the Alpha-Shape Algorithm, triangle edges with radius larger than alpha thresholds are removed from the delaunay triangulation graph of the text boundary points. As the radius is sensitive to size of each triangle - i.e. size of text instance, the coordinates of boundary points of each text instance are first normalized to 0 to 1 to simplify the alpha threshold selection. The polygon for each text instance is therefore first generated from the normalized boundary points and then re-sized back to the original scale.

To train the proposed shape regressor, a central text region and two distance maps are first extracted from the annotation polygon of each training text instance as illustrated in Fig. 3. Given a text annotation (may have four edges or more for curved text lines) as shown in Fig. 3(a), triangulation is first performed over the annotation vertices, where each formed triangle has two vertices from the upper (or lower) side of the annotation and the third from the low (or upper) side as illustrated in Fig. 3(b). For each newly formed triangle edge connecting the upper and lower sides of the text annotation, two points at 25% distance from each end are determined which form the vertices of the central text region as illustrated in Figs. 3(b) and 3(c). For each pixel in the central text region, the nearest point on the text annotation lines can be located (yellow-color points in Fig. 3(d)), and the distances between and can then be determined to generate the distances maps in x and y directions as illustrated in Fig. 3(e) and 3(f).

(a)
(b)
(c)
(d)
(e)
(f)
Fig. 4: Illustration of ground-truth generation: Given a text annotation polygon in (a), triangulation is performed over the polygon vertices to locate the vertices (green points in (b)) of the central text region in blue color in (c). For each central-text-region pixel (in blue color in (d)), the nearest point on the text annotation box in yellow color is determined as the nearest text boundary point as shown in (d), and the distance between and is used to generate ground-truth distance maps as shown in (e) and (f)

Iii-B Multi-Scale Multi-Stage Detection Network

We design a multi-scale multi-stage network architecture for robust detection of scene texts of different sizes. Instead of extracting features at multiple network stages or from images of multiple scales, our proposed network extracts features at multiple network stages and from images of multiple scales concurrently and seamlessly which clearly performs better than using either strategy alone while dealing with scene texts of very different sizes.

The proposed network adopts a multi-channel structure to accommodate images of different scales. Given a training image, it is first re-sampled by a factor of 2 and produces multiple re-scaled images. The training image and the re-scaled are then fed to multiple network channels for feature extraction. Fig. 3 shows one network structure that uses two channels for the original training image and the half-scaled (this network is adopted in our implemented scene text detection system). Within each network channel, image features are extracted at multiple network stages to capture details at different levels. At the end, object features extracted from multiple network channels and multiple network stages (within each channel) are fused and fed to the Shape Regression module which will predict the central text regions and distance maps as described in Section 3.1.

The concurrent and seamless feature learning from images of different scales at different network stages is a challenging task as features from the same network stage of different network channels have different scales. Fig. 3 illustrates how our proposed network architecture addresses this challenge. As Fig. 3 shows, features from Conv5 of Channel 2 are firstly up-sampled by a factor of 2 so that they will have the same scale as features from Conv5 of Channel 1 (and so Con4 of Channel 2). A Concat UpConv module is exploited for feature fusion, which first concatenates features from three network stages and then up-samples the concatenated features by a factor of 2 (after a 1x1 convolution and a 3x3 convolution) [3] and finally passes the up-sampled feature to the earlier network stage for further feature fusion. Note we only use features from stages from Conv2 to Conv5 as those from Conv1 are too shallow.

The proposed network improves the detection of scene texts of very different sizes from two aspects. One the one hand, the multi-stage design improves the prediction of details at different levels by fusing local and global features effectively. On the other hand, the multi-scale design extracts features from text images of different scales which addresses the large text size variation directly.

Iii-C Network Training

The proposed multi-scale shape regression network takes the original training image and the corresponding central text regions and distances maps as inputs. The training aims to minimize the following multi-task loss function:

(1)

where and refer to the loss of classification (for prediction of central text regions) and regression (for prediction of distances to the nearest text boundary), respectively. Parameters is the weight to balance the two losses which is empirically set at 1.0 in our implemented system.

The prediction of central text regions is actually a pixel-wise binary classification problem. We adopt the Dice Coefficient [39, 40] loss that is defined by:

(2)

where G and P refer to the ground-truth central text region and the predicted central text region, respectively.

The prediction of distances from central text pixels to the nearest text boundary is a regression problem. We define the regression loss based on the Smooth L1 loss [41]:

(3)

where x and y denote ground-truth distances in the horizontal and vertical directions, respectively, and and denote the correspondingly predicted distances. The denotes the standard Smooth L1 loss.

Iv Experiments

The proposed technique has been extensively evaluated over four public datasets that contain scene texts of arbitrary orientations, shapes and lengths. It has also been compared with state-of-the-art techniques and analyzed via ablation analysis to be described in the ensuing subsections.

Iv-a Datasets

SynthText [42] contains more than 800,000 synthetic scene text images most of which are at word level with multi-oriented rectangular annotations.

CTW1500 [43] consists of 1,000 training images and 500 test images that contain 10,751 multi-oriented text instances of which 3,530 are arbitrarily curved. Each text instance is annotated at text-line level by using 14 vertices, where texts are largely in English and Chinese.

Total-Text [44] consists of 1,255 training images and 300 test images where texts are all in English. It contains a large number of multi-oriented curved text instances each of which is annotated at word level by using a polygon.

MSRA-TD500 [17] consists of 300 training images and 200 test images. All captured text instances are printed in English and Chinese which are annotated at text-line level by using best-alighed rectangles.

ICDAR2013 [45] has 229 training images and 233 test images. All captured text instances are in English which are annotated at word level by using horizontal rectangles.

Methods Precision Recall F-score
SegLink [25] 42.3 40.0 40.8
EAST [3] 78.7 49.1 60.4
DMPNet [24] 69.9 56.0 62.2
CTD* [43] 74.3 65.2 69.5
CTD+TLOC* [43] 77.4 69.8 73.4
TextSnake* [46] 67.9 85.3 75.6
Ours* 83.8 77.8 80.7
TABLE I: Experimental results over CTW1500, where methods with ‘*’ specifically address curved text lines.
Methods Precision Recall F-score
SegLink [25] 30.3 23.8 26.7
EAST [3] 50.0 36.2 42.0
Baseline [47] 33.0 40.0 36.0
Mask TextSpotter* [48] 69.0 55.0 61.3
TextSnake* [46] 82.7 74.5 78.4
Ours* 85.2 73.0 78.6
TABLE II: Experimental results over Total-Text, where methods with ‘*’ specifically address curved text lines. Results of SegLink and EAST are taken from [46].

Iv-B Implementation Details

The proposed technique is implemented using Tensorflow

[49] on a regular GPU workstation with 2 Nvidia Geforce GTX 1080 Ti, an Intel(R) Core(TM) i7-7700K CPU @ 4.20GHz and 32GB RAM. The network is optimized by Adam optimizer [50] with a starting learning rate of . The network is pre-trained on the SynthText [42], which is then fine-tuned by using the training images of each evaluated dataset with a batch size of 10. ResNet-50 [51] is used as the network backbone.

Iv-C Experimental Results

The proposed technique has been evaluated quantitatively and qualitatively over four public datasets as shown in Tables I, II, III and IV and Fig. 5. It has also been analyzed through ablation studies as shown in Table V and Fig. 6.

Fig. 5: Illustration of the proposed scene text detection method: Sample images in rows 1-4 are selected from ICDAR2013, MSRA-TD500, CTW1500 and Total-Text, where green boxes show polygon localization boxes and blue boxes show ground-truth boxes. For images with straight text lines in rows 1-2, red quadrilateral boxes are derived for evaluations. A few typical unsuccessful cases are given in the last column of the corresponding datasets.

Iv-C1 Texts in Arbitrary Orientations and Shapes

The proposed technique has been evaluated over the datasets CTW1500 and Total-Text where many scene texts were captured in arbitrary orientations, shapes and lengths. The purpose is to study how the proposed technique performs under the presence of many different text appearances. Tables I and II show experimental results.

Methods Precision Recall F-score
Kang et al. [19] 71.0 62.0 66.0
Zhang et al. [52] 83.0 67.0 74.0
He et al. [4] 77.0 70.0 74.0
EAST [3] 87.3 67.4 76.1
SegLink [25] 86.0 70.0 77.0
Wu et al. [30] 77.0 78.0 77.0
PixelLink [31] 83.0 73.2 77.8
TextSnake* [46] 83.2 73.9 78.3
RRD [27] 87.0 73.0 79.0
Xue et al. [28] 83.0 77.4 80.1
ITN-ResNet50 [53] 90.3 72.3 80.3
Lyu et al. [26] 87.6 76.2 81.5
Ours* 87.4 76.7 81.7
TABLE III: Experimental results over MSRA-TD500, where methods with ‘*’ specifically address curved text lines.

As Tables I and II show, the proposed method achieves f-scores of 80.7% and 78.6% on datasets CTW1500 and Total-Text which are significantly higher than state-of-the-art methods that didn’t specifically address curved text lines. In addition, the proposed method is on par with or significantly outperform the very recent methods that specifically addressed curved text lines. In particular, it outperforms the best f-score by 5.1% for CTW1500 with annotations at text-line level, demonstrating its superiority in dealing with text lines of different lengths. In fact, the proposed method is capable of dealing with text lines of arbitrary lengths as illustrated in Fig. 5 because it regresses to the nearest text boundary instead of the four quadrilateral vertices. For Totel-Text with scene texts annotated at word level, our proposed method also achieved state-of-the-art performance, demonstrating its superiority in dealing with curved texts with limited length variations.

Iv-C2 Texts in Arbitrary Orientations and Lengths

The proposed method is also evaluated over MSRA-TD500 where most scene texts are straight but in arbitrary orientations with annotations at text-line level. Similar to state-of-the-art methods [26, 53, 28, 46] with the best f-scores, we also include HUST-TR400 [54] training images in training. As MSRA-TD500 provides annotations by rotated rectangles, we derive an oriented rectangular box from each determined concave polygon as illustrated in the second row of Fig. 5 (highlighted in red-color boxes) for overlap computation in evaluations. Table III shows experimental results and comparisons with state-of-the-art techniques.

As Table III shows, the proposed method achieves state-of-the-art f-score, demonstrating it superior capability in dealing with straight text lines of arbitrary orientations and lengths, the subject that has been studied for many years in the scene text detection community. In fact, it outperforms most state-of-the-art methods that predict rectangular boxes which are more suitable for straight text lines. Further, it outperforms TextSnake, a very recent method that specifically addressed curved text lines, by up to 3.4% in f-score.

Methods Precision Recall F-score
Zhang et al. [52] 88.0 78.0 83.0
SegLink [25] 87.7 83.0 85.3
He et al. [4] 92.0 81.0 86.0
CTPN [22] 93.0 83.0 88.0
SSTD [2] 89.0 86.0 88.0
Lyu et al. [26] 92.0 84.4 88.0
RRD [27] 92.0 86.0 89.0
Xue et al. [28] 91.5 87.1 89.2
WordSup [55] 93.3 87.5 90.3
Mask TextSpotter* [48] 94.1 88.1 91.0
Ours* 91.8 88.5 90.1
TABLE IV: Experimental results over ICDAR2013, where methods with ‘*’ specifically address curved text lines.

Iv-C3 Normal Scene Texts

We also evaluate our method over ICDAR2013 where most captured scene texts are horizontal with annotations at word level. Similar to MSRA-TD500, rectangular boxes are further derived from the determined concave polygon as illustrated in the first row of Fig. 5 (highlighted in red-color boxes) for evaluation. Table IV shows experimental results and comparisons with state-of-the-art techniques. As Table IV shows, the proposed method also obtains state-of-the-art performance for for this long-studied dataset.

Fig. 6: Ablation study of the proposed technique: For sample images from CTW1500 in (a), (b-e) show scene text detection in green-color boxes by using the ‘Baseline’, ‘Baseline+Multi-Scale’, ‘Baseline+Shape Regression’ and ‘Multi-Scale Shape Regression’, respectively, while ground-truth boxes are shown in blue-color.

Iv-D Ablation Study

The proposed multi-scale shape regression network consists of two innovative components, namely, a multi-scale network and a shape regression module. We perform an ablation study over CTW1500 to identify the contribution of these two components. Four models are trained as shown in Table V. The first is ‘Baseline’ which refers to the original EAST model [3] that regresses text pixels to four quadrilateral vertices. The second is ‘Baseline+Multi-Scale’ which uses EAST but includes the proposed multi-scale network structure. The third is ‘Baseline+Shape Regression’ that uses EAST but regresses to the nearest text boundary. The last is ‘Multi-Scale Shape Regression’ that fully implements both Multi-Scale and Shape Regression.

As Table V shows, the inclusion of multi-scale network alone improves the recall significantly with certain sacrifice of precision, and the inclusion of the proposed shape regression alone improves both recall and precision clearly, leading to a 17% improvement in f-scores. In addition, the inclusion of both multi-scale network and shape regression module improves the f-score by over 20% beyond the baseline. Fig. 6 illustrate the ablation study where many missing and broken detection by the Baseline, Baseline+Multi-Scale and Baseline+Shape Regression are correctly detected by the Multi-Scale+Shape Regression.

Methods P R F-score
Baseline 78.7 49.1 60.4
Baseline+Multi-Scale 72.8 60.8 66.3
Baseline+Shape Regression 82.8 72.1 77.1
Multi-Scale Shape Regression 83.8 77.8 80.7
TABLE V: Ablation study of the proposed technique over the dataset CTW1500 (P: precision; R: recall)

Iv-E Discussion

The proposed technique is capable of producing accurate localization of scene texts of arbitrary orientations, lengths and shapes. Fig. 7 shows f-scores of our proposed method and the baseline EAST when different IoU thresholds are used in evaluation (on dataset MSRA-TD500). As Fig. 7 shows, the f-score difference between our proposed method and EAST keeps increasing from 9.9% to 20.5% when IoU threshold increases from 0.5 to 0.8, demonstrating the more accurate localization by our proposed method.

The proposed method still faces certain constraints under several specific scenarios. First, it could fail while dealing with text lines that spatially overlap with each other as shown in the last images in the first and third rows of Fig. 5, largely due to the ambiguity in differentiating the central text region while text lines overlap with each other. Second, it could produce broken detection when characters in a word or text line are widely separated as shown in the last example image in the second row of Fig. 5. Without text semantics, it’s a common challenge to decide whether characters/letters belong to the same text line when they are widely separated. Third, it may produce incomplete or miss detection when scene texts are printed in certain rarely used fonts and the training data have little similar samples, as shown in the last example image in the fourth row of Fig. 5.

Fig. 7: The proposed method improves the localization accuracy greatly: the f-score gap with respect to the EAST keeps increasing while the IoU threshold increases (evaluated on the MSRA-TD500).

V Conclusion

This paper presents a novel multi-scale shape regression network that is capable of locating scene texts of arbitrary orientations, shapes and lengths accurately. The proposed method predicts dense text boundary points instead of sparse quadrilateral vertices that are prone to produce large regression errors while dealing with long text lines. In addition, this also enables accurate localization of scene texts of arbitrary orientations and curvatures whereas state-of-the-art techniques using quadrilaterals often include undesired background to the ensuing scene text recognition task. The multi-scale network extracts and fuses features at different scales which demonstrates superb tolerance to the text scale variation. Extensive experiments over several public datasets show the superior performance of the proposed technique.

References

  • [1] M. Liao, B. Shi, and X. Bai, “Textboxes++: A single-shot oriented scene text detector,” IEEE Transactions on Image Processing, vol. 27, no. 8, pp. 3676–3690, 2018.
  • [2] P. He, W. Huang, T. He, Q. Zhu, Y. Qiao, and X. Li, “Single shot text detector with regional attention,” in

    The IEEE International Conference on Computer Vision (ICCV)

    , vol. 6, no. 7, 2017.
  • [3] X. Zhou, C. Yao, H. Wen, Y. Wang, S. Zhou, W. He, and J. Liang, “East: an efficient and accurate scene text detector,” in Proc. CVPR, 2017, pp. 2642–2651.
  • [4] W. He, X.-Y. Zhang, F. Yin, and C.-L. Liu, “Deep direct regression for multi-oriented scene text detection,” in Proceedings of the IEEE International Conference on Computer Vision, 2017, pp. 745–753.
  • [5] S. Ren, K. He, R. Girshick, and J. Sun, “Faster r-cnn: Towards real-time object detection with region proposal networks,” in Advances in neural information processing systems, 2015, pp. 91–99.
  • [6] W. Liu, D. Anguelov, D. Erhan, C. Szegedy, S. Reed, C.-Y. Fu, and A. C. Berg, “Ssd: Single shot multibox detector,” in European conference on computer vision, 2016, pp. 21–37.
  • [7] L. Huang, Y. Yang, Y. Deng, and Y. Yu, “Densebox: Unifying landmark localization with end to end object detection,” arXiv preprint arXiv:1509.04874, 2015.
  • [8] B. Singh and L. S. Davis, “An analysis of scale invariance in object detection–snip,” in

    Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition

    , 2018, pp. 3578–3587.
  • [9] B. Shi, C. Yao, M. Liao, M. Yang, P. Xu, L. Cui, S. Belongie, S. Lu, and X. Bai, “Icdar2017 competition on reading chinese text in the wild (rctw-17),” in Document Analysis and Recognition (ICDAR), 2017 14th IAPR International Conference on, vol. 1.    IEEE, 2017, pp. 1429–1434.
  • [10] P. F. Felzenszwalb, R. B. Girshick, D. McAllester, and D. Ramanan, “Object detection with discriminatively trained part-based models,” IEEE transactions on pattern analysis and machine intelligence, vol. 32, no. 9, pp. 1627–1645, 2010.
  • [11]

    P. Viola and M. J. Jones, “Robust real-time face detection,”

    International journal of computer vision, vol. 57, no. 2, pp. 137–154, 2004.
  • [12] J. Redmon, S. Divvala, R. Girshick, and A. Farhadi, “You only look once: Unified, real-time object detection,” in Proceedings of the IEEE conference on computer vision and pattern recognition, 2016, pp. 779–788.
  • [13]

    S. Lu, T. Chen, S. Tian, J.-H. Lim, and C.-L. Tan, “Scene text extraction based on edges and support vector regression,”

    International Journal on Document Analysis and Recognition (IJDAR), vol. 18, no. 2, pp. 125–135, 2015.
  • [14] M. Busta, L. Neumann, and J. Matas, “Fastext: Efficient unconstrained scene text detector,” in Proceedings of the IEEE International Conference on Computer Vision, 2015, pp. 1206–1214.
  • [15] Z. Zhang, W. Shen, C. Yao, and X. Bai, “Symmetry-based text line detection in natural scenes,” in Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, 2015, pp. 2558–2567.
  • [16] B. Epshtein, E. Ofek, and Y. Wexler, “Detecting text in natural scenes with stroke width transform,” in Computer Vision and Pattern Recognition (CVPR), 2010 IEEE Conference on.    IEEE, 2010, pp. 2963–2970.
  • [17] C. Yao, X. Bai, W. Liu, Y. Ma, and Z. Tu, “Detecting texts of arbitrary orientations in natural images,” in 2012 IEEE Conference on Computer Vision and Pattern Recognition.    IEEE, 2012, pp. 1083–1090.
  • [18] L. Neumann and J. Matas, “Real-time scene text localization and recognition,” in Computer Vision and Pattern Recognition (CVPR), 2012 IEEE Conference on.    IEEE, 2012, pp. 3538–3545.
  • [19] L. Kang, Y. Li, and D. Doermann, “Orientation robust text line detection in natural images,” in Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, 2014, pp. 4034–4041.
  • [20] H. Cho, M. Sung, and B. Jun, “Canny text detector: Fast and robust scene text localization algorithm,” in Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, 2016, pp. 3566–3573.
  • [21] S. Tian, Y. Pan, C. Huang, S. Lu, K. Yu, and C. Lim Tan, “Text flow: A unified text detection system in natural scene images,” in Proceedings of the IEEE international conference on computer vision, 2015, pp. 4651–4659.
  • [22] Z. Tian, W. Huang, T. He, P. He, and Y. Qiao, “Detecting text in natural image with connectionist text proposal network,” in European conference on computer vision.    Springer, 2016, pp. 56–72.
  • [23] S. Tian, S. Lu, and C. Li, “Wetext: Scene text detection under weak supervision,” in Proc. ICCV, 2017.
  • [24] Y. Liu and L. Jin, “Deep matching prior network: Toward tighter multi-oriented text detection,” in Proc. CVPR, 2017, pp. 3454–3461.
  • [25] B. Shi, X. Bai, and S. Belongie, “Detecting oriented text in natural images by linking segments,” in 2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).    IEEE, 2017, pp. 3482–3490.
  • [26] P. Lyu, C. Yao, W. Wu, S. Yan, and X. Bai, “Multi-oriented scene text detection via corner localization and region segmentation,” in Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, 2018, pp. 7553–7563.
  • [27] M. Liao, Z. Zhu, B. Shi, G.-s. Xia, and X. Bai, “Rotation-sensitive regression for oriented scene text detection,” in Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, 2018, pp. 5909–5918.
  • [28] C. Xue, S. Lu, and F. Zhan, “Accurate scene text detection through border semantics awareness and bootstrapping,” in European Conference on Computer Vision.    Springer, 2018, pp. 370–387.
  • [29] A. Polzounov, A. Ablavatski, S. Escalera, S. Lu, and J. Cai, “Wordfence: Text detection in natural images with border awareness,” in Image Processing (ICIP), 2017 IEEE International Conference on.    IEEE, 2017, pp. 1222–1226.
  • [30] Y. Wu and P. Natarajan, “Self-organized text detection with minimal post-processing via border learning,” in Proc. ICCV, 2017.
  • [31] D. Deng, H. Liu, X. Li, and D. Cai, “Pixellink: Detecting scene text via instance segmentation,” arXiv preprint arXiv:1801.01315, 2018.
  • [32] T.-Y. Lin, P. Dollár, R. B. Girshick, K. He, B. Hariharan, and S. J. Belongie, “Feature pyramid networks for object detection.” in CVPR, vol. 1, no. 2, 2017, p. 3.
  • [33] O. Ronneberger, P. Fischer, and T. Brox, “U-net: Convolutional networks for biomedical image segmentation,” in International Conference on Medical image computing and computer-assisted intervention.    Springer, 2015, pp. 234–241.
  • [34] Z. Hao, Y. Liu, H. Qin, J. Yan, X. Li, and X. Hu, “Scale-aware face detection,” in The IEEE Conference on Computer Vision and Pattern Recognition (CVPR), vol. 3, 2017.
  • [35] S. Nah, T. H. Kim, and K. M. Lee, “Deep multi-scale convolutional neural network for dynamic scene deblurring,” in CVPR, vol. 1, no. 2, 2017, p. 3.
  • [36] H. Zhao, X. Qi, X. Shen, J. Shi, and J. Jia, “Icnet for real-time semantic segmentation on high-resolution images,” in The European Conference on Computer Vision (ECCV), September 2018.
  • [37] S. Qiao, W. Shen, W. Qiu, C. Liu, and A. L. Yuille, “Scalenet: Guiding object proposal generation in supermarkets and beyond.” in ICCV, 2017, pp. 1809–1818.
  • [38] N. Akkiraju, H. Edelsbrunner, M. Facello, P. Fu, E. Mucke, and C. Varela, “Alpha shapes: definition and software,” in Proceedings of the 1st International Computational Geometry Software Workshop, vol. 63, 1995, p. 66.
  • [39] L. R. Dice, “Measures of the amount of ecologic association between species,” Ecology, vol. 26, no. 3, pp. 297–302, 1945.
  • [40] T. Sørensen, “A method of establishing groups of equal amplitude in plant sociology based on similarity of species and its application to analyses of the vegetation on danish commons,” Biol. Skr., vol. 5, pp. 1–34, 1948.
  • [41] R. Girshick, “Fast r-cnn,” in Proceedings of the IEEE international conference on computer vision, 2015, pp. 1440–1448.
  • [42] A. Gupta, A. Vedaldi, and A. Zisserman, “Synthetic data for text localisation in natural images,” in IEEE Conference on Computer Vision and Pattern Recognition, 2016.
  • [43] L. Yuliang, J. Lianwen, Z. Shuaitao, and Z. Sheng, “Detecting curve text in the wild: New dataset and new solution,” arXiv preprint arXiv:1712.02170, 2017.
  • [44] C. K. Ch’ng and C. S. Chan, “Total-text: A comprehensive dataset for scene text detection and recognition,” in Document Analysis and Recognition (ICDAR), 2017 14th IAPR International Conference on, vol. 1.    IEEE, 2017, pp. 935–942.
  • [45] D. Karatzas, F. Shafait, S. Uchida, M. Iwamura, L. G. i Bigorda, S. R. Mestre, J. Mas, D. F. Mota, J. A. Almazan, and L. P. De Las Heras, “Icdar 2013 robust reading competition,” in Document Analysis and Recognition (ICDAR), 2013 12th International Conference on.    IEEE, 2013, pp. 1484–1493.
  • [46] S. Long, J. Ruan, W. Zhang, X. He, W. Wu, and C. Yao, “Textsnake: A flexible representation for detecting text of arbitrary shapes,” in Proceedings of the European Conference on Computer Vision (ECCV), 2018, pp. 20–36.
  • [47] H. Noh, S. Hong, and B. Han, “Learning deconvolution network for semantic segmentation,” in Proceedings of the IEEE international conference on computer vision, 2015, pp. 1520–1528.
  • [48] P. Lyu, M. Liao, C. Yao, W. Wu, and X. Bai, “Mask textspotter: An end-to-end trainable neural network for spotting text with arbitrary shapes,” in The European Conference on Computer Vision (ECCV), September 2018.
  • [49] M. Abadi, P. Barham, J. Chen, Z. Chen, A. Davis, J. Dean, M. Devin, S. Ghemawat, G. Irving, M. Isard et al.

    , “Tensorflow: A system for large-scale machine learning.” in

    Osdi, vol. 16, 2016, pp. 265–283.
  • [50] D. P. Kingma and J. Ba, “Adam: A method for stochastic optimization,” arXiv preprint arXiv:1412.6980, 2014.
  • [51] K. He, X. Zhang, S. Ren, and J. Sun, “Deep residual learning for image recognition,” in Proceedings of the IEEE conference on computer vision and pattern recognition, 2016, pp. 770–778.
  • [52] Z. Zhang, C. Zhang, W. Shen, C. Yao, W. Liu, and X. Bai, “Multi-oriented text detection with fully convolutional networks,” in Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, 2016, pp. 4159–4167.
  • [53]

    F. Wang, L. Zhao, X. Li, X. Wang, and D. Tao, “Geometry-aware scene text detection with instance transformation network,” in

    The IEEE Conference on Computer Vision and Pattern Recognition (CVPR), June 2018.
  • [54] C. Yao, X. Bai, and W. Liu, “A unified framework for multioriented text detection and recognition,” IEEE Transactions on Image Processing, vol. 23, no. 11, pp. 4737–4749, 2014.
  • [55] H. Hu, C. Zhang, Y. Luo, Y. Wang, J. Han, and E. Ding, “Wordsup: Exploiting word annotations for character based text detection,” in The IEEE International Conference on Computer Vision (ICCV), Oct 2017.