TextSnake: A Flexible Representation for Detecting Text of Arbitrary Shapes

Driven by deep neural networks and large scale datasets, scene text detection methods have progressed substantially over the past years, continuously refreshing the performance records on various standard benchmarks. However, limited by the representations (axis-aligned rectangles, rotated rectangles or quadrangles) adopted to describe text, existing methods may fall short when dealing with much more free-form text instances, such as curved text, which are actually very common in real-world scenarios. To tackle this problem, we propose a more flexible representation for scene text, termed as TextSnake, which is able to effectively represent text instances in horizontal, oriented and curved forms. In TextSnake, a text instance is described as a sequence of ordered, overlapping disks centered at symmetric axes, each of which is associated with potentially variable radius and orientation. Such geometry attributes are estimated via a Fully Convolutional Network (FCN) model. In experiments, the text detector based on TextSnake achieves state-of-the-art or comparable performance on Total-Text and SCUT-CTW1500, the two newly published benchmarks with special emphasis on curved text in natural images, as well as the widely-used datasets ICDAR 2015 and MSRA-TD500. Specifically, TextSnake outperforms the baseline on Total-Text by more than 40


page 2

page 5

page 11

page 14


All you need is a second look: Towards Tighter Arbitrary shape text detection

Deep learning-based scene text detection methods have progressed substan...

TextTubes for Detecting Curved Text in the Wild

We present a detector for curved text in natural images. We model scene ...

Geometry Normalization Networks for Accurate Scene Text Detection

Large geometry (e.g., orientation) variances are the key challenges in t...

Shift Variance in Scene Text Detection

Theory of convolutional neural networks suggests the property of shift e...

Look More Than Once: An Accurate Detector for Text of Arbitrary Shapes

Previous scene text detection methods have progressed substantially over...

A Single-Shot Arbitrarily-Shaped Text Detector based on Context Attended Multi-Task Learning

Detecting scene text of arbitrary shapes has been a challenging task ove...

Learning Robust Feature Representations for Scene Text Detection

Scene text detection based on deep neural networks have progressed subst...

1 Introduction

In recent years, the community has witnessed a surge of research interest and effort regarding the extraction of textual information from natural scenes, a.k.a. scene text detection and recognition. The driving factors stem from both application prospect and research value. On the one hand, scene text detection and recognition have been playing ever-increasingly important roles in a wide range of practical systems, such as scene understanding, product search, and autonomous driving. On the other hand, the unique traits of scene text, for instance, significant variations in color, scale, orientation, aspect ratio and pattern, make it obviously different from general objects. Therefore, particular challenges are posed and special investigations are required.

Figure 1: Comparison of different representations for text instances. (a) Axis-aligned rectangle. (b) Rotated rectangle. (c) Quadrangle. (d) TextSnake. Obviously, the proposed TextSnake representation is able to effectively and precisely describe the geometric properties, such as location, scale, and bending of curved text with perspective distortion, while the other representations (axis-aligned rectangle, rotated rectangle or quadrangle) struggle with giving accurate predictions in such cases.

Text detection, as a prerequisite step in the pipeline of textual information extraction, has recently advanced substantially with the development of deep neural networks and large image datasets. Numerous innovative works [1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11] are proposed, achieving excellent performances on standard benchmarks.

However, most existing methods for text detection shared a strong assumption that text instances are roughly in a linear shape and therefore adopted relatively simple representations (axis-aligned rectangles, rotated rectangles or quadrangles) to describe them. Despite their progress on standard benchmarks, these methods may fall short when handling text instances of irregular shapes, for example, curved text. As depicted in Fig. 1, for curved text with perspective distortion, conventional representations struggle with giving precise estimations of the geometric properties.

In fact, instances of curved text are quite common in real life [12, 13]. In this paper, we propose a more flexible representation that can fit well text of arbitrary shapes, i.e., those in horizontal, multi-oriented and curved forms. This representation describes text with a series of ordered, overlapping disks, each of which is located at the center axis of text region and associated with potentially variable radius and orientation. Due to its excellent capability in adapting for the complex multiplicity of text structures, just like a snake changing its shape to adapt for the external environment, the proposed representation is named as TextSnake. The geometry attributes of text instances, i.e., central axis points, radii and orientations, are estimated with a single Fully Convolutional Network (FCN) model. Besides ICDAR 2015 and MSRA-TD500, the effectiveness of TextSnake is validated on Total-Text and SCUT-CTW1500, which are two newly-released benchmarks mainly focused on curved text. The proposed algorithm achieves state-of-the-art performance on the two curved text datasets, while at the same time outperforming previous methods on horizontal and multi-oriented text, even in the single-scale testing mode. Specifically, TextSnake achieves significant improvement over the baseline on Total-Text by in F-measure.

In summary, the major contributions of this paper are three-fold: (1) We propose a flexible and general representation for scene text of arbitrary shapes; (2) Based on this representation, an effective method for scene text detection is proposed; (3) The proposed text detection algorithm achieves state-of-the-art performance on several benchmarks, including text instances of different forms (horizontal, oriented and curved).

2 Related Work

In the past few years, the most prominent trend in the area of scene text detection is the transfer from conventional methods [14, 15]

to deep learning based methods 

[16, 17, 4, 3, 2]. In this section, we look back on relevant previous works. For comprehensive surveys, please refer to [18, 19]. Before the era of deep learning, SWT [14] and MSER [15] are two representative algorithms that have influenced a variety of subsequent methods [20, 21]

. Modern methods are mostly based on deep neural networks, which can be coarsely classified into two categories: regression based and segmentation based.

Regression based text detection methods [4] mainly draw inspirations from general object detection frameworks. TextBoxes [4] adopted SSD [22] and added “long” default boxes and filters to handle the significant variation of aspect ratios of text instances. Based on Faster-RCNN [23], Ma et al. [24] devised Rotation Region Proposal Networks (RRPN) to detect arbitrary-Oriented text in natural images. EAST [3] and Deep Regression [25] both directly produce the rotated boxes or quadrangles of text, in a per-pixel manner.

Segmentation based text detection methods cast text detection as a semantic segmentation problem and FCN [26] is often taken as the reference framework. Yao et al. [1] modified FCN to produce multiple heatmaps corresponding various properties of text, such as text region and orientation. Zhang et al. [27] first use FCN to extract text blocks and then hunt character candidates from these blocks with MSER [15]. To better separate adjacent text instances, the method of [6] distinguishes each pixel into three categories: non-text, text border and text. These methods mainly vary in the way they separate text pixels into different instances.

The methods reviewed above have achieved excellent performances on various benchmarks in this field. However, most works, except for [1, 7, 12], have not payed special attention to curved text. In contrast, the representation proposed in this paper is suitable for text of arbitrary shapes (horizontal, multi-oriented and curved). It is primarily inspired by [1, 7] and the geometric attributes of text are also estimated via the multiple-channel outputs of an FCN-based model. Unlike [1], our algorithm does not need character level annotations. In addition, it also shares a similar idea with SegLink [2], by successively decomposing text into local components and then composing them back into text instances. Analogous to [28], we also detect linear symmetry axes of text instances for text localization.

Another advantage of the proposed method lies in its ability to reconstruct the precise shape and regional strike of text instances, which can largely facilitate the subsequent text recognition process, because all detected text instances could be conveniently transformed into a canonical form with minimal distortion and background (see the example in Fig.9).

3 Methodology

In this section, we first introduce the new representation for text of arbitrary shapes. Then we describe our method and training details.

3.1 Representation

Figure 2: Illustration of the proposed TextSnake representation. Text region (in yellow) is represented as a series of ordered disks (in blue), each of which is located at the center line (in green, a.k.a symmetric axis or skeleton) and associated with a radius and an orientation . In contrast to conventional representations (e.g., axis-aligned rectangles, rotated rectangles and quadrangles), TextSnake is more flexible and general, since it can precisely describe text of different forms, regardless of shapes and lengths.

As shown in Fig. 1, conventional representations for scene text (e.g., axis-aligned rectangles, rotated rectangles and quadrangles) fail to precisely describe the geometric properties of text instances of irregular shapes, since they generally assume that text instances are roughly in linear forms, which does not hold true for curved text. To address this problem, we propose a flexible and general representation: TextSnake. As demonstrated in Fig. 2, TextSnake expresses a text instance as a sequence of overlapping disks, each of which is located at the center line and associated with a radius and an orientation. Intuitively, TextSnake is able to change its shape to adapt for the variations of text instances, such as rotation, scaling and bending.

Mathematically, a text instance , consisting of several characters, can be viewed as an ordered list . , where stands for the th disk and is the number of the disks. Each disk is associated with a group of geometry attributes, i.e. , in which , and are the center, radius and orientation of disk , respectively. The radius is defined as half of the local width of , while the orientation is the tangential direction of the center line around the center . In this sense, text region can be easily reconstructed by computing the union of the disks in .

Note that the disks do not correspond to the characters belonging to . However, the geometric attributes in can be used to rectify text instances of irregular shapes and transform them into rectangular, straight image regions, which are more friendly to text recognizers.

3.2 Pipeline

Figure 3: Method framework: network output and post-processing

In order to detect text with arbitrary shapes, we employ an FCN model to predict the geometry attributes of text instances. The pipeline of the proposed method is illustrated in Fig.3. The FCN based network predicts score maps of text center line (TCL) and text regions (TR), together with geometry attributes, including , and

. The TCL map is further masked by the TR map since TCL is naturally part of TR. To perform instance segmentation, disjoint set is utilized, given the fact that TCL does not overlap with each other. A striding algorithm is used to extract the central axis point lists and finally reconstruct the text instances.

3.3 Network Architecture

Figure 4: Network Architecture. Blue blocks are convolution stages of VGG-16.

The whole network is shown in Fig. 4. Inspired by FPN[29] and U-net[30], we adopt a scheme that gradually merges features from different levels of the stem network. The stem network can be convolutional networks proposed for image classification, e.g. VGG-16/19[31] and ResNet[32]. These networks can be divided into 5 stages of convolutions and a few additional fully-connected (FC) layers. We remove the FC layers, and feed the feature maps after each stage to the feature merging network. We choose VGG-16 as our stem network for the sake of direct and fair comparison with other methods.

As for the feature merging network, several stages are stacked sequentially, each consisting of a merging unit that takes feature maps from the last stage and corresponding stem network layer. Merging unit is defined by the following equations:


where denotes the feature maps of the -th stage in the stem network and is the feature maps of the corresponding merging units. In our experiments, upsampling is implemented as deconvolutional layer as proposed in [33].

After the merging, we obtain a feature map whose size is of the input images. We apply an additional upsampling layer and 2 convolutional layers to produce dense predictions:


where , with

channels for logits of TR/TCL, and the last

respectively for , and of the text instance. As a result of the additional upsampling layer, has the same size as the input image.The final predictions are obtained by taking softmax for TR/TCL and regularizing and so that the squared sum equals .

3.4 Inference

After feed-forwarding, the network produces the TCL, TR and geometry maps. For TCL and TR, we apply thresholding with values and respectively. Then, the intersection of TR and TCL gives the final prediction of TCL. Using disjoint-set, we can efficiently separate TCL pixels into different text instances.

Finally, a striding algorithm is designed to extract an ordered point list that indicates the shape and course of the text instance, and also reconstruct the text instance areas. Two simple heuristics are applied to filter out false positive text instances: 1) The number of TCL pixels should be at least

times their average radius; 2) At least half of pixels in the reconstructed text area should be classified as TR.

Figure 5: Framework of Post-processing Algorithm. Act(a) Centralizing: relocate a given point to the central axis; Act(b) Striding: a directional search towards the ends of text instances; Act(c) Sliding: a reconstruction by sliding a circle along the central axis.

The procedure for the striding algorithm is shown in Fig.5. It features main actions, denoted as Act(a), Act(b), and Act(c), as illustrated in Fig.6. Firstly, we randomly select a pixel as the starting point, and centralize it. Then, the search process forks into two opposite directions, striding and centralizing until it reaches the ends. This process would generates 2 ordered point list in two opposite directions, which can be combined to produce the final central axis list that follows the course of the text and describe the shape precisely. Details of the actions are shown below.

Figure 6: Mechanisms of Centralizing, Striding and Sliding

Act(a) Centralizing As shown in Fig.6, given a point on the TCL, we can draw the tangent line and the normal line, respectively denoted as dotted line and solid line. This step can be done with ease using the geometry maps. The midpoint of the intersection of the normal line and the TCL area gives the centralized point.

Act(b) Striding The algorithm takes a stride to the next point to search. With the geometry maps, the displacement for each stride is computed and represented as and , respectively for the two directions. If the next step is outside the TCL area, we decrement the stride gradually until it’s inside, or it hits the ends.

Act(c) Sliding The algorithm iterates through the central axis and draw circles along it. Radii of the circles are obtained from the map. The area covered by the circles indicates the predicted text instance.

In conclusion, taking advantage of the geometry maps and the TCL that precisely describes the course of the text instance, we can go beyond detection of text and also predict their shape and course. Besides, the striding algorithm saves our method from traversing all pixels that are related.

3.5 Label Generation

3.5.1 Extracting Text Center Line

For triangles and quadrangles, it’s easy to directly calculate the TCL with algebraic methods, since in this case, TCL is a straight line. For polygons of more than 4 sides, it’s not easy to derive a general algebraic method.

Instead, we propose a method that is based on the assumption that, text instances are snake-shaped, i.e. that it does not fork into multiple branches. For a snake-shaped text instance, it has two edges that are respectively the head and the tail. The two edges near the head or tail are running parallel but in opposite direction.

Figure 7: Label Generation. (a) Determining text head and tail; (b) Extracting text center line and calculating geometries; (c) Expanded text center line.

For a text instance represented by a group of vertexes in clockwise or counterclockwise order, we define a measurement for each edge as . Intuitively, the two edges with nearest to , e.g. and in Fig.7, are the head and tail. After that, equal number of anchor points are sampled on the two sidelines, e.g. and in Fig.7. TCL points are computed as midpoints of corresponding anchor points. We shrink the two ends of TCL by pixels, so that TCL are inside the TR and makes it easy for the network to learn to separate adjacent text instances. denotes the radius of the TCL points at the two ends. Finally, we expand the TCL area by , since a single-point line is prone to noise.

3.5.2 Calculating and

For each points on TCL: (1) is computed as the distance to the corresponding point on sidelines; (2) is computed by fitting a straight line on the TCL points in the neighborhood. For non-TCL pixels, their corresponding geometry attributes are set to 0 for convenience.

3.6 Training Objectives

The proposed model is trained end-to-end, with the following loss functions as the objectives:


in Eq.5 represents classification loss for TR and TCL, and for regression loss of , and . In Eq.6, and are cross-entropy loss for TR and TCL. Online hard negative mining [34] is adopted for TR loss, with the ratio between the negatives and positives kept to 3:1 at most. For TCL, we only take into account pixels inside TR and adopt no balancing methods.

In Eq.7, regression loss, i.e. and , are calculated as Smoothed-L1 loss[35]:


where , and are the predicted values, while , and are their ground truth correspondingly. Geometry loss outside TCL are set to 0, since these attributes make no sense for non-TCL points.

The weights constants , , , and are all set to 1 in our experiments.

4 Experiments

In this section, we evaluate the proposed algorithm on standard benchmarks for scene text detection and compare it with previous methods. Analyses and discussions regarding our algorithm are also given.

4.1 Datasets

The datasets used for the experiments in this paper are briefly introduced below:

SynthText [36] is a large sacle dataset that contains about synthetic images. These images are created by blending natural images with text rendered with random fonts, sizes, colors, and orientations, thus these images are quite realistic. We use this dataset to pre-train our model.

TotalText [12] is a newly-released benchmark for text detection. Besides horizontal and multi-Oriented text instances, the dataset specially features curved text, which rarely appear in other benchmark datasets,but are actually quite common in real environments. The dataset is split into training and testing sets with 1255 and 300 images, respectively.

CTW1500 [13] is another dataset mainly consisting of curved text. It consists of 1000 training images and 500 test images. Text instances are annotated with polygons with 14 vertexes.

ICDAR 2015 is proposed as the Challenge 4 of the 2015 Robust Reading Competition [37] for incidental scene text detection. Scene text images in this dataset are taken by Google Glasses without taking care of positioning, image quality, and viewpoint. This dataset features small, blur, and multi-oriented text instances. There are 1000 images for training and 500 images for testing. The text instances from this dataset are labeled as word level quadrangles.

MSRA-TD500 [38] is a dataset with multi-lingual, arbitrary-oriented and long text lines. It includes 300 training images and 200 test images with text line level annotations. Following previous works [3, 10], we also include the images from HUST-TR400 [39] as training data when fine-tuning on this dataset, since its training set is rather small.

For experiments on ICDAR 2015 and MSRA-TD500, we fit a minimum bounding rectangle based on the output text area of our method.

4.2 Data Augmentation

Images are randomly rotated, and cropped with areas ranging from to and aspect ratios ranging from to . After that, noise, blur, and lightness are randomly adjusted. We ensure that the text on the augmented images are still legible, if they are legible before augmentation.

Figure 8: Qualitative results by the proposed method. Top: Detected text contours (in yellow) and ground truth annotations (in green). Bottom: Combined score maps for TR (in red) and TCL (in yellow). From left to right in column: image from ICDAR 2015, TotalText, CTW1500 and MSRA-TD500. Best viewed in color.

4.3 Implementation Details

Our method is implemented in Tensorflow 1.3.0 


. The network is pre-trained on SynthText for one epoch and fine-tuned on other datasets. We adopt the Adam optimazer 

[41] as our learning rate scheme. During the pre-training stage, the learning rate is fixed to . During the fine-tuning stage, the learing rate is set to initially and decaies with a rate of every 5000 iterations. During fine-tuning, the number of iterations is decided by the sizes of datasets. All the experiments are conducted on a regular workstation (CPU: Intel(R) Xeon(R) CPU E5-2650 v3 @ 2.30GHz; GPU:Titan X; RAM: 384GB). We train our model with the batch size of 32 on GPUs in parallel and evaluate our model on 1 GPU with batch size set as . Hyper-parameters are tuned by grid search on training set.

4.4 Experiment Results

Experiments on Curved Text (Total-Text and CTW1500)

Fine-tuning on these two datasets stops at about iterations. Thresholds , are set to and respectively on Total-Text and CTW1500. In testing, all images are rescaled to for Total-Text, while for CTW1500, the images are not resized, since the images in CTW1500 are rather small (The largest image is merely ). For comparison, we also evaluated the models of EAST [3] and SegLink [2] on Total-Text and CTW1500. The quantitative results of different methods on these two datasets are shown in Tab. 1 and Tab. 2, respectively.

Method Precision Recall F-measure
SegLink [2]
EAST [3]
Baseline (DeconvNet[42])
TextSnake 82.7 74.5 78.4
Table 1: Quantitative results of different methods evaluated on Total-Text. Note that EAST and SegLink were not fine-tuned on Total-Text. Therefore their results are included only for reference.

As shown in Tab. 1, the proposed method achieves , , and in precision, recall and F-measure on Total-Text, significantly outperforming previous methods. Note that the F-measure of our method is more than double of that of the baseline provided in the original Total-Text paper [12].

Method Precision Recall F-measure
SegLink [2]
EAST [3]
DMPNet [43]
CTD+TLOC[13] 77.4
TextSnake 67.9 85.3 75.6
Table 2: Quantitative results of different methods evaluated on CTW1500. Results other than ours are obtained from [13].

On CTW1500, the proposed method achieves , , and in precision, recall and F-measure , respectively. Compared with CTD+TLOC which is proposed together with the CTW1500 dataset in [13], the F-measure of our algorithm is higher ( vs. ).

The superior performances of our method on Total-Text and CTW1500 verify that the proposed representation can handle well curved text in natural images.

Experiments on Incidental Scene Text (ICDAR 2015)

Fine-tuning on ICDAR 2015 stops at about iterations. In testing, all images are resized to . , are set to . For the consideration that images in ICDAR 2015 contains many unlabeled small texts, predicted rectangles with the shorter side less than 10 pixels or the area less than 300 are filtered out.

The quantitative results of different methods on ICDAR 2015 are shown in Tab.3. With only single-scale testing, our method outperforms most competitors (including those evaluated in multi-scale). This demonstrates that the proposed representation TextSnake is general and can be readily applied to multi-oriented text in complex scenarios.

Method Precision Recall F-measure FPS
Zhang et al. [27] 70.8 43.0 53.6 0.48
CTPN [44] 74.2 51.6 60.9 7.1
Yao et al. [1] 72.3 58.7 64.8 1.61
SegLink [2] 73.1 76.8 75.0 -
EAST [3] 80.5 72.8 76.4 6.52
SSTD [45] 80.0 73.0 77.0 7.7
WordSup [8] 79.3 77.0 78.2 2
EAST [3] 83.3 78.3 80.7 -
He et al. [25] 82.0 80.0 81.0 1.1
PixelLink [46] 85.5 82.0 83.7 3.0
TextSnake 84.9 80.4 82.6 1.1
Table 3: Quantitative results of different methods on ICDAR 2015. stands for multi-scale, indicates that the base net of the model is not VGG16.

Experiments on Long Straight Text Lines (MSRA-TD500)

Fine-tuning on MSRA-TD500 stops at about iterations. Thresholds for , are . In testing, all images are resized to . Results are shown in Tab.4. The F-measure () of the proposed method is higher than that of the other methods.

Method Precision Recall F-measure FPS
Kang et al. [47] 71.0 62.0 66.0 -
Zhang et al. [27] 83.0 67.0 74.0 0.48
Yao et al. [1] 76.5 75.3 75.9 1.61
EAST [3] 81.7 61.6 70.2 6.52
EAST [3] 87.3 67.4 76.1 13.2
SegLink [2] 86.0 70.0 77.0 8.9
He et al. [25] 77.0 70.0 74.0 1.1
PixelLink [46] 83.0 73.2 77.8 3.0
TextSnake 83.2 73.9 78.3 1.1
Table 4: Quantitative results of different methods on MSRA-TD500. indicates models whose base nets are not VGG16.

4.5 Analyses and Discussions

Precise Description of Text Instances What distinguishes our method from others is its ability to predict a precise description of the shape and course of text instances(see Fig.8).

We attribute such ability to the TCL mechanism. Text center line can be seen as a kind of skeletons that prop up the text instance, and geo-attributes providing more details. Text, as a form of written language, can be seen as a stream of signals mapped onto 2D surfaces. Naturally, it should follows a course to extend.

Therefore we propose to predict TCL, which is much narrower than the whole text instance. It has two advantages: (1) A slim TCL can better describe the course and shape; (2) TCL, intuitively, does not overlaps with each other, so that instance segmentation can be done in a very simple and straightforward way, thus simplifying our pipeline.

Moreover, as depicted in Fig.9, we can exploit local geometries to sketch the structure of the text instance and transform the predicted curved text instances into canonical form, which may largely facilitate the recognition stage.

Figure 9: Text instances transformed to canonical form using the predicted geometries.

Generalization Ability To further verify the generalization ability of our method, we train and fine-tune our model on datasets without curved text and evaluate it on the two benchmarks featuring curved text. Specifically, we fine-tune our models on ICDAR 2015, and evaluate them on the target datasets. The models of EAST [3], SegLink [2], and PixelLink [46] are taken as baselines, since these two methods were also trained on ICDAR 2015.

Datasets Total-Text CTW1500
Methods Precision Recall F-measure Precision Recall F-measure
SegLink[2] 35.6 33.2 34.4 33.0 2.4 30.5
EAST[3] 49.0 43.1 45.9 46.7 37.2 41.4
PixelLink [46] 53.5 52.7 53.1 50.6 42.8 46.4
TextSnake 61.5 67.9 64.6 65.4 63.4 64.4
Table 5: Comparison of cross-dataset results of different methods. The following models are fine-tuned on ICDAR 2015 and evaluated on Total-Text and CTW1500. Experiments for SegLink, EAST and PixelLink are done with the open source code. The evaluation protocol is DetEval [48], the same as Total-Text. While ICDAR 2015 and Total-Text has word-level labels, CTW1500 uses line-level ones. We deem DetEval[48] preferable to PASCAL [49]. Otherwise, the line-level labels of CTW1500 would significantly penalize models fine-tuned on word-level labeled ICDAR2015.

As shown in Tab.5, our method still performs well on curved text and significantly outperforms the three strong competitors SegLink, EAST and PixelLink, without fine-tuning on curved text. We attribute this excellent generalization ability to the proposed flexible representation. Instead of taking text as a whole, the representation treats text as a collection of local elements and integrates them together to make decisions. Local attributes are kept when formed into a whole. Besides, they are independent of each other. Therefore, the final predictions of our method can retain most information of the shape and course of the text.We believe that this is the main reason for the capacity of the proposed text detection algorithm in hunting text instances with various shapes.

5 Conclusion and Future Work

In this paper, we present a novel, flexible representation for describing the properties of scene text with arbitrary shapes, including horizontal, multi-oriented and curved text instances. The proposed text detection method based upon this representation obtains state-of-the-art or comparable performance on two newly-released benchmarks for curved text (Total-Text and SCUT-CTW1500) as well as two widely-used datasets (ICDAR 2015 and MSRA-TD500) in this field, proving the effectiveness of the proposed method. As for future work, we would explore the direction of developing an end-to-end recognition system for text of arbitrary shapes.


  • [1] Yao, C., Bai, X., Sang, N., Zhou, X., Zhou, S., Cao, Z.: Scene text detection via holistic, multi-channel prediction. arXiv preprint arXiv:1606.09002 (2016)
  • [2] Shi, B., Bai, X., Belongie, S.: Detecting oriented text in natural images by linking segments.

    In: The IEEE Conference on Computer Vision and Pattern Recognition (CVPR). (July 2017)

  • [3] Zhou, X., Yao, C., Wen, H., Wang, Y., Zhou, S., He, W., Liang, J.: EAST: An efficient and accurate scene text detector. In: The IEEE Conference on Computer Vision and Pattern Recognition (CVPR). (July 2017)
  • [4] Liao, M., Shi, B., Bai, X., Wang, X., Liu, W.: Textboxes: A fast text detector with a single deep neural network. In: AAAI. (2017) 4161–4167
  • [5] Huang, L., Yang, Y., Deng, Y., Yu, Y.: Densebox: Unifying landmark localization with end to end object detection. arXiv preprint arXiv:1509.04874 (2015)
  • [6] Wu, Y., Natarajan, P.: Self-organized text detection with minimal post-processing via border learning. In: Proceedings of the IEEE Conference on CVPR. (2017) 5000–5009
  • [7] He, D., Yang, X., Liang, C., Zhou, Z., Ororbia, A.G., Kifer, D., Giles, C.L.: Multi-scale fcn with cascaded instance aware segmentation for arbitrary oriented word spotting in the wild. In: Computer Vision and Pattern Recognition (CVPR), 2017 IEEE Conference on, IEEE (2017) 474–483
  • [8] Hu, H., Zhang, C., Luo, Y., Wang, Y., Han, J., Ding, E.: Wordsup: Exploiting word annotations for character based text detection. In: The IEEE International Conference on Computer Vision (ICCV). (Oct 2017)
  • [9] Tian, S., Lu, S., Li, C.: Wetext: Scene text detection under weak supervision. arXiv preprint arXiv:1710.04826 (2017)
  • [10] Lyu, P., Yao, C., Wu, W., Yan, S., Bai, X.: Multi-oriented scene text detection via corner localization and region segmentation. In: Computer Vision and Pattern Recognition (CVPR), 2018 IEEE Conference on. (2018)
  • [11] Sheng, Z., Yuliang, L., Lianwen, J., Canjie, L.: Feature enhancement network: A refined scene text detector. In: Proceedings of AAAI, 2018. (2018)
  • [12] Kheng Chng, C., Chan, C.S.: Total-text: A comprehensive dataset for scene text detection and recognition. arXiv preprint arXiv:1710.10400 (2017)
  • [13] Yuliang, L., Lianwen, J., Shuaitao, Z., Sheng, Z.: Detecting curve text in the wild: New dataset and new solution. arXiv preprint arXiv:1712.02170 (2017)
  • [14] Epshtein, B., Ofek, E., Wexler, Y.: Detecting text in natural scenes with stroke width transform. In: Computer Vision and Pattern Recognition (CVPR), 2010 IEEE Conference on, IEEE (2010) 2963–2970
  • [15] Neumann, L., Matas, J.: A method for text localization and recognition in real-world images. In: Asian Conference on Computer Vision, Springer (2010) 770–783
  • [16] Jaderberg, M., Vedaldi, A., Zisserman, A.: Deep features for text spotting. In: European conference on computer vision, Springer (2014) 512–528
  • [17] Jaderberg, M., Simonyan, K., Vedaldi, A., Zisserman, A.:

    Reading text in the wild with convolutional neural networks.

    International Journal of Computer Vision 116(1) (2016) 1–20
  • [18] Ye, Q., Doermann, D.: Text detection and recognition in imagery: A survey. IEEE transactions on pattern analysis and machine intelligence 37(7) (2015) 1480–1500
  • [19] Zhu, Y., Yao, C., Bai, X.: Scene text detection and recognition: Recent advances and future trends. Frontiers of Computer Science 10(1) (2016) 19–36
  • [20] Yin, X.C., Yin, X., Huang, K., Hao, H.W.: Robust text detection in natural scene images. IEEE transactions on pattern analysis and machine intelligence 36(5) (2014) 970–983
  • [21] Huang, W., Qiao, Y., Tang, X.: Robust scene text detection with convolution neural network induced mser trees. In: European Conference on Computer Vision, Springer (2014) 497–511
  • [22] Liu, W., Anguelov, D., Erhan, D., Szegedy, C., Reed, S., Fu, C.Y., Berg, A.C.: SSD: Single shot multibox detector. In: European conference on computer vision, Springer (2016) 21–37
  • [23] Ren, S., He, K., Girshick, R., Sun, J.: Faster r-cnn: Towards real-time object detection with region proposal networks. In: Advances in neural information processing systems. (2015) 91–99
  • [24] Ma, J., Shao, W., Ye, H., Wang, L., Wang, H., Zheng, Y., Xue, X.: Arbitrary-oriented scene text detection via rotation proposals. arXiv preprint arXiv:1703.01086 (2017)
  • [25] He, W., Zhang, X.Y., Yin, F., Liu, C.L.: Deep direct regression for multi-oriented scene text detection. In: The IEEE International Conference on Computer Vision (ICCV). (Oct 2017)
  • [26] Long, J., Shelhamer, E., Darrell, T.: Fully convolutional networks for semantic segmentation. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. (2015) 3431–3440
  • [27] Zhang, Z., Zhang, C., Shen, W., Yao, C., Liu, W., Bai, X.: Multi-oriented text detection with fully convolutional networks. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. (2016) 4159–4167
  • [28] Zhang, Z., Shen, W., Yao, C., Bai, X.: Symmetry-based text line detection in natural scenes. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. (2015) 2558–2567
  • [29] Lin, T.Y., Dollar, P., Girshick, R., He, K., Hariharan, B., Belongie, S.: Feature pyramid networks for object detection. In: The IEEE Conference on Computer Vision and Pattern Recognition (CVPR). (July 2017)
  • [30] Ronneberger, O., Fischer, P., Brox, T.: U-Net: Convolutional Networks for Biomedical Image Segmentation. Springer International Publishing (2015)
  • [31] Simonyan, K., Zisserman, A.: Very deep convolutional networks for large-scale image recognition. arXiv preprint arXiv:1409.1556 (2014)
  • [32] He, K., Zhang, X., Ren, S., Sun, J.: Deep residual learning for image recognition. In: In Proceedings of the IEEE conference on computer vision and pattern recognition(CVPR). (2016)
  • [33] Zeiler, M.D., Krishnan, D., Taylor, G.W., Fergus, R.: Deconvolutional networks. In: In Proceedings of the IEEE conference on computer vision and pattern recognition(CVPR). (2010) 2528–2535
  • [34] Shrivastava, A., Gupta, A., Girshick, R.: Training region-based object detectors with online hard example mining. (2016) 761–769
  • [35] Girshick, R.: Fast r-cnn. In: The IEEE International Conference on Computer Vision (ICCV). (December 2015)
  • [36] Gupta, A., Vedaldi, A., Zisserman, A.: Synthetic data for text localisation in natural images. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. (2016) 2315–2324
  • [37] Karatzas, D., Gomez-Bigorda, L., Nicolaou, A., Ghosh, S., Bagdanov, A., Iwamura, M., Matas, J., Neumann, L., Chandrasekhar, V.R., Lu, S., et al.: Icdar 2015 competition on robust reading. In: Document Analysis and Recognition (ICDAR), 2015 13th International Conference on, IEEE (2015) 1156–1160
  • [38] Yao, C., Bai, X., Liu, W., Ma, Y., Tu, Z.: Detecting texts of arbitrary orientations in natural images. In: Computer Vision and Pattern Recognition (CVPR), 2012 IEEE Conference on, IEEE (2012) 1083–1090
  • [39] Yao, C., Bai, X., Liu, W.: A unified framework for multioriented text detection and recognition. IEEE Transactions on Image Processing 23(11) (2014) 4737–4749
  • [40] Abadi, M., Barham, P., Chen, J., Chen, Z., Davis, A., Dean, J., Devin, M., Ghemawat, S., Irving, G., Isard, M., et al.:

    Tensorflow: A system for large-scale machine learning.

    In: OSDI. Volume 16. (2016) 265–283
  • [41] Kingma, D., Ba, J.: Adam: A method for stochastic optimization. In: Proceedings of ICLR. (2015)
  • [42] Noh, H., Hong, S., Han, B.: Learning deconvolution network for semantic segmentation. (2015) 1520–1528
  • [43] Liu, Y., Jin, L.: Deep matching prior network: Toward tighter multi-oriented text detection. (2017)
  • [44] Tian, Z., Huang, W., He, T., He, P., Qiao, Y.: Detecting text in natural image with connectionist text proposal network. In: European Conference on Computer Vision, Springer (2016) 56–72
  • [45] He, P., Huang, W., He, T., Zhu, Q., Qiao, Y., Li, X.: Single shot text detector with regional attention. In: The IEEE International Conference on Computer Vision (ICCV). (Oct 2017)
  • [46] Deng, D., Liu, H., Li, X., Cai, D.: Pixellink: Detecting scene text via instance segmentation. AAAI (2018)
  • [47] Kang, L., Li, Y., Doermann, D.: Orientation robust text line detection in natural images. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. (2014) 4034–4041
  • [48] Wolf, C., Jolion, J.M.: Object count/area graphs for the evaluation of object detection and segmentation algorithms. International Journal of Document Analysis and Recognition (IJDAR) 8(4) (2006) 280–296
  • [49] Everingham, M., Eslami, S.A., Van Gool, L., Williams, C.K., Winn, J., Zisserman, A.: The pascal visual object classes challenge: A retrospective. International journal of computer vision 111(1) (2015) 98–136