Scene Text Detection and Recognition: The Deep Learning Era

With the rise and development of deep learning, computer vision has been tremendously transformed and reshaped. As an important research area in computer vision, scene text detection and recognition has been inescapably influenced by this wave of revolution, consequentially entering the era of deep learning. In recent years, the community has witnessed substantial advancements in mindset, approach and performance. This survey is aimed at summarizing and analyzing the major changes and significant progresses of scene text detection and recognition in the deep learning era. Through this article, we devote to: (1) introduce new insights and ideas; (2) highlight recent techniques and benchmarks; (3) look ahead into future trends. Specifically, we will emphasize the dramatic differences brought by deep learning and the grand challenges still remained. We expect that this review paper would serve as a reference book for researchers in this field. Related resources are also collected and compiled in our Github repository:



There are no comments yet.


page 2

page 3

page 6

page 7

page 10

page 11

page 12

page 13


Text Recognition in the Wild: A Survey

The history of text can be traced back over thousands of years. Rich and...

Survey of Computer Vision and Machine Learning in Gastrointestinal Endoscopy

This paper attempts to provide the reader a place to begin studying the ...

Deep learning for scene recognition from visual data: a survey

The use of deep learning techniques has exploded during the last few yea...

Deep Learning for Scene Classification: A Survey

Scene classification, aiming at classifying a scene image to one of the ...

Salient Object Detection in the Deep Learning Era: An In-Depth Survey

As an important problem in computer vision, salient object detection (SO...

Focus-Enhanced Scene Text Recognition with Deformable Convolutions

Recently, scene text recognition methods based on deep learning have spr...

Audio-based Musical Version Identification: Elements and Challenges

In this article, we aim to provide a review of the key ideas and approac...
This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

1 Introduction

Undoubtedly, text is among the most brilliant and influential creations of humankind. As the written form of human languages, text makes it feasible to reliably and effectively spread or acquire information across time and space. In this sense, text constitutes the cornerstone of human civilization.

On the one hand, text, as a vital tool for communication and collaboration, has been playing a more important role than ever in modern society; on the other hand, the rich, precise high level semantics embodied in text could be beneficial for understanding the world around us. For example, text information can be used in a wide range of real-world applications, such as image search [134, 116], instant translation [23, 102], robots navigation [21, 79, 80, 117], and industrial automation [39, 47, 16]. Therefore, automatic text reading from natural environments (schematic diagram is depicted in Fig. 1), a.k.a. scene text detection and recognition [172] or PhotoOCR [8], has become an increasing popular and important research topic in computer vision.

Fig. 1: Schematic diagram of scene text detection and recognition. The image sample is from Total-Text [15].

However, despite years of research, a series of grand challenges may still be encountered when detecting and recognizing text in the wild. The difficulties mainly stem from three aspects [172]:

Diversity and Variability of Text in Natural Scenes Distinctive from scripts in documents, text in natural scene exhibit much higher diversity and variability. For example, instances of scene text can be in different languages, colors, fonts, sizes, orientations and shapes. Moreover, the aspect ratios and layouts of scene text may vary significantly. All these variations pose challenges for detection and recognition algorithms designed for text in natural scenes.

Complexity and Interference of Backgrounds Backgrounds of natural scenes are virtually unpredictable. There might be patterns extremely similar with text (e.g., tree leaves, traffic signs, bricks, windows, and stockades), or occlusions caused by foreign objects, which may potentially lead to confusions and mistakes.

Imperfect Imaging Conditions In uncontrolled circumstances, the quality of text images and videos could not be guaranteed. That is, in poor imaging conditions, text instances may be with low resolution and severe distortion due to inappropriate shooting distance or angle, or blurred because of out of focus or shaking, or noised on account of low light level, or corrupted by highlights or shadows.

These difficulties run through the years before deep learning showed its potential in computer vision as well as in other fields. As deep learning came to prominence after AlexNet [68] won the ILSVRC2012 [115]

contest, researchers turn to deep neural networks for automatic feature learning and start with more in-depth studies. The community are now working on ever more challenging targets. The progresses made in recent years can be summarized as follows:

Incorporation of Deep Learning Nearly all recent methods are built upon deep learning models. Most importantly, deep learning frees researchers from the exhausting work of repeatedly designing and testing hand-crafted features, which gives rise to a blossom of works that push the envelope further. To be specific, the use of deep learning substantially simplifies the overall pipeline. Besides, these algorithms provide significant improvements over previous ones on standard benchmarks. Gradient-based training routines also facilitate to end-to-end trainable methods.

Target-Oriented Algorithms and Datasets Researchers are now turning to more specific aspects and targets. Against difficulties in real-world scenarios, newly published datasets are collected with unique and representative characteristics. For example, there are datasets that feature long text, blurred text, and curved text respectively. Driven by these datasets, almost all algorithms published in recent years are designed to tackle specific challenges. For instance, some are proposed to detect oriented text, while others aim at blurred and unfocused scene images. These ideas are also combined to make more general-purpose methods.

Advances in Auxiliary Technologies Apart from new datasets and models devoted to the main task, auxiliary technologies that do not solve the task directly also find their places in this field, such as synthetic data and bootstrapping.

In this survey, we present an overview of recent development in scene text detection and recognition with focus on the deep learning era. We review methods from different perspectives, and list the up-to-date datasets. We also analyze the status quo and predict future research trends.

There have been already several excellent review papers [136, 154, 160, 172], which also comb and analyze works related text detection and recognition. However, these papers are published before deep learning came to prominence in this field. Therefore, they mainly focus on more traditional and feature-based methods. We refer readers to these paper as well for a more comprehensive view and knowledge of the history. This article will mainly concentrate on text information extraction from still images, rather than videos. For scene text detection and recognition in videos, please also refer to [60, 160].

The remaining parts of this paper are arranged as follows. In Section 2, we briefly review the methods before the deep learning era. In Section 3, we list and summarize algorithms based on deep learning in a hierarchical order. In Section 4, we take a look at the datasets and evaluation protocols. Finally, we present potential applications and our own opinions on the current status and future trends.

2 Methods before the Deep Learning Era

2.1 Overview

In this section, we take a brief glance retrospectively at algorithms before the deep learning era. More detailed and comprehensive coverage of these works can be found in [136, 154, 160, 172]. For text detection and recognition, the attention has been the design of features.

In this period of time, most text detection methods either adopt Connected Components Analysis (CCA) [52, 98, 24, 135, 159, 156, 58] or Sliding Window (SW) based classification [70, 142, 17, 144]

. CCA based methods first extract candidate components through a variety of ways (e.g., color clustering or extreme region extraction), and then filter out non-text components using manually designed rules or classifiers automatically trained on hand-crafted features (see Fig.

2). In sliding window classification methods, windows of varying sizes slide over the input image, where each window is classified as text segments/regions or not. Those classified as positive are further grouped into text regions with morphological operations [70], Conditional Random Field (CRF) [142] and other alternative graph based methods [17, 144].

Fig. 2: Illustration of traditional methods with hand-crafted features: (1) Maximally Stable Extremal Regions (MSER) [98], assuming chromatic consistency within each character; (2) Stroke Width Transform (SWT) [24], assuming consistent stroke width within each character.

For text recognition, one branch adopted the feature-based methods. Shi et al. [126] and Yao et al. [153] proposed character segments based recognition algorithms. Rodriguez et al. [110, 109] and Gordo et al. [35] and Almazan et al. [3] utilized label embedding to directly perform matching between strings and images. Stoke [10] and character key-points [104] are also detected as features for classification. Another discomposed the recognition process into a series of sub-problems. Various methods have been proposed to tackle these sub-problems

, which includes text binarization 

[167, 93, 139, 71], text line segmentation [155], character segmentation [101, 127, 114], single character recognition [12, 120] and word correction [165, 138, 94, 62, 145].

Fig. 3: Overview of recent progress and dominant trends.

There have been efforts devoted to integrated (i.e. end-to-end as we call it today) systems as well [142, 97]. In Wang et al. [142], characters are considered as a special case in object detection and detected by a nearest neighbor classifier trained on HOG features [19] and then grouped into words through a Pictorial Structure (PS) based model [26]. Neumann and Matas [97] proposed a decision delay approach by keeping multiple segmentations of each character until the last stage when the context of each character is known. They detected character segmentations using extremal regions and decoded recognition results through a dynamic programming algorithm.

In summary, text detection and recognition methods before the deep learning era mainly extract low-level or mid-level hand crafted image features, which entails demanding and repetitive pre-processing and post-processing steps. Constrained by the limited representation ability of hand crafted features and the complexity of pipelines, those methods can hardly handle intricate circumstances, e.g. blurred images in the ICDAR2015 dataset [63].

3 Methodology in the Deep Learning Era

Fig. 4: Typical pipelines of scene text detection and recognition. (a) [55] and (b) [152] are representative multi-step methods. (c) and (d) are simplified pipeline. (c) [168] only contains detection branch, and therefore is used together with a separate recognition model. (d) [81, 45] jointly train a detection model and recognition model.

As implied by the title of this section, we would like to address recent advances as changes in methodology instead of merely new methods. Our conclusion is grounded in the observations as explained in the following paragraph.

Methods in the recent years are characterized by the following two distinctions: (1) Most methods utilizes deep-learning based models; (2) Most researchers are approaching the problem from a diversity of perspectives. Methods driven by deep-learning enjoy the advantage that automatic feature learning can save us from designing and testing the large amount potential hand-crafted features. At the same time, researchers from different viewpoints are enriching and promoting the community into more in-depth work, aiming at different targets, e.g. faster and simpler pipeline[168], text of varying aspect ratios[121], and synthetic data[38]

. As we can also see further in this section, the incorporation of deep learning has totally changed the way researchers approach the task, and has enlarged the scope of research by far. This is the most significant change compared to the former epoch.

In a nutshell, recent years have witness a blossoming expansion of research into subdivisible trends. We summarize these changes and trends in Fig.3, and we would follow this diagram in our survey.

In this section, we would classify existing methods into a hierarchical taxonomy, and introduce in a top-down style. First, we divide them into four kinds of systems: (1) text detection that detects and localizes the existence of text in natural image; (2) recognition system that transcribes and converts the content of the detected text region into linguistic symbols; (3) end-to-end system that performs both text detection and recognition in one single pipeline; (4) auxiliary methods that aim to support the main task of text detection and recognition, e.g. synthetic data generation, and deblurring of image. Under each system, we review recent methods from different perspectives.

3.1 Detection

There are three main trends in the field of text detection, and we would introduce them in the following sub-sections one by one. They are: (1) pipeline simplification; (2) changes in prediction units; (3) specified targets.

3.1.1 Pipeline Simplification

One of the important trends is the simplification of the pipeline, as shown in Fig.4. Most methods before the era of deep-learning, and some early methods that use deep-learning, have multi-step pipelines. More recent methods have simplified and much shorter pipelines, which is a key to reduce error propagation and simplify the training process. More recently, separately trained two-staged methods are surpassed by jointly trained ones. The main components of these methods are end-to-end differentiable modules, which is an outstanding characteristic.

Multi-step methods: Early deep-learning based methods  [152],  [166]111Code:[41] cast the task of text detection into a multi-step process. In [152]

, a convolutional neural network is used to predict whether each pixel in the input image (1) belongs to a character, (2) is inside the text region, and (3) the text orientation around the pixel. Connected positive responses are considered as a detection of character or text region. For characters belonging to the same text region, Delaunay triangulation

[61] is applied, after which graph partition based on the predicted orientation attribute groups characters into text lines.

Similarly, [166] first predicts a dense map indicating which pixels are within text line regions. For each text line region, MSER[99] is applied to extract character candidates. Character candidates reveal information of the scale and orientation of the underlying text line. Finally, minimum bounding box is extracted as the final text line candidate.

In [41], the detection process also consists of several steps. First, text blocks are extracted. Then the model crops and only focuses on the extracted text block to extract text center line(TCL), which is defined to be a shrunk version of the original text line. Each text line represents the existence of one text instance. The extracted TCL map is then split into several TCLs. Each split TCL is then concatenated to the original image. A semantic segmentation model then classifies each pixel into ones that belong to the same text instance as the given TCL, and ones that do not.

Simplified pipeline: More recent methods  [44]222Code:[59][73]333Code:[82][121]444Code:[163]555Code:[90]666Code:[111][74]777Code:[119] follow a 2-step pipeline, consisting of an end-to-end trainable neural network model and a post-processing step that is usually much simpler than previous ones. These methods mainly draw inspiration from techniques in general object detection[76, 27, 31, 30, 107, 42], and benefit from the highly integrated neural network modules that can predict text instances directly. There are mainly two branches: (1) Anchor-based methods[44, 73, 82, 121] that predict the existence of text and regress the location offset only at pre-defined grid points of the input image; (2) Region proposal methods[163, 74, 59, 90, 111, 119] that predict and regress on the basis of extracted image region.

Since the original targets of most of these works are not merely the simplification of pipeline, we only introduce some representative methods here. Other works will be introduced in the following parts.

Anchor-based methods draw inspiration from SSD[76], a general object detection network. As shown in Fig.5 (b), a representative work, TextBoxes[73], adapts SSD network specially to fit the varying orientations and aspect-ratios of text line. Specifically, at each anchor point, default boxes are replaced by default quadrilaterals, which can capture the text line tighter and reduce noise.

A variant of the standard anchor-based default box prediction method is EAST[168]888Code: In the standard SSD network, there are several feature maps of different sizes, on which default boxes of different receptive fields are detected. In EAST, all feature maps are integrated together by gradual upsampling, or U-Net[113] structure to be specific. The size of the final feature map is of the original input image, with -channels. Under the assumption that each pixel only belongs to one text line, each pixel on the final feature map, i.e. the

feature tensor, is used to regress the rectangular or quadrilateral bounding box of the underlying text line. Specifically, the existence of text, i.e. text/non-text, and geometries, e.g. orientation and size for rectangles, and vertexes coordinates for quadrilaterals, are predicted. EAST makes a difference to the field of text detection with its highly simplified pipeline and the efficiency. Since EAST is most famous for its speed, we would re-introduce EAST in later parts, with emphasis on its efficiency.

Region proposal methods usually follow the standard object detection framework of R-CNN [31, 30, 107], where a simple and fast pre-processing method is applied, extracting a set of region proposals that could contain text lines. A neural network then classifies it as text/non-text and corrects the localization by regressing the boundary offsets. However, adaptations are necessary.

Rotation Region Proposal Networks [90] follows and adapts the standard Faster RCNN framework. To fit into text of arbitrary orientations, rotating region proposals are generated instead of the standard axis-aligned rectangles.

Similarly, R2CNN[59] modifies the standard region proposal based object detection methods. To adapt to the varying aspects ratios, three Region of Interests Poolings of different sizes are used, and concatenated for further prediction and regression. In FEN[119], adaptively weighted poolings are applied to integrated different pooling sizes. The final prediction is made by leveraging the textness score for poolings of different sizes.

Fig. 5: High level illustration of existing anchor/roi-pooling based methods: (a) Similar to YOLO [105], predicting at each anchor positions. Representative methods include rotating default boxes [82]. (b) Variants of SSD [76], including Textboxes[73], predicting at feature maps of different sizes. (c) Direct regression of bounding boxes [168], also predicting at each anchor position. (d) Region Proposal based methods, including rotating Region of Interests (RoI) [90] and RoI of varying aspect ratios [59].

3.1.2 Different Prediction Units

A main distinction between text detection and general object detection is that, text are homogeneous as a whole and show locality, while general object detection are not. By homogeneity and locality, we refer to the property that any part of a text instance is still text. Human do not have to see the whole text instance to know it belongs to some text.

Such a property lays a cornerstone for a new branch of text detection methods that only predict sub-text components and then assemble them into a text instance.

In this part, we take the perspective of the granularity of text detection. There are two main level of prediction granularity, text instance level and sub-text level.

In text instance level methods[163, 73, 74, 59, 90, 82, 119, 18, 46, 168], detection of text follows the standard routine of general object detection, where a region-proposal network and a refinement network are combined to make predictions. The region-proposal network produces initial and coarse guess for the localization of possible text instance, and then a refinement part discriminates the proposals as text/non-text and also correct the localization of the text.

Contrarily, sub-text level detection methods [89][20]999Code:[148, 152, 41][44]101010Code:[40, 166, 121][133]111111Code:[140, 171] only predicts parts that are combined to make a text instance. Such sub-text mainly includes pixel-level and components-level.

In pixel-level methods [20, 148, 152, 41, 44, 166], an end-to-end fully convolutional neural network learns to generate a dense prediction map indicting whether each pixel in the original image belongs to any text instances or not. Post-processing methods then groups pixels together depending on which pixels belong to the same text instance. Since text can appear in clusters which makes predicted pixels connected to each other, the core of pixel-level methods is to separate text instances from each other. PixelLink[20] learns to predict whether two adjacent pixels belong to the same text instance by adding link prediction to each pixel. Border learning method[148] casts each pixels into three categories: text, border, and background, assuming that border can well separate text instances. In Holistic[152], pixel-prediction maps include both text-block level and character center levels. Since the centers of characters do not overlap, the separation is done easily.

In this part we only intend to introduce the concept of prediction units. We would go back to details regarding the separation of text instances in the section of Specific Targets.

Components-level methods[89, 121, 133, 171, 40, 140] usually predicts at a medium granularity. Component refer to a local region of text instance, sometimes containing one or more characters.

As shown in Fig.6 (a), SegLink[121] modified the original framework of SSD[76]. Instead of default boxes that represent whole objects, default boxes used in SegLink have only one aspect ratio and predict whether the covered region belongs to any text instances or not. The region is called text segment. Besides, links between default boxes are predicted, indicating whether the linked segments belong to the same text instance.

Corner localization methods[89] proposes to detect the corners of each text instance. Since each text instance only has corners, the prediction results and their relative position can indicate which corners should be grouped into the same text instance.

SegLink[121] and Corner localization[89] are proposed specially for long and multi-oriented text. We only introduce the idea here and discuss more details in the section of Specific Targets, regarding how they are realized.

In a clustering based method[140], pixels are clustered according to their color consistency and edge information. The fused image segments are called superpixel. These superpixels are further used to extract characters and predict text instance.

Another branch of component-level method is Connectionist Text Proposal Network (CTPN) [133, 171, 147]

. CTPN models inherit the idea of anchoring and recurrent neural network for sequence labeling. They usually consist of a CNN-based image classification network, e.g. VGG, and stack an RNN on top of it. Each position in the final feature map represents features in the region specified by the corresponding anchor. Assuming that text appear horizontally, each row of features are fed into an RNN and labeled as text/non-text. Geometries are also predicted.

3.1.3 Specific Targets

Another characteristic of current text detection system is that, most of them are designed for special purposes, attempting to approach unique difficulties in detecting scene text. We broadly classify them into the following aspects.

Long Text

Unlike general object detection, text usually come in varying aspect ratios. They have much larger height-width ratio, and thus general object detection framework would fail. Several methods have been proposed [121, 59, 89], specially designed to detect long text.

 [59] gives an intuitive solution, where ROI pooling with different sizes are used. Following the framework of Faster R-CNN [107], three ROI-poolings with varying pooling sizes, , , and , are performed for each box generated by region-proposal network, and the pooled features are concatenated for textness score.

Another branch learns to detect local sub-text components which are independent from the whole text [121, 89, 20]. SegLink [121] proposes to detect components, i.e. square areas that are text, and how these components are linked to each other. PixelLink [20] predicts which pixels belong to any text and whether adjacent pixels belong to the same text instances. Corner localization [89] detects text corners. All these methods learn to detect local components and then group them together to make final detections.

Fig. 6: Illustration of representative bottom-up methods: (a) SegLink [121]: with SSD as base network, predict word segments at each anchor position, and connections between adjacent anchors. (b) PixelLink [20]: predict for each pixel, text/non-text classification and whether it belongs to the same text as adjacent pixels or not/ (c) Corner Localization [89]: predict the four corners of each text and group those belonging to the same text instances. (d) TextSnake [85]: predict text/non-text and local geometries, which are used to reconstruct text instance.
Multi-Oriented Text

Another distinction from general text detection is that text detection is rotation-sensitive and skewed text are common in real-world, while using traditional axis-aligned prediction boxes would incorporate noisy background that would affect the performance of the following text recognition module. Several methods have been proposed to adapt to it  

[59, 73, 82, 90, 121, 168, 74][141]121212Code:

Extending from general anchor-based methods, rotating default boxes [82, 73] are used, with predicted rotation offset. Similarly, rotating region proposals [90] are generated with different orientations. Regression-based methods [168, 121, 59] predict the rotation and positions of vertexes, which are insensitive to orientation. Further, in Liao et al. [74], rotating filters [169] are incorporated to model orientation-invariance explicitly. The peripheral weights of filters rotate around the center weight, to capture features that are sensitive to rotation.

While the aforementioned methods may entail additional post-processing, Wang et al. [141] proposes to use a parametrized

Instance Transformation Network

(ITN) that learns to predict appropriate affine transformation to perform on the last feature layer extracted by the base network, to rectify oriented text instances. Their method, with ITN, can be trained end-to-end.

Fig. 7: (a)-(c): Representing text as horizontal rectangles, oriented rectangles, and quadrilaterals. (d): The sliding-disk reprensentation proposed in TextSnake [85].
Text of Irregular Shapes

Apart from varying aspect ratios, another distinction is that text can have a diversity of shapes, e.g. curved text. Curved text poses a new challenge, since regular rectangular bounding box would incorporate a large proportion of background and even other text instances, making it difficult for recognition.

Extending from quadrilateral bounding box, it’s natural to use bounding ’boxes’ with more that vertexes. Bounding polygons [163] with as many as vertexes are proposed, followed by a bi-lstm [48] layer to refine the coordinates of the predicted vertexes. In their framework, however, axis-aligned rectangles are extracted as intermediate results in the first step, and the location bounding polygons are predicted upon them.

Similarly, Lyu et al. [88] modifies the Mask R-CNN [42] framework, so that for each region of interest—in the form of axis-aligned rectangles—character masks are predicted solely for each type of alphabets. These predicted characters are then aligned together to form a polygon as the detection results. Notably, they propose their method as an end-to-end system. We would refer to it again in the following part.

Viewing the problem from a different perspective, Long et al. [85] argues that text can be represented as a series of sliding round disks along the text center line (TCL), which accord with the running direction of the text instance, as shown in Fig.7. With the novel representation, they present a new model, TextSnake, as shown in Fig.6 (d), that learns to predict local attributes, including TCL/non-TCL, text-region/non-text-region, radius, and orientation. The intersection of TCL pixels and text region pixels gives the final prediction of pixel-level TCL. Local geometries are then used to extract the TCL in the form of ordered point list, as demonstrated in Fig.6 (d). With TCL and radius, the text line is reconstructed. It achieves state-of-the-art performance on several curved text dataset as well as more widely used ones, e.g. ICDAR2015 [63] and MSRA-TD500 [135]. Notably, Long et al. proposes a cross-validation test across different datasets, where models are only fine-tuned on datasets with straight text instances, and tested on the curved datasets. In all existing curved datasets, TextSnake achieves improvements by up to over other baselines in F1-Score.


Current text detection methods place more emphasis on speed and efficiency, which is necessary for application in mobile devices.

The first work to gain significant speedup is EAST [168], which makes several modifications to previous framework. Instead of VGG [129], EAST uses PVANet [67]

as its base-network, which strikes a good balance between efficiency and accuracy in the ImageNet competition. Besides, it simplifies the whole pipeline into a prediction network and a non-maximum suppression step. The prediction network is a U-shaped 

[113] fully convolutional network that maps an input image to a feature map , where each position

is the feature vector that describes the predicted text instance. That is, the location of the vertexes or edges, the orientation, and the offsets of the center, for the text instance corresponding to that feature position

. Feature vectors that corresponds to the same text instance are merged with the non-maximum suppression. It achieves state-of-the-art speed with FPS of as well as leading performance on most datasets.

Easy Instance Segmentation
Fig. 8: Frameworks of text recognition models. The basic methodology is to first resize the cropped image to a fixed height, then extract features and feed them to an RNN that produce a character prediction for each column. As the number of columns of the features is not necessarily equal to the length of the word, the CTC technique [36] is proposed as a post-processing stage. (a) RNN stacked with CNN [122]; (b) Sequence prediction with FCN [28]; (c) Attention-based models [14, 29, 69, 150, 123, 83], allowing decoding text of varying lengths; (d) Cheng et al. [13]

proposed to apply supervision to the attention module; (e) To improve the misalignment problem in previous methods with fixed-length decoding with attention, Edit Probability 

[6] is proposed to reorder the predicted sequential distribution.

As mentioned above, recent years have witnessed methods with dense predictions, i.e. pixel level predictions [20, 41, 148, 103]. These methods generate a prediction map classifying each pixel as text or non-text. However, as text may come near each other, pixels of different text instances may be adjacent in the prediction map. Therefore, separating pixels become important.

Pixel-level text center line is proposed [41], since the center lines are far from each other. In [41], a prediction map indicating text lines is predicted. These text lines can be easily separated as they are not adjacent. To produce prediction for text instance, a binary map of text center line of a text instance is attached to the original input image and fed into a classification network. A saliency mask is generated indicating the detected text. However, this method involves several steps. The text-line generation step and the final prediction step can not be trained end-to-end, and error propagates.

Another way to separate different text instances is to use the concept of border learning [148, 103, 149], where each pixel is classified into one of the three classes: text, non-text, and text border. The text border then separates text pixels that belong to different instances. Similarly, in the work of Xue et al. [149], text are considered to be enclosed by segments, i.e. a pair of long-side borders (abdomen and back) and a pair of short-side borders (head and tail). The method of Xue et al. is also the first to use DenseNet [51] as their basenet, which provides a consistant performance boost in F1-score over that with ResNet [43] on all datasets that it’s evaluated on.

Following the linking idea of SegLink, PixelLink [20] learns to link pixels belonging to the same text instance. Text pixels are classified into groups for different instances efficiently via disjoint set algorithm. Treating the task in the same way, Liu et al. [84] proposes a method for predicting the composition of adjacent pixels with Markov Clustering [137], instead of neural networks. The Markov Clustering algorithm is applied to the saliency map of the input image, which is generated by neural networks and indicates whether each pixel belongs to any text instances or not. Then, the clustering results give the segmented text instances.

Retrieving Designated Text

Different from the classical setting of scene text detection, sometimes we want to retrieve a certain text instance given the description. Rong et al. [112] a multi-encoder framework to retrieve text as designated. Specifically, text is retrieved as required by a natural language query. The multi-encoder framework includes a Dense Text Localization Network (DTLN) and a Context Reasoning Text Retrieval (CRTR). DTLN uses an LSTM to decode the features in a FCN network into a sequence of text instance. CRTR encodes the query and the features of scene text image to rank the candidate text regions generated by DTLN. As much as we are concerned, this is the first work that retrieves text according to a query.

Against Complex Background

Attention mechanism is introduced to silence the complex background [44]. The stem network is similar to that of the standard SSD framework predicting word boxes, except that it applies inception blocks on its cascading feature maps, obtaining what’s called Aggregated Inception Feature (AIF). An additional text attention module is added, which is again based on inception blocks. The attention is applied on all AIF, reducing the noisy background.

3.2 Recognition

In this section, we introduce methods that tackle the text recognition problem. Input of these methods are cropped text instance images which contain one word or one text line.

In traditional text recognition methods[8, 127], the task is divided into 3 steps, including image pre-processing, character segmentation and character recognition. Character segmentation is considered the most challenging part due to the complex background and irregular arrangement of scene text, and largely constrained the performance of the whole recognition system. Two major techniques are adopted to avoid segmentation of characters, namely Connectionist Temporal Classification [36] and Attention mechanism. We introduce recognition methods in the literature based on which technique they employ, while other novel work will also be presented. Mainstream frameworks are illustrated in Fig.8.

3.2.1 CTC-based Methods

CTC computes the conditional probability , where represent the per-frame prediction of RNN and is the label sequence, so that the network can be trained using only sequence level label as supervision. The first application of CTC in the OCR domain can be traced to the handwriting recognition system of Graves et al.[37]. Now this technique is widely adopted in scene text recognition  [130][78][122]131313Code:[28, 157].

Shi et al.[122] proposes a model that stacks CNN with RNN to recognize scene text images. As illustrated in Fig.8 (a), CRNN consists of three parts: (1) convolutional layers, which extract a feature sequence from the input image; (2) recurrent layers, which predict a label distribution for each frame; (3) transcription layer (CTC layer), which translates the per-frame predictions into the final label sequence.

Instead of RNN, Gao et al. [28] adopt the stacked convolutional layers to effectively capture the contextual dependencies of the input sequence, which is characterized by lower computational complexity and easier parallel computation. Overall difference with other frameworks are illustrated in Fig.8 (b).

Yin et al. [157] also avoids using RNN in their model, they simultaneously detects and recognizes characters by sliding the text line image with character models, which are learned end-to-end on text line images labeled with text transcripts.

3.2.2 Attention-based methods

The attention mechanism was first presented in [5]

to improve the performance of neural machine translation systems, and flourished in many machine learning application domains including Scene text recognition

[14, 13, 29, 69, 150, 123, 83].

Lee et al. [69]

presented a recursive recurrent neural networks with attention modeling (R2AM) for lexicon-free scene text recognition. the model first passes input images through recursive convolutional layers to extract encoded image features

, and then decodes them to output characters by recurrent neural networks with implicitly learned character-level language statistics. Attention-based mechanism performs soft feature selection for better image feature usage.

Cheng et al. [13] observed the attention drift problem in existing attention-based methods and proposed an Focus Attention Network (FAN) to attenuate it. As shown in Fig.8 (d), the main idea is to add localization supervision to the attention module, while the alignment between image features and target label sequence are usually automatically learned in previous work.

In [6], Bai et al.

proposed an edit probability (EP) metric to handle the misalignment between the ground truth string and the attention’s output sequence of probability distribution, as shown in Fig.


(e). Unlike aforementioned attention-based methods, which usually employ a frame-wise maximal likelihood loss, EP tries to estimate the probability of generating a string from the output sequence of probability distribution conditioned on the input image, while considering the possible occurrences of missing or superfluous characters.

In [83], Liu et al. proposed an efficient attention-based encoder-decoder model, in which the encoder part is trained under binary constraints. Their recognition system achieves state-of-the-art accuracy while consumes much less computation costs than aforementioned methods.

Among those attention-based methods, some work made efforts to accurately recognize irregular (perspectively distorted or curved) text. Shi et al. [123, 124]

proposed a text recognition system which combined a Spatial Transformer Network (STN) 

[56] and an attention-based Sequence Recognition Network. The STN predict a Thin-Plate-Spline transformations which rectify the input irregular text image into a more canonical form.

Yang et al. [150] introduced an auxiliary dense character detection task to encourage the learning of visual representations that are favorable to the text patterns.And they adopted an alignment loss to regularize the estimated attention at each time-step. Further, they use a coordinate map as a second input to enforce spatial-awareness.

In [14], Cheng et al. argue that encoding a text image as a 1-D sequence of features as implemented in most methods is not sufficient. They encode an input image to four feature sequences of four directions:horizontal, reversed horizontal, vertical and reversed vertical. And a weighting mechanism is designed to combine the four feature sequences.

Liu et al. [77] presented a hierarchical attention mechanism (HAM) which consists of a recurrent RoI-Warp layer and a character-level attention layer. They adopt a local transformation to model the distortion of individual characters, resulting in an improved efficiency, and can handle different types of distortion that are hard to be modeled by a single global transformation.

3.2.3 Other Efforts

Jaderberg et al. [54, 53]

perform word recognition on the whole image holistically. They train a deep classification model solely on data produced by a synthetic text generation engine, and achieve state-of-the-art performance on some benchmarks containing English words only. But application of this method is quite limited as it cannot be applied to recognize long sequences such as phone numbers.

3.3 End-to-End System

In the past, text detection and recognition are usually cast as two independent sub-problems that are combined together to perform text retrieval from images. Recently, many end-to-end text detection and recognition systems (also known as text spotting systems) have been proposed, profiting a lot from the idea of designing differentiable computation graphs. Efforts to build such systems have gained considerable momentum as a new trend.

While earlier work [142, 144] first detect single characters in the input image, recent systems usually detect and recognize text in word level or line level. Some of these systems first generate text proposals using a text detection model and then recognize them with another text recognition model [55, 73, 38]. Jaderberg et al. [55] use a combination of Edge Box proposals [173] and a trained aggregate channel features detector [22] to generate candidate word bounding boxes. Proposal boxes are filtered and rectified before being sent into their recognition model proposed in [54]. In [73], Liao et al. combined an SSD[76] based text detector and CRNN[122] to spot text in images. Lyu et al. [88] proposes a modification of Mask R-CNN that is adapted to produce shape-free recognition of scene text, as shown in Fig.9 (c). For each region of interest, character maps are produced, indicating the existence and location of a single character. A post-processing that links these character together gives the final results.

One major drawbacks of the two-step methods is that the propagation of error between the text detection models and the text recognition models will lead to less satisfactory performance. Recently, more end-to-end trainable networks are proposed to tackle the this problem [7]141414Code:[11]151515Code:[72, 45][81].

Bartz et al. [7] presented an solution which utilize a STN [56] to circularly attend to each word in the input image, and then recognize them separately. The united network is trained in a weakly-supervised manner that no word bounding box labels are used. Li et al. [72] substitute the object classification module in Faster-RCNN [107] with an encoder-decoder based text recognition model and make up their text spotting system. Lui et al. [81], Busta et al. [11] and He et al. [45] developed a unified text detection and recognition systems with a very similar overall architecture which consist of a detection branch and a recognition branch. Liu et al.[81] and Busta et al. [11] adopt EAST [168] and YOLOv2 [106] as their detection branch respectively, and have a similar text recognition branch in which text proposals are mapped into fixed height tensor by bilinear sampling and then transcribe in to strings by a CTC-based recognition module. He et al. [45] also adopted EAST [168] to generate text proposals, and they introduced character spatial information as explicit supervision in the attention-based recognition branch.

3.4 Auxiliary Techniques

Recent advances are not limited to detection and recognition models that aim to solve the tasks directly. We should also give credit to auxiliary techniques that have played an important role. In this part, we briefly introduce several promising trends: synthetic data, bootstrapping, text deblurring, incorporating context information, and adversarial training.

3.4.1 Synthetic Data

Most deep learning models are data-thirsty. Their performance is guaranteed only when enough data are available. Therefore, artificial data generation has been a popular research topic, e.g. Generative Adversarial Nets (GAN) [34]. In the field of text detection and recognition, this problem is more urgent since most human-labeled datasets are small, usually containing around merely data instances. Fortunately, there have been work [54, 38, 164] that can generate data instances of relatively high quality, and they have been widely used for pre-training models for better performance.

Fig. 9: Illustration of mainstream end-to-end scene text detection and recognition framework. The basic idea is to concatenate the two branch. (a): In SEE [7], the detection results are represented as grid matrices. Image regions are cropped and transformed before being fed into the recognition branch. (b): In contrast to (a), some methods crop from the feature maps and feed them to the recognition branch [11, 72, 45, 81]. (c): While frameworks (a) and (b) utilize CTC-based and attention-based recognition branch, it’s also possible to retrieve each character as generic objects and compose the text [88].

Jaderberg et at. [54] first proposes synthetic data for text recognition. Their method blends text with randomly cropped natural image from human-labeled datasets after rending of font, border/shadow, color, and distortion. The results show that training merely on these synthetic data can achieve state-of-the-art performance and that synthetic data can act as augmentative data sources for all datasets.

SynthText [38]161616Code: first proposes to embed text in natural scene images for training of text detection, while most previous work only print text on a cropped region and these synthetic data are only for text recognition. Printing text on the whole natural images poses new challenges, as it needs to maintain semantic coherence. To produce more realistic data, SynthText makes use of depth prediction [75] and semantic segmentation [4]. Semantic segmentation groups pixels together as semantic clusters, and each text instance is printed on one semantic surface, not overlapping multiple ones. Dense depth map is further used to determine the orientation and distortion of the text instance. Model trained only on SynthText achieves state-of-the-art on many text detection datasets. It’s later used in other works [168, 121] as well for initial pre-training.

Further, Zhan et al. [164]171717Code: equips text synthesis with other deep learning techniques to produce more realistic samples. They introduce selective semantic segmentation so that word instances would only appear on sensible objects, e.g. a desk or wall in stead of someone’s face. Text rendering in their work is adapted to the image so that they fit into the artistic styles and do not stand out awkwardly.

3.4.2 Bootstrapping

Bootstrapping, or Weakly and semi supervision, is also important in text detection and recognition [132, 111, 50]. It’s mainly used in word [111] or character [132, 50] level annotations.

Bootstrapping for word-box Rong et al. [111] proposes to combine an FCN-based text detection network with Maximally Stable Extremal Region (MSER) features to generate new training instances annotated on box-level. First, they train an FCN, which predicts the probability of each pixel belonging to text. Then, MSER features are extracted from regions where the text confidence is high. Using single linkage criterion (SLC) based algorithms[128, 32], final prediction is made.

Bootstrapping for character-box Character level annotations are more accurate and better. However, most existing datasets do not provide character-level annotating. Since character is smaller and close to each other, character-level annotation is more costly and inconvenient. There have been some work on semi-supervised character detection [132, 50]. The basic idea is to initialize a character-detector, and applies rules or threshold to pick the most reliable predicted candidates. These reliable candidates are then used as additional supervision source to refine the character-detector. Both of them aim to augment existing datasets with character level annotations. They only differ in details.

Fig. 10: Overview of semi-supervised and weakly-supervised methods. Existing methods differ in the way with regard to how filtering is done. (a): WeText [132], mainly by thresholding the confidence level and filtering by word-level annotation. (b) and (c): Scoring-based methods, including WordSup [50]

which assumes that text are straight lines, and use a eigenvalue-based metric to measure its

straightness; Rong et al. [111] evaluate each predicted text region with MSER features combined with SLC algotirhm.

WordSup [50] first initializes the character detector by training warm-up iterations on synthetic dataset, as shown in Fig.10 (b). For each image, WordSup generates character candidates, which are then filtered with word-boxes. For characters in each word box, the following score is computed to select the most possible character list:


where is the union of the selected character boxes; is the enclosing word bounding box; and are the first and second largest eigenvalues of a covariance matrix , computed by the coordinates of the centers of the selected character boxes; is a weight scalar. Intuitively, the first term measures how complete the selected characters can cover the word boxes, while the second term measures whether the selected characters are located on a straight line, which is a main characteristic for word instances in most datasets.

WeText [132]

starts with a small datasets annotated on character level. It follows two paradigms of bootstrapping: semi-supervised learning and weakly-supervised learning. In the semi-supervised setting, detected character candidates are filtered with a high thresholding value. In the weakly-supervised setting, ground-truth word boxes are used to mask out false positives outside. New instances detected in either way is added to the initial small datasets and re-train the model.

3.4.3 Text Deblurring

By nature, text detection and recognition are more sensitive to blurring than general object detection. Some methods  [49]181818Code:,  [66] have been proposed for text deblurring.

Fig. 11: Selected samples from Chars74K, SVT-P, IIIT5K, MSRA-TD500, ICDAR2013, ICDAR2015, ICDAR2017 MLT, ICDAR2017 RCTW, and Total-Text.

Hradis et al. [49] proposes an FCN-based deblurring method. The core FCN maps the input image which is blurred and generates a deblurred image. They collect a dataset of well-taken images of documents, and process them with kernels designed to mimic hand-shake and de-focus.

Khare et al. [66] proposes a quite different framework. Given a blurred image, , it aims to alternatively optimize the original image and kernel by minimizing the following energy value:


where is the regularization weight, with operator as the Gaussian weighted (w) norm. The optimization is done by alternatively optimizing over the kernel and the original image .

3.4.4 Context Information

Another way to make more accurate predictions is to take into account the context information. Intuitively, we know that text only appear on a certain surfaces, e.g. billboards, books, and etc.. Text are less likely to appear on the face of a human or an animal. Following this idea, Zhu et al. [170] proposes to incorporate the semantic segmentation result as part of the input. The additional feature filters out false positives where the patterns look like text.

3.4.5 Adversarial Attack

Text detection and recognition has a broad range of application. In some scenarios, the security of the applied algorithms becomes a key factor, e.g. autonomous vehicles and identity verification. Yuan et al. [162] proposes the first adversarial attack algorithm for text recognition. They propose a white-box attack algorithm that induces a trained model to generate a desired wrong output. Specifically, they aim to optimize a joint target of: (1) for minimizing the alteration applied to the original image; (2)

for the loss function with regard to the probability of the targeted output. They adapt the automated weighting method proposed by Kendall

et al. [65] to find the optimum weight of the two targets. Their method realizes a success rate over with speedup compared to other state-of-the-art attack methods. Most importantly, their method showed a way to carry out sequential attack.

4 Benchmark Datasets and Evaluation Protocols

As cutting edge algorithms achieved better on previous datasets, researchers were able to tackle more challenging aspects of the problems. New datasetes aimed at different real-world challenges have been and are being crafted, benefiting the development of detection and recognition methods further.

In this section, we list and briefly introduce the existing datasets and the corresponding evaluation protocols. We also identify current state-of-the-art performance on the widely used datasets when applicable.

Dataset (Year) Image Num (train/test) Text Num (train/test) Orientation Language Characteristics Detection Task Recognition Task
ICDAR03(2003) 258/251 1110/1156 Horizontal EN -
Scene Text(2013)
229/233 848/1095 Horizontal EN -
Incidental Text(2015)
1000/500 -/- Multi-Oriented EN
ICDAR RCTW(2017) 8034/4229 -/- Multi-Oriented CN -
Total-Text (2017) 1255/300 -/- Curved EN, CN Polygon label
SVT(2010) 100/250 257/647 Horizontal EN -
*CUTE(2014) -/80 -/- Curved EN -
CTW (2017) 25K/6K 812K/205K Multi-Oriented CN Fine-grained annotation
CASIA-10K (2018) 7K/3K -/- Multi-Oriented CN -
*MSRA-TD500 (2012) 300/200 1068/651 Multi-Oriented EN, CN Long text -
HUST-TR400 (2014) 400/- -/- Multi-Oriented EN, CN Long text -
ICDAR17MLT(2017) 9000/9000 -/- Multi-Oriented 9 langanges - -
CTW1500 (2017) 1000/500 -/- Curved EN -
*IIIT 5K-Word(2012) 2000/3000 2000/3000 Horizontal - - -
SVTP(2013) -/639 -/639 Multi-Oriented EN Perspective text -
SVHN(2010) 73257/26032 73257/26032 Horizontal - House number digits -
TABLE I: Existing datasets: * indicates datasets that are the most widely used across recent publications. Newly published ones representing real-world challenges are marked in bold. EN stands for English and CN stands for Chinese.

4.1 Benchmark Datasets

We collect existing datasets and summarize their features in Tab.I. We also select some representative image samples from some of the datasets, which are demonstrated in Fig.11. Links to these datasets are also collected in our Github repository mentioned in abstract, for readers’ convenience.

4.1.1 Datasets with both detection and recognition tasks

The ICDAR and

Held in , the ICDAR Robust Reading Competition [87] is the first such benchmark dataset that’s ever released for scene text detection and recognitio.Among the images, are used for training and for testing. The dataset is also used in ICDAR Text Locating Competition [86]. ICDAR also includes a digit recognition track.

Method Precision Recall F-measure FPS
Zhang et al. [166] 88 78 83 -
SynthText[38] 92.0 75.5 83.0 -
Holistic[152] 88.88 80.22 84.33 -
PixelLink[20] 86.4 83.6 84.5 -
CTPN [133] 93 83 88 7.1
He et al. [41] 93 79 85 -
SegLink [121] 87.7 83.0 85.3 20.6
He et al.  [46] 92 80 86 1.1
TextBox++[73] 89 83 86 1.37
EAST [168] 92.64 82.67 87.37 -
SSTD[44] 89 86 88 7.69
Lyu et al.[89] 93.3 79.4 85.8 10.4
Liu et al.[84] 88.2 87.2 87.7 -
He et al.[45] 88 87 88 -
Xue et al. [149] 91.5 87.1 89.2 -
WordSup  [50] 93.34 87.53 90.34 -
Lyu et al.[88] 94.1 88.1 91.0 4.6
FEN[119] 93.7 90.0 92.3 1.11
TABLE II: Detection performance on ICDAR2013. means multi-scale, stands for the base net of the model is not VGG16. The performance is based on DetEval.

In the ICDAR and Robust Reading Competitions, previous datasets are modified and extended, which make the new ICDAR  [118] and  [64] datasets. Problems in previous datasets are corrected, e.g. imprecise bounding boxes. State-of-the-art results are shown in Tab.II for detection and Tab.VIII for recognition.

Method Precision Recall F-measure FPS
Zhang et al. [166] 71 43.0 54 -
CTPN [133] 74 52 61 7.1
Holistic [152] 72.26 58.69 64.77 -
He et al. [41] 76 54 63 -
SegLink [121] 73.1 76.8 75.0 -
SSTD [44] 80 73 77 -
EAST [168] 83.57 73.47 78.20 13.2
He et al.  [46] 82 80 81 -
R2CNN [59] 85.62 79.68 82.54 0.44
Liu et al.[84] 72 80 76 -
WordSup  [50] 79.33 77.03 78.16 -
Wang et al. [141] 85.7 74.1 79.5 -
Lyu et al.[89] 94.1 70.7 80.7 3.6
TextSnake[85] 84.9 80.4 82.6 1.1
He et al.[45] 84 83 83 -
Lyu et al.[88] 85.8 81.2 83.4 4.8
PixelLink [20] 85.5 82.0 83.7 3.0
TABLE III: Detection performance on ICDAR2015. means multi-scale, stands for the base net of the model is not VGG16.


In real world application, images containing text may be too small, blurred, or occluded. To represent such a challenge, ICDAR2015 is proposed as the Challenge 4 of the 2015 Robust Reading Competition [63] for incidental scene text detection. Scene text images in this dataset are taken by Google Glasses without taking care of the image quality. A large proportion of images are very small, blurred, and multi-oriented. There are 1000 images for training and 500 images for testing. The text instances from this dataset are labeled as word level quadrangles. State-of-the-art results are shown in Tab.III for detection and Tab.VIII for recognition.


In ICDAR2017 Competition on Reading Chinese Text in the Wild [125], Shi et al. propose a new dataset, called CTW-12K, which mainly consists of Chinese. It is comprised of images in total, among which are for training and are for testing. Text instances are annotated with parallelograms. It’s the first large scale Chinese dataset, and was also the largest published one by then.


The Chinese Text in the Wild (CTW) dataset proposed by Yuan et al. [161] is the largest annotated dataset to date. It has high resolution street view image of Chinese text, with character instances in total. All images are annotated at the character level, including its underlying character type, bouding box, and other attributes. These attributes indicate whether its background is complex, whether it’s raised, whether it’s hand-written or printed, whether it’s occluded, whether it’s distorted, whether it uses word-art. The dataset is split into a training set of images with characters, a recognition test set of images with characters, and a detection test set of images with characters.


Unlike most previous datasets which only include text that are in straight lines, Total-Text consists of 1555 images with more than 3 different text orientations: Horizontal, Multi-Oriented, and Curved. Text instances in Total-Text are annotated with both quadrilateral boxes and polygon boxes of a variable number of vertexes. State-of-the-art results for Total-Text are shown in Tab.IV for detection and recognition.


The Street View Text (SVT) dataset [143, 142] is a collection of street view images. SVT has 350 images. It only has word-level annotations.


CUTE is proposed in [108]. The dataset focuses on curved text. It contains 80 high-resolution images taken in natural scenes. No lexicon is provided.

Method Detection Word Spotting
P R F None Full
DeconvNet[100] - -
Lyu et al.[88] 69.0 55.0 61.3 52.9 71.8
TextSnake[85] 82.7 74.5 78.4 - -
TABLE IV: Performance on Total-Text.
Method Precision Recall F-measure
SegLink [121]
EAST [168]
DMPNet [82]
CTD+TLOC[163] 77.4
TextSnake[85] 67.9 85.3 75.6
TABLE V: Detection performance on CTW1500.

4.1.2 Datasets with only detection task


The MSRA Text Detection Dataset (MSRA-TD500) [135] is a benchmark dataset featuring long and multi-oriented text. Text instances in MSRA-TD500 have much larger aspect ratios than other datasets. Later, an additional set of images, called HUST-TR400 [151], are collected in the same way as MSRA-TD500, usually used as additional training data for MSRA-TD500.


The dataset of ICDAR2017 MLT Challenge [95] contains images with scripts of languages, for each. It features the largest number of languages up till now.


CASIA-10K is a newly published Chinese scene text dataset. This dataset contains 10000 images under various scenarios, with for training and testing. As Chinese characters are not segmented by spaces, line-level annotations are provided.

SCUT-CTW1500 (CTW1500)

CTW1500 is another dataset which features curved text. It consists of 1000 training images and 500 test images.Annotations in CTW1500 are polygons with vertexes. Performances on CTW1500 are shown in Tab.V for detection.

4.1.3 Datasets with only recognition task

IIIT 5K-Word

IIIT 5K-Word [94]

is the largest dataset, containing both digital and natural scene images. Its variance in font, color, size and other noises makes it the most challenging one to date. There are

images in total, for training and for testing.

SVT-Perspective (SVTP)

SVTP is proposed in [104] for evaluating the performance of recognizing perspective text. Images in SVTP are picked from the side-view images in Google Street View. Many of them are heavily distorted by the non-frontal view angle. The dataset consists of 639 cropped images for testing, each with a 50-word lexicon inherited from the SVT dataset.

SVHNThe street view house numbers (SVHN) dataset [96] contains more than digits of house numbers in natural scenes. The images are collected from Google View images. This dataset is usually used in digit recognition.

Method Precision Recall F-measure FPS
Kang et al. [61] 71 62 66 -
Zhang et al. [166] 83 67 74 -
Holistic [152] 76.51 75.31 75.91 -
He et al.  [46] 77 70 74 -
EAST  [168] 87.28 67.43 76.08 13.2
Wu et al. [148] 77 78 77 -
SegLink [121] 86 70 77 8.9
PixelLink [20] 83.0 73.2 77.8
TextSnake[85] 83.2 73.9 78.3 1.1
Xue et al. [149] 83.0 77.4 80.1 -
Wang et al. [141] 90.3 72.3 80.3 -
Lyu et al.[89] 87.6 76.2 81.5 5.7
Liu et al.[84] 88 79 83 -
TABLE VI: State-of-the-art detection performance on MSRA-TD500. stands for models whose base nets are not VGG16.

4.2 Evaluation Protocols

In this part, we briefly summarize the evaluation protocols for text detection and recognition.

As metrics for performance comparison of different algorithms, we usually refer to their precision, recall and F1-score. To compute these performance indicators, the list of predicted text instances should be matched to the ground truth labels in the first place. Precision, denoted as , is calculated as the proportion of predicted text instances that can be matched to ground truth labels. Recall, denoted as , is the proportion of ground truth labels that have correspondents in the predicted list. F1-score is a then computed by

, taking both precision and recall into account. Note that the matching between the predicted instances and ground truth ones comes first.

4.2.1 Text Detection

There are mainly two different protocols for text detection, the IOU based PASCAL Eval and overlap based DetEval. They differ in the criterion of matching predicted text instances and ground truth ones. In the following part, we use these notations: is the area of the ground truth bounding box, is the area of the predicted bounding box, is the area of the intersection of the predicted and ground truth bounding box, is the area of the union.

PASCAL [25]: The basic idea is that, if the intersection-over-union value, i.e. , is larger than a designated threshold, the predicted and ground truth box are matched together.

DetEval: DetEval imposes constraints on both precision, i.e. and recall, i.e. . Only when both are larger than their respective thresholds, are they matched together.

Most datasets follow either of the two evaluation protocols, but with small modifications. We only discusses those that are different from the two protocols mentioned above.


The match score is calculated in a way similar to IOU. It’s defined as the ratio of the area of intersection over that of the minimum bounding rectangular bounding box containing both.


One major drawback of the evaluation protocol of ICDAR2003/2005 is that it only considers one-to-one match. It does not consider one-to-many, many-to-many, and many-to-one matchings, which underestimates the actual performance. Therefore, ICDAR2011/2013 follows the method proposed by Wolf et al. [146]. The match score function, and , gives different score for each types of matching:


is a function for punishment of many-matches, controlling the amount of splitting or merging.

Vocab List Description
S a per-image list of words all words in the image + seletected distractors
W all words in the entire test set
G a -word generic vocabulary
TABLE VII: Characteristics of the three vocabulary lists used in ICDAR 2013/2015. S stands for Strongly Contextualised, W for Weakly Contextualised, and G for Generic

Yao et al. [135] proposes a new evaluation protocol for rotated bounding box, where both the predicted and ground truth bounding box are revolved horizontal around its center. They are matched only when the standard IOU score is higher than the threshold and the rotation of the original bounding boxes are less a pre-defined value (in practice ).

4.2.2 Text Recognition and End-to-End System

Text recognition is another task where a cropped image is given which contains exactly one text instance, and we need to extract the text content from the image in a form that a computer program can understand directly, e.g. string type in C++ or str type in Python. There is not need for matching in this task. The predicted text string is compared to the ground truth directly. The performance evaluation is in either character-level recognition rate (i.e. how many characters are recognized) or word level (whether the predicted word is correct). ICDAR also introduces an edit-distance based performance evaluation. Note that in end-to-end evaluation, matching is first performed in a similar way to that of text detection. State-of-the-art recognition performance on the most widely used datasets are summarized in Tab.VIII

Methods ConvNet, Data IIIT5k SVT IC03 IC13 IC15 SVTP CUTE
50 1k 0 50 0 50 Full 0 0 0 0 0
Yao et al. [153] - 80.2 69.3 - 75.9 - 88.5 80.3 - - - - -
Rodríguez-Serrano et al. [109] - 76.1 57.4 - 70.0 - - - - - - - -
Jaderberg et al. [57] - - - - 86.1 - 96.2 91.5 - - - - -
Su and Lu [130] - - - - 83.0 - 92.0 82.0 - - - - -
Gordo [35] - 93.3 86.6 - 91.8 - - - - - - - -
Jaderberg et al. [55] VGG, 90k 97.1 92.7 - 95.4 80.7 98.7 98.6 93.1 90.8 - - -
Jaderberg et al. [53] VGG, 90k 95.5 89.6 - 93.2 71.7 97.8 97.0 89.6 81.8 - - -
Shi et al. [122] VGG, 90k 97.8 95.0 81.2 97.5 82.7 98.7 98.0 91.9 89.6 - - -
*Shi et al. [123] VGG, 90k 96.2 93.8 81.9 95.5 81.9 98.3 96.2 90.1 88.6 - 71.8 59.2
Lee et al. [69] VGG, 90k 96.8 94.4 78.4 96.3 80.7 97.9 97.0 88.7 90.0 - - -
Yang et al. [150] VGG, Private 97.8 96.1 - 95.2 - 97.7 - - - - 75.8 69.3
Cheng et al. [13] ResNet, 90k+ST 99.3 97.5 87.4 97.1 85.9 99.2 97.3 94.2 93.3 70.6 - -
Shi et al.[124] ResNet, 90k+ST 99.6 98.8 93.4 99.2 93.6 98.8 98.0 94.5 91.8 76.1 78.5 79.5
TABLE VIII: State-of-the-art recognition performance across a number of datasets. “50”, “1k”, “Full” are lexicons. “0” means no lexicon. “90k” and “ST” are the Synth90k and the SynthText datasets, respectively. “ST” means including character-level annotations. “Private” means private training data.

The evaluation for end-to-end system is a combination of both detection and recognition. Given output to be evaluated, i.e. text location and recognized content, predicted text instances are first matched with ground truth instances, followed by comparison of the text content.

The most widely used datasets for end-to-end systems are ICDAR2013 [64] and ICDAR2015 [63]. The evaluation over these two datasets are carried out under two different settings [1], the Word Spotting setting and the End-to-End setting. Under Word Spotting, the performance evaluation only focuses on the text instances from the scene image that appear in a predesignated vocabulary, while other text instances are ignored. On the contrary, all text instances that appear in the scene image are included under End-to-End. Three different vocabulary lists are provided for candidate transcriptions. They include Strongly Contextualised, Weakly Contextualised, and Generic. The three kinds of lists are summarized in Tab.VII. Note that under End-to-End, these vocabulary can still serve as reference. State-of-the-art performances are summarized in Tab.IX.

Method Word Spotting End-to-End FPS
Baseline OpenCV3.0+Tesseract[63] 14.7 12.6 8.4 13.8 12.0 8.0 -
TextSpotter [73] 37.0 21.0 16.0 35.0 20.0 16.0 1
Stradvision [63] 45.9 - - 43.7 - - -
Deep2Text-MO[55, 158, 159] 17.58 17.58 17.58 16.77 16.77 16.77 -
TextProposals+DictNet [33, 54] 56.0 52.3 49.7 53.3 49.6 47.2 0.2
HUST_MCLAB [121, 122] 70.6 - - 67.9 - - -
Deep Text Spotter [11] 58.0 53.0 51.0 54.0 51.0 47.0 9.0
FOTS [81] 87.01 82.39 67.97 83.55 79.11 65.33 -
He et al. [45] 85 80 65 82 77 63 -
Mask TextSpotter [88] 79.3 74.5 64.2 79.3 73.0 62.4 2.6
Jaderberg et al. [55] 90.5 - 76 86.4 - - -
FCRNall+multi-filt [38] - - 84.7 - - - -
Textboxes [73] 93.9 92.0 85.9 91.6 89.7 83.9
Deep text spotter [11] 92 89 81 89 86 77 9
Li et al. [72] 94.2 92.4 88.2 91.1 89.8 84.6 1.1
FOTS [81] 95.94 93.90 87.76 91.99 90.11 84.77 11.2
He et al. [45] 93 92 87 91 89 86 -
Mask TextSpotter [88] 92.5 92.0 88.2 92.2 91.1 86.5 4.8
TABLE IX: State-of-the-art performance of End-to-End and Word Spotting tasks on ICDAR2015 and ICDAR2013. means multi-scale, stands for the base net of the model is not VGG16.

5 Application

The detection and recognition of text—the visual and physical carrier of human civilization—allow the connection between vision and the understanding of its content further. Apart from the applications we have mentioned at the beginning of this paper, there have been numerous specific application scenarios across various industries and in our daily lives. In this part, we list and analyze the most outstanding ones that have, or are to have, significant impact, improving our productivity and life quality.

Automatic Data Entry Apart from an electronic archive of existing documents, OCR can also improve our productivity in the form of automatic data entry. Some industries involve time-consuming data type-in, e.g. express orders written by customers in the delivery industry, and hand-written information sheets in the financial and insurance industries. Applying OCR techniques can accelerate the data entry process as well as protect customer privacy. Some companies have already been using this technologies, e.g. SF-Express191919Official website: Another potential application is note taking, such as NEBO202020Official website:, a note-taking software on tablets like iPad that can perform instant transcription as user writes down notes.

Identity Authentication Automatic identity authentication is yet another field where OCR can give a full play to. In fields such as Internet finance and Customs, users/passengers are required to provide identification (ID) information, such as identity card and passport. Automatic recognition and analysis of the provided documents would require OCR that reads and extracts the textual content, and can automate and greatly accelerate such processes. There are companies that have already started working on identification based on face and ID card, e.g. Megvii(Face++)212121

Augmented Computer Vision As text is an essential element for the understanding of scene, OCR can assist computer vision in many ways. In the scenario of autonomous vehicle, text-embedded panels carry important information, e.g. geo-location, current traffic condition, navigation, and etc.. There have been several works on text detection and recognition for autonomous vehicle [92, 91]. The largest dataset so far, CTW [161], also places extra emphasis on traffic signs. Another example is instant translation, where OCR is combined with a translation model. This can be extremely helpful and time-saving as people travel or consult documents written in foreign languages. Google’s Translate application222222 can perform such instant translation. A similar application is instant text-to-speech equipped with OCR, which can help those with visual disability and those who are illiterate [2].

Intelligent Content Analysis OCR also allows the industries to perform more intelligent analysis, mainly for platforms like video-sharing websites and e-commerce. Text can be extracted from images and subtitles as well as real-time commentary subtitles (a kind of floating comments added by users, e.g. those in Bilibili232323 and

). On the one hand, such extracted text can be used in automatic content tagging and recommendation system. They can also be used to perform user sentiment analysis, e.g. which part of the video attracts the users most. On the other hand, website administrator can impose supervision and filtration for inappropriate and illegal content, such as terrorist advocacy.

6 Conclusion and Discussion

6.1 Status Quo

The past several years have witnessed the significant development of algorithms for text detection and recognition. As deep learning rose, the methodology of research has changed from searching for patterns and features, to architecture designs that takes up challenges one by one. We’ve seen and recognize how deep learning has resulted in great progress in terms of the performance of the benchmark datasets. Following a number of newly-designed datasets, algorithms aimed at different targets have attracted attention, e.g. for blurred images and irregular text. Apart from efforts towards a general solution to all sorts of images, these algorithms can be trained and adapted to more specific scenarios, e.g. bankcard, ID card, and driver’s license. Some companies have been providing such scenario-specific APIs, including Baidu Inc., Tencent Inc. and Megvii Inc.. Recent development of fast and efficient methods [107, 168] has also allowed the deployment of large-scale systems [9]. Companies including Google Inc. and Amazon Inc. are also providing text extraction APIs.

Despite the success so far, algorithms for text detection and recognition are still confronted with several challenges. While human have barely no difficulties localizing and recognizing text, current algorithms are not designed and trained effortlessly. They have not yet reached human-level performance. Besides, most datasets are monolingual. We have no idea how these models would perform on other languages. What exacerbates it is that, the evaluation metrics we use today may be far from perfect. Under PASCAL evaluation, a detection result which only covers slightly more than half of the text instance would be judged as successful as it passes the IoU threshold of

. Under DetEval, one can manually enlarge the detected area to meet the requirement of pixel recall, as DetEval requires a high pixel recall () but rather low pixel precision (). Both cases would be judged as failure from oracle’s viewpoint, as the former can not retrieve the whole text, while the later encloses too much background. A new and more appropriate evaluation protocol is needed.

Moreover, few works except for TextSnake [85] have considered the problem of generalization ability across datasets. Generalization ability is important as we aim to some application scenarios would require the adaptability to changing environments. For example, instant translation and OCR in autonomous vehicles should be able to perform stably under different situations: zoomed-in images with large text instances, far and small words, blurred words, different languages and shapes. However, these scenarios are only represented by different datasets individually. We would expect a more diverse dataset.

Though synthetic data (such as SynthText [38]) has been widely adopted in recent scene text detection and recognition algorithms. The diversity and realistic degree are actually quite limited. To develop scene text detection and recognition models with higher accuracy and generalization ability, it is worthy of exploration to build more powerful engines for text image synthesis.

Another shortcoming of deep learning based methods for scene text detection and recognition lies in their efficiency. Most of the current state-of-the-art systems are not able run in real-time when deployed on computers without GPUs or mobile devices. However, to make text information extraction techniques and services anytime anywhere, current systems should be significantly speed up while maintaining high accuracy.

6.2 Future Trends

History is a mirror for the future. What we lack today tells us about what we can expect tomorrow.

Diversity among Datasets: More Powerful Model Text detection and recognition is different from generic object detection in the sense that, it’s faced with unique challenges. We expect that new datasets aimed at new challenges, as we have seen so far [63, 15, 163], would draw attention to these aspects and solve more real world problems.

Diversity inside Datasets: More Robust Model Despite the success we’ve seen so far, current methods are only evaluated on single datasets after being trained on them separately. Tests of authentic generalization are needed, where a single trained model is evaluated on a more diverse held-out set, e.g. a combination of current datasets. Naturally, a new dataset representing several challenges would also provide extra momentum for this field. Evaluation of cross dataset generalization ability is also preferable, where the model is trained only on one dataset and then tested of another, as done in recent work in curved text [85].

Suitable Evaluation Metrics: a Fairer Play As discussed above, an evaluation metric that fits the task more appropriately would be better. Current evaluation metrics (DetEval and PASCAL-Eval) are inherited from the more generic task of object detection, where detection results are all represented in rectangular bounding boxes. However, in text detection and recognition, the shapes and orientations matter. Tighter and noiseless bounding region would also be more friendly to recognizers. Neglecting some parts in object detection may be acceptable as it remains semantically the same, but it would be disastrous for the final text recognition results as some characters may be missing, resulting in different words.

Towards Stable Performance: as Needed in Security As we have seen work that breaks sequence modeling methods [162] and attacks that interfere with image classification models [131], we should pay more attention to potential security risks, especially, when applied in security services e.g. identity check.


  • [1] Icdar 2015 robust reading competition (presentation). Accessed: 2018-07-30.
  • [2] Screen reader. Accessed: 2018-08-09.
  • [3] Jon Almazán, Albert Gordo, Alicia Fornés, and Ernest Valveny. Word spotting and recognition with embedded attributes. IEEE transactions on pattern analysis and machine intelligence, 36(12):2552–2566, 2014.
  • [4] Pablo Arbelaez, Michael Maire, Charless Fowlkes, and Jitendra Malik. Contour detection and hierarchical image segmentation. IEEE transactions on pattern analysis and machine intelligence, 33(5):898–916, 2011.
  • [5] Dzmitry Bahdanau, Kyunghyun Cho, and Yoshua Bengio. Neural machine translation by jointly learning to align and translate. ICLR 2015, 2014.
  • [6] Fan Bai, Zhanzhan Cheng, Yi Niu, Shiliang Pu, and Shuigeng Zhou. Edit probability for scene text recognition. In CVPR 2018, 2018.
  • [7] Christian Bartz, Haojin Yang, and Christoph Meinel. See: Towards semi-supervised end-to-end scene text recognition. arXiv preprint arXiv:1712.05404, 2017.
  • [8] Alessandro Bissacco, Mark Cummins, Yuval Netzer, and Hartmut Neven. Photoocr: Reading text in uncontrolled conditions. In Proceedings of the IEEE International Conference on Computer Vision, pages 785–792, 2013.
  • [9] Fedor Borisyuk, Albert Gordo, and Viswanath Sivakumar. Rosetta: Large scale system for text detection and recognition in images. In Proceedings of the 24th ACM SIGKDD International Conference on Knowledge Discovery & Data Mining, pages 71–79. ACM, 2018.
  • [10] Michal Busta, Lukas Neumann, and Jiri Matas. Fastext: Efficient unconstrained scene text detector. In Proceedings of the IEEE International Conference on Computer Vision (ICCV), pages 1206–1214, 2015.
  • [11] Michal Busta, Lukas Neumann, and Jiri Matas. Deep textspotter: An end-to-end trainable scene text localization and recognition framework. In Proc. ICCV, 2017.
  • [12] Xilin Chen, Jie Yang, Jing Zhang, and Alex Waibel. Automatic detection and recognition of signs from natural scenes. IEEE Transactions on image processing, 13(1):87–99, 2004.
  • [13] Zhanzhan Cheng, Fan Bai, Yunlu Xu, Gang Zheng, Shiliang Pu, and Shuigeng Zhou. Focusing attention: Towards accurate text recognition in natural images. In 2017 IEEE International Conference on Computer Vision (ICCV), pages 5086–5094. IEEE, 2017.
  • [14] Zhanzhan Cheng, Xuyang Liu, Fan Bai, Yi Niu, Shiliang Pu, and Shuigeng Zhou. Arbitrarily-oriented text recognition. CVPR2018, 2017.
  • [15] Chee Kheng Ch’ng and Chee Seng Chan. Total-text: A comprehensive dataset for scene text detection and recognition. In Document Analysis and Recognition (ICDAR), 2017 14th IAPR International Conference on, volume 1, pages 935–942. IEEE, 2017.
  • [16] MM Aftab Chowdhury and Kaushik Deb. Extracting and segmenting container name from container images. International Journal of Computer Applications, 74(19), 2013.
  • [17] Adam Coates, Blake Carpenter, Carl Case, Sanjeev Satheesh, Bipin Suresh, Tao Wang, David J Wu, and Andrew Y Ng. Text detection and character recognition in scene images with unsupervised feature learning. In Document Analysis and Recognition (ICDAR), 2011 International Conference on, pages 440–445. IEEE, 2011.
  • [18] Yuchen Dai, Zheng Huang, Yuting Gao, and Kai Chen. Fused text segmentation networks for multi-oriented scene text detection. arXiv preprint arXiv:1709.03272, 2017.
  • [19] Navneet Dalal and Bill Triggs. Histograms of oriented gradients for human detection. In

    IEEE Computer Society Conference on Computer Vision and Pattern Recognition (CVPR)

    , volume 1, pages 886–893. IEEE, 2005.
  • [20] Deng Dan, Liu Haifeng, Li Xuelong, and Cai Deng. Pixellink: Detecting scene text via instance segmentation. In Proceedings of AAAI, 2018, 2018.
  • [21] Guilherme N DeSouza and Avinash C Kak. Vision for mobile robot navigation: A survey. IEEE transactions on pattern analysis and machine intelligence, 24(2):237–267, 2002.
  • [22] Piotr Dollár, Ron Appel, Serge Belongie, and Pietro Perona. Fast feature pyramids for object detection. IEEE Transactions on Pattern Analysis and Machine Intelligence, 36(8):1532–1545, 2014.
  • [23] Yuval Dvorin and Uzi Ezra Havosha. Method and device for instant translation, June 4 2009. US Patent App. 11/998,931.
  • [24] Boris Epshtein, Eyal Ofek, and Yonatan Wexler. Detecting text in natural scenes with stroke width transform. In Computer Vision and Pattern Recognition (CVPR), 2010 IEEE Conference on, pages 2963–2970. IEEE, 2010.
  • [25] Mark Everingham, SM Ali Eslami, Luc Van Gool, Christopher KI Williams, John Winn, and Andrew Zisserman. The pascal visual object classes challenge: A retrospective. International journal of computer vision, 111(1):98–136, 2015.
  • [26] Pedro F Felzenszwalb and Daniel P Huttenlocher. Pictorial structures for object recognition. International journal of computer vision, 61(1):55–79, 2005.
  • [27] Cheng-Yang Fu, Wei Liu, Ananth Ranga, Ambrish Tyagi, and Alexander C Berg. Dssd: Deconvolutional single shot detector. arXiv preprint arXiv:1701.06659, 2017.
  • [28] Yunze Gao, Yingying Chen, Jinqiao Wang, and Hanqing Lu. Reading scene text with attention convolutional sequence modeling. arXiv preprint arXiv:1709.04303, 2017.
  • [29] Suman K Ghosh, Ernest Valveny, and Andrew D Bagdanov. Visual attention models for scene text recognition. In Document Analysis and Recognition (ICDAR), 2017 14th IAPR International Conference on. IEEE, 2017, volume 1, pages 943–948, 2017.
  • [30] Ross Girshick. Fast r-cnn. In The IEEE International Conference on Computer Vision (ICCV), December 2015.
  • [31] Ross Girshick, Jeff Donahue, Trevor Darrell, and Jitendra Malik. Rich feature hierarchies for accurate object detection and semantic segmentation. In Proceedings of the IEEE conference on computer vision and pattern recognition (CVPR), pages 580–587, 2014.
  • [32] Lluis Gomez and Dimosthenis Karatzas. Object proposals for text extraction in the wild. In 13th International Conference on Document Analysis and Recognition (ICDAR), pages 206–210. IEEE, 2015.
  • [33] Lluís Gómez and Dimosthenis Karatzas. Textproposals: a text-specific selective search algorithm for word spotting in the wild. Pattern Recognition, 70:60–74, 2017.
  • [34] Ian Goodfellow, Jean Pouget-Abadie, Mehdi Mirza, Bing Xu, David Warde-Farley, Sherjil Ozair, Aaron Courville, and Yoshua Bengio. Generative adversarial nets. In Advances in neural information processing systems, pages 2672–2680, 2014.
  • [35] Albert Gordo. Supervised mid-level features for word image representation. In Proceedings of the IEEE conference on computer vision and pattern recognition (CVPR), pages 2956–2964, 2015.
  • [36] Alex Graves, Santiago Fernández, Faustino Gomez, and Jürgen Schmidhuber. Connectionist temporal classification: labelling unsegmented sequence data with recurrent neural networks. In Proceedings of the 23rd international conference on Machine learning, pages 369–376. ACM, 2006.
  • [37] Alex Graves, Marcus Liwicki, Horst Bunke, Jürgen Schmidhuber, and Santiago Fernández. Unconstrained on-line handwriting recognition with recurrent neural networks. In Advances in neural information processing systems, pages 577–584, 2008.
  • [38] Ankush Gupta, Andrea Vedaldi, and Andrew Zisserman. Synthetic data for text localisation in natural images. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR), pages 2315–2324, 2016.
  • [39] Young Kug Ham, Min Seok Kang, Hong Kyu Chung, Rae-Hong Park, and Gwi Tae Park. Recognition of raised characters for automatic classification of rubber tires. Optical Engineering, 34(1):102–110, 1995.
  • [40] Dafang He, Xiao Yang, Wenyi Huang, Zihan Zhou, Daniel Kifer, and C Lee Giles. Aggregating local context for accurate scene text detection. In Asian Conference on Computer Vision, pages 280–296. Springer, 2016.
  • [41] Dafang He, Xiao Yang, Chen Liang, Zihan Zhou, Alexander G Ororbia, Daniel Kifer, and C Lee Giles. Multi-scale fcn with cascaded instance aware segmentation for arbitrary oriented word spotting in the wild. In Computer Vision and Pattern Recognition (CVPR), 2017 IEEE Conference on, pages 474–483. IEEE, 2017.
  • [42] Kaiming He, Georgia Gkioxari, Piotr Dollár, and Ross Girshick. Mask r-cnn. In Computer Vision (ICCV), 2017 IEEE International Conference on, pages 2980–2988. IEEE, 2017.
  • [43] Kaiming He, Xiangyu Zhang, Shaoqing Ren, and Jian Sun. Deep residual learning for image recognition. In In Proceedings of the IEEE conference on computer vision and pattern recognition (CVPR), 2016.
  • [44] Pan He, Weilin Huang, Tong He, Qile Zhu, Yu Qiao, and Xiaolin Li. Single shot text detector with regional attention. In The IEEE International Conference on Computer Vision (ICCV), Oct 2017.
  • [45] Tong He, Zhi Tian, Weilin Huang, Chunhua Shen, Yu Qiao, and Changming Sun. An end-to-end textspotter with explicit alignment and attention. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR), pages 5020–5029, 2018.
  • [46] Wenhao He, Xu-Yao Zhang, Fei Yin, and Cheng-Lin Liu. Deep direct regression for multi-oriented scene text detection. In The IEEE International Conference on Computer Vision (ICCV), Oct 2017.
  • [47] Zhiwei He, Jilin Liu, Hongqing Ma, and Peihong Li. A new automatic extraction method of container identity codes. IEEE Transactions on intelligent transportation systems, 6(1):72–78, 2005.
  • [48] Sepp Hochreiter and Jürgen Schmidhuber. Long short-term memory. Neural computation, 9(8):1735–1780, 1997.
  • [49] Michal Hradiš, Jan Kotera, Pavel Zemcík, and Filip Šroubek. Convolutional neural networks for direct text deblurring. In Proceedings of BMVC, volume 10, 2015.
  • [50] Han Hu, Chengquan Zhang, Yuxuan Luo, Yuzhuo Wang, Junyu Han, and Errui Ding. Wordsup: Exploiting word annotations for character based text detection. In Proceedings of the IEEE International Conference on Computer Vision. 2017., 2017.
  • [51] Gao Huang, Zhuang Liu, Laurens Van Der Maaten, and Kilian Q Weinberger. Densely connected convolutional networks. In CVPR, volume 1, page 3, 2017.
  • [52] Weilin Huang, Zhe Lin, Jianchao Yang, and Jue Wang. Text localization in natural images using stroke feature transform and text covariance descriptors. In Proceedings of the IEEE International Conference on Computer Vision, pages 1241–1248, 2013.
  • [53] Max Jaderberg, Karen Simonyan, Andrea Vedaldi, and Andrew Zisserman. Deep structured output learning for unconstrained text recognition. ICLR2015, 2014.
  • [54] Max Jaderberg, Karen Simonyan, Andrea Vedaldi, and Andrew Zisserman. Synthetic data and artificial neural networks for natural scene text recognition. arXiv preprint arXiv:1406.2227, 2014.
  • [55] Max Jaderberg, Karen Simonyan, Andrea Vedaldi, and Andrew Zisserman. Reading text in the wild with convolutional neural networks. International Journal of Computer Vision, 116(1):1–20, 2016.
  • [56] Max Jaderberg, Karen Simonyan, Andrew Zisserman, et al. Spatial transformer networks. In Advances in neural information processing systems, pages 2017–2025, 2015.
  • [57] Max Jaderberg, Andrea Vedaldi, and Andrew Zisserman. Deep features for text spotting. In In Proceedings of European Conference on Computer Vision (ECCV), pages 512–528. Springer, 2014.
  • [58] Anil K Jain and Bin Yu. Automatic text location in images and video frames. Pattern recognition, 31(12):2055–2076, 1998.
  • [59] Yingying Jiang, Xiangyu Zhu, Xiaobing Wang, Shuli Yang, Wei Li, Hua Wang, Pei Fu, and Zhenbo Luo. R2cnn: rotational region cnn for orientation robust scene text detection. arXiv preprint arXiv:1706.09579, 2017.
  • [60] Keechul Jung, Kwang In Kim, and Anil K Jain. Text information extraction in images and video: a survey. Pattern recognition, 37(5):977–997, 2004.
  • [61] Le Kang, Yi Li, and David Doermann. Orientation robust text line detection in natural images. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR), pages 4034–4041, 2014.
  • [62] Dimosthenis Karatzas and Apostolos Antonacopoulos. Text extraction from web images based on a split-and-merge segmentation method using colour perception. In Pattern Recognition, 2004. ICPR 2004. Proceedings of the 17th International Conference on, volume 2, pages 634–637. IEEE, 2004.
  • [63] Dimosthenis Karatzas, Lluis Gomez-Bigorda, Anguelos Nicolaou, Suman Ghosh, Andrew Bagdanov, Masakazu Iwamura, Jiri Matas, Lukas Neumann, Vijay Ramaseshan Chandrasekhar, Shijian Lu, et al. Icdar 2015 competition on robust reading. In Document Analysis and Recognition (ICDAR), 2015 13th International Conference on, pages 1156–1160. IEEE, 2015.
  • [64] Dimosthenis Karatzas, Faisal Shafait, Seiichi Uchida, Masakazu Iwamura, Lluis Gomez i Bigorda, Sergi Robles Mestre, Joan Mas, David Fernandez Mota, Jon Almazan Almazan, and Lluis Pere de las Heras. Icdar 2013 robust reading competition. In Document Analysis and Recognition (ICDAR), 2013 12th International Conference on, pages 1484–1493. IEEE, 2013.
  • [65] Alex Kendall, Yarin Gal, and Roberto Cipolla. Multi-task learning using uncertainty to weigh losses for scene geometry and semantics. arXiv preprint arXiv:1705.07115, 3, 2017.
  • [66] Vijeta Khare, Palaiahnakote Shivakumara, Paramesran Raveendran, and Michael Blumenstein. A blind deconvolution model for scene text detection and recognition in video. Pattern Recognition, 54:128–148, 2016.
  • [67] Kye-Hyeon Kim, Sanghoon Hong, Byungseok Roh, Yeongjae Cheon, and Minje Park. PVANET: deep but lightweight neural networks for real-time object detection. arXiv:1608.08021, 2016.
  • [68] Alex Krizhevsky, Ilya Sutskever, and Geoffrey E Hinton. Imagenet classification with deep convolutional neural networks. In Advances in neural information processing systems, pages 1097–1105, 2012.
  • [69] Chen-Yu Lee and Simon Osindero. Recursive recurrent nets with attention modeling for ocr in the wild. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR), pages 2231–2239, 2016.
  • [70] Jung-Jin Lee, Pyoung-Hean Lee, Seong-Whan Lee, Alan Yuille, and Christof Koch. Adaboost for text detection in natural scene. In Document Analysis and Recognition (ICDAR), 2011 International Conference on, pages 429–434. IEEE, 2011.
  • [71] Seonghun Lee and Jin Hyung Kim. Integrating multiple character proposals for robust scene text extraction. Image and Vision Computing, 31(11):823–840, 2013.
  • [72] Hui Li, Peng Wang, and Chunhua Shen. Towards end-to-end text spotting with convolutional recurrent neural networks. In The IEEE International Conference on Computer Vision (ICCV), Oct 2017.
  • [73] Minghui Liao, Baoguang Shi, Xiang Bai, Xinggang Wang, and Wenyu Liu. Textboxes: A fast text detector with a single deep neural network. In AAAI, pages 4161–4167, 2017.
  • [74] Minghui Liao, Zhen Zhu, Baoguang Shi, Gui-song Xia, and Xiang Bai. Rotation-sensitive regression for oriented scene text detection. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR), pages 5909–5918, 2018.
  • [75] Fayao Liu, Chunhua Shen, and Guosheng Lin. Deep convolutional neural fields for depth estimation from a single image. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR), pages 5162–5170, 2015.
  • [76] Wei Liu, Dragomir Anguelov, Dumitru Erhan, Christian Szegedy, Scott Reed, Cheng-Yang Fu, and Alexander C Berg. SSD: Single shot multibox detector. In In Proceedings of European Conference on Computer Vision (ECCV), pages 21–37. Springer, 2016.
  • [77] Wei Liu, Chaofeng Chen, and KKY Wong. Char-net: A character-aware neural network for distorted scene text recognition. In

    AAAI Conference on Artificial Intelligence

    . New Orleans, Louisiana, USA, 2018.
  • [78] Wei Liu, Chaofeng Chen, Kwan-Yee K Wong, Zhizhong Su, and Junyu Han. Star-net: A spatial attention residue network for scene text recognition. In BMVC, volume 2, page 7, 2016.
  • [79] Xiaoqing Liu and Jagath Samarabandu. An edge-based text region extraction algorithm for indoor mobile robot navigation. In Mechatronics and Automation, 2005 IEEE International Conference, volume 2, pages 701–706. IEEE, 2005.
  • [80] Xiaoqing Liu and Jagath K Samarabandu. A simple and fast text localization algorithm for indoor mobile robot navigation. In Image Processing: Algorithms and Systems IV, volume 5672, pages 139–151. International Society for Optics and Photonics, 2005.
  • [81] Xuebo Liu, Ding Liang, Shi Yan, Dagui Chen, Yu Qiao, and Junjie Yan. Fots: Fast oriented text spotting with a unified network. CVPR2018, 2018.
  • [82] Yuliang Liu and Lianwen Jin. Deep matching prior network: Toward tighter multi-oriented text detection. 2017.
  • [83] Zichuan Liu, Yixing Li, Fengbo Ren, Hao Yu, and Wangling Goh. Squeezedtext: A real-time scene text recognition by binary convolutional encoder-decoder network. AAAI, 2018.
  • [84] Zichuan Liu, Guosheng Lin, Sheng Yang, Jiashi Feng, Weisi Lin, and Wang Ling Goh. Learning markov clustering networks for scene text detection. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR), pages 6936–6944, 2018.
  • [85] Shangbang Long, Jiaqiang Ruan, Wenjie Zhang, Xin He, Wenhao Wu, and Cong Yao. Textsnake: A flexible representation for detecting text of arbitrary shapes. In In Proceedings of European Conference on Computer Vision (ECCV), 2018.
  • [86] Simon M Lucas. Icdar 2005 text locating competition results. In Document Analysis and Recognition, 2005. Proceedings. Eighth International Conference on, pages 80–84. IEEE, 2005.
  • [87] Simon M Lucas, Alex Panaretos, Luis Sosa, Anthony Tang, Shirley Wong, and Robert Young. Icdar 2003 robust reading competitions. In null, page 682. IEEE, 2003.
  • [88] Pengyuan Lyu, Minghui Liao, Cong Yao, Wenhao Wu, and Xiang Bai. Mask textspotter: An end-to-end trainable neural network for spotting text with arbitrary shapes. In In Proceedings of European Conference on Computer Vision (ECCV), 2018.
  • [89] Pengyuan Lyu, Cong Yao, Wenhao Wu, Shuicheng Yan, and Xiang Bai. Multi-oriented scene text detection via corner localization and region segmentation. In Computer Vision and Pattern Recognition (CVPR), 2018 IEEE Conference on, 2018.
  • [90] Jianqi Ma, Weiyuan Shao, Hao Ye, Li Wang, Hong Wang, Yingbin Zheng, and Xiangyang Xue. Arbitrary-oriented scene text detection via rotation proposals. In IEEE Transactions on Multimedia, 2018, 2017.
  • [91] Abdelhamid Mammeri, Azzedine Boukerche, et al. Mser-based text detection and communication algorithm for autonomous vehicles. In 2016 IEEE Symposium on Computers and Communication (ISCC), pages 1218–1223. IEEE, 2016.
  • [92] Abdelhamid Mammeri, El-Hebri Khiari, and Azzedine Boukerche. Road-sign text recognition architecture for intelligent transportation systems. In 2014 IEEE 80th Vehicular Technology Conference (VTC Fall), pages 1–5. IEEE, 2014.
  • [93] Anand Mishra, Karteek Alahari, and CV Jawahar. An mrf model for binarization of natural scene text. In ICDAR-International Conference on Document Analysis and Recognition. IEEE, 2011.
  • [94] Anand Mishra, Karteek Alahari, and CV Jawahar. Scene text recognition using higher order language priors. In BMVC-British Machine Vision Conference. BMVA, 2012.
  • [95] Nibal Nayef, Fei Yin, Imen Bizid, Hyunsoo Choi, Yuan Feng, Dimosthenis Karatzas, Zhenbo Luo, Umapada Pal, Christophe Rigaud, Joseph Chazalon, et al. Icdar2017 robust reading challenge on multi-lingual scene text detection and script identification-rrc-mlt. In Document Analysis and Recognition (ICDAR), 2017 14th IAPR International Conference on, volume 1, pages 1454–1459. IEEE, 2017.
  • [96] Yuval Netzer, Tao Wang, Adam Coates, Alessandro Bissacco, Bo Wu, and Andrew Y Ng. Reading digits in natural images with unsupervised feature learning. In NIPS workshop on deep learning and unsupervised feature learning, volume 2011, page 5, 2011.
  • [97] Luka Neumann and Jiri Matas. On combining multiple segmentations in scene text recognition. In Document Analysis and Recognition (ICDAR), 2013 12th International Conference on, pages 523–527. IEEE, 2013.
  • [98] Lukas Neumann and Jiri Matas. A method for text localization and recognition in real-world images. In Asian Conference on Computer Vision, pages 770–783. Springer, 2010.
  • [99] Lukáš Neumann and Jiří Matas. Real-time scene text localization and recognition. In Computer Vision and Pattern Recognition (CVPR), 2012 IEEE Conference on, pages 3538–3545. IEEE, 2012.
  • [100] Hyeonwoo Noh, Seunghoon Hong, and Bohyung Han. Learning deconvolution network for semantic segmentation. pages 1520–1528, 2015.
  • [101] Shigueo Nomura, Keiji Yamanaka, Osamu Katai, Hiroshi Kawakami, and Takayuki Shiose. A novel adaptive morphological approach for degraded character image segmentation. Pattern Recognition, 38(11):1961–1975, 2005.
  • [102] Christopher Parkinson, Jeffrey J Jacobsen, David Bruce Ferguson, and Stephen A Pombo. Instant translation system, November 29 2016. US Patent 9,507,772.
  • [103] Andrei Polzounov, Artsiom Ablavatski, Sergio Escalera, Shijian Lu, and Jianfei Cai. Wordfence: Text detection in natural images with border awareness. ICIP/ICPR, 2017.
  • [104] Trung Quy Phan, Palaiahnakote Shivakumara, Shangxuan Tian, and Chew Lim Tan. Recognizing text with perspective distortion in natural scenes. In Proceedings of the IEEE International Conference on Computer Vision (ICCV), pages 569–576, 2013.
  • [105] Joseph Redmon, Santosh Divvala, Ross Girshick, and Ali Farhadi. You only look once: Unified, real-time object detection. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR), pages 779–788, 2016.
  • [106] Joseph Redmon and Ali Farhadi. Yolo9000: better, faster, stronger. arXiv preprint, 2017.
  • [107] Shaoqing Ren, Kaiming He, Ross Girshick, and Jian Sun. Faster r-cnn: Towards real-time object detection with region proposal networks. In Advances in neural information processing systems, pages 91–99, 2015.
  • [108] Anhar Risnumawan, Palaiahankote Shivakumara, Chee Seng Chan, and Chew Lim Tan. A robust arbitrary text detection system for natural scene images. Expert Systems with Applications, 41(18):8027–8048, 2014.
  • [109] Jose A Rodriguez-Serrano, Albert Gordo, and Florent Perronnin. Label embedding: A frugal baseline for text recognition. International Journal of Computer Vision, 113(3):193–207, 2015.
  • [110] Jose A Rodriguez-Serrano, Florent Perronnin, and France Meylan. Label embedding for text recognition. In Proceedings of the British Machine Vision Conference. Citeseer, 2013.
  • [111] Li Rong, En MengYi, Li JianQiang, and Zhang HaiBin. weakly supervised text attention network for generating text proposals in scene images. In Document Analysis and Recognition (ICDAR), 2017 14th IAPR International Conference on, volume 1, pages 324–330. IEEE, 2017.
  • [112] Xuejian Rong, Chucai Yi, and Yingli Tian. Unambiguous text localization and retrieval for cluttered scenes. In 2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR), pages 3279–3287. IEEE, 2017.
  • [113] Olaf Ronneberger, Philipp Fischer, and Thomas Brox. U-Net: Convolutional Networks for Biomedical Image Segmentation. Springer International Publishing, 2015.
  • [114] Partha Pratim Roy, Umapada Pal, Josep Llados, and Mathieu Delalandre. Multi-oriented and multi-sized touching character segmentation using dynamic programming. In Document Analysis and Recognition, 2009. ICDAR’09. 10th International Conference on, pages 11–15. IEEE, 2009.
  • [115] Olga Russakovsky, Jia Deng, Hao Su, Jonathan Krause, Sanjeev Satheesh, Sean Ma, Zhiheng Huang, Andrej Karpathy, Aditya Khosla, Michael Bernstein, et al. Imagenet large scale visual recognition challenge. International Journal of Computer Vision, 115(3):211–252, 2015.
  • [116] Georg Schroth, Sebastian Hilsenbeck, Robert Huitl, Florian Schweiger, and Eckehard Steinbach.

    Exploiting text-related features for content-based image retrieval.

    In 2011 IEEE International Symposium on Multimedia, pages 77–84. IEEE, 2011.
  • [117] Ruth Schulz, Ben Talbot, Obadiah Lam, Feras Dayoub, Peter Corke, Ben Upcroft, and Gordon Wyeth. Robot navigation using human cues: A robot navigation system for symbolic goal-directed exploration. In Proceedings of the 2015 IEEE International Conference on Robotics and Automation (ICRA 2015), pages 1100–1105. IEEE, 2015.
  • [118] Asif Shahab, Faisal Shafait, and Andreas Dengel. Icdar 2011 robust reading competition challenge 2: Reading text in scene images. In Document Analysis and Recognition (ICDAR), 2011 International Conference on, pages 1491–1496. IEEE, 2011.
  • [119] Zhang Sheng, Liu Yuliang, Jin Lianwen, and Luo Canjie. Feature enhancement network: A refined scene text detector. In Proceedings of AAAI, 2018, 2018.
  • [120] Karthik Sheshadri and Santosh Kumar Divvala. Exemplar driven character recognition in the wild. In BMVC, pages 1–10, 2012.
  • [121] Baoguang Shi, Xiang Bai, and Serge Belongie. Detecting oriented text in natural images by linking segments. In The IEEE Conference on Computer Vision and Pattern Recognition (CVPR), July 2017.
  • [122] Baoguang Shi, Xiang Bai, and Cong Yao. An end-to-end trainable neural network for image-based sequence recognition and its application to scene text recognition. IEEE transactions on pattern analysis and machine intelligence, 39(11):2298–2304, 2017.
  • [123] Baoguang Shi, Xinggang Wang, Pengyuan Lyu, Cong Yao, and Xiang Bai. Robust scene text recognition with automatic rectification. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR), pages 4168–4176, 2016.
  • [124] Baoguang Shi, Mingkun Yang, XingGang Wang, Pengyuan Lyu, Xiang Bai, and Cong Yao. Aster: An attentional scene text recognizer with flexible rectification. IEEE transactions on pattern analysis and machine intelligence, 31(11):855–868, 2018.
  • [125] Baoguang Shi, Cong Yao, Minghui Liao, Mingkun Yang, Pei Xu, Linyan Cui, Serge Belongie, Shijian Lu, and Xiang Bai. Icdar2017 competition on reading chinese text in the wild (rctw-17). In Document Analysis and Recognition (ICDAR), 2017 14th IAPR International Conference on, volume 1, pages 1429–1434. IEEE, 2017.
  • [126] Cunzhao Shi, Chunheng Wang, Baihua Xiao, Yang Zhang, Song Gao, and Zhong Zhang. Scene text recognition using part-based tree-structured character detection. In Computer Vision and Pattern Recognition (CVPR), 2013 IEEE Conference on, pages 2961–2968. IEEE, 2013.
  • [127] Palaiahnakote Shivakumara, Souvik Bhowmick, Bolan Su, Chew Lim Tan, and Umapada Pal. A new gradient based character segmentation method for video text recognition. In Document Analysis and Recognition (ICDAR), 2011 International Conference on, pages 126–130. IEEE, 2011.
  • [128] Robin Sibson. Slink: an optimally efficient algorithm for the single-link cluster method. The computer journal, 16(1):30–34, 1973.
  • [129] Karen Simonyan and Andrew Zisserman. Very deep convolutional networks for large-scale image recognition. arXiv preprint arXiv:1409.1556, 2014.
  • [130] Bolan Su and Shijian Lu. Accurate scene text recognition based on recurrent neural network. In Asian Conference on Computer Vision, pages 35–48. Springer, 2014.
  • [131] Christian Szegedy, Wojciech Zaremba, Ilya Sutskever, Joan Bruna, Dumitru Erhan, Ian Goodfellow, and Rob Fergus. Intriguing properties of neural networks. arXiv preprint arXiv:1312.6199, 2013.
  • [132] Shangxuan Tian, Shijian Lu, and Chongshou Li. Wetext: Scene text detection under weak supervision. In Proc. ICCV, 2017.
  • [133] Zhi Tian, Weilin Huang, Tong He, Pan He, and Yu Qiao. Detecting text in natural image with connectionist text proposal network. In In Proceedings of European Conference on Computer Vision (ECCV), pages 56–72. Springer, 2016.
  • [134] Sam S Tsai, Huizhong Chen, David Chen, Georg Schroth, Radek Grzeszczuk, and Bernd Girod. Mobile visual search on printed documents using text and low bit-rate features. In 18th IEEE International Conference on Image Processing (ICIP), pages 2601–2604. IEEE, 2011.
  • [135] Zhuowen Tu, Yi Ma, Wenyu Liu, Xiang Bai, and Cong Yao. Detecting texts of arbitrary orientations in natural images. In 2012 IEEE Conference on Computer Vision and Pattern Recognition, pages 1083–1090. IEEE, 2012.
  • [136] Seiichi Uchida. Text localization and recognition in images and video. In Handbook of Document Image Processing and Recognition, pages 843–883. Springer, 2014.
  • [137] Stijn Marinus Van Dongen. Graph clustering by flow simulation. PhD thesis, 2000.
  • [138] Steffen Wachenfeld, H-U Klein, and Xiaoyi Jiang. Recognition of screen-rendered text. In Pattern Recognition, 2006. ICPR 2006. 18th International Conference on, volume 2, pages 1086–1089. IEEE, 2006.
  • [139] Toru Wakahara and Kohei Kita.

    Binarization of color character strings in scene images using k-means clustering and support vector machines.

    In Document Analysis and Recognition (ICDAR), 2011 International Conference on, pages 274–278. IEEE, 2011.
  • [140] Cong Wang, Fei Yin, and Cheng-Lin Liu. Scene text detection with novel superpixel based character candidate extraction. In Document Analysis and Recognition (ICDAR), 2017 14th IAPR International Conference on, volume 1, pages 929–934. IEEE, 2017.
  • [141] Fangfang Wang, Liming Zhao, Xi Li, Xinchao Wang, and Dacheng Tao. Geometry-aware scene text detection with instance transformation network. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR), pages 1381–1389, 2018.
  • [142] Kai Wang, Boris Babenko, and Serge Belongie. End-to-end scene text recognition. In Computer Vision (ICCV), 2011 IEEE International Conference on, pages 1457–1464. IEEE, 2011.
  • [143] Kai Wang and Serge Belongie. Word spotting in the wild. In In Proceedings of European Conference on Computer Vision (ECCV), pages 591–604. Springer, 2010.
  • [144] Tao Wang, David J Wu, Adam Coates, and Andrew Y Ng. End-to-end text recognition with convolutional neural networks. In Pattern Recognition (ICPR), 2012 21st International Conference on, pages 3304–3308. IEEE, 2012.
  • [145] Jerod Weinman, Erik Learned-Miller, and Allen Hanson. Fast lexicon-based scene text recognition with sparse belief propagation. In icdar, pages 979–983. IEEE, 2007.
  • [146] Christian Wolf and Jean-Michel Jolion. Object count/area graphs for the evaluation of object detection and segmentation algorithms. International Journal of Document Analysis and Recognition (IJDAR), 8(4):280–296, 2006.
  • [147] Dao Wu, Rui Wang, Pengwen Dai, Yueying Zhang, and Xiaochun Cao. Deep strip-based network with cascade learning for scene text localization. In Document Analysis and Recognition (ICDAR), 2017 14th IAPR International Conference on, volume 1, pages 826–831. IEEE, 2017.
  • [148] Yue Wu and Prem Natarajan. Self-organized text detection with minimal post-processing via border learning. In Proceedings of the IEEE Conference on CVPR, pages 5000–5009, 2017.
  • [149] Chuhui Xue, Shijian Lu, and Fangneng Zhan. Accurate scene text detection through border semantics awareness and bootstrapping. In In Proceedings of European Conference on Computer Vision (ECCV), 2018.
  • [150] Xiao Yang, Dafang He, Zihan Zhou, Daniel Kifer, and C Lee Giles. Learning to read irregular text with attention mechanisms. In Proceedings of the Twenty-Sixth International Joint Conference on Artificial Intelligence, IJCAI-17, pages 3280–3286, 2017.
  • [151] Cong Yao, Xiang Bai, and Wenyu Liu. A unified framework for multioriented text detection and recognition. IEEE Transactions on Image Processing, 23(11):4737–4749, 2014.
  • [152] Cong Yao, Xiang Bai, Nong Sang, Xinyu Zhou, Shuchang Zhou, and Zhimin Cao. Scene text detection via holistic, multi-channel prediction. arXiv preprint arXiv:1606.09002, 2016.
  • [153] Cong Yao, Xiang Bai, Baoguang Shi, and Wenyu Liu. Strokelets: A learned multi-scale representation for scene text recognition. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR), pages 4042–4049, 2014.
  • [154] Qixiang Ye and David Doermann. Text detection and recognition in imagery: A survey. IEEE transactions on pattern analysis and machine intelligence, 37(7):1480–1500, 2015.
  • [155] Qixiang Ye, Wen Gao, Weiqiang Wang, and Wei Zeng. A robust text detection algorithm in images and video frames. IEEE ICICS-PCM, pages 802–806, 2003.
  • [156] Chucai Yi and YingLi Tian. Text string detection from natural scenes by structure-based partition and grouping. IEEE Transactions on Image Processing, 20(9):2594–2605, 2011.
  • [157] Fei Yin, Yi-Chao Wu, Xu-Yao Zhang, and Cheng-Lin Liu. Scene text recognition with sliding convolutional character models. arXiv preprint arXiv:1709.01727, 2017.
  • [158] Xu-Cheng Yin, Wei-Yi Pei, Jun Zhang, and Hong-Wei Hao. Multi-orientation scene text detection with adaptive clustering. IEEE transactions on pattern analysis and machine intelligence, 37(9):1930–1937, 2015.
  • [159] Xu-Cheng Yin, Xuwang Yin, Kaizhu Huang, and Hong-Wei Hao. Robust text detection in natural scene images. IEEE transactions on pattern analysis and machine intelligence, 36(5):970–983, 2014.
  • [160] Xu-Cheng Yin, Ze-Yu Zuo, Shu Tian, and Cheng-Lin Liu. Text detection, tracking and recognition in video: A comprehensive survey. IEEE Transactions on Image Processing, 25(6):2752–2773, 2016.
  • [161] Tai-Ling Yuan, Zhe Zhu, Kun Xu, Cheng-Jun Li, and Shi-Min Hu. Chinese text in the wild. arXiv preprint arXiv:1803.00085, 2018.
  • [162] Xiaoyong Yuan, Pan He, and Xiaolin Andy Li. Adaptive adversarial attack on scene text recognition. arXiv preprint arXiv:1807.03326, 2018.
  • [163] Liu Yuliang, Jin Lianwen, Zhang Shuaitao, and Zhang Sheng. Detecting curve text in the wild: New dataset and new solution. arXiv preprint arXiv:1712.02170, 2017.
  • [164] Fangneng Zhan, Shijian Lu, and Chuhui Xue. Verisimilar image synthesis for accurate detection and recognition of texts in scenes. 2018.
  • [165] DongQuin Zhang and Shih-Fu Chang. A bayesian framework for fusing multiple word knowledge models in videotext recognition. In Computer Vision and Pattern Recognition, 2003. Proceedings. 2003 IEEE Computer Society Conference on, volume 2, pages II–II. IEEE, 2003.
  • [166] Zheng Zhang, Chengquan Zhang, Wei Shen, Cong Yao, Wenyu Liu, and Xiang Bai. Multi-oriented text detection with fully convolutional networks. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR), pages 4159–4167, 2016.
  • [167] Zhou Zhiwei, Li Linlin, and Tan Chew Lim. Edge based binarization for video text images. In Pattern Recognition (ICPR), 2010 20th International Conference on, pages 133–136. IEEE, 2010.
  • [168] Xinyu Zhou, Cong Yao, He Wen, Yuzhi Wang, Shuchang Zhou, Weiran He, and Jiajun Liang. EAST: An efficient and accurate scene text detector. In The IEEE Conference on Computer Vision and Pattern Recognition (CVPR), July 2017.
  • [169] Yanzhao Zhou, Qixiang Ye, Qiang Qiu, and Jianbin Jiao. Oriented response networks. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR), pages 4961–4970. IEEE, 2017.
  • [170] Anna Zhu, Renwu Gao, and Seiichi Uchida. Could scene context be beneficial for scene text detection? Pattern Recognition, 58:204–215, 2016.
  • [171] Xiangyu Zhu, Yingying Jiang, Shuli Yang, Xiaobing Wang, Wei Li, Pei Fu, Hua Wang, and Zhenbo Luo. Deep residual text detection network for scene text. In Document Analysis and Recognition (ICDAR), 2017 14th IAPR International Conference on. IEEE, 2017, volume 1, pages 807–812, 2017.
  • [172] Yingying Zhu, Cong Yao, and Xiang Bai. Scene text detection and recognition: Recent advances and future trends. Frontiers of Computer Science, 10(1):19–36, 2016.
  • [173] C Lawrence Zitnick and Piotr Dollár. Edge boxes: Locating object proposals from edges. In In Proceedings of European Conference on Computer Vision (ECCV), pages 391–405. Springer, 2014.