Symmetry-constrained Rectification Network for Scene Text Recognition

08/06/2019 ∙ by Mingkun Yang, et al. ∙ Huazhong University of Science u0026 Technology Peking University 8

Reading text in the wild is a very challenging task due to the diversity of text instances and the complexity of natural scenes. Recently, the community has paid increasing attention to the problem of recognizing text instances with irregular shapes. One intuitive and effective way to handle this problem is to rectify irregular text to a canonical form before recognition. However, these methods might struggle when dealing with highly curved or distorted text instances. To tackle this issue, we propose in this paper a Symmetry-constrained Rectification Network (ScRN) based on local attributes of text instances, such as center line, scale and orientation. Such constraints with an accurate description of text shape enable ScRN to generate better rectification results than existing methods and thus lead to higher recognition accuracy. Our method achieves state-of-the-art performance on text with both regular and irregular shapes. Specifically, the system outperforms existing algorithms by a large margin on datasets that contain quite a proportion of irregular text instances, e.g., ICDAR 2015, SVT-Perspective and CUTE80.



There are no comments yet.


page 4

page 6

page 7

page 8

This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

1 Introduction

Scene text reading [60, 32, 59, 53, 54, 36, 35]

is an important, active research area in computer vision, which can be applied to a wide range of real-world applications, such as self-driving cars, assistant position systems, and guide board recognition 

[43]. Scene text recognition, which aims at converting text regions in the images to machine-readable symbols, is a critical step in scene text reading systems. It remains challenging due to complex backgrounds, irregular shapes, varying fonts, non-uniform illuminations,

Figure 1: Comparison between ASTER [46] and ScRN (proposed in this paper), shown in (a) and (b) respectively.

Text instances in real-world scenarios have diverse shapes, , in horizontal, oriented, or curved forms. There have been a lot of works that focus on dealing with irregular text instances. AON [8] applies sequence recognition in four different orientations, which enables the recognition model to handle oriented text instances. RARE [45] and ASTER [46]

employ a rectification module before recognition. The rectification modules can improve text recognition accuracy by rectifying text in irregular shapes into regular forms. These rectification modules are based on spatial transform network (STN) 

[21], which predicts the control points of the text outlines in a weakly supervised way, as shown in Fig. 0(a). They can deal with the text of various shapes only with word-level supervision. Ideally, the control points should evenly spread along the upper and lower edges of the text region, and the paired upper and lower points should be symmetrical about the center line of text. However, these STN-based methods predict the control points separately and neglect the priors. Without any constraints for such priors, the rectification effect in highly curved or distorted occasions might be unsatisfactory.

Figure 2: Pipeline of the proposed method.

To further improve the performance of irregular text rectification, we propose a Symmetry-constrained Rectification Network (ScRN) that uses the center line of each text instance and adds symmetrical constraints via some geometrical attributes, including the orientations of the text center line, the orientations and the scales of the characters. Specifically, each text center line is more flexible to describe the pose of either straight or curved text. Its associated geometrical attributes can reliably estimate the orientation and the boundary of text lines in vertical direction. Furthermore, the generation process of control points ensures the symmetrical constraints in their spatial distribution. ScRN is a simple segmentation network which only consists of two convolutional layers. Therefore, it just incurs negligible computation and storage overhead when combined with a text recognizer. Compared with the previous STN-based rectification methods, ScRN has superiorities in both robustness and interpretability, profiting from its symmetric constraints. In this way, ScRN can further improve the text recognition accuracy by enhancing the performance of rectification on the irregular text, as illustrated in Fig. 


The main contribution of this paper lies in the proposed rectification network for scene text recognition, whose advantages are three-fold. 1) The novel rectification network is more precise and robust, due to the elaborate description of text shape and the explicit symmetric constraints. 2) It is a simple and lightweight segmentation network, and thus the extra computation complexity is negligible when combined with existing text recognizers. 3) With the rectification network, we achieve state-of-the-art performance on the standard scene text recognition benchmarks.

2 Related Work

2.1 Text Recognition

Existing works on scene text recognition can be roughly divided into traditional and deep learning based methods.

A popular pipeline of traditional methods [10, 50, 49, 51, 38, 37, 39, 55, 4]

is in the bottom-up architecture. They first localize every character with a character proposal extractor. Then, a character classifier is used to filter the proposals. Finally, the remained characters are grouped into words.

Deep learning based methods [20, 17, 25, 48, 45, 44, 7, 29, 3, 8, 46, 57] have been dominating this area in recent years. Jaderberg et al. [20] propose to take scene text recognition as a word classification problem by using a CNN classifier. However, it is limited to the pre-defined vocabulary. To overcome this limitation, various sequence-to-sequence models [44, 45, 7, 8, 46] are applied for scene text recognition, which do not rely on pre-defined vocabularies. These methods can be roughly divided into two subcategories by different sequence decoders. One subcategory is based on Connectionist Temporal Classification (CTC) [13, 44, 17] while the other is based on attention decoders [8, 46, 7]. More related papers are referred to [32].

2.2 Irregular Text Recognition

The irregular text includes, but is not limited to oriented or perspectively distorted text, curved text, Recently, irregular text recognition [40, 52, 45, 8, 46, 30] becomes popular. Cheng et al. [8] encode the input image to four feature sequences of four directions to handle text of oriented shapes. Yang et al. [52] add character-level supervision to guide the attention learning on the 2D feature maps. Liu et al. [30] introduce “clean” images which contain no geometric deformation to supervise the learning process at both the pixel level and the feature level in a generative way. With such a generator-discriminator architecture, it can handle text on a curved path but fails in the text with a cluttered background. Shi et al. [45, 46] propose to add a rectification module before recognition. With only word-level supervision, they adopt the spatial transform network (STN) [21] to rectify the text in a weakly supervised manner. To improve the rectification results, Li et al. [26] bring extra supervision to STN and upgrade the model to a semi-supervised multi-task learning system, by labeling a portion of transformation parameters in the dataset. The control points are expected evenly spread along the upper and lower edges of text, and the paired upper and lower points should be symmetrical about the center line of text. Nevertheless, these rectification modules separately predict the control points and do not explicitly consider the constraints on the control points, which results in the limitations of their rectification effect. Our proposed method applies the constraints via geometrical attributes of text instances to rectify the irregular text, which gains both robustness and interpretability.

3 Our Method

As illustrated in Fig. 2, the proposed pipeline consists of three major parts: the shared backbone network, the rectification network and the recognition network. The model is end-to-end trainable and integrates the text rectification and recognition within a unified framework. The backbone is FPN [28] equipped with ResNet-50 [16], which is shared by the rectification module and the recognition module. Using the shared feature maps, the rectification module yields dense pixel-wise predictions of text geometric attributes, with which the shared feature maps are expected to be rectified as regular ones via Thin-Plate-Spline (TPS) [6] transformation. Finally, the rectified feature maps are translated into a character sequence by the recognition module, where a shallow network is employed to convert the map to sequential features, followed by an attention decoder. The rectification module and recognition module are detailed in Sec. 3.1 and Sec. 3.2, respectively.

3.1 Rectification Module

The definition of the text shape is critical for text rectification because the rectification process can be considered as a shape transformation. Zhang et al. [58] design the text proposals according to symmetry while Long  et al. [33] adopt local geometrical attributes, to represent the text shape for scene text detection. From the above analysis in Sec. 1, we conclude the symmetrical constraints are necessary for precise text rectification. To add such constraints into our rectification module, we use text center line with its associated geometrical attributes, such as character orientation, text orientation and text scale to describe the shape of a text instance. In this section, we first introduce a new representation for text rectification. Then we describe how to rectify text images with the given geometric attributes. At last, we highlight the necessity to introduce the character orientation for accurate rectification.

3.1.1 Definition

The geometrical attributes of text for rectification are illustrated in Fig. 3, including the text center line, the scale , the character orientation and the text orientation .

A text instance can be viewed as an ordered character sequence , where is the number of characters. Each character has a bounding box , which is annotated with a free-form quadrilateral. First, we construct a center point list , which consists of the center point of each as well as the midpoint of ’s left edge and the midpoint of ’s right edge . Then the text center line (TCL) is constructed by linking the center points in sequential order. Each center point is associated with a group of geometrical attributes, , , where is the scale, is the character orientation and is the text orientation. Specifically, the scale is half the height of the character. The text orientation is defined as the tangential direction of . The character orientation is defined as the direction from the midpoint of the top edge to the midpoint of the bottom edge. For the points on the TCL but not in

, the values of their geometrical attributes are linearly interpolated with two nearest center points. In this way, the shape of the text instance is precisely described and can be leveraged for the subsequent rectification step.

Figure 3: Illustration of the text representation.

3.1.2 Geometrical Attributes Prediction

The rectification process is shown in Fig. 4. To yield the geometrical attributes, we employ a lightweight predictor which only consists of two convolutional layers. The output of this predictor is .

represents the probability of pixels on the TCL.

represents the character scale at each pixel. , , , and are pixel-wise predictions for , , and , respectively. Specifically, and are normalized to ensure that their quadratic sum equals to , as depicted in Eqn. (1). and are normalized in the same way. After that, TCL score map, , , and are used to extract the central point list , whose length is variable. More details about this process are referred to [33]. Then, , , , and are used for rectification.

Figure 4: The rectification process. Note that, for all figures in this paper, we use the input image to illustrate these points and rectified results, but the rectification is actually operated on the shared feature maps.

3.1.3 Rectification

Thin-Plate-Spline (TPS) transformation is employed to rectify the shared feature maps to regular ones . In order to compute the TPS transformation , we need to generate a pair of point sets and , which represent the fiducial points in the irregular feature maps and the predefined anchor points on the , respectively. The procedure is given in Fig. 4. First, we equidistantly sample points from , named . For each , we take two points at a distance along the character orientation, which is expressed in . The coordinates of the two points are computed via


is evenly placed along the top and bottom borders of the regular feature maps. Given and , the transformation matrix is calculated. Then, we apply to every pixel locations in and obtain a sampling grid on , with which, is sampled from using bilinear interpolation.

Theoretically, TPS transformation is able to handle variable-size fiducial points, and thus can be directly used to obtain the fiducial points . However, to build a minibatch for batch-wise training, the length of should be predefined and fixed. Therefore, we resample the central point list to obtain with a fixed length.

3.1.4 Character Orientation

When bounding boxes of all characters are rectangular, the character orientation is perpendicular to the text orientation. However, in more general cases, the orientation perpendicular to the text orientation is not the correct character orientation, which may lead to a failed rectification. As illustrated in Fig. 5, when the normal direction of the center line is not the same as the character orientation, the rectification based on the character orientation is much better than the other one. So it is necessary to add the character orientation into the text geometric attributes for text rectification.

3.2 Recognition Module

The text recognition module aims to predict a character sequence from the rectified shared features. Using the hierarchical structure of the shared backbone, we obtain an enriched feature map. We use a sub-network to further encode the map to vector sequence before being fed into the final attention decoder. The settings of the recognition module are detailed in Tab. 


Figure 5: Control points and rectification results using the character orientation (Left) and normal direction of text orientation (Right).

To reserve more discriminable features of the characters in compact text or in narrow shapes, the input feature is reduced only once along the width axis while keeps collapsing along the height axis until it reduces to 1. Then, the feature map is converted into a feature sequence by 1-stride sliced along the width axis. Finally, a Bidirectional LSTM 

[14] is attached to capture the long-range dependencies in both directions, resulting in a higher-level feature sequence , where is the length of H.

Next, the common attention-based decoder [2] equipped with GRU [9] is adopted to translate the feature sequence H into a symbol sequence , where is the number of characters. To generate sequences of variable lengths, a special end-of-sequence symbol (EOS) is inserted at the end of the target sequence. The decoder iteratively predicts a symbol at step until EOS is emitted.

Given the input image , the recognition loss can be formulated as

Layer Name Configuration Out Size





256 31
fc nc nc
Table 1: The architecture of recognition module. The configuration has the following format: for convolutional layers and maxpooling layers, {dimensions} for the number of features in the LSTM hidden state or fully-connected layers. “out size” is the feature map size of convolutional layers or the sequence length of recurrent layer. “nc” is the number of symbols.

3.3 Training and Inference

3.3.1 Training Objective

We add explicit supervision into both the rectification module and the recognition module. The whole network is trained end-to-end, with the following objective function,


For an input image , the loss is comprised of two parts, as shown in Eqn. (4). measures the deviation of the predicted geometrical attributes with the ground truth. We train our model with SynthText [15] and Synth90k [18]. Synth90k has no annotations of char-level or word-level bounding boxes, so it is not used to supervise the training of geometrical attributes prediction.


where is cross-entropy loss for TCL. , , , , and are calculated as Smoothed-L1 loss [11],


where , , , and are the predicted values, while , , , and are their ground truth correspondingly. for pixels outside the TCL is set to 0, since the geometrical attributes make no sense to non-TCL points.

The hyper-parameters, , , , , , and are all set to 1 in our experiments.

3.3.2 Training Strategy

The feature maps generated from the backbone are shared by both the rectification module and the recognition module. Our training strategy is two-staged. In the first stage, the shared features are rectified with the ground truth geometrical attributes. Then, the rectified features are used for the recognition module training. Since Synth90k is not annotated with geometrical attributes, the shared features from Synth90k are not rectified in this stage. In the second stage, we use the predicted geometrical attributes for rectification. In this stage, all shared features are rectified before being fed into the recognition module.

Figure 6: Selected results from SVTP and CUTE80, which suffer from severe distortion. For every three rows, the first row shows the input image with evenly sampled center points (visualized as red points) and green control points. The second row shows the rectified images. The last row is the recognition results.

4 Experiments

4.1 Datasets

We evaluate our method on 4 general benchmarks, which mainly consist of regular text instances and 3 datasets with irregular text instances, to demonstrate its rectification ability on curved, distorted and oriented text. A brief description of these datasets is as follows.

IIIT5K-Words (IIIT5K) [37]

contains 3,000 web images for testing. Each image is associated with a 50-word lexicon and a 1k-word lexicon.

Street View Text (SVT) [49] consists of 647 testing images, which are collected from the Google Street View. Many images are heavily corrupted by noise, blur or in low resolution. Each image specifies a 50-word lexicon.

ICDAR 2003 (IC03) [34] contains 860 images of cropped words after filtering. Following Wang et al. [49], words with non-alphanumeric characters or less than three characters are discarded. Each image has a 50-word lexicon and a “full lexicon” which contains all lexicon words.

ICDAR 2013 (IC13) [24] inherits most of its data from IC03 and contains 1,015 cropped word images.

ICDAR 2015 (IC15) [23] is collected via a pair of Google Glasses without careful positioning and focusing. The dataset contains 2,077 images with various distortions.

SVT-Perspective (SVTP) [40] is specifically proposed to evaluate the performance of perspective text recognition algorithms. It consists of 645 images for testing.

CUTE80 (CUTE) [41] is designed to evaluate curved text recognition. It has 288 cropped images for testing.

4.2 Implementation Details

The proposed method is implemented in PyTorch. Images are resized to

before being fed into the network. The resolutions of feature maps produced by the shared backbone and the rectified feature maps are both size of the input image, namely . Accordingly, the size of ground truth maps F is also . We expand one more pixel around TCL since a single-point line is prone to noise. The geometrical attributes on the expanded points keep the same with the nearest point on the original TCL. To apply TPS transformation in the mini-batch, we equidistantly sample =10 points after is extracted. In total, 95 symbols are recognized, including digits, upper-case and lower-case letters, 32 punctuation marks and an end-of-sequence symbol (EOS).

Our model is trained on SynthText and Synth90k from scratch. We adopt the ADADELTA [56] with default hyper-parameters (rho=0.9, eps=1e-6, weight_decay=0) to minimize the objective function. Each mini-batch has 512 samples which are randomly selected from the two datasets. As mentioned in Sec. 3.3.2

, our model is trained in two stages. In the first stage, we set the initial learning rate to 1.0 and decay it to 0.1 and 0.01 at the 4th epoch and the 5th epoch. The first stage finishes in the 6th epoch. In the second stage, the predicted geometrical attributes are used for rectification, and the model is trained for another epoch. All models are trained on 4 NVIDIA TITAN Xp graphics cards.

4.3 Effect of Rectification

To analyze the effect of rectification, we remove the rectification module in our pipeline as the baseline, where the feature maps generated from the backbone are fed into the recognition module directly. As shown in Tab. 3, the model with the rectification module outperforms the baseline nearly on all datasets, particularly on IC15 (+1.8%), SVTP (+3.1%) and CUTE80 (+3.1%). The significant improvements on three irregular text datasets demonstrate the effectiveness of the proposed ScRN. Furthermore, the attached rectification module only needs negligible computation and storage overhead, since it only consists of two convolutional layers and some simple postprocessing. Specifically, the baseline model and our method spend 12ms and 13ms in the inference stage, respectively.

To further explain the improvements, we visualize several images with different types of distortions to illustrate the rectification results in Fig. 6. With the proposed geometrical attributes, ScRN can obtain a precise description of the text shape, which finally results in evenly placed control points along the top and bottom text edges. Therefore, the followed TPS transformation can easily rectify these irregular text images. Although some rectified images are still slightly distorted, the images become more readable than the original ones and can be correctly recognized.

Variants IIIT5k SVT IC03 IC13 IC15 SVTP CUTE
baseline 88.4 79.9 92.1 88.9 67.3 66.5 80.6
multi-loss 87.6 79.1 91.3 90.0 67.0 66.7 79.5
ours 88.5 81.3 91.2 90.0 68.8 68.2 81.9
Table 2: Recognition accuracy to explore the effect of rectification module. All models are trained on SynthText only.

Unlike ASTER, our model is end-to-end trained with both the recognition loss and geometry loss. To make it clear whether the extra geometry loss or the rectification module improve the recognition results, we study another variant of the proposed model called multi-loss baseline, in which geometry loss is retained but the rectification module is discarded. In this part, the baseline, multi-loss baseline and our method are trained on SynthText only. Their performance is given in Tab. 2. Compared to the baseline model, the multi-loss variant achieves comparable results while our method obtains improvements on most datasets, except a slight decrease on IC03. These results reveal that the improvements are derived from the rectification module, rather than the extra geometry loss.

4.4 Comparison with STN-based Methods

Figure 7: Rectified results produced by our proposed ScRN, STN_baseline and STN_supervision, as well as their corresponding recognition results. Red characters are mistakenly recognized characters. Underlines in red represent the missed characters.

In this section, we compare our method with two STN-based methods. One is ASTER, a well-known STN-based method. The other one is similar with ASTER, but extra supervision is injected for STN. But we do not compare our method with ASTER directly here, since our method rectifies shared feature maps instead of raw images, considering complexity and efficiency. Therefore, we build another STN-based model, namely STN_baseline, for a fair comparison. STN_baseline shares the same backbone and recognition module with our method. It only replaces our rectification module with an STN network, which has a similar architecture with the rectification network of ASTER. The other STN-based method has the same structure as STN_baseline. The only difference is the extra supervision to further improve the accuracy of the predicted control points. This variant derives from [26] and we name it STN_supervision. All methods share the same training strategy. The results are shown in Tab. 3. Overall, STN_baseline outperforms the baseline model, meanwhile performs slightly worse than the STN_supervision. The conclusion is consistent with ASTER and [26]. Then we detail the comparisons with our method as follows.

On the datasets with irregular text such as IC15, SVTP and CUTE80, our method outperforms STN_baseline with improvements of 0.5%, 1.4% and 1.7%, respectively. With the extra supervision to STN, STN_supervision exceeds STN_baseline slightly but still performs worse than our method by 0.2%, 1.1%, 1.0% on IC15, SVTP and CUTE80, respectively. Profiting from the elaborate description of text pose, the rectification is more robust and accurate. In Fig. 7, we show some rectified results yielded by the three methods. Overall, STN-based methods suffer from heavily curved cases and predict imprecise control points, which lead to wrong rectifications while our method works well. Although our method fails to perfectly rectify text images with messy background and text with rare fonts, it can obtain more readable results.

The results reveal that the geometrical attributes are more helpful than the weakly supervised network and the simple supervised network for control points generation. Besides, the prediction network for geometrical attributes is much smaller and only trained with synthetic data, which is efficient and inexpensive.

We also study a variant of our model, named in Tab. 3 where we apply the rectification module to the input image, rather than the shared feature maps. In this variant, the backbone network is repeated twice without sharing parameters. So the elapsed time and the model size are nearly doubled. Compared with this variant, our method achieves comparable or even better results while avoiding the heavy computation and space cost.

Variants IIIT5k SVT IC03 IC13 IC15 SVTP CUTE
baseline 94.4 86.9 94.7 93.6 76.9 77.7 84.4
STN_baseline 94.1 87.6 95.0 93.2 78.2 79.4 85.8
STN_supervision 94.0 88.1 94.9 93.8 78.5 79.7 86.5
ScRN 94.4 88.9 95.0 93.9 78.7 80.8 87.5
95.0 88.4 95.6 93.7 78.4 81.1 90.6
Table 3: Recognition accuracy of different variants.

4.5 Comparison with State of the Art

Methods IIIT5k SVT IC03 IC13 IC15 SVTP CUTE80
50 1k 0 50 0 50 Full 0 0 0 0 0
Wang et al. [49] - - - 57.0 - 76.0 62.0 - - - - -
Mishra et al. [38] 64.1 57.5 - 73.2 - 81.8 67.8 - - - - -
Wang et al. [51] - - - 70.0 - 90.0 84.0 - - - - -
Bissacco et al. [5] - - - - - 90.4 78.0 - 87.6 - - -
Almazan et al. [1] 91.2 82.1 - 89.2 - - - - - - - -
Yao et al. [55] 80.2 69.3 - 75.9 - 88.5 80.3 - - - - -
Rodríguez-Serrano et al. [42] 76.1 57.4 - 70.0 - - - - - - - -
Jaderberg et al. [22] - - - 86.1 - 96.2 91.5 - - - - -
Su and Lu [47] - - - 83.0 - 92.0 82.0 - - - - -
Gordo [12] 93.3 86.6 - 91.8 - - - - - - - -
Jaderberg et al. [20] 97.1 92.7 - 95.4 80.7 98.7 98.6 93.1 90.8 - - -
Jaderberg et al. [19] 95.5 89.6 - 93.2 71.7 97.8 97.0 89.6 81.8 - - -
Shi et al. [44] 97.8 95.0 81.2 97.5 82.7 98.7 98.0 91.9 89.6 - - -
Shi et al. [45] 96.2 93.8 81.9 95.5 81.9 98.3 96.2 90.1 88.6 - 71.8 59.2
Lee et al. [25] 96.8 94.4 78.4 96.3 80.7 97.9 97.0 88.7 90.0 - - -
Yang et al. [52] 97.8 96.1 - 95.2 - 97.7 - - - - 75.8 69.3
Cheng et al. [7] 99.3 97.5 87.4 97.1 85.9 99.2 97.3 94.2 93.3 70.6 - -
Cheng et al. [8] 99.6 98.1 87.0 96.0 82.8 98.5 97.1 91.5 - 68.2 73.0 76.8
Liu et al. [29] - - 92.0 - 85.5 - - 92.0 91.1 74.2 78.9 -
Bai et al. [3] 99.5 97.9 88.3 96.6 87.5 98.7 97.9 94.6 94.4 73.9 - -
Liu et al. [31] 97.0 94.1 87.0 95.2 - 98.8 97.9 93.1 92.9 - - -
Liu et al. [30] 97.3 96.1 89.4 96.8 87.1 98.1 97.5 94.7 94.0 - 73.9 62.5
Liao et al. [27] 99.8 98.8 91.9 98.8 86.4 - - - 91.5 - - 79.9
Shi et al. [46] 99.6 98.8 93.4 97.4 89.5 98.8 98.0 94.5 91.8 76.1 78.5 79.5
ScRN (ours) 99.5 98.8 94.4 97.2 88.9 99.0 98.3 95.0 93.9 78.7 80.8 87.5
Table 4: Results across a number of methods and datasets. “50”, “1k”, “Full” are lexicons. “0” means no lexicon.

We also compare our method with previous state-of-the-art models. Tab. 4 summarizes the recognition results on seven text recognition datasets. The datasets IIIT5k, SVT and IC03 have lexicons to constrain recognition results. When analyzing the recognition accuracy of different models on these datasets, the predicted word will be replaced by the lexicon word that has the least edit distance with the original prediction. We achieve 6 best results out of 12, compared with other state-of-the-art methods.

Our method works effectively on datasets containing irregular text. Especially, we get an 8% improvement on CUTE80 compared with ASTER. We also outperform other state-of-the-art methods on SVTP and IC15 by 2.3% and 2.6%, respectively. The improvement gives credit to our rectification module, which attenuates text irregularities and therefore decreases the recognition difficulty. Compared with AON [8], our method provides a more intuitive way to represent text directions. Recur to the symmetrical constraints brought by the geometrical attributes, our method obtains more precise control points compared with ASTER.

Although our method mainly targets at irregular text recognition, it also achieves comparable or even better performance on regular datasets. Compared with ASTER, we get respectively 1%, 0.5%, and 2.1% improvements on IIIT5K, IC03, and IC13 with no lexicon. On SVT, our method performs slightly worse than ASTER by 0.6%. We conjecture that it is because the images in SVT always contain some incomplete characters on the left side. A unidirectional attention decoder in the left-to-right order suffers from the noise while the bidirectional one in ASTER can alleviate this effect.

4.6 Limitations

We also illustrate some failure cases produced by ScRN in Fig. 8. In Fig. 7(a), several characters are incorrectly recognized, due to imperfect rectification. We observe that our rectification module suffers from the curved text whose terminal characters have a nearly horizontal orientation and are close to the image borders. In Fig. 7(b), ScRN is able to give satisfactory rectification results, yet the recognizer fails to handle such blurry or occlusive cases.

Figure 8: Some bad cases produced by our recognition system. The meanings of these elements are the same as Fig. 6. Incorrectly recognized characters are in red.

Although character-level annotations are needed in our rectification module, it is labor-free and time-efficient to obtain such annotations with the automatic synthesizing engine [15]. In addition, extra images with only word-level annotations, such as Synth90k, can also be added for training to further improve the performance.

5 Conclusion

In this paper, we have proposed a Symmetry-constrained Rectification Network (ScRN) for scene text recognition. Such a flexible module can be either easily incorporated into existing recognition models or trained in an end-to-end manner within a unified framework. Our text recognition system incorporating the proposed ScRN achieves state-of-the-art performance on a number of benchmark datasets, especially on those with a large portion of irregular text images. Due to the shared backbone, ScRN significantly improves the recognition performance while requires negligible extra computation. Comprehensive experiments demonstrate the effectiveness and robustness of our recognition system. As for future work, we would like to extend the proposed method to an end-to-end text recognition system which can deal with text instances of arbitrary shapes.


This work was supported by NSFC 61733007, to Dr. Xiang Bai by the National Program for Support of Top-notch Young Professionals and the Program for HUST Academic Frontier Youth Team 2017QYTD08. In addition, we sincerely thank Shangbang Long for his help.


  • [1] J. Almazán, A. Gordo, A. Fornés, and E. Valveny (2014) Word spotting and recognition with embedded attributes. TPAMI 36 (12), pp. 2552–2566. Cited by: Table 4.
  • [2] D. Bahdanau, K. Cho, and Y. Bengio (2014) Neural machine translation by jointly learning to align and translate. CoRR abs/1409.0473. External Links: Link, 1409.0473 Cited by: §3.2.
  • [3] F. Bai, Z. Cheng, Y. Niu, S. Pu, and S. Zhou (2018) Edit probability for scene text recognition. In CVPR, Cited by: §2.1, Table 4.
  • [4] X. Bai, C. Yao, and W. Liu (2016) Strokelets: a learned multi-scale mid-level representation for scene text recognition. IEEE Transactions on Image Processing 25 (6), pp. 2789–2802. Cited by: §2.1.
  • [5] A. Bissacco, M. Cummins, Y. Netzer, and H. Neven (2013) PhotoOCR: reading text in uncontrolled conditions. In ICCV, pp. 785–792. Cited by: Table 4.
  • [6] F. L. Bookstein (1989) Principal warps: thin-plate splines and the decomposition of deformations. TPAMI 11 (6), pp. 567–585. Cited by: §3.
  • [7] Z. Cheng, F. Bai, Y. Xu, G. Zheng, S. Pu, and S. Zhou (2017) Focusing attention: towards accurate text recognition in natural images. In ICCV, pp. 5086–5094. Cited by: §2.1, Table 4.
  • [8] Z. Cheng, Y. Xu, F. Bai, Y. Niu, S. Pu, and S. Zhou (2018) AON: towards arbitrarily-oriented text recognition. In CVPR, pp. 5571–5579. Cited by: §1, §2.1, §2.2, §4.5, Table 4.
  • [9] K. Cho, B. van Merrienboer, Ç. Gülçehre, D. Bahdanau, F. Bougares, H. Schwenk, and Y. Bengio (2014) Learning phrase representations using RNN encoder-decoder for statistical machine translation. In

    Proceedings of the 2014 Conference on Empirical Methods in Natural Language Processing

    pp. 1724–1734. External Links: Link Cited by: §3.2.
  • [10] B. Epshtein, E. Ofek, and Y. Wexler (2010) Detecting text in natural scenes with stroke width transform. In CVPR, pp. 2963–2970. Cited by: §2.1.
  • [11] R. Girshick (2015-12) Fast r-cnn. In ICCV, Cited by: §3.3.1.
  • [12] A. Gordo (2015) Supervised mid-level features for word image representation. In CVPR, pp. 2956–2964. Cited by: Table 4.
  • [13] A. Graves, S. Fernández, F. J. Gomez, and J. Schmidhuber (2006)

    Connectionist temporal classification: labelling unsegmented sequence data with recurrent neural networks

    In ICML, pp. 369–376. Cited by: §2.1.
  • [14] A. Graves, M. Liwicki, S. Fernández, R. Bertolami, H. Bunke, and J. Schmidhuber (2009) A novel connectionist system for unconstrained handwriting recognition. TPAMI 31 (5), pp. 855–868. Cited by: §3.2.
  • [15] A. Gupta, A. Vedaldi, and A. Zisserman (2016) Synthetic data for text localisation in natural images. In CVPR, pp. 2315–2324. Cited by: §3.3.1, §4.6.
  • [16] K. He, X. Zhang, S. Ren, and J. Sun (2016) Deep residual learning for image recognition. In CVPR, pp. 770–778. Cited by: §3.
  • [17] P. He, W. Huang, Y. Qiao, C. C. Loy, and X. Tang (2016) Reading scene text in deep convolutional sequences.. In AAAI, Vol. 16, pp. 3501–3508. Cited by: §2.1.
  • [18] M. Jaderberg, K. Simonyan, A. Vedaldi, and A. Zisserman (2014) Synthetic data and artificial neural networks for natural scene text recognition. CoRR abs/1406.2227. Cited by: §3.3.1.
  • [19] M. Jaderberg, K. Simonyan, A. Vedaldi, and A. Zisserman (2015) Deep structured output learning for unconstrained text recognition. In ICLR, Cited by: Table 4.
  • [20] M. Jaderberg, K. Simonyan, A. Vedaldi, and A. Zisserman (2016)

    Reading text in the wild with convolutional neural networks

    IJCV 116 (1), pp. 1–20. Cited by: §2.1, Table 4.
  • [21] M. Jaderberg, K. Simonyan, A. Zisserman, et al. (2015) Spatial transformer networks. In NIPS, pp. 2017–2025. Cited by: §1, §2.2.
  • [22] M. Jaderberg, A. Vedaldi, and A. Zisserman (2014) Deep features for text spotting. In ECCV, pp. 512–528. Cited by: Table 4.
  • [23] D. Karatzas, L. Gomez-Bigorda, A. Nicolaou, S. K. Ghosh, A. D. Bagdanov, M. Iwamura, J. Matas, L. Neumann, V. R. Chandrasekhar, S. Lu, F. Shafait, S. Uchida, and E. Valveny (2015) ICDAR 2015 competition on robust reading. In Proc. ICDAR, pp. 1156–1160. Cited by: §4.1.
  • [24] D. Karatzas, F. Shafait, S. Uchida, M. Iwamura, L. G. i Bigorda, S. R. Mestre, J. Mas, D. F. Mota, J. A. Almazan, and L. P. de las Heras (2013) ICDAR 2013 robust reading competition. In ICDAR, pp. 1484–1493. Cited by: §4.1.
  • [25] C. Lee and S. Osindero (2016)

    Recursive recurrent nets with attention modeling for ocr in the wild

    In CVPR, pp. 2231–2239. Cited by: §2.1, Table 4.
  • [26] G. Li, S. Xu, X. Liu, L. Li, and C. Wang (2018) Jersey number recognition with semi-supervised spatial transformer network. In CVPR Workshops, pp. 1783–1790. External Links: Link, Document Cited by: §2.2, §4.4.
  • [27] M. Liao, J. Zhang, Z. Wan, F. Xie, J. Liang, P. Lyu, C. Yao, and X. Bai (2019) Scene text recognition from two-dimensional perspective. In AAAI, Cited by: Table 4.
  • [28] T. Lin, P. Dollár, R. B. Girshick, K. He, B. Hariharan, and S. J. Belongie (2017) Feature pyramid networks for object detection. In CVPR, pp. 936–944. Cited by: §3.
  • [29] W. Liu, C. Chen, and K. K. Wong (2018) Char-net: A character-aware neural network for distorted scene text recognition. In AAAI, pp. 7154–7161. Cited by: §2.1, Table 4.
  • [30] Y. Liu, Z. Wang, H. Jin, and I. J. Wassell (2018) Synthetically supervised feature learning for scene text recognition. In ECCV, pp. 449–465. External Links: Link, Document Cited by: §2.2, Table 4.
  • [31] Z. Liu, Y. Li, F. Ren, W. L. Goh, and H. Yu (2018) SqueezedText: a real-time scene text recognition by binary convolutional encoder-decoder network.. In AAAI, pp. 7194–7201. Cited by: Table 4.
  • [32] S. Long, X. He, and C. Yao (2018) Scene text detection and recognition: the deep learning era. arXiv preprint arXiv:1811.04256. Cited by: §1, §2.1.
  • [33] S. Long, J. Ruan, W. Zhang, X. He, W. Wu, and C. Yao (2018) TextSnake: a flexible representation for detecting text of arbitrary shapes. In ECCV, pp. 19–35. Cited by: §3.1.2, §3.1.
  • [34] S. M. Lucas, A. Panaretos, L. Sosa, A. Tang, S. Wong, and R. Young (2003) ICDAR 2003 robust reading competitions. In ICDAR, pp. 682. Cited by: §4.1.
  • [35] P. Lyu, M. Liao, C. Yao, W. Wu, and X. Bai (2018) Mask textspotter: an end-to-end trainable neural network for spotting text with arbitrary shapes. In Proceedings of the European Conference on Computer Vision (ECCV), pp. 67–83. Cited by: §1.
  • [36] P. Lyu, C. Yao, W. Wu, S. Yan, and X. Bai (2018) Multi-oriented scene text detection via corner localization and region segmentation. In

    Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition

    pp. 7553–7563. Cited by: §1.
  • [37] A. Mishra, K. Alahari, and C. Jawahar (2012) Scene text recognition using higher order language priors. In BMVC, Cited by: §2.1, §4.1.
  • [38] A. Mishra, K. Alahari, and C. Jawahar (2012) Top-down and bottom-up cues for scene text recognition. In CVPR, Cited by: §2.1, Table 4.
  • [39] T. Novikova, O. Barinova, P. Kohli, and V. S. Lempitsky (2012) Large-lexicon attribute-consistent text recognition in natural images. In ECCV, pp. 752–765. Cited by: §2.1.
  • [40] T. Quy Phan, P. Shivakumara, S. Tian, and C. Lim Tan (2013) Recognizing text with perspective distortion in natural scenes. In ICCV, pp. 569–576. Cited by: §2.2, §4.1.
  • [41] A. Risnumawan, P. Shivakumara, C. S. Chan, and C. L. Tan (2014) A robust arbitrary text detection system for natural scene images. Expert Syst. Appl. 41 (18), pp. 8027–8048. Cited by: §4.1.
  • [42] J. A. Rodríguez-Serrano, A. Gordo, and F. Perronnin (2015) Label embedding: A frugal baseline for text recognition. IJCV 113 (3), pp. 193–207. Cited by: Table 4.
  • [43] X. Rong, C. Yi, and Y. Tian (2016) Recognizing text-based traffic guide panels with cascaded localization network. In ECCV Workshops, pp. 109–121. Cited by: §1.
  • [44] B. Shi, X. Bai, and C. Yao (2017) An end-to-end trainable neural network for image-based sequence recognition and its application to scene text recognition. TPAMI 39 (11), pp. 2298–2304. Cited by: §2.1, Table 4.
  • [45] B. Shi, X. Wang, P. Lyu, C. Yao, and X. Bai (2016) Robust scene text recognition with automatic rectification. In CVPR, pp. 4168–4176. Cited by: §1, §2.1, §2.2, Table 4.
  • [46] B. Shi, M. Yang, X. Wang, P. Lyu, C. Yao, and X. Bai (2018) ASTER: an attentional scene text recognizer with flexible rectification. TPAMI. Cited by: Figure 1, §1, §2.1, §2.2, Table 4.
  • [47] B. Su and S. Lu (2014) Accurate scene text recognition based on recurrent neural network. In ACCV, pp. 35–48. Cited by: Table 4.
  • [48] B. Su and S. Lu (2017) Accurate recognition of words in scenes without character segmentation using recurrent neural network. Pattern Recognition 63, pp. 397–405. Cited by: §2.1.
  • [49] K. Wang, B. Babenko, and S. J. Belongie (2011) End-to-end scene text recognition. In ICCV, pp. 1457–1464. Cited by: §2.1, §4.1, §4.1, Table 4.
  • [50] K. Wang and S. Belongie (2010) Word spotting in the wild. In ECCV, pp. 591–604. Cited by: §2.1.
  • [51] T. Wang, D. J. Wu, A. Coates, and A. Y. Ng (2012) End-to-end text recognition with convolutional neural networks. In ICPR, pp. 3304–3308. Cited by: §2.1, Table 4.
  • [52] X. Yang, D. He, Z. Zhou, D. Kifer, and C. L. Giles (2017) Learning to read irregular text with attention mechanisms. In IJCAI, pp. 3280–3286. Cited by: §2.2, Table 4.
  • [53] C. Yao, X. Bai, W. Liu, Y. Ma, and Z. Tu (2012) Detecting texts of arbitrary orientations in natural images. In 2012 IEEE Conference on Computer Vision and Pattern Recognition, pp. 1083–1090. Cited by: §1.
  • [54] C. Yao, X. Bai, and W. Liu (2014) A unified framework for multioriented text detection and recognition. IEEE Transactions on Image Processing 23 (11), pp. 4737–4749. Cited by: §1.
  • [55] C. Yao, X. Bai, B. Shi, and W. Liu (2014) Strokelets: A learned multi-scale representation for scene text recognition. In CVPR, pp. 4042–4049. Cited by: §2.1, Table 4.
  • [56] M. D. Zeiler (2012) ADADELTA: an adaptive learning rate method. CoRR abs/1212.5701. Cited by: §4.2.
  • [57] F. Zhan, S. Lu, and C. Xue (2018) Verisimilar image synthesis for accurate detection and recognition of texts in scenes. In ECCV, pp. 257–273. Cited by: §2.1.
  • [58] Z. Zhang, W. Shen, C. Yao, and X. Bai (2015) Symmetry-based text line detection in natural scenes. In CVPR, pp. 2558–2567. Cited by: §3.1.
  • [59] X. Zhou, C. Yao, H. Wen, Y. Wang, S. Zhou, W. He, and J. Liang (2017) EAST: an efficient and accurate scene text detector. In Proceedings of the IEEE conference on Computer Vision and Pattern Recognition, pp. 5551–5560. Cited by: §1.
  • [60] Y. Zhu, C. Yao, and X. Bai (2016) Scene text detection and recognition: recent advances and future trends. Frontiers of Computer Science 10 (1), pp. 19–36. Cited by: §1.