ESIR: End-to-end Scene Text Recognition via Iterative Image Rectification
Automated recognition of texts in scenes has been a research challenge for years, largely due to the arbitrary variation of text appearances in perspective distortion, text line curvature, text styles and different types of imaging artifacts. The recent deep networks are capable of learning robust representations with respect to imaging artifacts and text style changes, but still face various problems while dealing with scene texts with perspective and curvature distortions. This paper presents an end-to-end trainable scene text recognition system (ESIR) that iteratively removes perspective distortion and text line curvature as driven by better scene text recognition performance. An innovative rectification network is developed which employs a novel line-fitting transformation to estimate the pose of text lines in scenes. In addition, an iterative rectification pipeline is developed where scene text distortions are corrected iteratively towards a fronto-parallel view. The ESIR is also robust to parameter initialization and the training needs only scene text images and word-level annotations as required by most scene text recognition systems. Extensive experiments over a number of public datasets show that the proposed ESIR is capable of rectifying scene text distortions accurately, achieving superior recognition performance for both normal scene text images and those suffering from perspective and curvature distortions.READ FULL TEXT VIEW PDF
ESIR: End-to-end Scene Text Recognition via Iterative Image Rectification
ESIR: End-to-end Scene Text Recognition via Iterative Image Rectification
Texts in scenes contain high level semantic information that is very useful in many practical applications such as indoor and outdoor navigation, content-based image retrieval, etc. Accurate and robust recognition of scene texts by machines has been a research challenge due to a huge amount of variations in image background, text appearance, imaging artifacts, etc. The advances in deep learning research and its success in various computer vision problems have pushed the boundary of scene text recognition greatly in recent years[27, 16, 33, 34, 38, 2, 24]. On the other hand, the deep learning based approach is still facing various problems while dealing with a large amount of scene texts that suffer from arbitrary perspective distortions and text line curvature as illustrated in Fig. 1.
We design an end-to-end trainable scene text recognition network via iterative rectification (ESIR). The ESIR employs an innovative rectification network that corrects perspective and curvature distortions of scene texts iteratively as illustrated in Fig. 2. The finally rectified scene text image is fed to a recognition network for recognition. The training of the iterative rectification network is driven by better scene text recognition as back-propagated from the recognition network, requiring no other annotations beyond the scene texts as used in most scene text recognition systems.
The proposed rectification network addresses two typical constraints in the recent scene text recognition research. The first is robust rectification of perspective and curvature distortions in scene texts. For this we design a novel line-fitting transformation that is powerful and capable of modeling and correcting various scene text distortions reliably. The line-fitting transformation models the middle line of scene texts by using a polynomial which is able to estimate the pose of either straight or curved text lines flexibly as illustrated in Fig. 3. In addition, it employs line segments which are capable of estimating the orientation and the boundary of text lines in vertical direction reliably. The proposed rectification network is thus capable of correcting not only perspective distortions in straight text lines as in spatial transfer networks  and bag-of-keypoints recognizer , but also various curvatures in curved text lines in scenes.
The second is accurate rectification of various perspective and curvature distortions in scene texts. For this we develop an iterative rectification framework that employs multiple feed-forward rectification modules to estimate and correct scene text distortions iteratively. As illustrated in Fig. 2, each iteration takes the image rectified in the last iteration for further distortion estimation as driven by higher scene text recognition accuracy. The iterative rectification is thus capable of producing much more accurate distortion correction compared with state of the arts that just perform a single distortion estimation and correction [35, 36]. In addition, the iteratively rectified images lead to superior scene text recognition accuracy especially for datasets that contain a large amount of curved and/or perspectively distorted texts, to be described in Experiments.
The contributions of this work are threefold. First, it proposes a novel line-fitting transformation that is flexible and robust for scene text distortion modeling and correction. Second, it designs an iterative rectification framework that clearly improves the scene text rectification and recognition performance with no extra annotations. Third, it develops an end-to-end trainable system that is robust to parameter initialization and achieves superior scene text recognition performance across a number of public datasets.
Existing scene text recognition work can be broadly grouped into two categories. One category adopts a bottom-up approach that first detects and recognizes individual characters. The other category takes a top-down approach that recognizes words or text lines directly without explicit detection and recognition of individual characters.
Most traditional scene text recognition systems follow a bottom-up approach that first detects and recognizes individual characters by using certain hand-crafted features and then links up the recognized characters into words or text lines using dynamic programming and language models. Various scene character detection and recognition techniques have been reported by using sliding window [41, 40], connected components , Hough voting , co-occurrence histograms , extremal regions , etc. On the other hand, these methods are often constrained by limited representation capacity of the hand-crafted features. With the advances of deep learning in recent years, various CNN architectures and frameworks have been designed for the scene character recognition task. For example,  adopts a fully connected network to recognize characters. 
makes use of CNNs for feature extraction. exploits CNNs for unconstrained character recognition. On the other hand, these deep network based methods require localization of individual characters which is not only resource-hungry but also prone to errors due to complex image background and heavy touching between adjacent characters.
To address the character localization issues, various top-down techniques have been proposed which recognize an entire word or text line without detecting individual characters. One typical approach is to treat a word as a unique object class and convert the scene text recognition into an image classification problem . In addition, recurrent neural networks (RNNs) have been widely explored which encode a word or text line as a feature sequence and perform recognition without character segmentation. For example, [37, 38] extract histogram of oriented gradient features across a word or text line and use RNNs to convert them into a feature sequence. ,  and  propose end-to-end systems that use RNNs for visual feature representation and connectionist temporal classification (CTC) loss  for sequence prediction. In recent years, visual attention has been incorporated in various ways which improves recognition by detecting more discriminative and informative regions in images. For example,  learns broader contextual information with a recursive CNN and then uses an attention based decoder for sequence generation.  proposes a focus mechanism to eliminate attention drift to improve the scene text recognition performance.
|Input Images||Rectified Images|
The state-of-the-art combining RNNs and attention has achieved great success while dealing with horizontal or slightly distorted texts in scenes. On the other hand, most existing methods still face various problems while dealing with many scene texts that suffer from either perspective distortions or text line curvatures or both.
Prior works dealing with perspective distortions and text line curvatures are limited but this problem has attracted increasing attention in recent years. The very early works [25, 9] correct perspective distortions in document texts as captured by digital cameras for better recognition.  works with scene texts by using bag of key-points that are tolerant to perspective distortions. These early systems achieve limited successes as they use hand-crafted features and also require character-level information. The recent works [35, 36]
also take an image rectification approach but explore spatial transformer networks for scene text distortion correction. Similarly,[3, 23] integrate the rectification and recognition into the same network. These recent systems exploit deep convolutional networks for rectification and RNNs for recognition, which require little manually crafted features or extra annotations and have shown very promising recognition performance.
Our proposed technique adopts a rectification approach for robust and accurate recognition of scene texts with perspective and curvature distortions. Different from existing rectification based works [35, 36, 3, 23], it corrects distortions in an iterative manner which helps to improve the rectification and recognition greatly. In addition, we propose a novel line-fitting transformation that is robust and flexible in scene text distortion estimation and correction.
Note some attempt has been reported to handle scene text perspectives and curvatures by managing deep network features in recent years. For example,  presents an auxiliary dense detector to encourage visual representation learning.  describes an arbitrary orientation network that extracts scene text features in four directions.
This section will present the proposed scene text recognition technique including iterative scene text rectification network, sequence recognition network and detailed description of network training strategy.
The proposed iterative rectification network employs a novel line-fitting transformation and an iterative rectification strategy for optimal estimation and correction of perspective and curvature distortions within scene text images.
A novel line-fitting transformation is proposed to model the pose of scene texts and correct perspective and curvature distortions for better scene text recognition. As illustrated in Fig. 3, the fitting lines consist of a polynomial that models the middle line of text lines in horizontal direction, and line segments that estimate the orientation and the boundary of text lines in vertical direction. Since most scene texts are either along a straight line or a normal smooth curve, a polynomial of a certain order is sufficient for the text line pose estimation in the horizontal direction. In our trained network, a polynomial of order 4 and 10 line segments are employed for scene text pose estimation.
By setting the image center as the origin and normalizing the x-y coordinate of each pixel within scene text images, the middle line of text lines can be denoted by a polynomial of order as follows:
The line segments can be denoted by:
where denotes the length of line segments on the two sides of the middle line of scene texts which can be approximated as the same. We therefore have parameters for estimating the line segments. By including the middle line polynomial, the parameter number becomes .
The proposed rectification network iteratively regresses to estimate the fitting-line parameters by employing a localization network together with image convolutions as illustrated in Fig. 4. Table 1 gives detailed structures of the localization network. It should be noted that the training of the localization network does not require any extra annotation of fitting lines but is completely driven by the gradients that are back-propagated from the recognition network. The underlying principle is that optimal recognition performance is usually achieved when scene text distortions are estimated and corrected properly.
Once the fitting line parameters are estimated, the coordinates of 2 endpoints of each of the line segments can be determined. Scene text distortions can then be corrected by a thin plate spline transformation  that can be determined based on the estimated line segment endpoints and another base points which define the appearance of scene texts within the rectified image:
With and , the transformation parameters can be determined by:
where S=[U(t-), U(t-), , U(t-)] and . For every pixel within the rectified image, the corresponding pixel within the distorted scene text image can thus be determined as follows:
With the estimated pixels , a grid can be generated within the distorted scene text image for rectification. A sampler is implemented to produce the rectified scene text image by using the determined grid, where the value of the pixel
is bilinearly interpolated from the pixels nearwithin the distorted scene text image. The sampler can back propagate the image gradients as it is fully differentiable.
We develop an iterative rectification framework for optimal scene text rectification and recognition. At the first iteration, the rectification network takes the original scene text image as input and rectifies it to certain degrees by using the estimated transformation parameters as described in the last subsection. After that, the rectified scene text image is further fed to the same rectification network for parameter estimation and image rectification. This process repeats until a predefined number of rectification iterations is reached. The finally rectified image is then fed to the sequence recognition network for scene text recognition.
The iterative rectification improves the scene text image rectification performance greatly as to be described in Experiments. On the other hand, it often encounters a critical ‘boundary effect’ problem if the iterative rectification is performed directly without control. In particular, each rectification iteration discards image pixels lying outside of the image sampling region, which accumulates and could lead to the discarding of certain text pixels during the iterative image rectification process. Besides, direct iterative image rectification also degrade the image clarity severely due to the multiple round bilinear interpolations.
|BiLSTM||256 hidden units|
|BiLSTM||256 hidden units|
|AttLSTM||-||256 hidden, 256 attention|
|AttLSTM||-||256 hidden, 256 attention|
Inspired by , we deal with the ‘boundary effects’ and clarity degradation by designing a novel network structure as illustrated in Fig. 4. In particular, the rectification network consists of localization networks for estimating rectification parameters, and a thin plate spline transformation for generating rectified scene text images by using the estimated rectification parameters. During the iterative image rectification process, the intermediately rectified scene text image is used for parameter estimation only, and the original instead of intermediately rectified scene text image is consistently fed to for rectification. This new design helps avoid accumulating ‘boundary effect’ and degradation of image clarity, which greatly helps improve the scene text recognition to be presented in Experiments.
The recognition network employs a sequence-to-sequence model with an attention mechanism. It consists of an encoder and a decoder and the detailed network structure is given in Table 2. In the encoder, the input is a rectified scene text image which is re-sized to . A 53-layer residual network  is used to extract features, where each residual unit consists of a convolution and a convolution. A stride convolution is implemented to down-sample feature maps in the first two residual blocks, which is changed to
stride in all following residual blocks. This helps to reserve more information along the horizontal direction and is very useful for distinguishing neighbor characters. The residual network is followed by two layers of Bidirectional long short-term memory (BiLSTM) with 256 hidden units. The decoder consists of 2-layer attentional LSTMs with 256 hidden units and 256 attention units, and it adopts the LuongAttention mechanism. During testing, beam search is employed to decode components that maintain candidates with the top accumulative scores.
The training of image rectification networks is never a simple task. The major issue is that image rectification networks are sensitive to parameter initialization as frequently encountered in prior studies [35, 36]. In particular, random parameter initialization often leads to network convergence problems because it is liable to produce highly distorted scene text images that ruin the training of the recognition network and further the rectification network (whose training is driven by the scene text recognition performance).
We address the network initialization problem by avoiding predicting directly. Instead, we use an auxiliary that equals at the beginning and make , where is predicted by the rectification network iteratively. By assigning a small value to , the initial will have a similar value as . This helps avoid generating highly distorted scene text image at the beginning stage and improves the network convergence greatly. Additionally, the gradual learning of via iterative estimation of makes the rectification network training smooth and stable.
This section describes a list of datasets and evaluation metrics that are used in the experiments.
All ESIR models are trained by using the Synth90K and SynthText, and there is no fine-tuning using any third dataset. The trained ESIR models are evaluated over 6 public datasets including 3 normal datasets ICDAR2013, IIIT5K and SVT where most scene texts are almost horizontal, and 3 distorted datasets ICDAR2015, SVTP and CUTE80 where a large amount of scene texts suffer from perspective and curvature distortions. The 6 datasets have been widely used for evaluations in scene text recognition research.
contains 9 million synthetic text images with a lexicon of 90K, and it has been widely used for training scene text recognition models. It has no separation of training and test data and all images are used for training.
SynthText  is the synthetic image dataset that was created for scene text detection research. It has been widely used for scene text recognition research as well by cropping text image patches using the provided annotation boxes. State-of-the-art methods crop different amounts for evaluations, e.g.  crops 4 million, Shi  crops over 7 millions, etc. We crop 4 million text image patches from this dataset which are at lower end for fair benchmarking.
ICDAR2013  is used in the Robust Reading Competition in the International Conference on Document Analysis and Recognition (ICDAR) 2013. It contains 848 word images for model training and 1095 for testing.
ICDAR2015  was used in the Robust Reading Competition under ICDAR 2015. It contains incidental scene text images that are captured without preparation before capturing. 2077 text image patches are cropped from this dataset, where a large amount of cropped scene texts suffer from perspective and curvature distortions.
IIIT5K  consists of 2000 training images and 3000 test images that are cropped from scene texts and born-digital images. Each word image in this dataset has a 50-word lexicon and a 1000-word lexicon, where each lexicon consists of a ground-truth word and a set of randomly picked words.
SVT  is collected from the Google Street View images that were used for scene text detection research. 647 words images are cropped from 249 street view images and words within most cropped word images are almost horizontal. Each word image has a 50-word lexicon.
SVTP  consists of 639 word images that are cropped from the SVT images. Most images in this dataset are heavily distorted by perspective distortions which are specifically picked for evaluation of scene text recognition under perspective views. Each word image has a 50-word lexicon as inherited from the SVT dataset.
CUTE  consists of 288 word images where most cropped scene texts are curved. All word images are cropped from the CUTE dataset which contains 80 high-resolution scene text images that are originally collected for the scene text detection research. No lexicon is provided for the 288 word images in this dataset.
|Wang  [-]||-||-||-||-||-||70.0||-||-||-|
|Bissacco  [-]||87.6||-||-||-||-||-||-||-||-|
|Yao  [-]||-||-||80.2||69.3||-||75.9||-||-||-|
|AlmazÂ´an  [-]||-||91.2||82.1||-||89.2||-||-||-|
|Gordo  [-]||-||93.3||86.6||-||91.8||-||-||-|
|Jaderberg  [VGG, SK]||81.8||-||95.5||89.6||-||93.2||71.7||-||-|
|Jaderberg  [VGG, SK]||90.8||-||97.1||92.7||-||95.4||80.7||-||-|
|Shi  [VGG, SK]||88.6||-||96.2||93.8||81.9||95.5||81.9||71.8||59.2|
|Yang  [VGG, Private]||-||-||97.8||96.1||-||95.2||-||75.8||69.3|
|Cheng  [ResNet, SK+ST]||93.3||70.6||99.3||97.5||87.4||97.1||85.9||71.5||63.9|
|Cheng  [VGG, SK+ST]||-||68.2||99.6||98.1||87.0||96.0||82.8||73.0||76.8|
|Shi  [ResNet, SK+ST]||91.8||76.1||99.6||98.8||93.4||97.4||89.5||78.5||79.5|
|ESIR [VGG, SK]||87.4||68.4||95.8||92.9||81.3||96.7||84.5||73.8||68.4|
|ESIR [ResNet, SK]||89.1||70.1||97.8||96.1||82.9||97.1||85.9||75.8||72.1|
|ESIR [ResNet, SK+ST]||91.3||76.9||97.4||98.8||93.3||97.4||90.2||79.6||83.3|
We follow the protocol and evaluation metrics that have been widely used in scene text recognition research [7, 36]. In particular, the recognition covers 68 characters including 10 digits, lower-case letters and 32 ASCII punctuation marks. In evaluation, only digits and letters are counted and the rest is directly discarded. If a lexicon is provided, the lexicon word that has the minimum edit distance with the predicted word is selected. In addition, evaluations are based on the correctly recognized words (CRW) which can be determined based on the ground truth transcription.
The proposed scene text recognition network is implemented using the Tensorflow framework. The ADADELTA is adopted as optimizer which employs adaptive learning rate and weighted cross-entropy in sequence loss calculation. The network is trained in 1 million iterations with a batch size of 64. In addition, the network training is performed on a workstation with one Intel Core i7-7700K CPU, one NVIDIA GeForce GTX 1080 Ti graphics card with 12GB memory and 32GB RAM.
Three ESIR models are trained for evaluations and benchmarking with the state-of-the-art. The first is a baseline model ESIR [VGG, SK] as shown in Table 3, which uses VGG as the network backbone and the Synth90 as the training data. The second model as denoted by ESIR [ResNet, SK] uses the same training data but ResNet as the network backbone. The third model as denoted by ESIR [ResNet, SK+ST] uses ResNet as the network backbone but a combination of the Synth90K and SynthText as training data, largely for benchmarking with state-of-the-art models such as ASTER and AON that also use a combination of the two datasets in training. All three ESIR models are trained under the same parameters setting: rectification iteration number: 5; number of line segments: 20; order of the middle line polynomial: 4.
|Input Images||Rectified Images||
The proposed ESIR has been evaluated extensively over the 6 public datasets as described in Dataset that contain both normal scene text images and scene text images with a variety of perspective and curvature distortions. In addition, it has been benchmarked with a number of state-of-the-art scene text recognition techniques that employ rectification, feature learning, etc. as described in Related Work. Table 3 shows experimental results.
As Table 3 shows, ESIR [ResNet, SK] consistently outperforms ESIR [VGG, SK] across all 6 datasets evaluated due to the use of a more powerful network backbone. In addition, ESIR [ResNET, SK+ST] consistently outperforms ESIR [VGG, SK] across the six datasets, demonstrating the value of including more data in network training. By taking a second look at the datasets, we observe that Synth90K mainly consists of English words but few number and punctuation samples, whereas SynthText contains a good amount of number and punctuation samples. This could partially explain why the inclusion of SynthText helps more for the datasets IIT5K and CUTE that contain a large amount of numbers and punctuation.
The proposed ESIR achieves superior scene text recognition performance across the 6 datasets as compared with state-of-the-art techniques. For the three distorted datasets, ESIR outperforms state-of-the-art techniques under all different settings, demonstrating the advantage of the proposed iterative rectification network. In particular, ESIR [VGG, SK] consistently outperforms the  over the SVTP and CUTE when similar network backbone and training data are used. ESIR [ResNET, SK+ST] also outperforms  and  consistently across the ICDAR2015, SVTP and CUTE under the same setting. For the three normal datasets, ESIR also achieves state-of-the-art performance. In particular, both ESIR [VGG, SK] and ESIR [ResNET, SK+ST] outperform state-of-the-art techniques over the SVT dataset that contains a large amount of low-quality scene texts from street view imagery. For the datasets ICDAR2013,  achieves the best accuracy but it requires character-level bounding box annotations.  and  outperform the ESIR [VGG,SK] slightly on the ICDAR2013 under the same setting but they only recognize words within a 90K dictionary.  achieves the best accuracy on the IIIT5K, but it crops 7.2 million training images from the SynthText whereas we only crop 4 million training images.
Fig. 5 illustrates the scene text rectification and recognition by the proposed ESIR, where the three columns show several sample images from the CUTE and SVTP, the rectified images by using the ESIR [VGG, SK], and the recognized texts without (at the top) and with (at the bottom) using the proposed iterative rectification network (incorrectly recognized texts are highlighted in red color), respectively. As Fig. 5 shows, the proposed ESIR is capable of rectifying scene text images with various perspective and curvature distortions in most cases. For the last two severely distorted scene text images, the rectification could be further improved by employing a larger number of rectification iterations (beyond default 5 under the current setting). At the same time, we can see that the proposed ESIR does not degrade scene text images that do not suffer from perspective and curvature distortions as illustrated in the sixth sample image. Further, the proposed ESIR helps to improve the scene text recognition performance greatly as shown in the third column. Note that the recognition here does not use any lexicon, and the recognition performance can be greatly improved by including a lexicon or even a large dictionary, e.g. the mis-recognized ’bookstorf’ and ’tacation’ from the last two sample images could be corrected if a dictionary is used.
We conjecture that the ESIR’s superior recognition performance especially over the three distorted datasets is largely due to our proposed iterative rectification network. Fig. 6 compares scene text rectifications by our proposed rectification network and two state-of-the-art rectification networks in RARE  and ASTER . As Fig. 6 shows, the proposed network produces clearly better rectifications as compared with rectifications by RARE and ASTER. The better rectifications are largely due to the robust line-fitting transformation as well as the iterative rectification framework as described in Proposed Method.
The performance of the proposed technique is heavily affected by two key parameters, namely, the number of rectification iterations and the number of line segments as illustrated in Fig. 3. We study these two parameters separately by using the Synth90K and SynthText as training data consistently. Table 4 shows experimental results, where the first set of experiments fix the line segments at 20 but change the rectification iteration number from 1 to 5 and the second fix the rectification iteration number at 5 but use 5, 10 and 15 line segments, respectively.
As Table 4 shows, the model performance improves consistently when a larger number of rectification iterations is implemented. In particular, the improvement is more significant at the early stage when the first and second iterations of rectification are implemented i.e. the number of iterations changes from 0 to 1 and from 1 to 2. This can be observed more clearly over the two highly distorted datasets SVTP and CUTE as shown in Table 4. In addition, using a larger number of line segments also helps to improve the scene text recognition performance though the improvement is not as significant as implementing a larger number of rectification iterations. We conjecture that using a larger number of line segments helps to produce better estimations of text line poses which further helps to improve the scene text rectification and recognition performance.
Though the proposed ESIR performs multiple iterations of rectification, the overall computation costs just increase slightly as compared with state-of-the-art rectification techniques without using iterations [35, 36]. In particular, the proposed ESIR with 5 rectification iterations takes 3ms per image in training with a batch-size of 64 and 28ms per image in testing with a batch-size of 1. Under the similar network setting, the ASTER takes 2.4ms per image in training and 20ms per image in testing. The similar computational cost is largely due to the proposed rectification network as shown in Table 1 which is small and computational light as compared with the feature extraction network and the sequence recognition network as shown in Table 2.
This paper presents an end-to-end trainable scene text recognition network that is capable of recognizing distorted scene texts via iterative rectification. The proposed network estimates and corrects perspective distortion and text line curvature iteratively as driven by better scene text recognition performance. In particular, a novel line-fitting transformation is designed to estimate the pose of text lines in scenes, and an iterative rectification framework is developed for optimal scene text rectification and recognition. The proposed network is also robust to parameter initialization and does not require extra annotations. Experiments over a number of public datasets demonstrate its superior performance in scene text rectification and recognition.
A few issues will be further studied. First, the proposed technique works well on manually cropped scene text images with almost perfect scene text localization. While in most end-to-end systems, the detection module usually introduces various localization errors. The tolerance to different localization errors as well as integration into end-to-end scene text reading systems will be further explored. Second, this work focuses on scene texts in Latin script, and we will further investigate the recognition of scene texts in other scripts and languages.
|N = 0||73.9||73.2||73.4|
|N = 1||75.8||77.3||78.8|
|N = 2||76.3||78.7||81.1|
|N = 3||76.7||79.3||82.7|
|N = 4||76.9||79.5||83.1|
|L = 5||75.8||78.0||81.7|
|L = 10||76.6||78.9||82.6|
|L = 15||76.9||79.3||83.0|
Edit probability for scene text recognition.In CVPR, 2018.
Recursive recurrent nets with attention modeling for ocr in the wild.In CVPR, pages 2231–2239, 2016.
Effective approaches to attention-based neural machine translation.In EMNLP, 2015.