Alchemy: Techniques for Rectification Based Irregular Scene Text Recognition

08/30/2019 ∙ by Shangbang Long, et al. ∙ Carnegie Mellon University Peking University 4

Reading text from natural images is challenging due to the great variety in text font, color, size, complex background and etc.. The perspective distortion and non-linear spatial arrangement of characters make it further difficult. While rectification based method is intuitively grounded and has pushed the envelope by far, its potential is far from being well exploited. In this paper, we present a bag of tricks that prove to significantly improve the performance of rectification based method. On curved text dataset, our method achieves an accuracy of 89.6 previous state-of-the-art by 6.3 combination of tricks helps us win the ICDAR 2019 Arbitrary-Shaped Text Challenge (Latin script), achieving an accuracy of 74.3 set. We release our code as well as data samples for further exploration at https://github.com/Jyouhou/ICDAR2019-ArT-Recognition-Alchemy

READ FULL TEXT VIEW PDF
POST COMMENT

Comments

There are no comments yet.

Authors

page 5

page 6

page 8

This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

1 Introduction

Recently, the detection and recognition of irregular text from natural images have become a new popular research topic [long2018scene, Shi_2017_CVPR, Zhou_2017_CVPR, Deng2018, long2018textsnake, baek2019character, tian2019learning, shi2016robust, liao2018scene, li2018show]. However, most detection methods describe text as bounding boxes or groups of pixels in the form of semantic segmentation, without any indication of the shapes. Thus, the recognition models still need proper mechanism to deal with curved text.

Deep learning based text recognition algorithms have transitioned from one-dimensional methods, i.e. CRNN and CTC based methods [su2014accurate, liu2016star, lee2016recursive, cheng2017focusing, shi2017end, gao2017reading, yin2017scene], to quasi-two-dimensional methods using rectification [shi2016robust, shi2018aster]

which is adapted fro Spatial Transformer Networks (STN) 

[jaderberg2015spatial], and currently to two-dimensional methods with Fully Convolutional Networks [long2015fully] or spatial attention [xu2015show] such as [liao2018scene, li2018show]. These methods have started and followed a standard and well accepted training and evaluation protocol, using images rendered with synthesis engines [jaderberg2014synthetic, gupta2016synthetic] as training data and then evaluating on real world datasets. Besides, they share a standard pre-processing step by resizing all images to a fixes size, usually set as pixels.

Despite these widely-adopted trends, the aforementioned common practices fall short when faced with irregular text. Specifically, synthetic data generated by existing algorithms [jaderberg2014synthetic, gupta2016synthetic, zhan2018verisimilar, liao2019synthtext3d] largely consist of text aligned to straight lines. Although the 800K synthetic images generated by SynthText [gupta2016synthetic] do contain curved text, the proportion is small and they are only slightly curved. Besides, images with irregular text are usually seriously warped if they are resized to . Therefore, it is important and worthwhile to study the bottleneck effect of these factors in building strong and robust text recognizers.

In this paper, we examine a set of techniques from different aspects that improve the performance of recognizers with focus on irregular text. These techniques may seem to be slight modification to models, data, and training procedures at first glance. However, they are reasonable and bring considerable improvement over baselines and even beat state-of-the-art models by a large margin. We design comprehensive experiments to analyse the effect of these techniques. Furthermore, we are also the first to evaluate on the newly published curved text dataset, Total-Text [kheng2017total], which contains images, much larger than the CUTE80 dataset [risnumawan2014robust]. Images in Total-Text are characterized by even more challenging conditions, and thus should be a better performance indicator for curved text detectors.

Our paper is organized as follows. In Section 2, we first set up our baseline methods and experiment settings. The experiments throughout this paper follow the same settings for fair comparison, unless otherwise specified. In Section 3, we discuss the selection of training data. In Section 4, we evaluate several model modifications. In Section 5, we review our participation in the ICDAR 2019 ArT challenge, and describe our system used in the competition. In Section 6, we summarize our findings and contributions.

The code and training scripts are all included in our Github repository as shown in the abstract.

2 Baseline System and Experiment Settings

Basically, we follow the standard of training and evaluating scene text recognition algorithms as most previous works do. We give a brief description of our baseline recognition system in Section 2.1. We choose a rectification based method as our experiment tools. Then, we specify our experiments settings in Section 2.2 and Section 2.3. Details regarding to the models and experiments can be found in our code repository. Especially, we would like to thank Mingkun Yang for his repository111https://github.com/ayumiymk/aster.pytorch of the Aster algorithm [shi2018aster].

2.1 Baseline Models with Rectification Layer

Figure 1: The pipeline of our implementation. The module marked with dotted rectangles is the rectification module. When we take off the rectification module, it becomes a vanilla RNN and CNN based model. The green dots inside represent predicted control points. For implementation details, we refer readers to our code.

We follow and modify the PyTorch implementation of Aster 

[shi2018aster] released by its authors as mentioned above. The pipeline is shown in Fig. 1, and is composed of the following steps:

Preprocessing Input images are first resized to . Pixel values are subtracted and then divided by , thus normalized to a to base.

Feature Extractor We use a standard 50-layered ResNet [He_2017_Res] as backbone network and Feature Pyramid Networks [Lin_2017_CVPR] to extract features from images. The output feature maps have the resolution of input images, that is, , and channels. We believe that the widely-used ResNet+FPN feature extractor should serve as a fair playground for easy replication and controlled experiments.

Rectification Module To perform rectification, we first apply a control point header to localize the text with a -vertexed polygon. The header consists of a series of alternating convolutional layers and maxpooling layers that encode the feature maps from feature extractor to a

-d vector. Two fully connected networks follow and output the polygon. Then, the feature maps are rectified to axis-aligned form with Thin-Plate-Spline (TPS) 

[warps1989thin]. Details with regard to the grid generator is the same as that specified in Aster.

Recognition Module The input to the recognition modules is the rectified feature maps output from the rectification module, with a size of . We follow the PyTorch implementation of Aster, and attach several convolutional layers, maxpooling layers and Bi-LSTM layers [hochreiter1997long] layers which act as encoders, and then attentional decoders. The decoder is an LSTM with attention mechanism to predict a sequence of symbols from a predefined character list. To produce sequence of variable length, a special end-of-sentence symbol, denoted as EOS, is added to the character list and appended to the end of training target. During inference time, decoding ends once the EOS is emitted. Apart from EOS, Arabic digits, lowercase alphabets, and printable ASCII punctuation marks are predicted.

This model is referred to as rectification baseline in our paper or Rect for short. These two names will be used interchangeably in the following sections.

2.1.1 Non-Rectification Baseline

As shown in Fig.1

, we can also take off the rectification module and it becomes a variant of the convolutional recurrent neural network (CRNN) 

[shi2017end], with RNN based transcription module stacked upon convolutional layers. The difference between CRNN and our model is that, the sequence output layer of CRNN is based on CTC, while our model uses the RNN-based encoder-decoder framework [sutskever2014sequence] which enables training and testing with variable lengths. We term our model without the rectification layer as non-rectification baseline, or Non-Rect for short. We only modify the rectification layer so that the Non-Rect model also serves as a benchmark for quantitative evaluation of the proposed techniques. The experiment results of Non-Rect can help attribute improvements to models and different techniques respectively. Also note that the input and output of the rectification module have the same size, and therefore, taking off it does not change the rest of the pipeline.

2.2 Training Settings

Training Target Given input image and ground truth text sequence

, the loss function is formulated as

cross-entropy loss averaged over time:

Note that the last symbol of the ground truth sequence is always the special symbol EOS.

Optimization We use ADADELTA [zeiler2012adadelta]

with default hyper-parameters (rho=9e-1, eps=1e-6, weight decay=0) to minimize the aforementioned loss function. Gradients are estimated using mini-batches with

images that are randomly sampled from training set. We train epochs in total. We initialize the learning rate to for the first epochs, for the fifth epoch and for the sixth epoch. All experiments are performed on a Ubuntu machine with NVIDIA TITAN Xp graphics cards, each with 12GB memory.

2.3 Evaluation

We evaluate the models on a total of datasets. These datasets are briefly described here:

IIIT 5K [mishra2012scene] contains horizontal testing images collected from the Internet that represent challenges in scene text recognition, such as the variety in font, size, and color.

Street View Text (SVT) [wang2011end] contains testing images collected from the Google Street View. Many of them are corrupted by noise, blur and low resolution.

ICDAR 2003 (IC03) [lucas2003icdar] recognition task contains horizontal focused images. Following previous works [wang2011end], we only consider instances with at least characters.

ICDAR 2013 (IC13) [karatzas2013icdar] inherit mostly from IC03 and contains images.

ICDAR 2015 (IC15) [karatzas2015icdar] is collected with Google Glasses without taking care of posing, position, and focus. The images are blurred, of low quality, and most of the text instances are oriented and small. We only consider instances with at least characters. The total number of images is .

SVT-Perspective (SVTP) [quy2013recognizing] is collected in a similar way to SVT, with special focus on perspective text. It contains images.

CUTE 80 [risnumawan2014robust] is designed for curved text. It only contains instances for testing. By far, CUTE80 is the most widely used dataset to evaluate curved text recognition, but it has a small sized compared to other datasets.

Total-Text [kheng2017total] is a dataset devoted to curved text detection and recognition. It has scene images for training and for testing. A large proportion of the instances are curved text and greatly oriented, making it difficult for both detection and recognition. We crop text instances from the test set, and obtain cropped text images.

Figure 2: Selected samples of RectTotal produced by applying TextSnake with ground truth geometry attributes. Note that, not only the text themselves are rectified, but the image background is also largely eliminated.

Rectified Total-Text (RectTotal) We propose RectTotal which is obtained by using the TextSnake [long2018textsnake] algorithm to rectify the test images using ground truth geometry attributes. If the detection module can precisely capture the shape of text instances instead of predicting them with groups of unordered pixels, curved text recognition algorithms may be less necessary. We include the results here for future reference. We also release RectTotal and the whole RectTotal dataset can also be found in our code repository. We select some samples for demonstration as shown in Fig.2.

ICDAR 2019 ArT (IC19-ArT)222https://rrc.cvc.uab.es/?ch=14 is a competition devoted to curved text detection and recognition. The dataset for the recognition track contains images for training and testing respectively, and contain both Latin script and Chinese Script. In this paper, we only consider Latin script. The total number of Latin images in the training set is . Among these images, we randomly split images as validation set, to match the size of other dataset. Also, as the test set is not available by the time this paper is written, we use this validation set as test set for evaluation.

Among the dataset described above, the first datasets have been widely used in research. Therefore, we can compare the performance of our methods with previous ones. For the latter curved text dataset, we are the first to report experiment results. Without a reference point, we can only use them to evaluate the effectiveness of the proposed techniques in the form of ablation test. Nonetheless, they are much larger than the previous curved text baseline CUTE 80, and should act as better benchmarks. Therefore we release the results so that they can serve as a proper baseline for future research.

3 Select the Right Training Data

3.1 Synthetic Text

In the last few years, training solely on synthetic data has become a widely accepted norm. To be more precise, there are two synthetic datasets that are used by researchers: (1) the synthetic images generated by the SynthText engine [gupta2016synthetic] that contain cropped text images and (2) cropped images from Synth90K [jaderberg2014synthetic]. Although SynthText has more realistic rendering, it is based on an imbalanced vocabulary. On the other hand, Synth90K has a balanced and large vocabulary. However, its images are monochrome, and the visual appearance is simpler than that of SynthText. Therefore, in most papers, the two datasets are combined to complement each other. Such a norm continues even after the recognition of curved text became a hot topic.

However, these two datasets only contain a low proportion of curved text, which may make it hard for recognizers to generalize to curved text. It is therefore reasonable and intuitive to generate synthetic curved text as a replacement. As SynthText provides an opensource code repository333https://github.com/ankush-me/SynthText and Synth90K does not, we decide to work on SynthText.

3.1.1 Synthesizing curved text

The SynthText engine renders text as a curve when the sampled text line contains only one word and the word contains no more than characters. Then, it samples a parabolic trajectory and places characters one by one symmetrically around the origin point. According to the code444https://github.com/ankush-me/SynthText/blob/6640bf6a0d07b01cbb108984814af2aeb6e30344/text_utils.py#L59, is uniformly sampled from . The potential problem is that, however, the engine’s default parameters are set such that text source always attempts to sample a paragraph, i.e. multiple lines of text, and therefore it is only in rare condition that, there is only one word to render when the selected location of text is small enough. As a result, with the current parameters, curved text are rare in the original dataset.

To solve this issue, we revise the text rendering module of the engine. The main idea of the new text rendering module is two-fold: (1) we increase the proportion of single word text so that more text are rendered as curved; (2) we randomly sample a radius and place text on the corresponding circle instead. The parameters are set such that there are a larger proportion of curved text and the curvature of text is greater. We also opensource the modified code for future research555https://github.com/Jyouhou/CurvedSynthText. For technical details, readers an refer to the code.

We obtain image crops in total. We estimate that around of the generated data are curved. We randomly select some samples from our modified SynthText for visualization, as shown in Fig.3. We refer to our version as CurvedSynth.

Figure 3: Randomly selected samples generated by our modified SynthText Engine.

3.1.2 Experiments with synthetic data

Now we have three synthetic dataset: the original SynthText, CurvedSynth, and Synth90K. To compare the effectiveness of different datasets, we first conduct an experiment with these datasets separately. When trained with Synth90K, we transform the testing images to grey scale. We train the rectification baseline on these datasets respectively, and evaluate them on the real-world datasets. We also carry out the same experiment with the Non-Rect model to see how much datasets can benefit models.

Model Dataset IIIT5K SVT IC03 IC13 IC15 SVTP CUTE
Total-Text
(RectTotal)
IC19-ArT
Non-Rect
Baseline
CurvedSynth 0.906 0.836 0.927 0.888 0.698 0.695 0.847
0.692
(0.627)
0.653
SynthText 0.912 0.849 0.929 0.915 0.731 0.722 0.799
0.629
(0.666)
0.649
Synth90K 0.842 0.847 0.938 0.897 0.696 0.740 0.674
0.409
(0.661)
0.568
Rectification
Baseline
CurvedSynth 0.920 0.850 0.922 0.893 0.722 0.740 0.837
0.724
(0.660)
0.671
SynthText 0.916 0.853 0.927 0.907 0.727 0.735 0.792
0.636
(0.663)
0.660
Synth90K 0.846 0.849 0.949 0.900 0.612 0.771 0.677
0.415
(0.665)
0.569
Table 1: Experiments with rectification baseline and non-rectification baseline trained on different synthetic data. For each dataset including RectTotal, we mark the highest score in bold.

The experiment results are summarised in Tab.1. They show some consistency across models and dataset. We analyse these aspects, and inspect these evaluation datasets as well as model and training datasets one by one. We would present some very interesting findings.

Performance on straight text We first focus on the horizontal datasets: III5K, SVT, IC03, IC13, IC15, and SVTP. On III5K and SVT, it is obvious that the rectification baseline performs better that the Non-Rect baseline. As for training datasets, models trained on CurvedSynth parallels SynthText overall, and Synth90K ranks the worst. It is reasonable as IIIT5k actually contain a small proportion of curved and oriented text. SVT also has a few curved sample, and remember that, more correct image on SVT means an improvement by . Therefore, the difference in the number of correctly recognized images may be little. We also note that, although models trained on SynthText performs slightly better when using Non-Rect, CurvedSynth soon catches up when using rectification layer. It is reasonable to attribute it to the compatibility between rectification layer and curved text. Other wise, the curved text are distorted greatly when resized to , which makes it difficult to recognize.

Figure 4: Randomly selected results by rectification baseline from IC03 and IC13 trained with different training datasets. The upper part of each sample is the input images which are resized to and the bottom is the rectified images.

On IC03 and IC13, things are different. As IC03 and IC15 contain few curved text, it is as expected that rectification layer brings no improvements. Besides, the fact that models trained with CurvedSynth performs worse than SynthText may indicate that redundant curved training examples may make the model less competent on purely horizontal text. It is worth-noting that, models trained on Synth90K benefits from the rectification layer on IC03 by a considerable margin. According to a recent survey [long2018scene], the state-of-the-art performance on IC03 is when trained on Synth90K+SynthText, which is surpassed by ours. To understand what contributes to it, we inspect the evaluation results by checking each testing images one-by-by manually. We notice a outstanding phenomenon, select some representatives, and show them in Fig.4. We notice that the rectification layer actually shorten some text so that the aspect ratios of characters look better, although leaving some blank at the end. Such resizing is frequently found in results produced by models that are trained on Synth90K. This phenomenon also exists in models trained with CurvedSynth, but not in the original SynthText. It is also not found in IC13, while there is also little improvement on IC13. We assume that encouraging such phenomenon can be an important technical trick.

On IC15 and SVTP, we carry out similar in-depth inspection on testing images one-by-one and found similar phenomenon. Models trained with Synth90K seems to be better at resizing the text to a more normal aspect ratio. It remains a challenge to investigate the reason behind this. We suppose it is because Synth90K has a balanced vocabulary, and therefore has more long words, while the SynthText engine uses an imbalanced vocabulary, and in natural corpus, most high-frequency words are short. Thus, models trained on Synth90K would have to learn to resize characters so that words with varying lengths can fit the input size better.

Performance on curved text On CUTE80, Total-Text, and IC19-Art, we observe consistent improvements brought by our CurvedSynth over the original SynthText. It results in an improvement of - on CUTE80, around - on Total-Text, and about on IC19-ArT, depending on the model we use. Especially, when we use rectification baseline, the use of CurvedSynth results in greater improvements over models trained with SynthText on Total-Text, from to , and IC19-ArT, from to

. Although performance on CUTE80 drops after adding rectification layer, it can be attributed to the small size of CUTE80 and the variance it may have.

On the other hand, when trained on SynthText or Synth90K which contain few curved text, the rectification layer brings only insignificant improvements. The improvements are less than on CUTE80, on Total-Text, and - on IC19-ArT. These numbers are quite small compared to improvements by introducing our CurvedSynth. When using CurvedSynth, the rectification layer can boost performance by on Total-Text and on IC19-ArT. We can draw a conclusion that, our proposed CurvedSynth can help rectification layer reach its potential by providing appropriate data.

Performance on rectified Total-Text We also evaluate all models on the RectTotal dataset we create using the TextSnake algorithm as specified in Section 2.3. The results are marked by a set of brackets in the same table. We notice that the Non-Rect baseline trained on SynthText with few curved text instances achieve the highest score of . The performance of models except Non-Rect trained on CurvedSynth is not significantly different from this best model. The reason why Non-Rect trained on CurvedSynth achieves the lowest accuracy may be attributed to the fact that, since the images in RectTotal are rectified, background is eliminated and the text compactly fit into and fill the images, as shown in the bottom row of Fig.2. This, as it turns out, makes these images visually more similar to the training images in SynthText and Synth90K. On the contrary, the curved text in CurvedSynth contain background and complex spatial alignment, which is actually out-of-distribution for RectTotal.

Another counter-intuitive phenomenon is that, models trained on CurvedSynth perform worse on RectTotal overall than on the original Total-Text. Note that, as the ground-truth polygon annotations of Total-Text is imprecise, the rectified images may be warped. Therefore, the performance gap may be due to the extra distortion brought up by this process. Basically, experiments on RectTotal lead us to the conclusion that, with a good detection model that can capture the shape of text and further rectify it, sophisticated model designs and tricks may be unnecessary. Specifically, we can observe absolute improvements in accuracy by on Total-Text for models trained on Synth90K if we rectify them in advance. But for now, we still need them as IC19-ArT is not rectified.

3.1.3 Experiments with multiple synthetic data

Model Dataset IIIT5K SVT IC03 IC13 IC15 SVTP CUTE Total-Text (RectTotal) IC19-ArT
SAR[li2018show] SynthText + Synth90K 0.915 0.845 - 0.910 0.692 0.764 0.833 - -
CA-FCN[liao2018scene] SynthText + private data 0.919 0.864 - 0.915 - - 0.799 0.616 (-) -
Rectification Baseline CurvedSynth + Synth90K 0.948 0.896 0.958 0.928 0.782 0.816 0.896 0.763 (0.734) 0.721
Table 2: Experiments with our rectification baseline trained on multiple synthetic datasets compared with previous state-of-the-art methods. Performance scores marked with ‘-‘ indicate that they are not reported. ‘*‘ indicates results provided by asking the authors but not presented in the paper. Again, performance on RectTotal is included only for future reference. Best results are marked in bold. All scores here are obtained by training models solely on synthetic data.

As most recent methods are trained on Synth90K and SynthText jointly, we also present our results as trained on Synth90K and CurvedSynth jointly, which is shown in Tab.2. The total number of training data is . With our CurvedSynth, we surpass recent baselines on all datasets by a large margin. We achieve significant improvements on CUTE80 by and on Total-Text by . The comparison verifies the effectiveness of our CurvedSynth dataset.

3.2 Mixing with Real World Data

As there are several annotated real world datasets, it seems a potential way to boost performance using read world images. However, the quantity of real world images is much smaller than synthetic data. It is then valuable to find out an optimal sampling scheme to balance real world data and synthetic data. This section is dedicated to this target.

IIIT5K SVT IC03 IC13 IC15 SVTP CUTE Total-Text IC19-ArT COCO sum
train 2000 257 1156 848 4467 0 0 9267 33220 13943 65158
test 3000 647 860 1015 1811 645 288 2201 3000 0 13467
Table 3: Sizes of real world datasets with annotations. The last column entitled sum is the total amount of data for train/test respectively.

We collect available real world data and summarize them in Tab.3. As IC19-ArT is only released recently, we do not add it into our training set for fair comparison with previous works that use real world data [li2018show]. Together we obtain real world images. Since the number of synthetic data is much larger than that of real world images, we give real world data a larger sampling weight. We carry out experiments with different sampling weights such that the real world data account for , , , and of the total training data. Also note that, we temporarily treat all real datasets the same and ignore the differences in their sizes.

Model Real World Data Ratio () IIIT5K SVT IC03 IC13 IC15 SVTP CUTE Total-Text IC19-ArT Avg
SAR[li2018show] 0.950 0.912 - 0.940 0.788 0.864 0.896 - - -
Rectification Baseline 0 0.948 0.896 0.958 0.928 0.782 0.816 0.896 0.763 0.721 0.834
Rectification Baseline 5 0.961 0.927 0.952 0.949 0.834 0.842 0.938 0.817 0.768 0.868
Rectification Baseline 10 0.958 0.924 0.964 0.953 0.832 0.851 0.931 0.814 0.762 0.866
Rectification Baseline 15 0.957 0.921 0.955 0.945 0.847 0.842 0.934 0.815 0.772 0.869
Rectification Baseline 20 0.956 0.926 0.966 0.948 0.839 0.840 0.938 0.807 0.759 0.864
Rectification Baseline 25 0.954 0.924 0.958 0.951 0.838 0.828 0.951 0.809 0.752 0.862
Table 4: Experiments with synthetic data mixed with real world data.

As shown in Tab.4, our model outperforms previous methods that use real world data by a large margin consistently for different real world data ratios. We achieve improvements on all datasets except SVTP. Notably, IC15, SVTP, CUTE80, Total-Text and IC19-Art all benefit significantly from real world data.

For an overall comparison, we compute the micro average score by taking the weighted average of all accuracy with dataset sizes as weights, and the average scores are shown in the last column labeled as Avg. In this sense, the rectification baseline performs best when the real world data should take up of the total training data.

As for performance on curved text datasets, a ratio of and result in fairly comparative results. A proportion of for real world data results in slightly higher accuracy in Total-Text and a improvement is insignificant. Since the ratio is not making very large difference on most datasets, we would continue with in later experiments on IC19-ArT due to its superiority both in curved text and in the overall way.

4 Model Design Tweaks

The previous sections mainly focus on the selection of data. In this part, we mainly talk about some model modifications that improve the performance of our rectification based methods. For fair comparison with most previous works as well as previous sections, we only train our method on synthetic data, i.e. Synth90K + CurvedSynth.

4.1 Keeping Aspect-Ratio for Input Images

As suggested in Section LABEL:ShrankText, the rectification layer learns to resizing the input images, and it is likely to be the reason why it performs better on those datasets. We suppose an aspect-ratio keeping input would benefit recognizers. Therefore, we design a new and intuitive pre-processing step as specified here.

For input images with different sizes, we first resize the long sides to

pixels, and short sides resized accordingly to keep the original aspect-ratios. Then, we pad

grey borders to the short sides such that the pre-processed images are square. We select and visualize some samples in Fig.5. It is obvious that our pre-processing is more friendly to curved text. Compared with fixed-size scheme that resizes images to uniformly, our method keeps the images semantically legible. We term our pre-processing as squarization.

Figure 5: Visualization of rectification results for squarization compared with fixed resizing scheme. Within each dotted frame, in the upper part lies the input image, and in bottom part lies the rectification results. We can see that, rectification based on square images produce higher image quality, more clear rectified images and better overall rectification.

4.1.1 Random Rotation

Another advantage of squarization is that, now images can be rotated freely in both training and testing stages. As instances in curved text dataset usually have oriented or even vertical arrangement, it is important to incorporate such situations in the training stages as well. Therefore, we propose to randomly rotate the images during training. Specifically, the images are randomly rotated , , or

degrees with a probability of

respectively.

4.2 Rectification on Images

While rectification layer can alleviate distortions to some extent, it only operates on the extracted feature maps and therefore the features may still be affected by the distortion in images. Therefore, we carry out experiments to inspect whether performing rectification on images can benefit recognition.

Specifically, the images are first fed forward to obtain the control points as our rectification baseline does. Then the control points are used to rectify the input images. Then the rectified images are again fed into the same network. The recognition module is applied to the feature maps extracted from the rectified images.

In implementation, such two-pass computation increases memory usage by nearly one time, and makes it impossible to fit in the GPUs we have. Therefore, in the first pass, we downsample the input images to half size using bilinear interpolation, but the second pass is performed on the images of original resolution. By reducing the input size of the first pass, the rectification module based on fully connected networks can also remain unchanged.

4.3 Experiments

Model Modification IIIT5K SVT IC03 IC13 IC15 SVTP CUTE Total-Text IC19-ArT
Rectification Baseline None 0.948 0.896 0.958 0.928 0.782 0.816 0.896 0.763 0.721
Rectification Baseline Squarization 0.950 0.898 0.950 0.927 0.782 0.809 0.892 0.769 0.724
Rectification Baseline Squarization + Random Rotation 0.949 0.898 0.948 0.925 0.781 0.817 0.889 0.781 0.728
Rectification Baseline Squarization + Rectify Image 0.946 0.904 0.953 0.925 0.792 0.795 0.800 0.776 0.723
Rectification Baseline All 0.942 0.881 0.948 0.920 0.771 0.803 0.903 0.761 0.720
Table 5: Results of experiments with model modifications.

We list the experiment results in Tab.5. We can see that the combination of squarization and random rotation boost performance on curved text recognition, with improvements on Total-Text by , and IC19-ArT by . Besides, squarization itself only results in slight improvements on nearly all datasets. However, we argue that squarization is still important because it enables the random rotation trick that can bring more improvements.

Unfortunately, the practice of rectifying images does not seem to be very effective. Although it scores highest on SVT with an improvement by and IC15 by , it fails on other datasets. We suppose rectification on images may make it difficult to train.

5 Evaluation on ICDAR-2019

In this part, we reveal details of how we build our recognition system for participation in IC19-ArT.

Data We pool all available data together: synthetic data including CurvedSynth and Synth90K, all real world data as listed in Tab.3 except the validation images we sample from IC19-ArT. We have synthetic data in total, and real world images. During training, we sample real world data with a larger weight so that real world data take up of total training data. We use the validation set to select the best models.

Model We train an ensemble of the following models on the aforementioned data: (1) rectification baseline; (2) squarified rectification model with random rotation; (3) rectification baseline with ResNet152; (4) rectification baseline with rectification on images. We use the validation set to select the best model parameters for each of the models and combine the prediction results via a simple voting mechanism: we simply select the text that appear most from the predictions.

The models achieve validation accuracy from and . The final accuracy on the test set that is not yet published is .

6 Conclusion

In this paper, we investigate several aspects that may affect the training and final performance of rectification based scene text recognizers on curved text datasets. We implement a rectification-based method with a standard ResNet+FPN as feature extractor, a rectification module, and a vanilla attentional RNN based encoder-decoder sequence learning module. We conduct comprehensive experiments to study the task of curved text recognition, and we finally arrive at the following conclusions:

  1. Training data that contain curved text are very important yet economical. Despite the latest fashion of training solely on the standard SynthText and Synth90K for fair comparison, these two datasets contain few curved instances, and are therefore not suitable for curved text recognition. Algorithms trained on these datasets are not able to reach their full potential. We demonstrate this by training a previous state-of-the-art method on curved synthetic text, and boost its performance by as much as on CUTE80, even surpassing the most recent state-of-the-art trained on both synthetic data and real world data. On the other hand, the synthesis of curved text is effortless and economical, and can replace existing datasets in a drop-in fashion. The proposed CurvedSynth is a better dataset to study the task of curved text recognition.

  2. Mixing synthetic data with real ones also makes a difference. This rule of thumb is self-evident, but technical details are important. As shown in our experiments, it is important to find an appropriate proportion for the real world data. Besides, these datasets vary greatly in size, from a few hundreds to tens of thousands. Thus, it is also valuable to design a method to find an optimal ratio among real world datasets.

  3. Squarization works, and brings up new possibility. Another standard norm we challenges is the fixed-size pre-processing that resizes all input images to the same preset size, which is usually . Although the proposed squarization only results in slight improvements on curved datasets and only parallel results on straight text, it enables the use of random rotation, which can boost the performance on curved text to some extent. For future directions, it is also worthwhile to study inputs with variable lengths/widths, as done by the NLP community. This may be important for long text, such as Chinese, Japanese, Korean, and etc..

  4. Recognizers are more robust to shape when working on images rectified by select detectors. We propose RectTotal, a dataset obtained by applying TextSnake to rectify test images of Total-Text using the ground truth polygon-based labels. Experiment results show that, even if algorithms are solely trained on straight text and not equipped with special mechanism for curved text recognition, they can achieve equally good results if the text are already rectified. This result places more importance on detectors that can capture the shape of text, such as TextSnake and CRAFT [baek2019character]. Text detection and recognition should be co-designed to achieve better results.

Finally, we release the training details for our participation in IC19-ArT. We hope it will help researchers in designing better algorithms and further mining new challenges that are not yet noticed by the community.

References