Log In Sign Up

SPIN: Structure-Preserving Inner Offset Network for Scene Text Recognition

by   Chengwei Zhang, et al.

Arbitrary text appearance poses a great challenge in scene text recognition tasks. Existing works mostly handle with the problem in consideration of the shape distortion, including perspective distortions, line curvature or other style variations. Therefore, methods based on spatial transformers are extensively studied. However, chromatic difficulties in complex scenes have not been paid much attention on. In this work, we introduce a new learnable geometric-unrelated module, the Structure-Preserving Inner Offset Network (SPIN), which allows the color manipulation of source data within the network. This differentiable module can be inserted before any recognition architecture to ease the downstream tasks, giving neural networks the ability to actively transform input intensity rather than the existing spatial rectification. It can also serve as a complementary module to known spatial transformations and work in both independent and collaborative ways with them. Extensive experiments show that the use of SPIN results in a significant improvement on multiple text recognition benchmarks compared to the state-of-the-arts.


Spatial Transformer Networks

Convolutional Neural Networks define an exceptionally powerful class of ...

ESIR: End-to-end Scene Text Recognition via Iterative Image Rectification

Automated recognition of texts in scenes has been a research challenge f...

TRIG: Transformer-Based Text Recognizer with Initial Embedding Guidance

Scene text recognition (STR) is an important bridge between images and t...

Exploring Font-independent Features for Scene Text Recognition

Scene text recognition (STR) has been extensively studied in last few ye...

ARTS: Eliminating Inconsistency between Text Detection and Recognition with Auto-Rectification Text Spotter

Recent approaches for end-to-end text spotting have achieved promising r...

LUCSS: Language-based User-customized Colourization of Scene Sketches

We introduce LUCSS, a language-based system for interactive col- orizati...

1 Introduction

Optical Character Recognition (OCR) has long been an important task in many practical applications. Recently, recognizing text in natural scene images, referred to as Scene Text Recognition (STR), has attracted great attentions due to the diverse text appearances and the extreme conditions in which these scenes are captured. Modern techniques in deep neural networks have been widely introduced [32, 33, 20, 22, 34, 39, 5, 6, 23, 49, 48]. For example, Shi et al. [32]

proposed CRNN to exploit the contexture information by combining convolutional neural networks (CNNs) and recurrent neural networks (RNNs). Lee

et al. [20]

introduced attention modeling to focus on the most reliable regions at every recurrent step. Hybrid attention methods combining various revisions have achieved great success

[5, 34].

Since existing methods are quite powerful for regular text tasks with clean background, irregular shape has become a challenging yet hot interest for researchers. As spatial transformer network (STN)

[16] was proposed, Shi et al. [33] integrated it into a unified mainstreaming procedure that the STR framework can be divided into two connected stages, the spatial rectification stage and the encoder-decoder recognition stage. Notice that the rectification module does not require any extra annotations yet is effective in an end-to-end network. Therefore, it was soon popularly utilized and further explored in later works [2, 1, 34, 48, 25]. For example, recently, Zhan et al. [48]

proposed line-fitting transformation to iteratively estimate the pose of text lines in scenes. Luo

et al. [25] directly predicted the location offsets of the coordinates to rectify multi-object.

Figure 1: Examples of regular and irregular scene text, where the problems of irregular case can be categorized into geometric distortion and chromatic distortion.

Although almost all the existing transformations are limited to geometric rectification, shape distortions do not account for all difficult cases. Severe conditions resulting from intensity variations, poor brightness, shadow, occlusion, low resolution, background and imaging noise also contribute to hard examples. In particular, corresponding to geometric irregularities, we term these problems as chromatic distortions, which are exceedingly common and intractable in real world. Existing transformations [33, 48, 25] were motivated to recognize images as shown in Fig. 1(b), but samples such as Fig. 1(c) and (d) were left knotty and intractable. Until now, chromatic problems have attracted rare attention in scene text recognition tasks. As Landau et al. [19] stated that humans prefer to categorize objects based on their shapes rather than colors, even the geometric distorted texts in Fig. 1(b) can be easily recognized by humans on condition that the shapes of objects are clear enough. Therefore, we intuitively intend to design a novel network for preserving the important shapes of texts and reducing the burdens of visual inference through adjusting the distributions of image color.

The difficulty of rectifying chromatic distortions lies in that the chromatic outliers not only contain inconsistent intensities with the normal patterns, but also consistent intensities, which cannot be directly separated and lead to visual inference. Inspired by recent studies

[12, 29] that focused on the revelation of the model vulnerability to intensity-related noises and attacks, we try to apply intensity manipulations to alleviate the chromatic outliers. In this work, we propose a new module called SPIN, which stands for Structure-Preserving Inner Offset Network. Here, we borrow the concept of offset from [25] intended for pixel-level spatial drift at birth, but in this paper we enrich it to a broader prospect, containing channel intensity offset (denoted in inner offset) and spatial offset (denoted in outer offset), respectively. The inner offset is designed to mitigate chromatic distortions, such as poor brightness and shadow, while the outer offset incorporates the existing geometric rectifications including [33, 34, 1, 25, 48]. Specifically, our SPIN, which is the first work to focus on the inner offset of scene text images, will preserve the shapes of objects while rectify images by adjusting the intensities. The proposed network consists of two components: the Structure Preserving Network (SPN) and the Auxiliary Inner-offset Network (AIN). Specifically, the well-designed SPN is responsible for alleviating the irregularities caused by inconsistent intensities, based on the Structure Preserving Transformation (SPT) [29]. And the AIN is an accessory network to transform the irregularities aroused by consistent intensities, which the SPT cannot deal with, into the inconsistent ones. These two components can complement with each other and be unified together via an additional update gate. The rectified images will thus become visually clearer in intrinsic shape of texts and thus easier to tell apart. Moreover, we also explore the integration of both the inner and outer offsets by absorbing geometric rectification (i.e. TPS-based STN [34, 1]) into our proposed SPIN. Our experiment results show that, the unified inner and outer offsets in a single transformation module will complement each other leading to even better visual quality and better recognition.

The contributions of the proposed SPIN modules can be summarized as follows:

(1) To the best of our knowledge, it is the first work to mainly handle with the chromatic distortions in STR tasks, rather than the extensively discussed spatial ones. We also introduce the novel concept of inner and outer offsets and propose a novel module SPIN to rectify the images with chromatic transformation.

(2) The proposed SPIN can be easily integrated into deep neural networks and trained in an end-to-end way without additional annotations and extra losses. Unlike the typical spatial transformation based on STN, which is tied to tedious initialization schemes [33, 48, 34], the SPIN requires no need of sophisticated initialization, which enables it to be a more flexible module.

(3) The proposed SPIN achieves impressive effectiveness to recognize regular and irregular scene text recognition.Furthermore, the combination of chromatic and geometric transformations has been experimentally proved to be practicable and to outperform existing techniques on multiple benchmarks.

2 Related Work

Scene Text Recognition

. The recognition of text in natural scene images is one of the most important challenges in computer vision and various methods have been proposed. Conventional methods were based on hand-crafted features including sliding window methods

[41, 42], connected components [28], strokelet generation [46], histogram of oriented gradients descriptors [37] etc. . The Conditional Random Fields (CRF) based methods [35] were also proposed in text recognition. Later on, in order to automatically extract features, deep neural network was introduced. Bissacco et al. [3] applied a network with five hidden layers for character classification. Jaderberg et al. [14] combined CRF with CNN for unconstrained recognition. Moreover, recurrent neural networks (RNNs) [11, 7] were introduced and combined with CNN-based methods for better sequential learning [32]. The attention mechanism was applied to a stacked RNN on top of the recursive CNN [20] and some revisions on the attention [5]. Recently, works further advanced the encoder-decoder structure with 2-Dimensional (2-D) ones including 2-D CTC [38], 2-D attention [21, 26] and character-based recogntion before sequence semantic learning [43].

Irregular Rectification. As modern techniques are powerful to deal with regular texts in scenes, irregular text recognition is still posing a challenging task. When it comes to the irregularity of scene texts, all the existing works are limited to the geometric irregularity. Following the rectification approach from Spatial Transformer Network (STN) [16], Shi et al. [33, 34] first used thin-plate-spline [4] transformation and regressed the fiducial transformation points on curved text. Similarly, Zhan et al. [48] iteratively utilized a new line-fitting transformation to estimate the pose of text lines in scenes. Luo et al. [25] rectified multi-object using predicted offset maps. Different from the image-level rectification, Liu et al. [22] proposed a character-aware network by using an affine transformation network to rotate each detected characters into regular ones. All the above methods handle with the irregular text using an independent rectification module before recognition to ease the downstream task. Here, we also work on designing rectification modules since the structure is flexible yet effective to be deployed by any other recognition networks. Differently from all the existing works, we point out that the color rectification is also important. Our proposed network aims to rectify the input patterns in a broader perspective, also handles with the chromatic difficulties in scene text recognition.

3 Methods

3.1 Structure-Preserving Inner Offset Network (SPIN)

3.1.1 Structure Preserving Network (SPN)

Inspired by Structure Preserving Transformation (SPT) [29]

, which cheats modern classifiers by degenerating the visual quality of chromaticity, we find that this kind of distortion is usually fatal to the deep-learning based classifiers. The potential ability of controlling the chromaticity enlightens us to shed light on an even broader application conditioned on proper adaptation. In particular, we find that SPT-based transformations could also rectify the images of color distortions by intensity manipulation, espatially for examples like Fig.

1 (c) and (d). Formally, given the input image , let denote the transformed image. A transformation on is defined as follows:

where or is the intensity of the input or output image at the coordinate . Specifically, the general form of SPT is defined to be a linear combination of multiple power functions:

where is the exponent of the i-th basis power function, is the corresponding weight. In general, these exponents can be either learned from data or chosen based on domain knowledge and fixed in advance. Unlike Peng et al. [29] where a group of values are manually set, here, an empirical formula (3) is applied to calculate in pair, where is the group number of the exponents.

The calculation can be done efficiently when initialization. Then, is generated from the input image with several common linear blocks. Therefore, the function of the whole network based on SPT can be summarized as follows:

where and are the weight parameters and basic feature extractor output from Block7, as shown in Fig 2(a)., respectively, which are simultaneously generated by the network. The newly proposed transformation can improve the visual quality in an end-to-end trainable way by adjusting the intensities without ruining the structures of the origin images.

Figure 2: The overview of SPIN. (a) The input image x would first be feed into a well-designed network, and output the updated image x and a group of weights respectively. (b) A structure-preserving transformation is performed on the updated image with the generated weights.
Layers Configurations Output
Block1 33 Conv, 32, k:22, s:22 5016
Block2 33 Conv, 64, k:22, s:22 258
Block3 33 Conv, 128, k:22, s:22 124
Block4-1/Block4-2 33 Conv, 256, k:22, s:22 33 Conv, 16, k:22, s:11 62 62
Block5-1/Block5-2 33 Conv, 256, k:22, s:22 33 Conv, 1 31 62
Block6 33 Conv, 512 - 31 -
Block7 Linear, 512, 256 - 256 -
Block8 Linear, 256, 2K+2 - 2K+2 -
Table 1:

The architecture of SPN and AIN. The kernel size of convolution, output channels are directly noted before and after ‘Conv’ respectively. The kernel size of max-pooling and stride size is noted as ‘k’, ‘s’ respectively. For linear blocks, the input channels and output channels are noted after ‘Linear’. In addition, SPN and AIN share the same layers from Block1 to Block3.

Note that Structure-Preserving is realized by filtering the intensity level of input images. In detail, all the pixels with the same intensity level in the original image have the same intensity level in the transformed image, where the set with intensity level c is defined as structure-patterns [29]. Intuitively, we propose the SPN for rectifying chromatic distortions by taking advantage of this property in two aspects: (1) Separating useful structure-patterns from the harmful ones by injecting them to different intensity levels, which will be likely to generate better contrast and brightness. (2) Aggregating different levels of structure-patterns by mapping them to close intensity levels, which will be beneficial to alleviate fragments, rendering a more unified image.

3.1.2 Auxiliary Inner-offset Network (AIN)

Since SPN tries to separate and aggregate the specific structure patterns by exploiting the spatial invariance of words or characters, it inexplicitly assumes that these patterns are under inconsistent intensities, namely different levels of structure-patterns. However, it does not consider the cases where the harmful structure-patterns are consistent with the useful ones, noted as pattern coherence. In fact, these patterns will have no chance to be separated due to the intrinsic property of SPN and may also degenerate visual quality.

To handle with the above difficulty, we borrow the concept of offset from a geometric transformation [25] and decouple geometric and chromatic offsets for better understanding of the rectification procedures. Spatial transformations [22, 33, 34, 48, 25] rectify the location shift of patterns by predicting the corresponding coordinates, which will generate geometric offsets (namely outer-offsets). And then they re-sample the whole images based on the points, which could be described as:

where is the origin coordinate (or the coordinate adjusted by outer-offsets), here stands for the sampler, which computes the pixels in

by interpolating the neighbor pixels in

, and represents different forms of transformation function, e.g. Affine [22, 33], Thin-Plate-Spline [4] (TPS)-based STN [33, 34], Line-Fitting Transformation [48].

Differently, SPT will generate chromatic offsets (namely inner-offsets) on each coordinate. As mentioned above, solely SPT is limited by the intrinsic problems of pattern coherence. Here, we design an auxiliary innet-offset network (AIN) to assist the main transformation to handle with the consistent intensity for texts. The auxiliary offsets are defined as:

where is an auxiliary inner-offset function, is set to be the identity function and thus requires no need of sampling operation. The auxiliary inner-offsets can mitigate the pattern coherence problems through slight intensity-level perturbations on each coordinate. Specifically, is defined as:


Here, is the trainable scalar which is partial output from Block8 in Fig 2(a)., where the other 2K+1 dimension are represented by in Equa.(4). is hadamard product. We design a learnable update gate , which can receive signals from SPN and perceive the difficulty of different tasks. It is responsible for controlling the balance between the input image x and the predicted auxiliary inner-offsets. (or ) is the updated (or input) image at the coordinate , represents the predicted offsets at that coordinate. Here, the offset is predicted by the proposed AIN, whose workflow is illustrated in Figure 2 and the architecture is given in Table 1. The AIN first divides the image into small patches and then predicts offsets for each. All the offset values are activated by and mapped into the size of input image via bilinear interpolation. After that, the enhanced SPT will be performed on the updated images, and the comprehensive transformation can be fomulated by

Note that the transformation manipulates the intensity without ruining the structure-patterns all the time. The detailed network and the overall structure are also illustrated in Table 1 and Figure 2, respectively.

3.2 Recognition Network and Network Training

For fair comparison, we adopt a widely-used recognition framework proposed by Jeonghun et al. [1]

, which consists of four-stage recognition pipeline: Transformation, Feature Extraction, Sequence Modeling and Prediction. The

Transformation network rectifies an input image with predicted transformation parameters to ease downstream stages. Then, the Feature Extraction network will map the input image to a compact representation. The Sequence Modeling captures the contextual information and the Prediction modules map the identified representation into final characters. Specifically, to build encoders including Feature Extraction, Sequence Modeling modules, the setting of CRNN [32] and FAN [5] are selected. The typical network consists of a 7-layer of VGG [36] or 32-layer ResNet [10]

and two layers of Bidirectional long short-term memory (BiLSTM)

[11] each of which has 256 hidden units. Following [33, 34, 48, 1]

, the Prediction modules adopt the attention mechanism which consists of an attentional LSTM with 256 hidden units, 256 attention units and 69 output units (26 letters, 10 digits, 32 ASCII punctuation marks and 1 EOS symbol). Our model is fully end-to-end trainable by minimizing the conventional loss function on the recognition output probabilities. During the inference stage, we directly select the most probable characters. Only digits and letters are counted and the rest is discarded.

4 Experiments

4.1 Dataset

Following formerly unified settings for text recognition [1], we use 2 synthetic datasets for training, including MJSynth (MJ) [15] which contains 9 million synthetic text images and SynthText (ST) [8] where we crop 7 million text image patches from original images for detection.

The popular benchmarks were categorized into 4 regular and 3 irregular datasets as follows according to the difficulty and geometric layout of the texts.

IIIT5K (IIIT) [27] contains scene texts and born-digital images. It consists of 2,000 images for training and 3,000 images for evaluation

SVT [40] consists of 257 images for training and 647 images for evaluation. The images were collected from Google Street View. Some of these images are noisy, blurry, or of low-resolution.

ICDAR2003 (IC03) [24] contains 1,156 images for training and 860 images for evaluation which was used in the Robust Reading Competition in the International Conference on Document Analysis and Recognition (ICDAR) 2003.

ICDAR2013 (IC13) [18] contains 848 word images for model training and 1015 for evaluation which was used in the ICDAR 2013.

ICDAR2015 (IC15) [17] contains 2,077 text image patches are cropped from cropped scene texts suffer from perspective and curvature distortions, which was used in the ICDAR 2015. Researchers have used two different versions for evaluation: 1,811 and 2,077 images. Some extremely geometrically distorted images were discarded in the 1,811 version. We use 1,811 images in our discussions and report both results when compared with other techniques for fair comparisons.

SVTP [30] contains 645 images for evaluation. Many of the images contain perspective projections due to the prevalence of non-frontal viewpoints.

CUTE80 (CT) [31] contains 288 cropped images for evaluation. Many of these are curved text images.

Note that models will be trained on the MJSynth dataset by default. For fair comparison, the SynthText dataset will be used under the state-of-the-art settings. All experiments are only with word-level annotations and there is no fine-tuning stage or adding any extra private dataset.

Rectification Transformation Offsets Benchmark
Outer Inner IIIT SVT IC03 IC13 IC15 SVTP CT Avg
None (a) None [33] No No 82.4 81.9 91.2 87.3 67.4 70.9 61.6 77.5
Chromatic (b) SPIN w/o AIN No Yes 82.1 84.2 91.9 87.8 67.3 72.3 63.5 78.4
(c) SPIN No Yes 82.3 84.1 91.6 88.7 68.1 72.3 64.6 78.8
Geometric (d) STN [1] Yes No 82.7 83.3 92.0 88.5 69.6 74.1 62.9 79.0
Both (e) SPIN+STN Yes Yes 83.6 84.4 92.7 89.2 70.9 73.2 64.6 79.8
(f) GA-SPIN Yes Yes 84.1 83.2 92.3 88.7 71.0 74.5 67.1 80.1
Table 2: Comparison between the baseline and the model combined with different rectifications. All the models are evaluated based on accuracy under several benchmarks. The ‘Offset’ stands for enabling and dis-enabling the corresponding offset modules. ‘’ indicates the model is tested by us under a fair setting.

4.2 Implementation

All images are resized to 32 100 before entering the network. is the default setting. The parameters are randomly initialized using He et al. ’s method [9] if not specified. Models are trained with the AdaDelta [47]

optimizer for 5 epochs with batch size = 64. Gradient clipping is used at magnitude 5. The learning rate is set to 1.0 initially and decayed to 0.1 and 0.01 at 4-th and 5-th epoch, respectively.

4.3 Ablation Study

To go deep into individual factors in the proposed model, we cumulatively enable each configuration one-by-one on top of a solid baseline. All models are trained on the MJSynth dataset with backbone of 7-layer VGG. Following models are all the same by default. Table 2 lists the accuracy in each configuration.

The item (a) poses the baseline [33] without the rectification module, specifically the VGG backbone, 2 layers of BiLSTM for sequence modeling and an attention decoder. The items (b) and (c) add the rectification module before the backbone and verify the advantage of the chromatic rectification. Specifically, (b) enables the proposed SPIN w/o AIN and improves (a) by an average of 0.9%, which can demonstrate the effectiveness of the chromatic rectification. And (c) moves forward to enable AIN, composing SPIN accompanied with the formerly added SPN. Though only two additional convolution layers appended to (b), (c) obtains an average of 0.4% consistent improvement and a total of 1.2% improvement compared with the baseline. The enhancement is mainly from SVT, IC15 and CUTE80, where lots of images with chromatic distortions are included.

As mentioned, all existing rectification modules are focused on geometric transformations, as (d) shows the baseline (a) equipped with TPS-base STN [1], which rectifies images via outer-offsets and achieved superior performance among the state-of-the-arts. Interestingly, the performance of the newly proposed chromatic rectification module (c) is close to that of the geometric one. The difference lies in that, the former mainly improves the SVT and CUTE80 benchmarks, where large proportion of chromatically distorted images are included, while the latter mainly improves the IC15 and SVTP benchmarks, where rotated, perspective-shifted and curved images dominate. And it verifies the color rectification is comparably important with shape transformation.

We then explore the combination of both geometric and chromatic rectifications. We expect to learn whether inner-offsets assist the existing outer-offsets [33] and how these two categories can jointly work better. In (e), we try to combine chromatic and geometric transformation by simply placing the SPIN (c) before the TPS-based STN (d) (denoted as SPIN+STN) without additional modification. An average of 0.8% improvement, compared with configuration (d), can be directly achieved, indicating that these two transformations are likely to be two complementary strategies, where the inner part was usually ignored by the former researchers.

As the pipeline of SPIN and STN is verified beneficial in accuracy in (e) while brings additional computational cost and superfluous parameters. We propose a geometric-absorbed SPIN (denoted as GA-SPIN) in (f), which unifies SPIN and TPS-based STN in a single transformer. In detail, two types of offset are integrated together, where the parameters of the inner and outer offset are simultaneously predicted by one single network, no longer from two separate modules. The geometric and chromatic parts share the same convolution network, and differ only in output channels of the last linear layer. For instance, the output channel number for SPN is 2K+2 as illustrated in Table 1., and for GA-SPIN the number is 2K+2+N, only additional N outputs (N is number of the parameters for localization in TPS, e.g. 40). Though cutting parameters and computation cost compared to the straightforward pipline structure (e), it improves the results by 0.3% on average and gains a large margin by 1.1% compared to solely geometric transformers (d). It means GA-SPIN is a simple yet effective integration of the two kinds and the different types of rectifications truly complement each other. Note that despite TPS, other kinds of spatial transformers (e.g. , Affine [22], MORN [25] ) can also be easily extended with our proposed SPIN using methods in either (e) or (f) similarly.

4.4 Discussion

4.4.1 Discussion on sensitivity to weight initialization.

Proper weight initialization is necessary for geometric transformations, because these modules are liable to produce highly geometrically distorted scene text images that will ruin the training of the recognition network [33, 34, 48]. Thus we have to follow the scheme proposed by Shi et al. [33] for all the -transformations in this paper. On the contrary, chromatic transformations are less likely to severely distort the images by preserving the structure patterns. Random initialization works well in Table 2 (b) and (c), indicating insensitivity to weight initialization of our modules. However, we observed convergence failure when geometric transformations are included, which proves that the severe chromatic distortions do affect the performance of geometric rectification. For a better flexibility adapted with other structures requiring careful parameters, we initialize the last fully-connected layer of SPIN with zero weights and biases of certain values. The values, which are all set to zero except the ones corresponding to exponent = 1.00 and the update gate are set to 1.0 and -1.0, respectively.

Figure 3: Results on different K. All the models are trained on the MJ dataset using VGG feature extractor. The line in coral, blue and red refer to the baseline without rectification module, with SPIN and with GA-SPIN, respectively. The results are evaluated with the mean accuracy of 4 regular benchmarks and 3 irregular benchmarks.
Transformation Accuracy (VGG) Accuracy (ResNet) Para. ()
Regular(%) Irregular(%) Regular(%) Irregular(%)
None [33] 85.7 66.6 86.5 69.0 0
TPS-STN [1] 86.6(+0.9) 68.9(+2.3) 87.1(+0.6) 71.2(+2.2) 1.68
ASTER [34] 87.0(+1.3) 68.2(+1.6) 88.2(+1.7) 70.6(+1.6) -
ESIR [48] - 70.2(+3.6) - 72.7(+3.7) -
SPIN w/o AIN 86.5(+0.8) 67.7(+0.9) 87.5(+1.0) 69.9(+0.9) 2.28
SPIN 86.7(+1.0) 68.3(+1.7) 87.6(+1.1) 71.0(+2.0) 2.30
GA-SPIN 87.1(+1.4) 70.9(+4.3) 87.8(+1.3) 73.0(+4.0) 2.31
Table 3: Study of modules with respect to accuracy and the number of parameters. The accuracies are acquired by taking the mean of the results of the regular and irregular benchmarks including that module respectively. ‘*’ denotes the reported results are using deeper 45-layer ResNet than our basic 32-layer ResNet setting. ‘’ indicates the model is tested by us under the same setting.
Figure 4: Visual comparison of different rectification methods: the input images are in the first column, columns 2-4 show the rectified images by using the TPS-based STN, SPIN, and GA-SPIN, respectively. The sample images are from IC15 and CUTE80. The proposed SPIN performs better in chromatic distortion rectification and GA-SPIN performs well on both geometric and chromatic distortion rectification.
Figure 5: Examples of the image rectification effects and the recognition results. “gt” indicate the ground truth. All the prediction results are showed under the images. Red characters are mistakenly recognized characters. It is clear the proposed rectification helps enhance the visual quality and improve recognition performance greatly.

4.4.2 Discussion on regular and irregular cases.

The two types of datasets are formerly grouped mainly according to their geometric layout difficulties [1], while it also makes sense in evaluation of our proposed chromatic rectifications as both spatial and color distortions always coexist in relatively complex scenes. In terms of the results of VGG setting for instance from Table 3, we find that SPIN w/o AIN has close gain on regular and irregular cases, about +1.0% on both VGG and ResNet. Then, SPIN can enhance it by largely improving the performance of irregular case (+1.7%) with few additional parameters. Note it is not surprising [1, 34, 48] can improve the accuracy on irregular datasets more than the regular one because they are intended to rectify texts based on geometric distortions. However, it is inspiring that the spatial-invariance transformation can also achieve similar improvements (approximately +1.0% under regular case and +1.7% under irregular case respectively), demonstrating the comparative significance of chromatic rectification. More importantly, obvious enhancement (approximately +1.4% and +4.3% under the two cases, respectively) can be observed when equipped with the GA-SPIN module, showing the strong ability of integration in geometric and chromatic transformation. The ResNet settings obtain similar conclutions.

4.4.3 Discussion on different values of K.

We empirically analyze the performance under different K values, from 0 to 12, both in SPIN and GA-SPIN models. The results are demonstrated in Figure 3 with mean accuracy under regular and irregular cases. We can conclude from the curves that both SPIN and GA-SPIN will steadily improve the performance when K becomes larger, indicating the effectiveness of chromatic transformations. A proper value of K=6 is selected. We do not observe additional performance gain when K is unreasonably large as the smaller K indicates simpler transformation and the larger K may result in complexity and redundancy.

4.5 Comparison with Rectification Methods

To further explore the function of chromatic rectification and fusing two kinds of rectifications, we will use the structure described in Table 2 (c) SPIN and (f) GA-SPIN in this section. For fair comparisons, we also report the results under different network architectures, including 7-layer VGG [32] and 32-layer ResNet[5], and under different training datasets as [1].

We have listed the top repoted results among existing geometric-based transformations [33, 34, 48] in evaluation of both regular and irregular datasets with different backbones in Table 3. Both SPIN and GA-SPIN achieve superior recognition performance compared to existing rectification modules. Some visualization in Figure 4 clearly shows that the proposed SPIN can rectify the chromatic distortions by adjusting the colors. The unbalanced contrast (row 1-2), low brightness (row 3) and shadow (row 4), which the geometric-related rectifications do not consider, can all be mitigated. In addition, we can see that GA-SPIN can lead to better visual results in both geometric and chromatic perspectives. Figure 5 illustrates the rectification effects and text predictions by TPS-base STN [1], SPIN and GA-SPIN, respectively, where the five rows show several randomly sampled images from SVT, IC15 and CUTE80. It shows rectifying chromatic distortion can improve the recognition results from two aspects. First, it will make each character much easier to recognize directly. The first two rows show that models can be easily misleaded by special symbols or similar shapes under severe chromatic distortions. At the same time, the proposed SPIN does not degrade scene text images free of chromatic distortions such as the samples in the third row. Second, chromatic and geometric rectification can promote each other as clearly shown in the last two chromatically and geometrically distorted images.

4.6 Comparison with the State-of-the-Art

Table 4 lists the reported the state-of-the-art results on STR tasks. The proposed SPIN achieves superior performance across the 7 datasets compared with other techniques and GA-SPIN ulteriorly boosts the results. In particular, SPIN consistently outperforms the existing results [32, 33, 25, 48] under VGG or ResNet backbone trained on the MJ dataset. GA-SPIN further outperforms them by a larger margin. Under the full settings with ResNet as backbone training on 2 synthetic datasets, SPIN already outperforms all the methods using spatial transformers [34, 1, 48] on all the three irregular benchmarks. It points out that despite the widely focused geometric irregularity, color distortions are vital in these complex scenes. Note that even among strong settings like character-level annotations [5] or deeper convolutional structures (e.g. 45-layer ResNet in [34, 25]) and newly designed advanced framework in [43], our method SPIN obtains satisfying performance and GA-SPIN shows impressive promotion. Our methods only slightly fall behind the recent reported [43] on IC03 and IC13, and we attribute the reason to their deeper convolution, multi-scale features and newly designed decoupled attention mechanism, which brings strong ability on regular recognition but is defeated by a large margin facing SPIN on more complex scenes, including IC15, SVTP and CUTE80 datasets. In overall performance on 7 benchmarks, our proposed SPIN and the extended GA-SPIN both surpass the existing methods.

Methods ConvNet Training data Benchmark
Jaderberg et al. [13] VGG MJ - 80.7 93.1 90.8 - - - -
Shi et al. [32] VGG MJ 78.2 80.8 89.4 86.7 - - - -
Shi et al. [33] VGG MJ 81.9 81.9 90.1 88.6 - - 71.8 59.2
Lee et al. [20] VGG MJ 78.4 80.7 88.7 90.0 - - - -
Yang et al. [44] VGG P+C - - - - - - 75.8 69.3
Liu et al. [22] ResNet MJ+P 83.3 83.6 89.9 89.1 - - 73.5 -
Cheng et al. [6] VGG MJ+ST 87.0 82.8 91.5 - - 68.2 73.0 76.8
Cheng et al. [5] ResNet MJ+ST+C 87.4 85.9 94.2 93.3 70.6 - - -
Shi et al. [34] ResNet MJ+ST 93.4 89.5 94.5 91.8 76.1 - 78.5 79.5
Zhan et al. [48] ResNet MJ+ST 93.3 90.2 - 91.3 76.9 - 79.6 83.3
Jeonghun et al. [1] ResNet MJ+ST 87.9 87.5 94.9 92.3 77.6 71.8 79.2 74.0
Luo et al. [25] ResNet MJ+ST 91.2 88.3 95.0 92.4 - 68.8 76.1 77.4
Xie et al. [45] ResNet MJ+ST - - - - - 68.9 70.1 82.6
Wang et al. [43] ResNet MJ+ST 94.3 89.2 95.0 93.9 - 74.5 80.0 84.4
SPIN VGG MJ 82.3 84.1 91.6 88.7 68.1 63.4 72.3 64.6
GA-SPIN VGG MJ 84.1 83.2 92.3 88.7 71.0 65.1 74.5 67.1
SPIN ResNet MJ 83.9 84.6 92.9 88.9 70.6 65.0 75.0 67.4
GA-SPIN ResNet MJ 84.2 83.9 93.0 90.0 72.7 67.1 76.1 70.1
SPIN ResNet MJ+ST 94.7 87.6 93.4 91.5 79.1 76.0 79.7 85.1
GA-SPIN ResNet MJ+ST 94.7 90.3 94.4 92.8 82.2 78.5 82.8 87.5
Table 4: Performance of existing STR models with their training settings. ‘MJ’, ‘ST’, ‘C’ and ‘P’ indicate the MJSynth and the SynthText datasets, character-level annotations [5, 44], and private training data [44, 22, 21], respectively. Top accuracy for each benchmark is shown in bold. For IC15: a and b indicate 1811 and 2077 examples,respectively.

5 Conclusion

This paper proposes a novel idea of chromatic rectification in scene text recognition. Our proposed module SPIN allows the channel intensity manipulation of data within the network, giving neural networks the ability to actively transform the input color for clearer text patterns. Extensive experiments show its benefits on the performance, especially in more complex scenes. We then verify that geometric and chromatic rectifications can be unified into GA-SPIN rather than simply pipeline of two transformers, which can further boost the results and outperform the existing rectification modules by a large margin. In the future, we are intended to extend the chromatic transformer into more general research fields like object classification.


  • [1] J. Baek, G. Kim, J. Lee, S. Park, D. Han, S. Yun, S. J. Oh, and H. Lee (2019) What is wrong with scene text recognition model comparisons? dataset and model analysis. ICCV. Cited by: §1, §1, §3.2, §4.1, §4.3, §4.4.2, §4.5, §4.5, §4.6, Table 2, Table 3, Table 4.
  • [2] A. K. Bhunia, A. Das, A. K. Bhunia, P. S. R. Kishore, and P. P. Roy (2019) Handwriting recognition in low-resource scripts using adversarial learning. In CVPR, pp. 4767–4776. Cited by: §1.
  • [3] A. Bissacco, M. Cummins, Y. Netzer, and H. Neven (2013) Photoocr: reading text in uncontrolled conditions. In ICCV, pp. 785–792. Cited by: §2.
  • [4] F. L. Bookstein (1989) Principal warps: thin-plate splines and the decomposition of deformations. TPAMI 11 (6), pp. 567–585. Cited by: §2, §3.1.2.
  • [5] Z. Cheng, F. Bai, Y. Xu, G. Zheng, S. Pu, and S. Zhou (2017) Focusing attention: towards accurate text recognition in natural images. In ICCV, pp. 5076–5084. Cited by: §1, §2, §3.2, §4.5, §4.6, Table 4.
  • [6] Z. Cheng, Y. Xu, F. Bai, Y. Niu, S. Pu, and S. Zhou (2018) Aon: towards arbitrarily-oriented text recognition. In CVPR, pp. 5571–5579. Cited by: §1, Table 4.
  • [7] J. Chung, C. Gulcehre, K. Cho, and Y. Bengio (2014) Empirical evaluation of gated recurrent neural networks on sequence modeling. CoPR, abs/1412.3555. Cited by: §2.
  • [8] A. Gupta, A. Vedaldi, and A. Zisserman (2016) Synthetic data for text localisation in natural images. In CVPR, pp. 2315–2324. Cited by: §4.1.
  • [9] K. He, X. Zhang, S. Ren, and S. Jian (2015)

    Delving deep into rectifiers: surpassing human-level performance on imagenet classification

    In CVPR, Cited by: §4.2.
  • [10] Cited by: §3.2.
  • [11] S. Hochreiter and J. Schmidhuber (1997) Long short-term memory. Neural computation 9 (8), pp. 1735–1780. Cited by: §2, §3.2.
  • [12] H. Hosseini and R. Poovendran (2018) Semantic adversarial examples. CoRR abs/1804.00499. Cited by: §1.
  • [13] M. Jaderberg, K. Simonyan, A. Vedaldi, and A. Zisserman Reading text in the wild with convolutional neural networks. IJCV 116 (1), pp. 1–20. Cited by: Table 4.
  • [14] M. Jaderberg, K. Simonyan, A. Vedaldi, and A. Zisserman (2014) Deep structured output learning for unconstrained text recognition. ICLR. Cited by: §2.
  • [15] M. Jaderberg, K. Simonyan, A. Vedaldi, and A. Zisserman (2014) Synthetic data and artificial neural networks for natural scene text recognition. CoPR, abs/1406.2227. Cited by: §4.1.
  • [16] M. Jaderberg, K. Simonyan, A. Zisserman, et al. (2015) Spatial transformer networks. In NIPS, pp. 2017–2025. Cited by: §1, §2.
  • [17] D. Karatzas, L. Gomez-Bigorda, A. Nicolaou, S. Ghosh, A. Bagdanov, M. Iwamura, J. Matas, L. Neumann, V. R. Chandrasekhar, S. Lu, et al. (2015) ICDAR 2015 competition on robust reading. In ICDAR, pp. 1156–1160. Cited by: §4.1.
  • [18] D. Karatzas, F. Shafait, S. Uchida, M. Iwamura, L. G. i Bigorda, S. R. Mestre, J. Mas, D. F. Mota, J. A. Almazan, and L. P. De Las Heras (2013) ICDAR 2013 robust reading competition. In ICDAR, pp. 1484–1493. Cited by: §4.1.
  • [19] B. Landau, L. B. Smith, and S. S. Jones (1988) The importance of shape in early lexical learning. Cognitive development 3 (3), pp. 299–321. Cited by: §1.
  • [20] C. Lee and S. Osindero (2016) Recursive recurrent nets with attention modeling for ocr in the wild. In CVPR, pp. 2231–2239. Cited by: §1, §2, Table 4.
  • [21] H. Li, P. Wang, C. Shen, and G. Zhang (2019) Show, attend and read: a simple and strong baseline for irregular text recognition.. In AAAI, Vol. 33, pp. 8610–8617. Cited by: §2, Table 4.
  • [22] W. Liu, C. Chen, K. Wong, Z. Su, and J. Han (2016) STAR-net: a spatial attention residue network for scene text recognition. Cited by: §1, §2, §3.1.2, §4.3, Table 4.
  • [23] Y. Liu, Z. Wang, H. Jin, and I. Wassell (2018) Synthetically supervised feature learning for scene text recognition. In ECCV, pp. 435–451. Cited by: §1.
  • [24] S. M. Lucas, A. Panaretos, L. Sosa, A. Tang, S. Wong, and R. Young (2003) ICDAR 2003 robust reading competitions. In ICDAR, pp. 682–687. Cited by: §4.1.
  • [25] C. Luo, L. Jin, and Z. Sun (2019) MORAN: a multi-object rectified attention network for scene text recognition. Pattern Recognition, pp. 109–118. Cited by: §1, §1, §1, §2, §3.1.2, §4.3, §4.6, Table 4.
  • [26] P. Lyu, Z. Yang, X. Leng, X. Wu, R. Li, and X. Shen (2019) 2D attentional irregular scene text recognizer.. CoRR abs/1906.05708. Cited by: §2.
  • [27] A. Mishra, K. Alahari, and C. Jawahar (2012) Scene text recognition using higher order language priors. In BMVC, Cited by: §4.1.
  • [28] L. Neumann and J. Matas (2012) Real-time scene text localization and recognition. In CVPR, pp. 3538–3545. Cited by: §2.
  • [29] D. Peng, Z. Zheng, and X. Zhang (2019) Structure-preserving transformation: generating diverse and transferable adversarial examples. In AAAI, Cited by: §1, §3.1.1, §3.1.1.
  • [30] T. Quy Phan, P. Shivakumara, S. Tian, and C. Lim Tan (2013) Recognizing text with perspective distortion in natural scenes. In ICCV, pp. 569–576. Cited by: §4.1.
  • [31] A. Risnumawan, P. Shivakumara, C. S. Chan, and C. L. Tan (2014) A robust arbitrary text detection system for natural scene images. Expert Systems with Applications 41 (18), pp. 8027–8048. Cited by: §4.1.
  • [32] B. Shi, X. Bai, and C. Yao (2016) An end-to-end trainable neural network for image-based sequence recognition and its application to scene text recognition. TPAMI 39 (11), pp. 2298–2304. Cited by: §1, §2, §3.2, §4.5, §4.6, Table 4.
  • [33] B. Shi, X. Wang, P. Lyu, C. Yao, and X. Bai (2016) Robust scene text recognition with automatic rectification. In CVPR, pp. 4168–4176. Cited by: §1, §1, §1, §1, §1, §2, §3.1.2, §3.2, §4.3, §4.3, §4.4.1, §4.5, §4.6, Table 2, Table 3, Table 4.
  • [34] B. Shi, M. Yang, X. Wang, P. Lyu, C. Yao, and X. Bai (2018) Aster: an attentional scene text recognizer with flexible rectification. TPAMI. Cited by: §1, §1, §1, §1, §2, §3.1.2, §3.2, §4.4.1, §4.4.2, §4.5, §4.6, Table 3, Table 4.
  • [35] C. Shi, C. Wang, B. Xiao, Y. Zhang, S. Gao, and Z. Zhang (2013) Scene text recognition using part-based tree-structured character detection. In CVPR, pp. 2961–2968. Cited by: §2.
  • [36] K. Simonyan and A. Zisserman (2014) VERY deep convolutional networks for large-scale image recognition. In CVPR, Cited by: §3.2.
  • [37] B. Su and S. Lu (2014) Accurate scene text recognition based on recurrent neural network. In ACCV, pp. 35–48. Cited by: §2.
  • [38] Z. Wan, F. Xie, Y. Liu, X. Bai, and C. Yao 2D-ctc for scene text recognition.. Cited by: §2.
  • [39] J. Wang and X. Hu (2017) Gated recurrent convolution neural network for ocr. In NIPS, pp. 335–344. Cited by: §1.
  • [40] K. Wang, B. Babenko, and S. Belongie (2011) End-to-end scene text recognition. In ICCV, pp. 1457–1464. Cited by: §4.1.
  • [41] K. Wang, B. Babenko, and S. Belongie (2011) End-to-end scene text recognition. In ICCV, pp. 1457–1464. Cited by: §2.
  • [42] K. Wang and S. Belongie (2010) Word spotting in the wild. In ECCV, pp. 591–604. Cited by: §2.
  • [43] T. Wang, Y. Zhu, L. Jin, C. Luo, X. Chen, Y. Wu, Q. Wang, and M. Cai (2020) Decoupled attention network for text recognition. In AAAI, Cited by: §2, §4.6, Table 4.
  • [44] Y. Xiao, D. He, Z. Zhou, D. Kifer, and C. L. Giles (2017) Learning to read irregular text with attention mechanisms. In IJCAI, pp. 3280–3286. Cited by: Table 4.
  • [45] Z. Xie, Y. Huang, Y. Zhu, L. Jin, Y. Liu, and L. Xie (2019) Aggregation cross-entropy for sequence recognition. pp. 6538–6547. Cited by: Table 4.
  • [46] C. Yao, X. Bai, B. Shi, and W. Liu (2014) Strokelets: a learned multi-scale representation for scene text recognition. In CVPR, pp. 4042–4049. Cited by: §2.
  • [47] M. D. Zeiler (2012) ADADELTA: an adaptive learning rate method. CoPR, abs/1212.5701. Cited by: §4.2.
  • [48] F. Zhan and S. Lu (2019) Esir: end-to-end scene text recognition via iterative image rectification. In CVPR, pp. 2059–2068. Cited by: §1, §1, §1, §1, §1, §2, §3.1.2, §3.2, §4.4.1, §4.4.2, §4.5, §4.6, Table 3, Table 4.
  • [49] Y. Zhang, S. Nie, W. Liu, X. Xu, D. Zhang, and H. T. Shen (2019) Sequence-to-sequence domain adaptation network for robust text image recognition. In CVPR, pp. 2740–2749. Cited by: §1.