Log In Sign Up

Decoupling Visual-Semantic Feature Learning for Robust Scene Text Recognition

by   Changxu Cheng, et al.

Semantic information has been proved effective in scene text recognition. Most existing methods tend to couple both visual and semantic information in an attention-based decoder. As a result, the learning of semantic features is prone to have a bias on the limited vocabulary of the training set, which is called vocabulary reliance. In this paper, we propose a novel Visual-Semantic Decoupling Network (VSDN) to address the problem. Our VSDN contains a Visual Decoder (VD) and a Semantic Decoder (SD) to learn purer visual and semantic feature representation respectively. Besides, a Semantic Encoder (SE) is designed to match SD, which can be pre-trained together by additional inexpensive large vocabulary via a simple word correction task. Thus the semantic feature is more unbiased and precise to guide the visual feature alignment and enrich the final character representation. Experiments show that our method achieves state-of-the-art or competitive results on the standard benchmarks, and outperforms the popular baseline by a large margin under circumstances where the training set has a small size of vocabulary.


1 Introduction

Text carries rich semantic information that is useful in many practical applications such as automatic driving, intelligent transportation system, scene understanding and so on. Reading text in scene images plays an important role in artificial intelligence.

In the community of scene text recognition, semantic information has been shown useful along with the visual feature in an end-to-end model [2, 18, 29, 32], especially in the case of blurred or occluded text images. However, the learning of semantic feature is likely to be a double-edged sword, where the side effect is called vocabulary reliance [24]. The previous visual-semantic coupling method (Figure 2(b)) suffers much from the effect that the model performs well on images with words within vocabulary of the training set but generalizes poorly on images with words outside vocabulary. As shown in Figure 1, the existing V-S coupling methods like ASTER [21] incline to misrecognize texts as words that have appeared in the training phase.

The vocabulary reliance effect is mainly caused by that these V-S coupling methods learn visual and semantic feature in a hybrid decoder simultaneously. The semantics is only learned from the limited and noisy word set in the training image data, thus goes overfitting and inaccurate under a parameter-rich decoder. Hence, the processes of character alignment and character representation is poisoned by the wrongly learned semantics.

Figure 1: The comparison of VSDN and the V-S coupling method like Aster [21]. Ours performs better on the cases when the words are out of training vocabulary.
Figure 2: Different kinds of scene text recognition methods from the perspective of visual and semantic learning. ’V’ and ’S’ denote ’visual’ and ’semantic’ respectively.

To address the problem, we propose the Visual-Semantic Decoupling Network (VSDN), in which the learning processes of character-level visual and semantic feature are realized in the visual decoder and the semantic decoder respectively. Besides, a bidirectional semantic encoder (SE) is designed to match the semantic decoder (SD). SE and SD makes up a semantic module that can be pre-trained with a word correction task to gain extra semantic information from inexpensive text data (Figure 2(d)). As a result, the extracted character-level semantic feature is more accurate and robust, improving the character feature alignment and representation.

Succinctly, the main contributions of this paper are three-fold. Firstly, we propose a novel Visual-Semantic Decoupling Network (VSDN) that decouples visual and semantic feature learning, which alleviates vocabulary reliance problem. Secondly, we design a character-level semantic module, which can be easily pre-trained with a word correction task and further initialize the corresponding part in our VSDN. Thirdly, the experiments conducted on several public datasets demonstrate that our proposed method achieves state-of-the-art or competitive recognition performance under fair comparison. Specially, the performance under circumstance of lack-of-words and lack-of-images is very surprising, compared with the visual-semantic coupled methods.

2 Related Works

As depicted in Figure 2, scene text recognition methods can be divided into three categories by the way modeling semantic information: visual method, visual-semantic coupling method and visual-semantic serial method.

Visual Method: Treating scene text recognition as a purely visual task, visual method is semantic-free.  [10]

directly classifies a given text image into one of the pre-defined 90k word classes and thus is incapable of coping with those text images with words out of the pre-defined word lexicon.  

[20] uses CNN and RNN to encode a given text image into a sequence feature and then feed it into a CTC [5] decoder to align each character at each time step. Inspired by the success of visual segmentation,  [15] propose a segmentation-based approach for STR, which uses FCN to predict each pixel’s character class and gather the characters into a word.  [23] is segmentation-based as well, which represents the position and order of characters with different channels to better align characters. These segmentation-based methods need expensive character-level annotations.

Visual-Semantic Coupling Method: Semantic information plays a complementary role when visual information is insufficient due to the low quality of images. After encoding input text image into a 1D sequence feature,  [13] decodes the sequence feature into target sequence with the attention mechanism [1]. At each time step, the result of the last time step will impact on the present result, and the semantic information that contains the dependency between characters is built during this process.  [21] applies a similar approach, which adds a rectification module before CNN to alleviate the difficulty brought by the spatial layout of text images.  [28] utilizes a symmetry-constrained rectification module which has a better performance for rectifying text images.  [31] improves rectification performance by iteratively rectifying text images.  [4] extracts image features in four directions and fused them using a filter gate. Inspired by  [8], which combines CTC with attention mechanism on speech recognition tasks,  [33] uses a CTC-Attention mechanism to gain better performance. Since CTC decoder has an advantage in inference speed but attention-based decoder is better at learning good feature representations,  [9] proposes to learn feature representations with powerful attentional guidance for better performance while using a CTC decoder to maintain a fast inference speed.  [2] uses a gate to control the influence of the last time step’s semantic information on the present time step.  [18] uses the word embedding from a pre-trained language model to predict additional global semantic information to guide the decoding process. Though achieving good performance, these attention-based methods build visual and semantic information in a coupling way since they use one decoder to decode visual and semantic features simultaneously and thus have vocabulary reliance problem.

Visual-Semantic Serial Method:  [29] uses two modules in serial structure to decode visual features and semantic features separately. When decoding in the semantic module,  [29] utilizes the transformer unit [22] to build global semantic information. This serial way makes  [29] largely depends on the visual module’s output, especially the feature alignment, which is generated without utilizing the latter semantic information.

3 Methodology

Figure 3: Architecture of the proposed Visual-Semantic Decoupling Network. After extracting the local visual features, we exploit a visual decoder and a semantic module (semantic encoder and decoder) respectively to get character-level visual and semantic features. The dashed line means the recurrent progress.

The overall structure of our proposed network is illustrated in Figure 3. It comprises four components: (1) a shared feature extractor that takes a rectified image as input and encodes it into a local visual feature sequence; (2) an attention-based visual decoder that outputs the character-level visual feature sequence; (3) a semantic module that consists of a character embedding layer, a semantic encoder and a semantic decoder that outputs character-level semantic feature sequence; (4) a fusion block that combines the character-level visual and semantic feature representation to get the final recognition results.

Given a rectified image Where = 64 and = 256 , we first use a 45-layer ResNet and a two-layer bidirectional LSTM [6] to encode into a feature map [21] where = 25 and = 512. We can squeeze the height dimension and the dimension of becomes . The CTC decoder[5]

, which consists of a two-layer bidirectional LSTM and a softmax layer, takes

as input and outputs a sequence of probability distribution

, and then we can get a coarse prediction from

by selecting the most probable characters.

3.1 Visual Decoder

The visual decoder is the combination of an attention unit and a GRU unit for extracting character-level visual features.

At time step , we concatenate the character-level semantic feature (detailed in Section 3.2) and a learnable step embedding as the query to get the attention weights on the local visual feature sequence:


When generating attentional weights , considering contains semantic information built in the semantic module and this information can make visual decoder attend to feature map’s corresponding part more accurately (analyzed in Section 4.6), we choose to use the semantic feature instead of the visual feature . The latter is the common usage in previous attention-based methods.

Obviously, can be considered as a location mask on at current step, hence the attention unit carries out the character feature alignment. The visual glimpse and the character feature are:


It is worth noticing that we do not use the last time step’s predicted character here because bringing in will lead the visual decoder to capture the dependencies between output characters and build extra semantic information, which is supposed to be built only in the semantic module.

3.2 Semantic Module

Our semantic module is designed for word-level language modeling.

Semantic Encoder (SE) first maps a text sequence with length T to with a character embedding layer, and then is fed into a two-layer bidirectional GRU, a linear function and an averaging operation (average across the T dimension) to get the global semantic embedding .

Semantic Decoder (SD) is designed to generate semantic hidden states step by step. The global semantic embedding is mapped to an initial semantic hidden state and a common word embedding by two linear layers respectively. At time step , the new hidden state is calculated by using the predicted character in the previous step , the previous hidden state and the word embedding as following:


3.2.1 Pre-training: a Word Correction Task

To make the semantic module learn word-level semantics beyond the limited-size vocabulary in the training image set, we take a language modeling task which is to simply correct words that might have spelling errors. SE takes a string as input and SD output the corresponding correct word. Our training vocabulary is mainly derived from Synth90K [10] which contains 90k words. Besides, we add some random digit numbers to enrich the vocabulary.

When preparing the input string, we follow some specially-designed rules to simulate the coarse text prediction by CTC decoder as real as possible. Specifically, we interrupt a word by exploiting 3 probabilistic operations on a random character: replacement (40%), insertion (10%) and deletion (15%). We replace a character to another one based on the visual similarity matrix

, which is calculated by: , , successively, where is the weights of the classifier in the CTC decoder which is an approximate visual metric for characters, Cos

is the cosine similarity function,

is the number of classes, and is a hyper-parameter set to 3 empirically.

Method Training Data IIIT5K IC15
InVoc OutVoc Total InVoc OutVoc Total
No. of images Synth9K 263 2737 3000 136 1675 1811
Aster 90.1 30.0 35.3 76.5 19.0 23.3
Aster 88.2 67.6 69.4 69.1 57.6 58.4
VSDN 76.8 57.0 58.8 66.2 47.7 49.1
VSDN 87.8 71.3 72.7 75.7 65.9 66.6
No. of images Synth18K 468 2532 3000 251 1560 1811
Aster 88.7 54.4 59.7 80.9 46.3 51.1
Aster 83.1 70.3 72.3 63.7 59.0 59.7
VSDN 85.7 67.7 70.5 72.1 56.9 59.0
VSDN 84.6 75.2 76.7 77.3 67.6 68.9
No. of images Synth45K 1231 1769 3000 742 1069 1811
Aster 86.2 71.5 77.5 78.2 59.8 67.3
Aster 82.8 74.6 78.0 67.4 59.4 62.7
VSDN 87.7 75.7 80.7 80.7 63.5 70.6
VSDN 86.3 80.2 82.7 80.1 68.1 73.0
No. of images Synth90K 2415 585 3000 1420 391 1811
Aster 87.0 60.5 81.9 78.0 43.7 70.6
Aster 83.6 65.5 80.1 73.0 49.9 68.0
VSDN 87.8 62.4 82.9 77.6 48.7 71.2
VSDN 87.6 70.1 84.2 80.1 53.7 74.4
Table 1: Accuracies of models trained on datasets with small size of vocabulary and samples.
Method Training Data: IIIT5K-2000
Testing Data: IIIT5K-3000
InVoc(1435) OutVoc(1565) Total(3000)
Aster 68.6 0.5 33.1
Aster 78.3 34.0 55.2
VSDN 86.5 18.6 51.1
VSDN 85.5 36.9 60.1
Table 2: Accuracies of models trained on IIIT5K-2000

3.3 Fusion Block

Both the visual information contained in the visual feature and the semantic information contained in the semantic feature are important for our model to make a precise prediction, so we combine with by concatenating them and then use a linear function to get the current-step symbol as following:


3.4 Loss Functions

The overall loss function consists of four parts, which is defined as following:


where is the CTC loss function, , , are the cross-entropy loss function performed on the visual decoder’s output, the semantic decoder’s output, and the final output. The labels are all the same. , , , are hyper-parameters to control the trade-off of the four terms. In our experiments, , , are set to 1.0, and is set to 0.2.

4 Experiments

In this section, we firstly conduct experiments to validate the effectiveness of our proposed method in alleviating the problem of vocabulary reliance. Next, we compare our method with previous state-of-the art methods on several public benchmark datasets. Lastly, we make ablation studies to show the effectiveness of several components.

4.1 Datasets

Synth90K [10] and SynthText [7] are the popular training dataset, containing 9 million and 8 million synthetic text line images respectively. IIIT5K-Words (IIIT5K) [16] consists of 2000 training images and 3000 testing images. Besides, Street View Text (SVT) [25] (647), ICDAR2013 (IC13) [12] (1015), ICDAR2015 (IC15) [11] (1811), SVT-Perspective (SVTP) [17] (645) and CUTE80 (CUTE) [19] (288) are common benchmark datasets for model evaluation.

4.2 Implementation Details

The number of classes to be recognized is 39, including 26 lower-case letters, 10 digits and 3 special symbols: end of sequence (EoS), unknown (UKN) and padding (PAD). For fair comparison, we use the 2 synthetic datasets and their augmented version released by SRN

[29] as our training data. We choose Adadelta [30]

as the optimizer to train for 6 epochs with learning rate 1.0 which decays to 0.1 and 0.01 at the 4th and 5th epoch respectively. The batch size is set to 1024 and all experiments are implemented on two GeForce-GTX-1080-Ti graphics cards.

4.3 Alleviating Vocabulary Reliance

To validate the effectiveness of VSDN trained on datasets with small-sized vocabulary and image samples, we construct several sub-datasets of Synth90K as training data, i.e., Synth9K, Synth18K, Synth45K by randomly choosing 10%, 20%, 50% vocabularies of Synth90K. Same as Synth90K, every word has 100 image samples.

We adopt the strong popular model Aster [21] as the baseline, and Aster without the language model (Aster, simply remove the in its decoder) to show the double-sword effect directly. VSDN uses pre-trained parameters of the semantic module to further demonstrate the effectiveness of the pre-training task.

Figure 4: Some examples for comparison between VSDN and Aster
Figure 5: Examples to show the predictions from different components

For each group of experiment, we first make sure the training vocabulary, and then split the test set into 2 parts: samples with texts in vocabulary and out of vocabulary. The respective and total accuracy are given to measure the performance.

The results are shown in Table 1. After removing the implicit language model, Aster performs better than Aster when training vocabulary is insufficient, which means the coupling modeled semantic information is harmful since it will increase the bias towards the words in training vocabulary.

Our VSDN without pre-training is similar to Aster, but performs better, which indicates that the decoupling structure is less affected by the small-sized training vocabulary. Moreover, VSDN with parts of parameters pre-trained (VSDN) gets the best accuracy. The simple language modeling task largely enhances our model’s capability of constructing semantic information and makes our model more robust against the insufficiency of training vocabulary.

We also make experiments on a real dataset IIIT5K using only its 2000 real training samples. As shown in Table 2, the conclusion is the same.

CNN[10] - 80.7 90.8 - - -
CRNN[20] 81.2 82.7 89.6 - - -
RRN[13] 78.4 80.7 90.0 - - -
FAN[3] 87.4 85.9 93.3 70.6 - -
AON[4] 87.0 82.8 - 68.2 73.0 76.8
ACE[27] 82.3 82.6 89.7 68.9 70.1 82.6
FCN[15] 91.9 86.4 91.3 - - -
ScRN[28] 94.4 88.9 93.9 78.7 80.8 87.5
SAR[14] 91.5 84.5 91.0 69.2 76.4 83.3
ESIR[31] 93.3 90.2 91.3 76.9 79.6 83.3
TextScanner[23] 93.9 90.1 92.9 79.4 84.3 83.3
DAN[26] 94.3 89.2 93.9 74.5 80.0 84.4
LAL[32] 95.0 89.8 95.1 79.0 82.9 87.8
SEED[18] 93.8 89.6 92.8 80.0 81.4 83.6
SRN[29] 94.8 91.5 95.5 82.7 85.1 87.8
Aster (Baseline) [21] 93.4 89.5 91.8 76.1 78.5 79.5
VSDN (Ours) 94.4 92.3 93.5 84.5 85.3 85.1
Table 3: Comparisons on public benchmarks

4.4 Comparison with State-of-the-art

We compare our VSDN with previous state-of-the-art methods in Table 3. Our proposed method VSDN achieves the best results on SVT, SVTP and IC15 and competitive results on the rest of the datasets. Note that LAL[32] uses more curved synthetic text images to train the model, thus it is not fair to compare directly.

VSDN (CTC) 92.1 89.0 90.9 80.6 78.3
VSDN (VD) 93.9 91.8 92.9 83.8 84.2
VSDN (SD) 84.7 90.4 91.2 75.8 81.6
VSDN (SD) 90.3 92.4 93.7 82.4 85.4
VSDN (SD) 92.4 94.0 95.2 85.8 88.1
VSDN (Final) 94.4 92.3 93.5 84.5 85.3
Table 4: Performances of different components

For regular datasets, compared with our baseline Aster [21], VSDN improves 1.0% on IIIT5K (from 93.4% to 94.4%), 2.8% on SVT (from 89.5% to 92.3%) and 1.7% on IC13 (from 91.8% to 93.5%).

VSDN also performs well on irregular datasets. VSDN improves 8.4% on IC15 (from 76.1% to 84.5%), 6.8% on SVTP (from 78.5% to 85.3%) and 5.6% on CUTE (from 79.5% to 85.1%) when compared with Aster [21].

Although many images in SVT are severely corrupted by noise, blur, and low resolution and many images in IC15 suffer in heavy perspective distortions, our method still largely outperforms previous methods on the two datasets. We believe it is owing to that our VSDN can construct accurate semantic information, which is robust for visual defects. Examples are shown in Figure 4.

4.5 Component Analysis

To be clear how the visual and semantic features contribute to the final recognition, we also evaluate the performance of each component that is supervised during training. Specially, for the prediction of semantic decoder, we adopt top- accuracy since the word correction task usually has more than 1 answer (e.g., big VS bug VS bag).

As shown in Table 4, CTC decoder has relatively low accuracy. But it is enough for semantic module because the character error rate is adequately low. With the function of word correction, the semantic decoder has surpassed the CTC decoder by its top- accuracy, which proves that our VSDN has the ability of semantic modeling. The final performance is better than the pure visual decoder, showing that the feature fusion is necessary. Note that the high accuracy of the visual decoder is not just owing to the visual learning, but the joint semantic learning and character feature alignment (See Sec. 4.6.2). Figure 5 shows some cases to understand the above analysis intuitively. The results of semantic decoder are presented by the top-5 prediction.

4.6 Ablation Study

4.6.1 Effectiveness of Visual Loss and Semantic Loss

In the training stage, and are responsible for the supervision of the output of visual decoder and semantic decoder respectively (detailed in Section 3.4). We conduct experiments to evaluate the effectiveness of them. As shown in Table 5, the two loss terms matter much and removing any of them will lead to a drop in accuracy, which indicates that it is essential for both the visual and semantic parts to be supervised with labels.

Figure 6: Comparison of attention map by different queries.
Method SVT IC13 SVTP
VSDN 90.6 92.7 82.9
91.5 92.3 83.1
91.5 93.0 83.3
92.3 93.5 85.3
Table 5: The effect of visual and semantic loss for the training of VSDN.
query SVT IC15 CUTE
90.1 81.8 83.2
92.3 84.5 85.1
Table 6: The comparison of different queries in visual decoder.

4.6.2 Query in Character Feature Alignment

In the phase of attention-based character feature alignment, the semantic feature of the current step calculated from the semantic decoder is utilized as the query. In Aster [21], they used the previous hidden state as query.

To explore the different effects between these two hidden states, we conduct experiments by choosing each hidden state as the query and the quantitative results are shown in Table 6. Compared with visual hidden state , taking semantic hidden state as the query achieves better performance. We argue that is prone to be biased by the limited and noisy training vocabulary since the semantic feature is implicitly modeled in a coupling way. Accurate alignment needs accurate semantic feature. Figure 6 shows some visualization examples.

5 Conclusion

In this paper, we propose a novel Visual-Semantic Decoupling Network to learn the character-level visual and semantic features independently. Taking the advantage of decoupling, we can pre-train the semantic module by a word correction task with inexpensive text data. Our method achieves state-of-the-art or competitive results on several public benchmarks, and shows great superiority against the baseline when the taining data only has a small size of samples and vocabulary, which alleviates the problem of vocabulary reliance.


  • [1] D. Bahdanau, K. Cho, and Y. Bengio (2015) Neural machine translation by jointly learning to align and translate. In ICLR, Cited by: §2.
  • [2] X. Chen, T. Wang, Y. Zhu, L. Jin, and C. Luo (2020) Adaptive embedding gate for attention-based scene text recognition. Neurocomputing 381, pp. 261–271. Cited by: §1, §2.
  • [3] Z. Cheng, F. Bai, Y. Xu, G. Zheng, S. Pu, and S. Zhou (2017) Focusing attention: towards accurate text recognition in natural images. In ICCV, pp. 5086–5094. Cited by: Table 3.
  • [4] Z. Cheng, Y. Xu, F. Bai, Y. Niu, S. Pu, and S. Zhou (2018) AON: towards arbitrarily-oriented text recognition. In CVPR, pp. 5571–5579. Cited by: §2, Table 3.
  • [5] A. Graves, S. Fernández, F. J. Gomez, and J. Schmidhuber (2006)

    Connectionist temporal classification: labelling unsegmented sequence data with recurrent neural networks

    In ICML, Vol. 148, pp. 369–376. Cited by: §2, §3.
  • [6] A. Graves, M. Liwicki, S. Fernández, R. Bertolami, H. Bunke, and J. Schmidhuber (2009) A novel connectionist system for unconstrained handwriting recognition. TPAMI 31 (5), pp. 855–868. Cited by: §3.
  • [7] A. Gupta, A. Vedaldi, and A. Zisserman (2016) Synthetic data for text localisation in natural images. In CVPR, pp. 2315–2324. Cited by: §4.1.
  • [8] T. Hori, S. Watanabe, Y. Zhang, and W. Chan (2017) Advances in joint ctc-attention based end-to-end speech recognition with a deep CNN encoder and RNN-LM. In INTERSPEECH, pp. 949–953. Cited by: §2.
  • [9] W. Hu, X. Cai, J. Hou, S. Yi, and Z. Lin (2020) GTC: guided training of CTC towards efficient and accurate scene text recognition. In AAAI, pp. 11005–11012. Cited by: §2.
  • [10] M. Jaderberg, K. Simonyan, A. Vedaldi, and A. Zisserman (2016)

    Reading text in the wild with convolutional neural networks

    IJCV 116 (1), pp. 1–20. Cited by: §2, §3.2.1, §4.1, Table 3.
  • [11] D. Karatzas, L. Gomez-Bigorda, A. Nicolaou, S. K. Ghosh, A. D. Bagdanov, M. Iwamura, J. Matas, L. Neumann, V. R. Chandrasekhar, S. Lu, F. Shafait, S. Uchida, and E. Valveny (2015) ICDAR 2015 competition on robust reading. In ICDAR, pp. 1156–1160. Cited by: §4.1.
  • [12] D. Karatzas, F. Shafait, S. Uchida, M. Iwamura, L. G. i Bigorda, S. R. Mestre, J. Mas, D. F. Mota, J. Almazán, and L. de las Heras (2013) ICDAR 2013 robust reading competition. In ICDAR, pp. 1484–1493. Cited by: §4.1.
  • [13] C. Lee and S. Osindero (2016)

    Recursive recurrent nets with attention modeling for OCR in the wild

    In CVPR, pp. 2231–2239. Cited by: §2, Table 3.
  • [14] H. Li, P. Wang, C. Shen, and G. Zhang (2019) Show, attend and read: A simple and strong baseline for irregular text recognition. In AAAI, pp. 8610–8617. Cited by: Table 3.
  • [15] M. Liao, J. Zhang, Z. Wan, F. Xie, J. Liang, P. Lyu, C. Yao, and X. Bai (2019) Scene text recognition from two-dimensional perspective. In AAAI, pp. 8714–8721. Cited by: §2, Table 3.
  • [16] A. Mishra, K. Alahari, and C. V. Jawahar (2012) Scene text recognition using higher order language priors. In BMVC, pp. 1–11. Cited by: §4.1.
  • [17] T. Q. Phan, P. Shivakumara, S. Tian, and C. L. Tan (2013) Recognizing text with perspective distortion in natural scenes. In ICCV, pp. 569–576. Cited by: §4.1.
  • [18] Z. Qiao, Y. Zhou, D. Yang, Y. Zhou, and W. Wang (2020) SEED: semantics enhanced encoder-decoder framework for scene text recognition. In CVPR, pp. 13525–13534. Cited by: §1, §2, Table 3.
  • [19] A. Risnumawan, P. Shivakumara, C. S. Chan, and C. L. Tan (2014) A robust arbitrary text detection system for natural scene images. Expert Systems with Applications 41 (18), pp. 8027–8048. Cited by: §4.1.
  • [20] B. Shi, X. Bai, and C. Yao (2017)

    An end-to-end trainable neural network for image-based sequence recognition and its application to scene text recognition

    TPAMI 39 (11), pp. 2298–2304. Cited by: §2, Table 3.
  • [21] B. Shi, M. Yang, X. Wang, P. Lyu, C. Yao, and X. Bai (2019) ASTER: an attentional scene text recognizer with flexible rectification. TPAMI 41 (9), pp. 2035–2048. Cited by: Figure 1, §1, §2, §3, §4.3, §4.4, §4.4, §4.6.2, Table 3.
  • [22] A. Vaswani, N. Shazeer, N. Parmar, J. Uszkoreit, L. Jones, A. N. Gomez, L. Kaiser, and I. Polosukhin (2017) Attention is all you need. In NIPS, pp. 5998–6008. Cited by: §2.
  • [23] Z. Wan, M. He, H. Chen, X. Bai, and C. Yao (2020) TextScanner: reading characters in order for robust scene text recognition. In AAAI, pp. 12120–12127. Cited by: §2, Table 3.
  • [24] Z. Wan, J. Zhang, L. Zhang, J. Luo, and C. Yao (2020) On vocabulary reliance in scene text recognition. In CVPR, pp. 11422–11431. Cited by: §1.
  • [25] K. Wang, B. Babenko, and S. J. Belongie (2011) End-to-end scene text recognition. In ICCV, pp. 1457–1464. Cited by: §4.1.
  • [26] T. Wang, Y. Zhu, L. Jin, C. Luo, X. Chen, Y. Wu, Q. Wang, and M. Cai (2020) Decoupled attention network for text recognition. In the AAAI Conference on Artificial Intelligence, Vol. 34, pp. 12216–12224. Cited by: Table 3.
  • [27] Z. Xie, Y. Huang, Y. Zhu, L. Jin, Y. Liu, and L. Xie (2019) Aggregation cross-entropy for sequence recognition. In CVPR, pp. 6538–6547. Cited by: Table 3.
  • [28] M. Yang, Y. Guan, M. Liao, X. He, K. Bian, S. Bai, C. Yao, and X. Bai (2019) Symmetry-constrained rectification network for scene text recognition. In ICCV, pp. 9146–9155. Cited by: §2, Table 3.
  • [29] D. Yu, X. Li, C. Zhang, T. Liu, J. Han, J. Liu, and E. Ding (2020) Towards accurate scene text recognition with semantic reasoning networks. In CVPR, pp. 12110–12119. Cited by: §1, §2, §4.2, Table 3.
  • [30] M. D. Zeiler (2012) Adadelta: an adaptive learning rate method. arXiv preprint arXiv:1212.5701. Cited by: §4.2.
  • [31] F. Zhan and S. Lu (2019) ESIR: end-to-end scene text recognition via iterative image rectification. In CVPR, pp. 2059–2068. Cited by: §2, Table 3.
  • [32] Y. Zheng, W. Qin, D. Wijaya, and M. Betke (2020) LAL: linguistically aware learning for scene text recognition. In ACM MM, pp. 4051–4059. Cited by: §1, §4.4, Table 3.
  • [33] L. Zuo, H. Sun, Q. Mao, R. Qi, and R. Jia (2019) Natural scene text recognition based on encoder-decoder framework. IEEE Access 7, pp. 62616–62623. Cited by: §2.