Scene Text Recognition with Temporal Convolutional Encoder

by   Xiangcheng Du, et al.

Texts from scene images typically consist of several characters and exhibit a characteristic sequence structure. Existing methods capture the structure with the sequence-to-sequence models by an encoder to have the visual representations and then a decoder to translate the features into the label sequence. In this paper, we study text recognition framework by considering the long-term temporal dependencies in the encoder stage. We demonstrate that the proposed Temporal Convolutional Encoder with increased sequential extents improves the accuracy of text recognition. We also study the impact of different attention modules in convolutional blocks for learning accurate text representations. We conduct comparisons on seven datasets and the experiments demonstrate the effectiveness of our proposed approach.



page 1


NRTR: A No-Recurrence Sequence-to-Sequence Model For Scene Text Recognition

Scene text recognition has attracted a great many researches for decades...

SEED: Semantics Enhanced Encoder-Decoder Framework for Scene Text Recognition

Scene text recognition is a hot research topic in computer vision. Recen...

Hierarchical Photo-Scene Encoder for Album Storytelling

In this paper, we propose a novel model with a hierarchical photo-scene ...

Edit Probability for Scene Text Recognition

We consider the scene text recognition problem under the attention-based...

What is wrong with scene text recognition model comparisons? dataset and model analysis

Many new proposals for scene text recognition (STR) models have been int...

FACLSTM: ConvLSTM with Focused Attention for Scene Text Recognition

Scene text recognition has recently been widely treated as a sequence-to...

Rethinking Text Line Recognition Models

In this paper, we study the problem of text line recognition. Unlike mos...
This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

1 Introduction

Scene text recognition is an essential task in computer vision research. The texts contain rich semantic information and are important to understand the scene images. The text recognition task has various applications such as image-based file retrieval, product recognition, search the vast information, and intelligent inspection.

Although text recognition has been studied for a few years and many approaches with promising results are proposed, recognizing text from scene images is still a challenging research problem. Recently, the deep learning based approaches treat text recognition as a sequence labeling problem, by designing an encoder to have the visual representations and then a decoder to translate the features into the label sequence. The construction of a good encoder to represent text is of fundamental importance in building a robust text recognition system. Many of previous work rely on a combination of a convolutional neural network (CNN) and a recurrent neural network (RNN) and return the representations for text sequences (

e.g., in [1]).

In contrast to previous works that use RNN for sequential modeling, our framework employs the convolutional layers to capture the sequential dependencies. In this paper, we propose a new text representation called Temporal Convolutional Encoder (TCE), which is built upon the attention-based feature extractor and temporal convolutions. The utilization of long-term temporal dependencies makes TCE more discriminative than the existing works and the convolutional based design ensures the efficiency with the parallel operations. Since the accuracy of text recognition deeply depends on the abstract textual features, we also improve the CNN-based feature extractor through a novel attention mechanism.

Figure 1: The network architecture.

To summarize, the highlights of this paper are:

  • A novel Temporal Convolutional Encoder in the text recognition model by operating on the temporal resolution of the text sequences;

  • Strategies for the refinement of attention mechanisms with the channel and spatial context to improve the performance of scene text recognition;

  • Extensive evaluation on seven scene text datasets with very competitive results of the proposed encoder.

1.1 Related Work

Numerous efforts have been devoted to the design of effective text representations. [2, 3] surveyed the recent advances in scene text recognition approaches. In this section, we mainly discuss literature on deep learning based approaches, which are more related to this work.

Shi et al. [1] proposed an end-to-end model that captures sequence feature representation by combining CNN and RNN and then CTC loss [4]

was used with the neural network outputs for calculating the conditional probability between the predicted and the target sequences. The CTC loss was also used in

[5, 6] with Gated RCNN (GRCNN) and maxout CNN. As sequence-to-sequence model, most of the recognition models consist of an encoder and a decoder. The attention mechanism is incorporated into the decoder. For example, Lee et al. [7] proposed to use an attention-based decoder for text-output prediction, while Cheng et al. [8] presented the Focusing Attention Network (FAN) to tackle attention drift problem in order to improve the performance of regular text recognition. Besides, some previous work also exploited to handle the irregular scene text images at the beginning of the encoder. Cheng et al. [9] developed arbitrary orientation network (AON) to extract features from the irregular image, and [10, 11, 12] aimed to correct image with different rectification approaches.

In this paper, the proposed approach is also towards the design of the encoder. However, instead of using a sophisticated rectifier, we adopt a spatial transformation network

[13] to correct the image. Our focus is to construct an efficient contextual representation of the text sequences with CNNs. Probably the most related work to ours is [14], where a fully convolutional network was trained as the encoder. Our approach is different from [14] in its design since we use a temporal convolutional network after the attention module layers. Moreover, as will be shown in the experiments, the proposed temporal convolutional encoder outperforms the FCN based method in various benchmarks.

2 Method

In this section, we will introduce our proposed framework, which consists of three subsections. We start by introducing the design of our framework, followed by the details of the temporal convolution and attention module in our encoder.

2.1 Architecture

We illustrate the overall architecture and networks of our framework in Fig. 1, which are composed of two parts, i.e., the encoder and decoder. The goal of the encoder is to extract rich and discriminative visual features to represent the text regions. Classic encoder like [1] employ a two-stage model: a convolutional neural network to extract the sequential feature representations from input images, and a recurrent model built upon the convolutional layers to capture contextual information. In the first stage, we observe that using attention mechanisms promotes the representation ability and thus uses a feature extractor based on ResNet [15]

and attention blocks. To handle the expensive computation and gradient vanishing of the recurrent model, we design the encoder with temporal convolutions to efficiently capture the long-term sequence dependencies. We will describe these components in detail in the next subsections. The decoder aims to estimate and output the text sequences from the extracted features of the text regions. While we use the attention-based prediction

[16] in this paper, any decoding model will suffice (e.g., CTC [4]).

2.2 Temporal Convolutions

layers Configurations
layer1 :3  :256  :1
layer2 :3  :256  :2
layer3 :3  :256  :4
layer4 :3  :256  :8
Table 1: Setting of the temporal convolution layers. , , and indicates the kernel size, channel, and dilation rate respectively. The dropout rate is 0.3 during model training.

Sequence modeling aims to model the dependencies between feature sequences and is a key step to recognize text. Instead of the recurrent neural networks, we use temporal convolutions to establish the relationship between feature sequences in this work. Let be the sequential representations generated by the extractor. To incorporate the causal relationship between the historical information, the causal convolution with one-dimension convolutional kernel is applied at the -th element of features,


where n is the size of . The length of modeling of feature sequences by the ordinary causal convolution is limited by kernel size. Capturing long-term dependencies requires more network layers or larger kernel size. However, directly increasing these values results in high computational cost.

In this paper, we use dilated convolution to increase temporal extents. The definition of dilated operation is as follows,


Where is the expansion parameter of the dilated convolution. The dilated convolution allows for interval sampling of the input, such that the size of the receptive field grows exponentially with the number of layers. Overall, a temporal convolutional network with dilation can obtain a large receptive field with fewer layers. A diagram of our network design is shown in the middle part of Fig. 1.

Although the dilated convolution can reduce the number of layers of the network, it still requires several network layers to obtain a complete receptive field. Meanwhile, when the channel feature information is passed between network layers, the vanishing gradient problems tend to occur. To this end, we adopt residual connections to convey feature information between network layers in sequence modeling. The network setting of our sequence model based on temporal convolutions is listed in Table


2.3 Attention-based Feature Extractor

Figure 2: Pipeline of attention mechanism.

Our baseline feature extractor employs the backbone of ResNet-18 [15] for its efficiency and representation ability. Inspired by recent progress such as [17], we explore variations of attention modules to enhance the discriminant of convolution layers. Fig. 2 illustrates the configuration of attention for the convolution layers. We will introduce the modules as follows.

Channel attention. Suppose we have the feature map with size of and

channels after the convolutional layer. The feature maps are first operated along the spatial dimension for channel attention. The average pooling and max pooling are used to have two one-dimensional vectors,

i.e., and . Then the vectors go through the full connection layer (FC) and we can obtain the channel-wise attention,


With multiplication with the original feature map, the transformed map is computed by .

Spatial attention. The channel attention pays more attention to the relationship between channel features, however, it ignores the spatial information of the features. Thus, we adopt spatial attention to make up for this deficiency. The max pooling and average pooling along the channel dimension are employed to have spatial feature maps and . A convolution operation followed the concatenating of the spatial feature maps is designed for the spatial attention, i.e.,


Similar with the channel attention, the result of the spatial attention is calculated as . To avoid the gradient vanishing with the attention modules, we also introduce the residual structure and the final transformed feature map is defined as .

Dataset Type Size Dataset Type Size
SVT reg. 647 CUTE80 irreg. 288
ICDAR03 reg. 867 SVTP irreg. 645
ICDAR13 reg. 1015 ICDAR15 irreg. 1811
IIIT reg. 3000
Table 2: Statistics of the datasets. Two types are included: regular datasets (reg.) contain text images with horizontal text regions, while irregular datasets (irreg.) also contain distorted regions such as curved or rotated texts. Size indicates the testing image size of each dataset.
CRNN [1] 2016 82.7 91.9 89.6 81.2 - - -
GRCNN [5] 2017 81.5 91.2 - 80.8 - - -
FAN [8] 2017 85.9 94.2 93.3 87.4 - - 70.6
AON [9] 2018 82.8 91.5 - 87.0 76.8 73 -
EP [18] 2018 87.5 94.6 94.4 88.3 - - 73.9
ASTER [10] 2018 93.4 94.5 91.8 93.4 79.5 78.5 76.1
FCN [14] 2019 82.7 89.2 88.0 81.8 - - 62.3
MORAN [11] 2019 88.3 95.0 92.4 91.2 77.4 76.1 68.8
Baek et al.[16] 2019 87.5 94.4 92.3 87.9 74.0 79.2 77.6
SAR [19] 2019 84.5 - 91.0 91.5 83.3 76.4 69.2
ESIR [12] 2019 90.2 - 91.3 93.3 83.3 79.6 76.9
TCE [ours] 89.0 95.4 93.8 88.6 72.9 81.9 79.9
Table 3: Comparison with the state-of-the-arts on text recognition benchmarks. Bold text denotes the top result, while underlined text corresponds to the second runner-up.

3 Experiments

In this section, we will introduce the experimental setup and evaluate the performance of the proposed network.

3.1 Configuration

Datasets. We evaluate the proposed text recognition network on seven standard datasets: Street View Text (SVT) [20], SVT Perspective (SVTP) [21], IIIT [22], CUTE80 [23], ICDAR03 [24], ICDAR13 [25], and ICDAR15 [26]. The details of the datasets are summarized in Table 2. Following [16], our model is trained with the synthetic images from MJSynth [27] and SynthText [28]

and the combination of training images from ICDAR13, ICDAR15, IIIT, and SVT are considered as a validation dataset. Then we can obtain a single network and apply it to the testing set of the benchmarks. Throughout the experiments, we do not employ lexicon and the mean accuracy is used to evaluate the performance.

Baselines. We compare our temporal convolutional encoder with several baselines and state-of-the-art approaches. Among them, the first group contains several well-known recognition networks, including CRNN [1] and GRCNN [5]. We then compare ours with previous attention aware approaches such as FAN [8], FCN [14], and Baek et al. [16]. Other state-of-the-art approaches, e.g., ASTER [10], MORAN [11], and ESIR [12], are also included in the comparison.

Implementation details

. Our recognition model is implemented using PyTorch 

[29]. The spatial transformer network [13] is used before the encoder to rectify the scene text image. To train the network, we first resize the training images into

and normalize the pixel values to the range of (-1, 1). The layers of the network are initialized using a Gaussian distribution and the AdaDelta optimizer is used with the decay rate of 0.95. The training process reaches convergence after 250k iterations.

3.2 Results and Discussion

Comparison with the state-of-the-arts. Table 3 summarizes the recognition accuracies of all the benchmarks. We can see that temporal convolutional encoder outperforms all the previous methods in ICDAR03, SVTP, and ICDAR15, and is comparable with the state-of-the-arts for the rest datasets. It worth mentioning that our proposed framework can achieve better performance comparing with the FCN-based method [14] in the benchmarks. We notice that the rectification approaches [10, 12] also achieve good performance in all the benchmarks. As these methods and ours focus on different stages of a text recognition system, we believe that Temporal Convolutional Encoder is complementary to the rectification approaches and we would like to examine it in the future.

Dataset LSTM TC ICDAR03 94.6 95.4 ICDAR13 93.1 93.8 ICDAR15 78.1 79.9
Figure 3: Experiment results (a) and training losses at different iterations on the synthetic dataset (b) with encoder using LSTM (blue curve) and temporal convolutions (TC, red curve).
Baseline 94.1 91.7 76.4
CA 95.3 92.9 78.0
SA 95.0 92.8 79.2
CA+SA 95.4 93.8 79.9
Table 4:

The accuracy of using different attention module. Baseline setting is with a ResNet-18 in the feature extraction stage. CA and SA are the channel attention and spatial attention, respectively.

Ablation study. Here we evaluate the parameters for constructing the encoder and report results on the ICDAR datasets. We first compare the sequence modeling approaches. In Fig. 3(a), it clearly shows that using temporal convolutions can improve text recognition performance over the Bidirectional LSTM which is popular in the encoder part of previous work (e.g., in [10, 16]). We also observe that the proposed approach converges faster than the LSTM-based encoder network (see Fig. 3(b)).

we also explore the influence of different choices of attention module in the encoder. As mentioned earlier, the channel attention or spatial attention may be added by directly putting them after a convolutional layer. In Table 4, we report the detailed results and find that adding attention boosts the accuracy. Moreover, using both channel and spatial attention modalities leads to performance gains for all of ICDAR datasets.

4 Conclusions

In this paper, we introduced a framework based on the Temporal Convolutional Encoder for scene text recognition. Text sequences were modeled with dilated convolutions to increase the temporal receptive field, resulting in a more efficient model. Channel and spatial attention mechanisms were also explored to refine the feature extractor. Experimental comparisons with the state-of-the-art approaches on seven standard datasets showed the effectiveness and efficiency of our proposed encoder for the text recognition task.


  • [1] B. Shi, X. Bai, and C. Yao, “An end-to-end trainable neural network for image-based sequence recognition and its application to scene text recognition,” IEEE Transaction on Pattern Analysis and Machine Intelligence, vol. 39, no. 11, pp. 2298–2304, 2016.
  • [2] Q. Ye and D. Doermann, “Text detection and recognition in imagery: A survey,” IEEE Transaction on Pattern Analysis and Machine Intelligence, vol. 37, no. 7, pp. 1480–1500, 2015.
  • [3] S. Long, X. He, and C. Yao, “Scene text detection and recognition: The deep learning era,” arXiv preprint arXiv:1811.04256, 2018.
  • [4] A. Graves, S. Fernández, F. Gomez, and J. Schmidhuber, “Connectionist temporal classification: labelling unsegmented sequence data with recurrent neural networks,” in

    International Conference on Machine Learning

    , 2006, pp. 369–376.
  • [5] J. Wang and X. Hu, “Gated recurrent convolution neural network for ocr,” in Advances in Neural Information Processing Systems, 2017, pp. 335–344.
  • [6] P. He, W. Huang, Y. Qiao, C. C. Loy, and X. Tang, “Reading scene text in deep convolutional sequences,” in

    AAAI Conference on Artificial Intelligence

    , 2016.
  • [7] C.-Y. Lee and S. Osindero,

    “Recursive recurrent nets with attention modeling for ocr in the wild,”


    IEEE Conference on Computer Vision and Pattern Recognition

    , 2016, pp. 2231–2239.
  • [8] Z. Cheng, F. Bai, Y. Xu, G. Zheng, S. Pu, and S. Zhou, “Focusing attention: Towards accurate text recognition in natural images,” in International Conference on Computer Vision, 2017, pp. 5076–5084.
  • [9] Z. Cheng, Y. Xu, F. Bai, Y. Niu, S. Pu, and S. Zhou, “Aon: Towards arbitrarily-oriented text recognition,” in IEEE Conference on Computer Vision and Pattern Recognition, 2018, pp. 5571–5579.
  • [10] B. Shi, M. Yang, X. Wang, P. Lyu, C. Yao, and X. Bai, “Aster: An attentional scene text recognizer with flexible rectification,” IEEE Transaction on Pattern Analysis and Machine Intelligence, 2018.
  • [11] C. Luo, L. Jin, and Z. Sun, “Moran: A multi-object rectified attention network for scene text recognition,” Pattern Recognition, vol. 90, pp. 109 – 118, 2019.
  • [12] F. Zhan and S. Lu, “Esir: End-to-end scene text recognition via iterative image rectification,” in IEEE Conference on Computer Vision and Pattern Recognition, 2019.
  • [13] M. Jaderberg, K. Simonyan, A. Zisserman, and k. kavukcuoglu, “Spatial transformer networks,” in Advances in Neural Information Processing Systems, 2015, pp. 2017–2025.
  • [14] Y. Gao, Y. Chen, J. Wang, M. Tang, and H. Lu, “Reading scene text with fully convolutional sequence modeling,” Neurocomputing, vol. 339, pp. 161 – 170, 2019.
  • [15] K. He, X. Zhang, S. Ren, and J. Sun, “Deep residual learning for image recognition,” in IEEE Conference on Computer Vision and Pattern Recognition, 2016, pp. 770–778.
  • [16] J. Baek, G. Kim, J. Lee, S. Park, D. Han, S. Yun, S. J. Oh, and H. Lee, “What is wrong with scene text recognition model comparisons? dataset and model analysis,” in International Conference on Computer Vision, 2019.
  • [17] J. Hu, L. Shen, and G. Sun, “Squeeze-and-excitation networks,” in IEEE Conference on Computer Vision and Pattern Recognition, 2018, pp. 7132–7141.
  • [18] F. Bai, Z. Cheng, Y. Niu, S. Pu, and S. Zhou, “Edit probability for scene text recognition,” in IEEE Conference on Computer Vision and Pattern Recognition, 2018.
  • [19] H. Li, P. Wang, C. Shen, and G. Zhang, “Show, attend and read: A simple and strong baseline for irregular text recognition,” in AAAI Conference on Artificial Intelligence, 2019.
  • [20] K. Wang, B. Babenko, and S. Belongie, “End-to-end scene text recognition,” in International Conference on Computer Vision, 2011, pp. 1457–1464.
  • [21] T. Quy Phan, P. Shivakumara, S. Tian, and C. Lim Tan, “Recognizing text with perspective distortion in natural scenes,” in International Conference on Computer Vision, 2013, pp. 569–576.
  • [22] A. Mishra, K. Alahari, and C. V. Jawahar, “Scene text recognition using higher order language priors,” in British Machine Vision Conference, 2012.
  • [23] A. Risnumawan, P. Shivakumara, C. S. Chan, and C. L. Tan, “A robust arbitrary text detection system for natural scene images,” Expert Systems with Applications, vol. 41, no. 18, pp. 8027–8048, 2014.
  • [24] S. M. Lucas, A. Panaretos, L. Sosa, A. Tang, S. Wong, and R. Young, “Icdar 2003 robust reading competitions,” in International Conference on Document Analysis and Recognition, 2003, pp. 682–687.
  • [25] D. Karatzas, F. Shafait, S. Uchida, M. Iwamura, L. G. i Bigorda, S. R. Mestre, J. Mas, D. F. Mota, J. A. Almazan, and L. P. De Las Heras, “Icdar 2013 robust reading competition,” in International Conference on Document Analysis and Recognition, 2013, pp. 1484–1493.
  • [26] D. Karatzas, L. Gomez-Bigorda, A. Nicolaou, S. Ghosh, A. Bagdanov, M. Iwamura, J. Matas, L. Neumann, V. R. Chandrasekhar, S. Lu, F. Shafait, S. Uchida, and E. Valveny, “Icdar 2015 competition on robust reading,” in International Conference on Document Analysis and Recognition, 2015, pp. 1156–1160.
  • [27] M. Jaderberg, K. Simonyan, A. Vedaldi, and A. Zisserman, “Reading text in the wild with convolutional neural networks,” International Journal of Computer Vision, vol. 116, no. 1, pp. 1–20, 2016.
  • [28] A. Gupta, A. Vedaldi, and A. Zisserman, “Synthetic data for text localisation in natural images,” in IEEE Conference on Computer Vision and Pattern Recognition, 2016, pp. 2315–2324.
  • [29] A. Paszke, S. Gross, S. Chintala, G. Chanan, E. Yang, Z. DeVito, Z. Lin, A. Desmaison, L. Antiga, and A. Lerer, “Automatic differentiation in PyTorch,” in NIPS Autodiff Workshop, 2017.