Focus-Enhanced Scene Text Recognition with Deformable Convolutions

by   Yanxiang Gong, et al.

Recently, scene text recognition methods based on deep learning have sprung up in computer vision area. The existing methods achieved great performances, but the recognition of irregular text is still challenging due to the various shapes and distorted patterns. Consider that at the time of reading words in the real world, normally we will not rectify it in our mind but adjust our focus and visual fields. Similarly, through utilizing deformable convolutional layers whose geometric structures are adjustable, we present an enhanced recognition network without the steps of rectification to deal with irregular text in this work. A number of experiments have been applied, where the results on public benchmarks demonstrate the effectiveness of our proposed components and shows that our method has reached satisfactory performances. The code will be publicly available at soon.


A Feasible Framework for Arbitrary-Shaped Scene Text Recognition

Deep learning based methods have achieved surprising progress in Scene T...

CSTR: A Classification Perspective on Scene Text Recognition

The prevalent perspectives of scene text recognition are from sequence t...

SVTR: Scene Text Recognition with a Single Visual Model

Dominant scene text recognition models commonly contain two building blo...

Generating Text Sequence Images for Recognition

Recently, methods based on deep learning have dominated the field of tex...

Alchemy: Techniques for Rectification Based Irregular Scene Text Recognition

Reading text from natural images is challenging due to the great variety...

Data Augmentation for Scene Text Recognition

Scene text recognition (STR) is a challenging task in computer vision du...

Scene Text Detection and Recognition: The Deep Learning Era

With the rise and development of deep learning, computer vision has been...

1 Introduction

Text in scene images usually contains a large amount of semantic information, thus text recognition plays an important role in the field of computer vision. With the advent of deep learning, methods for recognition have made great progressesShi et al. (2017); Jaderberg et al. (2014b, a); Gupta et al. (2016). These techniques achieve great performances on regular text images, but they are not expert in treating irregular text images owing to the fixed geometric structures of the layers in the modules. Unfortunately, irregular text is also very common in the wild, as illustrated in Figure 1. Therefore, some predecessorsLiu et al. (2016); Cheng et al. (2017); Sun et al. (2018) utilized rectification networks or attention mechanism to mitigate this issue. Yao et al.Yao et al. (2014) presented to use a dictionary to do error correction on the recognition results to handle multi-oriented text images. Luo et al.Luo et al. (2019) added a multi-object rectification network before the recognition network. Shi et al.Shi et al. (2016b)

put forward a spatial transformer network

Jaderberg et al. (2015) to automatically rectify the word images. Gupta et al.Gupta et al. (2016) proposed a method to regard the whole word as a class that will ignore the arrangement of the characters. Lyu et al.Lyu et al. (2018) propose an end-to-end learning procedure to handle text instances of irregular shapes. These methods are highly effective and greatly alleviated the problem of irregular text recognition, yet most of them tend to rectify the images from different perspectives. That may lead to more manual designs like requirements of preprocesses and increment of network complexity.

Figure 1: The examples of images with regular and irregular text from public benchmarks. (a) Regular text. (b) Curved Text. (c) Tilted text. (d) Other kinds of irregular text.

In most recognition networks, standard convolutional layers that possess receptive fields with a fixed rectangular shape are utilized. These layers have a certain effect, but they may not best suit the text area as there are redundant background noises, as shown in Figure 1. For a better performance, especially on irregular text images, we propose a focus-enhanced text recognition method in which the operation of rectifying images is needless in this paper. Imagine that we are reading text in the real world, if the words are arranged irregularly, usually we will not turn our head to fit the arrangements of them or try to rectify their shapes in our minds. Generally we simply make our focus move along with the text line and change our visual field in the minds. Hence we intend to give the network a capability to focus on the text area and extract the feature precisely. Recently, deformable convolutional layers which are able to learn additional offsets on each sample location of the convolutional kernels from the images are proposed by Dai et al.Dai et al. (2017), which inspires us to make the layers change their structures. In our work, deformable layers that are able to adjust the receptive field to cover the region of interest better have been applied in order to enhance the focus of text recognition networks. We integrate our components with CRNNShi et al. (2017), which is widely utilized as a baseline model. To assess our method and confirm the effectiveness, we carry out some ablation experiments and comparisons with other methods on several public benchmarks. The results demonstrate that our proposed method can achieve a competitive performance both on regular and irregular text benchmarks. All the training and testing codes will be open source soon.

2 Methodology

2.1 Baseline

In this section, first we review the architecture of CRNN(Convolution Recurrent Neural Networks)

Shi et al. (2017)

. To the best of our knowledge, it is the first attempt to integrate CNN(Convolution Neural Networks) and RNN(Recurrent Neural Networks) for scene text recognition. Benefited from RNN who is able to achieve sequence labeling without segmentation, the model realizes end-to-end recognition. The network mainly composes of three modules: convolution, recurrent and transcription. Based on the VGG architectures

Simonyan and Zisserman (2014)

, the convolutional layers consist of convolution and max pooling, and it is responsible for extracting the features of the text into frames. In the recurrent layers, BiLSTM(Bidirectional Long Short-Term Memory) networks

Hochreiter and Schmidhuber (1997)

which utilize a forget gate to avoid the vanishing gradient problem are utilized, and the layers predict the content of each frame. The transcription layer decodes the per-frame predictions into a label sequence.

2.2 Deformable Modules

Figure 2: Indication of deformable convolution kernel. The blue points indicate the sample locations of a standard convolution kernel. The arrows shows the offsets, and the green points indicate the sampling locations of deformable convolution.

Although CRNNShi et al. (2017) has reached great performances on regular text images, the models still cannot achieve a satisfactory performance when facing irregular text images on account of the background noises and the training data most of which are regular. Enlightened by Dai et al.Dai et al. (2017), who present a network in which the convolutional layers can learn to add offsets on each sample location, we consider that it is necessary for the network to change the focus to treat irregular text. Hence we intend to integrate these deformable layers into our baseline model to enhance its focus. The 2D convolution of each location in the image can be expressed as


where represents the input feature map, represents the weight of sampled values and defines the receptive field size and dilation. In the defromable convolution, is augmented with offsets


The offsets on sampling locations can be learned automatically during the training stage. The process of the deformable convolution is indicated in Figure 2. It is obvious that the shapes of the receptive fields of the deformable convolution can better focus on the text area, which enables the ability to treat irregular text of our network. However, the replacement is not arbitrary. The shallow layers usually extract some basic information such as the edges, shapes and textures, thus the deformable convolution may be ineffective. And because there should be some space in the image for the receptive fields to drift, the effectiveness could be not obvious as deep layers have a too small size. In summary, we finally replace the convolutional layers in the middle of the network with deformable ones, which will be introduced in details in section 3.3.2. The visualization of the layers in our network is shown in Figure 3.

Figure 3: Indication of fixed receptive fields in standard convolution(a)(c) and adaptive receptive fields in deformable convolution(b)(d). In each image triplet, the left shows the sampling locations of two levels of filters on the preceding feature map, the middle shows the sampling locations of a filter and the right shows two activation units. Two sets of locations are highlighted according to the activation units.

2.3 Network Architecture

The architecture of oue modified network is depicted in Figure 4. Owing that the network becomes more complicated, in order to avoid the problem of vanishing gradient, some residual blocks have been utilized. Each of the residual blocks consists of two convolutions with a skip connection. And for better engineer implementation, the adaptive max pooling layers are applied in our network. These layers are able to get the kernel size automatically according to the required size of the output feature map with


where means the input kernel size and means the output kernel size.

Figure 4:

The pipeline of our method(a), the architecture of the convolutional layers in the network and the components of the residual blocks(b). In the figure, ”Conv” means a standard convolutional layer with the kernel size and output channel number, ”MaxPool” means an adaptive max pooling layer with the output size, ”DConv” means a deformable convolutional layer with the kernel size and output channel number and ”BatchNorm” means batch normalization operation with the input channel number.

The pipeline of our network is shown in Figure 4 and the architecture of the convolutional layers is depicted in Figure 4

. All the activation functions are ReLU and we do not modify the recurrent layers and the transcription layer of the baseline model.

3 Experiments

In the following part, we will describe the implementation details of our method. We use the standard evaluation protocols and run a number of ablations to analyze the effectiveness of the proposed components on public benchmarks.

3.1 Datasets

3.1.1 Training Data

MJSynth DatasetJaderberg et al. (2014b, a) includes about 9-million gray synthesized images. The fonts of the text are collected from Google Fonts and the backgrounds are collected from ICDAR 2003Lucas et al. (2003) and SVTWang et al. (2011) datasets. Text samples are obtained through font, border and shadow rendering. Then after coloring and distortion, the samples will be blended into natural images with some noises.

SynthText in the Wild DatasetGupta et al. (2016) includes about 7-million colored synthesized images. The dictionary is collected from SVT datasetWang et al. (2011) and the background images are from Google Image Search. To synthesize text images, a text example will be rendered and then blended into a contiguous region of a scene image according to the semantic information.

3.1.2 Testing Data

To confirm the effectiveness of our method, five benchmarks have been utilized in the experiments. Each of them will be introduced below.

TotalText(Total)Ch’ng and Chan (2017) dataset consists of 1255 training images and 300 testing images with more than 3 different text orientations: horizontal, multi-oriented, and curved. There are totally about 2,000 word images.

ICDAR 2013(IC13)Karatzas et al. (2013) dataset consists of 229 training and 233 testing images which contains about 1,000 words. The images were captured by user explicitly detecting the focus of the camera on the text content of interest, so most of the text is regular arranged horizontally.

ICDAR 2015(IC15)Karatzas et al. (2015) dataset consists of 1000 training and 500 testing images with about 4,000 words. The images were collected without taking any specific prior attention, thus the word images may be blurred and not arranged horizontally.

Street View Text(SVT)Wang et al. (2011) dataset consists of 353 images in which about 600 words were labeled as testing data. The images were collected from Google Street View and many of them are severely corrupted by noise and blur, or have very low resolutions.

IIIT 5K-Words(IIIT5K)Mishra et al. (2012) dataset contains 3,000 cropped word test images. The dataset is harvested from Google image search.

In these datasets, TotalCh’ng and Chan (2017) and IC15Karatzas et al. (2015) are commonly used to test the ability to recognize irregular text images, and the others are mainly to test recognition of regular ones.

3.2 Implementation Details

The network is trained only with the synthetic text images mentioned in section 3.1.1, and no real data is involved. All of the input images are resized to

. The loss function is Connectionist Temporal Classification (CTC) loss proposed by Graves

et al.Graves et al. (2006)

. The optimizer of the network is SGD, the batch size is set to 64 and the learning rate is 0.00005. The network is trained for 8 epochs which costs 3 days. The proposed method is implemented by PyTorch

Paszke et al. (2017). All experiments are carried out on a standard PC with Intel i7-8700 CPU and a single Nvidia TITAN Xp GPU.

3.3 Ablation Experiments

3.3.1 The Impacts of Proposed Components

We evaluate the effectiveness of our proposed components, including to utilize deformable convolutional layers, to use residual blocks and to resize the images to a larger size. Our model needs input images with size , and we also test the model with which is applied in our baseline model. The comparisons are shown in Table 1 and some of the recognition results are shown in Figure 5

. The models are all trained with case insensitive mode and tested without lexicon.

DConv ResBlock Larger Size Total IC13 IC15 SVT IIIT5K
64.8 86.2 65.3 78.4 85.2
68.6 88.7 70.8 78.3 90.4
68.2 88.9 69.5 78.7 91.4
65.2 87.4 65.6 78.1 87.4
68.7 89.0 69.9 78.9 90.9
70.3 89.7 72.2 79.4 92.2
Table 1: Ablation Experiment Results. In the table, ”DConv” means to utilize the deformable convolution layers, ”ResBlock” means to add residual blocks into the network and ”Larger Size” means to utilize as the size of input images.

From Table 1, we can observe that the adaptations of our model achieves a progress on the accuracy compared with the baseline model. And apparently, the deformable layers are chiefly effective on irregular images, but the effect is not significant when recognizing regular text. The residual blocks are mainly effective on regular images. And to utilize as the input size dose not bring about significant improvements on our baseline networks. However, as some space for offsets is required, the larger input size is highly effective on our modified network. Finally, with all of the components, the model will reach the best performance.

Figure 5: The recognition results of our baseline model(up) and our method(down). The red characters are those recognized incorrectly.

3.3.2 The Impacts of Location of Deformable Layers

To demonstrate that it is best to utilize the deformbale convolutional layers as the fourth and fifth layers, ablation experiments with deformable layers at different positions have been involved. The results are shown in Table 2.

Location Total IC13 IC15 SVT IIIT5K
{3} 65.1 89.2 67.2 77.6 88.9
{4} 66.4 88.1 67.2 77.7 89.8
{5} 67.7 89.3 68.6 77.6 90.6
{3,4} 66.1 88.5 66.8 78.0 89.4
{4,5} 70.3 89.7 72.2 79.4 92.2
{3,5} 64.4 88.3 64.4 74.6 89.1
{3,4,5} 63.8 87.3 64.6 75.0 88.1
Table 2: Ablation Experiment Results. ”Location” represents which convolutional layers are replaced with deformable ones. For instance, ”{3,5}” means to replace the third and fifth layers.

In this table, it can be observed that when deeper layers are replaced with deformable ones, the network will achieve better performances. And to utilize two deformable layers is better than to use one, but when three layers are replaced, there is degeneration on accuracy. We consider that it is because too many deformable layers cause an over-fitting problem. According to the results, finally we choose to apply deformable layers in the fourth and fifth convolution.

3.4 Comparative Evaluation

Methods Total IC13 IC15 SVT IIIT5K
Shi et al.Shi et al. (2017) - 86.7 - 80.8 78.2
Luo et al.Luo et al. (2019) - 92.4 68.8 88.3 91.2
Liu et al.Liu et al. (2016) - 89.1 - 83.6 83.3
Lyu et al.Lyu et al. (2018) 52.9 86.5 62.4 - -
Sun et al.Sun et al. (2018) 54.0 83.0 60.5 - -
Shi et al.Shi et al. (2016a) - 88.6 - 81.9 81.9
Lee et al.Lee and Osindero (2016) - 90.0 - 80.7 78.4
Cheng et al.Cheng et al. (2018) - - 68.2 82.8 87.0
Liu et al.Liu et al. (2018a) - 90.8 60.0 84.4 83.6
Liu et al.Liu et al. (2018b) - 94.0 - 87.1 89.4
Bissacco et al.Bissacco et al. (2013) - 87.6 - 78.0 -
Wang et al.Wang and Hu (2017) - - - 81.5 80.8
Jaderberg et al.Jaderberg et al. (2014c) - 81.8 - 71.7 -
Jaderberg et al.Jaderberg et al. (2014a) - 90.8 - 80.7 -
Tan et al.Tan et al. (2014) - - - 80.1 81.7
Baseline 64.8 86.2 65.3 78.4 85.2
Ours 70.3 89.7 72.2 79.4 92.2
Table 3: Experiment Results. In the table, ”Baseline” represents the model which is trained by our training dataset using our baseline model without any modification. ”-” represents that the authors did not test their model on the dataset and no result provided.

We make a few comparisons with other methods including Shi et al. (2017); Luo et al. (2019); Liu et al. (2016); Cheng et al. (2017); Lyu et al. (2018); Sun et al. (2018); Shi et al. (2016a); Lee and Osindero (2016); Cheng et al. (2018); Liu et al. (2018a, b); Bissacco et al. (2013); Wang and Hu (2017); Jaderberg et al. (2014c, a); Tan et al. (2014). All the results are reached without lexicon, which are shown in Table 3. It is obvious that our method is effective while dealing with irregular text like those from TotalText Ch’ng and Chan (2017) and ICDAR 2015 Karatzas et al. (2015). Though there are no significant improvements when treating regular text from ICDAR 2013 Karatzas et al. (2013) and IIIT5K Mishra et al. (2012), our proposed model still reaches a satisfactory performance. On SVT Wang et al. (2011), our model do not achieve a high score, and we assert that is because many of the images are severely corrupted by noise and blur, or have very low resolutions. That is different from the images in the training dataset, which confuses the deformable layers. The deformable layers are not able to locate the text area learned from the training images, as our goal is to deal with irregular text but not blurred text.

4 Conclusions

For dealing with irregular text images, modules for rectifying is usually easier to think of. But that is different from the way we read irregular text which should be to change and enhance our focus. In this work, we propose a method to recognize both regular and irregular text images through utilizing deformable convolutional layers to enable the ability of the network to change and enhance its focus. The model has reached a satisfactory performance, and no component for rectifying the images is applied. In the future, as more complicated recognition networks are available and attention mechanism can be involved, our goal is to design a system that is able to deal with images in which the text is in any orientation without preprocesses.


  • A. Bissacco, M. Cummins, Y. Netzer, and H. Neven (2013) Photoocr: reading text in uncontrolled conditions. In Proceedings of the IEEE International Conference on Computer Vision, pp. 785–792. Cited by: §3.4, Table 3.
  • C. K. Ch’ng and C. S. Chan (2017) Total-text: a comprehensive dataset for scene text detection and recognition. In 14th IAPR International Conference on Document Analysis and Recognition ICDAR, pp. 935–942. External Links: Document Cited by: §3.1.2, §3.1.2, §3.4.
  • Z. Cheng, F. Bai, Y. Xu, G. Zheng, S. Pu, and S. Zhou (2017) Focusing attention: towards accurate text recognition in natural images. In Proceedings of the IEEE International Conference on Computer Vision, pp. 5076–5084. Cited by: §1, §3.4.
  • Z. Cheng, Y. Xu, F. Bai, Y. Niu, S. Pu, and S. Zhou (2018) Aon: towards arbitrarily-oriented text recognition. In

    Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition

    pp. 5571–5579. Cited by: §3.4, Table 3.
  • J. Dai, H. Qi, Y. Xiong, Y. Li, G. Zhang, H. Hu, and Y. Wei (2017) Deformable convolutional networks. In Proceedings of the IEEE international conference on computer vision, pp. 764–773. Cited by: §1, §2.2.
  • A. Graves, S. Fernández, F. Gomez, and J. Schmidhuber (2006) Connectionist temporal classification: labelling unsegmented sequence data with recurrent neural networks. In

    Proceedings of the 23rd international conference on Machine learning

    pp. 369–376. Cited by: §3.2.
  • A. Gupta, A. Vedaldi, and A. Zisserman (2016) Synthetic data for text localisation in natural images. In IEEE Conference on Computer Vision and Pattern Recognition, Cited by: §1, §3.1.1.
  • S. Hochreiter and J. Schmidhuber (1997) Long short-term memory. Neural computation 9 (8), pp. 1735–1780. Cited by: §2.1.
  • M. Jaderberg, K. Simonyan, A. Vedaldi, and A. Zisserman (2014a) Reading text in the wild with convolutional neural networks. arXiv preprint arXiv:1412.1842. Cited by: §1, §3.1.1, §3.4, Table 3.
  • M. Jaderberg, K. Simonyan, A. Vedaldi, and A. Zisserman (2014b) Synthetic data and artificial neural networks for natural scene text recognition. arXiv preprint arXiv:1406.2227. Cited by: §1, §3.1.1.
  • M. Jaderberg, K. Simonyan, A. Vedaldi, and A. Zisserman (2014c) Deep structured output learning for unconstrained text recognition. arXiv preprint arXiv:1412.5903. Cited by: §3.4, Table 3.
  • M. Jaderberg, K. Simonyan, A. Zisserman, et al. (2015) Spatial transformer networks. In Advances in neural information processing systems, pp. 2017–2025. Cited by: §1.
  • D. Karatzas, L. Gomez-Bigorda, A. Nicolaou, S. Ghosh, A. Bagdanov, M. Iwamura, J. Matas, L. Neumann, V. R. Chandrasekhar, S. Lu, et al. (2015) ICDAR 2015 competition on robust reading. In 2015 13th International Conference on Document Analysis and Recognition (ICDAR), pp. 1156–1160. Cited by: §3.1.2, §3.1.2, §3.4.
  • D. Karatzas, F. Shafait, S. Uchida, M. Iwamura, L. G. i Bigorda, S. R. Mestre, J. Mas, D. F. Mota, J. A. Almazan, and L. P. De Las Heras (2013) ICDAR 2013 robust reading competition. In 2013 12th International Conference on Document Analysis and Recognition, pp. 1484–1493. Cited by: §3.1.2, §3.4.
  • C. Lee and S. Osindero (2016)

    Recursive recurrent nets with attention modeling for ocr in the wild

    In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 2231–2239. Cited by: §3.4, Table 3.
  • W. Liu, C. Chen, K. K. Wong, Z. Su, and J. Han (2016) STAR-net: a spatial attention residue network for scene text recognition.. In BMVC, Vol. 2, pp. 7. Cited by: §1, §3.4, Table 3.
  • W. Liu, C. Chen, and K. K. Wong (2018a) Char-net: a character-aware neural network for distorted scene text recognition. In

    Thirty-Second AAAI Conference on Artificial Intelligence

    Cited by: §3.4, Table 3.
  • Y. Liu, Z. Wang, H. Jin, and I. Wassell (2018b) Synthetically supervised feature learning for scene text recognition. In Proceedings of the European Conference on Computer Vision (ECCV), pp. 435–451. Cited by: §3.4, Table 3.
  • S. M. Lucas, A. Panaretos, L. Sosa, A. Tang, S. Wong, and R. Young (2003) ICDAR 2003 robust reading competitions. In Seventh International Conference on Document Analysis and Recognition, 2003. Proceedings., pp. 682–687. Cited by: §3.1.1.
  • C. Luo, L. Jin, and Z. Sun (2019) MORAN: a multi-object rectified attention network for scene text recognition. Pattern Recognition 90, pp. 109–118. Cited by: §1, §3.4, Table 3.
  • P. Lyu, M. Liao, C. Yao, W. Wu, and X. Bai (2018) Mask textspotter: an end-to-end trainable neural network for spotting text with arbitrary shapes. In Proceedings of the European Conference on Computer Vision (ECCV), pp. 67–83. Cited by: §1, §3.4, Table 3.
  • A. Mishra, K. Alahari, and C. V. Jawahar (2012) Scene text recognition using higher order language priors. In BMVC, Cited by: §3.1.2, §3.4.
  • A. Paszke, S. Gross, S. Chintala, and G. Chanan (2017)

    Pytorch: tensors and dynamic neural networks in python with strong gpu acceleration

    PyTorch: Tensors and dynamic neural networks in Python with strong GPU acceleration 6. Cited by: §3.2.
  • B. Shi, X. Bai, and C. Yao (2017) An end-to-end trainable neural network for image-based sequence recognition and its application to scene text recognition. IEEE transactions on pattern analysis and machine intelligence 39 (11), pp. 2298–2304. Cited by: §1, §1, §2.1, §2.2, §3.4, Table 3.
  • B. Shi, X. Wang, P. Lyu, C. Yao, and X. Bai (2016a) Robust scene text recognition with automatic rectification. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 4168–4176. Cited by: §3.4, Table 3.
  • B. Shi, X. Wang, P. Lyu, C. Yao, and X. Bai (2016b) Robust scene text recognition with automatic rectification. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 4168–4176. Cited by: §1.
  • K. Simonyan and A. Zisserman (2014) Very deep convolutional networks for large-scale image recognition. arXiv preprint arXiv:1409.1556. Cited by: §2.1.
  • Y. Sun, C. Zhang, Z. Huang, J. Liu, J. Han, and E. Ding (2018) TextNet: irregular text reading from images with an end-to-end trainable network. arXiv preprint arXiv:1812.09900. Cited by: §1, §3.4, Table 3.
  • Z. R. Tan, S. Tian, and C. L. Tan (2014) Using pyramid of histogram of oriented gradients on natural scene text recognition. In 2014 IEEE International Conference on Image Processing (ICIP), pp. 2629–2633. Cited by: §3.4, Table 3.
  • J. Wang and X. Hu (2017) Gated recurrent convolution neural network for ocr. In Advances in Neural Information Processing Systems, pp. 335–344. Cited by: §3.4, Table 3.
  • K. Wang, B. Babenko, and S. Belongie (2011) End-to-end scene text recognition. In 2011 International Conference on Computer Vision, pp. 1457–1464. Cited by: §3.1.1, §3.1.1, §3.1.2, §3.4.
  • C. Yao, X. Bai, and W. Liu (2014) A unified framework for multioriented text detection and recognition. IEEE Transactions on Image Processing 23 (11), pp. 4737–4749. Cited by: §1.