Memory Matters: Convolutional Recurrent Neural Network for Scene Text Recognition

01/06/2016 ∙ by Guo Qiang, et al. ∙ 0

Text recognition in natural scene is a challenging problem due to the many factors affecting text appearance. In this paper, we presents a method that directly transcribes scene text images to text without needing of sophisticated character segmentation. We leverage recent advances of deep neural networks to model the appearance of scene text images with temporal dynamics. Specifically, we integrates convolutional neural network (CNN) and recurrent neural network (RNN) which is motivated by observing the complementary modeling capabilities of the two models. The main contribution of this work is investigating how temporal memory helps in an segmentation free fashion for this specific problem. By using long short-term memory (LSTM) blocks as hidden units, our model can retain long-term memory compared with HMMs which only maintain short-term state dependences. We conduct experiments on Street View House Number dataset containing highly variable number images. The results demonstrate the superiority of the proposed method over traditional HMM based methods.



There are no comments yet.


page 1

page 2

page 3

page 4

This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

I Introduction

Text recognition in natural scene is an important problem in computer vision. However, due to the enormous appearance variations in natural images, e.g. different fonts, scales, rotations, illumination conditions, it is still quite challenging.

Identifying the position of a character and recognizing it are two interdependent problems. Straight-forward methods treat the task as separate character segmentation and recognition[1, 2]. This paradigm is fragile in unconstrained natural images for it’s difficult to deal with low resolution, low contrast, blurring, large diversity of text fonts and highly complicated background clutters.

Due to the shortcoming of these methods, algorithms combining segmentation and recognition were proposed. GMM-HMMs are the mostly used models, especially in speech and handwriting communities[3, 4, 5, 6].

In this paradigm, scene text images are transformed to frame sequences by sliding window. GMMs are used for modeling frame appearance and HMMs are used to infer the target labels of the whole sequence.[7]

The merit of this method is avoiding the need of fragile character segmentation. However HMMs have several obvious shortcomings, e.g. lacking of long context consideration, improper independent hypothesis etc.

Fig. 1: The whole architecture of CRNN.

As the developments of deep neural networks (DNNs) flourishing, convolutional neural networks (CNNs) have been used to form the hybrid CNN-HMM model[7], replacing GMMs as the observation model. The model generally performs better than the GMM-HMM model thanks to the strong representation capacity of CNN, however still doesn’t eliminate the issues with HMM.

In this work, we address the issues of HMMs while keeping the algorithm free of segmentation. The novalty of our method is using Recurrent Neural Network (RNN), which has the ability of adaptively retaining long-term dynamic memory, as the sequence model. We combine CNN with RNN to utilize their representation abilities on different aspects.

RNN is a powerful connectionist model for sequences. Comparing with static feed-forward networks, it introduces recurrent connections enabling the network to maintain an internal state. It doesn’t make any hypothesis on the independence of inputs, so each hidden unit can take into account more input information. Specifically, LSTM memory blocks are used enabling RNN to retain longer range of inter-dependences of input frames. Another virtue of using RNN as the sequence model is the ease of build an end-to-end trainable system directly trained on the whole image without needing explicit segmentation. The main weakness of RNN is its feature extraction capability.

To alleviate the shortcomings of both HMM and RNN, we propose a novel end-to-end sequence recognition model named Convolutional Recurrent Neural Network (CRNN). The model is composed with hierarchical convolutional feature extraction layers and recurrent sequence modeling layers.

CNN is good at appearance modeling and RNNs have strong capacity for modeling sequences. The model is trained with Connectionist Temporal Classification (CTC)[8] object which enables the model directly learned from images without segmentation information.

Our idea is motivated by observing the complementary modeling capacities of CNN and RNN, and inspired by recent success applications of LSTM architectures to various sequential problems, such as handwriting[9], and speech recognition[10], image description[11, 12]. The whole architecture of our model is illustrated in Figure 1.

Ii Related Work

In this section, we briefly survey methods that focus on sequence modeling without segmentation.

The paradigms of scene text recognition algorithms are similar with handwriting recognition. Straightforward methods[1, 2]

are composed of two separated parts. A segmentation algorithm followed by a classifier to determine the category or each segment. Often, the classification results are post-processed to form the final results.

To eliminate the need of explicit character segmentation, many researchers use GMM-HMM for text recognition[13, 14]. GMM-HMM is a classical model widely used by the speech community. HMMs make it possible to model sequences without the necessity of segmentation. However, there are many shortcomings of GMM-HMM, which make it not widely used in scene text recognition. Firstly, GMM is a weak model for modeling characters in natural scene. Secondly, HMM has many limitations that are addressed in section I.

To strengthen the representation capability of HMM based model, CNN is then used to replace GMM as the observation model which forms the hybrid CNN-HMM model[7, 15]. While improves the performance in comparison with GMM-HMM, it still doesn’t eliminate the shortcomings of HMM.

Our idea is motivated by recent success of the RNN models applied to handwriting recognition[9], speech recognition[10] and image description[11, 12]. The main inspiration of our idea is observing the complementary modeling capacity of CNN and RNN. CNN can automatically learn hierarchical image features but only as a static model. RNN is good at sequence modeling while lacking the ability of feature extraction. We integrate the two models to form an end-to-end scene text recognition system.

Different with recent works[16] which use similar ideas, we investigate to use deep RNNs by stacking multiple recurrent hidden states on top of each other. Our experiment shows the improvements of the endeavor.

Iii Problem Formulation

We formulate scene text recognition as a sequence labeling problem by treating scene text as frame sequences. The label sequence is drawn from a fixed alphabet . The length of the label sequence is not restricted to be equal to that of the frame sequence.

Each scene text image is treated as a sequence of frames denoted by . The target sequence is . We use bold typeface to denote sequences.

We constrain that . The input space is the set of all sequences of

real valued vectors. The target space

is the set of all sequences over the alphabet of labels. We refer as a labeling.

Let be a set of training samples drawn independently from a fixed distribution composed of sequence pairs . The task is to use to train a sequence labeling algorithm to label the sequences in a test set as accurately as possible given the error criterion label error rate :


where is the edit distance between two sequences and .

Iv Method

Iv-a The proposed model

The network architecture of our CRNN model is shown in Figure 1. The model is mainly composed with two parts, a deep convolutional network for feature extraction and a bidirectional recurrent network for sequence modeling.

An scene text image is transformed into a sequence of frames which are fed into the CNN model to extract feature vectors.

The CNN model only map one frame feature to its corresponding output vector. The sequence of feature vectors are then used as the input of RNN which takes the whole sequence history into consideration.

The recurrent connections allow the network to retain previous inputs as memory in the internal states and discovery temporal correlations among time-steps even far from each other.

Given an input sequence , a standard RNN computes the hidden vector sequence and output vector sequence as following:


where the terms denote weight matrices (e.g. is the input-hidden weight matrix), the

terms denote bias vectors (e.g

is hidden bias vector),

is the hidden layer activation function.

We stack a CTC layer on top of RNN. With the CTC layer, we can train the RNN model directly on the sequences’ labellings with knowing frame-wise labels.

Iv-B Feature extraction

CNNs[17, 18] have shown exceptionally powerful representation capability for images and have achieved state-of-the-art results in various vision problems. In this work, we build an CNN for feature extraction.

We use CNN as a transforming function that takes an input image and outputs an fixed dimensional vector as the feature. The convolution and pooling operations in deep CNNs are specially designed to extract visual features hierarchically, from local low-level features to robust high-level ones. The hierarchically extracted features are robust to variable factors that characters facing in natural scene.

Iv-C Dynamic modeling with Bidirectional RNN and LSTM

One shortcoming of conventional RNNs is that they only able to make use of previous context. However, it’s reasonable to exploit both previous and future contexts. For scene text recognition, the left and right context are both useful for determining the category of a specific frame image.

In our model, we use Bidirectional RNN (BRNN)[19] to process sequential data from both directions with two separate hidden layers.

BRNN computes the forward hidden sequence , the backward hidden sequence . Each time-step of the output sequence is computed by integrating both directional hidden states:


BRNN provides the output layer with complete past and future context for every time-step in the input sequence. During the forward pass, input sequence is fed to both directional hidden layers, and the output layer is not updated until both the two hidden layers have processed the entire input sequence. The backward pass of BPTT for BRNN is similar with unidirectional RNN, except that the output layer error is fed back to the both two directional hidden layers.

While in principle RNN is a simple and powerful model, in practice, it’s unfortunately hard to train properly. RNN can be seen as a deep neural network unfolding in time. A critical problem when training deep networks is the vanishing and exploding gradient problem[20]. When error signal transmitting along the network’s recurrent connections, it decays or blows up exponentially. Due to this, the range of context being accessed can be quite limited.

Long Short-term Memory (LSTM) block[21] is designed to control the input, output and transition of signal so as to retain useful information and discard useless one. LSTMs are used as memory blocks of RNN and can alleviate the vanishing and exploding gradient issues. They use a special structure to form memory cells. The LSTM updates for time-step given inputs , and are:



is the logistic sigmoid function,

are respectively the input gate, forget gate, output gate and cell activation vectors, all of which are the same size as the hidden vector .

The core part of LSTM block is the memory cell that encodes the information of the inputs that have been observed up to that step. The gates determine whether the LSTM keeps the value from the gate or discards it. The input gate controls whether the LSTM considers its current input, the forget gate allows to forget its previous memory, and the output gate decides how much of the memory to transfer to the hidden state. Those features enable the LSTM architecture to learn complex long-term dependences.

Iv-D Training

Both the CNN and RNN models are deep models. When stacking them together, it is difficult to train them together. Starting with an random initialization, the supervision information propagated from RNN to CNN are quite ambiguous. So we separately train the two parts.

The CNN model is trained with stochastic gradient descent. The samples for training CNN is got by performing forced-alignment on the scene text images with a GMM-HMM model


For training the RNN model, we need a loss function that can directly compute the probability of the target labelling from the frame-wise outputs of RNN given the observations.

Connectionist Temporal Classification (CTC)[8] is an objective function designed for sequence labeling problem when the segmentation of data is unknown. It does not require pre-segmented training data, or post-processing to transform the network outputs into labelings. It trains the network to map directly from input sequences to the conditional probabilities of the possible labelings.

A CTC output layer contains one more unit than there are elements in the alphabet , denoted as . The elements in are refered as paths. For an input sequence , the conditional probability of a path is given by


where is the activation of output unit at time . An operator is defined to merge the repeated labels and remove blanks. For example, yields the labeling . The conditional probability of a given labeling is the sum of the probabilities of all paths corresponding to it:


We use CTC[8] as the objective function. A forward-backward algorithm[8] for CTC, which is similar to the forward-backward algorithm of HMM, is designed to effectively evaluate the probability.

The objective function for CTC is the negative log probability of the correct labelings for the entire training set:


Given the partial derivatives of some differential loss function with respect to the network outputs, we use back propagation through time algorithm (BPTT) to determine the derivatives with respect to the weights.

Like standard back propagation

, BPTT follows the chain rule to calculate derivatives. The subtle difference is that the loss function depends on the activation of the hidden layer not only through its influence on the output layer, but also through the hidden layer of next time-step. So the error back propagation formula is




The same weights are reused at every timestep, so we sum the whole sequence to get the derivatives with respect to the network weights:


Iv-E Decoding

The decoding task is to find the most probable labeling given an input sequence :


We use a simple and effective approximation by choosing the most probable path, then get the labeling corresponding to the path:


V Experiments

This section presents the details of our experiments. We compare the CRNN model with methods: (1) A baseline method by directly using CNN to predict the character at each time-step, with a post-processing procedure to merge repeated outputs; (2) The GMM-HMM model; (3) Hybrid CNN-HMM model.

V-a Dataset

We explore the use of CRNN model on a challenging scene text dataset Street View House Numbers(SVHN)[2]. The dataset contains two versions of sample images. One contains only isolated digits with for training and for testing. The other contains unsegmented full house number images containing variable number of unsegmented digits. The full number version is composed of training images and testing. House numbers in the dataset show quite large appearance variability, blurring and unnormalized layouts.

We use the isolated version of the dataset for training the CNN, which is then used to extract features for each sequence frame. The full house number version is used for HMM based methods and CRNN.

The training samples are randomly splited out as the validation set. The validation set is used only for tuning hyper parameters of different models. All the models use the same training, validation and testing set, which makes it fair to compare different models.

V-B Implementation details

The full number images are normalized to the same height while keeping the scale ratio, then transformed to frame sequences by sliding window. Each frame is fed into CNN to producing feature

. We standardize the features by subtracting the global mean and dividing the standard deviation.

Our CNN model contains convolutioin and pooling layer pairs and

fully connected layers. The activation neurons are all rectified linear units (ReLU)

[22]. The output number sequence of the layers are . CNN is trained with stochastic gradient descent (SGD) under cross entropy loss with learning rate and momentum .

We choose the output of CNN’s first fully connected layer as frame features. The features are D and used for both HMM based models and CRNN.

We use HMMs coinciding with the extended alphabet . All the HMMs are of -state left-to-right topology, except for the category which has self-looped state. The GMM-HMM model is trained with Baum-Welch algorithm.

For the proposed CRNN model, we use a deep bidirectional RNN. We stack RNN hidden layers, both of which are bidirectional. All hidden units are LSTM blocks. The CRNN model is trained with BPTT algorithm using SGD. We use a learning rate of and a momentum of .

V-C Results

Fig. 2: Training curve during training of CRNN-2-layers.

CNN based sequence labelling

We train the CNN on the isolated version of SVHN, then use it to get the character predictions for each frame, choosing the most probable one as the result. After that, we merge consecutively repeated characters to get the final sequence labels. The recognition accuracy on the full number test set is only . Note that, the CNN model achieve an accuracy of on the isolated test set. The correctly recognized house numbers by CNN are mostly contain only or digits. The simple experiment can gain us an intuition of how hard the problem is and how important the sequential information is.

HMM based models

We compare our method with kinds of HMM based models. One is GMM-HMM model, the other is hybrid CNN-HMM model[7]. The number of mixture components is an important factor for GMM-HMM. We evaluate different number of Gaussian mixture components. The sequence accuracy stops improving at . The model tends to overfit when we continually increase the number of mixture components.

Hybrid CNN-HMM improves GMM-HMM by using CNN as the observation model. The training process is an iterative procedure, where network retraining is alternated with HMM re-alignment to generate more accurate state assignments.


CRNN-1-layer CRNN-2-layers
epoch accuracy epoch accuracy


CTC error 12 0.84 9 0.90
label error 15 0.86 10 0.91


TABLE I: Comparison of different CRNN architectures.


Recent developments of deep learning shows that

Deep is an important factor for feed-forward models to gain stronger representation capability[23]. We evaluated two architectures of CRNN. One uses hidden layer denoted as CRNN-1-layer, the other hidden layers denoted as CRNN-2-layers. CRNN-1-layer has LSTM memory cells, CRNN-2-layers has for the first hidden layer and for the second. Figure 1 shows the architecture of CRNN-2-layers.

Experiment results are presented in Table I. Epoch column lists the epoch at which the best model reaches with respect to different error criterion. Accuracy column shows the sequence accuracy of the best model on test set.

As can be seen, the deeper architecture performs not only better but also with less training epochs. This is a surprising finding, as intuitively the deeper model has more parameters which makes it more difficult to train.


Model Accuracy


CNN 0.23
GMM-HMM 0.56
Hybrid CNN-HMM 0.81
CRNN 0.91


TABLE II: Performance comparison of different models.

Performance comparison of CRNN with other models is represented in Table II. As shown by the experiments, CRNN outperforms CNN and both HMM based models.

Vi Conclusion

We have presented the Convolutional Recurrent Neural Network (CRNN) model for scene text recognition. It uses CNN to extract robust high-level features and RNN to learn sequence dependences. The model eliminates the need of character segmentation when doing scene text recognition. We apply our method on street view images and achieve promising results. CRNN performs much better than HMM based methods. However, CNN is still trained separately. While the recognition process is segmentation-free, we still need cropped character samples for training the CNN. To eliminate the needing of cropped samples in training, we plan to investigate using forced alignment of GMM-HMM for bootstrapping of CRNN. Better method would be directly perform joint training of CNN and RNN from scratch. Another promising direction would be to investigate the potential of stacking more hidden LSTM layers of RNN.


This work was supported by Open Project Program of the State Key Laboratory of Mathematical Engineering and Advanced Computing(No.2015A04).


  • [1] A. Bissacco, M. Cummins, Y. Netzer, and H. Neven, “Photoocr: Reading text in uncontrolled conditions.” in ICCV, 2013, pp. 785–792.
  • [2] Y. Netzer, T. Wang, A. Coates, A. Bissacco, B. Wu, and A. Y. Ng, “Reading digits in natural images with unsupervised feature learning,” NIPS workshop on deep learning and unsupervised feature learning, vol. 2011, no. 2, p. 5, 2011.
  • [3] H. A. Bourlard and N. Morgan, Connectionist speech recognition: a hybrid approach.   Springer Science & Business Media, 2012, vol. 247.
  • [4] G. E. Dahl, D. Yu, L. Deng, and A. Acero, “Context-dependent pre-trained deep neural networks for large-vocabulary speech recognition,” Audio, Speech, and Language Processing, IEEE Transactions on, vol. 20, no. 1, pp. 30–42, 2012.
  • [5] U.-V. Marti and H. Bunke, “Using a statistical language model to improve the performance of an hmm-based cursive handwriting recognition system,”

    International journal of Pattern Recognition and Artificial intelligence

    , vol. 15, no. 01, pp. 65–90, 2001.
  • [6] S. Espana-Boquera, M. J. Castro-Bleda, J. Gorbe-Moya, and F. Zamora-Martinez, “Improving offline handwritten text recognition with hybrid hmm/ann models,” Pattern Analysis and Machine Intelligence, IEEE Transactions on, vol. 33, no. 4, pp. 767–779, 2011.
  • [7] Q. Guo, D. Tu, J. Lei, and G. Li, “Hybrid cnn-hmm model for street view house number recognition,” in ACCV 2014 Workshops, ser. Lecture Notes in Computer Science, C. Jawahar and S. Shan, Eds., 2015, vol. 9008, pp. 303–315.
  • [8] A. Graves, S. Fernández, F. J. Gomez, and J. Schmidhuber, “Connectionist temporal classification: labelling unsegmented sequence data with recurrent neural networks.” in ICML, 2006, pp. 369–376.
  • [9] A. Graves and J. Schmidhuber, “Offline handwriting recognition with multidimensional recurrent neural networks.” in NIPS, 2008, pp. 545–552.
  • [10] A. Graves and N. Jaitly, “Towards end-to-end speech recognition with recurrent neural networks.” in ICML, 2014, pp. 1764–1772.
  • [11] J. Donahue, L. A. Hendricks, S. Guadarrama, M. Rohrbach, S. Venugopalan, K. Saenko, and T. Darrell, “Long-term recurrent convolutional networks for visual recognition and description.” 2014.
  • [12] A. Karpathy and F.-F. Li, “Deep visual-semantic alignments for generating image descriptions.” 2014.
  • [13] A. Vinciarelli, S. Bengio, and H. Bunke, “Offline recognition of unconstrained handwritten texts using hmms and statistical language models.” 2004, pp. 709–720.
  • [14] M. Kozielski, P. Doetsch, and H. Ney, “Improvements in rwth’s system for off-line handwriting recognition.” in ICDAR, 2013, pp. 935–939.
  • [15] T. Bluche, H. Ney, and C. Kermorvant, “Feature extraction with convolutional neural networks for handwritten word recognition.” in ICDAR, 2013, pp. 285–289.
  • [16] K. Elagouni, C. Garcia, F. Mamalet, and P. Sébillot, “Text recognition in videos using a recurrent connectionist approach.” in ICANN (2), 2012, pp. 172–179.
  • [17] Y. Lecun, L. Bottou, Y. Bengio, and P. Haffner, “Gradient-based learning applied to document recognition,” Proceedings of the IEEE, vol. 86, no. 11, pp. 2278–2324, Nov 1998.
  • [18]

    A. Krizhevsky, I. Sutskever, and G. E. Hinton, “Imagenet classification with deep convolutional neural networks,” in

    Advances in Neural Information Processing Systems 25, F. Pereira, C. Burges, L. Bottou, and K. Weinberger, Eds., 2012, pp. 1097–1105.
  • [19]

    M. Schuster and K. K. Paliwal, “Bidirectional recurrent neural networks,”

    Signal Processing, IEEE Transactions on, vol. 45, no. 11, pp. 2673–2681, Nov 1997.
  • [20] R. Pascanu, T. Mikolov, and Y. Bengio, “On the difficulty of training recurrent neural networks.” in ICML (3), 2013, pp. 1310–1318.
  • [21] S. Hochreiter and J. Schmidhuber, “Long short-term memory.” 1997, pp. 1735–1780.
  • [22]

    V. Nair and G. E. Hinton, “Rectified linear units improve restricted boltzmann machines.” in

    ICML, 2010, pp. 807–814.
  • [23] R. Pascanu, G. Montufar, and Y. Bengio, “On the number of response regions of deep feed forward networks with piece-wise linear activations,” in International Conference on Learning Representations 2014 (ICLR 2014), Banff, Alberta, Canada, 2013.