An Efficient End-to-End Neural Model for Handwritten Text Recognition

07/20/2018 ∙ by Arindam Chowdhury, et al. ∙ 0

Offline handwritten text recognition from images is an important problem for enterprises attempting to digitize large volumes of handmarked scanned documents/reports. Deep recurrent models such as Multi-dimensional LSTMs have been shown to yield superior performance over traditional Hidden Markov Model based approaches that suffer from the Markov assumption and therefore lack the representational power of RNNs. In this paper we introduce a novel approach that combines a deep convolutional network with a recurrent Encoder-Decoder network to map an image to a sequence of characters corresponding to the text present in the image. The entire model is trained end-to-end using Focal Loss, an improvement over the standard Cross-Entropy loss that addresses the class imbalance problem, inherent to text recognition. To enhance the decoding capacity of the model, Beam Search algorithm is employed which searches for the best sequence out of a set of hypotheses based on a joint distribution of individual characters. Our model takes as input a downsampled version of the original image thereby making it both computationally and memory efficient. The experimental results were benchmarked against two publicly available datasets, IAM and RIMES. We surpass the state-of-the-art word level accuracy on the evaluation set of both datasets by 3.5



page 2

page 3

page 4

This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

1 Introduction

Handwritten Text Recognition (HTR) has been a major research problem for several decades [Bunke(2003)] [Vinciarelli(2002)]

and has gained recent impetus due to the potential value that can be unlocked from extracting the data stored in handwritten documents and exploiting it via modern AI systems. Traditionally, HTR is divided into two categories: offline and online recognition. In this paper, we consider the offline recognition problem which is considerably more challenging as, unlike the online mode which exploits attributes like stroke information and trajectory in addition to the text image, offline mode has only the image available for feature extraction.

Historically, HTR has been formulated as a sequence matching problem: a sequence of features extracted from the input data is matched to an output sequence composed of characters from the text, primarily using Hidden Markov Models ( HMM ) [El-Yacoubi et al.(1999)El-Yacoubi, Gilloux, Sabourin, and Suen][Marti and Bunke(2001)]. However, HMMs fail to make use of the context information in a text sequence, due to the Markovian assumption that each observation depends only on the current state. This limitation was addressed by the use of Recurrent Neural Networks ( RNN ) which encode the context information in the hidden states. Nevertheless, the use of RNN was limited to scenarios in which the individual characters in a sequence could be segmented, as the RNN objective functions require a separate training signal at each timestep. Improvements were proposed in form of models that have a hybrid architecture combining HMM with RNN [Bourlard and Morgan(2012)] [Bengio(1999)], but major breakthrough came in [Graves et al.(2009)Graves, Liwicki, Fernández, Bertolami, Bunke, and Schmidhuber] which proposed the use of Connectionist Temporal Classification ( CTC ) [Graves et al.(2006)Graves, Fernández, Gomez, and Schmidhuber] in combination with RNN. CTC allows the network to map the input sequence directly to a sequence of output labels, thereby doing away with the need of segmented input.

The performance of RNN-CTC model was still limited as it used handcrafted features from the image to construct the input sequence to the RNN. Multi-Dimensional Recurrent Neural Network (MDRNN) [Graves and Schmidhuber(2009)] was proposed as the first end-to-end model for HTR. It uses a hierarchy of multi-dimensional RNN layers that process the input text image along both axes thereby learning long term dependencies in both directions. The idea is to capture the spatial structure of the characters along the vertical axis while encoding the sequence information along the horizontal axis. Such a formulation is computationally expensive as compared to standard convolution operations which extract the same visual features as shown in [Puigcerver(2017)], which proposed a composite architecture that combines a Convolutional Neural Network ( CNN ) with a deep one-dimensional RNN-CTC model and holds the current state-of-the-art performance on standard HTR benchmarks.

In this paper, we propose an alternative approach which combines a convolutional network as a feature extractor with two recurrent networks on top for sequence matching. We use the RNN based Encoder-Decoder network [Cho et al.(2014)Cho, Van Merriënboer, Gulcehre, Bahdanau, Bougares, Schwenk, and Bengio] [Sutskever et al.(2014)Sutskever, Vinyals, and Le]

, that essentially performs the task of generating a target sequence from a source sequence and has been extensively employed for Neural Machine Translation ( NMT ). Our model incorporates a set of improvements in architecture, training and inference process in the form of Batch & Layer Normalization, Focal Loss and Beam Search to name a few. Random distortions were introduced in the inputs as a regularizing step while training. Particularly, we make the following key contributions:

Figure 1: Model Overview: (a) represents the generation of feature sequence from convolutional feature maps and (b) shows the mapping of the visual feature sequence to a string of output characters.
  • We present an end-to-end neural network architecture composed of convolutional and recurrent networks to perform efficient offline HTR on images of text lines.

  • We demonstrate that the Encoder-Decoder network with Attention provides significant boost in accuracy as compared to the standard RNN-CTC formulation for HTR.

  • We show that a reduction of in computations and in memory consumption can be achieved by downsampling the input images to almost a sixteenth of the original size, without compromising with the overall accuracy of the model.

2 Proposed Method

Our model is composed of two connectionist components Feature Extraction module that takes as input an image of a line of text to extract visual features and Sequence Learning module that maps the visual features to a sequence of characters. A general overview of the model is shown in Figure 1. It consists of differentiable neural modules with a seamless interface, allowing fast and efficient end-to-end training.

2.1 Feature Extraction

Convolutional Networks have proven to be quite effective in extracting rich visual features from images, by automatically learning a set of non-linear transformations, essential for a given task. Our aim was to generate a sequence of features which would encode local attributes in the image while preserving the spatial organization of the objects in it. Towards this end, we use a standard CNN ( without the fully-connected layers ) to transform the input image into a dense stack of feature maps. A specially designed

Map-to-Sequence [Shi et al.(2017)Shi, Bai, and Yao]

layer is put on top of the CNN to convert the feature maps into a sequence of feature vectors, by depth-wise detaching columns from it. It means that the

-th feature vector is constructed by concatenating the -th columns of all the feature maps. Due to the translational invariance of convolution operations, each column represents a vertical strip in the image ( termed as Receptive field ), moving from left to right, as shown in Figure 2. Before feeding to the network, all the images are scaled to a fixed height while the width is scaled maintaining the aspect ratio of the image. This ensures that all the vectors in the feature sequence conform to the same dimensionality without putting any restriction on the sequence length.

Figure 2: Visualization of feature sequence generation process and the possible Receptive fields of the feature vectors. Intermediate feature maps are stacked depth-wise in correspondence with the convolutional filters that generate them while the final feature maps are stacked row-wise.

2.2 Sequence Learning

The visual feature sequence extracted by the CNN is used to generate a target sequence composed of character tokens corresponding to the text present in the image. Our main aim, therefore, was to map a variable length input sequence into another variable length output sequence by learning a suitable relationship between them. In the Encoder-Decoder framework, the model consists of two recurrent networks, one of which constructs a compact representation based on its understanding of the input sequence while the other uses the same representation to generate the corresponding output sequence.

The encoder takes as input, the source sequence , where is the sequence length, and generates a context vector , representative of the entire sequence. This is achieved by using an RNN such that, at each timestep , the hidden state and finally, , where and are some non-linear functions. Such a formulation using a basic RNN cell is quite simple yet proves to be ineffective while learning even slightly long sequences due to the vanishing gradient effect [Hochreiter et al.(2001)Hochreiter, Bengio, Frasconi, Schmidhuber, et al.][Bengio et al.(1994)Bengio, Simard, and Frasconi] caused by repeated multiplications of gradients in an unfolded RNN. Instead, we use the Long Short Term Memory ( LSTM )[Hochreiter and Schmidhuber(1997)] cells, for their ability to better model and learn long-term dependencies due to the presence of a memory cell . The final cell state is used as the context vector of the input sequence. In spite of its enhanced memory capacity, LSTM cells are unidirectional and can only learn past context. To utilize both forward and backward dependencies in the input sequence, we make the encoder bidirectional [Schuster and Paliwal(1997)] , by combining two LSTM cells, which process the sequence in opposite directions, as shown in Figure 3. The output of the two cells, forward and backward are concatenated at each timestep, to generate a single output vector . Similarly, final cell state is formed by concatenating the final forward and backward states .

Figure 3: RNN Encoder-Decoder Network with Attention layer. Encoder is a bidirectional LSTM whose outputs are concatenated at each step while decoder is an unidirectional LSTM with a Softmax layer on top. The character inputs are sampled from an embedding layer.

The context vector is fed to a second recurrent network, called decoder which is used to generate the target sequence. Following an affine transformation, , where is the transformation matrix, is used to initialize the cell state of the decoder. Unlike the encoder, decoder is unidirectional as its purpose is to generate, at each timestep , a token of the target sequence, conditioned on and its own previous predictions

. Basically, it learns a conditional probability distribution

over the target sequence , where is the sequence length. Using an RNN, each conditional is modeled as , where is a non-linear function and is the RNN hidden state. As in case of the encoder, we employ an LSTM cell to implement .

The above framework proves to be quite efficient in learning a sequence-to-sequence mapping but suffers from a major drawback nonetheless. The context vector that forms a link between the encoder and the decoder often becomes an information bottleneck[Cho et al.(2014)Cho, Van Merriënboer, Gulcehre, Bahdanau, Bougares, Schwenk, and Bengio]. Especially for long sequences, the context vector tends to forget essential information that it saw in the first few timesteps. Attention models are an extension to the standard encoder-decoder framework in which the context vector is modified at each timestep based on the similarity of the previous decoder hidden state with the sequence of annotations generated by the encoder, for a given input sequence. As we use a bidirectional encoder, Bahdanau [Bahdanau et al.(2014)Bahdanau, Cho, and Bengio] attention mechanism becomes a natural choice for our model. The context vector at the -th decoder timestep is given by,

The weight for each is given as,

Here, is a feedforward network trained along with the other components.

Therefore, the context vector is modified as an weighted sum of the input annotations, where the weights measure how similar the output at position is with the input around position . Such a formulation helps the decoder to learn local correspondence between the input and output sequences in tandem with a global context, which becomes especially useful in case of longer sequences. Additionally, we incorporate the attention input feeding approach used in Luong [Luong et al.(2015)Luong, Pham, and Manning] attention mechanism in which the context vector from previous timestep is concatenated with the input of the current timestep. It helps in building a local context, further augmenting the predictive capacity of the network.

We train the model by minimizing a cumulative categorical cross-entropy ( CE ) loss calculated independently for each token in a sequence and then summed up. For a target sequence , the loss is defined as where is the probability of true class at timestep . The input to the decoder at each timestep is an embedding vector, from a learnable embedding layer, corresponding to the gold prediction from previous step, until the end-of-sequence or eos token is emitted. At this point, a step of gradient descent is performed across the recurrent network using Back Propagation Through Time ( BPTT ) followed by back propagation into the CNN to update the network parameters.

Although CE loss is a powerful measure of network performance in a complex multi-class classification scenario, it often suffers from class imbalance problem. In such a situation, the CE loss is mostly composed of the easily classified examples which dominate the gradient.

Focal Loss [Lin et al.(2017)Lin, Goyal, Girshick, He, and Dollár] addresses this problem by assigning suitable weights to the contribution of each instance in the final loss. It is defined as , where is the true-class probability and is a tunable focusing parameter. Such a formulation ensures that the easily classified examples get smaller weights than the hard examples in the final loss, thereby making larger updates for the hard examples. Our primary motivation to use focal loss arises from the fact that, in every language, some characters in the alphabet have higher chances of occurring in regular text than the rest. For example, vowels occur with a higher frequency in English text than a character like z. Therefore, to make our model robust to such an inherent imbalance, we formulate our sequence loss as . We found that worked best for our model.

To speed up training, we employ mini-batch gradient descent. Here, we optimize a batch loss which is a straightforward extension of the sequence loss, calculated as

where is the batch size and represents the -th timestep of the -th instance of the batch.

For any sequence model, the simplest approach for inference is to perform a Greedy Decoding ( GD ) which emits, at each timestep, the class with the highest probability from the softmax distribution, as the output at that instance. GD operates with the underlying assumption that the best sequence is composed of the most likely tokens at each timestep, which may not necessarily be true. A more refined decoding algorithm is the Beam Search which aims to find the best sequence by maximizing the joint distribution,

over a set of hypotheses, known as the beam. The algorithm selects top- classes, where is the beam size, at the first timestep and obtains an output distribution individually for each of them at the next timestep. Out of the hypotheses, where is the output vocabulary size, the top- are chosen based on the product . This process is repeated till all the rays in the beam emit the eos token. The final output of the decoder is the ray having the highest value of in the beam.

3 Implementation Details

3.1 Image Preprocessing

The input to our system are images that contain a line of handwritten text which may or may not be a complete sentence. The images have a single channel with intensity levels. We invert the images prior to training so that the foreground is composed of higher intensity on a dark background, making it slightly easier for the CNN activations to learn. We also scale down the input images from an average height of pixels to pixels while the width is scaled maintaining the aspect ratio of the original image to reduce computations and memory requirements as shown in Table 2

. As we employ minibatch training, uniformity in dimensions is maintained in a batch by padding the images with background pixels on both left and right to match the width of the widest image in the batch. In preliminary experiments, our model had shown a tendency to overfit on the training data. To prevent such an outcome, as a further regularization step, we introduced random distortions


in the training images, so that, ideally, in every iteration the model would process a previously unseen set of inputs. Every training batch is subjected to a set of four operations viz. translation, rotation, shear and scaling. Parameters for all the operations are sampled independently from a Gaussian distribution. The operations and the underlying distribution were chosen by observing a few examples at the beginning of experimentations and were fixed then on.

Conv. filters 16 32 64 64 128 128 128
Maxpool( x ) ✓- ✓- ✕- ✕- ✕- ✕- ✕
Maxpool( x ) ✕- ✕- ✕- ✕- ✓- ✓- ✕
Table 2: Effect of Image Downsampling on Model Performance while Training
Computations Memory
Size ( Tflops ) ( GB )
Table 1: Network configuration

3.2 Convolutional Network

Our model consists of seven convolutional ( conv ) layers stacked serially, with

Leaky ReLU

[Maas et al.(2013)Maas, Hannun, and Ng] activations. The first six layers use a kernel size of x pixels with pixel wide input padding while the final layer uses a kernel size of x

pixels without input padding. Kernel strides are of

pixel in both vertical and horizontal directions. Activations of the conv layers are Batch Normalized [Ioffe and Szegedy(2015)]

, to prevent internal covariate shift and thereby speed up training, before propagating to the next layer. Pooling operations are performed on the activations of certain conv layers to reduce the dimensionality of the input. A total of four max-pooling layers are used in our model, two of which have a kernel size of

x to preserve the horizontal spatial distribution of text and the rest use standard x non-overlapping kernels. Table 2 shows the network configuration used in each conv layer.

3.3 RNN Encoder-Decoder

Encoder & decoder use LSTM cells with hidden units. We allow both networks to extend to a depth of layers to enhance their learning capacity. Residual connections [Kim et al.(2017)Kim, El-Khamy, and Lee] are created to facilitate gradient flow across the recurrent units to the layers below. Further, we use dropout [Pham et al.(2014)Pham, Bluche, Kermorvant, and Louradour]

along depth connections to regularize the network, without modifying the recurrent connections, thereby preserving the network’s capacity to capture long-term dependencies. To prevent covariate shift due to minibatch training, the activities of the cell neurons are

Layer Normalized [Ba et al.(2016)Ba, Kiros, and Hinton], which proved to be quite effective in stabilizing the hidden state dynamics of the network. For the final prediction we apply a linear transformation on the RNN predictions, where

is the output vocabulary size, to generate the logits.

Softmax operation is performed on the logits to define a probability distribution over the output vocabulary at each timestep.

3.4 Training & Inference

In our experiments, while training, the batch size is set to . We use Adam [Kingma and Ba(2014)] algorithm as the optimizer with a learning rate of . The model was trained till the best validation accuracy, achieved after epochs. For inference, we use a beam size equal to the number of classes in the output.

4 Dataset

We use the following publicly available datasets to evaluate our method.
IAM Handwriting Database v ( English ) [Marti and Bunke(2002)] is composed of pages of text, written by different writers and partitioned into writer-independent training, validation and test sets of , , segmented lines, respectively. The line images have an average height of pixels and average width of pixels. There are different characters in the database, including whitespace.
RIMES Database ( French ) [Augustin et al.(2006)Augustin, Carré, Grosicki, Brodin, Geoffrois, and Prêteux] has scanned pages of mails handwritten by people. The dataset consists of segmented lines for training and for testing. Original database doesn’t provide a separate validation set and therefore we randomly sampled % of the total training lines for validation. Final partition of the dataset contains training lines, validation lines and test lines. Average width of the images is pixels and average height is pixels. There are different characters in the dataset.

5 Experiments

We evaluate our model on the evaluation partition of both datasets using mean Character Error Rate ( CER ) and mean Word Error Rate ( WER ) as performance metrics determined as the mean over all text lines. They are defined as,

Our experiments were performed using an Nvidia Tesla K40 GPU. The inference time for the model is seconds.

5.1 Results

Table 3 shows the effect of Layer Normalization ( LN ), Focal Loss and Beam Search on the base model. LN improved the performance of the base model by around . The use of Focal Loss also increased the accuracy by but major improvement was achieved by replacing greedy decoding with beam search which boosted the model accuracy by .

System CER(%) WER(%) CER(%) WER(%)
Baseline 17.4 25.5 12.0 19.1
+ LN 13.1 22.9 9.7 15.8
+ LN + Focal Loss 11.4 21.1 7.3 13.5
+ LN + Focal Loss + Beam Search 8.1 16.7 3.5 9.6
Table 3: Effect of Layer Normalization, Focal Loss & Beam Search on Model Performance

5.2 Comparison with the state-of-the-art

We provide a comparison of the accuracy of our method with previously reported algorithms in Table 4 and a comparison of the efficiency, in terms of maximum GPU memory consumption and number of trainable parameters, with the state-of-the-art in Table 5.

Methods CER(%) WER(%) CER(%) WER(%)
2DLSTM [Graves and Schmidhuber(2009)], reported by [Puigcerver(2017)] 8.3 27.5 4.0 17.7
CNN-1DLSTM-CTC [Puigcerver(2017)] 6.2 20.2 2.6 10.7
Our method 8.1 16.7 3.5 9.6
Table 4: Comparison with previous methods in terms of accuracy
Methods Memory ( GB ) # of Parameters ( Mi )
CNN-1DRNN-CTC [Puigcerver(2017)] 10.5 9.3
Our method 7.9 4.6
Table 5: Comparison with state-of-the-art in terms of efficiency

Although we beat the state-of-the-art [Puigcerver(2017)] at word level accuracy, our character level accuracy is slightly lower in comparison. It implies that our model is prone to make additional spelling mistakes in words which have already got mislabeled characters in them, but overall makes fewer spelling mistakes at the aggregate word level. This arises out of the inference behavior of our model, which uses the previous predictions to generate the current output and as a result, a prior mistake can trigger a sequence of future errors. But, higher word accuracy proves that most often, our model gets the entire word in a line correct. Essentially, the model is quite accurate at identifying words but when a mistake does occur, the word level prediction is off by a larger number of characters.

6 Summary and Extensions

We propose a novel framework for efficient handwritten text recognition that combines the merits of two extremely powerful deep neural networks. Our model substantially exceeds performance of all the previous methods on a public dataset and beats them by a reasonable margin on another. While the model performs satisfactorily on standard testing data, we intend to carry out further evaluations to ascertain its performance on completely unconstrained settings, with different writing styles and image quality.

An extension to the present method would be to develop a training procedure that would optimize a loss dependant on the correctness of a full sequence instead of a cumulative loss of independent characters, resulting in similar behavior of the model at training and inference. Also, a language model can be incorporated in the training scheme to further augment the performance of the model and correct for mistakes, especially for rare sequences or words.