Scan, Attend and Read: End-to-End Handwritten Paragraph Recognition with MDLSTM Attention

by   Théodore Bluche, et al.

We present an attention-based model for end-to-end handwriting recognition. Our system does not require any segmentation of the input paragraph. The model is inspired by the differentiable attention models presented recently for speech recognition, image captioning or translation. The main difference is the covert and overt attention, implemented as a multi-dimensional LSTM network. Our principal contribution towards handwriting recognition lies in the automatic transcription without a prior segmentation into lines, which was crucial in previous approaches. To the best of our knowledge this is the first successful attempt of end-to-end multi-line handwriting recognition. We carried out experiments on the well-known IAM Database. The results are encouraging and bring hope to perform full paragraph transcription in the near future.



There are no comments yet.


page 1

page 2

page 3

page 4


Joint Line Segmentation and Transcription for End-to-End Handwritten Paragraph Recognition

Offline handwriting recognition systems require cropped text line images...

Streaming Attention-Based Models with Augmented Memory for End-to-End Speech Recognition

Attention-based models have been gaining popularity recently for their s...

Attention-Based Models for Text-Dependent Speaker Verification

Attention-based models have recently shown great performance on a range ...

End-to-End Attention-based Image Captioning

In this paper, we address the problem of image captioning specifically f...

A comparison of end-to-end models for long-form speech recognition

End-to-end automatic speech recognition (ASR) models, including both att...

End-to-End Far-Field Speech Recognition with Unified Dereverberation and Beamforming

Despite successful applications of end-to-end approaches in multi-channe...

A Comprehensive Comparison of End-to-End Approaches for Handwritten Digit String Recognition

Over the last decades, most approaches proposed for handwritten digit st...

Code Repositories


Open source implementation of Scan, Attend and Read.

view repo
This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

1 Introduction

In offline handwriting recognition, the input is a variable-sized two dimensional image, and the output is a sequence of characters. The cursive nature of handwriting makes it hard to first segment characters to recognize them individually. Methods based on isolated characters were widely used in the nineties [3, 19], and progressively replaced by the sliding window approach, in which features are extracted from vertical frames of the line image [18]

. This method transforms the problem into a sequence to sequence transduction one, while potentially encoding the two-dimensional nature of the image by using convolutional neural networks 

[6] or by defining relevant features [5].

The recent advances in deep learning and the new architectures allowed to build systems that can handle both the 2D aspect of the input and the sequential aspect of the prediction. In particular, Multi-Dimensional Long Short-Term Memory Recurrent Neural Networks (MDLSTM-RNNs 

[12]), associated with the Connectionist Temporal Classification (CTC [11]) objective function, yield low error rates and became the state-of-the-art model for handwriting recognition, winning most of the international evaluations in the field [7, 24, 28].

Up to now, current systems require segmented text lines, which are rarely readily available in real-world applications. A complete processing pipeline must therefore rely on automatic line segmentation algorithms in order to transcribe a document. We propose a model for multi-line recognition, built upon the recent “attention-based” methods, which have proven successful for machine translation [2], image caption generation [9, 29], or speech recognition [8, 10]. This proposal follows the longstanding and successful trend of making less and less segmentation hypotheses for handwriting recognition. Text recognition state-of-the-art moved from isolated character to isolated word recognition, then from isolated words to isolated lines recognition, and we now suggest to go further and recognize full pages without explicit segmentation.

Our domain of application bears similarities with the image captioning and speech recognition tasks. We aim at selecting the relevant parts of an input signal to sequentially generate text. Like in image captioning, the inputs are images. Similarly to the speech recognition task, we want to predict a monotonic and potentially long sequence of characters. In fact, we face here the challenges of both tasks. We need an attention mechanism that should look for content at specific location and in a specific order. Moreover, in multi-line recognition, the reading order is encapsulated. For example, in Latin scripts, we have a primary order from left to right, and a secondary order from top to bottom. We deal here with a complex problem involving long two-dimensional sequences.

The system presented in this paper constitutes a whole new approach to handwriting recognition. Previous models make sequential predictions over the width of the image, with an horizontal step size fixed by the model. They have to resort to tricks to transform the 2D input image into a character sequence, such as sliding window and Hidden Markov Models, or collapsing representations and CTC, making it impossible to handle multiple lines of text. Those approaches need the text to be already segmented into lines to work properly. Moreover, the length of the predicted sequence, the reading order and the positions of predictions are directly embeded into the architecture. Here, the sequence generation and extraction of information from the multi-dimensional input are decoupled. The system may adjust the number of predictions and arbitrarily and iteratively select any part of the input. The first results show that this kind of model could deprecate the need of line segmentation for training and recognition. Furthermore, since the model makes no assumption about the reading order, it could be applied without any change to languages with different reading order, such as Arabic (right-to-left, or even bidirectional when mixed with Latin scripts) or some Asian languages (top-to-bottom).

2 Handwriting Recognition with MDLSTM and CTC

Figure 1: MDLSTM-RNN for handwriting recognition, alternating LSTM layers in four directions and subsampling convolutions. After the last linear layer, the feature maps are collapsed in the vertical dimension, and character predictions are obtained after a softmax normalization (figure from [23]).

Multi-Dimensional Long Short-Term Memory recurrent neural networks (MDLSTM-RNNs) were introduced in [12] for unconstrained handwriting recognition. They generalize the LSTM architecture to multi-dimensional inputs. An overview of the architecture is shown in Figure 1. The image is presented to four MDLSTM layers, one layer for each scanning direction. The LSTM cell inner state and output are computed from the states and output of previous positions in the horizontal and vertical directions:



is the input feature vector at position

, and and represent the output and inner state of the cell, respectively. The choices in this recurrence depend on which of the four scanning directions is considered.

Each LSTM layer is followed by a convolutional layer, with a step size greater than one, subsampling the feature maps. As in usual convolutional architectures, the number of features computed by these layers increases as the size of the feature maps decreases. At the top of this network, there is one feature map for each label. A collapsing layer sums the features over the vertical axis, yielding a sequence of prediction vectors, effectively delaying the 2D to 1D transformation just before the character predictions, normalized with a softmax activation.

In order to transform the sequence of predictions into a sequence of labels, an additionnal non-character – or blank – label is introduced, and a simple mapping is defined in order to obtain the final transcription. The connectionist temporal classification objective (CTC [11]), which considers all possible labellings of the sequence, is applied to train the network to recognize a line of text.

The paradigm collapse/CTC already encodes the monotonicity of the prediction sequence, and allows to recognize characters from 2D images. In this paper, we propose to go beyond single line recognition, and to directly predict character sequences, potentially spanning several lines in the input image. To do this, we replace the collapse and CTC framework with an attention-based decoder.

3 An Attention-Based Model for End-to-End Handwriting Recognition

The proposed model comprises an encoder of the 2D image of text, producing feature maps, and a sequential decoder that predicts characters from these maps. The decoder proceeds by combining the feature vectors of the encoded maps into a single vector, used to update an intermediate state and to predict the next character in the sequence. The weights of the linear combination of the feature vectors at every timestep are predicted by an attention network. In this work the attention is implemented with a MDLSTM network.

Figure 2: Proposed architecture. The encoder network has the same architecture as the standard network of Figure 1

, except for the collapse and softmax layers. At each timestep, the feature maps, along with the previous attention map and state features are fed to an MDLSTM network which outputs new attention weights at each position. The weighted sum of the encoder features is computed and given to the state LSTM, and to the decoder. The decoder also considers the new state features and outputs character probabilities.

The whole architecture, depicted in Figure 2

, computes a fully differentiable function, which parameters can be trained with backpropagation. The optimized cost is the negative log-likelihood of the correct transcription:


where is the image, is the target character sequence and are the outputs of the network.

In the previous architecture (Figure 1), we can see the MDLSTM network as a feature extraction module, and the last collapsing and softmax layers as a way to predict sequences. Taking inspiration from [9, 10, 29], we keep the MDLSTM network as an encoder of the image into high-level features:


where are coordinates in the feature maps, and we apply an attention mechanism to read character from them.

The attention mechanism provides a summary of the encoded image at each timestep in the form of a weighted sum of feature vectors. The attention network computes a score for the feature vectors at every position:


We refer to as the attention map at time , which computation depends not only on the encoded image, but also on the previous attention map, and on a state vector . The attention map is obtained by a softmax normalization:


In the literature of attention-based models, we find two main kinds of mechanisms. The first one is referred to as “location-based” attention. The attention network in this case only predicts the position to attend from the previous attended position and the current state (e.g. in [14, 15]):


The second kind of attention is “content-based”. The attention weights are predicted from the current state, and the encoded features, i.e. the network looks for relevant content (e.g. in [2, 9]):


We combine these two complementary approaches to obtain the attention weights from both the content and the position, similarly to Chorowski et al. [10], who compute convolutional features on the previous attention weights in addition to the content-based features.

In this paper, we combine the previous attention map with the encoded features through an MDLSTM layer, which can keep track of position and content (Eqn. 4). With this architecture, the attention potentially depends on the context of the whole image. Moreover, the LSTM gating system allows the network to use the content at one location to predict the attention weight for another location. In that sense, we can see this network as implementing a form of both overt and covert attention.

The state vector allows the model to keep track of what it has seen and done. It is an ensemble of LSTM cells, whose inner states and outputs are updated at each timestep:


where represents the summary of the image at time , resulting from the attention given to the encoder features:


and is used both to update the state vector and to predict the next character.

The final component of this architecture is a decoder, which predicts the next character given the current image summary and state vector:


The end of sequence is predicted with a special token EOS. In this paper, the decoder is a simple multi-layer perceptron with one hidden layer (

activation) and a softmax output layer.

4 Related Work

Our system is based on the idea of [2] to learn to align and transcribe for machine translation. It is achieved by coupling an encoder of the input signal and a decoder predicting language tokens with an attention mechanism, which selects from the encoded signal the relevant parts for the next prediction.

It bears many similarity with the attention-based models for speech recognition [8, 10]. Indeed, we want to predict text from a sensed version of natural language (audio in speech recognition, image of handwritten text here). As for speech recognition, we need to deal with long sequences. Our network also has LSTM recurrences, but we use MDLSTM units to handle images, instead of bi-directional LSTMs. This is a different way of handling images, compared with the attention-based systems for image captioning for example [9, 29]. Besides the MDLSTM attention, the main difference in our architecture is that we do not input the previous character to predict the next one, so it is also quite different from the RNN transducers [13].

Contrary to some attention models like DRAW [16]

or spatial transformer networks 


, our model does not select and transform a part of the input by interpolation, but only weights the feature vectors and combine them with a sum. We do not explicitely predict the coordinates of the attention, as done in


In similar models of attention, the weights are either computed from the content at each position individually (e.g. in [8, 29]), from the location of the previous attention (e.g. in [14, 15]) or from a combination of both (e.g. in [10, 15]). In our model, the content of the whole image is explicitely taken into account to predict the weight at every position, and the location is implicitely considered through the MDLSTM recurrences.

Finally, although attention models have been applied to the recognition of sequences of symbols (e.g. in [1, 26] for MNIST or SVHN digits, and [20, 25] for scene text OCR on cropped words), we believe that we present the first attempt to recognize multiple lines of cursive text without an explicit line segmentation.

5 Experiments

5.1 Experimental Setup

We carried out the experiments on the popular IAM database, described in details in [22], consisting of images of handwritten English text documents. They correspond to English texts exctracted from the LOB corpus. 657 writers produced between 1 and 59 handwritten documents. The training set comprises 747 documents (6,482 lines, 55,081 words), the validation set 116 documents (976 lines, 8,895 words) and the test set 336 documents (2,915 lines, 25,920 words). The texts in this database typically contain 450 characters in about nine lines. In 150 dpi images, the average character has a width of 20px.

The baseline corresponds to the architecture presented in Figure 1, with 4, 20 and 100 units in MDLSTM layers, 12 and 32 units in convolutional layers, and dropout after every MDLSTM as presented in [23]

. The last linear layer has 80 outputs, and is followed by a collapse layer and a softmax normalization. In the attention-based model, the encoder has the same architecture as the baseline model, without the collapse and softmax. The attention network has 16 or 32 hidden LSTM units in each direction followed by a linear layer with one output. The state LSTM layer has 128 or 256 units, and the decoder is an MLP with 128 or 256 tanh neurons. The networks are trained with RMSProp 

[27] with a base learning rate of and mini-batches of 8 examples. We measure the Character Error Rate (CER%), i.e. the edit distance normalized by the number of characters in the ground-truth.

5.2 The Usual Word and Line Recognition Tasks

We first trained the model to recognize words and lines. The inputs are images of several consecutive words from the IAM database. The encoder network has the standard architecture presented in Section 2, with dropout after each LSTM layer [23] and was pre-trained on IAM database with CTC. The results are presented in Table 2. We see that the models tend to be better on longer inputs, and the results for complete lines are not far from the baseline performance.

Model Inputs CER (%)
MDLSTM + CTC Full Lines 6.6
Attention-based 1 word 12.6
2 words 9.4
3 words 8.2
4 words 7.8
Full Lines 7.0
Table 2: Multi-line recognition results (CER%).
Two lines of… CER (%)
1 words 11.8
2 words 11.1
3 words 10.9
Full Lines 9.4
Table 1: Multi-word recognition results (CER%).

In Figure 3, we display the attention map and character predictions as recognition proceeds. We see that attention effectively shifts from one character to the next, in the proper reading order.

Figure 3: Visualization of the attention weights at each timestep for multiple words. The attention map is interpolated to the size of the input image. The outputs of the network at each timestep are displayed in blue.

5.3 Learning Line Breaks

Next, we evaluate the ability of this model to read multiple lines, i.e. to read all characters of one line before finding the next line. This is challenging because it has to consider two levels of reading orders, which is crucial to achieve whole paragraph recognition without prior line segmentation.

We started with a synthetic database derived from IAM, where the images of words or sequences of words are stacked to represent two short lines. The results (character error rate – CER) are presented in Table 2. Again, the system is better with longer inputs. The baseline from the previous section does not apply here anymore, and the error rate with two lines is worse than with a single line, but still in a reasonable range.

We show in Figure 4 the outputs of the decoder and of the attention network on an example of two lines of one word. We observe that the system learnt to look for the second line when the first line is read, with an attention split between the end of the first line and the beginning of the second line.

Figure 4: Visualization of the attention weights at each timestep for multiple lines. The attention map is interpolated to the size of the input image.

5.4 Towards Paragraph Recognition

Training this system on paragraphs raises several challenges. The model still has to learn to both align and recognize, but the alignment problem is much more complex. A typical paragraph from IAM contains 450 characters on 9 text lines. Moreover, the full backpropagation though time must cover those 450 timesteps, on images that are significantly bigger than the line images, which is prohibitive in terms of memory usage.

To tackle these challenges, we modified the training procedure in several ways. First, we truncated the backpropagation through time of the decoder to 30 timesteps in order to adress the memory issue. Note that although 30 timesteps was chosen so that intermediate activations fit in memory even for full paragraphs, it roughly corresponds to half a line, or 4-5 words, and we suppose that it is sufficient to learn the relevant dependencies. Then, instead of using only full paragraphs (there are only 747 in the training set), we added the single lines and all concatenations of successive lines. To some extent, this may be seen as data augmentation by considering different crops of paragraphs.

Finally, we applied several levels of curriculum learning [4]. One of these is the strategy proposed by [21], which samples training examples according to their target length. It prefers short sequences at the beginning of training (e.g. single lines) and progressively adds longer sequences (paragraphs). The second curriculum is similar to that of [1]: we train only to recognize the first few characters at the beginning. The targets are the first characters, with

, i.e. first 50 during the first epoch, then first 100, and so on. Note that 50 characters roughly correspond to the length of one line. This strategy amounts to train to recognize the first line during the first epoch, then the first two lines, and so on.

The baseline here is the MDLSTM network trained with CTC for single lines, applied to the result of automatic line segmentation. We present in Table 3 the character error rates obtained with different input resolutions and segmentation algorithms. Note that the line segmentation on IAM is quite easy as the lines tend to be clearly separated.

Resolution Line segmentation Attention-based
(DPI) GroundTruth Projection Shredding Energy (this work)
90 18.8 24.7 19.8 20.8 -
150 10.3 17.2 11.1 11.8 16.2
300 6.6 13.8 7.5 7.9 -
Table 3: Character Error Rates (%) of CTC-trained RNNs on ground-truth lines and automatic segmentation of paragraphs with different resolutions.

We trained the attention-based model on 150 dpi images and the results after only twelve epochs are promising. In Figure 5, we show some examples of paragraphs being transcribed by the network. We report the character error rates on inputs corresponding to all possible sub-paragraphs of one to twelve lines from the development set in Figure 6. The Paragraphs column corresponds to the set of actual complete paragraphs, individually depicted as blue dots in the other columns. Note that for a few samples, the attention jumped back to a previous line at some point, causing the system to transcribe again a whole part of the image. In those cases, the insertion rate was very high and the final CER sometimes above 100%.

Figure 5: Transcribing full paragraphs of text. Character predictions are located at the center of mass of the attention weights. An online demo is available at
Figure 6: Character Error Rates (%) of the proposed model trained on multiple lines, evaluated with inputs containing different number of lines (150 dpi, after twelve epochs). The medians and means accross all examples are displayed in red. The blue dots are complete paragraphs.

6 Discussion

The results we present in this paper are promising and show that recognizing full paragraphs of text without an explicit segmentation into lines is feasible. Not only can we hope to perform full paragraph recognition in the near future, but we may also envision the recognition of complex documents. The attention mechanism would then be a way of performing document layout analysis and text recognition within a single end-to-end system.

We also carried out preliminary experiments on Arabic text lines and SVHN without any cropping, rescaling, or preprocessing. The results are interesting. For Arabic, the model effectively reads from right to left, and manages to handle bidirectional reading order in mixed Arabic/Latin inputs in several images. For SVHN, the model finds digits in the scene images.

In this version of the model, the prediction is not explicitely conditioned on the previous character, as for example in [10], and the integration of a language model is more complicated than with classical models trained with CTC. This should be addressed in future work. Finally, the presented system is very slow due to the computation of attention for each character in turn. The time and memory consumption is prohibitive for most industrial applications, but learning how to read whole paragraphs might open new directions of research in the field.

7 Conclusion

In this paper, we have presented a method to transcribe complete paragraphs of text without an explicit line segmentation. The system is based on MDLSTM-RNNs, widely applied to transcribe isolated text lines, and is inspired from the recent attention-based models. The proposed model is able to recognize multiple lines of text, and to learn encapsulated reading orders. It is not limited to handwritten Latin scripts, and could be applied without change to other languages (such as Chinese or Arabic), write type (e.g. printed text), or more generally image-to-sequence problems.

Unlike similar models, the decoder output is not conditioned on the previous token. Future work will include this architectural modification, which would enable a richer decoding with a beam search. On the other hand, we proposed an MDLSTM attention network, which computes attention weights taking into account the context of the whole image, and merging location and content information.

The results are encouraging, and prove that explicit segmentation is not necessary, which we believe is an important contribution towards end-to-end handwriting recognition.