using rnn (lstm or gru) and ctc to convert line image into text, based on torch7 and warp-ctc
Offline handwriting recognition systems require cropped text line images for both training and recognition. On the one hand, the annotation of position and transcript at line level is costly to obtain. On the other hand, automatic line segmentation algorithms are prone to errors, compromising the subsequent recognition. In this paper, we propose a modification of the popular and efficient multi-dimensional long short-term memory recurrent neural networks (MDLSTM-RNNs) to enable end-to-end processing of handwritten paragraphs. More particularly, we replace the collapse layer transforming the two-dimensional representation into a sequence of predictions by a recurrent version which can recognize one line at a time. In the proposed model, a neural network performs a kind of implicit line segmentation by computing attention weights on the image representation. The experiments on paragraphs of Rimes and IAM database yield results that are competitive with those of networks trained at line level, and constitute a significant step towards end-to-end transcription of full documents.READ FULL TEXT VIEW PDF
Recurrent neural networks (RNNs) are a powerful model for sequential dat...
We present an attention-based model for end-to-end handwriting recogniti...
Offline handwritten text recognition from images is an important problem...
Neural handwriting recognition (NHR) is the recognition of handwritten t...
Automatic text image recognition is a prevalent application in computer
Offline handwritten text line recognition is a hard task that requires b...
Inspired by recent successes in neural machine translation and image cap...
using rnn (lstm or gru) and ctc to convert line image into text, based on torch7 and warp-ctc
Offline handwriting recognition consists in recognizing a sequence of characters in an image of handwritten text. Traditional approaches contain a first segmentation step, followed by a transcription step. Unlike printed texts, images of handwriting are diffucult to segment into characters. Early methods tried to compute segmentation hypotheses for characters, for example by performing an heuristic over-segmentation, followed by a scoring of groups of segments (e.g. in[4, 28]
). In the nineties, this kind of approach was progressively replaced by segmentation-free methods, where a whole word image is fed to a system providing a sequence of scores. A lexicon constrains a decoding step, allowing to retrieve the character sequence. Some examples are the sliding window approach, in which features are extracted from vertical frames of the line image, or space-displacement neural networks . In the last decade, word segmentations were abandoned in favor of complete text line recognition with statistical language models .
Nowadays, the standard handwriting recognition systems are multi-dimensional long short-term memory recurrent neural networks (MDLSTM-RNNs ), which consider the whole image, alternating MDLSTM layers and convolutional layers. The transformation of the 2D structure into a sequence is computed by a simple collapse layer summing the activations along the vertical axis. The further conversion of a sequence of predictions into a sequence of characters is achieved by a simple mapping, involving a non-character label, allowing to consider all possible character segmentations during training with the connectionist temporal classification loss (CTC ). These models have become very popular and won the recent evaluations of handwriting recognition [9, 43, 48].
However, current models still need segmented text lines, and full document processing pipelines should include automatic line segmentation algorithms. Although the segmentation of documents into lines is assumed in most descriptions of handwriting recognition systems, several papers or surveys state that it is a crucial step for handwriting text recognition systems [8, 31, 41]. The need of line segmentation to train the recognition system has also motivated several efforts to map a paragraph-level or page-level transcript to line positions in the image (e.g. recently [7, 20]).
In this paper, we pursue the traditional tendency to relax hard segmentation hypotheses in handwriting recognition systems – from character, then word segmentation to full text lines – which consistently improved the performance. We propose a model for multi-line recognition based on the popular MDLSTM-RNNs, augmented with an attention mechanism inspired from the recent models for machine translation , image caption generation [13, 51], or speech recognition [11, 14, 15]. In the proposed model, the “collapse” layer is modified with an attention network, providing weights to modulate the importance given at different positions in the input. By iteratively applying this layer to a paragraph image, the network can transcribe each text line in turn, enabling a purely segmentation-free recognition of full paragraphs.
We carried out experiments on two public datasets of handwritten paragraphs: Rimes and IAM. We report results that are competitive with the state-of-the-art systems, which use the ground-truth line segmentation. The remaining of this paper is organized as follows. Section 2 presents methods related to the one presented here, in terms of the tackled problem and modeling choices. In Section 3, we introduce the baseline model: MDLSTM-RNNs. We expose in Section 4 the proposed modification, and we give the details of the system. Experimental results are reported in Section 5, and followed by a short discussion in Section 6, in which we explain how the system could be improved, and present the challenge of generalizing it to complete documents.
Our work is clearly related to MDLSTM-RNNs , which we improve by replacing the simple collapse layer by a more elaborated mechanism, itself made of MDLSTM layers. The model we propose iteratively performs an implicit line segmentation at the level of intermediate representations.
. However, some methods were devised using statistical models and machine learing techniques such as hidden Markov models, conditional random fields , or neural networks [16, 26, 35, 36]. In our model, the line segmentation is performed implicitely and integrated in the neural network. The intermediate features are shared by the transcription and the segmentation models, and they are jointly trained to minimize the transcription error.
In the field of computer vision, and particularly object detection and recognition, many neural architectures were proposed to both locate and recognize the objects, such as OverFeat25]. Although systems are now able to detect multiple similar objects in a scene, most methods localize only one object, or several objects that are different. For scene text recognition, which is maybe the topic in computer vision closest to our problem, most systems still rely on a two-step process (localization, then recognition) , even though some approaches jointly optimize character segmentation and word recognition [12, 49, 50].
Recently, many “attention-based” models were proposed to iteratively select in an encoded signal the relevant parts to make the next prediction. This paradigm, already suggested by Fukushima in 1987 , was successfully applied to various problems such as machine translation , image caption generation [13, 51], speech recognition [11, 14, 15], or cropped words in scene text . In those works, the localization is implicitely performed inside the neural network.
Other papers present similar methods to read short sequence of characters (mainly digits) with different implementations of the attention, e.g. DRAW , RAM , or recurrent spatial transformer networks . We recently proposed an attention-based model to transcribe full paragraphs of handwritten text, which predicts each character in turn .
Outputing one token at a time turns out to be prohibitive in terms of memory and time consumption for full paragraphs, which typically contain about 500 characters. In the proposed system, the encoded image is not summarized as a single vector at each timestep, but as a sequence of vectors representing full text lines. It represents a huge speedup factor, and a comeback to the original MDLSTM-RNN architecture, in which the collapse layer is augmented with an MDLSTM attention network similar to the one presented in.
In this section, we briefly present the MDLSTM-RNNs . MDLSTM layers generalize LSTMs to two-dimensional inputs. They were first introduced in the context of handwriting recognition. The general architecture is displayed in Figure 1.
The MDLSTM layers scan the input in the four possible directions. The LSTM cell inner state and output are computed from the states and outputs of previous positions in the horizontal and vertical directions. Each LSTM layer is followed by a convolutional layer. The resolution of the learnt representations is decreased by setting a step size of the convolutional filters greater than one. As the size of the feature maps decreases, the number of extracted features increases. At the top of this network, there is one feature map for each character. A collapse layer sums the features along the vertical axis, yielding a sequence of prediction vectors, normalized with a softmax activation.
In order to transform the sequence of predictions into a sequence of labels, an additionnal non-character label is introduced, and a simple mapping is defined to retrieve the transcription. The connectionist temporal classification objective (CTC ), which considers all possible labellings of the sequence, may be applied to train the network to recognize text lines.
The 2D to 1D conversion happens in the collapsing layer, which applies a simple aggregation of the feature maps into vector sequences, i.e. maps of height 1. This is achieved by a simple sum across the vertical dimension:
where is the -th output vector and is the input feature vector at coordinates . All the information in the vertical dimension is reduced to a single vector, regardless of its position in the feature maps, preventing the recognition of multiple lines within this framwork.
In this paper, we replace the sum of Eqn. 1 by a weighted sum, in order to focus on a specific part of the input. The weighted collapse is defined as follows:
where are scalar weights between and , computed at every time for each position . The weights are computed by a recurrent neural network, illustrated in Figure 2, enabling the recognition of a text line at each timestep.
. This mechanism is differentiable and can be trained with backpropagation.
Both this new architecture and the previous one are composed of an encoder (the MDLSTM network), an aggregation layer, and a decoder, described below.
The bottom part of the architecture presented in Section 3
remains the same. We can see the MDLSTM network as a feature extraction module, or encoder of the input imageinto high-level features:
where are coordinates in the feature maps.
In Section 3 a simple sum of is computd by a collapse layer. Here, we apply an attention mechanism to read text lines.
The weighted collapse is an attention mechanism providing a view of the encoded image at each timestep in the form of a weighted sum of feature vector sequences. The attention network computes a score for the feature vectors at every position:
We refer to as the attention map at time , which computation depends not only on the encoded image, but also on the previous attention features. A softmax normalization is applied to each column:
This module is applied several times to the features from the encoder. The output of the attention module at iteration , computed with Eqn. 2, is a sequence of feature vectors, intended to represent a text line. Therefore, we may see this module as a soft line segmentation neural network. The advantages over the neural networks trained for line segmentation [16, 26, 36, 35] are that (i) it works on the same features as those used for the transcription (multi-task encoder) and (ii) it is trained to maximize the transcription accuracy (i.e. more closely related to the goal of handwriting recognition systems, and easily interpretable).
The final component of this architecture is a decoder, which predicts a character sequence from the feature vectors.
where is the concatenation of . Alternatively, the deocder may be applied to s sub-sequences to get s and is the concatenation of .
In the standard MDLSTM architecture of Section 3, the decoder is a simple softmax. However, a Bidirectional LSTM (BLSTM) decoder could be applied to the collapsed representations. This is particularly interesting in the proposed model, as the BLSTM would potentially process the whole paragraph, allowing a modeling of dependencies across text lines.
This model can be trained with CTC. If the line breaks are known in the transcript, the CTC could by applied to the segments corresponding to each line prediction, with the line transcript. Moreover, it will enforce the prediction at each timestep to correspond to a complete text line. Otherwise, one can directly apply CTC to the whole paragraph. The different training strategies of this model are illustrated in Figure 3.
In this work, we mainly investigated the second strategy, with CTC training at the paragraph level, and with a BLSTM decoder applied to the concatenation of all collapsing steps, for reasons developed in the next section.
Compared to the model presented in , the iterative decoder requires one step for each text line instead of one step for each character, which represents a huge speedup of a factor 20-30. However, we loose the ability to handle arbitrary reading orders. Moreover, in this version, the model does not predict a “stop” token. Thus, the network predicts an arbitrary number of sequence , fixed by the experimenter. In our experiments, corresponds to the maximum number of lines in the dataset. The BLSTM decoder is applied to these sequences and was efficient to ignore the supplementary lines in shorter paragraphs. In numerous cases, we observed that during these additional steps, the attention was located in interlines, were the decoder can easily only predict non-characters. However, missing the ability to determine automatically the number of required steps is an important limitation, which should be fixed in future work.
Finally, the collapsing paradigm forces the model to output sequences that span the whole width of the image. We may replace the column-wise softmax of Eqn. 5 with a sigmoid to ignore some parts of the input, for shorter lines for example, but a refined mechanism that selects only a portion of the image will become crucial to handle complete documents with complex layouts. That issue will be discussed in more details in Section 6.
We carried out the experiments on two public databases. The IAM database  is made of handwritten English texts copied from the LOB corpus. There are 747 documents (6,482 lines) in the training set, 116 documents (976 lines) in the validation set and 336 documents (2,915 lines) in the test set. The Rimes database  contains handwritten letters in French. The data consists of a training set of 1,500 paragraphs (11,333 lines), and a test set of 100 paragraphs (778 lines). We held out the last 100 paragraphs of the training set as a validation set.
The networks have the following architecture. The encoder first computes a 2x2 tiling of the input and alternate MDLSTM layers of 4, 20 and 100 units and 2x4 convolutions of 12 and 32 filters with no overlap. The last layer is a linear layer with 80 outputs for IAM and 102 for Rimes. The attention network is an MDLSTM network with 2x16 units in each direction followed by a linear layer with one output, and a softmax on columns (Eqn. 5
). The decoder is a BLSTM network with 256 units. The networks are trained with RMSProp with a base learning rate of and mini-batches of 8 examples, to minimize the CTC loss over entire paragraphs.
In the following, we study the impact of adding a BLSTM decoder, and an attention-based collapse (Section 5.2), we compare our method to the baseline results on automatic and ground-truth line segmentation (Section 5.3) and we present a comparison of our system to the state of the art (Section 5.4).
As explained in Section 4.5, in our model, the weighted collapse method is followed by a BLSTM decoder. In this experiment, we compare the baseline system (standard collapse followed by a softmax) with the proposed model. In order to dissociate the impact of the weighted collapse from that of the BLSTM decoder, we also trained an intermediate architecure with a BLSTM layer after the standard collapse, but still limited to text lines.
|Standard||BLSTM + Softmax||7.5|
|Attention||BLSTM + Softmax||6.8|
|Standard||BLSTM + Softmax||4.8|
|Attention||BLSTM + Softmax||2.5|
The character error rates (CER%) on the validation sets are reported in Table 1 for 150dpi images. We observe that the proposed model outperforms the baseline by a large margin (relative 20% improvement on IAM, 50% on Rimes), and that the gain may be attributed to both the BLSTM decoder, and the attention mechanism.
Our model performs an implicit line segmentation to transcribe paragraphs. The baseline considered in the last section is somehow cheating, because it was evaluated on the ground-truth line segmentation. In this experiment, we add to the comparison the baseline models evaluated in a real scenario where they are applied to the result of an automatic line segmentation algorithm.
In Table 2, we report the CERs obtained with the ground-truth line positions, with three different segmentation algorithms, and with our end-to-end system, on the validation sets of both databases with different input resolutions. We see that applying the baseline networks on automatic segmentations increases the error rates, by an absolute 1% in the best case. We also observe that the models are better with higher resolutions.
Our models yield better performance than methods based on an explicit and automatic line segmentation, and comparable or better results than with ground-truth segmentation, even with a resolution divided by two. In Figure 4, we display a visualisation of the implicit line segmentation computed by the network. Each color corresponds to one step of the iterative weighted collapse. On the images, the color represents the weights given by the attention network (the transparency encodes their intensity). The texts below are the predicted transcriptions, and chunks are colored according to the corresponding timestep of the attention mechanism.
In this section, we also compute the word error rates (WER%) and evaluate our models on the test sets in order to compare the proposed approach to existing systems. For IAM, we applied a -gram language model with a lexicon of 50,000 words, trained on the LOB, Brown and Wellington corpora111 The parts of the LOB corpus used in the validation and evaluation sets were removed.. This language model has a perplexity of 298 and OOV rate of 4.3% on the validation set (329 and 3.7% on the test set).
The results are presented in Table 3 for Rimes and in Table 4 for IAM, for different input resolutions. When comparing the error rates, it is important to note that all systems in the litterature used an explicit (ground-truth) line segmentation and a language model. [17, 29, 34] used an hybrid character/word language model to tackle the issue of out-of-vocabulary words. Moreover, all systems except [34, 39]5] is a combination of four systems.
|150 dpi||no language model||12.7||2.8||13.6||3.2|
|300 dpi||no language model||12.0||2.5||12.6||2.9|
|Bluche, 2015 ||11.2||3.3||11.2||3.5|
|Pham et al., 2014 ||-||-||12.3||3.3|
|Doetsch et al., 2014 ||-||-||12.9||4.3|
|Messina & Kermorvant, 2014 ||-||-||13.3||-|
|Kozielski et al. 2013 ||-||-||13.7||4.6|
On Rimes, the system applied to 150 dpi images already outperforms the state of the art in CER%, while being competitive in terms of WER%. The system for 300 dpi images is comparable to the best single system  in WER% with a significantly better CER%.
|150 dpi||no language model||22.4||6.8||29.5||10.1|
|with language model||13.8||4.7||16.6||6.5|
|300 dpi||no language model||17.7||4.9||24.6||7.9|
|with language model||13.1||3.5||16.4||5.5|
|Bluche, 2015 ||9.6||3.3||10.9||4.4|
|Doetsch et al., 2014 ||8.4||2.5||12.2||4.7|
|Kozielski et al. 2013 ||9.5||2.7||13.3||5.1|
|Pham et al., 2014 ||11.2||3.7||13.6||5.1|
|Messina & Kermorvant, 2014 ||-||-||19.1||-|
|Espana-Boquera et al., 2011 ||19.0||-||22.4||9.8|
On IAM, the language model turned out to be quite important, probably because there is more variability in the language222 A simple language model yields a perplexity of 18 on Rimes .. On 150 dpi images, the results are not too far from the state of the art results. The WER% does not improve much on 300 dpi images, but we get a lower CER%. When analysing the errors, we noticed that there is a lot of punctuation in IAM, which was often missed by the attention mechanism.
. However, the mechanism cannot handle arbitrary reading orders. Rather, it implements a sort of implicit line segmentation. In the current implementation, the iterative collapse runs for a fixed number of timesteps. Yet, the model can handle a variable number of text lines, and, interestingly, the focus is put on interlines in the additional steps. A more elegent solution should include the prediction of a binary variable indicating when to stop reading.
Our method was applied to paragraph images, so a document layout analysis should be applied to detect those paragraphs before applying the model. Naturally, the next step should be the transcription of complex documents without an explicit or assumed paragraph extraction. The limitation to paragraphs is inherent to this system. Indeed, the weighted collapse always outputs sequences corresponding to the whole width of the encoded image, which, in paragraphs, may correspond to text lines. In order to switch to full documents, several issues arise.
First, the size of the lines are determined by the size of the text block. Thus a method should be devised to only select a smaller part of the feature maps, representing only the considered text line. This is not possible in the presented framework. A potential solution could come from spatial transformer networks 
, performing a differentiable crop. However, that method is based on learning a grid transformation, with a fixed grid size, while we would like to crop variable-sized parts. Another solution would be hierarchical and comprise a first attention at the text block level, and a second one at the line level inside the block. Note that we would probably still need to crop the text block. In a different direction, we could also abandon the differentiability requirement, and learn to predict the crops with reinforcement learning techniques.
On the other hand, training will in practice become more difficult, not only because of the complexity of the task, but also because the reading order in complex documents cannot be exactly inferred in many cases. Even defining arbitrary rules can be tricky. Therefore, the matching of predictions with ground-truth texts should be addressed.
Finally, we would like to point out some important factors to take into account when training the presented model. Because CTC training may have difficulties to find good alignments and to have the network predict actual characters and not only non-character symbols, the convergence is much faster with a pre-trained encoder. For example, one can first train an MDLSTM-RNN with the standard collapse on text lines, and finetune it with the attention-based collapse on paragraphs in a second step. However, training the attention model on full paragraphs directly was actually not easy, and we found curriculum methods useful. Before switching to full paragraphs, we had to train for a few epochs on two or three lines to initiate the attention mechanism. This should also be taken into account for complete documents. A good curriculum will be harder to design, and probably crucial. Nonetheless, the amount of data used in our experiments is quite limited, and careful training might become less important with more data.
We have presented a model to transcribe full paragraphs of handwritten texts without an explicit line segmentation. Contrary to classical methods relying on a two-step process (segment then recognize), our system directly considers the paragraph image without an elaborated pre-processing, and outputs the complete transcription. We proposed a simple modification of the collapse layer in the standard MDLSTM architecture to iteratively focus on single text lines. This implicit line segmentation is learnt with backpropagation along with the rest of the network to minimize the CTC error at the paragraph level.
We reported comparable error rates to the state of the art on two public databases. After switching from explicit to implicit character, then word segmentation for handwriting recognition, we showed that line segmentation can also be learnt inside the transcription model. The next step towards end-to-end handwriting recognition is now at the full page level.
text detection with convolutional neural networks.In VISAPP (2), pages 290–294, 2008.
Neural network model for selective attention in visual pattern recognition and associative recall.Applied Optics, 26(23):4985–4992, 1987.
International Conference on Machine learning, pages 369–376, 2006.
Surgenerative Finite State Transducer n-gram for Out-Of-Vocabulary Word Recognition.In 11th IAPR Workshop on Document Analysis Systems (DAS2014), pages 212–216, 2014.