Joint Line Segmentation and Transcription for End-to-End Handwritten Paragraph Recognition

04/28/2016 ∙ by Théodore Bluche, et al. ∙ A2iA 0

Offline handwriting recognition systems require cropped text line images for both training and recognition. On the one hand, the annotation of position and transcript at line level is costly to obtain. On the other hand, automatic line segmentation algorithms are prone to errors, compromising the subsequent recognition. In this paper, we propose a modification of the popular and efficient multi-dimensional long short-term memory recurrent neural networks (MDLSTM-RNNs) to enable end-to-end processing of handwritten paragraphs. More particularly, we replace the collapse layer transforming the two-dimensional representation into a sequence of predictions by a recurrent version which can recognize one line at a time. In the proposed model, a neural network performs a kind of implicit line segmentation by computing attention weights on the image representation. The experiments on paragraphs of Rimes and IAM database yield results that are competitive with those of networks trained at line level, and constitute a significant step towards end-to-end transcription of full documents.

READ FULL TEXT VIEW PDF
POST COMMENT

Comments

There are no comments yet.

Authors

page 7

Code Repositories

lstm-ctc-ocr

using rnn (lstm or gru) and ctc to convert line image into text, based on torch7 and warp-ctc


view repo
This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

1 Introduction

Offline handwriting recognition consists in recognizing a sequence of characters in an image of handwritten text. Traditional approaches contain a first segmentation step, followed by a transcription step. Unlike printed texts, images of handwriting are diffucult to segment into characters. Early methods tried to compute segmentation hypotheses for characters, for example by performing an heuristic over-segmentation, followed by a scoring of groups of segments (e.g. in 

[4, 28]

). In the nineties, this kind of approach was progressively replaced by segmentation-free methods, where a whole word image is fed to a system providing a sequence of scores. A lexicon constrains a decoding step, allowing to retrieve the character sequence. Some examples are the sliding window approach 

[27], in which features are extracted from vertical frames of the line image, or space-displacement neural networks [4]. In the last decade, word segmentations were abandoned in favor of complete text line recognition with statistical language models [10].

Nowadays, the standard handwriting recognition systems are multi-dimensional long short-term memory recurrent neural networks (MDLSTM-RNNs [22]), which consider the whole image, alternating MDLSTM layers and convolutional layers. The transformation of the 2D structure into a sequence is computed by a simple collapse layer summing the activations along the vertical axis. The further conversion of a sequence of predictions into a sequence of characters is achieved by a simple mapping, involving a non-character label, allowing to consider all possible character segmentations during training with the connectionist temporal classification loss (CTC [21]). These models have become very popular and won the recent evaluations of handwriting recognition [9, 43, 48].

However, current models still need segmented text lines, and full document processing pipelines should include automatic line segmentation algorithms. Although the segmentation of documents into lines is assumed in most descriptions of handwriting recognition systems, several papers or surveys state that it is a crucial step for handwriting text recognition systems [8, 31, 41]. The need of line segmentation to train the recognition system has also motivated several efforts to map a paragraph-level or page-level transcript to line positions in the image (e.g. recently [7, 20]).

In this paper, we pursue the traditional tendency to relax hard segmentation hypotheses in handwriting recognition systems – from character, then word segmentation to full text lines – which consistently improved the performance. We propose a model for multi-line recognition based on the popular MDLSTM-RNNs, augmented with an attention mechanism inspired from the recent models for machine translation [3], image caption generation [13, 51], or speech recognition [11, 14, 15]. In the proposed model, the “collapse” layer is modified with an attention network, providing weights to modulate the importance given at different positions in the input. By iteratively applying this layer to a paragraph image, the network can transcribe each text line in turn, enabling a purely segmentation-free recognition of full paragraphs.

We carried out experiments on two public datasets of handwritten paragraphs: Rimes and IAM. We report results that are competitive with the state-of-the-art systems, which use the ground-truth line segmentation. The remaining of this paper is organized as follows. Section 2 presents methods related to the one presented here, in terms of the tackled problem and modeling choices. In Section 3, we introduce the baseline model: MDLSTM-RNNs. We expose in Section 4 the proposed modification, and we give the details of the system. Experimental results are reported in Section 5, and followed by a short discussion in Section 6, in which we explain how the system could be improved, and present the challenge of generalizing it to complete documents.

2 Related Work

Our work is clearly related to MDLSTM-RNNs [22], which we improve by replacing the simple collapse layer by a more elaborated mechanism, itself made of MDLSTM layers. The model we propose iteratively performs an implicit line segmentation at the level of intermediate representations.

Classical text line segmentation algorithms are mostly based on image processing techniques and heuristics [32, 37, 38, 40, 42, 45, 53]

. However, some methods were devised using statistical models and machine learing techniques such as hidden Markov models 

[8], conditional random fields [24], or neural networks [16, 26, 35, 36]. In our model, the line segmentation is performed implicitely and integrated in the neural network. The intermediate features are shared by the transcription and the segmentation models, and they are jointly trained to minimize the transcription error.

In the field of computer vision, and particularly object detection and recognition, many neural architectures were proposed to both locate and recognize the objects, such as OverFeat 

[44]

or spatial transformer networks 

[25]. Although systems are now able to detect multiple similar objects in a scene, most methods localize only one object, or several objects that are different. For scene text recognition, which is maybe the topic in computer vision closest to our problem, most systems still rely on a two-step process (localization, then recognition) [52], even though some approaches jointly optimize character segmentation and word recognition [12, 49, 50].

Recently, many “attention-based” models were proposed to iteratively select in an encoded signal the relevant parts to make the next prediction. This paradigm, already suggested by Fukushima in 1987 [19], was successfully applied to various problems such as machine translation [3], image caption generation [13, 51], speech recognition [11, 14, 15], or cropped words in scene text [30]. In those works, the localization is implicitely performed inside the neural network.

Other papers present similar methods to read short sequence of characters (mainly digits) with different implementations of the attention, e.g. DRAW [23], RAM [2], or recurrent spatial transformer networks [46]. We recently proposed an attention-based model to transcribe full paragraphs of handwritten text, which predicts each character in turn [6].

Outputing one token at a time turns out to be prohibitive in terms of memory and time consumption for full paragraphs, which typically contain about 500 characters. In the proposed system, the encoded image is not summarized as a single vector at each timestep, but as a sequence of vectors representing full text lines. It represents a huge speedup factor, and a comeback to the original MDLSTM-RNN architecture, in which the collapse layer is augmented with an MDLSTM attention network similar to the one presented in 

[6].

3 Handwriting Recognition with MDLSTM and CTC

In this section, we briefly present the MDLSTM-RNNs [22]. MDLSTM layers generalize LSTMs to two-dimensional inputs. They were first introduced in the context of handwriting recognition. The general architecture is displayed in Figure 1.

Figure 1: MDLSTM-RNN achitecture for handwriting recognition. LSTM layers in four scanning directions are followed by convolutions. The feature maps of the top layer are are summed in the vertical dimension, and character predictions are obtained after a softmax normalization (figure from [39]).

The MDLSTM layers scan the input in the four possible directions. The LSTM cell inner state and output are computed from the states and outputs of previous positions in the horizontal and vertical directions. Each LSTM layer is followed by a convolutional layer. The resolution of the learnt representations is decreased by setting a step size of the convolutional filters greater than one. As the size of the feature maps decreases, the number of extracted features increases. At the top of this network, there is one feature map for each character. A collapse layer sums the features along the vertical axis, yielding a sequence of prediction vectors, normalized with a softmax activation.

In order to transform the sequence of predictions into a sequence of labels, an additionnal non-character label is introduced, and a simple mapping is defined to retrieve the transcription. The connectionist temporal classification objective (CTC [21]), which considers all possible labellings of the sequence, may be applied to train the network to recognize text lines.

The 2D to 1D conversion happens in the collapsing layer, which applies a simple aggregation of the feature maps into vector sequences, i.e. maps of height 1. This is achieved by a simple sum across the vertical dimension:

(1)

where is the -th output vector and is the input feature vector at coordinates . All the information in the vertical dimension is reduced to a single vector, regardless of its position in the feature maps, preventing the recognition of multiple lines within this framwork.

4 An Iterative Weighted Collapse for End-to-End Handwriting Recognition

In this paper, we replace the sum of Eqn. 1 by a weighted sum, in order to focus on a specific part of the input. The weighted collapse is defined as follows:

(2)

where are scalar weights between and , computed at every time for each position . The weights are computed by a recurrent neural network, illustrated in Figure 2, enabling the recognition of a text line at each timestep.

Figure 2: Proposed modification of the collapse layer. While the standard collapse (left, top) computes a simple sum, the weighted collapse (right, bottom) includes a neural network to predict the weights of a weighted sum.

This collapse, weighted with a neural network, may be interpreted as the “attention” module of an attention-based neural network similar to those of [3, 13, 51]

. This mechanism is differentiable and can be trained with backpropagation.

Both this new architecture and the previous one are composed of an encoder (the MDLSTM network), an aggregation layer, and a decoder, described below.

4.1 Encoder

The bottom part of the architecture presented in Section 3

remains the same. We can see the MDLSTM network as a feature extraction module, or encoder of the input image

into high-level features:

(3)

where are coordinates in the feature maps.

In Section 3 a simple sum of is computd by a collapse layer. Here, we apply an attention mechanism to read text lines.

4.2 Attention

The weighted collapse is an attention mechanism providing a view of the encoded image at each timestep in the form of a weighted sum of feature vector sequences. The attention network computes a score for the feature vectors at every position:

(4)

We refer to as the attention map at time , which computation depends not only on the encoded image, but also on the previous attention features. A softmax normalization is applied to each column:

(5)

This module is applied several times to the features from the encoder. The output of the attention module at iteration , computed with Eqn. 2, is a sequence of feature vectors, intended to represent a text line. Therefore, we may see this module as a soft line segmentation neural network. The advantages over the neural networks trained for line segmentation [16, 26, 36, 35] are that (i) it works on the same features as those used for the transcription (multi-task encoder) and (ii) it is trained to maximize the transcription accuracy (i.e. more closely related to the goal of handwriting recognition systems, and easily interpretable).

4.3 Decoder

The final component of this architecture is a decoder, which predicts a character sequence from the feature vectors.

(6)

where is the concatenation of . Alternatively, the deocder may be applied to s sub-sequences to get s and is the concatenation of .

In the standard MDLSTM architecture of Section 3, the decoder is a simple softmax. However, a Bidirectional LSTM (BLSTM) decoder could be applied to the collapsed representations. This is particularly interesting in the proposed model, as the BLSTM would potentially process the whole paragraph, allowing a modeling of dependencies across text lines.

4.4 Training

Figure 3: Training strategies. If a ground-truth is available at the line level, the CTC objective function may be applied on line segments independently (left). If the ground-truth is only available at the paragraph level, the CTC objective is applied to the concatenation of all line predictions (right).

This model can be trained with CTC. If the line breaks are known in the transcript, the CTC could by applied to the segments corresponding to each line prediction, with the line transcript. Moreover, it will enforce the prediction at each timestep to correspond to a complete text line. Otherwise, one can directly apply CTC to the whole paragraph. The different training strategies of this model are illustrated in Figure 3.

In this work, we mainly investigated the second strategy, with CTC training at the paragraph level, and with a BLSTM decoder applied to the concatenation of all collapsing steps, for reasons developed in the next section.

4.5 Limitations

Compared to the model presented in [6], the iterative decoder requires one step for each text line instead of one step for each character, which represents a huge speedup of a factor 20-30. However, we loose the ability to handle arbitrary reading orders. Moreover, in this version, the model does not predict a “stop” token. Thus, the network predicts an arbitrary number of sequence , fixed by the experimenter. In our experiments, corresponds to the maximum number of lines in the dataset. The BLSTM decoder is applied to these sequences and was efficient to ignore the supplementary lines in shorter paragraphs. In numerous cases, we observed that during these additional steps, the attention was located in interlines, were the decoder can easily only predict non-characters. However, missing the ability to determine automatically the number of required steps is an important limitation, which should be fixed in future work.

Finally, the collapsing paradigm forces the model to output sequences that span the whole width of the image. We may replace the column-wise softmax of Eqn. 5 with a sigmoid to ignore some parts of the input, for shorter lines for example, but a refined mechanism that selects only a portion of the image will become crucial to handle complete documents with complex layouts. That issue will be discussed in more details in Section 6.

5 Experiments

5.1 Experimental Setup

We carried out the experiments on two public databases. The IAM database [33] is made of handwritten English texts copied from the LOB corpus. There are 747 documents (6,482 lines) in the training set, 116 documents (976 lines) in the validation set and 336 documents (2,915 lines) in the test set. The Rimes database [1] contains handwritten letters in French. The data consists of a training set of 1,500 paragraphs (11,333 lines), and a test set of 100 paragraphs (778 lines). We held out the last 100 paragraphs of the training set as a validation set.

The networks have the following architecture. The encoder first computes a 2x2 tiling of the input and alternate MDLSTM layers of 4, 20 and 100 units and 2x4 convolutions of 12 and 32 filters with no overlap. The last layer is a linear layer with 80 outputs for IAM and 102 for Rimes. The attention network is an MDLSTM network with 2x16 units in each direction followed by a linear layer with one output, and a softmax on columns (Eqn. 5

). The decoder is a BLSTM network with 256 units. The networks are trained with RMSProp 

[47] with a base learning rate of and mini-batches of 8 examples, to minimize the CTC loss over entire paragraphs.

In the following, we study the impact of adding a BLSTM decoder, and an attention-based collapse (Section 5.2), we compare our method to the baseline results on automatic and ground-truth line segmentation (Section 5.3) and we present a comparison of our system to the state of the art (Section 5.4).

5.2 Impact of the Decoder

As explained in Section 4.5, in our model, the weighted collapse method is followed by a BLSTM decoder. In this experiment, we compare the baseline system (standard collapse followed by a softmax) with the proposed model. In order to dissociate the impact of the weighted collapse from that of the BLSTM decoder, we also trained an intermediate architecure with a BLSTM layer after the standard collapse, but still limited to text lines.

Database Collapse Decoder CER%
IAM Standard Softmax 8.4
Standard BLSTM + Softmax 7.5
Attention BLSTM + Softmax 6.8
Rimes Standard Softmax 4.9
Standard BLSTM + Softmax 4.8
Attention BLSTM + Softmax 2.5
Table 1: Character Error Rates (%) of CTC-trained RNNs on 150 dpi images. The Standard models are trained on segmented lines. The Attention models are trained on paragraphs.

The character error rates (CER%) on the validation sets are reported in Table 1 for 150dpi images. We observe that the proposed model outperforms the baseline by a large margin (relative 20% improvement on IAM, 50% on Rimes), and that the gain may be attributed to both the BLSTM decoder, and the attention mechanism.

5.3 Impact of Line Segmentation

Our model performs an implicit line segmentation to transcribe paragraphs. The baseline considered in the last section is somehow cheating, because it was evaluated on the ground-truth line segmentation. In this experiment, we add to the comparison the baseline models evaluated in a real scenario where they are applied to the result of an automatic line segmentation algorithm.

Database Resolution GroundTruth Projection Shredding Energy This work
IAM 150 dpi 8.4 15.5 9.3 10.2 6.8
300 dpi 6.6 13.8 7.5 7.9 4.9
Rimes 150 dpi 4.8 6.3 5.9 8.2 2.8
300 dpi 3.6 5.0 4.5 6.6 2.5
Table 2: Character Error Rates (%) of CTC-trained RNNs on ground-truth lines and automatic segmentation of paragraphs with different resolutions. The last column contains the error rate of the attention-based model presented in this work, without an explicit line segmentation.

In Table 2, we report the CERs obtained with the ground-truth line positions, with three different segmentation algorithms, and with our end-to-end system, on the validation sets of both databases with different input resolutions. We see that applying the baseline networks on automatic segmentations increases the error rates, by an absolute 1% in the best case. We also observe that the models are better with higher resolutions.

Our models yield better performance than methods based on an explicit and automatic line segmentation, and comparable or better results than with ground-truth segmentation, even with a resolution divided by two. In Figure 4, we display a visualisation of the implicit line segmentation computed by the network. Each color corresponds to one step of the iterative weighted collapse. On the images, the color represents the weights given by the attention network (the transparency encodes their intensity). The texts below are the predicted transcriptions, and chunks are colored according to the corresponding timestep of the attention mechanism.

Figure 4: Transcription of full paragraphs of text and implicit line segmentation learnt by the network on IAM (left) and Rimes (right). Best viewed in color.

5.4 Comparison to Published Results

In this section, we also compute the word error rates (WER%) and evaluate our models on the test sets in order to compare the proposed approach to existing systems. For IAM, we applied a -gram language model with a lexicon of 50,000 words, trained on the LOB, Brown and Wellington corpora111 The parts of the LOB corpus used in the validation and evaluation sets were removed.. This language model has a perplexity of 298 and OOV rate of 4.3% on the validation set (329 and 3.7% on the test set).

The results are presented in Table 3 for Rimes and in Table 4 for IAM, for different input resolutions. When comparing the error rates, it is important to note that all systems in the litterature used an explicit (ground-truth) line segmentation and a language model. [17, 29, 34] used an hybrid character/word language model to tackle the issue of out-of-vocabulary words. Moreover, all systems except [34, 39]

carefully pre-processed the line image (e.g. corrected the slant or skew, normalized the height, …), whereas we just normalized the pixel values to zero mean and unit variance. Finally,

[5] is a combination of four systems.

Validation Test
WER% CER% WER% CER%
150 dpi no language model 12.7 2.8 13.6 3.2
300 dpi no language model 12.0 2.5 12.6 2.9
Bluche, 2015 [5] 11.2 3.3 11.2 3.5
Pham et al., 2014 [39] - - 12.3 3.3
Doetsch et al., 2014 [17] - - 12.9 4.3
Messina & Kermorvant, 2014 [34] - - 13.3 -
Kozielski et al. 2013 [29] - - 13.7 4.6
Table 3: Final results on Rimes database

On Rimes, the system applied to 150 dpi images already outperforms the state of the art in CER%, while being competitive in terms of WER%. The system for 300 dpi images is comparable to the best single system [39] in WER% with a significantly better CER%.

Validation Test
WER% CER% WER% CER%
150 dpi no language model 22.4 6.8 29.5 10.1
with language model 13.8 4.7 16.6 6.5
300 dpi no language model 17.7 4.9 24.6 7.9
with language model 13.1 3.5 16.4 5.5
Bluche, 2015 [5] 9.6 3.3 10.9 4.4
Doetsch et al., 2014 [17] 8.4 2.5 12.2 4.7
Kozielski et al. 2013 [29] 9.5 2.7 13.3 5.1
Pham et al., 2014 [39] 11.2 3.7 13.6 5.1
Messina & Kermorvant, 2014 [34] - - 19.1 -
Espana-Boquera et al., 2011 [18] 19.0 - 22.4 9.8
Table 4: Final results on IAM database

On IAM, the language model turned out to be quite important, probably because there is more variability in the language

222 A simple language model yields a perplexity of 18 on Rimes [5].. On 150 dpi images, the results are not too far from the state of the art results. The WER% does not improve much on 300 dpi images, but we get a lower CER%. When analysing the errors, we noticed that there is a lot of punctuation in IAM, which was often missed by the attention mechanism.

6 Discussion

As already discussed in Section 4.5, the proposed model can transcribe complete paragraphs without segmentation and is orders of magnitude faster that the model of [6]

. However, the mechanism cannot handle arbitrary reading orders. Rather, it implements a sort of implicit line segmentation. In the current implementation, the iterative collapse runs for a fixed number of timesteps. Yet, the model can handle a variable number of text lines, and, interestingly, the focus is put on interlines in the additional steps. A more elegent solution should include the prediction of a binary variable indicating when to stop reading.

Our method was applied to paragraph images, so a document layout analysis should be applied to detect those paragraphs before applying the model. Naturally, the next step should be the transcription of complex documents without an explicit or assumed paragraph extraction. The limitation to paragraphs is inherent to this system. Indeed, the weighted collapse always outputs sequences corresponding to the whole width of the encoded image, which, in paragraphs, may correspond to text lines. In order to switch to full documents, several issues arise.

First, the size of the lines are determined by the size of the text block. Thus a method should be devised to only select a smaller part of the feature maps, representing only the considered text line. This is not possible in the presented framework. A potential solution could come from spatial transformer networks [25]

, performing a differentiable crop. However, that method is based on learning a grid transformation, with a fixed grid size, while we would like to crop variable-sized parts. Another solution would be hierarchical and comprise a first attention at the text block level, and a second one at the line level inside the block. Note that we would probably still need to crop the text block. In a different direction, we could also abandon the differentiability requirement, and learn to predict the crops with reinforcement learning techniques.

On the other hand, training will in practice become more difficult, not only because of the complexity of the task, but also because the reading order in complex documents cannot be exactly inferred in many cases. Even defining arbitrary rules can be tricky. Therefore, the matching of predictions with ground-truth texts should be addressed.

Finally, we would like to point out some important factors to take into account when training the presented model. Because CTC training may have difficulties to find good alignments and to have the network predict actual characters and not only non-character symbols, the convergence is much faster with a pre-trained encoder. For example, one can first train an MDLSTM-RNN with the standard collapse on text lines, and finetune it with the attention-based collapse on paragraphs in a second step. However, training the attention model on full paragraphs directly was actually not easy, and we found curriculum methods useful. Before switching to full paragraphs, we had to train for a few epochs on two or three lines to initiate the attention mechanism. This should also be taken into account for complete documents. A good curriculum will be harder to design, and probably crucial. Nonetheless, the amount of data used in our experiments is quite limited, and careful training might become less important with more data.

7 Conclusion

We have presented a model to transcribe full paragraphs of handwritten texts without an explicit line segmentation. Contrary to classical methods relying on a two-step process (segment then recognize), our system directly considers the paragraph image without an elaborated pre-processing, and outputs the complete transcription. We proposed a simple modification of the collapse layer in the standard MDLSTM architecture to iteratively focus on single text lines. This implicit line segmentation is learnt with backpropagation along with the rest of the network to minimize the CTC error at the paragraph level.

We reported comparable error rates to the state of the art on two public databases. After switching from explicit to implicit character, then word segmentation for handwriting recognition, we showed that line segmentation can also be learnt inside the transcription model. The next step towards end-to-end handwriting recognition is now at the full page level.

References

  • [1] E. Augustin, M. Carré, E. Grosicki, J.-M. Brodin, E. Geoffrois, and F. Preteux. RIMES evaluation campaign for handwritten mail processing. In Proceedings of the Workshop on Frontiers in Handwriting Recognition, number 1, 2006.
  • [2] Jimmy Ba, Volodymyr Mnih, and Koray Kavukcuoglu. Multiple object recognition with visual attention. arXiv preprint arXiv:1412.7755, 2014.
  • [3] Dzmitry Bahdanau, Kyunghyun Cho, and Yoshua Bengio. Neural machine translation by jointly learning to align and translate. arXiv preprint arXiv:1409.0473, 2014.
  • [4] Yoshua Bengio, Yann LeCun, Craig Nohl, and Chris Burges. Lerec: A NN/HMM hybrid for on-line handwriting recognition. Neural Computation, 7(6):1289–1303, 1995.
  • [5] Théodore Bluche. Deep Neural Networks for Large Vocabulary Handwritten Text Recognition. Theses, Université Paris Sud - Paris XI, May 2015.
  • [6] Théodore Bluche, Jérôme Louradour, and Ronaldo Messina. Scan, Attend and Read: End-to-End Handwritten Paragraph Recognition with MDLSTM Attention. arXiv preprint arXiv:1604.03286, 2016.
  • [7] Théodore Bluche, Bastien Moysset, and Christopher Kermorvant. Automatic line segmentation and ground-truth alignment of handwritten documents. In International Conference on Frontiers in Handwriting Recognition (ICFHR), 2014.
  • [8] Vicente Bosch, Alejandro Hector Toselli, and Enrique Vidal. Statistical text line analysis in handwritten documents. In Frontiers in Handwriting Recognition (ICFHR), 2012 International Conference on, pages 201–206. IEEE, 2012.
  • [9] Sylvie Brunessaux, Patrick Giroux, Bruno Grilhères, Mathieu Manta, Maylis Bodin, Khalid Choukri, Olivier Galibert, and Juliette Kahn. The Maurdor Project: Improving Automatic Processing of Digital Documents. In Document Analysis Systems (DAS), 2014 11th IAPR International Workshop on, pages 349–354. IEEE, 2014.
  • [10] Horst Bunke, Samy Bengio, and Alessandro Vinciarelli. Offline recognition of unconstrained handwritten texts using hmms and statistical language models. Pattern Analysis and Machine Intelligence, IEEE Transactions on, 26(6):709–720, 2004.
  • [11] William Chan, Navdeep Jaitly, Quoc V Le, and Oriol Vinyals. Listen, attend and spell. arXiv preprint arXiv:1508.01211, 2015.
  • [12] Chongmu Chen, Da-Han Wang, and Hanzi Wang. Image and Graphics: 8th International Conference, ICIG 2015, Tianjin, China, August 13-16, 2015, Proceedings, Part III, chapter Scene Character and Text Recognition: The State-of-the-Art, pages 310–320. Springer International Publishing, Cham, 2015.
  • [13] Kyunghyun Cho, Aaron Courville, and Yoshua Bengio. Describing multimedia content using attention-based encoder-decoder networks. Multimedia, IEEE Transactions on, 17(11):1875–1886, 2015.
  • [14] Jan Chorowski, Dzmitry Bahdanau, Kyunghyun Cho, and Yoshua Bengio. End-to-end continuous speech recognition using attention-based recurrent NN: first results. arXiv preprint arXiv:1412.1602, 2014.
  • [15] Jan K Chorowski, Dzmitry Bahdanau, Dmitriy Serdyuk, Kyunghyun Cho, and Yoshua Bengio. Attention-based models for speech recognition. In Advances in Neural Information Processing Systems, pages 577–585, 2015.
  • [16] Manolis Delakis and Christophe Garcia.

    text detection with convolutional neural networks.

    In VISAPP (2), pages 290–294, 2008.
  • [17] Patrick Doetsch, Michal Kozielski, and Hermann Ney. Fast and robust training of recurrent neural networks for offline handwriting recognition. pages –, 2014.
  • [18] Salvador Espana-Boquera, Maria Jose Castro-Bleda, Jorge Gorbe-Moya, and Francisco Zamora-Martinez. Improving offline handwritten text recognition with hybrid HMM/ANN models. Pattern Analysis and Machine Intelligence, IEEE Transactions on, 33(4):767–779, 2011.
  • [19] Kunihiko Fukushima.

    Neural network model for selective attention in visual pattern recognition and associative recall.

    Applied Optics, 26(23):4985–4992, 1987.
  • [20] Basilis Gatos, Georgios Louloudis, Tim Causer, Kris Grint, Veronica Romero, Joan-Andreu Sánchez, Alejandro Hector Toselli, and Enrique Vidal. Ground-truth production in the transcriptorium project. In Document Analysis Systems (DAS), 2014 11th IAPR International Workshop on, pages 237–241. IEEE, 2014.
  • [21] A Graves, S Fernández, F Gomez, and J Schmidhuber. Connectionist temporal classification: labelling unsegmented sequence data with recurrent neural networks. In

    International Conference on Machine learning

    , pages 369–376, 2006.
  • [22] A. Graves and J. Schmidhuber. Offline Handwriting Recognition with Multidimensional Recurrent Neural Networks. In Advances in Neural Information Processing Systems, pages 545–552, 2008.
  • [23] Karol Gregor, Ivo Danihelka, Alex Graves, and Daan Wierstra. DRAW: A recurrent neural network for image generation. arXiv preprint arXiv:1502.04623, 2015.
  • [24] David Hebert, Thierry Paquet, and Stephane Nicolas. Continuous crf with multi-scale quantization feature functions application to structure extraction in old newspaper. In Document Analysis and Recognition (ICDAR), 2011 International Conference on, pages 493–497. IEEE, 2011.
  • [25] Max Jaderberg, Karen Simonyan, Andrew Zisserman, et al. Spatial transformer networks. In Advances in Neural Information Processing Systems, pages 2008–2016, 2015.
  • [26] Keechul Jung. Neural network-based text location in color images. Pattern Recognition Letters, 22(14):1503–1515, 2001.
  • [27] Alfred Kaltenmeier, Torsten Caesar, Joachim M Gloger, and Eberhard Mandler. Sophisticated topology of hidden Markov models for cursive script recognition. In Document Analysis and Recognition, 1993., Proceedings of the Second International Conference on, pages 139–142. IEEE, 1993.
  • [28] Stefan Knerr, Emmanuel Augustin, Olivier Baret, and David Price. Hidden Markov model based word recognition and its application to legal amount reading on French checks. Computer Vision and Image Understanding, 70(3):404–419, 1998.
  • [29] Michal Kozielski, Patrick Doetsch, Hermann Ney, et al. Improvements in RWTH’s System for Off-Line Handwriting Recognition. In Document Analysis and Recognition (ICDAR), 2013 12th International Conference on, pages 935–939. IEEE, 2013.
  • [30] Chen-Yu Lee and Simon Osindero. Recursive recurrent nets with attention modeling for ocr in the wild. arXiv preprint arXiv:1603.03101, 2016.
  • [31] Laurence Likforman-Sulem, Abderrazak Zahour, and Bruno Taconet. Text line segmentation of historical documents: a survey. International Journal of Document Analysis and Recognition (IJDAR), 9(2-4):123–138, 2007.
  • [32] G. Louloudis, B. Gatos, I. Pratikakis, and C. Halatsis. Text line and word segmentation of handwritten documents. Pattern Recognition, 42(12):3169–3183, December 2009.
  • [33] U-V Marti and Horst Bunke. The IAM-database: an English sentence database for offline handwriting recognition. International Journal on Document Analysis and Recognition, 5(1):39–46, 2002.
  • [34] R. Messina and C. Kermorvant.

    Surgenerative Finite State Transducer n-gram for Out-Of-Vocabulary Word Recognition.

    In 11th IAPR Workshop on Document Analysis Systems (DAS2014), pages 212–216, 2014.
  • [35] Bastien Moysset, Pierre Adam, Christian Wolf, and Jérôme Louradour. Space displacement localization neural networks to locate origin points of handwritten text lines in historical documents. In International Workshop on Historical Document Imaging and Processing (HIP), 2015.
  • [36] Bastien Moysset, Christopher Kermorvant, Christian Wolf, and Jérôme Louradour. Paragraph text segmentation into lines with recurrent neural networks. In International Conference of Document Analysis and Recognition (ICDAR), 2015.
  • [37] Anguelos Nicolaou and Basilis Gatos. Handwritten Text Line Segmentation by Shredding Text into its Lines. International Conference on Document Analysis and Recognition, 2009.
  • [38] Vassilis Papavassiliou, Vassilis Katsouros, and George Carayannis. A Morphological Approach for Text-Line Segmentation in Handwritten Documents. In International Conference on Frontiers in Handwriting Recognition, 2010.
  • [39] Vu Pham, Théodore Bluche, Christopher Kermorvant, and Jérôme Louradour. Dropout improves recurrent neural networks for handwriting recognition. In 14th International Conference on Frontiers in Handwriting Recognition (ICFHR2014), pages 285–290, 2014.
  • [40] Irina Rabaev, Ofer Biller, Jihad El-Sana, Klara Kedem, and Itshak Dinstein. Text line detection in corrupted and damaged historical manuscripts. In Document Analysis and Recognition (ICDAR), 2013 12th International Conference on, pages 812–816. IEEE, 2013.
  • [41] Zaidi Razak, Khansa Zulkiflee, Mohd Yamani Idna Idris, Emran Mohd Tamil, Mohd Noorzaily Mohamed Noor, Rosli Salleh, Mohd Yaakob, Zulkifli Mohd Yusof, and Mashkuri Yaacob. Off-line handwriting text line segmentation: A review. International journal of computer science and network security, 8(7):12–20, 2008.
  • [42] Raid Saabni and El-Sana Jihad. Language-Independent Text Lines Extraction Using Seam Carving. In International Conference of Document Analysis and Recognition, 2011.
  • [43] Joan Andreu Sánchez, Verónica Romero, Alejandro Toselli, and Enrique Vidal. ICFHR 2014 HTRtS: Handwritten Text Recognition on tranScriptorium Datasets. In International Conference on Frontiers in Handwriting Recognition (ICFHR), 2014.
  • [44] Pierre Sermanet, David Eigen, Xiang Zhang, Michaël Mathieu, Rob Fergus, and Yann LeCun. Overfeat: Integrated recognition, localization and detection using convolutional networks. arXiv preprint arXiv:1312.6229, 2013.
  • [45] Zhixin Shi, Srirangaraj Setlur, and Venu Govindaraju. A Steerable Directional Local Profile Technique for Extraction of Handwritten Arabic Text Lines. In International Conference on Document Analysis and Recognition, 2009.
  • [46] Søren Kaae Sønderby, Casper Kaae Sønderby, Lars Maaløe, and Ole Winther. Recurrent spatial transformer networks. arXiv preprint arXiv:1509.05329, 2015.
  • [47] Tijmen Tieleman and Geoffrey Hinton. Lecture 6.5-rmsprop: Divide the gradient by a running average of its recent magnitude. COURSERA: Neural Networks for Machine Learning, 4, 2012.
  • [48] A. Tong, M. Przybocki, V. Maergner, and H. El Abed. NIST 2013 Open Handwriting Recognition and Translation (OpenHaRT13) Evaluation. In 11th IAPR Workshop on Document Analysis Systems (DAS2014), 2014.
  • [49] Kai Wang, Boris Babenko, and Serge Belongie. End-to-end scene text recognition. In Computer Vision (ICCV), 2011 IEEE International Conference on, pages 1457–1464. IEEE, 2011.
  • [50] Jerod J Weinman, Zachary Butler, Dugan Knoll, and Jacqueline Feild. Toward integrated scene text reading. Pattern Analysis and Machine Intelligence, IEEE Transactions on, 36(2):375–387, 2014.
  • [51] Kelvin Xu, Jimmy Ba, Ryan Kiros, Aaron Courville, Ruslan Salakhutdinov, Richard Zemel, and Yoshua Bengio. Show, attend and tell: Neural image caption generation with visual attention. arXiv preprint arXiv:1502.03044, 2015.
  • [52] Qixiang Ye and David Doermann. Text detection and recognition in imagery: A survey. Pattern Analysis and Machine Intelligence, IEEE Transactions on, 37(7):1480–1500, 2015.
  • [53] Abderrazak Zahour, L. Likforman-Sulem, W. Boussellaa, and B. Taconet. Text Line Segmentation of Historical Arabic Documents. In International Conference on Document Analysis and Recognition, 2007.