State-of-the-art offline handwriting recognition (HWR) models are based on deep Convolutional Neural Networks (CNNs) and Bidirectional Long-Short Term Memory (BLSTM) networks and are trained on large amounts of labeled line images. Obtaining such large annotated training sets is expensive and time consuming, because a person must segment thousands of text lines and manually type transcriptions for the ground truth. However, such a process is often necessary for each language and domain because trained HWR models often fail to generalize sufficiently across domains, languages, and writers that were not observed during training. Eliminating or lessening this requirement is the goal of unsupervised HWR and related approaches.
Prior work has attempted to address the lack of a large labeled training set for low resource languages/domains through several means. Training on synthetic data is an appealing direction because an arbitrary amount of labeled data may be generated with little human effort. In some works, synthetic data is obtained by applying annotation preserving transformations to real data in order to simulate the natural variability in handwriting [2, 3, 4]. However, these methods depend on the availability of sufficiently diverse labeled data, which is not always the case. Other works have modeled the writing process for generating isolated characters using prototypes for Chinese  and Korean  characters, though it is not clear how such models could be extended to cursive scripts. Elarian et al. proposed a concatenative model for handwritten Arabic, though it relies on a database of pre-segmented characters and the concatenation procedure is specific to Arabic .
An alternative semi-supervised formulation of the problem assumes that there is a small labeled training set and a larger unlabeled training set. The main methodology involves propagating annotations from the labeled set to the unlabeled set through model prediction. Subsequent models then train as if the noisy predicted labels were ground truth annotations. Frinken et al. explored this method for isolated word image recognition in the framework of co-training, where a Hidden Markov Model (HMM) and a BLSTM model each made prediction that was used to further train the other model. In a separate work, Frinken and Bunke use an ensemble of BLSTM networks for self-training, where high confidence ensemble predictions on the unlabeled data are subsequently used as ground truth to further train the ensemble. Ball and Srihari used a similar idea to adapt writer specific HWR models from a general model by iteratively updating segmented character prototypes after performing recognition on unlabeled data .
In this work, we propose a transfer learning methodology that allows us to train a HWR model for a target language for which we have no labeled images. Our method only requires a labeled training set of line images in a sufficiently similar source language, a trained Language Model (LM) in the target language, and a set of unlabeled images in the target language. A source language is sufficiently similar to the target language if the character sets of the two languages have a large overlap. For example, Latin based languages, such as English, French and Spanish, are all sufficiently similar because they all use the written Latin script. The LM can be obtained from digital text in the target language that is unrelated to the unlabeled images. Digital text for training a LM is much more commonly available than labeled handwriting images, so our methodology helps extend automated HWR to lower resource languages.
After training a HWR model on the source language, our proposed method begins a hybrid training procedure where training occurs on both source data and target data. There is no ground truth for the target data, so we combine the model prediction with the LM to produce a corrected prediction that we then use as ground truth.
We perform several experiments to analyze the behavior of our proposed transfer learning methodology for HWR. These experiments are performed using 4 datasets and 3 languages: English, Spanish, and French. We examine factors such as how long the source model is trained, LM decoding hyperparameters, and the proportion of source and target training used during hybrid training. In the best cases, we find that transferring produces Character Error Rates (CER) nearly as low as those obtained by traditional supervised learning on the target data.
Ii Language Transfer Learning
We formulate our problem as follows. Suppose we wish to obtain a trained HWR model for a target language Y that has no labeled training data available, but there are many unlabeled text line images in this language. We also have sufficient digital text in language Y, such that we can train a Language Model (LM). For another language X we have segmented text line images with corresponding ground truth transcriptions. Noting that languages X and Y have similar character sets, we want to use the data in both languages to produce the HWR model for language Y.
Though our discussion uses the term language, our methodology is also applicable to transfer learning HWR problems where there is a difference in domains (e.g. modern vs historical) or writers. We demonstrate this later by transferring between a modern English dataset and a historical English dataset.
Ii-a Source Model Training
We begin transfer learning by training a state-of-the-art HWR model on the source language for which we have ground truth transcriptions. Fig. 1 shows our CNN-BLSTM architecture, which is similar to the model in 
, but introduces an auxiliary classifier and loss. This model learns high level features using convolution operations that are vertically collapsed to a 1D horizontal sequence of feature vectors that are fed to a 2-layer BLSTM. In the BLSTM, context is propagated both forwards and backwards along the sequence. Two separate frame-wise, linear character classifiers are each applied to the output of the CNN and the output of the combined CNN-BLSTM. Both classifiers are trained using Connectionist Temporal Classification (CTC) loss which automatically aligns frame-wise outputs with the ground truth transcriptions.
The classifier that operates on the output of the CNN is considered an auxiliary classifier and it is discarded after the training procedure, meaning that the model outputs the predictions made by the main classifier that operates on the output of the BLSTM. We found that introducing an auxiliary classifier improves transferability of the model, likely because it forces the CNN visual features to be discriminative of characters themselves instead of depending on further processing from the BLSTM layers. When transferring between languages, the visual difference of some shared characters is small, so the CNN should be robust to the language difference. In contrast, the BLSTM considers the whole sequence, so it is more sensitive to transferring between datasets.
The precise architecture of our HWR model is based on the model presented in . The size of the the input image is , where
is the image width, which can dynamically vary. The CNN is composed of 6 convolutional layers with 3x3 learnable kernels, and there are 64, 128, 256, 256, 512, and 512 feature maps respectively for the 6 layers. We apply Batch Norm (BN) after layers 4 and 5 and 2x2 Max-Pooling (MP) with a stride of 2 after layers 1 and 2. After layers 4 and 6, we vertically collapse features by using 2x2 MP with a vertical stride of 2 and a horizontal stride of 1. To form the input for the BLSTM and for the CNN auxiliary classifier, we concatenate features in the same column to form a 1D horizontal sequence of 1024-dimensional feature vectors. The BLSTM has 2-layers each with 512 hidden nodes that have a 0.5 probability of node dropout. A linear classifier is applied to each time step to produce the final prediction, which is a probability distribution over characters at each timestep.
The model is trained using CTC loss over both the main classifier and the auxiliary classifier:
where represents an input image, , the corresponding ground truth transcription, is the CTC loss , is the auxiliary CNN classifier, and is the BLSTM classifier. We empirically set based on cross validation using validation data.
Ii-B Language Model Decoding
The HWR model predicts each output character independently, and this may produce linguistically improbable sequences of characters. Decoding with a Language Model (LM) combines the individual predicted character probabilities with how likely sequences of characters occur together.
, we use a 10-gram character LM, which estimates
from digital text. Not all 10-gram character sequences are observed, so we smooth the empirical 10-gram distribution and employ backoff, where n-grams shorter than 10 are used to estimate the probability of infrequent 10-grams.
The decoding operation finds the most likely sequence of hidden states in a Hidden Markov Model (HMM), where the emission probabilities are determined by the HWR model and the transition probabilities are determined by the LM:
where is the sequence of hidden states corresponding to characters, indicates all states prior to , is the observed data, and determines the relative importance of the CNN-BLSTM and LM predictions. Because characters can span multiple output frames, we model each character using 3 states (corresponding to character start, middle, and end) as is commonly done in speech recognition . The LM directly encodes the term, but the CNN-BLSTM outputs . Using Bayes Rule, we have
We can estimate by examining the CNN-BLSTM outputs, but is unknown. Following Bluche et al., we approximate , where is a hyperparameter . An exact solution to Eq. 2 can be intractable, so in practice, we use a beam search which efficiently searches the state-space, but in some cases may not find the exact maximal sequence of characters.
Ii-C Hybrid Training
Our hybrid training procedure leverages the recognition performance achieved by the source model on the source language to then learn recognition over the target language. The overall process is shown in Algorithm 1.
The main difference between hybrid and source training is in the data used for learning. During hybrid training, part of the data comes from the source dataset (typically 50%) with the rest coming from the target dataset for which there are no ground truth transcriptions. However, the training loss for hybrid training is the same as in source training (Eq. 1), which means that to train we need to provide some transcriptions for the target data.
We obtain target transcriptions by applying the LM of the target language to the predictions of the network. The intuition is that due to the similarity of the source and target languages, the predictions of the network will be much better than random, though still quite poor at first. Applying the LM will improve the poor predictions to make better targets, which in turn helps the network to learn the target language better. We do, however, continue to train on source data to stabilize the learning process with real ground truth.
At the beginning of hybrid training, the model has never seen any instances of characters that are only part of the target language and will make incorrect predictions. However, the LM can correct some of these errors based on the context of surrounding correct predictions. For example, English words contain no accented characters, so a source model trained on English would never predict accented characters, but French and Spanish do use accents. The LM is able to correct the model predictions to include accents and thus introduce these characters into the ground truth so the model can learn to predict these characters in the future.
Because LM decoding depends on the marginal distribution of CNN-BLSTM outputs, in Eq. 3, we need to periodically update this quantity. This is done in lines 3-7 in Algorithm 1. In normal HWR model training, this is unnecessary because the LM is applied only as post-processing and not as part of the training process.
In this work we use 4 datasets: IAM , Rimes , Rodrigo , and Bentham (2014 HTR competition)  collections. Each dataset is composed of a number of line images with corresponding ground truth transcriptions.
Rodrigo is a single author, 853 page Spanish manuscript written in 1545 with 20000 segmented line images. We used the first 750 pages as training data, the next 50 pages as validation data, and the remaining pages as test data. The annotations contain some meta information that we preprocessed to exclude. Some examples of this include symbols that indicate that whitespace should be inserted or deleted for correctness, i.e. the manuscript author did not conform to modern usage of whitespace.
The Bentham collection are the writings of the English philosopher Jeremy Bentham (1748-1832), though some images may be handwritten copies of his works produced by others . For preprocessing, we deskewed the line images and performed height normalization. For IAM, we use the standard split, merging the two defined validation sets. For Rimes, there is only a defined train/test split, so we used a subset of the training data for a validation set.
Each image collection has different low level differences (e.g. color, texture), so we opted to binarize each dataset to eliminate those differences. This allows our analysis to focus on adapting to salient differences in language and style rather than on adapting to low level domain differences. For IAM, Rimes, and Rodrigo, we used Otsu binarization but for Bentham, we used adaptive Wolf binarization  because it produced visually better binarizations.
To train the LMs for each dataset used in most experiments, we used the transcriptions of the training data. Though this corresponds to having an optimal LM for hybrid training, we also explore using LMs trained on unrelated data. For these LMs, we sampled 50000 sentences from the United Nations proceedings subset of the Europarl machine translation dataset  in Spanish, English, and French.
To obtain the character classes predicted by our models, we take the union of the character sets of each dataset. Because of this, during source model training, the classifiers output distributions over all characters, not just those characters contained in the source dataset. This way if the target dataset has additional characters, we do not need to modify the classifiers before hybrid training.
Iv Experimental Results
|Experimental Conditions||Source: Bentham||Source: IAM||Source: Rimes||Source Rodrigo|
|Source||LM||LM||Amount||Transfer to||Transfer to||Transfer to||Transfer to||Avg.|
|10||Train||0.4 / 0.5||50%||9.2||7.3||12.0||8.2||6.3||13.0||8.8||8.2||12.6||64.5||77.2||35.1||8.0|
|50||Train||0.4 / 0.5||50%||9.8||7.5||13.3||8.6||6.3||13.5||9.3||8.4||13.4||70.3||74.3||70.1||8.3|
|10||Train||0.4 / 0.5||75%||9.3||7.6||13.2||7.5||5.7||12.1||9.4||7.8||12.8||59.4||64.7||33.4||7.8|
|10||Train||0.4 / 0.5||25%||10.7||7.8||13.1||9.3||6.3||11.9||8.5||8.7||12.6||71.8||73.8||37.5||8.6|
|10||Train||0.8 / 0.4||50%||12.1||8.1||100.0||8.2||6.7||97.0||8.9||8.9||13.5||67.2||80.1||18.3||8.8|
|10||Train||1.2 / 0.3||50%||30.8||9.3||91.7||11.1||7.2||79.1||10.3||12.4||96.0||66.7||74.1||11.1||13.5|
|10||Europarl||0.4 / 0.5||50%||32.6||80.5||85.8||13.4||12.0||23.6||20.7||18.6||36.8||99.8||99.1||99.8||29.6|
|Source Model - no Hybrid Training||45.5||43.3||26.1||27.8||16.1||24.3||35.0||24.6||34.4||59.5||67.0||67.1||32.1|
In the following experiments, we use the following protocol. For source models, we trained 4 models for each dataset for 10 epochs using the ADAM optimizer to perform weight updates . We then selected the best model using the Character Error Rate (CER) on the validation set after performing LM decoding using the dataset-specific LM. All reported numbers for source models are on the designated test splits for each dataset. These source models were used as the initial models in all hybrid training experiments, except where noted.
For hybrid training, we also trained 4 models where each hybrid model is initialized with the weights learned on the source dataset. Hybrid models are trained for approximately 12000 weight updates using mini-batches of 8 images, where mini-batches contain both source and target images. To report metrics, we select the best model based on the validation set for the target data and then evaluate this model on the target data. While in practice this is not feasible because target data would not have ground truth transcriptions, this allowed us to fairly compare different methods of hybrid training. We leave a method for selecting the best model without using ground truth as future work.
Iv-a Source Model Evaluation
Table I shows the CER of source models when evaluated on each dataset. As expected, source models obtain low CER when the test data matches the training data and high CER when there is a mismatch. Though this result may be obvious, it demonstrates the need for our hybrid training methodology in order to transfer models from one language to another. We also note that even though IAM and Bentham are both English datasets, models trained on one do not perform well on the other and have need of transfer learning.
The CERs obtained are competitive when compared with previous results reported in the literature. For example,  reports CER of 3.9 and 3.8 for IAM and Rimes respectively, while we achieve 8.4 and 4.9 CERs. In , a CER of 3.0 is reported on the Rodrigo dataset, though this number is not directly comparable to our reported results because they use a different data split and transcription preprocessing. Additionally, we binarized our data for transferability and generally CNN-BLSTM models perform better when using grayscale inputs. The best CER on Bentham reported in the 2014 ICFHR HTR compeition is 5.0 for the restricted track . Also, our reported numbers are on source models that have not trained to convergence (this improves hybrid training) but further training of the source models produces 1-2% lower CERs.
Iv-B Hybrid Training
In hybrid training, we varied 4 factors to gain a better understanding of the sensitivities of the method:
Length of source model training time
Proportion of source and target data
Data used to train the LM
The and LM parameters
Table II shows the CER after hybrid training for all language pairs for all experimental settings. Here we explain the column semantics of Table II. Source Epochs indicates how long source models were trained before hybrid training began. We also varied the data used for LM training - either the ground truth training set transcriptions, Europarl corpus subset, or no LM was used. The next 2 columns respectively indicate the LM hyperparameters and percentage of source data used in hybrid training. Remaining columns indicate Source-Target dataset pairs, where the first header row indicates the source language with target languages listed below. For example, the first data column is Bentham as the source with IAM as the target. The last column shows the average performance of the 6 language pairs involving Bentham, IAM, and Rimes. For this average, we excluded Rodrigo because of the extremely high CERs of unsuccessful transfers, which would dominate the average. For comparison, the last row shows performance of the source models before hybrid training, i.e. the off-diagonal entries of Table I.
Considering the first 4 rows of Table II, pairwise transfers between Bentham, IAM, and Rimes are extremely successful, achieving CERs near to those obtained with full supervised training in some cases. It is interesting that while these three datasets can transfer to Rodrigo with CERs of about 13%, the reverse is not true. Only Rodrigo to Rimes hybrid training managed to significantly improve the CER over the source model, achieving 11.1% CER under one set of experimental conditions, though this language pair appears sensitive to variations in experimental conditions.
When the source model is trained to convergence, i.e. trained for 50 epochs instead of 10, CER on the source data improves by about 1-2% (data not shown), but the CER after transferring increases for all language pairs except one. The average CER increases by 0.3%. This is because after training for so long, the models can overfit the source data and may have difficulty unlearning factors unique to the source dataset.
Next we examined what proportion of source and target data is used during hybrid training. Overall, using 75% source data produces an average CER of 7.8% vs 8.0% for equal proportions and 8.6% for 25% source data. We also note that the optimal percentage of source data varies by language pair. Because the target labels provided by the LM are not always correct, the model can diverge if it is presented with too many poor quality target labels. Source data helps stabilize hybrid training, so using a larger proportion of source data may make training more stable.
Next we examined the LM parameters and used during LM decoding (Eqs. 2,3). We determined our default values of and by cross validation to optimize the CER of source models evaluated on the datasets that they were not trained on (i.e., the off-diagonal entries of Table I). For example, Fig. 2 shows heatmaps for the source IAM model evaluated on Bentham and Rimes. When evaluating the IAM model on Bentham or Rimes, we see better performance when and , but when we evaluate on IAM, and perform best. We saw a similar trend when evaluating the Bentham source model on the other datasets.
A similar trend also holds for hybrid training. Our default parameters of , achieved an average CER of 8.0%, which is lower than 8.6% with , and 13.5 with , . Also, transferring to Rodrigo becomes unsuccessful when using these alternate parameters. However, it is interesting that these parameter settings greatly improve transfer from Rodrigo to Rimes (achieving 18.3 and 11.1 CER). Thus the optimal LM hyperparameters vary based on the language pair, and unfortunately, they cannot be estimated by cross validation in a real setting as cross validation relies on the ground truth for the target language.
We conclude our experiments by varying the data used to train the LM. If we do not apply LM decoding during hybrid training (or equivalently use a LM where all sequences of characters are equally likely), we see that some language pairs improve over the source model performance, though some do not improve or get worse. When we use the Europarl trained LMs, we see degraded performance with respect to the LMs trained on the dataset training sets, but this is expected to some degree. The Europarl corpus uses very formal language, and the modern Spanish is very different from the historical Spanish used in the 1545 Rodrigo manuscript. Transferring Bentham to IAM and vice versa, using the Europarl English LM greatly improves CER compared to using no LM at all. The same is also true for IAM and Rimes.
In this work we proposed a methodology that trains HWR on a target language without using any labeled data in that language. It does so by leveraging labeled images in a closely related source language and a language model in the target language. After training a source model, we train on both the source and target data, inputting target labels using the current model predictions decoded by the LM. We demonstrate that our approach is successful on many pairs of languages using the IAM, Rimes, Bentham, and Rodrigo datasets. We explored the design choices of our hybrid training approach and make conclusions about the LM training data, LM hyperparameters, amount of source data in hybrid training, and length of source model training.
-  A. Graves, M. Liwicki, S. Fernández, R. Bertolami, H. Bunke, and J. Schmidhuber, “A novel connectionist system for unconstrained handwriting recognition,” IEEE transactions on pattern analysis and machine intelligence, vol. 31, no. 5, pp. 855–868, 2009.
-  P. Y. Simard, D. Steinkraus, J. C. Platt et al., “Best practices for convolutional neural networks applied to visual document analysis.” in ICDAR, vol. 3, 2003, pp. 958–962.
-  T. Varga and H. Bunke, “Perturbation models for generating synthetic training data in handwriting recognition,” in Machine Learning in Document Analysis and Recognition. Springer, 2008, pp. 333–360.
-  C. Wigington, S. Stewart, B. Davis, B. Barrett, B. Price, and S. Cohen, “Data augmentation for recognition of handwritten words and lines using a cnn-lstm network,” in ICDAR, vol. 1. IEEE, 2017, pp. 639–645.
-  C.-H. Tung, Y.-J. Chen, and H.-J. Lee, “Performance analysis of an ocr system via an artificial handwritten chinese character generator,” Pattern Recognition, vol. 27, no. 2, pp. 221–232, 1994.
D.-H. Lee and H.-G. Cho, “A new synthesizing method for handwriting korean
International Journal of Pattern Recognition and Artificial Intelligence, vol. 12, no. 01, pp. 45–61, 1998.
-  Y. Elarian, I. Ahmad, S. Awaida, W. G. Al-Khatib, and A. Zidouri, “An arabic handwriting synthesis system,” Pattern Recognition, vol. 48, no. 3, pp. 849–861, 2015.
-  V. Frinken, A. Fischer, H. Bunke, and A. Foornes, “Co-training for handwritten word recognition,” in ICDAR. IEEE, 2011, pp. 314–318.
G. R. Ball and S. N. Srihari, “Semi-supervised learning for handwriting recognition,” inICDAR. IEEE, 2009, pp. 26–30.
A. Graves, S. Fernández, F. Gomez, and J. Schmidhuber, “Connectionist temporal classification: labelling unsegmented sequence data with recurrent neural networks,” inInternational Conference on Machine learning. ACM, 2006, pp. 369–376.
-  D. Povey, A. Ghoshal, G. Boulianne, L. Burget, O. Glembek, N. Goel, M. Hannemann, P. Motlicek, Y. Qian, P. Schwarz, J. Silovsky, G. Stemmer, and K. Vesely, “The kaldi speech recognition toolkit,” in IEEE 2011 Workshop on Automatic Speech Recognition and Understanding, Dec. 2011.
-  T. Bluche, H. Ney, and C. Kermorvant, “A comparison of sequence-trained deep neural networks and recurrent neural networks optical modeling for handwriting recognition,” in International Conference on Statistical Language and Speech Processing. Springer, 2014, pp. 199–210.
-  P. Voigtlaender, P. Doetsch, and H. Ney, “Handwriting recognition with large multidimensional long short-term memory recurrent neural networks,” in ICFHR. IEEE, 2016, pp. 228–233.
-  S. F. Chen and J. Goodman, “An empirical study of smoothing techniques for language modeling,” Computer Speech & Language, vol. 13, no. 4, pp. 359–394, 1999.
-  M. Mohri, F. Pereira, and M. Riley, “Speech recognition with weighted finite-state transducers,” in Springer Handbook of Speech Processing. Springer, 2008, pp. 559–584.
-  U.-V. Marti and H. Bunke, “The iam-database: an english sentence database for offline handwriting recognition,” IJDAR, vol. 5, no. 1, pp. 39–46, 2002.
-  E. Augustin, J.-m. Brodin, M. Carré, E. Geoffrois, E. Grosicki, and F. Prêteux, “RIMES evaluation campaign for handwritten mail processing,” in IWFHR, 2006.
-  N. Serrano, F. Castro, and A. Juan, “The rodrigo database.” in LREC, 2010, pp. 19–21.
-  J. A. Sánchez, V. Romero, A. H. Toselli, and E. Vidal, “Icfhr2014 competition on handwritten text recognition on transcriptorium datasets (htrts),” in ICFHR. IEEE, 2014, pp. 785–790.
-  N. Otsu, “A threshold selection method from gray-level histograms,” IEEE transactions on systems, man, and cybernetics, vol. 9, no. 1, pp. 62–66, 1979.
-  C. Wolf, J.-M. Jolion, and F. Chassaing, “Text Localization, Enhancement and Binarization in Multimedia Documents,” in Proceedings of the International Conference on Pattern Recognition, vol. 2, 2002, pp. 1037–1040.
-  P. Koehn, “Europarl: A parallel corpus for statistical machine translation,” in MT summit, vol. 5, 2005, pp. 79–86.
-  D. P. Kingma and J. Ba, “Adam: A method for stochastic optimization,” arXiv preprint arXiv:1412.6980, 2014.
-  E. Granell, E. Chammas, L. Likforman-Sulem, C.-D. Martínez-Hinarejos, C. Mokbel, and B.-I. Cîrstea, “Transcription of spanish historical handwritten documents with deep neural networks,” Journal of Imaging, vol. 4, no. 1, p. 15, 2018.