Since first introduced, CRNN for Handwritten Text Recognition (HTR) has been constantly breaking state-of-the-art results Doetsch et al. (2014); Pham et al. (2014); Voigtlaender et al. (2016), and being deployed in industrial applications Bluche and Messina (2017); Borisyuk et al. (2018)
. The gist of CRNN involves a convolutional feature extractor that encodes visual details into latent vectors, followed by a recurrent sequence decoder that turns the latent vectors into human-understandable characters. The whole architecture is trained end-to-end via Connectionist Temporal Classification (CTC) loss functionGraves et al. (2006) or attention mechanism Bahdanau et al. (2015); Hori et al. (2017); Kim et al. (2017).
Inside CRNN, the role of the sequence decoder (often implemented as Long Short-Term Memory(Hochreiter and Schmidhuber, 1997)) has been reported to serve as a language model Sabir et al. (2017). In Sabir et al. (2017), the authors observe that a learned OCR model attains higher accuracy for meaningful text line than for random text line. Experimental results from Ul-Hasan and Breuel (2013) also support this claim, in that an OCR model performs worse when tested on languages other than the language it is trained on (see Fig. 1). These results intuitively make sense because the knowledge of surrounding characters can provide clues to ascertain correct prediction. We hypothesize that this effect is even more pronouncing for HTR, where handwriting style variations and real world conditions can render characters visually confusing and unrecognizable, making predictions of such characters only feasible by referring to the surrounding context. As a result, enhancing the sequence decoder would improve CRNN performance.
Despite the promising empirical results, the LSTM as well as other RNN-based models is incapable of remembering long context due to vanishing/exploding gradient problems Bengio et al. (1994); Le et al. (2019); Pascanu et al. (2013). Current HTR datasets, which contain only scene text or short segments of texts are not challenging enough to expose the weakness. However, when approaching industrial data, which often involve transcription of the whole line of documents written in complicated languages, RNN-based decoders may fail to achieve good results.
Recently, a new class of neural network architecture, called Memory-Augmented Neural Network (MANN), has demonstrated potentials to replace RNN-based methods in sequential modeling. In essence, MANNs are recurrent controller networks equipped with external memory modules, in which the controllers can interact with external memory unit via attention mechanisms Graves et al. (2014)
. Compared to LSTM, instead of storing hidden states in a single vector of memory cells, a MANN can store and retrieve its hidden states in multiple memory slots, making it more robust against exploding/vanishing gradient problem. MANNs have been experimented to perform superior to LSTM in language modeling tasksGulcehre et al. (2017); Kumar et al. (2016) and thus beneficial to HTR where language modelling supports recognition. However, for this problem, to the best of our knowledge, there is no work currently employing MANN.
In this work, we adapt recent memory-augmented neural networks by integrating an external memory module into a convolutional neural network. The CNN layers read the input image, encoding it to a sequence of visual features. At each timestep of the sequence, two controllers will be used to store the features into the memory, access the memory in multiple refinement steps and generate an output. The output will be passed to a CTC layer to produce the CTC loss and the final predicted character. In summary, our main contributions are:
We introduce a memory-augmented recurrent convolutional architecture for OCR, called Convolutional Multi-way Associative Memory (CMAM),
We demonstrate the new architecture’s performance on 3 handwritten datasets: English IAM, Chinese SCUT-EPT and a private Japanese dataset.
2 Related Work
2.1 Convolutional Recurrent Neural Networks
Attempts to improve CRNN have mostly been focused on the convolutional encoder. Starting with the vanilla implementation Shi et al. (2017), convolution layers have since been incorporated with gating mechanism Bluche and Messina (2017), recurrent connections Wang and Hu (2017); Lee and Osindero (2016)et al. (2018); Gao et al. (2017); Zhan et al. (2017), and used alongside with dropout Pham et al. (2014); Puigcerver (2017), or Maxout He et al. (2016). Multi-Dimensional LSTM (MDLSTM) Graves et al. (2007), which is often used in the feature encoder, can also serve as the sequence decoder Graves and Schmidhuber (2009); Pham et al. (2014); Voigtlaender et al. (2016). On the other hand, improvement on the sequence decoder has received little attention. Sun et al. (2017) proposes GMU, inspired by the architectures of both LSTM and GRU (Chung et al., 2014), and achieves good results on both online English and Chinese handwriting recognition tasks. Doetsch et al. (2014) controls the shape of gating functions in LSTM with learnable scales. Recently, there are some attempts that replace CTC with Seq2Seq decoder, both without Sahu and Sukhwani (2015) and with attention mechanism Bluche et al. (2017) support for decoding. Hybrid models between CTC and attention decoder Hori et al. (2017); Kim et al. (2017) are also proposed and gain big improvement in Chinese and Japanese speech recognition.
2.2 Memory-augmented Neural Networks
Memory-augmented neural networks (MANN) have emerged as a new promising research topic in deep learning. In MANN, the interaction between the controller and the memory is differentiable, allowing it to be trained end-to-end with other components in the neural networkGraves et al. (2014); Weston et al. (2014). Compared to LSTM, it has been shown to generalize better on sequences longer than those seen during training Graves et al. (2014, 2016). This improvement comes at the expense of more computational cost. However, recent advancements in memory addressing mechanisms allow MANN to perform much more efficiently Le et al. (2019). For practical applications, MANNs have been applied to many problems such as meta learning Santoro et al. (2016), healthcare Le et al. (2018b, c), dialog system Le et al. (2018a), process analytic (Khan et al., 2018) and extensively in question answering Miller et al. (2016) and language modeling Gulcehre et al. (2017). Our work (CMAM) is one of the first attempts to utilize MANN for HTR tasks.
3 Proposed System
3.1 Visual Feature Extraction with Convolutional Neural Networks
In our CMAM, the component of convolutional layers is constructed by stacking the convolutional and max-pooling layers as in a standard CNN model. Such component is used to extract a sequential feature representation from an input image. Before being fed into the network, all the images need to be scaled to the same height. Then a sequence of feature vectors is extracted from the feature maps produced by the component of convolutional layers, which is the input for the memory module. Specifically, each feature vector of a feature sequence is generated from left to right on the feature maps column by column. This means the-th feature vector is the concatenation of the -th columns of all the maps (see Fig. 2). Each of the feature vector is then fed to a fully-connected layer to produce the final input for the memory module.
3.2 Sequential Learning with Multi-way Associative Memory
In this section, we propose a new memory-augmented neural network, namely Multi-way Associative Memory (MAM) that is designed for HTR tasks.
3.2.1 Memory-augmented neural network overview
A memory-augmented neural network (MANN) consists of an LSTM controller, which frequently accesses and modifies an external memory , where and are the number of memory slots and the dimension of each slot, respectively. At each time step, the controller receives a concatenation of the input signal and the previous read values from the memory to update its hidden state and output as follows,
The controller output then is used to compute the memory interface and the short-term output
by applying linear transformation,
where , are trainable weight parameters. The interface in our specific model is a set of vectors responsible for controlling the memory access, which includes memory reading and writing procedures (see 3.2.3 and 3.2.4, respectively). After the memory is accessed, the read values for current step are computed and combine with the short-term output , producing the final output of the memory module as the following,
where is trainable weight parameter. With the integration of read values, the final output contains not only short-term information from the controller, but also long-term context from the memory. This design was first proposed in Graves et al. (2014) and has become a standard generic memory architecture for sequential modeling Graves et al. (2014, 2016); Le et al. (2018a, b, c).
3.2.2 Multi-way Associative Memory
We leverage the standard memory-augmented architecture by marrying bidirectional control with multi-hop memory accesses. In particular, we use two controllers: forward and backward controllers (both implemented as LSTM). The forward controller reads the inputs in forward order (from timestep -th to -th) together with the read contents from the memory. It captures both short-term information from the past and long-term knowledge from the memory. On the other hand, the backward controller reads the inputs in backward order (from timestep -th to -th) and thus only captures short-term information from the future. The backward controller maybe useful when the short-term future timesteps give local contribution to recognize the character at current timestep. However, it is unlikely that long-term future timesteps can have causal impact on current temporary output, which explains why the backward controller does not need to read content from the memory.
Moreover, our architecture supports multi-step computations to refine the outputs from the controllers ( and ) before producing the final output for that timestep. Let us denote as the number of refinement steps. At -th refinement step, the short-term output of previous refinement will be used as the input for the controllers. In particular, the controllers will compute their hidden states and temporary outputs for this refinement as follows,
where . At timestep -th, the forward controller updates its hidden state and compute temporary output . The temporary output is stored in a buffer, waiting for the backward controller’s temporary output . The two buffered outputs are used to compute the memory interface vector and the short-term output as the following,
The refinement process simulates human multi-step reasoning. We often refer to our memory many times before making final decision. At each step of refinement, the controllers re-access the memory to get information representing the current stage of thinking. The representation is richer than the raw representation stored in the memory before refinement. For example, without refinement, at timestep -th, the forward controller can only read values containing information from the past. However, from the first refinement step (), the memory is already filled with the information of the whole sequence, and thus a new refined read at timestep -th can contain (very far if necessary) future information . Fig. 3 illustrates the flow of operations in the architecture.
After refinement process, the final output is computed by the generation unit as follows,
3.2.3 Memory reading
-th read-out. The addressing mechanism is mostly based on cosine similarity measure,
which is used to produce a content-based read-weight whose elements are computed according to softmax function over memory’s locations,
Here, is the strength parameter. After the read weights are determined. the -th read value is retrieved as the following,
The final read-out is the concatenation of all read values .
Different from Graves et al. (2016), we exclude temporal linkage reading mechanism from our memory reading. Recent analyses in Franke et al. (2018) reveal that this mechanism increases the computation time and physical storage dramatically. By examining the memory usages, Franke et al. (2018) also indicated that the temporal linkage reading are barely used in realistic tasks and thus could be safely removed without hurting the performance much. We follow this practice and only keep the content-based reading mechanism. We realize that the key-value based retrieval resembles traditional associative memory system (Baird et al., 1993). The mechanism is critical for OCR tasks where reference to previous exposures of some visual feature may consolidate the confidence of judging current ones. When compared to recurrent networks in current OCR systems, which only use single visual feature stored in the hidden state to make prediction on the output, multiple visual features reference may provide the model with richer information and thus give better predictions.
3.2.4 Memory writing
To build writing mechanism for our memory, we apply three different writing strategies. First. we make use of the dynamic memory allocation in Graves et al. (2016). This strategy tends to write to least-used memory slots. Let use the memory retention vector to determine how much each memory location will not be freed as follows,
where for each read head , denotes the free gate emitted by the interface and denotes the read weighting vector from the previous timestep. The usage over memory locations at current time-step is given by , which is called memory usage vector,
Then, the allocation vector is defined as the following,
in which, contains elements from in sorted order.
The second strategy we propose is the last-read writing, in which the location to be written is the previous read location. We define the previous read locations by averaging the previous read weights,
By writing to the previous read address, we assume that after read, the content in that address is no longer important for future prediction. This assumption makes sense in OCR setting where some visual features only take part in recognizing one character. After the model refers to these visual features to make predictions, it is safe to remove them from the memory to save spaces for other important features.
The third strategy is the common content-based writing, which is similar to content-based reading. A content-based write weight is computed as follows,
where and are the key and strength parameters for content-based writing, respectively.
To allow the model to select amongst strategies and have the ability to refuse writing, a write mode indicator and a write gate are used to compute the final write weight as the following,
where the write mode indicator and the write gate
are normalized using softmax and sigmoid functions, respectively.
Finally, the write-weight can be used together with the update value and erase value to modify the memory content as follows,
where is element-wise product.
4 Experimental Evaluation
4.1 Experimental Settings
The main baseline used across experiments is the traditional CRNN with the same CNN architecture presented in 3.1, coupled with the bidirectional 1D LSTM layers as proposed in (Puigcerver, 2017). We use DNC (Graves et al., 2016) as the decoder for visual sequence to form another MANN baseline against our memory-based architecture. Finally, we also include other results reported from previous works for SCUT-EPT dataset
To validate our proposed model, we select IAM (Marti and Bunke, 2002) and SCUT-EPT (Zhu et al., 2018), which are two public datasets written in English and Chinese, respectively. We also collect a private dataset from our partner-a big corporation in Japan. This dataset is scanned documents written by Japanese scientists and thus, contains noises, special symbols, and more characters per line than the other datasets. The statistics of the three datasets are summarized in Table 1.
Since our aim is to recognize long text line by end-to-end models, we do not segment the line into words. Rather, we train the models with the whole line and report Character Error Rate (CER), Correct Rate (CR) and Accuracy Rate (AR) (Yin et al., 2013). To fit with Chinese and Japanese datasets where there is no white-space between words, we exclude white-space from the vocabulary set and do not measure word error rate metrics.
4.2 Synthetically generated handwritten line images
This section describes the synthetic data generation process used for our experiments. We execute 5 steps as illustrated in Fig. 4. Specifically, we start by crawling some text contents, typically from news site or from Wikipedia, and remove bad symbols and characters from the text corpus (1). Since the corpus might contain unknown characters to us, i.e. characters that we do not have any visual knowledge. We obtain these characters (a) by scouting them on the web and (b) by generating them from fonts and apply random variations to make it less print-like and more handwriting-like (2). The corpus also largely contains text of paragraph level, so we break it down into text lines, each of which contains on average 15 characters (3). For each text line, we combine random individual character images to make line images (4). Finally, we normalize the line images to ease the abrupt variations produced by different image characters and we augment the line images to increase style variations. After this process, we have a collection of line images with their corresponding ground truth labels.
There are two purposes of using synthetic data. First, we generate data for tuning our implemented models (CRNN, DNC, CMAM). In particular, after training with 10,000 synthesized line images (both Latin and Japanese) on various range of hyper-parameters, we realize that the optimal memory size for MANN models is 1616 with 4 read heads. The LSTM controller’s hidden sizes for DNC and CMAM are 256 and 196, respectively. For the CRNN baseline, the best configuration is 2-layer bidirectional LSTM of size 256. The optimal optimizer is RMSprob with a learning rate of . Second, since collecting and labeling line images are labor-intensive, we increase the number of training images with synthetic data for the Japanese recognition task. We generate 100,000 Japanese line images to pre-train our models before fine-tuning them on the private Japanese dataset.
4.3 Latin Recognition Task
In this task, we compare our model with CRNN and DNC. Two variants of our CMAM are tested: one uses no refinement () and the other uses one step of refinement (
), respectively. This experiment is a simple ablation study to determine good configuration of CMAM for other tasks. We run each model 5 times and calculate the mean and standard deviation on CER metric. The final result is reported in Table2.
Compared to other baselines, DNC is the worst performer, which indicates a naive application of this generic model on OCR tasks seem inefficient. Both versions of CMAM outperform other baselines including the common CRNN architecture. Increasing the number of refinement steps helps CMAM outperform CRNN more than 2. The improvement is not big since the IAM’s text line is short and clean. It should be noted that our results are not comparable to that reported in other works (Puigcerver, 2017) as we train on the whole line and discard white-space prediction. We also do not use any language model, pre-processing and post-processing techniques.
4.4 Chinese Recognition Task
|CRNN (BiLSTM, ours)||81.47||74.33|
. CRNNs are still strong baselines for this task as they reach much higher accuracies than attentional models. Compared to CRNN(Zhu et al., 2018), our proposal CMAM can demonstrate higher CR by more than 3, yet lower AR by 0.92. However, CMAM outperforms our CRNN implementation on the two metrics by 0.67 and 0.12, respectively.
4.5 Japanese Recognition Task
The OCR model is trained on 2 sources of datasets: synthetically generated handwritten line images (see Sec. 4.2) and a private in-house collected Japanese dataset. The dataset contains more than 17,000 handwritten line images, which is divided into training and testing sets (see Table 1). We split 2000 images from the training set to form the validation set. These images are obtained from scanned notebooks, whose lines are meticulously located and labeled by our dedicated QA team. It contains various Japanese character types (Katakana, Kanji, Hiragana, alphabet, number, special characters and symbols) to make our model work on general use cases.
We have conducted 2 experiments with the Japanese dataset. In the first experiment, we train the models directly with the real-world training set and report CER on the validation and testing set in Table 4 (middle column). As seen from the results, CMAM outperforms CRNN significantly where the error rates reduce around 14 in both validation and testing sets. This demonstrates the advantage of using memory to capture distant visual features in case the line text is long and complicated.
In the second experiment, we first train the models on 100,000 synthetic lines of images. After the models converge on the synthetic data, we continue the training on the real-world data. The result is listed in the rightmost column of Table 4. With more training data, both models achieve better performance. More specifically, the improvement can be seen clearer in the case of CRNN, which implies pre-training may be more important for CRNN than CMAM. However, CMAM can still reach lower error rates than CRNN on the validation and test set (9 and 1, respectively).
|Model||Original data||Enriched data|
|CRNN (BiLSTM, ours)||32.84||27.00||24.83||11.62|
In this paper, we present a new architecture for handwritten text recognition that augments convolutional recurrent neural network with an external memory unit. The memory unit is inspired by recent memory-augmented neural networks, especially the Differentiable Neural Computer, with extensions for new writing strategy and multi-way memory access mechanisms. The whole architecture is proved efficient for handwritten text recognition through three experiments, in which our model demonstrates competitive or superior performance against other common baselines from the literature.
- Neural machine translation by jointly learning to align and translate. Proceedings of the International Conference on Learning Representations. Cited by: §1.
- A neural network associative memory for handwritten character recognition using multiple chua characters. IEEE Transactions on Circuits and Systems II: Analog and Digital Signal Processing 40 (10), pp. 667–674. Cited by: §3.2.3.
- Learning long-term dependencies with gradient descent is difficult. IEEE transactions on neural networks 5 (2), pp. 157–166. Cited by: §1.
- Scan, attend and read: end-to-end handwritten paragraph recognition with mdlstm attention. In 2017 14th IAPR International Conference on Document Analysis and Recognition (ICDAR), Vol. 1, pp. 1050–1055. Cited by: §2.1.
- Gated convolutional recurrent neural networks for multilingual handwriting recognition. In 2017 14th IAPR International Conference on Document Analysis and Recognition (ICDAR), Vol. 1, pp. 646–651. Cited by: §1, §2.1.
- Rosetta: large scale system for text detection and recognition in images. In Proceedings of the 24th ACM SIGKDD International Conference on Knowledge Discovery & Data Mining, pp. 71–79. Cited by: §1, §2.1.
- Empirical evaluation of gated recurrent neural networks on sequence modeling. arXiv preprint arXiv:1412.3555. Cited by: §2.1.
- Fast and robust training of recurrent neural networks for offline handwriting recognition. In 2014 14th International Conference on Frontiers in Handwriting Recognition, pp. 279–284. Cited by: §1, §2.1.
- Robust and scalable differentiable neural computer for question answering. In Proceedings of the Workshop on Machine Reading for Question Answering, pp. 47–59. External Links: Cited by: §3.2.2, §3.2.3.
- Reading scene text with attention convolutional sequence modeling. arXiv preprint arXiv:1709.04303. Cited by: §2.1.
Connectionist temporal classification: labelling unsegmented sequence data with recurrent neural networks.
Proceedings of the 23rd international conference on Machine learning, pp. 369–376. Cited by: §1.
- Multi-dimensional recurrent neural networks. In International conference on artificial neural networks, pp. 549–558. Cited by: §2.1.
- Offline handwriting recognition with multidimensional recurrent neural networks. In Advances in neural information processing systems, pp. 545–552. Cited by: §2.1.
- Neural turing machines. arXiv preprint arXiv:1410.5401. Cited by: §1, §2.2, §3.2.1, §3.2.3.
- Hybrid computing using a neural network with dynamic external memory. Nature 538 (7626), pp. 471–476. Cited by: §2.2, §3.2.1, §3.2.3, §3.2.3, §3.2.4, §4.1.
- Memory augmented neural networks with wormhole connections. arXiv preprint arXiv:1701.08718. Cited by: §1, §2.2.
Reading scene text in deep convolutional sequences.
Thirtieth AAAI Conference on Artificial Intelligence, Cited by: §2.1.
- Long short-term memory. Neural computation 9 (8), pp. 1735–1780. Cited by: §1.
- Advances in joint ctc-attention based end-to-end speech recognition with a deep cnn encoder and rnn-lm. arXiv preprint arXiv:1706.02737. Cited by: §1, §2.1.
- Memory-augmented neural networks for predictive process analytics. arXiv preprint arXiv:1802.00938. Cited by: §2.2.
- Joint ctc-attention based end-to-end speech recognition using multi-task learning. In 2017 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pp. 4835–4839. Cited by: §1, §2.1.
Ask me anything: dynamic memory networks for natural language processing. In International Conference on Machine Learning, pp. 1378–1387. Cited by: §1.
- Variational memory encoder-decoder. In Advances in Neural Information Processing Systems, pp. 1515–1525. Cited by: §2.2, §3.2.1.
- Dual control memory augmented neural networks for treatment recommendations. In Advances in Knowledge Discovery and Data Mining, Cham, pp. 273–284. External Links: Cited by: §2.2, §3.2.1.
- Dual memory neural computer for asynchronous two-view sequential learning. In Proceedings of the 24th ACM SIGKDD International Conference on Knowledge Discovery; Data Mining, KDD ’18, New York, NY, USA, pp. 1637–1645. External Links: Cited by: §2.2, §3.2.1.
- Learning to remember more with less memorization. In International Conference on Learning Representations, External Links: Cited by: §1, §2.2.
- Recursive recurrent nets with attention modeling for ocr in the wild. In , pp. 2231–2239. Cited by: §2.1.
- The iam-database: an english sentence database for offline handwriting recognition. International Journal on Document Analysis and Recognition 5 (1), pp. 39–46. Cited by: §4.1.
- Key-value memory networks for directly reading documents. arXiv preprint arXiv:1606.03126. Cited by: §2.2.
- On the difficulty of training recurrent neural networks. In International Conference on Machine Learning, pp. 1310–1318. Cited by: §1.
- Dropout improves recurrent neural networks for handwriting recognition. In 2014 14th international conference on frontiers in handwriting recognition, pp. 285–290. Cited by: §1, §2.1.
- Are multidimensional recurrent layers really necessary for handwritten text recognition?. In 2017 14th IAPR International Conference on Document Analysis and Recognition (ICDAR), Vol. 1, pp. 67–72. Cited by: §2.1, §4.1, §4.3.
- Implicit language model in lstm for ocr. In 2017 14th IAPR International Conference on Document Analysis and Recognition (ICDAR), Vol. 7, pp. 27–31. Cited by: §1.
- Sequence to sequence learning for optical character recognition. arXiv preprint arXiv:1511.04176. Cited by: §2.1.
- Meta-learning with memory-augmented neural networks. In International conference on machine learning, pp. 1842–1850. Cited by: §2.2.
- An end-to-end trainable neural network for image-based sequence recognition and its application to scene text recognition. IEEE transactions on pattern analysis and machine intelligence 39 (11), pp. 2298–2304. Cited by: §2.1.
GMU: a novel rnn neuron and its application to handwriting recognition. In 2017 14th IAPR International Conference on Document Analysis and Recognition (ICDAR), Vol. 1, pp. 1062–1067. Cited by: §2.1.
- Can we build language-independent ocr using lstm networks?. In Proceedings of the 4th International Workshop on Multilingual OCR, pp. 9. Cited by: Figure 1, §1.
- Handwriting recognition with large multidimensional long short-term memory recurrent neural networks. In 2016 15th International Conference on Frontiers in Handwriting Recognition (ICFHR), pp. 228–233. Cited by: §1, §2.1.
- Gated recurrent convolution neural network for ocr. In Advances in Neural Information Processing Systems, pp. 335–344. Cited by: §2.1.
- Memory networks. arXiv preprint arXiv:1410.3916. Cited by: §2.2.
- ICDAR 2013 chinese handwriting recognition competition. In 2013 12th International Conference on Document Analysis and Recognition, pp. 1464–1470. Cited by: §4.1.
- Handwritten digit string recognition by combination of residual network and rnn-ctc. In International Conference on Neural Information Processing, pp. 583–591. Cited by: §2.1.
- SCUT-ept: new dataset and benchmark for offline chinese text recognition in examination paper. IEEE Access 7, pp. 370–382. Cited by: §4.1, §4.4.