Fully Convolutional Recurrent Network for Handwritten Chinese Text Recognition

04/18/2016 ∙ by Zecheng Xie, et al. ∙ South China University of Technology International Student Union 0

This paper proposes an end-to-end framework, namely fully convolutional recurrent network (FCRN) for handwritten Chinese text recognition (HCTR). Unlike traditional methods that rely heavily on segmentation, our FCRN is trained with online text data directly and learns to associate the pen-tip trajectory with a sequence of characters. FCRN consists of four parts: a path-signature layer to extract signature features from the input pen-tip trajectory, a fully convolutional network to learn informative representation, a sequence modeling layer to make per-frame predictions on the input sequence and a transcription layer to translate the predictions into a label sequence. The FCRN is end-to-end trainable in contrast to conventional methods whose components are separately trained and tuned. We also present a refined beam search method that efficiently integrates the language model to decode the FCRN and significantly improve the recognition results. We evaluate the performance of the proposed method on the test sets from the databases CASIA-OLHWDB and ICDAR 2013 Chinese handwriting recognition competition, and both achieve state-of-the-art performance with correct rates of 96.40



There are no comments yet.


page 2

page 3

This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

I Introduction

Handwritten Chinese text recognition (HCTR) is a challenging problem and has received intensive concerns from numerous researchers. The large character set, diversity of writing styles and character-touching problem are the main difficulties of HCTR. Traditional methods[1][2] overcome these difficulties by integrating segmentation and recognition. Generally, a segmentation-recognition candidate lattice[1] is first derived from the input pen-tip trajectory through operations of over-segmentation, component combination and character recognition. Based on the lattice, the optimal path can be searched by simultaneously considering the character recognition score, in addition to the geometric and linguistic contexts. Zhou et al.[1] proposed a method based on semi-Markov conditional random fields, which combined candidate character recognition scores with geometric and linguistic contexts. Zhou et al.[2] described an alternative parameter learning method, which aimed at minimizing the character error rate rather than the string error rate. Vision Objects Ltd., France, whose system yielded the best performance in the ICDAR 2013 Chinese handwriting recognition competition[3]

, introduced three ‘experts’ that were responsible for segmentation, recognition and interpretation. They employed a global discriminant training scheme on the text level to learn the classifier parameter and meta-parameters of the recognizer.

However, traditional methods based on over-segmentation can barely overcome their own limitations to rectify the mis-segmentations when characters are not correctly separated. Segmentation-free models[4, 5, 6, 7, 8] have been studied and have proved to be useful in different areas. Liwicki et al.[4] and Graves et al.[5]

combined bidirectional long short-term memory (LSTM) and the connectionist temporal classifier (CTC) to build a speech recognizer. Messina et al.


applied multi-dimensional LSTM with CTC to offline HCTR. Recently, Shi et al. [7] proposed a network architecture called convolutional recurrent neural network (CRNN), which consists of the convolutional layers, recurrent layers and transcription layer, for image-based sequence recognition.

Similar to the aforementioned tasks, variable-length input is also the fundamental difficulty when solving HCTR problems. In this paper, we propose a fully convolutional recurrent network (FCRN), which is a novel framework for HCTR problems that possesses the following advantages: (1) It applies a path-signature layer to generate signature feature maps for online data, which uniquely characterizes the pen-tip trajectory. (2) It takes an input sequence of arbitrary length and outputs a corresponding label sequence without pre-segmentation. (3) It is end-to-end trainable. All its components can be jointly trained to fit each other and improve the overall function and reliability. Language models are of great importance for speech recognition and online text recognition, and have been proved to be effective by Wang et al.[9] and Wu et al.[10]. In this paper, we adopted a refined beam search method to integrate a language model to decode our FCRN. Experiments showed that by incorporating lexical constraints and prior knowledge about a certain language, the language model can further decrease the error rate by 2%-5%.

Fig. 1: Architecture of the proposed fully convolutional recurrent network. Given the input pet-tip trajectory, path-signature layer extracts

signature feature maps with informative dynamics. Then a fully convolutional network produces a length T feature sequence whose frames correspond to receptive fields with height 126 pixels and width 62 pixels on the signature feature maps. After that, a multi-layer BLSTM predicts a probability distribution for each frame in the feature sequence. Finally, transcription layer derives a label sequence from the per-frame predictions.

The remaining parts of the paper are organized as follows. In Section 2, we illustrate the framework of FCRN in detail. In Section 3, we describe language modeling and in Section 4, we present the experimental results. In Section 5, we conclude the paper.

Ii Fully Convolutional Recurrent Network

Given the training set and a training instance in which represents the pen-tip trajectory and

is the corresponding label sequence, the FCRN aims to minimize the loss function

as the negative log probability of correctly labelling all the training examples in :


Fig. 1 describes the network architecture of the proposed FCRN. The FCRN consists of four components. First, the path signature layer outputs

feature maps that are used to characterize the pen-tip trajectory from the online handwritten text data. Second, a fully convolutional network (FCN) produces a feature sequence in which each frame represents the feature vector of a

receptive field on the signature feature maps. Third, multi-layer bidirectional LSTM (BLSTM) predicts a probability distribution for each frame in the feature sequence. Finally, the transcription layer derives a label sequence from the per-frame predictions.

Ii-a Path signature layer

The path signature, pioneered by Chen[11] in the form of iterated integrals and developed by Terry Lyons and his colleagues to play a fundamental role in rough theory[12, 13, 14], can extract sufficient information that uniquely characterizes paths (e.g., in online handwriting) of finite length.

Assume a time interval and the writing plane . Then a pen stroke can be expressed as: . For intervals , the -th iterated integral of is the dimensional vector defined by


By convention, the = 0 iterated integral is simply the number one (i.e., the offline map of the character), the = 1 iterated integral represents the path displacement, and the = 2 iterated integral represents the curvature of the path.

Note that the -th iterated integral of increases rapidly in dimension as increases while carrying very little information. Hence, a truncated signature is preferred. If truncated at level , the path signature can be expressed by


The dimension of the truncated path signature is (i.e., the number of feature maps). When is a straight line, the iterated integrals can be calculated using


where denotes the path displacement. Fig. 1 shows the signature feature maps of online text data to better illustrate the idea of the path signature.

Ii-B Fully convolutional network

A convolutional network is a powerful visual model that extracts high-level abstract features from an image. Inheriting this property, an FCN[15]

takes an input image of arbitrary size and outputs a corresponding-sized dense response map. Unlike image cropping or sliding window-based approaches, an FCN eliminates redundant computations by sharing a convolutional response map layer-by-layer to make inference and backpropagation efficient.

Basic operations in a convolutional network, such as convolution, pooling and the element-wise activation function, are translation invariant. Therefore, locations in the last response map correspond to rectangular regions that are called the receptive field in the original image to which they are associated. Layer-wise formulations to calculate the exact location and size of the receptive field are provided below:


where is the local region size of the -th layer, is the kernel size,

is the stride size,

denotes the position and

is the padding size of a particular layer.

The FCN takes the input of the signature feature maps and outputs a length T feature sequence. As shown in Fig. 2, successive frames in the output feature sequence correspond to the overlapped receptive fields on the original data.

Fig. 2: The receptive field. Successive frames in the output feature sequence of FCN correspond to the overlapped receptive fields on the original data.

Ii-C Multi-layer BLSTM

The traditional recurrent neural network (RNN) is well known for its self-connected hidden layer that recurrently transfers information from output to input. However, the traditional RNN suffers from gradient vanishing and exploding problem. Long Short-Term Memory (LSTM)[16], the core of which is the memory cell and three gates (as illustrated in Fig. 3

), is used here for its strong ability to capture complex and long-term temporal dynamics. In particular, the three sigmoidal nonlinear gates, namely the input gate, forget gate and output gate, control the information flow in and out of the cell unit. The input gate protects the cell unit from the influence of the current input along with past hidden states, the forget gate allows the memory cell to forget or maintain its previous states and the output gate decides how much memory is to be sent out as hidden states.

To capture complex long-term dependencies, we adopted LSTM for modeling the input feature sequence produced by FCN. Each time it receives a frame from the input feature sequence, LSTM updates its hidden states and predicts a distribution for further transcription. We note that LSTM has the following properties for the HCTR problem. Shi et al.[7] showed that LSTM naturally captures the contextual information from a sequence, which makes the text recognition process more efficient and reliable than processing each character independently. Moreover, LSTM is not limited to fixed length inputs or outputs, which allows for modeling sequential data of arbitrary length. Furthermore, LSTM can be jointly trained with an FCN in a unified network (e.g., FCRN). Joint training can benefit both the convolutional layers and LSTM, and improve overall text recognition performance.

Standard LSTM can only use past contextual information in one direction. This is far from sufficient for HCTR in which bidirectional contextual knowledge is accessible. Bidirectional LSTM (BLSTM) can learn long-range context dynamics in both input directions and significantly outperform unidirectional networks. Furthermore, as suggested by Pascanu et al.[17], we stack multiple BLSTMs in our framework to capture higher-level abstract information for further transcription. Finally, fully connected layers were incorporated between the BLSTM and transcription layer to enhance classification.

Fig. 3: Long short-term memory (LSTM) cell.

Ii-D Transcription

Traditional approaches for HCTR are confronted with the paradox of a circular dependency between segmentation and recognition. To avoid the difficulty of segmentation, we adopted connectionist temporal classification (CTC) as the transcription layer in our framework. CTC allows an FCN and LSTM for sequential training without requiring any prior alignment between input images and their corresponding label sequences.

We denote the character set as , where contains all characters used in this task and ‘blank’ represents the null emission. Given length input sequences , where , we can obtain an exponentially large number of length label sequences, known as alignments, by assigning each time step a label and concatenating the labels to form a label sequence. The alignments are denoted by and their probability is given below:


By applying a sequence-to-sequence operation , alignments can be mapped onto a transcription (denoted by ) by first removing the repeated labels and then the blanks. For example, ‘apple’ can be transformed by from ‘_aa_p_pl_ll_e’ or ‘_a_pp_p_l_ee_’. The total probability of a transcription can be calculated by summing the probabilities of all alignments that correspond to it:


As described by Graves and Jaitly[18], because we do not know the exact position of the labels within a particular transcription, we consider all locations where they could occur; that is, what allows a CTC to train a network without pre-segmented data. A detailed forward-backward algorithm to efficiently calculate the probability in Eq. (8) was described by Graves[19].

Iii Language Modeling

The statistical language model plays a significant role in many technological applications, including online and offline handwritten text recognition, speech recognition and language translation. The statistical model of language (e.g., a length T sequence of words) is represented as follows:


where is the -th word in the sequence and denotes the sequence

. In fact, closer words in a word sequence tend to be more dependent. Therefore, n-gram model, which is constructed by the conditional probability of the next word given the last

words, is more often used in practice:


In this paper, we only considered the character bigram and trigram language model in the experiments.

Decoding a CTC network can be easily accomplished through ‘naive decoding’[19], which takes labels within the highest probability for each frame and obtains the transcription by applying operation to the alignment. However, naive decoding is not sufficient and can be improved by language modeling. By incorporating lexical constraints and prior knowledge about the language, language modeling can rectify some obvious semantic errors, and thus improves the recognition result. To integrate the language model and overcome the difficulty that the operation creates, we adopted a refined beam search method to decode the FCRN. Specifically, given the per-frame prediction distribution from the Multi-layer BLSTM, the beam search method first selects the candidates with confidence scores higher than a probability threshold for each time step, and then in the next steps it determines the time steps with one and only one candidate ‘blank’ to separate the alignments into regions. Finally, it enumerates the candidate paths in every region and sequentially concatenates the candidate paths from the first to last region. In each concatenating step, both the confidence score and the language model score of the paths is considered and only the top N paths remain for the subsequent concatenating steps.

Iv Experimental Results and Analysis

Iv-a Online handwritten text data

CASIA-OLHWDB[20] is a Chinese handwriting database that is often used for online Chinese handwriting recognition. It contains both isolated characters and unconstrained text lines. The training set of CASIA-OLHWDB for online handwritten text recognition contains 4072 pages of handwritten texts, which incorporates 41,710 text lines, including 1,082,220 characters of 2650 classes, whereas the test set (denoted as D-Casia) contains 1020 text pages, including 269,674 characters of 2631 classes. We randomly split the training set into two groups, with approximately 90% for training and the remainder for gauging the convergence of the training process and further parameter learning for the beam search. Furthermore, we assessed our proposed method on the test data (denoted as D-Com) of the online handwritten text recognition task of the ICDAR 2013 Chinese handwriting recognition competition[21]

, which contains 3432 text lines, including 91,576 characters of 1375 classes. However, our actual evaluation dataset is smaller than the reported one because we removed outlier characters that are never seen in the training data, and actually contains 89,723 characters of 1258 classes.

Iv-B Textual data

corpora #characters #class
PTR 2,199,492 4,689
PH 3,697,028 4,722
SLD 56,279,692 6,882
TABLE I: Character information in the corpora

The experiments were conducted on three corpora, including the PFR corpus[22], which is news text from the 1998 People’ s Daily corpus; the PH[23] corpus, which is news text from the People’s Republic of China’s Xinhua news written between January 1990 and March 1991; and the SLD corpus[24], which contains news text from 2006 Sogou Lab Data. Because the total amount of Sogou Lab Data was too large, we only used an extract in our experiments. Detailed information about these corpora is illustrated in Table I.

We constructed our language models using the SRILM toolkit[25]. We built three language models based on these three corpora, and compared their roles in decoding the FCRN with the beam search method.

Iv-C Experimental setting

Layer type Settings Stack times
transcription sequence labeling
inner product n: 2048
BLSTM c: 1024
convolution k: , s: , p:
convolution k: , s: , p:
pooling k: , s:
convolution k: , s: , p:
path-signature (train), (test)
input pen-tip trajectory
TABLE II: Detailed settings of our system

The detailed architecture of our FCRN for HCTR is listed in Table II

. The kernel number of each layer in our FCN from bottom to top is 64, 128, 256, 256, 512 and 512. We also applied batch normalization

[26] to the last four convolutional layers to enable them to converge faster and avoid over-fitting. To accelerate the training process, we trained our network with shorter texts segmented from text lines in the training data, which could be normalized to the same height of 128 pixels, while retaining the width at fewer than 576 pixels. In the test phase, we maintained the same height but increased the width to 2400 pixels to contain the text lines from the test set.

We constructed our FCRN network within the CAFFE

[27]deep learning framework, in which LSTM is implemented by Venugopalan et al.[28] and others are contributed by ourselves. The optimization algorithm was AdaDelta with =0.9. We trained our FCRN with GeForce Titan-X GPUs and it took approximately four days to reach convergence.

We used the correct rate(CR) and accuracy rate(AR) performance measurement discussed in the ICDAR 2013 Chinese handwriting recognition competition[21] to assess our framework.

Iv-D Experimental results

We compared the path signatures (Sig0, Sig1, Sig2, and Sig3) in different truncated versions on our network. Table III presents the results of our system with naive decoding (i.e., without language modeling). We observed that Sig2 outperformed the other signatures for both CR and AR, which suggests that Sig2 already extracts sufficient information for characterizing the pen-tip trajectory. Moreover, as the path signature increase from Sig0 to Sig2, system performance improved monotonically from 90.94% to 94.52% because the path signature captured better informative features from the pen-tip trajectory with higher iterated integrals. However, Sig3 performs worse than Sig2 in the experiment, because Sig3 captures slightly more information than Sig2 but may bring much more useless feature. Experiments also showed that FCRN performed much better on dataset D-Casia than D-Com because the per-character sample distribution of the training set was more similar to dataset D-Casia than D-Com.

Path signatures Feature maps D-Casia D-Com
Sig0 1 90.94 89.86 84.91 83.55
Sig1 3 93.56 93.04 87.05 86.32
Sig2 7 94.52 93.22 89.86 88.28
Sig3 15 93.92 93.02 88.46 87.36
TABLE III: Correct rate and accuracy rate (%) on dataset D-Casia and D-Com with the path signatures in different truncated versions (without language modeling).
Corpora n-gram order D-Casia D-Com
FCRN 94.52 93.22 89.86 88.28
PTR 2 95.66 94.35 92.97 91.36
3 95.88 94.66 93.10 91.55
PH 2 95.70 94.40 92.76 91.17
3 95.88 94.66 93.10 91.55
SLD 2 95.94 94.79 93.41 92.01
3 96.40 95.34 95.00 92.88
TABLE IV: Correct rate and accuracy rate (%) on dataset D-Casia and D-Com based on FCRN with Sig2 which integrates the language model with different corpus
Dataset Methods CR AR
D-Casia Zhou et al., 2013[1] 94.34 93.75
Zhou et al., 2014[2] 95.32 94.69
CRNN[7] 90.94 89.86
FCRN 94.52 93.22
FCRN with SLD corpus 96.40 95.34
D-Com Zhou et al., 2013[1] 94.62 94.06
Zhou et al., 2014[2] 94.76 94.22
VO-3[3] 95.03 94.49
CRNN[7] 84.91 83.55
FCRN 89.86 88.28
FCRN with SLD corpus 95.00 92.88
TABLE V: Comparison with state-of-the-art methods based on correct rate and accuracy rate (%) on dataset D-Casia and D-Com

Because Sig2 performed the best in all iterated integrals, we adopted it in our FCRN for the following experiments. We investigated the performance of decoding FCRN with different language models using the beam search method. Table IV

shows that by integrating the corpus PTR with the bigram language model, the CRs on dataset D-Casia and D-Com are increased by 1.14% and 3.11%, respectively and the ARs increased by 1.13% and 3.08%, respectively, proving the effectiveness of the language model. Experiments also showed that with a higher-order language model (e.g., trigram), our system still improved performance. Using PH for decoding achieved a similar effect. However, when we used a much larger corpus, SLD (about 56 million characters), for decoding, performance significantly improved. A larger and richer corpus made the language model more general and objective, and most importantly, helped to overcome the curse of dimensionality problem


The RCNN architecture proposed by Shi et al.[7] is the special case of our FCRN with the path signature truncated at level zero (i.e., Sig0). As presented in Table V, FCRN outperformed CRNN in both D-Casia and D-Com, which suggested that FCRN captured more essential online information from the pen-tip trajectory and was a better choice for the HCTR problem. We also observed that our FCRN with naive decoding already achieved comparable results with those of Zhou et al.[1][2] on dataset D-Casia. Furthermore, when decoded with the trigram language model based on the SLD corpus, our system outperformed the other methods, with a CR of 96.40% and an AR of 95.34% on dataset D-Casia. On dataset D-Com, although the result can not be strictly compared because of the removal of outlier characters, it may be safe to say that our system achieved state-of-the-art performance.

V Conclusion

This paper presented a novel method of fully convolutional recurrent network (FCRN) for handwritten Chinese text recognition. The proposed FCRN is an end-to-end architecture that directly used online text data during the training process to solve the HCTR problem, completely avoiding the difficulty of segmentation. In the experiments, we discovered that the path signature truncated at level two could perfectly capture the pen-tip trajectory of the online text data without significantly increasing the computation during the training process. At the post-processing stage, we present a refined beam search method that effectively integrated explicit language model to perform decoding and significantly improve the recognition result. On the test set of CASIA-OLHWDB for online handwritten text recognition, our system outperformed all other methods. On the test set of ICDAR 2013 Chinese handwriting recognition competition, our system achieved state-of-the-art performance.

In the experiment, our system performed much better on dataset D-Casia than D-Com because of the unbalanced per-character sample distribution on the datasets. Our future work will focus on enriching the training dataset to improve the performance on dataset D-Com using some data augmentation approaches such as sample synthesis or sample distortion.


This research is supported in part by NSFC (Grant No.: 61472144), GDSTP (Grant No.: 2013B010202004, 2015B010131004) , GDUPS (2011). Research Fund for the Doctoral Program of Higher Education of China (Grant No.: 20120172110023).


  • [1] X.-D. Zhou, D.-H. Wang, F. Tian, C.-L. Liu, and M. Nakagawa, “Handwritten chinese/japanese text recognition using semi-markov conditional random fields,” IEEE Transactions on Pattern Analysis and Machine Intelligence, vol. 35, no. 10, pp. 2413–2426, 2013.
  • [2] X.-D. Zhou, Y.-M. Zhang, F. Tian, H.-A. Wang, and C.-L. Liu, “Minimum-risk training for semi-markov conditional random fields with application to handwritten chinese/japanese text recognition,” Pattern Recognition, vol. 47, no. 5, pp. 1904–1916, 2014.
  • [3] F. Yin, Q.-F. Wang, X.-Y. Zhang, and C.-L. Liu, “ICDAR 2013 chinese handwriting recognition competition,” 12th International Conference on Document Analysis and Recognition (ICDAR), pp. 1464–1470, 2013.
  • [4] M. Liwicki, A. Graves, H. Bunke, and J. Schmidhuber, “A novel approach to on-line handwriting recognition based on bidirectional long short-term memory networks,” Proc. 9th Int. Conf. on Document Analysis and Recognition, vol. 1, pp. 367–371, 2007.
  • [5] A. Graves, M. Liwicki, S. Fernández, R. Bertolami, H. Bunke, and J. Schmidhuber, “A novel connectionist system for unconstrained handwriting recognition,” IEEE Transactions on Pattern Analysis and Machine Intelligence, vol. 31, no. 5, pp. 855–868, 2009.
  • [6] R. Messina and J. Louradour, “Segmentation-free handwritten chinese text recognition with lstm-rnn,” 13th International Conference on Document Analysis and Recognition (ICDAR), pp. 171–175, 2015.
  • [7] B. Shi, X. Bai, and C. Yao, “An end-to-end trainable neural network for image-based sequence recognition and its application to scene text recognition,” CoRR, vol. abs/1507.05717, 2015.
  • [8] P. He, W. Huang, Y. Qiao, C. C. Loy, and X. Tang, “Reading scene text in deep convolutional sequences,” CoRR, vol. abs/1506.04395, 2015.
  • [9] Q.-F. Wang, F. Yin, and C.-L. Liu, “Handwritten chinese text recognition by integrating multiple contexts,” IEEE Transactions on Pattern Analysis and Machine Intelligence, vol. 34, no. 8, pp. 1469–1481, 2012.
  • [10] Y.-C. Wu, F. Yin, and C.-L. Liu, “Evaluation of neural network language models in handwritten chinese text recognition,” 13th International Conference on Document Analysis and Recognition (ICDAR), pp. 166–170, 2015.
  • [11] K.-T. Chen, “Integration of paths–a faithful representation of paths by noncommutative formal power series,” Transactions of the American Mathematical Society, vol. 89, no. 2, pp. 395–407, 1958.
  • [12] B. Hambly and T. Lyons, “Uniqueness for the signature of a path of bounded variation and the reduced path group,” Annals of Mathematics, pp. 109–167, 2010.
  • [13] T. Lyons and Z. Qian, “System control and rough paths,(2002).”
  • [14] T. Lyons, “Rough paths, signatures and the modelling of functions on streams,” CoRR, vol. abs/1405.4537, 2014.
  • [15] J. Long, E. Shelhamer, and T. Darrell, “Fully convolutional networks for semantic segmentation,”

    Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition

    , pp. 3431–3440, 2015.
  • [16] S. Hochreiter and J. Schmidhuber, “Long short-term memory,” Neural computation, vol. 9, no. 8, pp. 1735–1780, 1997.
  • [17] R. Pascanu, C. Gulcehre, K. Cho, and Y. Bengio, “How to construct deep recurrent neural networks,” CoRR, vol. abs/1312.6026, 2013.
  • [18] A. Graves and N. Jaitly, “Towards end-to-end speech recognition with recurrent neural networks,”

    Proceedings of the 31st International Conference on Machine Learning (ICML-14)

    , pp. 1764–1772, 2014.
  • [19] A. Graves, Supervised sequence labelling.   Springer, 2012.
  • [20] C.-L. Liu, F. Yin, D.-H. Wang, and Q.-F. Wang, “Casia online and offline chinese handwriting databases,” 2011 International Conference on Document Analysis and Recognition (ICDAR), pp. 37–41, 2011.
  • [21] C.-L. Liu, F. Yin, Q.-F. Wang, and D.-H. Wang, “ICDAR 2011 chinese handwriting recognition competition (2011).”
  • [22] “the people’ s daily corpus.” http://icl.pku.edu.cn/icl_groups/corpus/dwldform1.asp, [Online] the People’ s Daily News and Information Center, the Peking University Institute of Computational Linguistics and Fujitsu Research and Development Center Limited. Accessed March 25, 2016.
  • [23] G. Jin., “The ph corpus.” ftp://ftp.cogsci.ed.ac.uk/pub/chinese, [Online] Accessed March 25, 2016.
  • [24] “Sogou lab data.” http://www.sogou.com/labs/dl/c.html, [Online] R&D Center of SOHU. Accessed March 25, 2016.
  • [25] A. Stolcke et al., “Srilm-an extensible language modeling toolkit.” INTERSPEECH, vol. 2002, p. 2002, 2002.
  • [26] S. Ioffe and C. Szegedy, “Batch normalization: Accelerating deep network training by reducing internal covariate shift,” CoRR, vol. abs/1502.03167, 2015.
  • [27] Y. Jia, E. Shelhamer, J. Donahue, S. Karayev, J. Long, R. Girshick, S. Guadarrama, and T. Darrell, “Caffe: Convolutional architecture for fast feature embedding,” Proceedings of the ACM International Conference on Multimedia, pp. 675–678, 2014.
  • [28] S. Venugopalan, M. Rohrbach, J. Donahue, R. Mooney, T. Darrell, and K. Saenko, “Sequence to sequence-video to text,” Proceedings of the IEEE International Conference on Computer Vision, pp. 4534–4542, 2015.
  • [29] Y. Bengio, H. Schwenk, J.-S. Senécal, F. Morin, and J.-L. Gauvain, “Neural probabilistic language models,” Innovations in Machine Learning, pp. 137–186, 2006.