1 Introduction
Sequence recognition, or sequence labelling [13] is to assign sequences of labels, drawn from a fixed alphabet, to sequences of input data, e.g., speech recognition[14, 2], scene text recognition [38, 39], and handwritten text recognition [34, 48], as shown in Fig. 1
. The recent advances in deep learning
[30, 41, 20] and the new architectures [42, 5, 4, 46] enabled the construction of systems that can handle onedimensional (1D) [38, 34] and twodimensional (2D) prediction problems [56, 4]. For 1D prediction problems, the topmost feature maps of the network are collapsed across the vertical dimension to generate 1D prediction [5] because characters in the original images are generally distributed sequentially. Typical examples are regular scene text recognition [38, 54], online/offline handwritten text recognition [12, 34, 48], and speech recognition [14, 2]. For 2D prediction problems, characters in the input image are distributed in a specific spatial structure. For example, there are highly complicated spatial relations between adjacent characters in mathematical expression recognition [56, 57]. In paragraphlevel text recognition, characters are generally distributed line by line [4, 46], whereas in irregular scene text recognition, they are generally distributed in a sideview or curved angle pattern [51, 8].For the sequence recognition problem, traditional methods generally require to separate training targets for each segment or timestep in the input sequence, resulting in inconvenient presegmentation and postprocessing stages [12]. The recent emergence of CTC [13] and attention mechanism [1] significantly alleviate this sequential training problem by circumventing the prior alignment between input image and their corresponding label sequence. However, although CTCbased networks have exhibited remarkable performance in 1D prediction problem, the underlying methodology is sophisticated; moreover, its implementation, the forwardbackward algorithm [12], is complicated, resulting in large computation consumption. Besides, CTC can hardly be applied to 2D prediction problems. Meanwhile, the attention mechanism relies on its attention module for label alignment, resulting in additional storage requirement and computation consumption. As pointed out by Bahdanau et al. [2], recognition model is difficult to learn from scratch with attention mechanism, due to the misalignment between ground truth strings and attention predictions, especially on longer input sequences [25, 9]. Bai et al. [3] also argues that the misalignment problem can confuse and mislead the training process, and consequently make the training costly and degrade recognition accuracy. Although the attention mechanism can be adapted for 2D prediction problem, it turns out to be prohibitive in terms of memory and time consumption, as indicated in [4] and [46].
Compelled by the above observations, we propose a novel aggregation crossentropy (ACE) loss function for the sequence recognition problem, as detailed in Fig. 2
. Given the prediction of the network, the ACE loss consists of three simple stages: (1) aggregation of the probabilities for each category along the time dimension; (2) normalization of the accumulative result and label annotation as probability distributions over all the classes; and (3) comparison between these two probability distributions using crossentropy. The advantages of the proposed ACE loss function can be summarized as follows:

[leftmargin=*, topsep=0pt,itemsep=0pt,parsep=0pt,partopsep=0pt]

Owing to its simplicity, the ACE loss function is much quicker to implement (four fundamental formulas), faster to infer and backpropagate (approximately in parallel), less memory demanding (no parameter and basic runtime memory), and convenient to use (simply replace CTC with ACE), as compared to CTC and attention mechanism. This is illustrated in Table 5, Section 3.4, and Section 4.4.

Despite its simplicity, the ACE loss function achieves competitive performance to CTC and the attention mechanism, as established in experiments of regularirregular scene text recognition and handwritten text recognition problems.

The ACE loss function can be adapted to the 2D prediction problem by flattening the 2D prediction into 1D prediction, as verified in the experiments of irregular scene text recognition and counting problems.

The ACE loss function does not require instance order information for supervision, which enable it to advance beyond sequence recognition, e.g., counting problem.
2 Related Work
2.1 Connectionist temporal classification
The advantages of the popular CTC loss were first demonstrated in speech recognition [16, 14] and online handwritten text recognition [15, 12]. Recently, an integrated CNNLSTMCTC model was proposed to address the scene text recognition problem [38]. There are also methods that aim to extend CTC in applications; e.g., Zhang et al. [55] proposed an extended CTC (ECTC) objective function adapted from CTC to allow RNNbased phoneme recognizers to be trained even when only wordlevel annotation is available. Hwang et al. [21]
developed an expectationmaximizationbased online CTC algorithm that allows RNNs to be trained with an infinitely long input sequence, without presegmentation or external reset. However, the calculation process of CTC is highly complicated and timeconsuming, and it require substantial effort to rearrange the feature map and annotation when applied to 2D problems
[46, 4].2.2 Attention mechanism
The attention mechanism was first proposed in machine translation [1, 42] to enable a model to automatically search for parts of a source sentence for prediction. Then, the method rapidly became popular in applications such as (visual) question answering [32, 52], image caption generation [50, 52, 31], speech recognition [2, 25, 32] and scene text recognition [39, 3, 19]. Most importantly, the attention mechanism can also be applied to 2D predictions, such as mathematical expression recognition [56, 57] and paragraph recognition [4, 5, 46]. However, the attention mechanism relies on a complex attention module to fulfill its functionality, resulting in additional network parameters and runtime. Besides, missing or superfluous characters can easily cause misalignment problem, confusing and misleading the training process, and consequently degrading the recognition accuracy [3, 2, 9].
3 Aggregation CrossEntropy
Formally, given the input image and its sequence annotation from a training set , the general loss function for the sequence recognition problem evaluates the probability of annotation of length conditioned on image under model parameter as follows:
(1) 
where represents the probability of predicting character at the th position of the predicted sequence. Therefore, the problem is to estimate the general loss function Eq. (3) based on the model prediction {}, where , with being the character set and the blank label. Nevertheless, directly estimating the probability was excessively challenging until the emergence of the popular CTC loss function. The CTC loss function elegantly calculates using a forwardbackward algorithm, which removes the need for presegmented data and external postprocessing. The attention mechanism provides an alternative solution to estimate the general loss function by directly predicting based on its attention module. However, the forwardbackward algorithm of CTC is highly complicated and timeconsuming whereas the attention mechanism requires extra complex network to ensure the alignment between attention prediction and annotation.
In this paper, we present the ACE loss function to estimate the general loss function based on model prediction . In Eq. (3), the general loss function can be minimized by maximizing the predictions at each position of the sequence annotation, i.e., . However, directly calculating based on is challenging because the alignment between the th character in the annotation and model prediction is unclear. Therefore, rather than precisely estimating the probability , the problem is mitigated by supervising only the accumulative probability of each class; without considering its sequential order in the annotation. For example, if a class appears twice in the annotation, we require its accumulative prediction probability over timesteps to be exactly two, anticipating that its two corresponding predictions approximate to one. Therefore, we can minimize the general loss function by requiring the network to precisely predict the character number of each class in the annotation as follows:
(2) 
where represents the number of times that character occurs in the sequence annotation . Note that this new loss function does not require character order information but only the classes and their number for supervision.
3.1 RegressionBased ACE Loss Function
Now, the problem is to bridge model prediction to the number prediction of each class. We propose to calculate the number of each class by summing up the probabilities of the th characters for timesteps, i.e., , as illustrated by aggregation in Fig. 2. Note that,
(3) 
Therefore, we adapt the loss function (Eq. (3)) from the perspective of regression problem as follows:
(4) 
Also note that a total of predictions are expected to yield null emission. Therefore, we have .
To find the gradient for each example , we first differentiate with respect to the network output :
(5) 
where . Recall that for Softmax functions, we have:
(6) 
where if and zero otherwise. Now, we can differentiate the loss function with respect to to backpropagate the gradient through the output layer:
(7) 
3.1.1 Gradient vanishing problem
From Eq. (3.1), we observe that the regressionbased ACE loss (Eq. (4)) is not convenient in term of backpropagation. In the early training stage, we have . Therefore, will be negligible for large vocabulary sequence recognition problems, where is large (e.g., 7,357 for the HCTR problem). Although the other terms in Eq. (3.1) (e.g., ) have acceptable magnitudes for backpropagation, the gradient would be scaled to a remarkably small size by the term and , resulting in gradient vanishing problem.
3.2 CrossEntropyBased ACE Loss Function
To prevent the gradientvanishing problem, It is necessary to offset the negative effect of the term introduced by the Softmax function in Eq. (3.1). We borrow the concept of crossentropy from information theory, which is designed to measure the “distance” between two probability distributions. Therefore, we first normalize the accumulative probability of the th character to , and the character numbers to . Then, the crossentropy between and is expressed as:
(8) 
The loss function derivatives with respect to
before the Softmax activation function has the following form:
(9) 
3.2.1 Discussion
In the following, we explain how the updated loss function solves the gradient vanishing problem:
(1) In the early training stage, has an approximately identical order of magnitude at all the timesteps. Thus, the normalized accumulated probability is also of an identical order of magnitude as . That is, ; therefore, the gradient through the th class is now . Thus, the gradient can straightforwardly backpropagate to through the characters that appear in sequence annotation . Besides, when , i.e., ; the corresponding gradient is approximately , which will encourage the model to make a larger prediction , whereas characters that do not appear in become smaller. This was our original intention.
(2) In the later training stage, only a few of the prediction will be very large, leaving the other predictions small enough to be omitted. In this situation, prediction will occupy the majority of , and we have . Therefore, when , the gradient can be straightforwardly backpropagated to the recognition network.
3.3 Twodimensional Prediction
In some 2D prediction problem like irregular scene text recognition with image level annotations, it is challenging to define the spatial relation between characters. Characters may be arranged in multiple lines, in a curved or sloped direction, or even distributed in a random manner. Fortunately, the proposed ACE loss function can naturally be generalized for the 2D prediction problem, because it does not require characterorder information for the sequencelearning process.
Suppose that the output 2D prediction has height and width , and the prediction at the th line and th row is denoted as . This requires a marginal adaptation of the calculation of and as follows, , . Then, the loss function for the 2D prediction can be transformed as follows:
(10) 
In our implementation, we directly flatten the 2D prediction into 1D prediction , where , and then apply Eq. (8) to calculate the final loss.
3.4 Implementation and Complexity Analysis
Implementation As illustrated in Eq. (3), represents the annotation for the ACE loss function; here, represents the number of times that the character occurs in the sequence annotation . A simple example describing the translation of sequence annotation cocacola into ACE’s annotation is shown in Fig. 2. In conclusion, given the model prediction and its annotation , the key implementation for a crossentropybased ACE loss function consists of four fundamental formulas:

[leftmargin=*, topsep=0pt,itemsep=1pt,parsep=0pt,partopsep=0pt]

to calculate the character number of each class by summing up the probabilities of the th class for all timesteps.

to normalize the accumulative probabilities.

to normalize the annotation.

to estimate the crossentropy between and .
In practical employment, the model prediction is generally provided by the integrated CNNLSTM model (1D prediction) or FCN model (flattened 2D prediction). That is, the input assumption of ACE is identical to that of CTC; therefore, the proposed ACE can be conveniently applied by replacing the CTC layer in the framework.
Complexity Analysis The overall computation of the ACE loss function is implemented based on the abovementioned four formulas that have computation complexities of , , , and , respectively. Therefore, the computation complexity of the ACE loss function is . Note however that the elementwise multiplication, division, and log operation in these four formulas can be implemented in parallel with GPU at . In contrast, the implementation of CTC [12] based on a forwardbackward algorithm has a computation complexity of . Because the forward variable and backward variable [12] of CTC depend on the previous result (e.g., and ) to calculate the present output, CTC can hardly be accelerated in parallel in the time dimension. Moreover, the elementary operation of CTC is already very complicated, resulting in larger overall time consumption than that of ACE. With regard to the attention mechanism, its computation complexity is proportional to the times of ‘attention’. However, the computation complexity of the attention module at each time already has similar magnitude as that of CTC.
From the perspective of memory consumption, the proposed ACE loss function requires nearly no memory consumption because the ACE loss result can be directly calculated based on the four fundamental formulas. However, CTC requires additional space to preserve the forwardbackward variable that is proportional to the timestep and the length of the sequence annotation. Meanwhile, the attention mechanism requires additional module to implement “attention”. Thus, its memory consumption is significantly larger than that of CTC and ACE.
In conclusion, the proposed ACE loss function exhibits significant advantages with regard to both computation complexity and memory demand, as compared to CTC and attention.
4 Performance Evaluation
In our experiment, three tasks were employed to evaluate the effectiveness of the proposed ACE loss function, including scene text recognition, offline handwritten Chinese text recognition, and counting objects in everyday scenes. For these tasks, we estimated the ACE loss for 1D and 2D predictions, where 1D implies that the final prediction is a sequence of T predictions and 2D indicates that the final feature map has 2D predictions of shape .
4.1 Scene Text Recognition
Scene text recognition often encounter problems owing to the large variations in the background, appearance, resolution, text font, and color, making it a challenging research topic. In this section, we study both 1D and 2D predictions on scene text recognition by utilizing the richness and variety of the testing benchmark in this task.
4.1.1 Dataset
Two types of datasets are used for scene text recognition: regular text datasets, such as IIIT5KWords [35], Street View Text [43], ICDAR 2003 [33], and ICDAR 2013 [24], and irregular text datasets, such as SVTPerspective [36], CUTE80 [37], and ICDAR 2015 [23]. The regular datasets were used to study the 1D prediction for the ACE loss function while the irregular text datasets were applied to evaluate the 2D prediction.
IIIT5KWords (IIIT5K) contains 3000 cropped word images for testing.
Street View Text (SVT) was collected from Google Street View, including 647 word images. Many of them are severely corrupted by noise and blur, or have very low resolutions.
ICDAR 2003 (IC03) contains 251 scene images that are labeled with text bounding boxes. The dataset contains 867 cropped images.
ICDAR 2013 (IC13) inherits most of its samples from IC03. It contains 1015 cropped text images.
SVTPerspective (SVTP
) contains 639 cropped images for testing, which are selected from sideview angle snapshots from Google Street View. Therefore, most of images are perspective distorted. Each image is associated with a 50word lexicon and a full lexicon.
CUTE80 (CUTE) contains 80 highresolution images taken of natural scenes. It was specifically collected for curve text recognition. The dataset contains 288 cropped natural images for testing. No lexicon is associated.
ICDAR 2015 (IC15) contains 2077 cropped images including more than 200 irregular text.
4.1.2 Implementation Details
For 1D sequence recognition on regular datasets, our experiments were based on the CRNN [38] network, trained only on 8million synthetic data released by Jaderberg et al. [22]. For 2D sequence recognition on irregular datasets, our experiments were based on the ResNet101 [18], with conv1 changed to 3
3, stride 1, and conv4_x as output. The training dataset consists of 8million synthetic data released by Jaderberg
et al. [22] and 4million synthetic instances (excluding the images that contain nonalphanumeric characters) cropped from 80thousand images [17]. The input images are normalized to the shape of (96,100) and the final 2D prediction has the shape of (12,13), as shown in Fig. 5. To decode the 2D prediction, we flattened the 2D prediction by concatenating each column in order from left to right and top to bottom and then decoded the flattened 1D prediction following the general procedure.In our experiment, we observed that directly normalizing the input image to the size of (96,100) overloads the network training process. Therefore, we trained another network to predict the character number in the text image and normalized the text image with respect to the character number to keep the character size within acceptable limits.
Method  IIIT5K  SVT  IC03  IC13 

Shi et al. [38]  81.2  82.7  91.9  89.6 
ACE (1D, Regression)  19.4  6.6  12.0  9.3 
ACE (1D, Cross Entropy)  82.3  82.6  92.1  89.7 
4.1.3 Experimental Result
To study the role of regression and crossentropy for the ACE loss function, we conducted experiments with 1D prediction using regular scene text datasets, as detailed in Table 1 and Fig. 3. Because there are only 37 classes in scene text recognition, the negative effect of the term in Eq. (3.1) is not as significant as that of the HCTR problem (7357 classes). As shown in Fig. 3, with regressionbased ACE loss, the network can converge but at a relatively slow rate, probably due to the gradient vanishing problem. With crossentropybased ACE loss, the WER and CER evolve at a relatively higher rate and in a smoother manner at the early training stage and attain a significantly better convergence result in the subsequent training stage. Table 1 clearly reveals the superiority of the crossentropybased ACE loss function over the regressionbased one. Therefore, we use crossentropybased ACE loss functions for all the remaining experiments. Moreover, with the same network setting (CRNN) and training set (8million synthetic data), the proposed ACE loss function exhibits performance comparable with that of previous work [38] with CTC.
To validate the independence of the proposed ACE loss to character order, we conduct experiments with ACE, CTC, and attention on four datasets; the character order of annotation is randomly shuffled at different ratios, as shown in Fig. 4. It is observed that the performance of attention and CTC on all the datasets degrades as the shuffle ratio increases. Specifically, attention is more sensitive than CTC because misalignment problem can easily misleads the training process of attention [3]. In contrast, the proposed ACE loss function exhibits similar recognition results for all the settings of the shuffle ratio, this is because it only requires classes and their number for supervision, completely omitting character order information.
Method  2D  SVTP  CUTE  IC15  

50  Full  None  None  None  
Shi et al. [38]  92.6  72.6  66.8  54.9    
Liu et al. [28]  94.3  83.6  73.5      
Yang et al. [51]  ✓  93.0  80.2  75.8  69.3   
Cheng et al. [7]  ✓  92.6  81.6  71.5  63.9  66.2 
Cheng et al. [8]  ✓  94.0  83.7  73.0  76.8  68.2 
Liu et al. [29]  –  –  73.9  62.5  –  
Shi et al. [39]  –  –  74.1  73.3  –  
ACE (2D)  ✓  94.9  87.8  70.1  82.6  68.9 
For irregular scene text recognition, we conducted text recognition experiments with 2D prediction. In Table 2, we provide a comparison with previous methods that considered only recognition model and no rectification for fair comparison. As illustrated in Table 2, the proposed ACE loss function exhibits superior performance on the datasets CUTE and IC15, particularly on CUTE with an absolute error reduction of 5.8%. This is because the dataset CUTE was specifically collected for curved text recognition and therefore, fully demonstrates the advantages of the ACE loss function. For the dataset SVTP, our naive decoding result is less effective than that of Yang et al. [51]. This is because numerous images in the dataset SVTP have very low resolutions, which creates a very high requirement for semantic context modeling. However, our network is based only on CNN, with neither LSTM/MDLSTM nor attention mechanism to leverage the highlevel semantic context. Nevertheless, it is noteworthy that our recognition model achieved the highest result when using lexicons, with which semantic context is accessible. This again validates the robustness and effectiveness of the proposed ACE loss function.
In Fig. 5, we provide a few real images processed by a recognition model using the ACE loss function. The original text images were first normalized and placed in the center of a blank image of shape (96, 100). We observe that after recognition, the 2D prediction exhibits a spatial distribution highly similar to that of the characters in the original text image, which implies the effectiveness of the proposed ACE loss function.
4.2 Offline Handwritten Chinese Text Recognition
Owing to its large character set (7,357 classes), diverse writing style, and charactertouching problem, the offline HCTR problem is highly complicated and challenging to solve. Therefore, it is a favorable testbed to evaluate the robustness and effectiveness of the ACE loss in 1D predictions.
4.2.1 Implementation Details
For the offline HCTR problem, our model was trained using the CASIAHWDB [26] datasets and tested with the standard benchmark ICDAR 2013 competition dataset [53].
For the HCTR problem, our network architecture with a prediction sequence of length 70 is specified as follows:
,
where represents a convolutional layer with kernel number of and kernel size of ,
denotes a max pooling layer with kernel size of
, and is a fully connected layer with kernel number of , and ResLSTM is residual LSTM proposed in [49]. The evaluation criteria for the HCTR problem are correct rate (CR) and accuracy rate (AR) specified by ICDAR2013 competition [53].4.2.2 Experimental Result
In Table 3, we provide the comparison between ACE loss and previous methods. It is evident that the proposed ACE loss function exhibits higher performance than previous methods, including MDLSTMbased models [34, 47], HMMbased model [10], and oversegmentation methods [27, 44, 45, 48] with and without language model (LM). Compared to scene text recognition, handwritten Chinese text recognition problem possesses its unique challenges, such as large character set (7357 classes) and charactertouching problem. Therefore, the superior performance of ACE loss function over previous methods can properly verify its robustness and generality for sequence recognition problems.
Method  w.o LM  with LM  

CR  AR  CR  AR  
HIT2 [27]  –  –  88.76  86.73 
Wang et al. [44]  –  –  91.39  90.75 
Messina et al. [34]  –  83.50  –  89.40 
Wu et al. [47]  87.43  86.64  –  92.61 
Du et al. [10]  –  83.89  –  93.50 
Wang et al. [45]  90.67  88.79  95.53  94.02 
Wu et al. [48]  –  –  96.32  96.20 
ACE (1D)  91.68  91.25  96.70  96.22 
4.3 Counting Objects in Everyday Scenes
Counting the number of instances of object classes in natural everyday images generally encounters complex real life situations, e.g., large variance in counts, appearance, and scales of object. Therefore, we verified the ACE loss function on the problem of counting objects in everyday scenes to demonstrate its generality.
4.3.1 Implementation Details
As a benchmark for multilabel object classification and object detection tasks, the PASCAL VOC [11] datasets contain category labels per image, as well as bounding box annotations that can be converted to the object number labels. In our implementation, we accumulated the prediction for category to obtain by thresholding counts at zero and rounding predictions to the closest integers. Given these predictions and the ground truth counts for a category and image , RMSE and relRMSE is calculated by and .
4.3.2 Experimental Result
Table 4 presents a comparison between the proposed ACE loss function and previous methods for the PASCAL VOC 2007 test dataset for counting objects in everyday scenes. The proposed ACE loss function outperforms the previous glancing and subitizing method [6], correlation loss method [40], and Always0 method (predicting mostfrequent ground truth count). The results have shown the generality of ACE loss function, in that it can be readily applied to problem other than sequence recognition, e.g., counting problems, requiring minimal domain knowledge.
In Fig. 6, we provide real images processed by the counting model under ACE loss. As shown in the images, our counting model trained with ACE loss manage to pay “attention” to the position where crucial objects occur. Unlike the text recognition problem, where the recognition model trained with the ACE loss function tends to make a prediction for a character, the counting model trained with the ACE loss function provides a more uniform prediction distribution over the body of the object. Moreover, it assigns different levels of “attention” to different parts of an object. For example, when observing the red color in the pictures, we notice that the counting model pays more attention to the face of a person. This phenomenon corresponds to our intuition because the face is the most distinctive part of an individual.
4.4 Complexity Analysis
In Table 5, we compare the parameter, runtime memory, and run time of ACE with those of CTC and attention. The result is executed with minibatch 64 and model prediction length T=144 on a single NVIDIA TITAN X graphics card of 12GB memory. Similar to CTC, the proposed ACE does not require any parameter to fulfill its function. Owing to its simplicity, ACE requires marginal runtime memory, five times less than those for CTC and attention. Furthermore, its speed is as least 30 times higher than those of CTC and attention. It is note worthy that with all these advantages, the proposed ACE achieve performance that is comparable or higher than those with CTC and attention.
Method  37 classes  7357 classes  

Para  Mem  Time  Para  Mem  Time  
CTC  none  0.1  3.1  none  47.8  16.2 
Attention  2.8  6.6  78.9  17.2  143.6  85.5 
ACE  none  0.02  0.1  none  4.2  0.1 
5 Conclusion
In this paper, a novel and straightforward ACE loss function is proposed for sequence recognition problem with competitive performance to CTC and attention. Owing to its simplicity, the ACE loss function is easy to employ by simply replacing CTC with ACE, quick to implement with only four basic formulas, fast to infer and backpropagate at approximately in parallel, and exhibits marginal memory requirement. Two following effective properties of ACE loss function are also investigated: (1) it can easily handle 2D prediction problem with marginal adaption and (2) it does not require characterorder information for supervision, which allows it to advance beyond sequence recognition problem, e.g., counting problem.
Acknowledgments
This research is supported in part by GDNSF (no. 2017A030312006), the National Key Research and Development Program of China (No. 2016YFB1001405), NSFC (Grant No.: 61673182, 61771199), and GDSTP (Grant No.:2017A010101027) , GZSTP(no. 201704020134).
References
 [1] D. Bahdanau, K. Cho, and Y. Bengio. Neural machine translation by jointly learning to align and translate. CoRR, abs/1409.0473, 2014.
 [2] D. Bahdanau, J. Chorowski, D. Serdyuk, P. Brakel, and Y. Bengio. Endtoend attentionbased large vocabulary speech recognition. In ICASSP, 2016.
 [3] F. Bai, Z. Cheng, Y. Niu, S. Pu, and S. Zhou. Edit probability for scene text recognition. CVPR, 2018.
 [4] T. Bluche. Joint line segmentation and transcription for endtoend handwritten paragraph recognition. NIPS, 2016.
 [5] T. Bluche and R. Messina. Scan, attend and read: Endtoend handwritten paragraph recognition with mdlstm attention. ICDAR, 2016.
 [6] P. Chattopadhyay, R. Vedantam, R. R. Selvaraju, D. Batra, and D. Parikh. Counting everyday objects in everyday scenes. In CVPR, 2017.
 [7] Z. Cheng, F. Bai, Y. Xu, G. Zheng, S. Pu, and S. Zhou. Focusing attention: Towards accurate text recognition in natural images. In ICCV, 2017.
 [8] Z. Cheng, X. Liu, F. Bai, Y. Niu, S. Pu, and S. Zhou. Arbitrarilyoriented text recognition. ICDAR, 2017.
 [9] J. K. Chorowski, D. Bahdanau, D. Serdyuk, K. Cho, and Y. Bengio. Attentionbased models for speech recognition. In NIPS, 2015.

[10]
J. Du, Z.R. Wang, J.F. Zhai, and J.S. Hu.
Deep neural network based hidden markov model for offline handwritten chinese text recognition.
In ICPR.  [11] M. Everingham, S. A. Eslami, L. Van Gool, C. K. Williams, J. Winn, and A. Zisserman. The pascal visual object classes challenge: A retrospective. IJCV, 2015.
 [12] A. Graves. Supervised sequence labelling. Springer, 2012.

[13]
A. Graves, S. Fernández, F. Gomez, and J. Schmidhuber.
Connectionist temporal classification: labelling unsegmented sequence data with recurrent neural networks.
ICML, 2006.  [14] A. Graves and N. Jaitly. Towards endtoend speech recognition with recurrent neural networks. ICML, 2014.
 [15] A. Graves, M. Liwicki, S. Fernández, R. Bertolami, H. Bunke, and J. Schmidhuber. A novel connectionist system for unconstrained handwriting recognition. IEEE TPAMI, 2009.
 [16] A. Graves, A.r. Mohamed, and G. Hinton. Speech recognition with deep recurrent neural networks. In ICASSP, 2013.
 [17] A. Gupta, A. Vedaldi, and A. Zisserman. Synthetic data for text localisation in natural images. In CVPR, 2016.
 [18] K. He, X. Zhang, S. Ren, and J. Sun. Deep residual learning for image recognition. In CVPR, 2016.
 [19] P. He, W. Huang, Y. Qiao, C. C. Loy, and X. Tang. Reading scene text in deep convolutional sequences. AAAI, 2016.
 [20] S. Hochreiter and J. Schmidhuber. Long shortterm memory. Neural computation, 9(8):1735–1780, 1997.
 [21] K. Hwang and W. Sung. Sequence to sequence training of ctcrnns with partial windowing. In ICML, 2016.
 [22] M. Jaderberg, K. Simonyan, A. Vedaldi, and A. Zisserman. Synthetic data and artificial neural networks for natural scene text recognition. CoRR, abs/1406.2227, 2014.
 [23] D. Karatzas, L. GomezBigorda, A. Nicolaou, S. Ghosh, A. Bagdanov, M. Iwamura, J. Matas, L. Neumann, V. R. Chandrasekhar, S. Lu, et al. Icdar 2015 competition on robust reading. In ICDAR, 2015.
 [24] D. Karatzas, F. Shafait, S. Uchida, M. Iwamura, L. G. i Bigorda, S. R. Mestre, J. Mas, D. F. Mota, J. A. Almazan, and L. P. De Las Heras. Icdar 2013 robust reading competition. In ICDAR, 2013.
 [25] S. Kim, T. Hori, and S. Watanabe. Joint ctcattention based endtoend speech recognition using multitask learning. In ICASSP, 2017.
 [26] C.L. Liu, F. Yin, D.H. Wang, and Q.F. Wang. Casia online and offline chinese handwriting databases. ICDAR, 2011.
 [27] C.L. Liu, F. Yin, Q.F. Wang, and D.H. Wang. ICDAR 2011 chinese handwriting recognition competition (2011).
 [28] W. Liu, C. Chen, K.Y. K. Wong, Z. Su, and J. Han. Starnet: A spatial attention residue network for scene text recognition. In BMVC, 2016.
 [29] Y. Liu, Z. Wang, H. Jin, and I. Wassell. Synthetically supervised feature learning for scene text recognition. In ECCV, 2018.
 [30] J. Long, E. Shelhamer, and T. Darrell. Fully convolutional networks for semantic segmentation. CVPR, 2015.
 [31] J. Lu, C. Xiong, D. Parikh, and R. Socher. Knowing when to look: Adaptive attention via a visual sentinel for image captioning. In CVPR, 2017.
 [32] J. Lu, J. Yang, D. Batra, and D. Parikh. Hierarchical questionimage coattention for visual question answering. In NIPS, 2016.
 [33] S. M. Lucas, A. Panaretos, L. Sosa, A. Tang, S. Wong, R. Young, K. Ashida, H. Nagai, M. Okamoto, H. Yamamoto, et al. Icdar 2003 robust reading competitions: entries, results, and future directions. IJDAR, 2005.
 [34] R. Messina and J. Louradour. Segmentationfree handwritten chinese text recognition with lstmrnn. ICDAR, 2015.
 [35] A. Mishra, K. Alahari, and C. Jawahar. Scene text recognition using higher order language priors. In BMVC, 2012.
 [36] T. Quy Phan, P. Shivakumara, S. Tian, and C. Lim Tan. Recognizing text with perspective distortion in natural scenes. In ICCV, 2013.
 [37] A. Risnumawan, P. Shivakumara, C. S. Chan, and C. L. Tan. A robust arbitrary text detection system for natural scene images. Expert Systems with Applications, 2014.
 [38] B. Shi, X. Bai, and C. Yao. An endtoend trainable neural network for imagebased sequence recognition and its application to scene text recognition. IEEE TPAMI, 2016.
 [39] B. Shi, M. Yang, X. Wang, P. Lyu, C. Yao, and X. Bai. Aster: An attentional scene text recognizer with flexible rectification. IEEE TPAMI, 2018.

[40]
Z. Song and Q. Qiu.
Learn to classify and count: A unified framework for object classification and counting.
In ICIGP, 2018.  [41] C. Szegedy, W. Liu, Y. Jia, P. Sermanet, S. Reed, D. Anguelov, D. Erhan, V. Vanhoucke, and A. Rabinovich. Going deeper with convolutions. CVPR, 2015.
 [42] A. Vaswani, N. Shazeer, N. Parmar, J. Uszkoreit, L. Jones, A. N. Gomez, Ł. Kaiser, and I. Polosukhin. Attention is all you need. In NIPS, 2017.
 [43] K. Wang, B. Babenko, and S. Belongie. Endtoend scene text recognition. In ICCV, 2011.
 [44] Q.F. Wang, F. Yin, and C.L. Liu. Handwritten chinese text recognition by integrating multiple contexts. IEEE TPAMI, 2012.
 [45] S. Wang, L. Chen, L. Xu, W. Fan, J. Sun, and S. Naoi. Deep knowledge training and heterogeneous cnn for handwritten chinese text recognition. In ICFHR, 2016.
 [46] C. Wigington, C. Tensmeyer, B. Davis, W. Barrett, B. Price, and S. Cohen. Start, follow, read: Endtoend fullpage handwriting recognition. In ECCV, 2018.
 [47] Y.C. Wu, F. Yin, Z. Chen, and C.L. Liu. Handwritten chinese text recognition using separable multidimensional recurrent neural network. In ICDAR, 2017.

[48]
Y.C. Wu, F. Yin, and C.L. Liu.
Improving handwritten chinese text recognition using neural network language models and convolutional neural network shape models.
Pattern Recognition, 2017.  [49] Z. Xie, Z. Sun, L. Jin, H. Ni, and T. Lyons. Learning spatialsemantic context with fully convolutional recurrent network for online handwritten chinese text recognition. TPAMI, 2018.
 [50] K. Xu, J. Ba, R. Kiros, A. Courville, R. Salakhutdinov, R. Zemel, and Y. Bengio. Show, attend and tell: Neural image caption generation with visual attention. CoRR, abs/1502.03044, 2015.
 [51] X. Yang, D. He, Z. Zhou, D. Kifer, and C. L. Giles. Learning to read irregular text with attention mechanisms. In IJCAI, 2017.
 [52] Z. Yang, X. He, J. Gao, L. Deng, and A. Smola. Stacked attention networks for image question answering. In CVPR, 2016.
 [53] F. Yin, Q.F. Wang, X.Y. Zhang, and C.L. Liu. ICDAR 2013 chinese handwriting recognition competition. ICDAR, 2013.
 [54] F. Yin, Y.C. Wu, X.Y. Zhang, and C.L. Liu. Scene text recognition with sliding convolutional character models. CoRR, abs/1709.01727, 2017.
 [55] B. Zhang, Y. Gan, Y. Song, and B. Tang. Application of pronunciation knowledge on phoneme recognition by lstm neural network. In ICPR, 2016.
 [56] J. Zhang, J. Du, and L. Dai. Track, attend and parse (tap): An endtoend framework for online handwritten mathematical expression recognition. TMM, 2018.
 [57] J. Zhang, J. Du, S. Zhang, D. Liu, Y. Hu, J. Hu, S. Wei, and L. Dai. Watch, attend and parse: An endtoend neural network based approach to handwritten mathematical expression recognition. Pattern Recognition, 2017.