1 Introduction
The connectionist temporal classification (CTC) (Graves & Gomez, 2006) is a commonly used method in sequence recognition tasks, including speech recognition (Graves & Jaitly, 2014; Miao et al., 2016; Kim et al., 2017), text recognition (Alex et al., 2009; He et al., 2016b; Shi et al., 2017; Borisyuk et al., 2018) and so on. With an extra class, the output at each timestep in the sequence indicates either a specific label or no label. The outputs over all timesteps consist a sequence of labels and blanks, named as a . A path is transformed into a label sequence by removing the repeated labels then the blanks in it, and different paths can correspond to the same label sequence. The CTCbased training is to maximize the probability of the correct labelling, which is calculated by summing up probabilities of all the corresponding paths.
Some previous works try to improve the CTC with regularization or reweighting/resampling heuristics.
(Hu et al., 2018) propose a maximum entropy based regularization for CTC (EnCTC) to enhance exploration during training to get models with better generalization. They also propose an entropybased pruning algorithm (EsCTC) to rule out unreasonable paths. For weaklysupervised action labelling in video, (Huang et al., 2016) introduces the Extended Connectionist Temporal Classification (ECTC) framework to enforce the consistency of all possible paths with frametoframe visual similarities. To solve the imbalance problem for sequence recognition, (Feng et al., 2019) modify the traditional CTC by fusing focal loss with it and thus make the model attend to the lowfrequent samples at training stage. All these works treat a sequence or a path as an example, and calculate the loss or perform the resampling on the basis of sequences or paths.Different from them, we propose to treat each timestep in a sequence as an individual example, and regard the sequence recognition task as a classification task for each timestep. The classification of each timestep is similar to the classification branch in objection detection (Liu et al., 2016), where the class corresponds to the background category and the labels correspond to the objects. In this case, the CTC may suffer from a classic problem in object detection, the imbalance between background and object samples. As shown in Figure 1, the outputs of a trained CTC network tend to form a series of spikes separated by strongly predicted blanks. It means only a few timesteps are label samples, and the rest are all blank samples, which are much more than the former. According to the error signals, the label samples are the hard examples during training, but become less harder as the network converges. By then, the network updating may be overwhelmed by the blanks.
In our experiments, we observe a phenomenon of accuracy degradation that supports this speculation. When a network is trained with the CTC loss and tested on the training set, the recognition accuracy starts to decrease after certain iterations and becomes very unstable. But if the batch normalization is performed within each minibatch without using global statistics, the accuracy becomes reasonable, as illustrated in Figure
2. It shows that the network updating is unstable so the averaged means/var over the past iterations does not suit the current network weights, which is probably caused by the overwhelming blanks.Therefore, common heuristics for object detection to solve the class imbalance, such as online hard example mining (OHEM) (Shrivastava et al., 2016), focal loss (Lin et al., 2017) and GHM (Li et al., 2019), can be introduced to improve the CTC.
In this paper, we propose a novel method to reweight the CTC, offering the theory basis and successful experimental experience. The main contributions are summarized as follows: (1) We reinterpret the CTC loss for sequence labelling as the cross entropy loss for classification problem, providing a new perspective to modify the CTC. (2) To deal with the class imbalance, we propose some weighted CTC losses, and demonstrate their effectiveness by comparison experiments on scene text recognition. The proposed weighted CTC has several advantages over the original CTC, including (1) preventing the accuracy degradation phenomenon, (2) alleviating the negative effects caused by imbalanced training data, and (3) facilitating the convergence for models, which means better performance and shorter training time.
2 Method
2.1 Connectionist Temporal Classification
The CTC (Graves & Gomez, 2006)
is proposed for labeling sequence data within a single network architecture that doesn’t need presegmentation and postprocessing. The basic idea is to interpret the network outputs as a conditional probability distribution over all possible output label sequences. Given this distribution, an objective function can be derived that directly maximises the probabilities of the correct label sequences.
At each timestep, the network outputs a probability distribution over the label set , where contains all the labels in the task and the extra represents ‘no label’. The activation is interpreted as the probability of observing label of at time . Given the length input sequence , we get the conditional probability of observing a particular through the lattice of label observations:
(1) 
where is the label observed at time along path , and is the set of length paths over .
Paths are mapped onto label sequences by an operation that simply removes the repeated labels then the blanks in a sequence. For a given label sequence , more than one corresponds to it, e.g. , where ‘’ denotes the . We can evaluate the conditional probability of as the sum of probabilities of all the corresponding paths:
(2) 
The CTC loss function is defined as the negative log probability of correctly labelling the sequence:
(3) 
During training, to backpropagate the gradient through the output layer, we need the derivatives of the loss function versus the outputs
before the activation function is applied. For the softmax activation function
(4) 
where ranges over , the derivative with respect to is
(5) 
where is the sum of probabilities of all the paths corresponding to that go through the label at time .
When the network is used for prediction, the predictions over all timesteps are converted into a label sequence. Since the computational complexity grows exponentially with the length of the path, it is not practical to find the most probable label sequence . There are many approximate alternatives, and the best path decoding is one of the most commonly used methods. It assumes that the most probable output will correspond to :
(6)  
It is not guaranteed to find the most probable label sequence, but the solution is good enough in most cases and the computation procedure is trivial.
2.2 Cross Entropy
The cross entropy (CE) is used to estimate the distance between two probability distributions. Given groundtruth
and network outputs , the cross entropy loss is defined as(7) 
where ranges over all the classes, and are the model’s estimated and groundtruth probabilities for class respectively. Let be the model’s outputs before the softmax activation function is applied, the loss function derivative with respect to can be found by
(8) 
2.3 Focal Loss
The focal loss (Lin et al., 2017) is designed to address the onestage object detection scenario in which there is an extreme imbalance between foreground and background classes during training. Let specify the groundtruth class for binary classification, and denote the model’s estimated probability for the class with label . For notational convenience, define as
(9) 
The main idea of focal loss is to reshape the loss function to downweight easy examples and thus focus training on hard negatives. On the basis of cross entropy loss
(10) 
a modulating factor is added, with tunable focusing parameter . The focal loss is defined as
(11) 
To address class imbalance, a common method is to introduce a weighting factor for class 1 and for class 1. For notational convenience, is defined analogously as . The balanced variant of the focal loss is defined as
(12) 
2.4 Cross Entropy Form of CTC
Treating the sequence recognition as the classification for each timestep, we rewrite the CTC loss into the form of cross entropy loss. Given an input sequence and its groundtruth label sequence , the network outputs probability distributions over the timesteps of the sequence. We define as the predicted probability distribution for the sample of timestep , and assume there is a corresponding groundtruth probability distribution . The cross entropy loss of correctly labelling the sequence should be
(13) 
A feasible solution for can be found by following the conditions below:
(14) 
We can get the derivative of versus
(15) 
which equals to the CTC loss function derivative as in Equation (5).
According to Equation (1) and (2), and are calculated on the basis of and . It seems unreasonable that the derivative of versus equals to zero, when depends on . We argue that the definition of is valid, and elaborate on it as follows.
At first, we define an intermediate variable , where
(16) 
also denoted by for the expression convenience. In this function, the is a dimensional independent variable, whose value is different for each sequence in each minibatch. Over the entire training phase, has finite discrete values. Indexing the values of by , we define the groundtruth as a piecewise function
(17) 
where stands for neighborhood. It’s easy to know
(18) 
Omitting and substituting and into the above equation, we can get Equation (25).
Some examples of and the corresponding are illustrated in Figure 3 for an intuitive perception.
2.5 Weighted CTC Loss
Given the cross entropy loss function in Equation (26), we can apply weighting methods for classification tasks to improve CTC. We should notice that, compared with the groundtruth in a general classification task, does not follow the probability distribution where one of the classes has a probability of 1 and the other classes have a probability of 0, so there is no groundtruth . We adopt the weighting method in two different ways to accommodate this situation. One way is to assign weights for different classes, and it is called class weighting. The other way is sample weighting, where each sample weight is calculated based on .
At first, we introduce weighting factors to balance the label and samples. The classweighted CTC loss function is defined as
(19) 
where
(20) 
is the weighting factor for class , and is a tunable parameter. The sampleweighted CTC loss function is
(21) 
where is the weighting factor for sample defined as
(22) 
It is easy to know that when is taken as 0.5, the above two weighted CTC losses are equivalent to the CTC loss.
We introduce focal loss to CTC, naming it as connectionist temporal focal loss (CTFL), to focus the training process on hard samples. Extending focal loss from binary classification to multiclass case is straightforward. Defining as the estimated probability for the groundtruth class, is used to downweight easy samples, where is a tunable focusing parameter. But as analyzed before, there is no groundtruth class for . In this case, we extend the sample weights of focal loss to class weights form , where denotes the distance between the estimated and groundtruth probabilities for class . For class weighting, we use the distances as the class weights of each sample, and define the classesweighted CTFL as
(23) 
As with sample weighting, each sample weight is calculated by summing the distances over all classes. The smapleweighted CTFL is given as
(24) 
It is easy to notice that when the value of is 0, CTFL degenerates into the CTC loss.
See the appendix for the loss derivatives and formula derivation processes.
3 Experiments
To evaluate the effects of the weighted CTC losses, we compare them with the CTC loss according to the convergence process and recognition performance of the models. For all the experiments, the accuracy refers to sequence accuracy, i.e. the percentage of testing images correctly recognized. Although different losses are adopted for training, the output is always the CTC loss, for it is an indicator of the probability of correctly recognizing a sequence according to Equation (3).
3.1 Datasets
For all the following experiments, we use the synthetic dataset released by (Jaderberg et al., 2014) as the training data. The dataset consists of 8 million word images and their corresponding groundtruth words. All the images are generated by a synthetic data engine using a 90k word dictionary, and are of different sizes. For training efficiency, we construct a training set Synth consisting of images: At first, all word images are scaled to have height 32 without changing their aspectratios. If the scaled width is larger than 256, we continue to scale the image to
. If there’s enough room for the next scaled image, we append it to this image after 20 columns of zeros. Otherwise, we pad the scaled image to width 256 with zeros. Besides, we construct the other training set
Synth100, which is more balanced between different classes. It is the same as Synth but containing 100 times extra copies of the images containing digits. The character number of each class in Synth and Synth100 is displayed in Figure 4.There are four popular benchmarks for scene text recognition used for model performance evaluation, namely IIIT5kword (IIIT5k), Street View Text (SVT), ICDAR 2003 (IC03) and ICDAR 2013 (IC13). IIIT5k (Mishra et al., 2012) contains 3,000 cropped word images collected from the Internet. SVT (Kai et al., 2012) contains 647 word images cropped from 249 streetview images that are collected from Google Street View. IC03 (Lucas et al., 2003) contains 251 scene images, we discard words that either contain nonalphanumeric characters or have less than three characters, and get 860 cropped word images. IC13 (Karatzas et al., 2013) contains 1,095 word images in total, we discard words that contain nonalphanumeric characters, and get 1,015 word images with groundtruths. In addition, we construct a subset Train with the first 64,000 images taken from the training set to evaluate the model performance on the training data.
3.2 Implementation Details
There are two networks used in our experiments, one is CRNN (Shi et al., 2017), the other is a CNN replacing the BLSTM layers in CRNN with residual blocks (He et al., 2016a). The network configurations are summarized in Table 1.


Network configurations. ‘Conv’ is short for convolutional layer, and ‘MP’ is short for max pooling layer. ‘c’ stands for channels, which denotes the number of feature maps for convolutional layer, the number of hidden units for BLSTM layer, and the bottleneck channels for residual unit. ‘k’, ‘s’, ‘p’ stand for kernel, stride and padding sizes respectively. ‘bn’ stands for batch normalization, ‘softmax’ stands for softmax activation function. The residual unit used here is the full preactivation version proposed in
(He et al., 2016a).Unless otherwise stated, we use the CNN as the default network and Synth100 as the default training set in our experiments. We implement the network architecture within the Caffe
(Jia et al., 2014)framework, with custom implementation for the loss layer. Networks are trained with stochastic gradient descent (SGD). The decay rate of weights is 0.0005, and the momentum is 0.9. The initial learning rate is 0.01, and it is decreased by a factor of 0.1 after every fix number of iterations denoted as learning rate step. Three different training strategies are used in our experiments:
bs32400k: batch size = 32, learning rate step = 100,000, max iterations = 400,000; bs32800k: batch size = 32, learning rate step = 200,000, max iterations = 800,000; bs25640k: batch size = 256, learning rate step = 20,000, max iterations = 40,000.3.3 Complexity Analysis
We propose four weighted forms of the CTC loss, and compare the algorithm complexities of them to CTC. According to their loss function derivatives, the gradient updating procedures are performed based on and , which are also calculated for the CTC loss. First, a softmax activation is applied to get the normalized network outputs , whose time complexity is for the cpu implementation and for the parallel gpu implementation. Then a dynamicprogramming algorithm similar to the forwardbackward algorithm for HMMs (Rabiner, 1993) is performed to calculate , whose time complexity is for both cpu and gpu implementations. For the CTC loss, there is the final subtraction, whose time complexity is for cpu and for gpu. Meanwhile, the weighted losses need additional calculations of the weights. We obtain the time complexity of CTC by summing up the above terms, and list the time complexity of the additional calculation for each weighted loss in Table 2. It’s obvious that the additional calculation doesn’t change the time complexity, so the amount of additional calculation in the weighted loss is acceptable. This is also validated by experiments, where the changes of training time are negligible.
Method  Complexitycpu  Complexitygpu  Trn. Time 

197min  
199min  
195min  
200min  
194min 
The space complexity of CTC is . Since the algorithm makes the most of the original space and keep down the additional space complexity of a weighted CTC to , which doesn’t change the original space complexity.
In one word, the proposed weighted CTC losses have the same time and space complexity as the CTC loss.
3.4 Overall Comparison
We train a group of models with each weighted CTC loss under variable hyperparameters. The training strategy is bs32400k. The convergence processes of models are shown in Figure 5, and the recognition performances are illustrated in Figure 6. Note that when is taken as 0.5 and is taken as 0, a weighted CTC loss becomes the CTC loss.
The parameter in and is used to adjust the ratio of influences coming from label and samples. Figure 5(a) shows the effect of is not ideal. Figure 5(b) suggests that the accuracy degradation is caused by excess samples, which is consistent with our class imbalance speculation, for the degradation is more obvious under a lower . can prevent the accuracy degradation by focusing more attention on label samples, but it brings no obvious improvement for the recognition performance, as shown in Figure 6(b). Based on our experience, benefits the model performance in some situations, but the improvements are minor compared with CTFL. So we ignore the weighting for the rest of this paper, and focus on discussing the effects of CTFL.
According to Figure 5(c) and 5(d), CTFL can prevent the accuracy degradation, as well as improve the recognition performance on the training set. performs slightly better than on recognition accuracy. Moreover, compared with , has no negative impact on the CTC loss, i.e. the probability of correctly recognizing a sequence. It means that will not damage the recognition performance when the decoding method is based on the probability. Figure 6(c) and 6(d) suggest for and for . These values are adopted for the rest of our experiments. However it is possible that the optimal values of the parameters are different for different tasks.
3.5 Deal with Class Imbalance
To investigate the effectiveness of CTFL for imbalanced classes, we train a set of models with the CTC loss and CTFL on Synth and Synth100 respectively. The training strategy is bs32400k. The recognition performances on four test sets are presented in Table 3. The results are also illustrated in Figure 7 to provide a direct perception.
Trn. Set  Method  IIIT5k  SVT  IC03  IC13 

9.2  0  3.8  8.4  
Synth  CTC  72.9  75.6  85.6  78.7 
78.6  74.5  89.1  84.7  
80.4  75.6  90.3  84.7  
Synth100  CTC  80.0  75.6  88.7  86.3 
80.0  76.2  89.7  85.4  
81.0  76.8  90.7  85.6 
Comparing between the models trained with the CTC loss, there are large margins between the accuracies on IIIT5k, IC03 and IC13 of models trained on Synth and Synth100. Each accuracy margin is consistent with the digits ratio of the corresponding test set according to Table 3. Therefore, we speculate that the model trained with CTC on Synth cannot correctly recognize digits, and it is cased by the severe class imbalance in Synth as shown in Figure 4.
The CTFL focus the network on hard samples during training, thus improves the network’s recognition ability for disadvantaged classes, i.e. digits in this case. Comparing between the models trained on Synth, the models trained with CTFL achieve much better performances than CTC, and their performances are nearly as good as models trained on the balanced training set Synth100.
3.6 Facilitate Convergence
The convergence process of a model is influenced by the training settings, including batch size, learning rate, network architecture, etc. To evaluate the effect of CTFL on convergence for different training settings, in addition to the experiments above, we conduct another two groups of comparison experiments. The training settings and model performances for the three groups of experiments are shown in Table 4.
Trn. St.  Method  IIIT5k  SVT  IC03  IC13  Train 

CNN bs32400k  CTC  80.0  75.6  88.7  86.3  88.3 
80.0  76.2  89.7  85.4  88.7  
81.0  76.8  90.7  85.6  89.8  
CNN bs25640k  CTC  77.2  71.4  86.9  84.1  85.5 
78.7  72.6  87.7  84.5  86.6  
78.8  75.1  88.6  85.2  87.9  
CRNN bs32400k  CTC  76.4  69.1  87.2  82.8  84.0 
CTC(800k)  77.1  72.3  87.3  82.8  86.3  
76.5  74.8  89.7  85.1  88.3  
78.3  73.7  90.9  84.8  89.5 
Comparing models trained with the CTC loss over different training settings, the CNN and strategy bs32400k lead to the best model performance. The convergence process is illustrated in Figure 5 with . Accuracy degradation appears early during training, but as the learning rate decreases, the degradation gradually disappears and does not affect the final model performance. In this situation, the CTFL prevents the accuracy degradation and stabilizes the convergence process. But its effect of facilitating the convergence and improving the performance is not so obvious.
Compared with training strategy bs32400k, the strategy bs25640k leads to underfitting models. According to Table 4, the accuracies of models trained by the CTC loss drop by about 24% due to the different training strategy. The convergence processes of this strategy are illustrated in Figure 8. Compared with the CTCbased training, the CTFL greatly facilitates the convergence and leads to a much better model performance. We think that the advantages of CTFL over CTC is more obvious when the CTCbased training suffers from underfitting. This opinion is also supported by the third group of experiments, where the underfitting is caused by the network structure, for recurrent structures are usually difficult to train. As shown in Figure 9, the models trained with CTFL easily outperform the model trained with CTC in convergence, even for the model trained for double the time.
On one hand, it usually takes a lot of effort to find the proper training settings. The CTFL facilitates the convergence, thus ensures a relatively reasonable model performance for various training settings. On the other hand, under the same training settings, models trained with CTFL always achieve better or similar performances within less training iterations.
3.7 Comparison With CRNN
Convolutional recurrent neural network (CRNN)
(Shi et al., 2017) is one of the stateoftheart approaches in CTCbased text recognition, and it is adopted in our experiments. The results of CRNN in Table 4 fall behind the reported results in (Shi et al., 2017), because the training sets and test sets are different. Although the training and test sets are built from the same datasets, the details of the building process can be different, which affects the experimental results. As shown in Table 5, we test the trained model released by (Shi et al., 2017) (downloaded from their code webpage, and supposed to have similar performance as the reported results) on our test sets, and get somehow inferior results.Method  IIIT5k  SVT  IC03  IC13  

Reported  (Shi et al., 2017)  81.2  82.7  91.9  89.6 
Results on Our Test Sets  (Shi et al., 2017)  80.3  81.6  90.0  86.2 
CTC  80.2  79.9  90.9  86.6  
82.0  81.8  91.5  88.6 
To ensure a fair comparison, we compare results obtained on our test sets. Besides, we adopt the same training set and rescaling strategy, despite slightly different training settings, that are described in (Shi et al., 2017). According to Table 5, the CTC model gets similar results as the released CRNN model, and the model trained with CTFL achieves better results.
4 Conclusion
In this paper, we provide a new perspective to modify the CTC method. The basic idea is to treat, instead of a sequence, each timestep in the sequence as a sample. Accordingly, we reinterpret the CTC loss for sequences into the cross entropy loss for timesteps through a pseudo groundtruth. The cross entropy form makes it possible for the CTC loss to cooperate with reweighting strategy.
We introduce label/blank weighting and focal loss to CTC and get four weighted CTC losses. The experiments show that the smapleweighted CTFL generally performs best among them. The proposed losses are proven to have the same complexity as CTC and the following benefits: eliminating the accuracy degradation, better performance when trained on imbalanced data, and contributing to faster and better convergence in some cases.
Apart from the reweighting method in this paper, the reinterpretation of CTC may be potentially useful in other contexts that is worthy of exploration in the future.
References
 Alex et al. (2009) Alex, G., Marcus, L., Santiago, F., Roman, B., Horst, B., and J rgen, S. A novel connectionist system for unconstrained handwriting recognition. IEEE Transactions on Pattern Analysis & Machine Intelligence, 31(5):855–868, 2009.
 Borisyuk et al. (2018) Borisyuk, F., Gordo, A., and Sivakumar, V. Rosetta: Large scale system for text detection and recognition in images. In Proceedings of the 24th ACM SIGKDD International Conference on Knowledge Discovery & Data Mining, pp. 71–79, 07 2018.
 Feng et al. (2019) Feng, X., Yao, H., and Zhang, S. Focal CTC Loss for Chinese Optical Character Recognition on Unbalanced Datasets. Complexity, 2019:11, 2019. doi: 10.1155/2019/9345861.

Graves & Gomez (2006)
Graves, A. and Gomez, F.
Connectionist temporal classification: labelling unsegmented sequence
data with recurrent neural networks.
In
International Conference on Machine Learning
, pp. 369–376, 2006.  Graves & Jaitly (2014) Graves, A. and Jaitly, N. Towards endtoend speech recognition with recurrent neural networks. In Xing, E. P. and Jebara, T. (eds.), Proceedings of the 31st International Conference on Machine Learning, volume 32 of Proceedings of Machine Learning Research, pp. 1764–1772, Bejing, China, 22–24 Jun 2014. PMLR.

He et al. (2016a)
He, K., Zhang, X., Ren, S., and Sun, J.
Identity mappings in deep residual networks.
In
European Conference on Computer Vision
, pp. 630–645, 2016a. 
He et al. (2016b)
He, P., Huang, W., Qiao, Y., Chen, C. L., and Tang, X.
Reading scene text in deep convolutional sequences.
In
Thirtieth AAAI Conference on Artificial Intelligence
, pp. 3501–3508, 2016b.  Hu et al. (2018) Hu, L., Sheng, J., and Changshui, Z. Connectionist temporal classification with maximum entropy regularization. In 32nd Conference on Neural Information Processing Systems (NeurIPS 2018), 2018.
 Huang et al. (2016) Huang, D.A., FeiFei, L., and Niebles, J. C. Connectionist temporal modeling for weakly supervised action labeling. In European Conference on Computer Vision (ECCV), pp. 137–153. Springer, 2016.
 Jaderberg et al. (2014) Jaderberg, M., Simonyan, K., Vedaldi, A., and Zisserman, A. Synthetic data and artificial neural networks for natural scene text recognition. Eprint Arxiv, 2014.
 Jia et al. (2014) Jia, Y., Shelhamer, E., Donahue, J., Karayev, S., and Long, J. Caffe: Convolutional architecture for fast feature embedding. Eprint Arxiv, pp. 675–678, 2014.
 Kai et al. (2012) Kai, W., Babenko, B., and Belongie, S. Endtoend scene text recognition. In IEEE International Conference on Computer Vision, 2012.
 Karatzas et al. (2013) Karatzas, D., Shafait, F., Uchida, S., Iwamura, M., Bigorda, L. G. I., Mestre, S. R., Mas, J., Mota, D. F., Almazan, J. A., and Heras, L. P. D. L. Icdar 2013 robust reading competition. In International Conference on Document Analysis & Recognition, 2013.
 Kim et al. (2017) Kim, S., Hori, T., and Watanabe, S. Joint ctcattention based endtoend speech recognition using multitask learning. In 2017 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pp. 4835–4839, March 2017.
 Li et al. (2019) Li, B., Liu, Y., and Wang, X. Gradient Harmonized Singlestage Detector. AAAI, 2019.
 Lin et al. (2017) Lin, T. Y., Goyal, P., Girshick, R., He, K., and Dollar, P. Focal loss for dense object detection. IEEE Transactions on Pattern Analysis & Machine Intelligence, PP(99):2999–3007, 2017.
 Liu et al. (2016) Liu, W., Anguelov, D., Erhan, D., Szegedy, C., Reed, S., Fu, C. Y., and Berg, A. C. Ssd: Single shot multibox detector. In European Conference on Computer Vision, pp. 21–37, 2016.
 Lucas et al. (2003) Lucas, S. M., Panaretos, A., Sosa, L., Tang, A., Wong, S., and Young, R. Icdar 2003 robust reading competitions. Proc of the Icdar, 7(23):105–122, 2003.
 Miao et al. (2016) Miao, Y., Gowayyed, M., and Metze, F. Eesen: Endtoend speech recognition using deep rnn models and wfstbased decoding. In Automatic Speech Recognition & Understanding, 2016.
 Mishra et al. (2012) Mishra, A., Alahari, K., and V. Jawahar, C. Scene text recognition using higher order language priors. 09 2012.

Rabiner (1993)
Rabiner, L.
A tutorial on hidden markov models and selected applications in speech recognition.
Proceedings of The IEEE  PIEEE, 77, 01 1993.  Shi et al. (2017) Shi, B., Bai, X., and Yao, C. An endtoend trainable neural network for imagebased sequence recognition and its application to scene text recognition. IEEE Transactions on Pattern Analysis & Machine Intelligence, 39(11):2298, 2017.

Shrivastava et al. (2016)
Shrivastava, A., Gupta, A., and Girshick, R.
Training regionbased object detectors with online hard example
mining.
In
IEEE Conference on Computer Vision and Pattern Recognition
, pp. 761–769, 2016.
Appendix A Cross Entropy Form of CTC
In the paper, we define a pseudo groundtruth , where
(25) 
to reinterpret the CTC loss as the sum of cross entropy losses. To this end, we need to prove that is a feasible solution of
(26) 
It means given the definition of , Equ. (26) holds.
For an ideal situation, the CTC and the crossentropy loss are both 0, equality holds. Therefore, we only need to prove that the derivatives of the them are equivalent.
Having denote the unnormalized network outputs, we normalize them with the softmax activation,
(27) 
It’s easy to know
(28) 
The derivation of cross entropy formatted CTC versus can be calculated as
(29)  
and its derivation with respect to can be calculated as
(30)  
It is equal to the derivative of CTC given in the paper, so Equ.(26) holds.
Appendix B Derivation Process of
The classweighted CTC loss function is
(31) 
where
(32) 
The derivation of versus can be calculated as
(33)  
and its derivation with respect to can be calculated as
(34)  
Appendix C Derivation Process of
The sampleweighted CTC loss function is
(35) 
where
(36) 
Since the derivation of versus can be calculated as
(37) 
and its derivation versus is
(38) 
The derivation of with respect to is
(39)  
Appendix D Derivation Process of
The classesweighted CTFL is defined as
(40) 
and its derivative versus is
(41)  
and its derivative versus is
(42)  
where
(43) 
Appendix E Derivation Process of
The smapleweighted CTFL is given as
(44) 
and we find its derivative with respect to as
Comments
There are no comments yet.