A Novel Re-weighting Method for Connectionist Temporal Classification

04/24/2019 ∙ by Hongzhu Li, et al. ∙ 0

The connectionist temporal classification (CTC) enables end-to-end sequence learning by maximizing the probability of correctly recognizing sequences during training. With an extra blank class, the CTC implicitly converts recognizing a sequence into classifying each timestep within the sequence. But the CTC loss is not intuitive for such classification task, so the class imbalance within each sequence, caused by the overwhelming blank timesteps, is a knotty problem. In this paper, we define a piece-wise function as the pseudo ground-truth to reinterpret the CTC loss based on sequences as the cross entropy loss based on timesteps. The cross entropy form makes it easy to re-weight the CTC loss. Experiments on text recognition show that the weighted CTC loss solves the class imbalance problem as well as facilitates the convergence, generally leading to better results than the CTC loss. Beside this, the reinterpretation of CTC, as a brand new perspective, may be potentially useful in some other situations.



There are no comments yet.


page 5

This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

1 Introduction

The connectionist temporal classification (CTC) (Graves & Gomez, 2006) is a commonly used method in sequence recognition tasks, including speech recognition (Graves & Jaitly, 2014; Miao et al., 2016; Kim et al., 2017), text recognition (Alex et al., 2009; He et al., 2016b; Shi et al., 2017; Borisyuk et al., 2018) and so on. With an extra class, the output at each timestep in the sequence indicates either a specific label or no label. The outputs over all timesteps consist a sequence of labels and blanks, named as a . A path is transformed into a label sequence by removing the repeated labels then the blanks in it, and different paths can correspond to the same label sequence. The CTC-based training is to maximize the probability of the correct labelling, which is calculated by summing up probabilities of all the corresponding paths.

Some previous works try to improve the CTC with regularization or re-weighting/re-sampling heuristics.

(Hu et al., 2018) propose a maximum entropy based regularization for CTC (EnCTC) to enhance exploration during training to get models with better generalization. They also propose an entropy-based pruning algorithm (EsCTC) to rule out unreasonable paths. For weakly-supervised action labelling in video, (Huang et al., 2016) introduces the Extended Connectionist Temporal Classification (ECTC) framework to enforce the consistency of all possible paths with frame-to-frame visual similarities. To solve the imbalance problem for sequence recognition, (Feng et al., 2019) modify the traditional CTC by fusing focal loss with it and thus make the model attend to the low-frequent samples at training stage. All these works treat a sequence or a path as an example, and calculate the loss or perform the re-sampling on the basis of sequences or paths.

Figure 1: (Graves & Gomez, 2006) Evolution of the network output and CTC error signal during training. Lines with different colors denote different labels, and the dashed line is the class.

Different from them, we propose to treat each timestep in a sequence as an individual example, and regard the sequence recognition task as a classification task for each timestep. The classification of each timestep is similar to the classification branch in objection detection (Liu et al., 2016), where the class corresponds to the background category and the labels correspond to the objects. In this case, the CTC may suffer from a classic problem in object detection, the imbalance between background and object samples. As shown in Figure 1, the outputs of a trained CTC network tend to form a series of spikes separated by strongly predicted blanks. It means only a few timesteps are label samples, and the rest are all blank samples, which are much more than the former. According to the error signals, the label samples are the hard examples during training, but become less harder as the network converges. By then, the network updating may be overwhelmed by the blanks.

In our experiments, we observe a phenomenon of accuracy degradation that supports this speculation. When a network is trained with the CTC loss and tested on the training set, the recognition accuracy starts to decrease after certain iterations and becomes very unstable. But if the batch normalization is performed within each mini-batch without using global statistics, the accuracy becomes reasonable, as illustrated in Figure 

2. It shows that the network updating is unstable so the averaged means/var over the past iterations does not suit the current network weights, which is probably caused by the overwhelming blanks.

Figure 2: The accuracy degradation phenomenon during CTC-based training. The network configuration is given in Table 2(b). The training set is Synth100 and test set is Train described in Section 3.1. The learning rate is 0.01 and batch size is 32, and other implementation details can be found in Section 3.2.

Therefore, common heuristics for object detection to solve the class imbalance, such as online hard example mining (OHEM) (Shrivastava et al., 2016), focal loss (Lin et al., 2017) and GHM (Li et al., 2019), can be introduced to improve the CTC.

In this paper, we propose a novel method to re-weight the CTC, offering the theory basis and successful experimental experience. The main contributions are summarized as follows: (1) We reinterpret the CTC loss for sequence labelling as the cross entropy loss for classification problem, providing a new perspective to modify the CTC. (2) To deal with the class imbalance, we propose some weighted CTC losses, and demonstrate their effectiveness by comparison experiments on scene text recognition. The proposed weighted CTC has several advantages over the original CTC, including (1) preventing the accuracy degradation phenomenon, (2) alleviating the negative effects caused by imbalanced training data, and (3) facilitating the convergence for models, which means better performance and shorter training time.

2 Method

2.1 Connectionist Temporal Classification

The CTC (Graves & Gomez, 2006)

is proposed for labeling sequence data within a single network architecture that doesn’t need pre-segmentation and post-processing. The basic idea is to interpret the network outputs as a conditional probability distribution over all possible output label sequences. Given this distribution, an objective function can be derived that directly maximises the probabilities of the correct label sequences.

At each timestep, the network outputs a probability distribution over the label set , where contains all the labels in the task and the extra represents ‘no label’. The activation is interpreted as the probability of observing label of at time . Given the length input sequence , we get the conditional probability of observing a particular through the lattice of label observations:


where is the label observed at time along path , and is the set of length paths over .

Paths are mapped onto label sequences by an operation that simply removes the repeated labels then the blanks in a sequence. For a given label sequence , more than one corresponds to it, e.g. , where ‘’ denotes the . We can evaluate the conditional probability of as the sum of probabilities of all the corresponding paths:


The CTC loss function is defined as the negative log probability of correctly labelling the sequence:


During training, to backpropagate the gradient through the output layer, we need the derivatives of the loss function versus the outputs

before the activation function is applied. For the softmax activation function


where ranges over , the derivative with respect to is


where is the sum of probabilities of all the paths corresponding to that go through the label at time .

When the network is used for prediction, the predictions over all timesteps are converted into a label sequence. Since the computational complexity grows exponentially with the length of the path, it is not practical to find the most probable label sequence . There are many approximate alternatives, and the best path decoding is one of the most commonly used methods. It assumes that the most probable output will correspond to :


It is not guaranteed to find the most probable label sequence, but the solution is good enough in most cases and the computation procedure is trivial.

2.2 Cross Entropy

The cross entropy (CE) is used to estimate the distance between two probability distributions. Given ground-truth

and network outputs , the cross entropy loss is defined as


where ranges over all the classes, and are the model’s estimated and ground-truth probabilities for class respectively. Let be the model’s outputs before the softmax activation function is applied, the loss function derivative with respect to can be found by


2.3 Focal Loss

The focal loss (Lin et al., 2017) is designed to address the one-stage object detection scenario in which there is an extreme imbalance between foreground and background classes during training. Let specify the ground-truth class for binary classification, and denote the model’s estimated probability for the class with label . For notational convenience, define as


The main idea of focal loss is to reshape the loss function to down-weight easy examples and thus focus training on hard negatives. On the basis of cross entropy loss


a modulating factor is added, with tunable focusing parameter . The focal loss is defined as


To address class imbalance, a common method is to introduce a weighting factor for class 1 and for class -1. For notational convenience, is defined analogously as . The -balanced variant of the focal loss is defined as


2.4 Cross Entropy Form of CTC

Treating the sequence recognition as the classification for each timestep, we rewrite the CTC loss into the form of cross entropy loss. Given an input sequence and its ground-truth label sequence , the network outputs probability distributions over the timesteps of the sequence. We define as the predicted probability distribution for the sample of timestep , and assume there is a corresponding ground-truth probability distribution . The cross entropy loss of correctly labelling the sequence should be


A feasible solution for can be found by following the conditions below:


We can get the derivative of versus


which equals to the CTC loss function derivative as in Equation (5).

According to Equation (1) and (2), and are calculated on the basis of and . It seems unreasonable that the derivative of versus equals to zero, when depends on . We argue that the definition of is valid, and elaborate on it as follows.

At first, we define an intermediate variable , where


also denoted by for the expression convenience. In this function, the is a -dimensional independent variable, whose value is different for each sequence in each mini-batch. Over the entire training phase, has finite discrete values. Indexing the values of by , we define the ground-truth as a piecewise function


where stands for neighborhood. It’s easy to know


Omitting and substituting and into the above equation, we can get Equation (25).

Some examples of and the corresponding are illustrated in Figure 3 for an intuitive perception.

Figure 3: Examples of predicted probability distributions and the corresponding ground-truths . This is the same sequence during different iterations. Lines with different colors denote different labels, and the dashed line is the blank class.

2.5 Weighted CTC Loss

Given the cross entropy loss function in Equation (26), we can apply weighting methods for classification tasks to improve CTC. We should notice that, compared with the ground-truth in a general classification task, does not follow the probability distribution where one of the classes has a probability of 1 and the other classes have a probability of 0, so there is no ground-truth . We adopt the weighting method in two different ways to accommodate this situation. One way is to assign weights for different classes, and it is called class weighting. The other way is sample weighting, where each sample weight is calculated based on .

At first, we introduce weighting factors to balance the label and samples. The class-weighted CTC loss function is defined as




is the weighting factor for class , and is a tunable parameter. The sample-weighted CTC loss function is


where is the weighting factor for sample defined as


It is easy to know that when is taken as 0.5, the above two weighted CTC losses are equivalent to the CTC loss.

We introduce focal loss to CTC, naming it as connectionist temporal focal loss (CTFL), to focus the training process on hard samples. Extending focal loss from binary classification to multi-class case is straightforward. Defining as the estimated probability for the ground-truth class, is used to down-weight easy samples, where is a tunable focusing parameter. But as analyzed before, there is no ground-truth class for . In this case, we extend the sample weights of focal loss to class weights form , where denotes the distance between the estimated and ground-truth probabilities for class . For class weighting, we use the distances as the class weights of each sample, and define the classes-weighted CTFL as


As with sample weighting, each sample weight is calculated by summing the distances over all classes. The smaple-weighted CTFL is given as


It is easy to notice that when the value of is 0, CTFL degenerates into the CTC loss.

See the appendix for the loss derivatives and formula derivation processes.

3 Experiments

To evaluate the effects of the weighted CTC losses, we compare them with the CTC loss according to the convergence process and recognition performance of the models. For all the experiments, the accuracy refers to sequence accuracy, i.e. the percentage of testing images correctly recognized. Although different losses are adopted for training, the output is always the CTC loss, for it is an indicator of the probability of correctly recognizing a sequence according to Equation (3).

3.1 Datasets

For all the following experiments, we use the synthetic dataset released by (Jaderberg et al., 2014) as the training data. The dataset consists of 8 million word images and their corresponding ground-truth words. All the images are generated by a synthetic data engine using a 90k word dictionary, and are of different sizes. For training efficiency, we construct a training set Synth consisting of images: At first, all word images are scaled to have height 32 without changing their aspect-ratios. If the scaled width is larger than 256, we continue to scale the image to

. If there’s enough room for the next scaled image, we append it to this image after 20 columns of zeros. Otherwise, we pad the scaled image to width 256 with zeros. Besides, we construct the other training set

Synth100, which is more balanced between different classes. It is the same as Synth but containing 100 times extra copies of the images containing digits. The character number of each class in Synth and Synth100 is displayed in Figure 4.

Figure 4: The number of characters for each class in Synth and Synth100. The word images containing digits in Synth100 are 100 times more than those in Synth.

There are four popular benchmarks for scene text recognition used for model performance evaluation, namely IIIT5k-word (IIIT5k), Street View Text (SVT), ICDAR 2003 (IC03) and ICDAR 2013 (IC13). IIIT5k (Mishra et al., 2012) contains 3,000 cropped word images collected from the Internet. SVT (Kai et al., 2012) contains 647 word images cropped from 249 street-view images that are collected from Google Street View. IC03 (Lucas et al., 2003) contains 251 scene images, we discard words that either contain non-alphanumeric characters or have less than three characters, and get 860 cropped word images. IC13 (Karatzas et al., 2013) contains 1,095 word images in total, we discard words that contain non-alphanumeric characters, and get 1,015 word images with ground-truths. In addition, we construct a subset Train with the first 64,000 images taken from the training set to evaluate the model performance on the training data.

3.2 Implementation Details

There are two networks used in our experiments, one is CRNN (Shi et al., 2017), the other is a CNN replacing the BLSTM layers in CRNN with residual blocks (He et al., 2016a). The network configurations are summarized in Table 1.

Type Configuration
Conv c64,k3x3,p1x1
MP k2x2,s2x2
Conv c128,k3x3,p1x1
MP k2x2,s2x2
Conv c256,k3x3,p1x1,bn
Conv c256,k3x3,p1x1
MP k2x2,s1x2,p1x0
Conv c512,k3x3,p1x1,bn
Conv c512,k3x3,p1x1
MP k2x2,s1x2,p1x0
Conv c512,k2x2,bn
BLSTM c256
BLSTM c256
Output c37,softmax
Loss -
(a) CRNN
Type Configuration
Conv c64,k3x3,p1x1
MP k2x2,s2x2
Conv c128,k3x3,p1x1
MP k2x2,s2x2
Conv c256,k3x3,p1x1,bn
Conv c256,k3x3,p1x1
MP k2x2,s1x2,p1x0
Conv c512,k3x3,p1x1,bn
Conv c512,k3x3,p1x1
MP k2x2,s1x2,p1x0
Conv c512,k2x2
ResUnit c128,k5x1,p2x0
ResUnit c128,k5x1,p2x0,bn
Output c37,softmax
Loss -
(b) CNN
Table 1:

Network configurations. ‘Conv’ is short for convolutional layer, and ‘MP’ is short for max pooling layer. ‘c’ stands for channels, which denotes the number of feature maps for convolutional layer, the number of hidden units for BLSTM layer, and the bottleneck channels for residual unit. ‘k’, ‘s’, ‘p’ stand for kernel, stride and padding sizes respectively. ‘bn’ stands for batch normalization, ‘softmax’ stands for softmax activation function. The residual unit used here is the full pre-activation version proposed in

(He et al., 2016a).

Unless otherwise stated, we use the CNN as the default network and Synth100 as the default training set in our experiments. We implement the network architecture within the Caffe

(Jia et al., 2014)

framework, with custom implementation for the loss layer. Networks are trained with stochastic gradient descent (SGD). The decay rate of weights is 0.0005, and the momentum is 0.9. The initial learning rate is 0.01, and it is decreased by a factor of 0.1 after every fix number of iterations denoted as learning rate step. Three different training strategies are used in our experiments:

bs32-400k: batch size = 32, learning rate step = 100,000, max iterations = 400,000; bs32-800k: batch size = 32, learning rate step = 200,000, max iterations = 800,000; bs256-40k: batch size = 256, learning rate step = 20,000, max iterations = 40,000.

For all the experiments, we get the recognition results by the lexicon-free best path decoding

(Graves & Gomez, 2006).

3.3 Complexity Analysis

We propose four weighted forms of the CTC loss, and compare the algorithm complexities of them to CTC. According to their loss function derivatives, the gradient updating procedures are performed based on and , which are also calculated for the CTC loss. First, a softmax activation is applied to get the normalized network outputs , whose time complexity is for the cpu implementation and for the parallel gpu implementation. Then a dynamic-programming algorithm similar to the forward-backward algorithm for HMMs (Rabiner, 1993) is performed to calculate , whose time complexity is for both cpu and gpu implementations. For the CTC loss, there is the final subtraction, whose time complexity is for cpu and for gpu. Meanwhile, the weighted losses need additional calculations of the weights. We obtain the time complexity of CTC by summing up the above terms, and list the time complexity of the additional calculation for each weighted loss in Table 2. It’s obvious that the additional calculation doesn’t change the time complexity, so the amount of additional calculation in the weighted loss is acceptable. This is also validated by experiments, where the changes of training time are negligible.

Method Complexity-cpu Complexity-gpu Trn. Time
Table 2: The additional time complexity and training time for each weighted loss, compared with the CTC loss. Trn. Time denotes training time spent for 100,000 iterations.

The space complexity of CTC is . Since the algorithm makes the most of the original space and keep down the additional space complexity of a weighted CTC to , which doesn’t change the original space complexity.

In one word, the proposed weighted CTC losses have the same time and space complexity as the CTC loss.

3.4 Overall Comparison

We train a group of models with each weighted CTC loss under variable hyper-parameters. The training strategy is bs32-400k. The convergence processes of models are shown in Figure 5, and the recognition performances are illustrated in Figure 6. Note that when is taken as 0.5 and is taken as 0, a weighted CTC loss becomes the CTC loss.

Figure 5: The convergence processes of models trained with the weighted CTC losses under variable hyper-parameters. The thinner lines indicate the CTC loss, and the wider lines denote the recognition accuracies on the training set.

Figure 6: Recognition accuracies on four test sets the training set. Each subfigure corresponds the performances of models trained with a weighted CTC losses under variable hyper-parameters.

The parameter in and is used to adjust the ratio of influences coming from label and samples. Figure 5(a) shows the effect of is not ideal. Figure 5(b) suggests that the accuracy degradation is caused by excess samples, which is consistent with our class imbalance speculation, for the degradation is more obvious under a lower . can prevent the accuracy degradation by focusing more attention on label samples, but it brings no obvious improvement for the recognition performance, as shown in Figure 6(b). Based on our experience, benefits the model performance in some situations, but the improvements are minor compared with CTFL. So we ignore the -weighting for the rest of this paper, and focus on discussing the effects of CTFL.

According to Figure 5(c) and 5(d), CTFL can prevent the accuracy degradation, as well as improve the recognition performance on the training set. performs slightly better than on recognition accuracy. Moreover, compared with , has no negative impact on the CTC loss, i.e. the probability of correctly recognizing a sequence. It means that will not damage the recognition performance when the decoding method is based on the probability. Figure 6(c) and 6(d) suggest for and for . These values are adopted for the rest of our experiments. However it is possible that the optimal values of the parameters are different for different tasks.

3.5 Deal with Class Imbalance

To investigate the effectiveness of CTFL for imbalanced classes, we train a set of models with the CTC loss and CTFL on Synth and Synth100 respectively. The training strategy is bs32-400k. The recognition performances on four test sets are presented in Table 3. The results are also illustrated in Figure 7 to provide a direct perception.

Trn. Set Method IIIT5k SVT IC03 IC13
9.2 0 3.8 8.4
Synth CTC 72.9 75.6 85.6 78.7
78.6 74.5 89.1 84.7
80.4 75.6 90.3 84.7
Synth100 CTC 80.0 75.6 88.7 86.3
80.0 76.2 89.7 85.4
81.0 76.8 90.7 85.6
Table 3: Recognition accuracies (%) on four English scene text datasets. Digits Ratio (%) indicates the percentage of word examples containing digits in the dataset. Synth and Synth100 in the first column denote the training sets for the network, CTC and CTFL in the second column denote the loss functions used during training.

Figure 7: The illustration of recognition performances in Table 3.

Comparing between the models trained with the CTC loss, there are large margins between the accuracies on IIIT5k, IC03 and IC13 of models trained on Synth and Synth100. Each accuracy margin is consistent with the digits ratio of the corresponding test set according to Table 3. Therefore, we speculate that the model trained with CTC on Synth cannot correctly recognize digits, and it is cased by the severe class imbalance in Synth as shown in Figure 4.

The CTFL focus the network on hard samples during training, thus improves the network’s recognition ability for disadvantaged classes, i.e. digits in this case. Comparing between the models trained on Synth, the models trained with CTFL achieve much better performances than CTC, and their performances are nearly as good as models trained on the balanced training set Synth100.

3.6 Facilitate Convergence

The convergence process of a model is influenced by the training settings, including batch size, learning rate, network architecture, etc. To evaluate the effect of CTFL on convergence for different training settings, in addition to the experiments above, we conduct another two groups of comparison experiments. The training settings and model performances for the three groups of experiments are shown in Table 4.

Trn. St. Method IIIT5k SVT IC03 IC13 Train
CNN bs32-400k CTC 80.0 75.6 88.7 86.3 88.3
80.0 76.2 89.7 85.4 88.7
81.0 76.8 90.7 85.6 89.8
CNN bs256-40k CTC 77.2 71.4 86.9 84.1 85.5
78.7 72.6 87.7 84.5 86.6
78.8 75.1 88.6 85.2 87.9
CRNN bs32-400k CTC 76.4 69.1 87.2 82.8 84.0
CTC(800k) 77.1 72.3 87.3 82.8 86.3
76.5 74.8 89.7 85.1 88.3
78.3 73.7 90.9 84.8 89.5
Table 4: The performances of models trained with different loss functions under three groups of training settings. Trn. St. is short for training settings, which include the network configuration and the training strategy. Note that CTC(800k) denotes the model is trained with the CTC loss under strategy bs32-800k.

Comparing models trained with the CTC loss over different training settings, the CNN and strategy bs32-400k lead to the best model performance. The convergence process is illustrated in Figure 5 with . Accuracy degradation appears early during training, but as the learning rate decreases, the degradation gradually disappears and does not affect the final model performance. In this situation, the CTFL prevents the accuracy degradation and stabilizes the convergence process. But its effect of facilitating the convergence and improving the performance is not so obvious.

Compared with training strategy bs32-400k, the strategy bs256-40k leads to under-fitting models. According to Table 4, the accuracies of models trained by the CTC loss drop by about 2-4% due to the different training strategy. The convergence processes of this strategy are illustrated in Figure 8. Compared with the CTC-based training, the CTFL greatly facilitates the convergence and leads to a much better model performance. We think that the advantages of CTFL over CTC is more obvious when the CTC-based training suffers from under-fitting. This opinion is also supported by the third group of experiments, where the under-fitting is caused by the network structure, for recurrent structures are usually difficult to train. As shown in Figure 9, the models trained with CTFL easily outperform the model trained with CTC in convergence, even for the model trained for double the time.

Figure 8: The convergence processes of models of CNN trained under strategy bs256-40k.

Figure 9: The convergence processes of models of CRNN trained under strategy bs32-400k or bs32-800k.

On one hand, it usually takes a lot of effort to find the proper training settings. The CTFL facilitates the convergence, thus ensures a relatively reasonable model performance for various training settings. On the other hand, under the same training settings, models trained with CTFL always achieve better or similar performances within less training iterations.

3.7 Comparison With CRNN

Convolutional recurrent neural network (CRNN)

(Shi et al., 2017) is one of the state-of-the-art approaches in CTC-based text recognition, and it is adopted in our experiments. The results of CRNN in Table 4 fall behind the reported results in (Shi et al., 2017), because the training sets and test sets are different. Although the training and test sets are built from the same datasets, the details of the building process can be different, which affects the experimental results. As shown in Table 5, we test the trained model released by (Shi et al., 2017) (downloaded from their code webpage, and supposed to have similar performance as the reported results) on our test sets, and get somehow inferior results.

Method IIIT5k SVT IC03 IC13
Reported (Shi et al., 2017) 81.2 82.7 91.9 89.6
Results on Our Test Sets (Shi et al., 2017) 80.3 81.6 90.0 86.2
CTC 80.2 79.9 90.9 86.6
82.0 81.8 91.5 88.6
Table 5: The results of models with CRNN architecture. Our models are trained with SGD, batch-size 64, for 600k iterations on rescaled images then 200k iterations on variable-size images.

To ensure a fair comparison, we compare results obtained on our test sets. Besides, we adopt the same training set and rescaling strategy, despite slightly different training settings, that are described in (Shi et al., 2017). According to Table 5, the CTC model gets similar results as the released CRNN model, and the model trained with CTFL achieves better results.

4 Conclusion

In this paper, we provide a new perspective to modify the CTC method. The basic idea is to treat, instead of a sequence, each timestep in the sequence as a sample. Accordingly, we reinterpret the CTC loss for sequences into the cross entropy loss for timesteps through a pseudo ground-truth. The cross entropy form makes it possible for the CTC loss to cooperate with re-weighting strategy.

We introduce label/blank weighting and focal loss to CTC and get four weighted CTC losses. The experiments show that the smaple-weighted CTFL generally performs best among them. The proposed losses are proven to have the same complexity as CTC and the following benefits: eliminating the accuracy degradation, better performance when trained on imbalanced data, and contributing to faster and better convergence in some cases.

Apart from the re-weighting method in this paper, the reinterpretation of CTC may be potentially useful in other contexts that is worthy of exploration in the future.


  • Alex et al. (2009) Alex, G., Marcus, L., Santiago, F., Roman, B., Horst, B., and J rgen, S. A novel connectionist system for unconstrained handwriting recognition. IEEE Transactions on Pattern Analysis & Machine Intelligence, 31(5):855–868, 2009.
  • Borisyuk et al. (2018) Borisyuk, F., Gordo, A., and Sivakumar, V. Rosetta: Large scale system for text detection and recognition in images. In Proceedings of the 24th ACM SIGKDD International Conference on Knowledge Discovery & Data Mining, pp. 71–79, 07 2018.
  • Feng et al. (2019) Feng, X., Yao, H., and Zhang, S. Focal CTC Loss for Chinese Optical Character Recognition on Unbalanced Datasets. Complexity, 2019:11, 2019. doi: 10.1155/2019/9345861.
  • Graves & Gomez (2006) Graves, A. and Gomez, F. Connectionist temporal classification: labelling unsegmented sequence data with recurrent neural networks. In

    International Conference on Machine Learning

    , pp. 369–376, 2006.
  • Graves & Jaitly (2014) Graves, A. and Jaitly, N. Towards end-to-end speech recognition with recurrent neural networks. In Xing, E. P. and Jebara, T. (eds.), Proceedings of the 31st International Conference on Machine Learning, volume 32 of Proceedings of Machine Learning Research, pp. 1764–1772, Bejing, China, 22–24 Jun 2014. PMLR.
  • He et al. (2016a) He, K., Zhang, X., Ren, S., and Sun, J. Identity mappings in deep residual networks. In

    European Conference on Computer Vision

    , pp. 630–645, 2016a.
  • He et al. (2016b) He, P., Huang, W., Qiao, Y., Chen, C. L., and Tang, X. Reading scene text in deep convolutional sequences. In

    Thirtieth AAAI Conference on Artificial Intelligence

    , pp. 3501–3508, 2016b.
  • Hu et al. (2018) Hu, L., Sheng, J., and Changshui, Z. Connectionist temporal classification with maximum entropy regularization. In 32nd Conference on Neural Information Processing Systems (NeurIPS 2018), 2018.
  • Huang et al. (2016) Huang, D.-A., Fei-Fei, L., and Niebles, J. C. Connectionist temporal modeling for weakly supervised action labeling. In European Conference on Computer Vision (ECCV), pp. 137–153. Springer, 2016.
  • Jaderberg et al. (2014) Jaderberg, M., Simonyan, K., Vedaldi, A., and Zisserman, A. Synthetic data and artificial neural networks for natural scene text recognition. Eprint Arxiv, 2014.
  • Jia et al. (2014) Jia, Y., Shelhamer, E., Donahue, J., Karayev, S., and Long, J. Caffe: Convolutional architecture for fast feature embedding. Eprint Arxiv, pp. 675–678, 2014.
  • Kai et al. (2012) Kai, W., Babenko, B., and Belongie, S. End-to-end scene text recognition. In IEEE International Conference on Computer Vision, 2012.
  • Karatzas et al. (2013) Karatzas, D., Shafait, F., Uchida, S., Iwamura, M., Bigorda, L. G. I., Mestre, S. R., Mas, J., Mota, D. F., Almazan, J. A., and Heras, L. P. D. L. Icdar 2013 robust reading competition. In International Conference on Document Analysis & Recognition, 2013.
  • Kim et al. (2017) Kim, S., Hori, T., and Watanabe, S. Joint ctc-attention based end-to-end speech recognition using multi-task learning. In 2017 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pp. 4835–4839, March 2017.
  • Li et al. (2019) Li, B., Liu, Y., and Wang, X. Gradient Harmonized Single-stage Detector. AAAI, 2019.
  • Lin et al. (2017) Lin, T. Y., Goyal, P., Girshick, R., He, K., and Dollar, P. Focal loss for dense object detection. IEEE Transactions on Pattern Analysis & Machine Intelligence, PP(99):2999–3007, 2017.
  • Liu et al. (2016) Liu, W., Anguelov, D., Erhan, D., Szegedy, C., Reed, S., Fu, C. Y., and Berg, A. C. Ssd: Single shot multibox detector. In European Conference on Computer Vision, pp. 21–37, 2016.
  • Lucas et al. (2003) Lucas, S. M., Panaretos, A., Sosa, L., Tang, A., Wong, S., and Young, R. Icdar 2003 robust reading competitions. Proc of the Icdar, 7(2-3):105–122, 2003.
  • Miao et al. (2016) Miao, Y., Gowayyed, M., and Metze, F. Eesen: End-to-end speech recognition using deep rnn models and wfst-based decoding. In Automatic Speech Recognition & Understanding, 2016.
  • Mishra et al. (2012) Mishra, A., Alahari, K., and V. Jawahar, C. Scene text recognition using higher order language priors. 09 2012.
  • Rabiner (1993) Rabiner, L.

    A tutorial on hidden markov models and selected applications in speech recognition.

    Proceedings of The IEEE - PIEEE, 77, 01 1993.
  • Shi et al. (2017) Shi, B., Bai, X., and Yao, C. An end-to-end trainable neural network for image-based sequence recognition and its application to scene text recognition. IEEE Transactions on Pattern Analysis & Machine Intelligence, 39(11):2298, 2017.
  • Shrivastava et al. (2016) Shrivastava, A., Gupta, A., and Girshick, R. Training region-based object detectors with online hard example mining. In

    IEEE Conference on Computer Vision and Pattern Recognition

    , pp. 761–769, 2016.

Appendix A Cross Entropy Form of CTC

In the paper, we define a pseudo ground-truth , where


to reinterpret the CTC loss as the sum of cross entropy losses. To this end, we need to prove that is a feasible solution of


It means given the definition of , Equ. (26) holds.

For an ideal situation, the CTC and the cross-entropy loss are both 0, equality holds. Therefore, we only need to prove that the derivatives of the them are equivalent.

Having denote the unnormalized network outputs, we normalize them with the softmax activation,


It’s easy to know


The derivation of cross entropy formatted CTC versus can be calculated as


and its derivation with respect to can be calculated as


It is equal to the derivative of CTC given in the paper, so Equ.(26) holds.

Appendix B Derivation Process of

The class-weighted CTC loss function is




The derivation of versus can be calculated as


and its derivation with respect to can be calculated as


Appendix C Derivation Process of

The sample-weighted CTC loss function is




Since the derivation of versus can be calculated as


and its derivation versus is


The derivation of with respect to is


Appendix D Derivation Process of

The classes-weighted CTFL is defined as


and its derivative versus is


and its derivative versus is




Appendix E Derivation Process of

The smaple-weighted CTFL is given as


and we find its derivative with respect to as