The recent trend of automatic speech recognition (ASR) research is to simplify the recognition process by using a single neural network to approximate the direct mapping from acoustic signals to textual transcription. The introduction of connectionist temporal classification [1, 2]
(CTC) is appealing thanks to the ability to directly model the alignment between acoustic observations and the labels. It is notable that the traditional phoneme-based systems which are built-up using pronunciation lexicons, framewise alignment or Hidden Markov Model (HMM) topology arenot mandatory in end-to-end approaches [3, 4, 5] using simple character sequences as labels. This level of granularity is even lower in other works with an acoustic-to-word setup . In such cases, neither language models nor beam search decoders are necessary.
The aforementioned works have shown the potential of the CTC framework to be able to jointly learn to predict the labels and to align between inputs and outputs. However, the disadvantage of CTC is training complexity, i.e. the convergence is not guaranteed or optimization gets stuck in local optima [7, 8]. Data sparsity is also known to be a hindrance to train CTC models effectively, as can be seen from the work of  reporting that without a pre-training initialization, an A2W model is much harder to converge on the well-known Switchboard training data set.
Our work is motivated by the findings in our previous study 
, in which the correlation between the phone probabilities estimated by CTC models and the hard labels produced by traditional frame-wise alignment models was found. This evidence indicates the correlation between two training schemes: sequence-wise prediction with CTC and frame-wise classification typically using a cross-entropy (CE) optimization criterion. To the best of our knowledge, the idea of combining these two training criteria CTC and CE, so that the models can employ both sequence-level and frame level properties, has not been successfully explored.
In this work, we propose a novel architecture in which a shared neural network is used to train acoustic models with both CTC and framewise CE, that we treat with a multi-task learning perspective. Our experiments show that both tasks can benefit from each other. First we showed that training an acoustic-to-word model is much more stable with our model without pre-training initialization, and results in a significant improvement compared to a plain model. Second, we reveal that the performance of the framewise CE acoustic model can be further improved when jointly training with a sequence-level optimization criterion such as CTC acoustic-to-word.
2 Multi-task Learning of CTC & framewise CE
In this section we review two popular optimization criteria CTC (Section 2.1) and framewise CE (Section 2.2) frequently used in training neural network-based acoustic models. Then our proposed network architecture and training methods for combining these criteria is described in section 2.3.
2.1 CTC Task
Given an audio utterance and the corresponding transcript (a sequence of labels), the CTC framework estimates alignments between the utterance and the transcript as a latent variable, dubbed the CTC path. Let us denote
as the posterior probability that the neural network assigns to the corresponding label ofat time
, then the CTC loss function is defined as the sum of the negative log likelihoods of the alignments for each utterance:
In order to optimize towards the CTC criterion, [1, 2] proposed to use the forward-backward algorithm which efficiently computes the gradient with respect to the activation of neural networks for every input frame.
CTC loss is typically computed on entire training utterances so that the model can effectively learn the prediction for the labels and their corresponding alignments at the same time. So far, the bidirectional Long-Short Term Memory (BLSTM) networks which are capable of capturing long-term context dependencies are the most popular architecture to learn the representation which is then trained via the CTC loss function.
2.2 Framewise CE Task
Assuming that HMM-based speech recognition uses Viterbi forced alignment to obtain a state label (i.e., context-dependent phoneme) from the ground-truth transcript for each input frame of a training utterance , a neural network model is trained by optimizing the CE loss function, to model this state distribution:
where is the Kronecker delta and is the network output for the state at the frame . According to the HMM assumptions, a neural network model, which minimizes the CE loss, approximately maximizes the likelihood of the input .
The framewise CE loss does not consider each utterance in its entirety, but instead it is defined over individual samples of acoustic frames and state labels. To build an acoustic model for the speech recognition task, successful hybrid Feed-Forward networks (FFNN) models are typically trained on random batches of all training samples.
Recently, the BLSTM has outperformed FFNN and become the state-of-the-art for acoustic modeling with framewise CE criterion. BLSTM is known to be better at modeling the dependencies between a long sequence of acoustic frames and its corresponding states. Different from the optimization of the CTC criterion, several works [11, 12, 13] have adopted an approach in which training utterances are divided into subsequences of fixed-sized chunks (e.g., 50 frames) when training BLSTM acoustic models with CE loss.
2.3 Multi-task Learning
The network models trained with CTC and framewise CE criterion provide very different label distributions even when the used label sets are identical, because they learn the mapping function at different scales. However, as in our previous work , the models resulting from training with either CTC or the framewise CE criterion end up sharing similar traits in representations, which motivates us to use those loss functions jointly. Here we consider each of them as a task, in which the model is assigned to learn features at either local or global levels. The aim of our work is to establish the shared underlying neural network model to efficiently learn from both tasks in parallel.
illustrates our proposed network architecture combining the two training criteria. Basically, the entire LSTM architecture used for encoding input sequences is shared, only two output layers are separated to perform the specific tasks. This structure allows gradients to be propagated back to the encoding network as early as possible. We further added a small projection layer (i.e. 200 neurons) into the shared network, on top of the LSTM layers. The projection layer was found to speed up training and improve convergence. In this case, it significantly reduces the number of parameters in the task-specific layers, so that it pushes the shared layers to learn their needed representations.
In practice, the optimal setup for optimizing CE loss is found when dividing training utterances into subsequences of frames which is different from the setup for CTC loss. To obtain optimal performance for both tasks, it is efficient to combine the loss functions at utterance level while keeping the attention for synchronizing between complete utterances and their subsequences. However, such a synchronization implementation is usually less memory efficient or has a largely increased training time.
In our experiments we found that training a BLSTM model with CE loss on entire utterances also gives performance comparable to the use of subsequences even though the training time increases due to less parallelization optimization. So we propose to compute and combine CE loss and CTC loss over entire utterances. The combined loss function is then the weighted sum of CTC and CE losses with a hyper-parameter :
3 Experimental Setup
Our experiments were conducted on the Switchboard-1 Release 2 training corpus which contains 300 hours of speech. The Hub5’00 evaluation data was used as test set. We used a deep BLSTM with 5 layers of 320 units (big models with 500 units). All the models were trained on 40 log mel filter-bank features which are normalized per conversation. We adopted a new-bob training schedule in which an initial learning rate is fixed for 12 epochs, then exponentially decays with a factor of 0.8 if the cross validation error degraded. For training the multi-task models and plain CTC models, we used stochastic gradient descent in which the loss is averaged over the number of utterances in each mini-batch. For the framewise CE training with sub-sequences, we normalized the loss per frame.
We used the PyTorch toolkit to build and train the BLSTMs. The character and phoneme CTC systems were decoded using Eesen while for the hybrid systems, we used Janus . A 4-gram language model is used in all decodings (except for the word models), which is trained on the transcripts of the training dataset and English Fisher corpus.
4.1 CTC Baseline Model
First of all, we are interested in how the multi-task network improves the performance on CTC task. We experimented with the CTC task on three popular label types including phonemes, graphemes and word labels.
In Table 1
, we summarize the results from recent studies which reported their CTC systems and training optimization on the Switchboard 300 hours training set. So far, the effective training optimization typically includes the usage of a pre-trained model or the order selection of training utterances. For fair comparisons, the selected phoneme and grapheme systems use 45 English phones and 46 characters as the label sets while the decoding was performed with n-gram language models. The word models were trained with 10k label units and only greedy decoding is applied.
The convergence of the models optimized with CTC is not stable . Several works have adopted a curriculum learning strategy in which training utterances are sorted in ascending order of frame length to improve the stability and the accuracy of model training.  proposed to use curriculum learning only in the first epoch and then a random order for the rest.
When switching to word labels, CTC training convergence is even poorer due to data sparsity.  has presented that on the Switchboard training set, model initialization through pre-training is critical and a random initialization of model parameters usually fails to converge. [9, 15, 16] used a pre-trained phone model and GloVe word embeddings  to initialize their word models.
used i-vectors and deltas as additional input features.
4.2 Performance of the CTC Task
while the framewise CE task of classifying 8,000 phone states was kept the same. We observed that whenis set to about , both token error rate (TER) and phone error rate (PER) measured on the CTC and framewise CE task decrease faster than in the training of the plain models. This indicates an optimal value of in which the learning of the shared network can maximally benefit. Our training optimization then includes two steps. We first train the multi-task models with until the combined loss converges. Then, we perform fine-tuning on the individual tasks.
Table 2 presents the results of the CTC systems trained with the proposed multi-task learning. We also provide the results of our plain CTC training with the same label sets. The performance of our plain models on phonemes and characters are at par with . We tried to randomly initialize word models, but it was not effective. However, when using the LSTM layers from the pre-trained phone model, the word model training converged successfully.
[7, 8] used parameters initialized from a framewise CE model to stabilize phone-based CTC training. We found that learning an A2W model jointly with framewise CE solved the problem of data sparsity. Without any pre-trained initialization, our word models converged as well as phone or character models. In our setup, learning a shared network also leads to better accuracy for the CTC task, as the performance of the multi-task models are improved over the plain models for all different label sets.
Curriculum learning is usually effective due to the training of CTC models not being stable. However, since the framewise CE task stabilizes the training of our multi-task models, we can apply different optimizations. In this study, we propose to use a random order of training utterances together with dropout  on LSTM layers. These techniques improve generalization and have shown effectiveness for framewise CE training . With this optimization, we achieved a 13.2% rel. WER improvement on the SWB sub-set for both character and word models compared to the plain models.
We also experimented with optimizing plain A2W models initialized with the parameters of the multi-task models from different training epochs (labeled as m-pretrain-epoch). We were able to train the plain A2W models with the new optimization, although it was hard to gain further improvement. This indicates that the framewise CE does not only stabilize the training but also leads to the learning of a shared representation which is effective for the CTC criterion.
In Table 3, we experimented with the training of word models with different label sizes. OOV means number of rare words mapped to unknown while the number of appearances of a word in the training set is marked as occurrences. The multi-task models can be trained stably even when there is a large number of rare words (some modeled words are only seen 3 times in whole training set). Our setup also allows to train a word model on a sub-set of 100 hours. When using a bigger model, we achieved slightly better improvements.
4.3 Performance of Framewise CE Task
As shown in Table 2, the framewise CE task of the multi-task models converged at different phone error rates depending on the label sets used for the CTC task. Interestingly, jointly training with A2W can supplement and boost the performance of the hybrid model over the plain training. This is shown in Table 4 where we compared several hybrid systems trained with and without multi-task learning. We also compared the LSTM acoustic models when using entire utterances or subsequences as input. Our setup for constructing subsequences is similar to the optimal setup found in , in which training utterances are divided into chunks of 50 frames while two consecutive chunks have 25 overlapping frames. We achieved a significant improvement (12% rel.) between the multi-task training and the plain training. The result of our multi-task model with bigger size is at par with the best model reported in , while this model employed more advanced input features.
5 Related Work
[7, 8] used framewise CE training to initialize the LSTM layers of phone-based CTC models. They found that using such pretrained parameters, CTC training is more stable than when using random initialization.  trained deep feed-forward sequential memory networks (Deep-FSMN) with CTC and proposed to incorporate CE loss as a regularization term. They argued that CE loss is helpful in stabilizing CTC training and improving the alignments of CTC models, which then lead to significant improvements in WER.
Current work  proposed to use hierarchical pre-trained CTC and curriculum learning and a joint CTC-CE training to optimize A2W models. In , multi-task CTC-CE was investigated but not successful. Different from that, we do not use any pre-training and found that additional training optimization is needed for improving the convergence of multi-task models. In our study, the performance of both CTC and framewise CE tasks are improved by optimizing to learn a shared representation.
6 Conclusion and Future Work
We have presented an efficient approach for training encoder networks for modeling both word and context-dependent phone-state sequences concurrently. Our results suggest that such a proposed encoder network can be potentially shared among different training criteria. As future work, we will investigate the use of this encoding network in attention-based speech recognition models.
Alex Graves, Santiago Fernández, Faustino Gomez, and Jürgen
“Connectionist temporal classification: labelling unsegmented sequence data with recurrent neural networks,”in
Proceedings of the 23rd international conference on Machine learning. ACM, 2006, pp. 369–376.
-  Alex Graves and Navdeep Jaitly, “Towards end-to-end speech recognition with recurrent neural networks,” in Proceedings of the 31st International Conference on Machine Learning (ICML-14), 2014, pp. 1764–1772.
-  Dario Amodei, Sundaram Ananthanarayanan, Rishita Anubhai, Jingliang Bai, Eric Battenberg, Carl Case, Jared Casper, Bryan Catanzaro, Qiang Cheng, Guoliang Chen, et al., “Deep speech 2: End-to-end speech recognition in english and mandarin,” in International Conference on Machine Learning, 2016, pp. 173–182.
-  Yajie Miao, Mohammad Gowayyed, and Florian Metze, “Eesen: End-to-end speech recognition using deep rnn models and wfst-based decoding,” in Automatic Speech Recognition and Understanding (ASRU), 2015 IEEE Workshop on. IEEE, 2015, pp. 167–174.
-  Geoffrey Zweig, Chengzhu Yu, Jasha Droppo, and Andreas Stolcke, “Advances in all-neural speech recognition,” in Acoustics, Speech and Signal Processing (ICASSP), 2017 IEEE International Conference on. IEEE, 2017, pp. 4805–4809.
-  Hagen Soltau, Hank Liao, and Hasim Sak, “Neural speech recognizer: Acoustic-to-word lstm model for large vocabulary speech recognition,” arXiv preprint arXiv:1610.09975, 2016.
-  Haşim Sak, Andrew Senior, Kanishka Rao, Ozan Irsoy, Alex Graves, Françoise Beaufays, and Johan Schalkwyk, “Learning acoustic frame labeling for speech recognition with recurrent neural networks,” in Acoustics, Speech and Signal Processing (ICASSP), 2015 IEEE International Conference on. IEEE, 2015, pp. 4280–4284.
-  Haşim Sak, Andrew Senior, Kanishka Rao, and Françoise Beaufays, “Fast and accurate recurrent neural network acoustic models for speech recognition,” arXiv preprint arXiv:1507.06947, 2015.
-  Kartik Audhkhasi, Bhuvana Ramabhadran, George Saon, Michael Picheny, and David Nahamoo, “Direct acoustics-to-word models for english conversational speech recognition,” arXiv preprint arXiv:1703.07754, 2017.
-  Thai-Son Nguyen, Sebastian Stueker, and Alex Waibel, “Exploring ctc-network derived features with conventional hybrid system,” in International Conference on Acoustics, Speech, and Signal Processing 2018 - ICASSP, 2018.
-  Haşim Sak, Andrew Senior, and Françoise Beaufays, “Long short-term memory recurrent neural network architectures for large scale acoustic modeling,” in Fifteenth annual conference of the international speech communication association, 2014.
-  Xiangang Li and Xihong Wu, “Constructing long short-term memory based deep recurrent neural networks for large vocabulary speech recognition,” in Acoustics, Speech and Signal Processing (ICASSP), 2015 IEEE International Conference on. IEEE, 2015, pp. 4520–4524.
-  Albert Zeyer, Patrick Doetsch, Paul Voigtlaender, Ralf Schlüter, and Hermann Ney, “A comprehensive study of deep bidirectional lstm rnns for acoustic modeling in speech recognition,” in Acoustics, Speech and Signal Processing (ICASSP), 2017 IEEE International Conference on. IEEE, 2017, pp. 2462–2466.
-  Michael Finke, Petra Geutner, Hermann Hild, Thomas Kemp, Klaus Ri es, and Martin Westphal, “The karlsruhe VERBMOBIL speech recognition engine,” in Proc. of ICASSP, 1997.
-  Kartik Audhkhasi, Brian Kingsbury, Bhuvana Ramabhadran, George Saon, and Michael Picheny, “Building competitive direct acoustics-to-word models for english conversational speech recognition,” in 2018 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP). IEEE, 2018, pp. 4759–4763.
-  Chengzhu Yu, Chunlei Zhang, Chao Weng, Jia Cui, and Dong Yu, “A multistage training framework for acoustic-to-word model,” IEEE/ACM Transactions on Audio, Speech, and Language Processing, 2018.
Jeffrey Pennington, Richard Socher, and Christopher Manning,
“Glove: Global vectors for word representation,”
Proceedings of the 2014 conference on empirical methods in natural language processing (EMNLP), 2014, pp. 1532–1543.
-  Geoffrey E Hinton, Nitish Srivastava, Alex Krizhevsky, Ilya Sutskever, and Ruslan R Salakhutdinov, “Improving neural networks by preventing co-adaptation of feature detectors,” arXiv preprint arXiv:1207.0580, 2012.
-  Shiliang Zhang, Ming Lei, Zhijie Yan, and Lirong Dai, “Deep-fsmn for large vocabulary continuous speech recognition,” IEEE/ACM Transactions on Audio, Speech, and Language Processing, 2018.