Automatic speech recognition (ASR) has historically been addressed with modular approaches, in which multiple parts of the system are trained separately. For example, traditional ASR systems include components like frame classifiers, phonetic acoustic models, lexicons (which may or may not be learned from data), and language models. These components typically correspond to different levels of representation, such as frame-level triphone states, phones, and words. Breaking up the task into such modules makes it easy to train each of them separately, possibly on different data sets, and to study the effect of modifying each component separately.
Over time, ASR research has moved increasingly toward training multiple components of ASR systems jointly. Typically, such approaches involve training initial separate modules, followed by joint fine-tuning using sequence-level losses [2, 3]
. Recently, completely integrated end-to-end training approaches, where all parameters are learned jointly using a loss at the final output level, have become viable and popular. End-to-end training is especially natural for deep neural network-based models, where the final loss gradient can be backpropagated through all layers. Typical end-to-end models are based on recurrent neural network (RNN) encoder-decoders[4, 5, 6, 7] or connectionist temporal classification (CTC)-based models [8, 9].
End-to-end training is appealing because it is conceptually simple and allows all model parameters to contribute to the same final goal, and to do so in the context of all other model parameters. End-to-end approaches have also achieved impressive results in ASR [4, 9, 10] as well as other domains [11, 12, 13]. On the other hand, end-to-end training has some drawbacks: Optimization can be challenging; the intermediate learned representations are not interpretable, making the system hard to debug; and the approach ignores potentially useful domain-specific information about intermediate representations, as well as existing intermediate levels of supervision.
Prior work on analyzing deep end-to-end models has found that different layers tend to specialize for different sub-tasks, with lower layers focusing on lower-level tasks and higher ones on higher-level tasks. This effect has been found in systems for speech processing [14, 15]
as well as computer vision[16, 17].
We propose an approach for deep neural ASR that aims to maintain the advantages of end-to-end approaches, while also including the domain knowledge and intermediate supervision used in modular systems. We use a multitask learning approach that combines the final task loss (in our case, log loss on the output labels) with losses corresponding to lower-level tasks (such as phonetic recognition) applied on lower layers. This approach is intended to encapsulate the intuitive and empirical observation that different layers encode different levels of information, and to encourage this effect more explicitly. In other words, while we want the end-to-end system to take input acoustics and produce output text, we also believe that at some appropriate intermediate layer, the network should do a good job at distinguishing more basic units like states or phones. Similarly, while end-to-end training need not require supervision at intermediate (state/phone) levels, if they are available then our multitask approach can take advantage of them.
We demonstrate this approach on a neural attention-based encoder-decoder character-level ASR model. Our baseline model is inspired by prior work [18, 8, 19, 4, 7], and our lower-level auxiliary tasks are based on phonetic recognition and frame-level state classification. We find that applying an auxiliary loss at an appropriate intermediate layer of the encoder improves performance over the baseline.
2 Related Work
Multitask training has been studied extensively in the machine learning literature. Its application to deep neural networks has been successful in a variety of settings in speech and language processing [22, 23, 24, 25, 26, 27]. Most prior work combines multiple losses applied at the final output layer of the model, such as joint Mandarin character and phonetic recognition in  and joint CTC and attention-based training for English ASR . Our work differs from this prior work in that our losses relate to different types of supervision and are applied different levels of the model.
The idea of using low-level supervision at lower levels was, to our knowledge, first introduced by Søgaard & Goldberg 
for natural language processing tasks, and has since been extended by. The closest work to ours is the approach of Rao and Sak  using phoneme labels for training a multi-accent CTC-based ASR system in a multitask setting. Here we study the approach in the context of encoder-decoder models, and we compare a number of low-level auxiliary losses.
The multitask approach we propose can in principle be applied to any type of deep end-to-end model. Here we study the approach in the context of attention-based deep RNNs. Below we describe the baseline model, followed by the auxiliary low-level training tasks.
3.1 Baseline Model
The model is based on attention-enabled encoder-decoder RNNs, proposed by . The speech encoder reads in acoustic features and outputs a sequence of high-level features (hidden states) which the character decoder attends to in generating the output character sequence , as shown in Figure 1 (the attention mechanism and a pyramidal LSTM layer are not shown in the figure for simplicity).
3.1.1 Speech Encoder
The speech encoder is a deep pyramidal bidirectional Long Short-Term Memory (BiLSTM) network . In the first layer, a BiLSTM reads in acoustic features and outputs given by:
where denotes the index of the timestep; and denote the first layer forward and backward LSTMs respectively111For brevity we exclude the LSTM equations. The details can be found, e.g., in Zaremba et al. ..
The first layer output is then processed as follows:
where and denote the forward and backward running LSTMs at layer . Following , we use pyramidal layers to reduces the time resolution of the final state sequence by a factor of . This reduction brings down the input sequence length, initially , where
denotes the length of a sequence of vectors, close to the output sequence length222For Switchboard, the average of number of frames per character is about 7., . For simplicity, we will refer to as .
3.1.2 Character Decoder
The character decoder is a single-layer LSTM that predicts a sequence of characters as follows:
The conditional dependence on the encoder state vectors is represented by context vector , which is a function of the current decoder hidden state and the encoder state sequence:
where the vectors and the matrices are learnable parameters; is the hidden state of the decoder at time step . The time complexity of calculating the context vector for every time step is ; reducing the resolution on encoder side is crucial to reducing this runtime.
The hidden state of the decoder, , which captures the previous character context , is given by:
where is the transformation of the single-layer LSTM, is the previous hidden state of the decoder, and is a character embedding vector for , as is typical practice in RNN-based language models. Finally, the posterior distribution of the output at time step is given by:
and the character decoder loss function is then defined as
3.2 Low-Level Auxiliary Tasks
As shown in Figure 1, we explore multiple types of auxiliary tasks in our multitask approach. We explore two types of auxiliary labels for multitask learning: phonemes and sub-phonetic states. We hypothesize that the intermediate representations needed for sub-phonetic state classification are learned at the lowest layers of the encoder, while representations for phonetic prediction may be learned at a somewhat higher level.
3.2.1 Phoneme-Based Auxiliary Tasks
We use phoneme-level supervision obtained from the word-level transcriptions and pronunciation dictionary. We consider two types of phoneme transcription loss:
Phoneme Decoder Loss: Similar to the character decoder described above, we can attach a phoneme decoder to the speech encoder as well. The phoneme decoder has exactly the same mathematical form as the character decoder, but with a phoneme label vocabulary at the output. Specifically, the phoneme decoder loss is defined as
where is the target phoneme sequence. Since this decoder can be attached at any depth of the four-layer encoder described above, we have four depths to choose from. We attach the phoneme decoder to layer 3 of the speech encoder, and also compare this choice to attaching it to layer 4 (the final layer) for comparison with a more typical multitask training approach.
. This involves adding an extra softmax output layer on top of the chosen intermediate layer of the encoder, and applying the CTC loss to the output of this softmax layer. Specifically, letbe the target phoneme sequence, and
be the speech encoder layer where the loss is applied. The probability ofgiven the input sequence is
where removes repetitive symbols and blank symbols, is ’s pre-image, is the number of frames at layer and is computed by a softmax function. The final CTC objective is
The CTC objective computation requires the output length to be less than the input length, i.e., . In our case the encoder reduces the time resolution by a factor of 8 between the input and the top layer, making the top layer occasionally shorter than the number of phonemes in an utterance. We therefore cannot apply this loss to the topmost layer, and use it only at the third layer.333In fact, even at the third layer we find occasional instances (about 10 utterances in our training set) where the hidden state sequence is shorter than the input sequence, due to sequences of phonemes of duration less than 4 frames each. Anecdotally, these examples appear to correspond to incorrect training utterance alignments
3.2.2 State-Level Auxiliary Task
Sub-phonetic state labels provide another type of low-level supervision that can be borrowed from traditional modular HMM-based approaches. We apply this type of supervision at the frame level, as shown in Figure 1, using state alignments obtained from a standard HMM-based system. We apply this auxiliary task at layer 2 of the speech encoder. The probability of a sequence of states is defined as
where is computed by a softmax function, and is the number of frames at layer 2 (in this case ). Since we use this task at layer 2, we subsample the state labels to match the reduced resolution. The final state-level loss is
3.2.3 Training Loss
The final loss function that we minimize is the average of the losses involved. For example, in the case where we use the character and phoneme decoder losses and the state-level loss, the loss would be
We use the Switchboard corpus (LDC97S62) , which contains roughly 300 hours of conversational telephone speech, as our training set. We reserve the first 4K utterances as a development set. Since the training set has a large number of repetitions of short utterances (“yeah”, “uh-huh”, etc.), we remove duplicates beyond a count threshold of 300. The final training set has about 192K utterances. For evaluation, we use the HUB5 Eval2000 data set (LDC2002S09), consisting of two subsets: Switchboard (SWB), which is similar in style to the training set, and CallHome (CHE), which contains unscripted conversations between close friends and family.
For input features, we use 40-dimensional log-mel filterbank features along with their deltas, normalized with per-speaker mean and variance normalization. The phoneme labels for the auxiliary task are generated by mapping words to their canonical pronunciations, using the lexicon in the Kaldi Switchboard training recipe. The HMM state labels were obtained via forced alignment using a baseline HMM/DNN hybrid system using the Kaldi NNet1 recipe. The HMM/DNN has 8396 tied states, which makes the frame-level softmax costly for multitask learning. We use the importance sampling technique described in to reduce this cost.
4.1 Model Details and Inference
The speech encoder is a 4-layer pyramidal bidirectional LSTM, resulting in a 8-fold reduction in time resolution. We use 256 hidden units in each direction of each layer. The decoder for all tasks is a single-layer LSTM with 256 hidden units. We represent the decoders’ output symbols (both characters and, at training time, phonemes) using 256-dimensional embedding vectors. At test time, we use a greedy decoder (beam size = 1) to generate the character sequence. The character with the maximum posterior probability is chosen at every time step and fed as input into the next time step. The decoder stops after encountering the “EOS” (end-of-sentence) symbol. We use no explicit language model.
We train all models using Adam 
with a minibatch size of 64 utterances. The initial learning rate is 1e-3 and is decayed by a factor of 0.95, whenever there is an increase in log-likelihood of the development data, calculated after every 1K updates, over its previous value. All models are trained for 75K gradient updates (about 25 epochs) and early stopping. To further control overfitting we: (a) use dropout at a rate of 0.1 on the output of all LSTM layers (b) sample the previous step’s prediction  in the character decoder, with a constant probability of 0.1 as in .
We evaluate performance using word error rate (WER). We report results on the combined Eval2000 test set as well as separately on the SWB and CHE subsets. We also report character error rates (CER) on the development set.
|Model||Dev CER||Dev WER|
|Enc-Dec + PhoneDec-3||13.8||24.9|
|Enc-Dec + PhoneDec-4||14.5||25.9|
|Enc-Dec + PhoneCTC-3||14.0||25.3|
|Enc-Dec + State-2||13.6||24.1|
|Enc-Dec + PhoneDec-3 + State-2||13.4||24.1|
Development set results are shown in Table 1. We refer to the baseline model as “Enc-Dec” and the models with multitask training as “Enc-Dec + [auxiliary task]-[layer]”. Adding phoneme recognition as an auxiliary task at layer 3, either with a separate LSTM decoder or with CTC, reduces both the character error rates and the final word error rates.
In order to determine whether the improved performance is a basic multitask training effect or is specific to the low-level application of the loss, we compare these results to those of adding the phoneme decoder at the topmost layer (Enc-Dec + PhoneDec-4). The top-layer application of the phoneme loss produces worse performance than having the supervision at the lower (third) layer. Finally, we obtain the best results by adding both phoneme decoder supervision at the third layer and frame-level state supervision at the second layer (Enc-Dec + PhoneDec-3 + State-2). The results support the hypothesis that lower-level supervision is best provided at lower layers. Table 2 provides test set results, showing the same pattern of improvement on both the SWB and CHE subsets. For comparison, we also include a variety of other recent results with neural end-to-end approaches on this task. Our baseline model has better performance than the most similar previous encoder-decoder result . With the addition of the low-level auxiliary task training, our models are competitive with all of the previous end-to-end systems that do not use a language model.
Figure 2 shows the training set log-likelihood for the baseline model and two multitask variants. The plot suggests that multitask training helps with optimization (improving the training error). Training error is very similar for both multitask models, while the development set performance is better for one of them (see Table 1), suggesting that there may also be an improved generalization effect and not only improved optimization.
|Enc-Dec + PhoneDec-3||24.5||40.6||32.6|
|Enc-Dec + PhoneDec-4||25.4||41.9||33.7|
|Enc-Dec + PhoneCTC-3||24.6||41.3||33.0|
|Enc-Dec + State-2||24.7||42.0||33.4|
|Enc-Dec + PhoneDec-3 + State-2||23.1||40.8||32.0|
|Lu et al.|
|Enc-Dec (word) + 3-gram||25.8||46.0||36.0|
|Maas et al. |
|CTC + 3-layer RNN LM||21.4||40.2||30.8|
|Zweig et al. |
|CTC + Char Ngram||19.8||32.1||—|
|CTC + Dictionary + Word Ngram||14.0||25.3||—|
We have presented a multitask training approach for deep end-to-end ASR models in which lower-level task losses are applied at lower levels, and we have explored this approach in the context of attention-based encoder-decoder models. Results on Switchboard and CallHome show consistent improvements over baseline attention-based models and support the hypothesis that lower-level supervision is more effective when applied at lower layers of the deep model. We have compared several types of auxiliary tasks, obtaining the best performance with a combination of a phoneme decoder and frame-level state loss. Analysis of model training and performance suggests that the addition of auxiliary tasks can help in either optimization or generalization.
Future work includes studying a broader range of auxiliary tasks and model configurations. For example, it would be interesting to study even deeper models and word-level output, which would allow for more options of intermediate tasks and placements of the auxiliary losses. Viewing the approach more broadly, it may be fruitful to also consider higher-level task supervision, incorporating syntactic or semantic labels, and to view the ASR output as an intermediate output in a more general hierarchy of tasks.
We are grateful to William Chan for helpful discussions, and to the speech group at TTIC, especially Shane Settle, Herman Kamper, Qingming Tang, and Bowen Shi for sharing their data processing code. This research was supported by a Google faculty research award.
M. Gales and S. Young, “The application of hidden markov models in speech recognition,”Foundations and trends in signal processing, vol. 1, 2008.
-  K. Veselỳ, A. Ghoshal, L. Burget, and D. Povey, “Sequence-discriminative training of deep neural networks.” in Interspeech, 2013.
-  D. Povey and B. Kingsbury, “Evaluation of proposed modifications to mpe for large scale discriminative training,” in IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), 2007.
-  W. Chan, N. Jaitly, Q. V. Le, and O. Vinyals, “Listen, attend and spell: A neural network for large vocabulary conversational speech recognition,” in International Conference on Acoustics, Speech, and Signal Processing (ICASSP), 2016.
-  J. K. Chorowski, D. Bahdanau, D. Serdyuk, K. Cho, and Y. Bengio, “Attention-based models for speech recognition,” in Neural Information Processing Systems (NIPS), 2015.
-  D. Bahdanau, J. Chorowski, D. Serdyuk, P. Brakel, and Y. Bengio, “End-to-end attention-based large vocabulary speech recognition,” in International Conference on Acoustics, Speech, and Signal Processing (ICASSP), 2016.
-  L. Lu, X. Zhang, and S. Renals, “On training the recurrent neural network encoder-decoder for large vocabulary end-to-end speech recognition,” in International Conference on Acoustics, Speech, and Signal Processing (ICASSP), 2016.
-  A. L. Maas, Z. Xie, D. Jurafsky, and A. Y. Ng, “Lexicon-free conversational speech recognition with neural networks,” in North American Chapter of the Association for Computational Linguistics on Human Language Technology (NAACL HLT), 2015.
-  G. Zweig, C. Yu, J. Droppo, and A. Stolcke, “Advances in all-neural speech recognition,” CoRR, vol. abs/1609.05935, 2016.
-  Y. Miao, M. Gowayyed, and F. Metze, “EESEN: End-to-end speech recognition using deep RNN models and WFST-based decoding,” in IEEE Workshop on Automatic Speech Recognition and Understanding (ASRU), 2015.
-  G. Huang, Y. Sun, Z. Liu, D. Sedra, and K. Q. Weinberger, “Deep networks with stochastic depth,” in European Conference on Computer Vision (ECCV), 2016.
-  O. Vinyals, L. Kaiser, T. Koo, S. Petrov, I. Sutskever, and G. E. Hinton, “Grammar as a foreign language,” in Neural Information Processing Systems (NIPS), 2015.
-  M. Johnson, M. Schuster, Q. V. Le, M. Krikun, Y. Wu, Z. Chen, N. Thorat, F. B. Viégas, M. Wattenberg, G. Corrado, M. Hughes, and J. Dean, “Google’s multilingual neural machine translation system: Enabling zero-shot translation,” CoRR, vol. abs/1611.04558, 2016.
A.-r. Mohamed, G. E. Hinton, and G. Penn, “Understanding how deep belief networks perform acoustic modelling,” inInternational Conference on Acoustics, Speech, and Signal Processing (ICASSP), 2012.
-  T. Nagamine, M. L. Seltzer, and N. Mesgarani, “On the role of nonlinear transformations in deep neural network acoustic models,” Interspeech, 2016.
-  M. D. Zeiler and R. Fergus, “Visualizing and understanding convolutional networks,” in European Conference on Computer Vision (ECCV), 2014.
R. Girshick, J. Donahue, T. Darrell, and J. Malik, “Rich feature hierarchies
for accurate object detection and semantic segmentation,” in
IEEE Conference on Computer Vision and Pattern Recognition (CVPR), 2014.
-  I. Sutskever, O. Vinyals, and Q. V. Le, “Sequence to sequence learning with neural networks,” in Neural Information Processing Systems (NIPS), 2014.
D. Bahdanau, K. Cho, and Y. Bengio, “Neural machine translation by jointly learning to align and translate,” inInternational Conference on Learning Representations (ICLR), 2015.
-  S. Bengio, O. Vinyals, N. Jaitly, and N. Shazeer, “Scheduled sampling for sequence prediction with recurrent neural networks,” in Neural Information Processing Systems (NIPS), 2015.
-  R. Caruana, “Multitask learning,” Machine Learning, 1997.
-  R. Collobert, J. Weston, L. Bottou, M. Karlen, K. Kavukcuoglu, and P. Kuksa, “Natural language processing (almost) from scratch,” Journal of Machine Learning Research (JMLR), 2011.
-  M. Luong, Q. V. Le, I. Sutskever, O. Vinyals, and L. Kaiser, “Multi-task sequence to sequence learning,” in International Conference on Learning Representations (ICLR), 2016.
-  A. Eriguchi, Y. Tsuruoka, and K. Cho, “Learning to parse and translate improves neural machine translation,” CoRR, vol. abs/1702.03525, 2017.
-  S. Kim, T. Hori, and S. Watanabe, “Joint CTC-attention based end-to-end speech recognition using multi-task learning,” CoRR, vol. abs/1609.06773, 2016.
-  W. Chan and I. Lane, “On online attention-based speech recognition and joint Mandarin character-Pinyin training,” in Interspeech, 2016.
-  Z. Wu, C. Valentini-Botinhao, O. Watts, and S. King, “Deep neural networks employing multi-task learning and stacked bottleneck features for speech synthesis,” in International Conference on Acoustics, Speech and Signal Processing (ICASSP), 2015.
-  A. Søgaard and Y. Goldberg, “Deep multi-task learning with low level tasks supervised at lower layers,” in Annual Meeting of the Association for Computational Linguistics (ACL), 2016.
-  K. Hashimoto, C. Xiong, Y. Tsuruoka, and R. Socher, “A joint many-task model: Growing a neural network for multiple NLP tasks,” CoRR, vol. abs/1611.01587, 2016.
-  K. Rao and H. Sak, “Multi-accent speech recognition with hierarchical grapheme based models,” in International Conference on Acoustics, Speech, and Signal Processing (ICASSP), 2017.
-  S. Hochreiter and J. Schmidhuber, “Long short-term memory,” Neural Computation, vol. 9, 1997.
-  W. Zaremba, I. Sutskever, and O. Vinyals, “Recurrent neural network regularization,” CoRR, vol. abs/1409.2329, 2014.
-  A. Graves, S. Fernández, and F. Gomez, “Connectionist temporal classification: Labelling unsegmented sequence data with recurrent neural networks,” in International Conference on Machine Learning (ICML), 2006.
-  J. J. Godfrey, E. C. Holliman, and J. McDaniel, “SWITCHBOARD: Telephone speech corpus for research and development,” in International Conference on Acoustics, Speech and Signal Processing (ICASSP), 1992.
-  S. Jean, K. Cho, R. Memisevic, and Y. Bengio, “On using very large target vocabulary for neural machine translation,” in Annual Meeting of the Association for Computational Linguistics (ACL), 2015.
-  D. P. Kingma and J. Ba, “Adam: A Method for Stochastic Optimization,” CoRR, vol. abs/1412.6980, 2014.
-  V. Pham, T. Bluche, C. Kermorvant, and J. Louradour, “Dropout improves recurrent neural networks for handwriting recognition,” in International Conference on Frontiers in Handwriting Recognition (ICFHR), 2014.