1 Introduction and Background
Automatic speech recognition (ASR), the technology that enables the recognition and translation of spoken language into text by computers, has been widely used in different applications. In the past few decades, ASR relied on complicated traditional techniques including Hidden Markov Models (HMMs) and Gaussian Mixture Models (GMMs). Besides, these traditional models also require hand-made pronunciation dictionaries and predefined alignments between audio and phoneme[2, 3]
. Although these traditional models achieve state-of-the-art accuracies on most audio corpora, it is quite a challenge to develop ASR models without enough acoustics knowledge. Therefore, benefiting from rapid development of deep learning, a few end-to-end ASR models were raised in recent years.
Connectionist temporal classification(CTC) based models and sequence-to-sequence(seq2seq) with attention models are two major approaches in end-to-end ASR systems. Both methods address the problem of variable-length input audios and output texts. Deep Speech 2, which was came up with by Baidu Silicon Valley AI Lab in 2016 , making full use of CTC and RNN, achieved a state-of-the-art recognition accuracy. As for seq2seq model, Chorowski et al utilized seq2seq model with attention mechanism to perform speech recognition 
. However, the accuracy of the model is unsatisfactory since alignment estimation in the attention mechanism is easily corrupted by noise, especially in real environment tasks.
To overcome the above misalignment problem, a combination of CTC and attention-based seq2seq model were proposed by Watanabe in 2017 . The key to this joint CTC-attention model is training a shared encoder, with both CTC and attention decoder as objective functions simultaneously. This novel approach improves the performance in both training speed and recognition accuracy.
This paper is partly inspired by the above method. Our main contributions in this paper include exploring different encoder and decoder network architecture and adopting several optimization methods such as attention smoothing and L2 regularization. We demonstrate that our system outperforms other published end-to-end ASR models in WER on LibriSpeech dataset.
2 Related Work
2.1 Hybrid CTC-attention architecture
The idea of this architecture is to use CTC as an auxiliary objective function to train the attention-based seq2seq network. Fig. 1
illustrates the architecture of the network, where the encoder has several convolutional neural network(CNN) layers followed by bidirectional long short-term memory (BiLSTM) layers, while the decoder includes a CTC module and an attention-based network. According to, using CTC along with attention decoder brings more robustness to the network since CTC helps acquiring appropriate alignments in noisy conditions. Moreover, CTC also assists the network in training speed.
CTC, which is introduced by , provides a method to train RNNs without any prior alignments between inputs and outputs. Suppose the length of the input sequence is
, then the probability of a CTC path can be computed as follow:
where denotes the the softmax probability of outputting label at frame t and denotes the CTC path. Hence the likelihood of the label sequence can be computed as follow:
where is the set of all possible CTC paths that can be mapped to . Therefore, we have CTC loss to be:
As for decoder part, the possibility of label at each step depends on input feature and previous labels . The overall possibility of the entire sequence can be obtained as follow:
2.2 Unit selection
Methods based on large lexicon, such as phoneme-based ASR systems or word-based ASR systems, are not able to resolve out-of-vocabulary (OOV) problems. Thus, starting from LAS, such seq2seq model raises new character-based method. By combining frame information in audio clips and the corresponding characters together, the OOV problem is resolved to some extent. Since many characters in English words are silent and same characters in different sentences may pronounce differently (e.g. ”a” in ”apple” and ”approve”), decoding procedure on character level relies heavily on the sentence sequence relationship given by RNN rather than the acoustic information given by the audio clip frames, which results in the uncertainty of decoding procedure on character level. Considering all the issues mentioned above, subword-based structure can resolve OOV problems on one hand, and can learn the relationship between acoustic information and character information on the other hand. An effective and fast method for generating subwords is byte-pair encoding (BPE) . Which is a compression algorithm that iteratively replaces the most frequent pair of units (or bytes) with an unused unit, and eventually generates new units that are consistent with the number of iterations.
In this section, we detail our optimization and improvements based on the previous hybrid CTC-attention architecture. We show our improvements to encoder-decoder architecture and attention mechanism in section 3.1 and section 3.2.
3.1 Encoder-Decoder architecture
The authors in Espnet , stacked several BiLSTM layers above a few convolutional layers. The outputs of the last BiLSTM layer sever as inputs to both CTC and attention-decoder as shown in Fig. 1. Our major improvements conclude inserting a BiLSTM layer, which is solely occupied by the CTC branch, between the top shared encoder layer and FC layer connected to CTC. The entire hybrid architecture is shown in Fig. 2.
According to our experiments in Section 4, setting in (8) to a smaller value makes the network perform better. However, when the weight is low, a new problem is raised. Since lower brings smaller gradient descent in back propagation in CTC loss part, the shared decoder focuses more on the attention module than the CTC module, which limits the performance of the CTC decoder. Considering this limitation, we introduce a solely BiLSTM layer linking to the CTC decoder, which can compensate the problem we mentioned above.
3.2 Attention smoothing
Inspired by , we use a location-based attention mechanism in our implementation. Specifically, the location based attention energies can be computed by the following equation:
In our speech recognition system, subwords are chosen as the model units, which require more sequence context information than character-based units. However, the attention score distribution is usually very sharp when computed using above equations. Hence, we apply attention smoothing mechanism instead, which can be computed by
The above method successfully smooths attention score distributions and then keep more context information for subword-based decoding.
4.1 Experimental Setup
We train and test our implementation over LibriSpeech dataset 
. Specificlly, we use train-clean-100, train-clean-360, train-other-500 as our training set and dev-clean as our validation set. For evaluation, we report the word error rates (WERs) on the subsets test-clean, test-other, dev-clean and dev-other. We also adopt 3-fold speed perturbation(0.9x/1.0x/1.1x) for data augmentation. 80 dimensional Mel-filterbank features are generated using a sliding window of length 25 ms with 10ms stride, and the feature extraction is performed by KALDI toolkit. Subword units are extracted using all the transcripts of training data by BPE algorithm. The number of subword units is set to 5000.
We use a 4-layer CNN architecture followed by a 7-layer BiLSTM where each layer is a BiLSTM with 1024 cell units per direction as encoder. In the CNN part, input features are downsampled to 1/4 through two max-pooling layers. The decoder consists of two branches where one branch is a one-layer BiLSTM followed by a CTC decoder and the other branch is a 2-layer LSTM with 1024 cell units per layer.
The AdaDelta algorithm 
with initial hyper-parameter epsilon=1e-8 is used for optimization, and L2 regularization and gradient clipping are applied. We measure the accuracy of the validation set every 1000 iterations and apply a strategy that eps is decayed by 0.1 when the average validation accuracy drops. All experiments are performed on 4 Tesla P40 GPUs with batchsizeon each GPU.
Our language model is a two-layer LSTM with units=1536 trained on large text data of 14500 public domain books, which is commonly used as training material for the LibriSpeech’s LM. The SGD algorithm is used for optimization, with initial learning-rate 1.0 and lr-decay 0.9 per 2 epochs. For decoding, we use the beam search algorithm with the beam size 20.
Fig. 3 shows the accuracy curve during training process, from which we can see that the model converges after 35000 iterations. The perplexity of our trained RNN-LM is 50.4 on the training set, and 46.9 on the test-clean subset. We conduct experiments with different number of layers in encoder and the addition of BiLSTM layer on the CTC branch. Results are shown in Table. 1, from which we have the following comments. Both increasing the number of BiLSTM layers in encoder and the addition of BiLSTM layer on the CTC branch lead to better WER. Moreover, WER can be reduced by about 25% using our trained RNN-LM.
|Encoder Layers||WER(no LM)||WER(RNN-LM)|
|Baidu DS2 + LM||5.15||12.73||-||-|
|Espnet + LM||4.6||13.7||4.5||13.0|
|I-Attention + LM||3.82||12.76||3.54||11.52|
|Ours + no LM||4.43||13.5||4.37||13.1|
|Ours + LM||3.34||10.54||3.15||9.98|
After that, we compare the different weight between CTC loss and attention loss by step 0.1. The result is shown in Fig.4. When we use pure attention-based system or pure ctc-based system, it produces inferior performance. The curve also shows that decreasing leads to better WER in hybrid system, which is consistent with the purpose of using CTC-decoder at the beginning: the CTC module is mainly used to assist the monotonic alignment and increase the convergence speed of training, and the hybrid system decoding effect mainly relies on the attention-decoder. As fig. 4 shows, we find that the best tuned is 0.1.
Finally, we compare our results with other reported state-of-the-art end-to-end systems on the LibriSpeech dataset in Table. 2. The results show that our system achieves better WERs than other known end-to-end ASR models.
In summary, we explore a variety of structural improvements and optimization methods on the hybrid CTC-attention-based ASR system. By applying the CTC-decoder BiLSTM, attention smoothing and some other tricks, our system achieves a word error rate(WER) of 4.43% without LM and 3.34% with RNN-LM on the test-clean subset of LibriSpeech corpus.
Future work will concentrate on the optimization of both the decoder structure and the training method, such as finetuning the CTC-decoder-branch after training the shared encoder. Another future work is to apply this technique to other languages like Mandarin, in which there are many polyphonic words that need to be solved in the decoding process.
-  Lawrence R Rabiner and Biing-Hwang Juang, Fundamentals of speech recognition, vol. 14, PTR Prentice Hall Englewood Cliffs, 1993.
-  Lawrence R Rabiner, “A tutorial on hidden markov models and selected applications in speech recognition,” in Readings in speech recognition, pp. 267–296. Elsevier, 1990.
-  Geoffrey Hinton, Li Deng, Dong Yu, George E Dahl, Abdel-rahman Mohamed, Navdeep Jaitly, Andrew Senior, Vincent Vanhoucke, Patrick Nguyen, Tara N Sainath, et al., “Deep neural networks for acoustic modeling in speech recognition: The shared views of four research groups,” IEEE Signal processing magazine, vol. 29, no. 6, pp. 82–97, 2012.
Dario Amodei, Sundaram Ananthanarayanan, Rishita Anubhai, Jingliang Bai, Eric
Battenberg, Carl Case, Jared Casper, Bryan Catanzaro, Qiang Cheng, Guoliang
Chen, et al.,
“Deep speech 2: End-to-end speech recognition in english and
International Conference on Machine Learning, 2016, pp. 173–182.
-  Jan K Chorowski, Dzmitry Bahdanau, Dmitriy Serdyuk, Kyunghyun Cho, and Yoshua Bengio, “Attention-based models for speech recognition,” in Advances in neural information processing systems, 2015, pp. 577–585.
-  Shinji Watanabe, Takaaki Hori, Suyoun Kim, John R Hershey, and Tomoki Hayashi, “Hybrid ctc/attention architecture for end-to-end speech recognition,” IEEE Journal of Selected Topics in Signal Processing, vol. 11, no. 8, pp. 1240–1253, 2017.
-  Suyoun Kim, Takaaki Hori, and Shinji Watanabe, “Joint ctc-attention based end-to-end speech recognition using multi-task learning,” in Acoustics, Speech and Signal Processing (ICASSP), 2017 IEEE International Conference on. IEEE, 2017, pp. 4835–4839.
Alex Graves and Faustino Gomez,
“Connectionist temporal classification:labelling unsegmented sequence data with recurrent neural networks,”in International Conference on Machine Learning, 2006, pp. 369–376.
-  William Chan, Navdeep Jaitly, Quoc Le, and Oriol Vinyals, “Listen, attend and spell: A neural network for large vocabulary conversational speech recognition,” in IEEE International Conference on Acoustics, Speech and Signal Processing, 2016, pp. 4960–4964.
-  Rico Sennrich, Barry Haddow, and Alexandra Birch, “Neural machine translation of rare words with subword units,” arXiv preprint arXiv:1508.07909, 2015.
-  Shinji Watanabe, Takaaki Hori, Shigeki Karita, Tomoki Hayashi, Jiro Nishitoba, Yuya Unno, Nelson Enrique Yalta Soplin, Jahn Heymann, Matthew Wiesner, Nanxin Chen, et al., “Espnet: End-to-end speech processing toolkit,” arXiv preprint arXiv:1804.00015, 2018.
-  Vassil Panayotov, Guoguo Chen, Daniel Povey, and Sanjeev Khudanpur, “Librispeech: an asr corpus based on public domain audio books,” in Acoustics, Speech and Signal Processing (ICASSP), 2015 IEEE International Conference on. IEEE, 2015, pp. 5206–5210.
-  Daniel Povey, Arnab Ghoshal, Gilles Boulianne, Lukas Burget, Ondrej Glembek, Nagendra Goel, Mirko Hannemann, Petr Motlicek, Yanmin Qian, Petr Schwarz, et al., “The kaldi speech recognition toolkit,” in IEEE 2011 workshop on automatic speech recognition and understanding. IEEE Signal Processing Society, 2011, number EPFL-CONF-192584.
-  Matthew D Zeiler, “Adadelta: an adaptive learning rate method,” arXiv preprint arXiv:1212.5701, 2012.
-  Tomoki Hayashi, Shinji Watanabe, Suyoun Kim, Takaaki Hori, and John R. Hershey, “Espnet: end-to-end speech processing toolkit,” https://github.com/espnet/espnet/pull/407/commits/, 2018.
-  Albert Zeyer, Kazuki Irie, Ralf Schlüter, and Hermann Ney, “Improved training of end-to-end attention models for speech recognition,” arXiv preprint arXiv:1805.03294, 2018.