Conventional deep neural network HMM hybrid speech recognition systems[11, 22]
usually require two steps in the training stage. First, a prior acoustic model such as Gaussian mixture models (GMM) is used to generate HMM state alignments for the speech training data. Based on the acoustic features and one-hot training targets generated from the state alignments, neural networks are trained to predict the frame-level state posterior probabilities. This separated two-step training process makes the acoustic model performance optimization less efficient.
Recently, various end-to-end (E2E) models [6, 13, 14, 8, 7, 4, 2, 3, 20, 15, 18, 10] are proposed to bypass the label alignment stage to directly learn the transducer of a sequence of acoustic features to a sequence of probabilities over output tokens. These E2E systems can be categorized into CTC based models [7, 8, 1, 15], sequence to sequence attention based models [5, 3, 4, 9] and the combination of CTC together with sequence to sequence attention based models [24, 13, 12, 6].
, due to its simplicity in training and efficiency in decoding. In CTC based models, a special blank label is introduced to identify the less informative frames. In addition, CTC based systems allow repetition of labels. In this way, CTC based models automatically infer the speech frame and label alignment (usually by a delay in time), which removes the state alignment step in training. Using highly efficient greedy decoding with no involvement of lexicon and language model, the CTC based model gives competitive results. In greedy decoding, the predictions are the concatenation of tokens that correspond to the spikes in posterior distribution.
. CTC loss function imposes the conditional independence constraints for the output tokens given the whole input feature sequence. So it relies on the hidden feature vector of the current frame to make predictions. Armed with the memory mechanism, current frame’s hidden feature vector from LSTM is able to capture the information from previous frames. In other words, the current frame label is not predicted based on exclusively current frame features.
In LSTM-CTC based models, to get the posterior probability of the output labels, a projection matrix maps the hidden feature vector to the final output layer. The hidden feature vector is the output from the last layer in the multi-layer LSTMs or bidirectional LSTMs. The output layer has the same dimension as the training labels. The phonemes or the characters are usually used as labels which have smaller dimensions than the LSTM output. So the projection matrix becomes the bottleneck that limits the expressive capability of the LSTM-CTC based models. To address similar issues in language modeling, a mixture of softmaxes method  is used to improve the performance. In this paper, we propose to use a high rank projection layer to replace the single projection matrix to improve the expressiveness of the LSTM-CTC based E2E models.
In the high rank projection layer, one hidden feature vector is first mapped to multiple vectors by a set of projection matrices together with non-linear activation function. A weighted combination of these vectors is used as the output of the high rank projection layer. The non-linear activation function breaks the potential linear correlation among the output vectors that are obtained by mapping one feature vector via several projection matrices. So the proposed projection layer has higher rank than mapping feature vectors with one single projection matrix.
One simple approach to decode with CTC based models is to concatenate the non-blank labels corresponding to the posterior spikes and to remove the continuously repeated output labels. However, such a simple greedy decoding method lacks the lexicon and language model information that could be leveraged to constrain the search path in decoding. In EESEN , a WFST based method is applied to integrate the CTC frame labels, lexicons and language models into one search graph. In this work, we follow EESEN’s way of doing decoding with CTC based models.
In CTC training, the actual label sequence is obtained by inserting blank labels at the beginning, at the end and between every consistent labels in the original label sequence. The blank label has a very high prior probability. That is one reason why for the trained CTC model, the majority of frames would take blank as labels and the non-blank labels only happen in a very narrow region with peaky distribution. To address this issue, similar to EESEN
, we apply the label distribution of the augmented label sequence used in CTC training as prior to normalize the posterior probability distribution.
We evaluate the proposed high rank LSTM-CTC based end-to-end speech recognition on Wall Street Journal (WSJ)  and LibriSpeech corpus . For both experiments, no external data or data augmentation is applied. On both data sets, the proposed models outperform the baseline model. For easy comparison and results reproduction, the source code for this study is released as an open source project111https://github.com/mobvoi/lstm_ctc.
The rest of the paper is organized as follows. In Section 2, we briefly review the LSTM-CTC based models in the E2E speech recognition system. Then we describe the proposed high rank LSTM-CTC based models. In Section 3, we present the experiments on WSJ and LibriSpeech benchmark data set. Finally, we give our conclusions.
2 A High Rank LSTM-CTC Based Model
Let denote the input sequence of acoustic feature vectors with sequence length , where . Given , the E2E ASR system gives a sequence of posterior probability vectors of the output labels, where is a posterior probability vector of the output labels at position . The dimension of the posterior probability vector is that is the number of the target labels. The target labels usually are the phonemes or the characters. In this paper, we only use the phonemes as output labels.
One typical problem for E2E speech recognition is that the length of output labels is often shorter than the length of input speech frames . To deal with this issue in training, CTC introduces a special blank label that is inserted between two consecutive labels and allowing for repetitions of labels. So the label sequence is expanded to that has the same length as input sequence. To get the posterior probability of a label sequence , CTC needs to compute and sum the posterior probabilities of all the possible path in . Under the constraint that given the input sequence, the posterior probability of each label in a output sequence is conditionally independent of each other, the CTC loss is formulated as follows:
More specifically, in LSTM-CTC models, the sequence hidden feature vectors is obtained by feeding multiple layers of LSTM or bidirectional LSTM with input acoustic feature . A projection matrix
shared across over the whole sequence is used to map the hidden feature vectors to logit vectors of which each hasnodes corresponding to labels including blank label . The projection can be formulated as follows:
Softmax activation function is then applied on each logit vector to get the posterior probability vector . Normally the number of output labels is relatively small. For example, there are 71 stressed phones in WSJ data set and 43 unstressed phones in LibriSpeech data set. This projection matrix becomes the bottleneck for the expressiveness of the LSTM-CTC models. To address this issue, we proposed a high rank projection layer to replace the single projection matrix.
2.2 A High Rank Projection Layer
As illustrated in Fig. 1, in the high rank projection layer, a set of projection matrices are used to map the input hidden feature vector (of dimension ) at frame to a set of logit vectors (each of dimension ).
where is the predefined number of projection matrices in this layer. is the concatenation of a set of projection matrices. Each is of dimension . The logit vector at speech frame is represented as an interpolation of the set of logit vectors as follows:
where is a predefined scale factor to control the smoothness of the posterior probabilities. is the combination weight computed at time stamp for the -th logit vector. It is the softmax after mapping the hidden feature vector to an -dimensional vector via .
The projection matrices, , and the weight matrix are all trained jointly with the rest of network parameters.
2.2.1 Non-linear activation function and temperature factor
To get a high rank projection, the non-linear activation needs to be used to break the potential linear correlation among the projection matrices in the projection layer. Without the non-linear activation, the logit vector at speech frame can be formulated as follows:
which is essentially the same as equation (2). The temperature factor controls the smoothness of the label output distribution. The weighted interpolation usually smooths the output probability distribution. To make the output probability distribution more discriminative, in this study, we use to sharp the output distribution.
3.1 Data Sets
We carry out experiments on Wall Street Journal (WSJ) corpus  and LibriSpeech corpus  to verify the performance of the proposed method. The WSJ corpus is a combination of LDC93S6B and LDC94S13B data sets obtained from LDC. After data preparation, we get 81 hours of transcribed speech audio, from which is selected as training data, the rest is used as validation data. The development data (dev93) consists of 503 utterance. And the evaluation data (eval92) contains 333 utterances. LibriSpeech is an open source speech corpus222http://www.openslr.org/12/ that has almost 1000 hours read speech based on public domain audio books. Similar to WSJ data preparation, among the 960 hours’ train data, we select 95 of the data for model training and the rest 5 for validation. In LibriSpeech, the development data and evaluation data are split into ”clean” and ”other” subsets.
In decoding, we use WSJ provided trigram language model. In LibriSpeech experiment, to be consistent with previous studies , the provided standard unpruned four-gram language model333http://www.openslr.org/resources/11/4-gram.arpa.gz is used in decoding.
In our experiments, the phonemes are used as CTC labels. For WSJ experiment, the CMU dictionary444http://www.speech.cs.cmu.edu/cgi-bin/cmudict is used as the lexicon for WFST graph building. Including the blank label, we extract 72 labels in total from CMU dictionary. In LibriSpeech experiment, we use the unstressed phonemes based lexicon555http://www.openslr.org/resources/11/librispeech-lexicon.txt from which 44 labels are extracted as CTC labels. Due to the lack of forced alignment, CTC training can not deal with the same word with multiple pronunciations. For every word, only the first pronunciation is applied to form the lexicon. We did not use other existing models to find the best pronunciation per occurrence.
3.2 Model Structure and Hyper-parameter Setup
For both experiments, 120-dimensional feature vector that consists of 40-dimensional filter bank together with its first and second order derivations are calculated at each speech frame. The features are normalized via mean subtraction and variance normalization per speaker. The splice of the feature vectors from left, current and right frame (in total 360-dimensional feature vector) is used as the input to bidirectional LSTM. To speed up training, frame skipping is used. Two out of three frames are skipped during training. Four layers of bidirectional LSTMs are used to get the hidden feature vectors. There are 320 hidden neurons in each LSTM layer with peephole connections. The forget gate bias is set to be 5. Batch size is set to 64 for experiments on LibriSpeech and 32 for experiments on WSJ. Adam based adaptive learning rate method is used. The initial learning rate is set to 0.001 for WSJ experiments and 0.0004 for LibriSpeech experiments, respectively. The learning rate gets decayed by a factor of 0.7 for WSJ experiments and 0.5 for LibriSpeech experiments when the model does not improve over validation data. For the proposed high rank LSTM-CTC based models, we setthe same as output lable size to achieve the highest rank of the projection layer.
Due to the fact that some GPU operations are non-deterministic in tensorflow, the models trained with the same setting up multiple times would be different. For fair comparison, we use the average word error rate of five different models that are trained with the same setting up.
Table. 1 gives the WER comparison for different models on WSJ corpus. Comparing with our baseline model (our-LSTM-CTC), the proposed model (our-HR-LSTM-CTC) gets and relative WER reduction on dev93 and eval92, respectively. We showed in Eq. (7), our-MOM-LSTM-CTC is similar to baseline model except that it has more weight parameters. The results in Table. 1 confirms that removing non-linear activation function and temperature factor, the simple mixture of different projection matrices does not improve over the baseline model.
Table. 2 shows the WER comparison of different models on the LibriSpeech corpus. The proposed model (our-HR-LSTM-CTC) shows consistent behavior on both WSJ and LibriSpeech.
compare the results from other models using CTC loss. Due to the lack of open-sourced data, script and code, to test our models on the exact same settings as published results is difficult. To present the status of CTC loss on these two data sets, we only refer the published results here. Note some of the comparisons are not fair, as they are not trained based on the exact same data. CTC-PL is the model trained by CTC loss together with policy learning to optimize WER. In CTC-PL, the training data is augmented through random perturbations of tempo, pitch, volume, temporal alignment, along with adding random noise. In DS2, it uses all the public available English corpus together with data augmentation as training data. E2E-att combines sequence attention modeling together with CTC loss. It use additional 800M words for language model training. When LSTM based LM is used in decoding, E2E-att gets the state-of-the-art result on LibriSpeech. ESPNET in Table.1 uses a combination of CTC loss with sequence to sequence loss. However, it does not use any effective method to leverage the language model and lexicon information in decoding.
In this paper, a high rank projection layer is proposed to replace the bottleneck projection matrix in conventional LSTM-CTC based models for E2E speech recognition. The output of the high rank projection layer is a weighted combination of multiple vectors that are obtained by feeding the hidden feature vector to a set of projection matrices and going through a non-linear activation function. On two benchmark corpora, WSJ and LibriSpeech, the proposed high rank LSTM-CTC model outperformed the baseline CTC model. On WSJ corpus, compared with baseline model, the proposed model got nearly relative WER reduction on dev93 and reduction on eval92. On LibriSpeech corpus, the proposed model improved the baseline model by relative WER reduction on test-clean and on test-other, dev-clean and dev-other subsets.
-  (2015) Deep Speech 2: end-to-end speech recognition in English and Mandarin. CoRR abs/1512.0. External Links: Cited by: §1, Table 1, Table 2.
-  (2018) Exploring neural transducers for end-to-end speech recognition. In Proceedings of ASRU, External Links: Cited by: §1.
-  (2016) Listen, attend and spell: A neural network for large vocabulary conversational speech recognition. In Proceedings of ICASSP, External Links: Cited by: §1.
-  (2017) State-of-the-art speech recognition with sequence-to-sequence models. CoRR abs/1712.0. External Links: Cited by: §1.
-  (2015) Attention-based models for speech recognition. In Proceedings of NIPS, NIPS’15, pp. 577–585. Cited by: §1.
-  (2018) Advancing connectionist temporal classification with attention modeling. CoRR abs/1803.0. External Links: Cited by: §1, §1.
Connectionist temporal classification : labelling unsegmented sequence data with recurrent neural networks. Proceedings of ICML. External Links: Cited by: §1.
-  (2014) Towards end-to-end speech recognition with recurrent neural networks. JMLR Workshop and Conference Proceedings. External Links: Cited by: §1, §1.
-  (2012) Sequence transduction with recurrent neural networks. CoRR abs/1211.3. External Links: Cited by: §1.
-  (2014) DeepSpeech: Scaling up end-to-end speech recognition. arXiv:1412.5567. External Links: Cited by: §1.
-  (2012) Deep neural networks for acoustic modeling in speech recognition: The shared views of four research groups. IEEE Signal Processing Magazine. External Links: Cited by: §1.
-  (2017) Joint CTC/attention decoding for end-to-end speech recognition. In Proceedings of ACL, External Links: Cited by: §1.
-  (2017) Joint CTC-attention based end-to-end speech recognition using multi-task learning. In Proceedings of ICASSP, External Links: Cited by: §1.
-  (2017) Improved training for online end-to-end speech recognition systems. CoRR abs/1711.0. External Links: Cited by: §1, §1.
-  (2016) EESEN: End-to-end speech recognition using deep RNN models and WFST-based decoding. In Proceedings of ASRU, External Links: Cited by: §1, §1, §1, §1, §1, Table 1.
-  (2015) Librispeech: An ASR corpus based on public domain audio books. In Processings of ICASSP, External Links: Cited by: §1, §3.1.
-  (1992) The design for the wall street journal-based CSR corpus. In Proceedings of the workshop on Speech and Natural Language - HLT ’91, External Links: Cited by: §1, §3.1.
-  (2017) Improving the performance of online neural transducer models. CoRR abs/1712.0. External Links: Cited by: §1.
-  (2015) Fast and accurate recurrent neural network acoustic models for speech recognition. arXiv. External Links: Cited by: §1, §1.
-  (2017) Recurrent neural aligner: An encoder-decoder neural network model for sequence to sequence mapping. In Proceedings of Interspeech, External Links: Cited by: §1.
-  (2018) ESPnet: end-to-end speech processing toolkit. Proceedings of Interspeech abs/1804.0. External Links: Cited by: Table 1.
-  (2017) The Microsoft 2016 conversational speech recognition system. In Proceedings of ICASSP, External Links: Cited by: §1.
-  (2017) Breaking the softmax bottleneck: A high-rank RNN language model. CoRR arXiv:1711, pp. 1–18. External Links: Cited by: §1.
-  (2018) Improved training of end-to-end attention models for speech recognition. CoRR abs/1805.0. External Links: Cited by: §1, Table 2.
-  (2017) Improving end-to-end speech recognition with policy learning. Proceedings of ICASSP abs/1712.0. External Links: Cited by: §3.1, Table 1, Table 2.