Improved hybrid CTC-Attention model for speech recognition

by   Zhe Yuan, et al.
CloudWalk Technology Co., Ltd.

Recently, end-to-end speech recognition with a hybrid model consisting of connectionist temporal classification(CTC) and the attention-based encoder-decoder achieved state-of-the-art results. In this paper, we propose a novel CTC decoder structure based on the experiments we conducted and explore the relation between decoding performance and the depth of encoder. We also apply attention smoothing mechanism to acquire more context information for subword-based decoding. Taken together, these strategies allow us to achieve a word error rate(WER) of 4.43 test-clean subset of the LibriSpeech corpora, which by far are the best reported WERs for end-to-end ASR systems on this dataset.



There are no comments yet.


page 1

page 2

page 3

page 4


An improved hybrid CTC-Attention model for speech recognition

Recently, end-to-end speech recognition with a hybrid model consisting o...

Adding Connectionist Temporal Summarization into Conformer to Improve Its Decoder Efficiency For Speech Recognition

The Conformer model is an excellent architecture for speech recognition ...

On the limit of English conversational speech recognition

In our previous work we demonstrated that a single headed attention enco...

Tree-constrained Pointer Generator for End-to-end Contextual Speech Recognition

Contextual knowledge is important for real-world automatic speech recogn...

Regularized Forward-Backward Decoder for Attention Models

Nowadays, attention models are one of the popular candidates for speech ...

Multi-encoder multi-resolution framework for end-to-end speech recognition

Attention-based methods and Connectionist Temporal Classification (CTC) ...

End-to-end Speech Recognition with Adaptive Computation Steps

In this paper, we present Adaptive Computation Steps (ACS) algorithm, wh...
This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

1 Introduction and Background

Automatic speech recognition (ASR), the technology that enables the recognition and translation of spoken language into text by computers, has been widely used in different applications. In the past few decades, ASR relied on complicated traditional techniques including Hidden Markov Models (HMMs) and Gaussian Mixture Models (GMMs)

[1]. Besides, these traditional models also require hand-made pronunciation dictionaries and predefined alignments between audio and phoneme[2, 3]

. Although these traditional models achieve state-of-the-art accuracies on most audio corpora, it is quite a challenge to develop ASR models without enough acoustics knowledge. Therefore, benefiting from rapid development of deep learning, a few end-to-end ASR models were raised in recent years.

Connectionist temporal classification(CTC) based models and sequence-to-sequence(seq2seq) with attention models are two major approaches in end-to-end ASR systems. Both methods address the problem of variable-length input audios and output texts. Deep Speech 2, which was came up with by Baidu Silicon Valley AI Lab in 2016 [4], making full use of CTC and RNN, achieved a state-of-the-art recognition accuracy. As for seq2seq model, Chorowski et al utilized seq2seq model with attention mechanism to perform speech recognition [5]

. However, the accuracy of the model is unsatisfactory since alignment estimation in the attention mechanism is easily corrupted by noise, especially in real environment tasks.

To overcome the above misalignment problem, a combination of CTC and attention-based seq2seq model were proposed by Watanabe in 2017 [6]. The key to this joint CTC-attention model is training a shared encoder, with both CTC and attention decoder as objective functions simultaneously. This novel approach improves the performance in both training speed and recognition accuracy.

This paper is partly inspired by the above method. Our main contributions in this paper include exploring different encoder and decoder network architecture and adopting several optimization methods such as attention smoothing and L2 regularization. We demonstrate that our system outperforms other published end-to-end ASR models in WER on LibriSpeech dataset.

The paper is organized as follows. Section 2 briefly introduces the related works, mainly focusing on the hybrid CTC/Attention method. Section 3 details our model architecture and section 4 presents our training methods and experimental results. Finally, section 5 concludes this work.

2 Related Work

In this section, we review the hybrid CTC-attention architecture in Section 2.1 and unit selection methods in Section 2.2.

2.1 Hybrid CTC-attention architecture

The idea of this architecture is to use CTC as an auxiliary objective function to train the attention-based seq2seq network. Fig. 1

illustrates the architecture of the network, where the encoder has several convolutional neural network(CNN) layers followed by bidirectional long short-term memory (BiLSTM) layers, while the decoder includes a CTC module and an attention-based network. According to

[7], using CTC along with attention decoder brings more robustness to the network since CTC helps acquiring appropriate alignments in noisy conditions. Moreover, CTC also assists the network in training speed.

CTC, which is introduced by [8], provides a method to train RNNs without any prior alignments between inputs and outputs. Suppose the length of the input sequence is

, then the probability of a CTC path can be computed as follow:


where denotes the the softmax probability of outputting label at frame t and denotes the CTC path. Hence the likelihood of the label sequence can be computed as follow:


where is the set of all possible CTC paths that can be mapped to . Therefore, we have CTC loss to be:


As for decoder part, the possibility of label at each step depends on input feature and previous labels . The overall possibility of the entire sequence can be obtained as follow:




denotes hidden states while

is the context vector based on input features

and attention weight

in the above equation. The loss function of this part is defined as:


where denotes the weight of different loss, .

Figure 1: Architecture of the hybrid CTC-Attention model

2.2 Unit selection

Methods based on large lexicon, such as phoneme-based ASR systems or word-based ASR systems, are not able to resolve out-of-vocabulary (OOV) problems. Thus, starting from LAS

[9], such seq2seq model raises new character-based method. By combining frame information in audio clips and the corresponding characters together, the OOV problem is resolved to some extent. Since many characters in English words are silent and same characters in different sentences may pronounce differently (e.g. ”a” in ”apple” and ”approve”), decoding procedure on character level relies heavily on the sentence sequence relationship given by RNN rather than the acoustic information given by the audio clip frames, which results in the uncertainty of decoding procedure on character level. Considering all the issues mentioned above, subword-based structure can resolve OOV problems on one hand, and can learn the relationship between acoustic information and character information on the other hand. An effective and fast method for generating subwords is byte-pair encoding (BPE) [10]. Which is a compression algorithm that iteratively replaces the most frequent pair of units (or bytes) with an unused unit, and eventually generates new units that are consistent with the number of iterations.

3 Methodology

In this section, we detail our optimization and improvements based on the previous hybrid CTC-attention architecture. We show our improvements to encoder-decoder architecture and attention mechanism in section 3.1 and section 3.2.

3.1 Encoder-Decoder architecture

The authors in Espnet [11], stacked several BiLSTM layers above a few convolutional layers. The outputs of the last BiLSTM layer sever as inputs to both CTC and attention-decoder as shown in Fig. 1. Our major improvements conclude inserting a BiLSTM layer, which is solely occupied by the CTC branch, between the top shared encoder layer and FC layer connected to CTC. The entire hybrid architecture is shown in Fig. 2.

Figure 2: Encoder and decoder architecture of our model

According to our experiments in Section 4, setting in (8) to a smaller value makes the network perform better. However, when the weight is low, a new problem is raised. Since lower brings smaller gradient descent in back propagation in CTC loss part, the shared decoder focuses more on the attention module than the CTC module, which limits the performance of the CTC decoder. Considering this limitation, we introduce a solely BiLSTM layer linking to the CTC decoder, which can compensate the problem we mentioned above.

3.2 Attention smoothing

Inspired by [5], we use a location-based attention mechanism in our implementation. Specifically, the location based attention energies can be computed by the following equation:




and .

In our speech recognition system, subwords are chosen as the model units, which require more sequence context information than character-based units. However, the attention score distribution is usually very sharp when computed using above equations. Hence, we apply attention smoothing mechanism instead, which can be computed by


The above method successfully smooths attention score distributions and then keep more context information for subword-based decoding.

4 Experiments

4.1 Experimental Setup

We train and test our implementation over LibriSpeech dataset [12]

. Specificlly, we use train-clean-100, train-clean-360, train-other-500 as our training set and dev-clean as our validation set. For evaluation, we report the word error rates (WERs) on the subsets test-clean, test-other, dev-clean and dev-other. We also adopt 3-fold speed perturbation(0.9x/1.0x/1.1x) for data augmentation. 80 dimensional Mel-filterbank features are generated using a sliding window of length 25 ms with 10ms stride, and the feature extraction is performed by KALDI toolkit

[13]. Subword units are extracted using all the transcripts of training data by BPE algorithm. The number of subword units is set to 5000.

We use a 4-layer CNN architecture followed by a 7-layer BiLSTM where each layer is a BiLSTM with 1024 cell units per direction as encoder. In the CNN part, input features are downsampled to 1/4 through two max-pooling layers. The decoder consists of two branches where one branch is a one-layer BiLSTM followed by a CTC decoder and the other branch is a 2-layer LSTM with 1024 cell units per layer.

The AdaDelta algorithm [14]

with initial hyper-parameter epsilon=1e-8 is used for optimization, and L2 regularization and gradient clipping are applied. We measure the accuracy of the validation set every 1000 iterations and apply a strategy that eps is decayed by 0.1 when the average validation accuracy drops. All experiments are performed on 4 Tesla P40 GPUs with batchsize

on each GPU.

Our language model is a two-layer LSTM with units=1536 trained on large text data of 14500 public domain books, which is commonly used as training material for the LibriSpeech’s LM. The SGD algorithm is used for optimization, with initial learning-rate 1.0 and lr-decay 0.9 per 2 epochs. For decoding, we use the beam search algorithm with the beam size 20.

4.2 Results

Fig. 3 shows the accuracy curve during training process, from which we can see that the model converges after 35000 iterations. The perplexity of our trained RNN-LM is 50.4 on the training set, and 46.9 on the test-clean subset. We conduct experiments with different number of layers in encoder and the addition of BiLSTM layer on the CTC branch. Results are shown in Table. 1, from which we have the following comments. Both increasing the number of BiLSTM layers in encoder and the addition of BiLSTM layer on the CTC branch lead to better WER. Moreover, WER can be reduced by about 25% using our trained RNN-LM.

Figure 3: Accuracy curves with the number of iterations on both the train set and the validation set during training

Figure 4: WER performance as a function of alpha on test-clean subset
Encoder Layers WER(no LM) WER(RNN-LM)
5 5.01 3.73
5+CTC-BiLSTM 4.73 3.59
6 4.82 3.64
6+CTC-BiLSTM 4.57 3.43
7 4.64 3.51
7+CTC-BiLSTM 4.43 3.34
Table 1: Comparison of test-clean-subset WERs under different structures
Model Test Dev
clean other clean other
Baidu DS2[4] + LM 5.15 12.73 - -
Espnet[15] + LM 4.6 13.7 4.5 13.0
I-Attention[16] + LM 3.82 12.76 3.54 11.52
Ours + no LM 4.43 13.5 4.37 13.1
Ours + LM 3.34 10.54 3.15 9.98
Table 2: Performance of different networks on the LibriSpeech dataset

After that, we compare the different weight between CTC loss and attention loss by step 0.1. The result is shown in Fig.4. When we use pure attention-based system or pure ctc-based system, it produces inferior performance. The curve also shows that decreasing leads to better WER in hybrid system, which is consistent with the purpose of using CTC-decoder at the beginning: the CTC module is mainly used to assist the monotonic alignment and increase the convergence speed of training, and the hybrid system decoding effect mainly relies on the attention-decoder. As fig. 4 shows, we find that the best tuned is 0.1.

Finally, we compare our results with other reported state-of-the-art end-to-end systems on the LibriSpeech dataset in Table. 2. The results show that our system achieves better WERs than other known end-to-end ASR models.

5 Conclusions

In summary, we explore a variety of structural improvements and optimization methods on the hybrid CTC-attention-based ASR system. By applying the CTC-decoder BiLSTM, attention smoothing and some other tricks, our system achieves a word error rate(WER) of 4.43% without LM and 3.34% with RNN-LM on the test-clean subset of LibriSpeech corpus.

Future work will concentrate on the optimization of both the decoder structure and the training method, such as finetuning the CTC-decoder-branch after training the shared encoder. Another future work is to apply this technique to other languages like Mandarin, in which there are many polyphonic words that need to be solved in the decoding process.