Learn Spelling from Teachers: Transferring Knowledge from Language Models to Sequence-to-Sequence Speech Recognition

07/13/2019 ∙ by Ye Bai, et al. ∙ 0

Integrating an external language model into a sequence-to-sequence speech recognition system is non-trivial. Previous works utilize linear interpolation or a fusion network to integrate external language models. However, these approaches introduce external components, and increase decoding computation. In this paper, we instead propose a knowledge distillation based training approach to integrating external language models into a sequence-to-sequence model. A recurrent neural network language model, which is trained on large scale external text, generates soft labels to guide the sequence-to-sequence model training. Thus, the language model plays the role of the teacher. This approach does not add any external component to the sequence-to-sequence model during testing. And this approach is flexible to be combined with shallow fusion technique together for decoding. The experiments are conducted on public Chinese datasets AISHELL-1 and CLMAD. Our approach achieves a character error rate of 9.3 sequence-to-sequence model.



There are no comments yet.


page 1

page 2

page 3

page 4

This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

1 Introduction

Attention based sequence-to-sequence (Seq2Seq) models have achieved promising performance in automatic speech recognition (ASR) [1, 2, 3, 4]

. A Seq2Seq model consists of two components: an encoder encodes the acoustic feature sequence into a high level representation, and a decoder generates the corresponding word sequence. The encoder leverages attention mechanism to fuse extracted features into a fixed-dimensional vector for capturing global semantic information of a speech signal. The decoder is a conditional language model (LM) to capture linguistic information of transcriptions. During decoding stage, the decoder predicts the current word in terms of the acoustic encoding of the encoder, history context, and the previous word at each step. This architecture is also referred to as Listen, Attend, and Spell


Compared with speech transcriptions, abundant unsupervised text corpora, which have rich linguistic information, are easier to obtain. Large scale external text data is commonly used to train language models (LMs) to improve ASR performance in conventional hidden Markov model (HMM) or connectionist temporal classification (CTC) based ASR pipelines. However, because the encoder and the decoder are optimized jointly, it is non-trivial to integrate an external LM into a Seq2Seq model.

Shallow fusion and deep fusion are two approaches to integrating an LM into a Seq2Seq model [5]. Shallow fusion performs log-linear interpolation between the decoder of a Seq2Seq model and an external LM during beam search. The external LM can be -gram LM or neural network language models (NNLMs). It has achieved success in ASR tasks [1, 6]

. Various deep fusion approaches leverage a neural network to fuse hidden representations of the Seq2Seq decoder and the external neural network based LM

[5]. Cold fusion and component fusion utilize a pre-trained recurrent neural network language model (RNNLM) and gating mechanism to improve ASR performances [7, 8]. These fusion approaches have shown promising performance. However, the neural network of the external LM increases complexity of the Seq2Seq model. Specifically, the fusion network introduces external parameters into the Seq2Seq model for deep fusion. Both shallow fusion and deep fusion need the external LM during test stage. It introduces external complexity into the ASR system.

We propose a knowledge distillation (KD) [9] based training approach to integrating an external LM into a Seq2Seq model. First, an RNNLM is trained on large scale text data. Then, the RNNLM is used to generate soft labels of speech transcriptions to train the Seq2Seq model. This training approach is also named as Teacher/Student model: the teacher (RNNLM) provides soft labels as prior knowledge to “teach” the student (Seq2Seq decoder). Thus, we refer to the proposed training approach as “Learn Spelling from Teachers” (LST). LST is simple to implemented: it does not modify the model structure, and only needs to train an RNNLM to generate soft labels. With LST, the external LM is only needed during training, so it does not increase complexity of the model for testing. Furthermore, LST and shallow fusion can be used together to achieve better performance. We conduct experiments on publicly available AISHELL-1111http://openslr.org/33/ dataset [10] and CLMAD222http://openslr.org/55/ text dataset [11] to show the effectiveness of the proposed LST. We use Speech-Transformer [3] as the backbone network. Our proposed approach reduced character error rate (CER) from to . we further utilized shallow fusion for the model trained with LST, and achieved CER of .

The rest of this paper is organized as follows. Section 2 introduces the background. Section 3 introduces the proposed LST. Section 4 introduces the related work. Section 5 describes the experimental results. Section 6 summarizes the paper.

((a)) Vanilla Seq2Seq Model
((b)) Learn Spelling from Teachers (LST)
Figure 1: (a) illustrates a basic encoder-decoder architecture for ASR. represent acoustic features, denotes the context of step, and denotes the previous token. The decoder predicts the current token in terms of the context , the previous token , and the acoustic vector generated by the encoder. The loss is computed with the softmax function of the decoder and the current ground truth token . (b) illustrates the proposed “Learn Spelling from Teachers” (LST) approach. The RNNLM generates soft labels to train the Seq2Seq model, and it is removed during testing.

2 Background: Seq2Seq models for ASR

A basic Seq2Seq model is shown in Fig. (a). First, a speech signal is processed into an acoustic feature sequence. Then, an encoder network encodes the sequence into a high level acoustic representation. The encoder can be a recurrent neural network [2, 12] or a transformer [3]

. The decoder is a conditional LM: given the high level acoustic representation, the previous token, and history context, it predicts the current token. The probability distribution on the vocabulary is computed by a softmax function.

The attention is an important mechanism to capture the relationship between the acoustic representations and the current state of the decoder. The attention scores are computed in terms of the current state of the decoder and the high level acoustic representations, and then the acoustic information and the decoder state are fused.

The encoder and decoder are trained jointly. The training criterion is cross entropy:


where is the index of each token, is the vocabulary size, is the index of the corresponding ground truth token at step , is the previous token, is the history context, is the acoustic features, represents probability, and stands the parameters of the whole network. is if the two variables are equal, and otherwise.

3 Distilling knowledge from external LMs

The basic idea of “Learn Spelling from Teachers” (LST) is: first, train an RNNLM on an external large scale text corpus, and then use this RNNLM to guide Seq2Seq model training. Besides

-of-K hard labels provided by the transcriptions, the RNNLM provides soft labels, which carries the knowledge of the text corpus. The soft labels are probabilities estimated by the RNNLM.

Fig. 2 shows the hard labels and soft labels of tokens in the vocabulary at one time step in a sequence. The soft labels contain more information than hard labels, e.g. some tokens have relatively large probabilities, and some tokens have very small probabilities.

Figure 2: Hard labels and soft labels at one time step of a sequence for training. The values of the soft labels reflect knowledge of the external LM.

Given the context and the previous token, the probability of -th token in vocabulary estimated by the RNNLM is


where is -th node of latent variable before the softmax function, is the vocabulary size, is the previous token, is the history context, and is a parameter called temperature to smooth the outputs.

To make the Seq2Seq model learn the knowledge from the RNNLM, we minimize the Kullback-Leibler divergence (KLD) between estimated probability of the RNNLM

and the estimated probability of the Seq2Seq model . Let , and , the KLD is



is fixed during training the Seq2Seq model, the loss function is equivalent to the cross entropy form:


We refer to the above loss as LST loss.

The cross entropy loss in Eq. (missing) and the LST loss in Eq. (missing) are weighted with a coefficient , then the final loss is


We can simplify the above equation as the label interpolation form:


Thus, compared with the vanilla Seq2Seq model, we just modify the labels rather than the loss function during training stage. combines the knowledge from transcriptions and the knowledge from the LM. The LST is illustrated in Fig. (b).

Comparing Fig. (b) with Fig. (a), we can see that LST is only used for training, and the external RNNLM is removed during testing. So the computation is the same as the original Seq2Seq for testing. In order to achieve better performance, shallow fusion can be further used with LST during decoding. In addition, besides ASR, our proposed LST can be generally used for Seq2Seq models.

4 Related work

Knowledge distillation. KD was proposed for model compression [9]. It is also referred to as teacher-student learning. Yoon et al. proposed to use KD to reduce the size of a Seq2Seq model for machine translation [13]. It has also been used for domain adaptation for acoustic models [14] and language models [15]. Different from these work, our work focuses on integrating external language models for Seq2Seq ASR systems.
Label smoothing. Label smoothing have been used to prevent the Seq2Seq ASR model making overconfident predictions [4, 3, 16]. It can be seen as a special case of KLD regularization when assuming the prior label distribution is uniform [17]

. Unlike label smoothing, LST leverages an RNNLM to provide a context-dependent prior distribution rather than a simple uniform distribution. Instead of assumption, the prior distribution is estimated with a data-driven method. Besides solving the overconfident problem, LST introduces knowledge from an external large scale text data corpus.

5 Experiments

5.1 Datasets

We use a Chinese corpus AISHELL-1 to evaluate our proposed approach [10]. The training set contains hours of speech ( utterances) recorded by speakers. The development set contains hours of speech ( utterances) recorded by speakers. And the test set contains hours of speech ( utterances) recorded by speakers. The speakers of the training set, development set, and test set are not overlapped. All the recordings of the corpus are in kHz WAV format. The content of the speech is news with different topics.

A subset of CLMAD [11, 18] text dataset is used as external text dataset333This subset of the external text has been shared with OneDrive: https://1drv.ms/u/s!An08U7hvUohBb234-V-Z0Qb_Zcc. We use an open source tool XenC to extract the subset of CLMAD which is topic matched with AISHELL-1 [19]. The preprocessing steps are as follows:

  1. [label=(0)]

  2. Select million sentences which have small cross entropy with AISHELL-1 training transcriptions [20];

  3. Remove the sentences whose lengths are longer than ;

  4. Mix the remained sentences with training transcriptions (which are duplicated times to improve proportion);

  5. Re-segment the word sequences into characters.

The information of the text data is shown in Table 1.

5.2 Experimental setup

In this paper, we employ Speech-Transformer [3, 21], a non-recurrent Seq2Seq model for speech recognition, as the backbone network. Instead of hidden states and recurrent structures of RNNs, the transformer models the context by computing attention directly. Please see [3, 12, 21] for details of the transformer.

The acoustic features are -dimension Mel-filter bank features (FBANK), which are extracted every 10ms with 25ms of frame length. Each frame is spliced with three left frames. So, the input of the network is -dimensional. The sequence is subsampled every three frames. The Speech-Transformer consists of blocks of an encoder and blocks of a decoder. The dimensionality of the model is , and the number of inner nodes of the fully connected feed-forward network is . The number of heads is . The modeling units of the decoder are characters, including three special symbols “unk”, “sos”, “eos”, which represent unknown character, start of a sentence, end of a sentence, respectively. The character embeddings is shared with the output weights of the decoder [22]. Following [3], we use Adam optimizer with , , . The learning rate is updated as follows:


where is the dimensionality of the model, is the step number, is a tunable parameter, and the learning rate increases linearly for steps. We set , . The model is trained for epochs. There are utterances containing about K frames in one batch. The development set is used for validation. Only the model which achieves the lowest cross entropy on development set is stored as the final model.

The external RNNLM is a two layers of long short-term memory (LSTM) network. The modeling units are the same as the Seq2Seq model. The RNNLM is trained on the external text. The embedding size of the RNNLM is

, and the number of LSTM cells of each layer is . The RNNLM is trained on external text. The stochastic gradient decent (SGD) with momentum as the optimizer for training the RNNLM. The momentum is set to , and the learning rate is set to . The RNNLM is trained for epochs.

For decoding, we set beam width to for beam search, and maximum decoding length to .

#Sentences #Characters Size
Training Trans. MB
Test Trans. MB
External Text MB
Table 1: The description of the text.

5.3 Results and analysis

5.3.1 The effectiveness of external text

Firstly, we demonstrate the effectiveness of the external text data and the RNNLM. We compute the perplexities on AISHELL-1 test transcriptions, which are shown in Table 2. Note that the data is in character level, so the perplexities are relatively smaller. We can see that compared with the -gram with Kneser-Ney smoothing trained on training transcriptions, the -gram trained on external text achieves a significant reduction of perplexity. Moreover, the RNNLM achieves about a relative reduction over the -gram trained on the external text.

5.3.2 The impact of hyper-parameters

Table (a) shows the the character error rates (CERs) on development the set with different temperature in Eq. (missing) when is fixed at . The temperature controls the smoothness of the soft labels generated by the RNNLM. When it is too small, the soft labels are too sharp, and the Seq2Seq training is perturbed heavily. When it is too large, the soft labels are too smoothed to affect the training. We can see that when the temperature is set to , the model achieves the best performance.

Then we fix at and evaluate the influence of . The parameter controls the proportion of the ground truth hard labels to the soft labels of the RNNLM. The results are shown in Table (b). We can see that when , the model achieves the best performance on the development set. According to the above results in Table 3, we select and as the final hyper-parameters. We refer to the model trained with and as “Seq2Seq+LST” in the rest.

-gram (Training Trans.)
-gram (Ext. Text)
RNNLM (Ext. Text)
Table 2: The perplexities on transcriptions of AISHELL-1 development set.
Temperature CER%
((a)) Varing for RNNLM softmax ().
Weight CER%
((b)) Varing for interpolation ().
Table 3: Comparisons of different hyper-parameters on the development set.

5.3.3 The effectiveness of the proposed approach

Table 4 gives the results on the test set of each model. “Seq2Seq” is the plain Seq2Seq without regularization. Compared to “Seq2Seq”, “Seq2Seq+LST” achieves an relative reduction in character error rate.

We report results of two KLD based regularization approaches, namely label smoothing and unigram smoothing. For label smoothing, the prior label distribution is assumed to be a uniform distribution. The label smoothing achieves a relative reduction in character error rate. For unigram smoothing, the prior label distribution is assumed to be the frequency of each label. The frequency is estimated on the external text. Because the unigram is too sharp, it introduces noises and affects training. We add to the frequencies and re-normalize them to smooth the unigram. We can see that the original unigram hurts the performance, and the smoothed unigram improves the performance. Both label smoothing and unigram smoothing are effective for regularizing the model. The unigram should be smoothed for training to reduce the sharp problem.

From Table 4, we can see that “Seq2Seq+LST” outperforms both label smoothing and unigram smoothing (without shallow fusion). We analyze that the assumptions of label smoothing (uniform distribution) and unigram smoothing (unigram frequency) do not match real situations. However, LST, which is a data-driven approach, does not assume prior distributions.

We further leverage shallow fusion with the RNNLM for each model. The weight of LM is . The RNNLM is the same one which is used for LST. We can see that shallow fusion improves performances for all models. “Seq2Seq + LST” model outperforms “Seq2Seq + Label Smoothing + SF” model (CER ), which demonstrates that LST is an effective way to improve the performance of Seq2Seq models. Moreover, the model which uses LST and shallow fusion together, i.e. “Seq2Seq + LST + SF”, achieves the best CER of .

Model CER%
Seq2Seq (baseline)
Seq2Seq + SF
Seq2Seq + Label Smoothing
Seq2Seq + Label Smoothing + SF
Seq2Seq + Original Unigram Smoothing
Seq2Seq + Smoothed Unigram Smoothing
Seq2Seq + Smoothed Unigram Smoothing + SF
Seq2Seq + Proposed LST
Seq2Seq + Proposed LST + SF
Table 4: Comparisons on the test set. LST represents our proposed “Learn Spelling from Teachers” approach. SF means using shallow fusion during decoding.
Figure 3: The loss curves of Seq2Seq model (left) and Seq2Seq model with LST (right). For Seq2Seq model, the training loss is lower than validation loss. However, with LST, the training loss is higher than the validation loss. Moreover, the validation loss in the right figure is a little smaller than the left one.

To further show the effect of our proposed approach, we draw the loss curves with baseline “Seq2Seq” and the proposed “Seq2Seq+LST” in Fig. 3. For “Seq2Seq” model, the training loss is lower than validation loss. However, for “Seq2Seq+LST”, the training loss is higher than validation loss. The final validation loss of “Seq2Seq+LST” is a little bit smaller than “Seq2Seq”. This result shows regularization effect of LST.

6 Conclusions

In this paper, we propose LST training approach to integrating an external RNNLM into a Seq2Seq model. An RNNLM is first trained on large scale external text data. Then, the RNNLM provides soft labels of training transcriptions to train the Seq2Seq model. We used transformer based Seq2Seq as backbone, and conducted experiments on public available Chinese datasets AISHELL-1 (speech) and CLMAD (external text). The experiments demonstrate the effectiveness of our proposed approach. We will try integrating more powerful language models into Seq2Seq systems in the future.

7 Acknowledgements

This work is supported by the National Key Research & Development Plan of China (No.2017YFB1002801), the National Natural Science Foundation of China (NSFC) (No.61425017, No.61831022, No.61773379, No.61603390), the Strategic Priority Research Program of Chinese Academy of Sciences (No.XDC02050100), and Inria-CAS Joint Research Project (No.173211KYSB20170061).