In recent years, the performance of automatic speech recognition (ASR) has been greatly improved by sequence-to-sequence modeling, such as connectionist temporal classification (CTC)[graves2006ctc]
, recurrent neural network transducer (RNN-T)[battenberg2017rnnt], and attention-based encoder-decoder (AED) [hori2017joint]
. Many of the earlier researches have focused on autoregressive (AR) modeling, which generates the token sequence using a left-to-right probabilistic chain rule. Despite their great performance, such AR models require-step incremental model calculations to generate tokens, resulting in high inference latency and considerable computational costs.
From another aspect, non-autoregressive (NAR) modeling generates the token sequence in constant steps, and removes the chain-rule assumption. CTC [graves2006ctc] plays an essential role in recent NAR researches. Modern NAR approaches outperform CTC by leveraging alignments (align-based) and the output token sequence (token-based). Based on the joint CTC/attention architecture [hori2017joint], Mask-CTC [higuchi2020mask] utilizes the (conditional) masked language model ((C)MLM) decoder to refine the CTC token sequence. [higuchi2021improved] proposes two auxiliary tasks to solve the length prediction problem that occurred in Mask-CTC. From another point of view, CTC alignment shows its advantage for building NAR models in Align-Refine [chi2021alignrefine], CASS-NAT [fan2021improved] and AL-NAT [wang2022alignmentlearning]. Moreover, the self-supervised model wav2vec2.0 [baevski_2020_wav2vec_20_framework] has achieved promising results by CTC.
However, there are two major challenges remaining for NAR modeling: First, NAR models converge slowly and perform poorly compared to state-of-the-art (SOTA) AR models [higuchi2020mask, deng2022improving]. Second, although NAR models are often favored in resource-constrained situations for fast inference speed and high accuracy [higuchi2020mask], the large model scale and computational cost limit the application of NAR modeling. Knowledge distillation (Transfer learning) is typically used to solve such a problem by teaching a smaller student model [gou2021knowledge]
. To be specific, the student aims to mimic the soft targets provided by the well-trained teacher using the Kullback-Leibler divergence (KLD)[huang2018knowledgea, takashima2019investigation, munim2019sequencelevel]. Nevertheless, when applying knowledge distillation on non-autoregressive ASR, the poor NAR teacher limits the improvement.
In this paper, we propose a novel architecture by transferring and distilling the knowledge from an autoregressive (AR) teacher to a non-autoregressive (NAR) student together with a beam-search decoding method to boost the performance of non-autoregressive modeling. Firstly, we introduce a beam-search decoding method to enlarge the search space for the (conditional) masked language model ((C)MLM) decoder[ghaz2019mask]. Then, we extend the knowledge distillation technique by transferring the AR teacher’s knowledge to NAR in two distillation levels, therefore improving NAR students’ performance. The encoder distillation is conducted following our previous setup [huang2018knowledgea]. For the decoder distillation, we develop the frame- and sequence-level distillation from the attention-based autoregressive model into Mask-CTC. The distillation loss is customized for token-based NAR models, so that the NAR decoder can benefit from the AR decoder.
The structure of the paper is organized as follows: In Section 2, the attention-based autoregressive model and non-autoregressive Mask-CTC are briefly introduced. In Section 3, we present the proposed beam search method for Mask-CTC and the knowledge distillation from AR to NAR ASR. In Section 4, experimental results and analysis are given on the Mandarin AISHELL-1 and English LibriSpeech datasets. Finally, the conclusion is drawn in Section 5.
2 Autoregressive and Non-autoregressive ASR
Basically, end-to-end ASR models map speech features to a token sequence , where is the feature dimension and denotes the vocabulary set.
Traditional attention-based autoregressive (AR) ASR models [vaswani2017attention, gulati2020conformer] firstly encode speech features
into a hidden representation: , and then compose it with previous tokens
to estimate the posterior:
and the whole sequence probability is
During inference, the AR model generates hypothesis token-by-token.
Connectionist temporal classification (CTC) [graves2006ctc] is one of the earliest non-autoregressive (NAR) method, which introduces a many-to-one function from the frame-level alignment to the token sequence , by merging same labels and removing the blank in . The sequence probability is represented as:
where is a many-to-one function from to . During inference, greedy CTC predicts the alignment by selecting the tokens with the highest probability for each step.
Mask-CTC [higuchi2020mask], which is a popular instantiation of NAR ASR, is actually a refinement of CTC results via the conditional masked language model (MLM) [ghaz2019mask]. During training, the groundtruth are randomly replaced by the special token <MASK> , and the MLM decoder predicts masked tokens , conditioned on the observed tokens :
During inference, the output is initialized by CTC greedy decoding, and low-confidence tokens are substituted by <MASK> based on pre-defined threshold . After that, masks are filled using easy-first algorithm: fill all masks in iterations, where denotes the total number of <MASK> and each iteration predicts top- tokens with the highest confidence guided by MLM:
where , is the candidate set of <MASK> tokens and is the updated result after mask filling.
3 Proposed Methods
In this section, we introduce: (1) the proposed beam search method for NAR, (2) the distillation architecture transferring knowledge from AR to NAR ASR.
3.1 Beam Search for NAR ASR
Inspired by joint and rescoring decoding [hori2017joint], we design a beam-search decoding method to enlarge the search space for the MLM decoder. The procedure is shown in Algorithm 1. is a sorted queue to be updated at the beginning of one iteration, stores the final after one iteration. During each iteration, a -size beam is preserved, and the number of updated tokens is fixed and computed by (i.e. ). Top-
candidates are selected, according to the log domain posterior probability and Equation5.
3.2 Knowledge Transfer and Distillation from Autoregressive to Non-Autoregressive ASR
As previously stated, knowledge distillation performance on NAR is constrained owing to the poor performance of the NAR teacher. Here, we propose knowledge transfer and distillation from autoregressive (AR) to non-autoregressive (NAR) ASR, pushing the limits of NAR.
Firstly, we introduce two types of knowledge distillation techniques based on Kullback-Leibler divergence (KLD): , where are the teacher and student output distributions, respectively.
Frame-level knowledge distillation as a basic distillation criterion is formulated as below:
where and are the posterior probabilities of token at timestamp of teacher and student . , , and are the conditions of the above probabilities, but omitted for simplicity. is omitted in computing KLD loss due to the frozen teacher during training.
Sequence-level knowledge distillation is another distillation criterion:
where is the hypotheses from teacher model, is the set of all possible sequences and similar omission as in Equation 7. Using such sequence-level knowledge distillation is unaffordable, as we are approximating an exponentially-sized sequence distribution . Similar to MWER [prabhavalkar2017minimum] training, N-best candidates set is accessed by beam search, and then can be approximated by:
And then we can achieve the knowledge distillation loss by:
where are hyper-parameters for frame-level knowledge distillation loss and sequence-level , respectively.
As shown in Figure 1, the proposed knowledge distillation methods are divided into two parts: the first part is the distillation after the encoder, and the second part is the distillation after the decoder. The encoder distillation is done after the linear layer of the encoder, which has and similar to literature [huang2018knowledgea]. The decoder distillation is setup as follows. For frame-level distillation, only positions are selected, so the objective function is normalized by the number of <MASK> tokens:
For sequence-level distillation, approximate probability from N-best is used:
And the final loss is then:
where are weight coefficients for encoder and decoder knowledge distillation.
|W2V2-CTC [deng2022improving, baevski_2020_wav2vec_20_framework]||-||95M||-||4.8||-||5.3||95M||-||2.7||-||8.0|
|CASS-NAT v2 [fan2021improved]||-||-||-||4.9||-||5.4||-||-||3.1||-||7.2|
|AR||46.8M (M)||0.2||4.5||0.2||4.9||115.0M (L)||0.2||2.8||1.3||6.6|
|NAR (Same size)||46.8M (M)||0.4||5.4||0.5||6.4||115.0M (L)||0.6||4.1||2.0||10.2|
|NAR (Smaller)||6.5M (XS)||0.4||7.2||2.1||8.5||12.4M (S)||0.8||5.1||2.4||12.4|
Our experiments are conducted on the Mandarin AISHELL-1 [bu2017AISHELL1] and the English LibriSpeech corpora [panayotov2015LibriSpeech]. AISHELL-1 contains a 150h training set, with a development (dev) and test set for evaluation, while LibriSpeech has a 960h training set, with test-clean/other (test c/o) used for tests. We report the character error rate (CER) on AISHELL-1, word error rate (WER) on LibriSpeech, respectively.
4.2 Model description
For acoustic feature extraction, 80-dimensional mel filterbank (Fbank) features are extracted with global level cepstral mean and variance normalization (CMVN). When it comes to data augmentation, speed perturbation is applied only for AISHELL-1 and SpecAugment[park2019specaugment] for both datasets, respectively. For text modeling, 5000 English Byte Pair encoding (BPE) [kudo2018sentencepiece] subword units are adopted for English, and 4233 characters for Mandarin. The baseline follows the recipe of ESPnet v2 [watanabe2018espnet], a 12-layer conformer encoder with four times down-sampling and a 6-layer transformer decoder. The weight for CTC module is fixed to .
For knowledge transfer and distillation, we firstly train a new NAR student model from scratch with
for 80 epochs. The hyper-parameters are set to. Then we fine-tune the distillation procedure by adding , and the tuning parameters are set as with 20 epochs. In Equation 8, 9, we use beam-size , which is consistent with the decoding hyper-parameters in AR model.
Different NAR student model sizes are explored in Table 2, identified as large (L), medium (M), small (S), and extremely small (XS). The AR teacher model keeps L size for LibriSpeech, and M-size for AISHELL-1.
In the inference stage, no language model is used during the following experiments. Model parameters are averaged over the last 5 checkpoints. For autoregressive models, joint CTC/attention one-pass decoding [hori2017joint]
is used with beam size 10, and the score interpolation of CTC is 0.3. For non-autoregressive Mask-CTC decoding, we follow the beam decoding method in Section3.1, with beam , the threshold and , for AISHELL-1 and LibriSpeech corpus.
4.3 Results with the NAR Beam Decoding
As proposed in Section 3.1, we firstly evaluate the beam search performance with real time factor (RTF) in Table 3. RTF is computed using Intel-Xeon E5-2690 CPU with a single core at test set. To be consistent with literature [higuchi2020mask], the NAR (M) model speeds up AR (M) [hori2017joint] model by more than 10x, as the RTF is 0.58 for AR (M) and 0.31 for AR (S). Without too much degradation of inference speed (1.5x as slow as ‘Beam1’), the beam decoding method achieves better performance compared with the greedy (Beam1) one by 5%9% relative WER reduction on the test set. As the beam size grows, the rate of improvement decreases.
|Decoding||Dev (%)||Test (%)||RTF|
4.4 Results on Knowledge Transfer and Distillation
Table 1 compares the knowledge transfer distillation and other modern AR and NAR models on AISHELL-1 and LibriSpeech datasets to validate the performance.
AISHELL-1: As shown in Table 1, the teacher AR model obtains more than 24% relative reduction on CER compared with NAR (M), and 40% with NAR (XS). After knowledge distillation, the NAR (M) with ‘+’ achieves 8% and 16% relative CER reduction on dev and test sets respectively, and the one based on ‘+’ with ‘++’ shows a further CER reduction on test set by 15%. The student results achieve competitive performance (5.0%/5.4% CER) to the state-of-the-art NAR models like CASS-NAT [fan2021improved] or AL-NAT [wang2022alignmentlearning]. Similar results are obtained for distilled NAR (XS) as 18%/25% CER reduction on two evaluation sets.
LibriSpeech: Table 1 shows the performance comparison on the large LibriSpeech corpus. The AR (L) is adopted as teacher model while NAR (L,S) as student model. The observations are consistent with that in Table 1, and further boost the performance of NAR Mask-CTC model at L (3.3/7.8% WER) and S (3.7/9.2% WER) scale by 25% relative WER reduction. However, due to the limits of the AR teacher, the insertion and deletion error rate is high on AR (L).
Results show that such our method narrows the gap between AR and NAR, with the improvement being significantly greater in the more difficult evaluation set (i.e. test set in AISHELL-1, test-other in LibriSpeech). After knowledge transfer and distillation, the length error problem is greatly alleviated compared with the original NAR model owing to the high prediction accuracy of AR teacher. Moreover, both and attribute to reducing insertion and deletion errors (‘I+D’), pushing the length error problem [higuchi2020mask] to its limits at 0.2% CER for ‘I+D’ in AISHELL-1 and 1.4% for LibriSpeech test-other. Meanwhile, the NAR student model performs comparable results with other state-of-the-art NAR methods, including wav2vec2-CTC [baevski_2020_wav2vec_20_framework, deng2022improving], Improved CASS-NAT [fan2021improved] and AL-NAT [wang2022alignmentlearning].
In this paper we propose a novel knowledge transfer and distillation architecture that leverages knowledge from AR models to improve NAR performance while reducing the model size. To further boost the performance of NAR, we propose a beam search method on Mask-CTC, which enlarges the search space during inference stage. Experiments demonstrate that NAR beam search obtains relative 5% reduction in AISHELL-1 dataset with a tolerable RTF increment. For knowledge distillation, most results achieve over 15% relative CER/WER reduction on large and smaller NAR modeling. Future works are going to explore the generalization of knowledge distillation from AR to NAR. Different non-autoregressive like CASS-NAT [fan2021improved] and AL-NAT [wang2022alignmentlearning]
might be explored with external non-autoregressive language models. Hidden feature distillation[gou2021knowledge] is also a valuable extension of this paper.
This work is supported in part by China NSFC projects under Grants 62122050 and 62071288, and in part by Shanghai Municipal Science and Technology Major Project under Grant 2021SHZDZX0102. The authors would like to thank Tian tan and Wangyou Zhang for discussion.