Log In Sign Up

Distilling the Knowledge of BERT for CTC-based ASR

by   Hayato Futami, et al.

Connectionist temporal classification (CTC) -based models are attractive because of their fast inference in automatic speech recognition (ASR). Language model (LM) integration approaches such as shallow fusion and rescoring can improve the recognition accuracy of CTC-based ASR by taking advantage of the knowledge in text corpora. However, they significantly slow down the inference of CTC. In this study, we propose to distill the knowledge of BERT for CTC-based ASR, extending our previous study for attention-based ASR. CTC-based ASR learns the knowledge of BERT during training and does not use BERT during testing, which maintains the fast inference of CTC. Different from attention-based models, CTC-based models make frame-level predictions, so they need to be aligned with token-level predictions of BERT for distillation. We propose to obtain alignments by calculating the most plausible CTC paths. Experimental evaluations on the Corpus of Spontaneous Japanese (CSJ) and TED-LIUM2 show that our method improves the performance of CTC-based ASR without the cost of inference speed.


page 1

page 2

page 3

page 4


Distilling the Knowledge of BERT for Sequence-to-Sequence ASR

Attention-based sequence-to-sequence (seq2seq) models have achieved prom...

Non-autoregressive Error Correction for CTC-based ASR with Phone-conditioned Masked LM

Connectionist temporal classification (CTC) -based models are attractive...

Attention-based Multi-hypothesis Fusion for Speech Summarization

Speech summarization, which generates a text summary from speech, can be...

BECTRA: Transducer-based End-to-End ASR with BERT-Enhanced Encoder

We present BERT-CTC-Transducer (BECTRA), a novel end-to-end automatic sp...

Disfluency Detection with Unlabeled Data and Small BERT Models

Disfluency detection models now approach high accuracy on English text. ...

Abstracting Influence Paths for Explaining (Contextualization of) BERT Models

While "attention is all you need" may be proving true, we do not yet kno...

Reducing Spelling Inconsistencies in Code-Switching ASR using Contextualized CTC Loss

Code-Switching (CS) remains a challenge for Automatic Speech Recognition...

1 Introduction

End-to-end automatic speech recognition (ASR) that directly maps acoustic features into text sequences has shown remarkable results. There are some variants for its modeling: CTC-based models [12], attention-based sequence-to-sequence models [4, 9]

, and neural network transducers

[13, 33]. Among them, CTC-based models have the advantage of lightweight and fast inference. They consist of an encoder followed by a compact linear layer only and can predict all tokens in parallel, which is called non-autoregressive generation. For these advantages, a lot of efforts have continuously been made to improve the ASR performance of CTC-based models [20, 25]. In terms of output unit, CTC-based models and transducers are categorized as frame-synchronous models that makes frame-level predictions, while attention-based models are categorized as label-synchronous models that makes token-level predictions.

End-to-end ASR models including CTC-based models are trained on paired speech and transcripts. On the other hand, much larger amount of text-only data is often available, and the most popular way to leverage it in end-to-end ASR is the integration of external language models (LMs). In

-best rescoring, -best hypotheses obtained from an ASR model are re-scored by an LM, and then the hypothesis of the highest score is selected. In shallow fusion [6]

, the interpolated score of the ASR model and the LM is calculated at each ASR decoding step. These two LM integration approaches are simple and effective and therefore widely used in the CTC-based ASR. However, they degrade the fast inference, which is the most important advantage of CTC over other variants of end-to-end ASR. Specifically, beam search

[11] to obtain multiple hypotheses makes CTC lose its non-autoregressive nature. In addition to beam search, the inference of LM takes much time during testing.

Recently, knowledge distillation [15] -based LM integration has been proposed [3, 10, 2]. In this approach, an LM serves as a teacher model, and an attention-based ASR model serves as a student model. The knowledge of the LM is transferred to the ASR model during ASR training, and the LM is not required during testing. However, in the formulation of existing studies, the student ASR model has been limited to the attention-based model that makes token-level predictions. In this study, we propose an extension of this knowledge distillation to the frame-synchronous CTC-based models, so as to integrate the LM while maintaining fast inference of CTC. We use BERT [7] as a teacher LM that predicts each masked word on the basis of both its left and right context. We have shown that BERT outperforms conventionally-used unidirectional LMs that predicts each word on the basis of only its left context in distillation for attention-based ASR [10]. In addition, as recent successful CTC-based models mostly consist of a bidirectional encoder that looks at both left and right context, a bidirectional LM, BERT is suited for a teacher LM.

BERT and attention-based models give token-by-token predictions, while CTC-based models give frame-by-frame predictions. For distillation from BERT to attention-based ASR [3, 10, 2], it is obvious that the teacher BERT’s prediction for the -th token becomes the soft target for the student attention-based model’s one for the -th token. However, to distill the knowledge of BERT for CTC-based ASR, it is not trivial how to correspond the teacher BERT’s token-level predictions to the student CTC-model’s frame-level predictions. In this study, we propose to leverage forced alignment from the CTC forward-backward (or the Viterbi) algorithm [12] to solve the problem. During ASR training, the most plausible CTC path attributed to the label sequence is calculated to determine the correspondence between tokens and time frames. The proposed method improves the performance of CTC-based ASR, even with greedy decoding, without any additional inference steps related to BERT.

2 Preliminaries and related work

2.1 End-to-end ASR

2.1.1 CTC-based ASR

Let denote the acoustic features in an utterance and denote the label sequence of tokens corresponding to . An encoder network that consists of RNN, Transformer, or Conformer [14] transforms into higher-level representations of length . A CTC-based model predicts CTC path using the encoded representations. Let denote the vocabulary and

denote a blank token. Then, we define the probability of predicting

for the -th time frame as


The output sequence is obtained by , where the mapping

removes blank tokens after removing repeated ones. The CTC loss function is defined over all possible paths that can be reduced to



2.1.2 Attention-based ASR

An attention-based ASR model consists of encoder and decoder networks. The decoder network predicts each token using the encoded representations and previously decoded tokens. We define the probability of predicting for the -th token as


The loss function is defined as the cross-entropy:


where becomes when , and otherwise.

2.2 Bert

BERT [7] that consists of Transformer encoders was originally proposed as a pre-training method for downstream NLP tasks such as question answering and language understanding. BERT is pre-trained on large text corpora for masked language modeling (MLM) objective, where some of the input tokens are masked and the original tokens are predicted given unmasked tokens. After this pre-training, BERT can serve as an LM that predicts each masked word given both its left and right context. BERT as an LM has been applied to ASR via -best rescoring [30, 28] and knowledge distillation [10, 2]. BERT has been reported to perform better than conventional LMs in ASR thanks to the use of the bidirectional context.

2.3 Distilling the knowledge of BERT for attention-based ASR

We have proposed to apply BERT to attention-based ASR via knowledge distillation in [10]. BERT provides soft labels for attention-based ASR training to encourage more syntactically or semantically likely hypotheses. To generate better soft labels, context beyond the current utterance is used as input to BERT. We define BERT’s prediction of for the -th target as


where is obtained by masking the -th token, that is, [MASK]. is concatenated with tokens from the preceding utterances and tokens from the succeeding utterances to make an input sequence of fixed length .

The knowledge distillation (KD) loss function is formulated by minimizing KL divergence between and , which is equivalent to minimizing the cross-entropy between them as


The work in [2]

also performed knowledge distillation from BERT to an attention-based non-autoregressive model

[1], which is a label-synchronous model different from CTC.

2.4 Knowledge distillation for CTC-based ASR

Knowledge distillation (KD) [15] between two CTC-based models has been investigated [29, 19, 31, 8]. The simplest way is to minimize KL divergence between the distributions of the student CTC and those of the teacher CTC frame-by-frame [29]. However, it assumes the student and teacher models share the same frame-wise alignment. This is not true in KD between CTC-based models with different topologies such as KD from bidirectional RNN-based CTC to unidirectional RNN-based CTC [19, 31], which is oriented for streaming ASR applications. In [19], a guiding CTC model encourages student and teacher models to share the same alignment. Sequence-level KD [17] was also proposed to address the issue, where -best hypotheses from the teacher CTC are used as targets for the student CTC training [31, 8].

KD from an attention-based model to a CTC-based model has also been proposed [24, 23]. Token-level predictions from the attention-based model need to be aligned with frame-level predictions from the CTC, which is similar to our KD from BERT to CTC. Attention weights of are used for that purpose in [24, 23]. However, for KD from BERT, BERT does not attend acoustic features of length , thus such attention weights cannot be obtained. Note that KD from an LM including BERT to a CTC-based model is proposed in this study for the first time.

3 Proposed method: Distilling the knowledge of BERT for CTC-based ASR

Figure 1: Illustration of our proposed method. The forced alignment path with forward-backward calculation determines which frames correspond to each token . For (), corresponding frames are and , so the -th and -th CTC predictions and are trained for the -nd soft label from BERT .

In this study, we propose to apply BERT to CTC-based ASR via knowledge distillation. With this method, we expect CTC-based models to further learn the syntactic or semantic relationship between tokens from BERT. CTC-based models have difficulty in capturing it because they cannot learn it explicitly from the output tokens because of the conditional independence assumption. Soft labels from BERT help CTC-based models learn it implicitly from acoustic features and intermediate representations in an encoder. BERT provides token-level soft labels . To distill the knowledge of BERT for attention-based ASR that makes token-level predictions , we just minimized KL divergence between and as in Eq. (6). On the other hand, CTC makes frame-level predictions , so the alignment between and is necessary for using token-level soft labels from BERT.

To solve the problem, we propose to use the forced alignment result, that is, the most plausible CTC path of like [16], where the CTC path is used to enable a monotonic chunkwise attention (MoChA) model to learn optimal alignments for streaming ASR. can be obtained by tracking the path that has the maximum products of forward and backward variables, which can be obtained in the process of calculating the CTC loss function from Eq. (2) with the forward-backward algorithm [12]. This method does not introduce any additional architectures or external alignment information (e.g. HMM-based one) for alignment. For each , has one or more that corresponds to , and we define the alignment as an one-to-many mapping from token index to frame indices . Note that frame assigned to in does not appear in . For example, defines , , and .

Once the alignment is obtained, is used as a soft label for where . For example, given , the CTC-based ASR is trained to make and close to , as illustrated in Fig. 1. The KD loss function is formulated as


Finally, is interpolated with from Eq. (2) as



is a tunable hyperparameter.

The alignment is calculated at each training step on the fly. To mitigate the negative effects of unreliable alignments in early steps, the CTC-based model is pre-trained with Eq. (2). The soft labels from BERT can be pre-computed for all the training set. For memory efficiency, top- distillation [32] is applied, where the top- probabilities of BERT are normalized and smoothed by temperature parameter to generate soft labels for distillation. In this study, and are used.

There is a clear advantage of our method over existing LM integration methods for CTC-based ASR such as -best rescoring and shallow fusion. Our method introduces the knowledge of LM (BERT) during training, so it has no change in terms of inference time during testing. Rescoring and shallow fusion require LM modules during testing, which significantly increases the inference time. Our method benefits from LM just with greedy decoding without the runtime use of LM, whereas rescoring and shallow fusion requires time-consuming beam search decoding [11] with LM inference.

4 Experimental evaluations

4.1 Experimental conditions

We evaluated our method using the Corpus of Spontaneous Japanese (CSJ) [21] and the TED-LIUM2 corpus [27]. CSJ has two subcorpora of oral presentations, CSJ-APS on academic and CSJ-SPS on general topics. In CSJ experiments, hours of transcribed speech of CSJ-APS was used for training ASR. M-word transcripts of CSJ (both CSJ-APS and CSJ-SPS) and additional M-word text of the Balanced Corpus of Contemporary Written Japanese (BCCWJ) [22] were used for training LMs. The ASR model and LMs shared the same BPE vocabulary of entries. In TED-LIUM2 experiments, hours of transcribed speech was used for training ASR, and M-word text in official LM data was used for training LMs. The BPE vocabulary has entries.

CTC-based ASR models consist of Transformer encoder with layers, hidden units, and attention heads. Adam optimizer with Noam learning rate scheduling [9] of was used for training the ASR models. SpecAugment [26] was applied to acoustic features, and speed perturbation [18] was also applied in the TED-LIUM2 experiments. When applying knowledge distillation, CTC-based ASR was pre-trained for epochs with Eq. (2) and then trained for epochs with Eq. (8). For a fair comparison, baseline CTC-based ASR without knowledge distillation was pre-trained for epochs and then further trained for epochs.

We compared three types of LMs: BERT and Transformer LM (TLM) that consist of layers, hidden units, and attention heads, and RNN LM that consists of -layer LSTM with hidden units. Adam optimizer of the learning rate of with learning rate warmup over the first of total steps and linear decay was used for training LMs. During training, sequences of tokens were fed into LMs, and of tokens in a sequence were masked for BERT.

4.2 Experimental results

Table 1 shows the ASR results of our proposed method on CSJ. First of all, CTC-based ASR trained with our proposed knowledge distillation (KD) method (A2) outperformed a baseline without KD (A1) in terms of both word error rate (WER) and PPL. PPL denotes the pseudo perplexity [5] of BERT on the resulting hypotheses:


The improvement in both PPL and WER with our method suggests that the knowledge of BERT was indeed incorporated into CTC-based ASR and helped improve WER. in Eq. (8) was determined using the development set, and was found to achieve the best WER for the proposed method on CSJ. We also observed that increasing from to improved PPL ( to ) but degraded WER ( to ). We also trained CTC-based ASR with the recently proposed regularization method, InterCTC [20] (B1). Further WER improvement was obtained by training CTC with the combination of InterCTC and our method (B2).

Here, we explored a few different KD strategies to find out if there is a better way. In (A3), only the leftmost indices of non-blank tokens in the most plausible path were used for alignment, while all indices were used in the proposed method. For example, corresponds to , , and . In (A4), only the rightmost indices were used instead. In (A5), CTC-based ASR was trained with KD from scratch, while KD was applied after pre-training without KD in the proposed method. In (A6), soft labels from TLM were used in KD, that is, in Eq. (7) was replaced with . In (A7), one-hot labels were used, that is, in Eq. (7) was replaced with that becomes when , and otherwise. This does not use the knowledge of any LM but just encourage CTC’s predictions to be aligned with the most plausible path. Among them, our proposed way described in Section 3 performed the best, which demonstrates the effectiveness of our alignment allocation, pre-training, and the use of BERT.

Table 2 shows the results with other LM integration methods: rescoring and shallow fusion. Inference times relative to a plain CTC (A1) measured on CPU are shown in the table as “InferTime”. Overall, these methods improved WER more than our KD-based method (A2), but they increased inference time far more than the baseline (A1). They took much time for beam search and LM inference. Note that RNNLM (C3,D3) can carry over states so is faster than TLM (C4,D4) in shallow fusion and that TLM (C5,D5) scores a hypothesis in a single step so is faster than BERT (C6,D6) in rescoring [28]. On the other hand, our method did not affect inference time, and WER improvement was obtained with greedy decoding. Our method also improved the oracle WER (D7), and combinations of our method and rescoring or shallow fusion (D3-D6) further improved WER compared to rescoring or shallow fusion alone (C3-C6).

eval1 eval1
(A2)+KD (BERT)
(A3)+KD (BERT) (leftmost)
(A4)+KD (BERT) (rightmost)
(A5)+KD (BERT) (scratch)
(A6)+KD (TLM)
(B1)InterCTC [20]
(B2)+KD (BERT)
Table 1: ASR results on CSJ with proposed knowledge distillation (KD) -based LM integration. Different KD strategies are compared.
WER(%) InferTime
eval1 eval1
(C2)+BS ()
(C3)+SF (RNNLM, )
(C4)+SF (TLM, )
(C5)+Resc (TLM, / ) / /
(C6)+Resc (BERT, / ) / /
(C7)Oracle ( / ) / -
(D2)+BS ()
(D3)+SF (RNNLM, )
(D4)+SF (TLM, )
(D5)+Resc (TLM, / ) / /
(D6)+Resc (BERT, / ) / /
(D7)Oracle ( / ) / -
Table 2: Comparison and combinations with rescoring (Resc) and shallow fusion (SF). “BS” means beam search without LM, and denotes beam width. denotes the number of hypotheses to rescore.
test dev test
Table 3: ASR results on TED-LIUM2.
Figure 2: An example of decoded hypotheses and top- predictions (the upper row is more probable) from two CTC-based models on TED-LIUM2. “_” denotes a word boundary.

Table 3 summarizes ASR results on TED-LIUM2, and it shows that our proposed KD with BERT improved WER and PPL of CTC-based ASR as well. Fig. 2 shows a decoding example for an utterance from TED-LIUM2. The box at the top of the figure shows decoded hypotheses from two CTC-based models, (a) one trained without KD and (b) the other trained with our method. While “feed ourselves” was erroneously recognized as “fe our cell” without KD, it was correctly recognized with KD. The lower part of the figure shows the top- frame-level predictions from the two models for the utterance. We see that the probabilities of semantically plausible subwords “ed” and “selves” become higher with KD, leading to correct recognition. It is also interesting to see that semantically plausible subwords such as “why” () and “_kids” () appear in the top- predictions with KD, which indicates the model considers the relationship between tokens in the context much more than the baseline.

5 Conclusions

In this study, we have proposed knowledge distillation-based BERT integration for CTC-based ASR. For knowledge distillation, BERT provides token-level soft labels, while CTC-based ASR makes frame-level predictions. We obtained the alignment between them by calculating the most plausible CTC paths. Our method does not add any computational costs during testing, which maintains the fast inference of CTC. We demonstrated that our method improved the performance of CTC-based ASR on CSJ and TED-LIUM2 by exploiting the knowledge of BERT. For future work, we will investigate applying BERT to neural network transducers [13, 33] that are frame-synchronous models and have an autoregressive nature.


  • [1] Y. Bai, J. Yi, J. Tao, Z. Tian, Z. Wen, and S. Zhang (2020) Listen attentively, and spell once: whole sentence generation via a non-autoregressive architecture for low-latency speech recognition. In INTERSPEECH, pp. 3381–3385. Cited by: §2.3.
  • [2] Y. Bai, J. Yi, J. Tao, Z. Tian, Z. Wen, and S. Zhang (2021) Fast end-to-end speech recognition via non-autoregressive models and cross-modal knowledge transferring from BERT. TASLP, pp. 1897–1911. Cited by: §1, §1, §2.2, §2.3.
  • [3] Y. Bai, J. Yi, J. Tao, Z. Tian, and Z. Wen (2019) Learn spelling from teachers: transferring knowledge from language models to sequence-to-sequence speech recognition. In INTERSPEECH, pp. 3795–3799. Cited by: §1, §1.
  • [4] W. Chan, N. Jaitly, Q. Le, and O. Vinyals (2016) Listen, attend and spell: a neural network for large vocabulary conversational speech recognition. In ICASSP, pp. 4960–4964. Cited by: §1.
  • [5] X. Chen, A. Ragni, X. Liu, and M. J.F. Gales (2017)

    Investigating bidirectional recurrent neural network language models for speech recognition

    In INTERSPEECH, pp. 269–273. Cited by: §4.2.
  • [6] J. Chorowski and N. Jaitly (2017) Towards better decoding and language model integration in sequence to sequence models. In INTERSPEECH, pp. 523–527. Cited by: §1.
  • [7] J. Devlin, M. Chang, K. Lee, and K. Toutanova (2019) BERT: pre-training of deep bidirectional Transformers for language understanding. In NAACL, pp. 4171–4186. Cited by: §1, §2.2.
  • [8] H. Ding, K. Chen, and Q. Huo (2019) Compression of CTC-trained acoustic models by dynamic frame-wise distillation or segment-wise n-best hypotheses imitation. In INTERSPEECH, pp. 3218–3222. Cited by: §2.4.
  • [9] L. Dong, S. Xu, and B. Xu (2018) Speech-Transformer: a no-recurrence sequence-to-sequence model for speech recognition. In ICASSP, pp. 5884–5888. Cited by: §1, §4.1.
  • [10] H. Futami, H. Inaguma, S. Ueno, M. Mimura, S. Sakai, and T. Kawahara (2020) Distilling the knowledge of BERT for sequence-to-sequence ASR. In INTERSPEECH, pp. 3635–3639. Cited by: §1, §1, §2.2, §2.3.
  • [11] A. Graves and N. Jaitly (2014)

    Towards end-to-end speech recognition with recurrent neural networks

    In ICML, Cited by: §1, §3.
  • [12] A. Graves, S. Fernández, F. Gomez, and J. Schmidhuber (2006) Connectionist temporal classification: labelling unsegmented sequence data with recurrent neural networks. In ICML, pp. 369–376. Cited by: §1, §1, §3.
  • [13] A. Graves (2012) Sequence transduction with recurrent neural networks. arXiv. Cited by: §1, §5.
  • [14] A. Gulati, J. Qin, C. Chiu, N. Parmar, Y. Zhang, J. Yu, W. Han, S. Wang, Z. Zhang, Y. Wu, and R. Pang (2020) Conformer: convolution-augmented Transformer for speech recognition. In INTERSPEECH, pp. 5036–5040. Cited by: §2.1.1.
  • [15] G. E. Hinton, O. Vinyals, and J. Dean (2015) Distilling the knowledge in a neural network. arXiv. Cited by: §1, §2.4.
  • [16] H. Inaguma and T. Kawahara (2021) Alignment knowledge distillation for online streaming attention-based speech recognition. arXiv. Cited by: §3.
  • [17] Y. Kim and A. M. Rush (2016) Sequence-level knowledge distillation. In EMNLP, pp. 1317–1327. Cited by: §2.4.
  • [18] T. Ko, V. Peddinti, D. Povey, and S. Khudanpur (2015) Audio augmentation for speech recognition. In INTERSPEECH, pp. 3586–3589. Cited by: §4.1.
  • [19] G. Kurata and K. Audhkhasi (2019) Guiding CTC posterior spike timings for improved posterior fusion and knowledge distillation. In INTERSPEECH, pp. 1616–1620. Cited by: §2.4.
  • [20] J. Lee and S. Watanabe (2021) Intermediate loss regularization for CTC-based speech recognition. In ICASSP, pp. 6224–6228. Cited by: §1, §4.2, Table 1.
  • [21] K. Maekawa (2003) Corpus of Spontaneous Japanese : its design and evaluation. SSPR. Cited by: §4.1.
  • [22] K. Maekawa, M. Yamazaki, T. Ogiso, T. Maruyama, H. Ogura, W. Kashino, H. Koiso, M. Yamaguchi, M. Tanaka, and Y. Den (2014) Balanced corpus of contemporary written Japanese. Lang. Resour. Eval., pp. 345–371. Cited by: §4.1.
  • [23] T. Moriya, T. Ochiai, S. Karita, H. Sato, T. Tanaka, T. Ashihara, R. Masumura, Y. Shinohara, and M. Delcroix (2020) Self-distillation for improving CTC-Transformer-based ASR systems. In INTERSPEECH, pp. 546–550. Cited by: §2.4.
  • [24] T. Moriya, H. Sato, T. Tanaka, T. Ashihara, R. Masumura, and Y. Shinohara (2020) Distilling attention weights for CTC-based ASR systems. In ICASSP, pp. 6894–6898. Cited by: §2.4.
  • [25] J. Nozaki and T. Komatsu (2021) Relaxing the conditional independence assumption of CTC-based ASR by conditioning on intermediate predictions. In INTERSPEECH, pp. 3735–3739. Cited by: §1.
  • [26] D. S. Park, W. Chan, Y. Zhang, C. Chiu, B. Zoph, E. D. Cubuk, and Q. V. Le (2019) SpecAugment: a simple data augmentation method for automatic speech recognition. In INTERSPEECH, pp. 2613–2617. Cited by: §4.1.
  • [27] A. Rousseau, P. Deléglise, and Y. Estève (2014) Enhancing the TED-LIUM corpus with selected data for language modeling and more TED talks. In LREC, pp. 3935–3939. Cited by: §4.1.
  • [28] J. Salazar, D. Liang, T. Q. Nguyen, and K. Kirchhoff (2020) Masked language model scoring. In ACL, pp. 2699–2712. Cited by: §2.2, §4.2.
  • [29] A. Senior, H. Sak, F. de Chaumont Quitry, T. Sainath, and K. Rao (2015) Acoustic modelling with CD-CTC-SMBR LSTM RNNs. In ASRU, pp. 604–609. Cited by: §2.4.
  • [30] J. Shin, Y. Lee, and K. Jung (2019) Effective sentence scoring method using BERT for speech recognition. In ACML, pp. 1081–1093. Cited by: §2.2.
  • [31] R. Takashima, S. Li, and H. Kawai (2018) An investigation of a knowledge distillation method for CTC acoustic models. In ICASSP, pp. 5809–5813. Cited by: §2.4.
  • [32] X. Tan, Y. Ren, D. He, T. Qin, and T. Liu (2019)

    Multilingual neural machine translation with knowledge distillation

    In ICLR, Cited by: §3.
  • [33] Q. Zhang, H. Lu, H. Sak, A. Tripathi, E. McDermott, S. Koo, and S. Kumar (2020) Transformer transducer: a streamable speech recognition model with Transformer encoders and RNN-T loss. In ICASSP, pp. 7829–7833. Cited by: §1, §5.