End-to-end automatic speech recognition (ASR) that directly maps acoustic features into text sequences has shown remarkable results. For its modeling, CTC-based models [Graves06-CTC], attention-based sequence-to-sequence models [Chan16-LAS, Dong18-ST]
, and neural network transducers[Graves12-ST, Zhang20-TT] are mainly used. Among them, CTC-based models have the advantage of lightweight and fast inference, which we focus on in this study. They consist of an encoder followed by a compact linear layer only and can predict all tokens in parallel, which is called non-autoregressive generation. Attention-based models and transducers predict tokens one-by-one depending on previously decoded tokens, which is called autoregressive generation.
End-to-end ASR models including CTC-based models are trained on paired speech and transcripts. However, it is usually difficult to prepare a sufficient amount of paired data on target domain. On the other hand, much larger amount of in-domain text-only data is often available, and the most popular way to leverage it in end-to-end ASR is the integration of external language models (LMs). In rescoring[Mikolov10-RNN, Shin19-ESS, Salazar20-MLMS], -best hypotheses obtained from an ASR model are rescored by an LM, and then the hypothesis of the highest score is selected. In shallow fusion [Chorowski17-TBD, Kannan18-AILM]
, the interpolated score of the ASR model and the LM is calculated at each ASR decoding step. These two LM integration approaches are simple and effective, and therefore they are widely used in CTC-based ASR. However, they deteriorate the fast inference, which is the most important advantage of CTC over other variants of end-to-end ASR. Specifically, beam search to obtain multiple hypotheses makes CTC lose its non-autoregressive nature[Graves14-TES].
Besides rescoring and shallow fusion, knowledge distillation (KD) [Hinton15-KD] -based LM integration for attention-based ASR has been proposed [Bai19-LST, Futami20-DKB, Bai21-FESR]. The knowledge of the LM (teacher model) is transferred to the ASR model (student model) during ASR training. Though KD has the advantage of no additional inference steps during testing, the effect of LM is often limited because LM cannot directly affect the ASR inference.
In this study, we propose an ASR error correction method where a masked LM (MLM) corrects less confident tokens in a CTC-based ASR hypothesis. This method does not require beam search in CTC and corrects all less confident parts in parallel using MLM. In other words, both ASR and error correction procedures are conducted in a non-autoregressive manner. However, this MLM-based error correction does not work well because MLM does not consider acoustic information in correction. To solve the problem, we propose phone-conditioned masked LM (PC-MLM) to leverage phone information. CTC is jointly trained to predict phones from the intermediate layer of its encoder noted as hierarchical multi-task learning [Krishna18-HMTL]. PC-MLM uses both word and phone context for correction. Furthermore, to deal with insertion errors, we propose Deletable PC-MLM that is trained to predict insertions to be removed.
2 Preliminaries and related work
2.1 CTC-based ASR
Let denote the acoustic features and denote the label sequence of tokens corresponding to . An encoder network transforms into a higher-level representation of length . A CTC-based model predicts frame-level CTC path using the encoded representations. Let denote the vocabulary and
denote a blank token. We define the probability of predictingfor the -th time frame as
In greedy decoding, CTC path is decided as
which is based on non-autoregressive generation. In beam search decoding [Graves14-TES], is dependent on previously decided path and its score, which is autoregressive. The output sequence is obtained by , where the mapping removes blank tokens after condensing repeated tokens.
The CTC loss function is defined over all possible paths that can be reduced to:
2.2 Masked LM
Masked LM (MLM) is originally proposed as a pre-training objective for BERT [Devlin19-BERT], which has shown promising results in many downstream NLP tasks. During MLM training, some of the input tokens (usually ) are masked and the original tokens are predicted. MLM predicts all masked tokens in parallel, or non-autoregressively, given both left and right unmasked context. We define the probability of predicting for the -th token as
where the -th token is masked in . Conventionally, RNN or Transformer LMs has been used in ASR via rescoring [Mikolov10-RNN] and shallow fusion [Chorowski17-TBD, Kannan18-AILM]. They predicts each token autoregressively, given only their left context. Recently, MLM has been applied to ASR via rescoring [Shin19-ESS, Salazar20-MLMS] and knowledge distillation (KD) [Futami20-DKB, Bai21-FESR]. MLM has been reported to perform better than conventional LMs thanks to the use of the bidirectional context. However, rescoring with MLM takes a lot of time during testing because it requires steps to rescore a hypothesis of length by masking each token [Salazar20-MLMS, Futami21-ELECTRA]. In KD with MLM, the following KL-divergence based objective is minimized during ASR training.
where denotes the probability of predicting for the -th token with an attention-based ASR model. In the previous studies, the student ASR model has been limited to an attention-based model that makes token-level predictions as MLM.
As an extension of MLM, conditional masked LM (CMLM) has been proposed in [Ghazvininejad19-MP]
for non-autoregressive neural machine translation (NMT). CMLM is an encoder-decoder model that predicts all masked tokens in a non-autoregressive manner, conditioning on both source text and unmasked target translation. In ASR, Audio-CMLM (A-CMLM)[Chen19-LAF] and mask CTC [Higuchi20-MCTC] adopts CMLM architecture for non-autoregressive ASR. In mask CTC, similar to our proposed method, less confident tokens in CTC output are refined by CMLM. However, it conditions on acoustic features and is jointly trained with CTC on paired data, while our proposed PC-MLM conditions on phone tokens and separately trained from CTC on text-only data.
2.3 ASR error correction
ASR error correction aims to correct errors generated by ASR using another high-level model. Recently, it has been modeled with autoregressive sequence-to-sequence models that convert an ASR hypothesis to a corrected one, like neural machine translation [Guo19-SC, Zhang19-ITSC, Mani20-AECDA, Wang20-AECAT, Hrinchuk20-CASR, Zhao21-BART]. They are usually trained on paired ASR hypotheses and their corresponding references. However, such paired data is obtained from a limited amount of paired speech and transcripts, which can cause an overfitting problem. Some studies [Hrinchuk20-CASR, Zhao21-BART] use text-only data via initialization with a large pre-trained LM such as BERT and BART [Lewis19-BART]. In [Guo19-SC], recognition results of TTS-synthesized speech are used as pseudo ASR hypotheses, and in [Wang20-AECAT]Wang20-AECAT], a phone-level encoder is added to a sequence-to-sequence model to incorporate phone information. More recently, a non-autoregressive error correction model based on edit alignment was proposed in [Leng21-FC]. In this study, phone-conditioned masked LM is used as an error correction model. It is trained not on paired data but on text-only data and realizes non-autoregressive and phone-aware correction.
3 Proposed method
3.1 Phone-conditioned masked LM (PC-MLM)
Phone-conditioned masked LM (PC-MLM) is a phone-to-word conversion model that consists of a Transformer-based CMLM [Ghazvininejad19-MP]. PC-MLM predicts word tokens of the masked positions given both phone tokens input to the encoder and word tokens input to the decoder. When the -th token is masked in , the probability of predicting for the -th token can be defined as
Phone information can be automatically obtained from word sequences using a lexicon, so PC-MLM can be trained on text-only data as LMs. To prevent overfitting, some phone tokens are randomly masked () during training, which is called “text augmentation” in [Wang21-CRNNT].
3.2 Error correction with PC-MLM
In this study, PC-MLM serves as an error correction model that corrects CTC-based ASR hypotheses. An overview of the proposed method is illustrated in Figure 1. We use confidence scores to determine which tokens are to be masked and then corrected. First, to obtain token-level confidence scores, we need to aggregate frame-level CTC predictions in Eq. (1) into token-level predictions as
where the index mapping from to is obtained from greedy CTC path in Eq. (2). Then, a part of CTC output is masked out to obtain based on the confidence score as
In addition to word-level context , we propose to leverage phone-level context in error correction as input to PC-MLM. We obtain these phone-level predictions via a hierarchical multi-task learning framework [Krishna18-HMTL], where an auxiliary phone-level target is added at an intermediate layer of the encoder. This method also improves word-level ASR at the final layer.
Then, as in Eq. (6), PC-MLM provides given and . Finally, to get a corrected hypothesis , we can directly use the probability of PC-MLM for correction. We also propose to use the interpolated score of CTC and PC-MLM as
The proposed method has an advantage of fast inference over existing LM integration methods for CTC such as rescoring and shallow fusion. Our method requires only a -best hypothesis, while rescoring or shallow fusion requires -best () hypotheses after or during decoding. For CTC, a -best one can be obtained by non-autoregressive generation, but
-best ones can be obtained by autoregressive generation, whose difference has a significant impact on inference speed. After obtaining hypotheses, PC-MLM corrects tokens in a non-autoregressive manner, which is fast. Our method can also be applied to decoded outputs from attention models and transducers. Furthermore, our method can apply LM to non-autoregressive models other than CTC such as A-CMLM[Chen19-LAF], LASO [Bai20-LASO], Insertion Transformer [Fujita20-IBM] that had difficulty in LM integration because they do not have beam search.
3.3 Deletable PC-MLM
PC-MLM is trained to replace masked tokens with the same number of other tokens, which only deal with substitution errors from CTC-based ASR. We further propose Deletable PC-MLM to address insertion errors, inspired by [Gu19-LT, Higuchi21-IMCTC]. Deletable PC-MLM predicts null tokens () for inserted errors, and then a corrected result is obtained by removing them. During its training, some of the input tokens are randomly masked () as in MLM, and some mask tokens ([MASK]
) are randomly inserted between them. The number of mask tokens to be inserted is sampled from Poisson distribution () as in [Lewis19-BART]. After masking and insertion, Deletable PC-MLM is trained to predict original tokens from non-inserted positions and null tokens from inserted positions.
4 Experimental evaluations
4.1 Experimental conditions
We evaluated our methods on ASR using the Corpus of Spontaneous Japanese (CSJ) [maekawa03-CSJ] and the TED-LIUM2 corpus [Ted214]. CSJ consists of Japanese presentations, including CSJ-APS subcorpus ( hours) on academic presentation speech and CSJ-SPS subcorpus ( hours) on simulated public speaking on everyday topics. TED-LIUM2 consists of English presentations available on the TED website. Evaluations were done in a domain adaptation setting, where we assumed that paired data for training ASR on target domain was not available and that text-only data on target domain was available. An ASR model was trained on paired data on another source domain and evaluated on target domain. In CSJ experiments, the ASR model was trained on CSJ-SPS (source domain) and evaluated on the test set for CSJ-APS (target domain), and LMs were trained on CSJ-APS transcripts. In TED-LIUM2 experiments, we adopted Librispeech [Libri15] corpus for source domain, which has hours of paired data on English book reading. An ASR model was trained on Librispeech (source domain) and evaluated on the test set for TED-LIUM2 (target domain), and LMs were trained on M-word text prepared for TED-LIUM2 [Ted214].
We prepared a CTC-based ASR model that consists of a Transformer encoder with and a linear layer, where , and denotes the number of layers, hidden units, and attention heads, respectively. We also prepared a Conformer [Gulati20-CF] -based one with the same size. We used Adam optimizer with Noam learning rate scheduling [Dong18-ST] of . For data augmentation, SpecAugment [Park19-SA] was applied to acoustic features. We prepared four types of language models (LMs): Transformer LM (TLM), masked LM (MLM), phone-conditioned masked LM (PC-MLM), and Deletable PC-MLM (Del PC-MLM). TLM and MLM share the same architecture that consists of a Transformer encoder with . PC-MLMs consist of a Transformer encoder and decoder with , which has almost the same number of parameters as TLM and MLM. We used Adam optimizer of the learning rate of with learning rate warmup over the first of total steps and linear decay for TLM and MLM. For PC-MLMs, Noam learning rate scheduling was applied.
ASR and LMs shared the same Byte Pair Encoding (BPE) subword vocabulary for each corpus. The BPE vocabulary of and entries were used in CSJ and TED-LIUM2 experiments, respectively. For PC-MLM, phone tokens were obtained using OpenJTalk-based grapheme-to-phone (g2p) tool 111https://github.com/r9y9/pyopenjtalk for CSJ ( entries), and officially provided pronunciation dictionaries for TED-LIUM2 ( entries). All our implementations are publicly available 222https://github.com/emonosuke/emoASR/tree/main/asr/correct.
4.2 Experimental results
Table 1 shows the ASR results of our proposed error correction (EC) on CSJ. As mentioned in Section 3.2, a baseline CTC-based model (A1) is trained with hierarchical multi-task learning [Krishna18-HMTL], which was confirmed to improve the word error rate (WER) from to with phone error rate (PER). We compared three LMs for error correction: MLM, PC-MLM, and Del PC-MLM. We also compared “with” and “without” score interpolation in Eq. (10), where in the table means “without” interpolation. With score interpolation (), was determined using the development set, and its value is shown in the table. First, MLM that ignores phone information did not perform well, even with score interpolation (A2,A3). PC-MLM that considers phone information outperformed the baseline (A4), and the score interpolation led to further improvement (A5). Del PC-MLM that is trained to delete insertion errors further improved WER (A6,A7). Compared to PC-MLM (A5), Del PC-MLM (A7) actually reduced insertion errors from to together with substitution errors from to , but increased deletion errors from to . The cascade approach with phone-to-word CTC (A8) can be considered, but it did not perform well because of error propagation. We saw that our method was also effective for an improved Conformer-based baseline (B1,B2).
|(A2)+EC (MLM, )|
|(A3)+EC (MLM, )|
|(A4)+EC (PC-MLM, )|
|(A5)+EC (PC-MLM, )|
|(A6)+EC (Del PC-MLM, )|
|(A7)+EC (Del PC-MLM, )|
|(B1)CTC Conformer (greedy)|
|(B2)+EC (Del PC-MLM, )|
Table 2 compares our method on CSJ with other LM integration methods. Real time factor (RTF) are noted in the table, which was measured with a batch size of using an NVIDIA TITAN V GPU by averaging five runs. PC-MLMs increased RTF compared to MLM because it utilizes phone tokens as input which are longer than word tokens in general. Our method worked much faster, compared to rescoring (Resc) (C2,C3) and shallow fusion (SF) (C4), which require beam search (C1). In shallow fusion, LM calculation at each decoding step is also required. Note that beam search is hard to parallelize, but Transformer inference in our error correction benefits from parallelization with GPU.
We also compared our method to knowledge distillation (KD) with MLM. To apply KD, we utilized a forced-aligned CTC path to align frame-level predictions of CTC with token-level predictions of MLM, inspired by [Inaguma21-AKD]. The KD loss function is formulated as
where denotes the mapping from to based on the forced-aligned CTC path calculated with CTC forward-backward algorithm [Graves06-CTC]. KD was applied during training, so its RTF was not increased from the baseline. However, its WER improvement was limited (D1) compared to our method. The combination of KD and our method (D2) leads to further WER improvement. The WER was improved by relative over the baseline, while maintaining fast inference.
|(A7)+EC (Del PC-MLM)|
|(C2)+Resc (TLM, )|
|(C3)+Resc (MLM, )|
|(C4)+SF (TLM, )|
|(D2)+EC (Del PC-MLM)|
Table 3 shows the ASR results of our proposed error correction (EC) on TED-LIUM2. Our method improved ASR, by taking advantage of LM, while maintaining fast inference, as on CSJ. However, in terms of WER, our method was not competitive with rescoring (Resc) (F2) and shallow fusion (SF) (F3). We found that phone information was not as effective as on CSJ, by comparing (E2) and (E3). English words (alphabet) are a phonogram that represents speech sound while Japanese words (kanji) are an ideogram that represents a concept (meaning). Therefore, phone information is more related to word information in English than in Japanese. This suggests that word-level recognition errors and phone-level ones are likely to occur at the same position and that phone information was not helpful in error correction. On the other hand, phone information had a complementary effect on Japanese words.
|(E4)+EC (Del PC-MLM)|
|(F2)+Resc (TLM, )|
|(F3)+SF (TLM, )|
|(G2)+EC (Del PC-MLM)|
In this study, we have proposed an LM integration method for CTC-based ASR via error correction with phone-conditioned masked LM (PC-MLM). PC-MLM corrects less confident tokens in CTC output using phone information. We demonstrated our proposed method worked faster than conventional LM integration methods such as rescoring and shallow fusion. They assume multiple hypotheses from autoregressive beam search, while our method assumes a single hypothesis from non-autoregressive greedy decoding. In addition, PC-MLM itself works in a non-autoregressive manner. We also demonstrated on CSJ that our method even improved the ASR performance more than rescoring, shallow fusion, and knowledge distillation. For future work, we will investigate recovering deletion errors and iterative refinement in PC-MLM, while keeping its inference speed.