1 Introduction
Endtoend (E2E) models have become mainstream in the research field of automatic speech recognition (ASR). One advantage of the E2E models is the simplicity of the model structure. A single neural network receives an acoustic feature sequence and directly generates an output token sequence. It does not need separate models such as acoustic, language and lexicon models commonly used in the conventional ASR system. There has been a lot of work that aims to improve E2E models
[8, 3, 1, 20, 27, 29]. Several reported that the E2E model achieved comparable or better performance to the conventional ASR system in the product system [7, 22] and publicly available corpora [19, 16, 14] such as Librispeech [18], Switchboard [9], and Corpus of Spontaneous Japanese(CSJ) [17]. One of the stateoftheart E2E models is Transformer [25], which significantly outperformed the RNNbased E2E models [15, 14].Many of the above E2E models assume lefttoright autoregressive generation of an output token sequence. In the speech production context, this assumption is reasonable because speech is produced in a lefttoright order given linguistic content. But in the speech perception context, it is unclear that lefttoright decoding is always the best way. For example, when we listen to speech and encounter a word whose pronunciation is unclear, we leave it as uncertain and reestimate it using the future context. Mimicking this perceptual process in ASR is scientifically quite important and also has some potential to improve the performance of lefttoright decoding.
One of the nonlefttoright E2E models is known as nonautoregressive Transformer (NAT). It is heavily investigated in the area of neural machine translation (NMT) [11, 24, 13, 4]. Maskpredict, one of the NAT models, has been applied to speech recognition [6]
. It realizes nonautoregressive output token generation by introducing a special token which masks part of an output token sequence. When training the model, some tokens in the output sequence are randomly masked and the model is trained to estimate the masked tokens with noncausal masking in the selfattention of the decoder. During decoding, the output token sequence is generated by estimating masked tokens. There is no causal masking hence nonlefttoright nonautoregressive output sequence generation is realized. It achieved competitive performance to an autoregressive model with faster decoding time on AISHELL
[2].However, maskpredict needs some additional component or heuristics to estimate the output token sequence length. To overcome this disadvantage, the insertionbased model is proposed [24]. In theory this model can generate an output token sequence with an arbitrary order without any additional component or heuristics to estimate the output token sequence length. In NMT, performance competitive to autoregressive Transformer is reported with fewer iterations in decoding [4].
This paper proposes using the insertionbased models for E2E ASR with an indepth investigation of three insertionbased models originally proposed in NMT. In addition, we introduce a new formulation of joint modeling of connectionist temporal classification (CTC) [10]
and insertionbased models. This formulation can be viewed as modeling the joint distribution between the CTC probability and the insertionbased sequence generation probability. Hence the CTC probability depends on the insertionbased output token sequence generation. With this new formulation, the monotonic alignment property of CTC is reinforced by insertion based token generation. It achieves performance competitive with an autoregressive lefttoright model decoded with a similar decoding condition in nonlefttoright and nonautoregressive manner.
The source code will be publicly available in the open source E2E modeling toolkit ESPnet
[28].2 Related work
To the best of our knowledge, this is the first work to apply insertionbased models to ASR tasks.
As mentioned in Section 1, this work is another type of NAT application to ASR tasks compared with [6]. Our work does not use a special mask token or need to estimate an output sequence length in advance. Furthermore, our work can handle autoregressive and nonautoregressive models in a single formulation and also introduces a new formulation of joint modeling of CTC and insertionbased models.
Nonautoregressive E2E ASR using a CTClike model is proposed in Imputer
[5]. It assumes that the alignment at the th generation step depends on the past th alignment. The alignment is estimated similarly to maskpredict in a nonautoregressive manner. Our work is different in using insertionbased models and joint modeling of CTC and insertionbased token sequence generation.3 Insertionbased endtoend models
Let be a dimensional acoustic feature sequence and be an output token sequence. is the input length, is a set of distinct tokens, and
is the output length. Then, decoding of the E2E model is performed to maximize the posterior probability
:(1) 
Training of the E2E model is also based on this criterion. The difference between various E2E models is how to define the posterior in Eq. (1).
In the insertionbased models, it is assumed to be marginalized over all possible insertion orders (permutation). Let be an insertion order. For example, suppose , is all the permutations of ordering 4 tokens, i.e.,
(2) 
(3) 
where is the permutated output token sequence with an insertion order . The number of all permutations is . Then, the posterior in Eq. (1) is factorized with the sum and product rules as:
(4) 
We assume that insertion order does not depend on input feature hence .
Definition of in Eq. (4) is different between insertionbased models and explained in the next subsection. Note that the lefttoright autoregressive model can be interpreted as a special case where is fixed to be the lefttoright order and in Eq. (4) is .
When training the model of Eq. (4), a lower bound of loglikelihood where is the parameters of the model is maximized under a predefined prior distribution over insertion order .
(5) 
From the next subsection, three existing insertionbased models are explained.
3.1 Insertionbased decoding
Insertionbased decoding (InDIGO) [12] is an insertionbased model using Transformer with relative position representation. Let be a relative position representation at generation step under an insertion order . Element is defined as:
(6) 
in Eq. (4) of InDIGO is defined as:
(7)  
(8) 
where is the
th column vector of
. The factorized form in Eq. (8) is modeled by Transformer. Let be the final output of the decoder layer of Transformer where is the dimension of selfattention,(9) 
For the word prediction term, a linear transform and softmax operation are applied to
, and for the position prediction term a pointer network [26] is used.During decoding, the next token to be inserted is estimated by the word prediction term then its position is estimated by the position prediction term in Eq. (9). Because of this sequential operation, only a single token is generated per iteration during decoding. Therefore, InDIGO can be nonlefttoright but requires iterations.
3.2 Insertion Transformer and KERMIT
Another type of insertionbased model is Insertion Transformer [24] and KERMIT (Kontextuell Encoder Representations Made by Insertion Transformations) [4]. The basic formulation of these two models is the same. Let be a token to be inserted and be a position where the token is inserted at the th generation step under an insertion order . in Eq. (4) is defined as:
(10) 
where is the sorted token sequence at the th generation step. For example, in case of and , i.e., ,
Note that is a position relative to the hypothesis at the th generation step. In the case of explained above, because there are two tokens in the previous hypothesis.
The difference between Insertion Transformer and KERMIT is the matrix used when the posterior is calculated. In the case of Insertion Transformer, the final output of the decoder layer of Transformer is used as . On the other hand, KERMIT uses only the encoder block of Transformer. Acoustic feature sequence and token embedding are concatenated and fed into the encoder block. The final output of the encoder layer is sliced as then used as . This difference is depicted in Figure 1.
By using the matrix, the posterior in Eq. (10) is calculated as:
(11) 
The word and position prediction term is calculated by a softmax followed by a linear transformation of .
There are two ways of decoding. The first one is autoregressive greedy decoding directly using the posterior in Eq. (11):
(12) 
The second way is nonautoregressive parallel decoding using only the word prediction term in Eq. (11):
(13) 
When the balanced binary insertion order proposed in [24] is used as , parallel decoding finishes empirically with iterations. This order is to insert centermost tokens of current hypothesis. For example, suppose , then the hypothesis grows like .
4 Insertionbased/CTC joint modeling
Speech is generated in a lefttoright order hence the alignment is monotonic. Therefore, an E2E model trained with a CTC objective is reported to achieve faster convergence and high accuracy [27]. It is natural to apply this technique to insertionbased models. In the case of InDIGO and Insertion Transformer, its network is composed of an encoder and decoder so it is easy to apply this technique the same way as in [27]. However, for KERMIT, because it consists of only an encoder, a new formulation must be introduced.
Let be an output token sequence to be modeled by CTC. Usually, is set as . Joint modeling is to extend in Eq. (4) as:
(14) 
The term in Eq. (14) is modeled by CTC. Let be a sequence of tokens extended with a blank symbol, . is a mapping function which deletes repetitions and the blank symbol from a sequence hence . The CTC probability is formulated as:
(15) 
In the case of InDIGO and Insertion Transformer, the final output of the encoder layer is used to calculate in Eq. (15) as:
(16) 
where is calculated by applying a linear transformation and softmax to .
For KERMIT, in Eq. (15) can not be approximated as in Eq. (16). KERMIT consists of only an encoder and acoustic feature sequence and token embedding are concatenated then fed into the encoder block. Therefore, the output of the encoder block depends on both the acoustic feature and output token sequence. There might be several ways how to calculate in Eq. (15). In this work, output of KERMIT encoder is sliced as and used:
(17) 
This process is depicted in Figure 1. This formulation can reinforce CTC by making it dependent not only on the acoustic feature sequence but also on the output token sequence from insertionbased generation. Note that this formulation still retains nonautoregressive characteristics.
When training the model, in order to adjust the range of the two terms in Eq. (14), the CTC weight is introduced as:
(18) 
During decoding, either the CTC part or the insertion part in Eq. (14) can be used.
CTC weight  CSJ 271h  TEDLIUM2  AISHELL  
Model  Beam  Iterations  train  decode  Eval1  Eval2  Eval3  dev  test  dev  test  
AT  10    0.3  0.3  7.9  5.7  13.7  10.6  9.1  6.5  7.2  
1  8.1  5.4  13.9  12.7  10.1  6.7  8.1  
InDIGO  1  L2R  0.0  0.0  8.4  6.2  14.7          
0.3  7.8  5.5  13.3  13.6  9.6  6.1  6.7  
Insertion  1  L2R  0.0  0.0  8.7  6.3  16.1          
Transformer  0.3  8.3  5.4  13.9  11.2  9.6  6.8  7.6  
KERMIT  1  L2R  0.0  0.0  11.0  8.0  18.9          
0.9  9.2  6.7  15.7  14.9  12.4  7.7  8.9  
1.0  9.5  6.7  14.8  16.1  15.4  7.8  8.8  
CTC  1    1.0  1.0  8.5  6.1  13.8  16.1  16.3  6.8  7.6  
Insertion  1  BBT  0.0  0.0  15.0  12.4  21.6          
Transformer  0.3  14.1  10.8  18.0  19.1  16.3  9.6  10.6  
KERMIT  1  BBT  0.0  0.0  12.5  9.7  18.5          
0.9 (Proposed  11.5  9.1  16.7  18.8  15.0  9.8  10.9  
formulation)  1.0  7.9  5.5  13.6  12.0  10.7  6.9  7.8 
5 Experiments
5.1 Setup
We used three corpora, the Corpus of Spontaneous Japanese (CSJ) [17], TEDLIUM2 [21] and AISHELL[2]. As a baseline model we chose CTC [10], which is similar to the work in [23] using the Transformer encoder layers. Another baseline is autoregressive Transformer (AT) [15].
The three insertionbased models described in Section 3 are compared to the baseline. For parameters with these models, we followed the Transformer recipe of ESPnet [28] based on [15]. The numbers of layers for the encoder and decoder were 12 and 6, respectively. We increased the number of encoder layers to 18 for CTC and KERMIT because they are composed of only an encoder. We focused on two types of priors, lefttoright (L2R) and balanced binary tree (BBT) for in Eq. (4) to simplify the comparison. L2R is evaluated in order to see if there is a performance difference to AT from explicit modeling of the insertion position of tokens. BBT is chosen because it can decode an length sequence with empirically
iterations. In training with the BBT prior, we increased the number of epochs from 50 to 300 because only a single step of the output token sequence generation is trained in one minibatch while with the L2R prior we can train the whole sequence generation.
Since our insertionbased models do not have a beam search algorithm, we mainly compare the L2Rprior insertionbased models with AT (beam=1) and the BBTprior insertionbased models with CTC (beam=1).
5.2 Results
First, the results of AT and insertionbased models with the L2R prior are shown in the upper part of Table 1. When compared to AT without beam search, the performance of insertionbased models trained with the CTC objective is mostly better except for the eval2 set of CSJ.
Next, the models with the BBT prior are compared in the lower part of Table 1. These are nonautoregressive models hence performance is first compared to CTC. Unfortunately, Insertion Transformer with the BBT prior cannot compete with CTC even with hybrid training with CTC. KERMIT with joint training with CTC, which is a new formulation introduced in Section 4, achieved better performance than CTC on CSJ and TEDLIUM2. Notably, it achieved better performance in a nonautoregressive manner than AT without beam search on several test sets highlighted by underlined numbers in Table 1.
5.3 Discussion
One of the remarks we got from the experiments of the L2R prior is that explicit modeling of the position of an output token and hybrid training with the CTC objective worked complementarily and the quality of a hypothesis in decoding was improved.
Another remark is that when the BBT prior is used, the new formulation of joint training with CTC introduced in this paper seems to make use of both benefits of lefttoright and nonlefttoright generation orders at least on CSJ and TEDLIUM2. However, the performance of the BBT prior on AISHELL was not as good as CTC. In this case, the performance of CTC is close to that of AT unlike the other tasks, and the effectiveness of the generation order may depend on the language or task.
6 Conclusions
This paper proposes applying three insertionbased models, originally proposed for NMT, to ASR tasks. In addition, we introduce a new formulation for joint training of the insertionbased model and CTC. Our experiments show that InDIGO and Insertion Transformer trained with the L2R prior achieved comparable or better performance than autoregressive Transformer without beam search. Models trained with the BBT prior and the proposed formulation, which retains nonautoregressive characteristics, achieved better performance than CTC and competitive with autoregressive Transformer without beam search on CSJ and TEDLIUM2 but was not effective on AISHELL. Therefore, we will investigate more solid use of insertionbased models including an extension of decoding algorithm with beam search.
References

[1]
(2016)
Deep speech 2: endtoend speech recognition in english and mandarin.
In
Proc. of the 33rd International Conference on International Conference on Machine Learning (ICML)
, pp. 173–182. Cited by: §1.  [2] (2017) AIShell1: an opensource mandarin speech corpus and a speech recognition baseline. In Proc. Oriental COCOSDA 2017, Cited by: §1, §5.1.
 [3] (201603) Listen, attend and spell: a neural network for large vocabulary conversational speech recognition. In Proc. 2016 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), Vol. , pp. 4960–4964. External Links: Document, ISSN Cited by: §1.
 [4] (2019) KERMIT: generative insertionbased modeling for sequences. arXiv preprint arXiv:1906.01604. Cited by: §1, §1, §3.2.
 [5] (2020) Imputer: sequence modelling via imputation and dynamic programming. arXiv preprint arXiv:2002.08926. Cited by: §2.
 [6] (2020) Listen and fill in the missing letters: nonautoregressive transformer for speech recognition.. arXiv preprint arXiv:1911.04908. Cited by: §1, §2.
 [7] (201804) Stateoftheart speech recognition with sequencetosequence models. In Proc. 2018 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), Vol. , pp. 4774–4778. External Links: Document, ISSN Cited by: §1.
 [8] (2015) Attentionbased models for speech recognition. In Proc. Advances in Neural Information Processing Systems (NIPS) 28, pp. 577–585. External Links: Link Cited by: §1.
 [9] (199203) SWITCHBOARD: telephone speech corpus for research and development. In Proc. 1992 IEEE International Conference on Acoustics, Speech, and Signal Processing (ICASSP), Vol. 1, pp. 517–520 vol.1. External Links: Document, ISSN Cited by: §1.

[10]
(2006)
Connectionist temporal classification: labelling unsegmented sequence data with recurrent neural networks
. In Proc. of the 23rd International Conference on Machine Learning (ICML), pp. 369–376. External Links: Document Cited by: §1, §5.1.  [11] (2017) Nonautoregressive neural machine translation. arXiv preprint arXiv:1711.02281. Cited by: §1.
 [12] (2019) Insertionbased decoding with automatically inferred generation order. Transactions of the Association for Computational Linguistics 7, pp. 661–676. Cited by: §3.1.
 [13] (2019) Levenshtein transformer. In Proc. Advances in Neural Information Processing Systems (NIPS) 32, pp. 11181–11191. External Links: Link Cited by: §1.
 [14] (2019) A comparative study on transformer vs RNN in speech applications. arXiv preprint arXiv:1909.06317. Cited by: §1.
 [15] (2019) Improving TransformerBased EndtoEnd Speech Recognition with Connectionist Temporal Classification and Language Model Integration. In Proc. Interspeech 2019, pp. 1408–1412. External Links: Document, Link Cited by: §1, §5.1, §5.1.
 [16] (2019) RWTH ASR Systems for LibriSpeech: Hybrid vs Attention. In Proc. Interspeech 2019, pp. 231–235. External Links: Document, Link Cited by: §1.
 [17] (2000) Spontaneous speech corpus of Japanese. In Proc. the Second International Conference on Language Resources and Evaluation (LREC’00), Cited by: §1, §5.1.
 [18] (201504) Librispeech: an ASR corpus based on public domain audio books. In Proc. 2015 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), Vol. , pp. 5206–5210. External Links: Document, ISSN Cited by: §1.
 [19] (2019) SpecAugment: A Simple Data Augmentation Method for Automatic Speech Recognition. In Proc. Interspeech 2019, pp. 2613–2617. External Links: Document, Link Cited by: §1.
 [20] (2017) A comparison of sequencetosequence models for speech recognition. In Proc. Interspeech 2017, pp. 939–943. External Links: Document, Link Cited by: §1.
 [21] (201405) Enhancing the TEDLIUM corpus with selected data for language modeling and more TED talks. In Proceedings of the Ninth International Conference on Language Resources and Evaluation (LREC’14), pp. 3935–3939. Cited by: §5.1.
 [22] (2020) A streaming ondevice endtoend model surpassing serverside conventional model quality and latency. In Proc. 2020 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), Vol. , pp. 6059–6063. Cited by: §1.
 [23] (2019) Selfattention networks for connectionist temporal classification in speech recognition. In Proc. 2019 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pp. 7115–7119. Cited by: §5.1.
 [24] (2019) Insertion transformer: flexible sequence generation via insertion operations. In Proc. of International Conference on Machine Learning (ICML), pp. 5976–5985. Cited by: §1, §1, §3.2, §3.2.
 [25] (2017) Attention is all you need. In Proc. Advances in Neural Information Processing Systems (NIPS) 30, pp. 5998–6008. External Links: Link Cited by: §1.
 [26] (2015) Pointer networks. In Proc. Advances in Neural Information Processing Systems (NIPS) 28, pp. 2692–2700. External Links: Link Cited by: §3.1.
 [27] (201712) Hybrid ctc/attention architecture for endtoend speech recognition. IEEE Journal of Selected Topics in Signal Processing 11 (8), pp. 1240–1253. External Links: Document, ISSN Cited by: §1, §4.
 [28] (2018) ESPnet: endtoend speech processing toolkit. In Proc. Interspeech 2018, pp. 2207–2211. External Links: Document, Link Cited by: §1, §5.1.

[29]
(2018)
Improved training of endtoend attention models for speech recognition
. In Proc. Interspeech 2018, pp. 7–11. External Links: Document, Link Cited by: §1.
Comments
There are no comments yet.