End-to-end (E2E) models have become mainstream in the research field of automatic speech recognition (ASR). One advantage of the E2E models is the simplicity of the model structure. A single neural network receives an acoustic feature sequence and directly generates an output token sequence. It does not need separate models such as acoustic, language and lexicon models commonly used in the conventional ASR system. There has been a lot of work that aims to improve E2E models[8, 3, 1, 20, 27, 29]. Several reported that the E2E model achieved comparable or better performance to the conventional ASR system in the product system [7, 22] and publicly available corpora [19, 16, 14] such as Librispeech , Switchboard , and Corpus of Spontaneous Japanese(CSJ) . One of the state-of-the-art E2E models is Transformer , which significantly outperformed the RNN-based E2E models [15, 14].
Many of the above E2E models assume left-to-right autoregressive generation of an output token sequence. In the speech production context, this assumption is reasonable because speech is produced in a left-to-right order given linguistic content. But in the speech perception context, it is unclear that left-to-right decoding is always the best way. For example, when we listen to speech and encounter a word whose pronunciation is unclear, we leave it as uncertain and re-estimate it using the future context. Mimicking this perceptual process in ASR is scientifically quite important and also has some potential to improve the performance of left-to-right decoding.
One of the non-left-to-right E2E models is known as non-autoregressive Transformer (NAT). It is heavily investigated in the area of neural machine translation (NMT) [11, 24, 13, 4]. Mask-predict, one of the NAT models, has been applied to speech recognition 
. It realizes non-autoregressive output token generation by introducing a special token which masks part of an output token sequence. When training the model, some tokens in the output sequence are randomly masked and the model is trained to estimate the masked tokens with non-causal masking in the self-attention of the decoder. During decoding, the output token sequence is generated by estimating masked tokens. There is no causal masking hence non-left-to-right non-autoregressive output sequence generation is realized. It achieved competitive performance to an autoregressive model with faster decoding time on AISHELL.
However, mask-predict needs some additional component or heuristics to estimate the output token sequence length. To overcome this disadvantage, the insertion-based model is proposed . In theory this model can generate an output token sequence with an arbitrary order without any additional component or heuristics to estimate the output token sequence length. In NMT, performance competitive to autoregressive Transformer is reported with fewer iterations in decoding .
This paper proposes using the insertion-based models for E2E ASR with an in-depth investigation of three insertion-based models originally proposed in NMT. In addition, we introduce a new formulation of joint modeling of connectionist temporal classification (CTC) 
and insertion-based models. This formulation can be viewed as modeling the joint distribution between the CTC probability and the insertion-based sequence generation probability. Hence the CTC probability depends on the insertion-based output token sequence generation. With this new formulation, the monotonic alignment property of CTC is reinforced by insertion based token generation. It achieves performance competitive with an autoregressive left-to-right model decoded with a similar decoding condition in non-left-to-right and non-autoregressive manner.
2 Related work
To the best of our knowledge, this is the first work to apply insertion-based models to ASR tasks.
As mentioned in Section 1, this work is another type of NAT application to ASR tasks compared with . Our work does not use a special mask token or need to estimate an output sequence length in advance. Furthermore, our work can handle autoregressive and non-autoregressive models in a single formulation and also introduces a new formulation of joint modeling of CTC and insertion-based models.
Non-autoregressive E2E ASR using a CTC-like model is proposed in Imputer. It assumes that the alignment at the -th generation step depends on the past -th alignment. The alignment is estimated similarly to mask-predict in a non-autoregressive manner. Our work is different in using insertion-based models and joint modeling of CTC and insertion-based token sequence generation.
3 Insertion-based end-to-end models
Let be a -dimensional acoustic feature sequence and be an output token sequence. is the input length, is a set of distinct tokens, and
is the output length. Then, decoding of the E2E model is performed to maximize the posterior probability:
Training of the E2E model is also based on this criterion. The difference between various E2E models is how to define the posterior in Eq. (1).
In the insertion-based models, it is assumed to be marginalized over all possible insertion orders (permutation). Let be an insertion order. For example, suppose , is all the permutations of ordering 4 tokens, i.e.,
where is the permutated output token sequence with an insertion order . The number of all permutations is . Then, the posterior in Eq. (1) is factorized with the sum and product rules as:
We assume that insertion order does not depend on input feature hence .
Definition of in Eq. (4) is different between insertion-based models and explained in the next subsection. Note that the left-to-right autoregressive model can be interpreted as a special case where is fixed to be the left-to-right order and in Eq. (4) is .
When training the model of Eq. (4), a lower bound of log-likelihood where is the parameters of the model is maximized under a predefined prior distribution over insertion order .
From the next subsection, three existing insertion-based models are explained.
3.1 Insertion-based decoding
Insertion-based decoding (InDIGO)  is an insertion-based model using Transformer with relative position representation. Let be a relative position representation at generation step under an insertion order . Element is defined as:
in Eq. (4) of InDIGO is defined as:
where is the
-th column vector of. The factorized form in Eq. (8) is modeled by Transformer. Let be the final output of the decoder layer of Transformer where is the dimension of self-attention,
For the word prediction term, a linear transform and softmax operation are applied to, and for the position prediction term a pointer network  is used.
During decoding, the next token to be inserted is estimated by the word prediction term then its position is estimated by the position prediction term in Eq. (9). Because of this sequential operation, only a single token is generated per iteration during decoding. Therefore, InDIGO can be non-left-to-right but requires iterations.
3.2 Insertion Transformer and KERMIT
Another type of insertion-based model is Insertion Transformer  and KERMIT (Kontextuell Encoder Representations Made by Insertion Transformations) . The basic formulation of these two models is the same. Let be a token to be inserted and be a position where the token is inserted at the -th generation step under an insertion order . in Eq. (4) is defined as:
where is the sorted token sequence at the -th generation step. For example, in case of and , i.e., ,
Note that is a position relative to the hypothesis at the -th generation step. In the case of explained above, because there are two tokens in the previous hypothesis.
The difference between Insertion Transformer and KERMIT is the matrix used when the posterior is calculated. In the case of Insertion Transformer, the final output of the decoder layer of Transformer is used as . On the other hand, KERMIT uses only the encoder block of Transformer. Acoustic feature sequence and token embedding are concatenated and fed into the encoder block. The final output of the encoder layer is sliced as then used as . This difference is depicted in Figure 1.
By using the matrix, the posterior in Eq. (10) is calculated as:
The word and position prediction term is calculated by a softmax followed by a linear transformation of .
There are two ways of decoding. The first one is autoregressive greedy decoding directly using the posterior in Eq. (11):
The second way is non-autoregressive parallel decoding using only the word prediction term in Eq. (11):
When the balanced binary insertion order proposed in  is used as , parallel decoding finishes empirically with iterations. This order is to insert centermost tokens of current hypothesis. For example, suppose , then the hypothesis grows like .
4 Insertion-based/CTC joint modeling
Speech is generated in a left-to-right order hence the alignment is monotonic. Therefore, an E2E model trained with a CTC objective is reported to achieve faster convergence and high accuracy . It is natural to apply this technique to insertion-based models. In the case of InDIGO and Insertion Transformer, its network is composed of an encoder and decoder so it is easy to apply this technique the same way as in . However, for KERMIT, because it consists of only an encoder, a new formulation must be introduced.
Let be an output token sequence to be modeled by CTC. Usually, is set as . Joint modeling is to extend in Eq. (4) as:
The term in Eq. (14) is modeled by CTC. Let be a sequence of tokens extended with a blank symbol, . is a mapping function which deletes repetitions and the blank symbol from a sequence hence . The CTC probability is formulated as:
In the case of InDIGO and Insertion Transformer, the final output of the encoder layer is used to calculate in Eq. (15) as:
where is calculated by applying a linear transformation and softmax to .
For KERMIT, in Eq. (15) can not be approximated as in Eq. (16). KERMIT consists of only an encoder and acoustic feature sequence and token embedding are concatenated then fed into the encoder block. Therefore, the output of the encoder block depends on both the acoustic feature and output token sequence. There might be several ways how to calculate in Eq. (15). In this work, output of KERMIT encoder is sliced as and used:
This process is depicted in Figure 1. This formulation can reinforce CTC by making it dependent not only on the acoustic feature sequence but also on the output token sequence from insertion-based generation. Note that this formulation still retains non-autoregressive characteristics.
When training the model, in order to adjust the range of the two terms in Eq. (14), the CTC weight is introduced as:
During decoding, either the CTC part or the insertion part in Eq. (14) can be used.
|CTC weight||CSJ 271h||TEDLIUM2||AISHELL|
We used three corpora, the Corpus of Spontaneous Japanese (CSJ) , TEDLIUM2  and AISHELL. As a baseline model we chose CTC , which is similar to the work in  using the Transformer encoder layers. Another baseline is autoregressive Transformer (AT) .
The three insertion-based models described in Section 3 are compared to the baseline. For parameters with these models, we followed the Transformer recipe of ESPnet  based on . The numbers of layers for the encoder and decoder were 12 and 6, respectively. We increased the number of encoder layers to 18 for CTC and KERMIT because they are composed of only an encoder. We focused on two types of priors, left-to-right (L2R) and balanced binary tree (BBT) for in Eq. (4) to simplify the comparison. L2R is evaluated in order to see if there is a performance difference to AT from explicit modeling of the insertion position of tokens. BBT is chosen because it can decode an length sequence with empirically
iterations. In training with the BBT prior, we increased the number of epochs from 50 to 300 because only a single step of the output token sequence generation is trained in one minibatch while with the L2R prior we can train the whole sequence generation.
Since our insertion-based models do not have a beam search algorithm, we mainly compare the L2R-prior insertion-based models with AT (beam=1) and the BBT-prior insertion-based models with CTC (beam=1).
First, the results of AT and insertion-based models with the L2R prior are shown in the upper part of Table 1. When compared to AT without beam search, the performance of insertion-based models trained with the CTC objective is mostly better except for the eval2 set of CSJ.
Next, the models with the BBT prior are compared in the lower part of Table 1. These are non-autoregressive models hence performance is first compared to CTC. Unfortunately, Insertion Transformer with the BBT prior cannot compete with CTC even with hybrid training with CTC. KERMIT with joint training with CTC, which is a new formulation introduced in Section 4, achieved better performance than CTC on CSJ and TEDLIUM2. Notably, it achieved better performance in a non-autoregressive manner than AT without beam search on several test sets highlighted by underlined numbers in Table 1.
One of the remarks we got from the experiments of the L2R prior is that explicit modeling of the position of an output token and hybrid training with the CTC objective worked complementarily and the quality of a hypothesis in decoding was improved.
Another remark is that when the BBT prior is used, the new formulation of joint training with CTC introduced in this paper seems to make use of both benefits of left-to-right and non-left-to-right generation orders at least on CSJ and TEDLIUM2. However, the performance of the BBT prior on AISHELL was not as good as CTC. In this case, the performance of CTC is close to that of AT unlike the other tasks, and the effectiveness of the generation order may depend on the language or task.
This paper proposes applying three insertion-based models, originally proposed for NMT, to ASR tasks. In addition, we introduce a new formulation for joint training of the insertion-based model and CTC. Our experiments show that InDIGO and Insertion Transformer trained with the L2R prior achieved comparable or better performance than autoregressive Transformer without beam search. Models trained with the BBT prior and the proposed formulation, which retains non-autoregressive characteristics, achieved better performance than CTC and competitive with autoregressive Transformer without beam search on CSJ and TEDLIUM2 but was not effective on AISHELL. Therefore, we will investigate more solid use of insertion-based models including an extension of decoding algorithm with beam search.
Deep speech 2: end-to-end speech recognition in english and mandarin.
Proc. of the 33rd International Conference on International Conference on Machine Learning (ICML), pp. 173–182. Cited by: §1.
-  (2017) AIShell-1: an open-source mandarin speech corpus and a speech recognition baseline. In Proc. Oriental COCOSDA 2017, Cited by: §1, §5.1.
-  (2016-03) Listen, attend and spell: a neural network for large vocabulary conversational speech recognition. In Proc. 2016 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), Vol. , pp. 4960–4964. External Links: Cited by: §1.
-  (2019) KERMIT: generative insertion-based modeling for sequences. arXiv preprint arXiv:1906.01604. Cited by: §1, §1, §3.2.
-  (2020) Imputer: sequence modelling via imputation and dynamic programming. arXiv preprint arXiv:2002.08926. Cited by: §2.
-  (2020) Listen and fill in the missing letters: non-autoregressive transformer for speech recognition.. arXiv preprint arXiv:1911.04908. Cited by: §1, §2.
-  (2018-04) State-of-the-art speech recognition with sequence-to-sequence models. In Proc. 2018 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), Vol. , pp. 4774–4778. External Links: Cited by: §1.
-  (2015) Attention-based models for speech recognition. In Proc. Advances in Neural Information Processing Systems (NIPS) 28, pp. 577–585. External Links: Cited by: §1.
-  (1992-03) SWITCHBOARD: telephone speech corpus for research and development. In Proc. 1992 IEEE International Conference on Acoustics, Speech, and Signal Processing (ICASSP), Vol. 1, pp. 517–520 vol.1. External Links: Cited by: §1.
Connectionist temporal classification: labelling unsegmented sequence data with recurrent neural networks. In Proc. of the 23rd International Conference on Machine Learning (ICML), pp. 369–376. External Links: Cited by: §1, §5.1.
-  (2017) Non-autoregressive neural machine translation. arXiv preprint arXiv:1711.02281. Cited by: §1.
-  (2019) Insertion-based decoding with automatically inferred generation order. Transactions of the Association for Computational Linguistics 7, pp. 661–676. Cited by: §3.1.
-  (2019) Levenshtein transformer. In Proc. Advances in Neural Information Processing Systems (NIPS) 32, pp. 11181–11191. External Links: Cited by: §1.
-  (2019) A comparative study on transformer vs RNN in speech applications. arXiv preprint arXiv:1909.06317. Cited by: §1.
-  (2019) Improving Transformer-Based End-to-End Speech Recognition with Connectionist Temporal Classification and Language Model Integration. In Proc. Interspeech 2019, pp. 1408–1412. External Links: Cited by: §1, §5.1, §5.1.
-  (2019) RWTH ASR Systems for LibriSpeech: Hybrid vs Attention. In Proc. Interspeech 2019, pp. 231–235. External Links: Cited by: §1.
-  (2000) Spontaneous speech corpus of Japanese. In Proc. the Second International Conference on Language Resources and Evaluation (LREC’00), Cited by: §1, §5.1.
-  (2015-04) Librispeech: an ASR corpus based on public domain audio books. In Proc. 2015 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), Vol. , pp. 5206–5210. External Links: Cited by: §1.
-  (2019) SpecAugment: A Simple Data Augmentation Method for Automatic Speech Recognition. In Proc. Interspeech 2019, pp. 2613–2617. External Links: Cited by: §1.
-  (2017) A comparison of sequence-to-sequence models for speech recognition. In Proc. Interspeech 2017, pp. 939–943. External Links: Cited by: §1.
-  (2014-05) Enhancing the TED-LIUM corpus with selected data for language modeling and more TED talks. In Proceedings of the Ninth International Conference on Language Resources and Evaluation (LREC’14), pp. 3935–3939. Cited by: §5.1.
-  (2020) A streaming on-device end-to-end model surpassing server-side conventional model quality and latency. In Proc. 2020 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), Vol. , pp. 6059–6063. Cited by: §1.
-  (2019) Self-attention networks for connectionist temporal classification in speech recognition. In Proc. 2019 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pp. 7115–7119. Cited by: §5.1.
-  (2019) Insertion transformer: flexible sequence generation via insertion operations. In Proc. of International Conference on Machine Learning (ICML), pp. 5976–5985. Cited by: §1, §1, §3.2, §3.2.
-  (2017) Attention is all you need. In Proc. Advances in Neural Information Processing Systems (NIPS) 30, pp. 5998–6008. External Links: Cited by: §1.
-  (2015) Pointer networks. In Proc. Advances in Neural Information Processing Systems (NIPS) 28, pp. 2692–2700. External Links: Cited by: §3.1.
-  (2017-12) Hybrid ctc/attention architecture for end-to-end speech recognition. IEEE Journal of Selected Topics in Signal Processing 11 (8), pp. 1240–1253. External Links: Cited by: §1, §4.
-  (2018) ESPnet: end-to-end speech processing toolkit. In Proc. Interspeech 2018, pp. 2207–2211. External Links: Cited by: §1, §5.1.
Improved training of end-to-end attention models for speech recognition. In Proc. Interspeech 2018, pp. 7–11. External Links: Cited by: §1.