1 Introduction
Nonautoregressive Transformer (NAT, Gu et al., 2018) introduce a parallel decoding paradigm with higher decoding efficiency (> ) than autoregressive models Bahdanau et al. (2015); Gehring et al. (2017); Vaswani et al. (2017). Unlike autoregressive models, NAT models impose conditional independence assumptions in words to support parallel decoding of sentences during inference. It attracts many researchers to explore NAT in machine translation Gu et al. (2018); Lee et al. (2018); Kaiser et al. (2018) and texttospeech tasks Chen et al. (2019); Peng et al. (2020).
Amount of researchers devoted themselves to improve the NATs’ inferior generation quality. Such as modeling word interdependencies by curriculum learning Guo et al. (2020a); Liu et al. (2020) or iterative refinements mechanism Ghazvininejad et al. (2019); Guo et al. (2020b), introducing latent variables to decompose target sentences and serve as the springboard for decoding Shu et al. (2019); Ma et al. (2019); Bao et al. (2021), and introduce inductive bias for models’ training Wei et al. (2019); Li et al. (2019). The most successful method is the glancing transformer (GLAT, Qian et al., 2021a), which trains the NAT model by sampling partial target words as inputs to predict the remaining target words, explicitly building dependencies between the observed and unobserved words. Qian et al. (2021b) employ GLAT to achieve impressive results on the translation task of WMT21^{1}^{1}1http://statmt.org/wmt21/, even outperforming many strong autoregressive translation systems in BLEU score Papineni et al. (2002).
Although existing NAT models achieve competitive results compared to autoregressive models in translation tasks, it is not negligible that they still need the help of an autoregressive Transformer (AT, Vaswani et al., 2017) as a teacher for training, i.e., sequencelevel knowledge distillation (Kim and Rush, 2016). A wellrecognized explanation is a multimodality problem Zhou et al. (2020); Sun and Yang (2020): each input may have multiple valid outputs in datasets, which will prevent NAT models from learning to organize consistent outputs. Training with the outputs of an AT can directly bypass the multimodal phenomenon in the dataset, effectively improving the models’ performances.
However, training NAT models by knowledge distillation are limited. First, it needs to train an extra AT model, which inevitably enlarges the training cost. Second, it is hard to promise that the teacher (or AT) model can be accurate enough in all text generation settings, which will become the bottleneck for its student NAT model. Therefore, training a model from scratch without the help of an AT model is still an open and interesting problem.
In this paper, we propose latentGLAT, which can directly learn from the raw dataset. It alleviates the multimodality problem following a divideandconquer spirit, introducing a small set of discrete latent variables to capture the target word categorical information and divide the origin goal into latent variables modeling and sentence reconstruction. First, the categorical information may have fewer multimodality phenomena than the original words, thus can be learned directly without the help of knowledge distillation. Second, the word categorical information is informativeness to the sentence reconstruction. We can extend glancing training with these discrete latent variables for modeling the sentence, encouraging the model to build dependencies on word categorical information rather than words, which works more robustly.
Experiment results on WMT14, Quora, and DailyDialog datasets show that latentGLAT achieves remarkable improvements over several strong baselines, verifying the effectiveness of latentGLAT. More impressively, latentGLAT even outperforms autoregressive models in Quora and DailyDialog datasets, further validating our motivation for removing knowledge distillation. Indepth analyses indicate that the introduced discrete latent variables are helpful to alleviate the multimodality problem and are necessary for performance improvement.
2 Background
For a sequencetosequence task of predicting sequence given its input sequence , the classical autoregressively factorization decomposes the
with a series of conditional probability:
(1) 
where .
Although such factorization achieved great success in previous studies Bahdanau et al. (2015); Gehring et al. (2017); Vaswani et al. (2017), they predict each word^{2}^{2}2We use BPE segmentation in our experiments, and they are strictly tokens. For clarity, we use words and tokens interchangeably in the paper. based on the prefix words, which may suffer from the issues of error accumulation and slow decoding during inference.
Nonautoregressive Transformer.
To tackle the above problems, Gu et al. (2018) firstly propose nonautoregressive Transformer (NAT), introducing a nonautoregressive factorization as:
(2) 
where each word are modeled independently. During inference, the NAT model can decode the word simultaneously by for each , remarkably improving the efficiency (15 speedups to an autoregressive Transformer).
However, the independence assumption may prevent the NAT model from leveraging the inherent word dependencies to organize consistent outputs. Due to this,the efficiency improvements of NAT are at the cost of its quality, e.g., the performance degradation by more than 10.0 BLEU Papineni et al. (2002) points in machine translation tasks Gu et al. (2018). Besides, recent studies Zhou et al. (2020); Sun and Yang (2020) point out that the multimodality phenomenon in the dataset aggravates the challenge of NAT models.
Glancing Transformer.
To mitigate the issue of missing word dependency in NAT models, Qian et al. (2021a) propose Glancing Transformer (GLAT), introducing glancing training (GLT) and sampling partial target tokens for training NAT:
(3) 
where is the partial target tokens, and is its complements set. It progressively decreases the sampling ratio and obtains better performances in machine translation tasks.
Nevertheless, we find that GLAT in experiments still has a multimodality problem^{3}^{3}3We include details of GLAT in Appendix A.: First, its sampling rate cannot be decreased to zero during training, which exists the issue of exposure bias. Second, it still heavily relies on a teacher model for further improvements Qian et al. (2021a).
Latent Transformer.
To alleviate the multimodality problem, Kaiser et al. (2018); Shu et al. (2019); Ma et al. (2019); Bao et al. (2021) propose Latent Transformer (LT), introducing latent variables for NAT predictions as:
(4) 
where is always trained by variational inference Ma et al. (2019) or discretization techniques Kaiser et al. (2018). Such latent variables are decomposed from the target sentence, which is informative to determine the mode of the sentence and alleviates the multimodality problems.
Although Latent Transformer models improve performance in terms of BLEU score, their used autoregressive predictor Kaiser et al. (2018); Bao et al. (2021) or deep iterative transformation Shu et al. (2019); Ma et al. (2019) for predicting latent variables unavoidable sacrifice the overall decoding efficiency. Besides, they do not explicitly build the interdependencies among the outputs.
3 Proposed Method: latentGlat
In this section, we present latentGLAT. latentGLAT follows Latent Transformer models Kaiser et al. (2018); Bao et al. (2021) but introduces glancing training Qian et al. (2021a) with the discrete latent variables. Our intuitions are as follows:
First, compared to the words, the introduced discrete latent variables may have fewer modes than words and be informative to determine the modes of the sentences. In such a case, we can directly learn the discrete latent variables by the Glancing Transformer Qian et al. (2021a), keeping competitive inference efficiency. More importantly, we can employ the latent variables to invoke glancing training for modeling the target sentences, which is informative enough to reduce the multimodality problem of original sentences. Besides, glancing at latent variables also works robustly due we can obtain the latent variables during inference.
3.1 Introducing Discrete Latent Variables for Modeling Target Categorical Information
In this part, we state the structure of latentGLAT, which introduces a small set of discrete latent variables for a NAT model, basically following Kaiser et al. (2018); Roy et al. (2018); Bao et al. (2021).
Let be the size of the discrete latent space and let denote the set . For each target sentence , we use a samelength latent variable sequence for modeling it as:
(5) 
where and , is the model parameters.
Discretization.
For discretizing target sentences to latent variables, we use vector quantization Roy et al. (2018)
, which works by dividing a large set of origin vector representations into small groups. We assign each token
with a group that has the nearest distance to its representation:(6) 
where is the maintained representations and is its dimension. We use the embedding as , refer to Bao et al. (2021). Finally, the model is trained to minimize
(7) 
where and are the prediction loss for words and latent variables , respectively.
The maintained representations are updated with an exponential moving average over a minibatch of target tokens :
(8) 
where is assigned count for group , and we set decay parameter in our experiments.
Architecture.
As shown in Figure 1, latentGLAT mainly consists of an encoder (NAT Encoder), a latent predictor (NAT Predictor), and a decoder (Mix. Decoder). We parameterize them with the multihead attentionbased encoder or decoder, similar to Transformer Vaswani et al. (2017). Their functions can be formalized as:
where we use an extra module to predict the target length and initialize the decoder inputs with the Wei et al. (2019) mechanism.
3.2 Glancing at Discrete Latent Variables for Parallel Sequence Decoding
The small number () of discrete latent variables can capture highlevel categorical information of the target words, supporting better learning design for parallel sequence decoding.
Our first insight is that we can learn to nonautoregressively predict the discretized latent variables directly without the help of distillation. Specifically, we parameterize the in a nonautoregressive fashion and use a glancing training technique (GLT, Qian et al., 2021a) for optimizing it, as shown in Figure 1(a):
(9) 
where is uniformly sampled from , refer to Qian et al. (2021a). We provide more training details of latentGLAT in Appendix B.
Our next insight is modeling the sentence based on the sampled latent variables rather than , namely, glancing at for optimizing :
(10) 
We find Eqn. (10) works robustly in experiments and analyze it in Section ( 4.3).
As shown in Figure 1(b), we eventually employ words to invoke glancing training for minimizing , namely we optimize the by minimizing
(11) 
where and are the sampled target tokens and discrete latent variables.
Overall Training Loss.
Our fullfledged loss includes latent variable prediction, sentence reconstruction, and length prediction losses:
(12) 
where
are the hyperparameters to adjust the importance of length prediction loss
.3.3 Inference
In inference phase, latentGLAT predicts the target length, latent variables, and sentence in turn.
For the target length, latentGLAT first predicts the target length with the length predictor . To avoid the length prediction errors during inference, latentGLAT expands the length to a ranges (we use , total six candidates in our experiments).
Then, latentGLAT predicts the latent variables with and sentence with for each candidate.
Similar to Ma et al. (2019), latentGLAT also ranks the candidates by itself (selfreranking) and chooses the highest score output with:
(13) 
where is the length penalty ratio to avoid the length bias, and denotes the length of .
4 Experiments
We conduct experiments on several generation tasks, including machine translation, paraphrase generation, and dialog generation.
4.1 Experimental Setup
Dataset.
We chose the most popular benchmarks for each task:

Machine Translation (MT): We follow previous practices in NAT models and use the WMT14 English (EN) German (DE) corpus (4.5M sentence pairs) and the IWSLT14 German (DE)
English (EN) corpus (160K sentence pairs) to validate our proposed model. We obtain the datasets following the instruction opensourced in
fairseq^{4}^{4}4https://github.com/pytorch/fairseq. In detail, we first tokenize the datasets with Moses script. Then, we use 37,000 and 10,000 operations to split the words into bytepair encodings (BPE, Sennrich et al., 2016) in WMT14 and IWSLT14 datasets, respectively. We also share subword embeddings between the source and target language for each dataset. 
Paraphrase Generation (PG): We use the Quora^{5}^{5}5https://www.kaggle.com/c/quoraquestionpairs/data dataset to evaluate the paraphrase generation task. The Quora dataset contains around 135K labeled paraphrases pairs. Following the standard dataset split, we sample 100K sentence pairs from the labeled paraphrases as training data and hold out 30K pairs for testing, the remaining about 5K pairs for validation. Like the MT tasks, we tokenize the corpus with Moses scripts and split the words into BPE units with total 32K operations.

Dialog Generation (DG): We conduct the dialog generation experiments on the DailyDialog dataset Li et al. (2017). We obtain the processed DailyDialog dataset from Bao et al. (2020)^{6}^{6}6https://github.com/gmftbyGMFTBY/MultiTurnDialogZoo. The training set contains 87,170 sentence pairs (11,118 dialogues). The validation and testing set in the dataset contain 8069 pairs (1000 dialogues) and 7740 pairs (1000 dialogues), respectively.
Note that these tasks emphasize different aspects. The task of MT aims to transfer bilingual sentences with semantically invariant conditions. The PG task differs from machine translation and works on mode transformation in the same language, whose goal is to synthesize a sentence different from the original input but conveys the same meaning. The DG task is most challenging due to the complex generation goal.
Implementations.
We compare latentGLAT with Transformer (Vaswani et al., 2017), NAT (Gu et al., 2018), and GLAT Qian et al. (2021a) models. We implement them based on the opensource framework fairseq (Ott et al., 2019).
For machine translation tasks, we use the base setting (, , , , and ) of Transformer Vaswani et al. (2017) for WMT14 dataset and a smaller setting (, , , , and ) for IWSLT14 dataset. The number of layers in latentGLAT decoder and latent predictor are both set to 4 in experiments. We use inverse square root learning rate scheduling for WMT14 and a linear annealing learning rate from to in 250K steps for IWSLT14. The models are optimized with Adam (Kingma and Ba, 2015) optimizer () in 300K steps for WMT14 and 250K steps for IWSLT14. As for the ratio that used in glancing sampling, we linear anneal the ratio from to in whole training steps. The minibatch in each step consists of 2K tokens for IWSLT14 and 64K tokens for WMT14.
Since the scale of the Quora and DailyDialog datasets are close to the IWSLT14, we keep the same setting to the IWSLT14, such as the Adam, learning rate (linear annealing from to ), and batch size (2K tokens).
Evaluation.
To validate the effectiveness of our proposed method, we evaluate it in terms of quality and efficiency. We use tokenized and cased BLEU scores Papineni et al. (2002)^{7}^{7}7We evaluate BLEU using fairseq_score script. to evaluate the generation quality of MT and PG tasks. For dialog generation, we also include BLEU1 and BLEU2 scores for analysis. Following the common practices Gu et al. (2018); Qian et al. (2021a), we measure the decoding latency of each model by decoding sentence by sentence and compute the speedup compared with the autoregressive Transformer (AT) model to reflect its decoding efficiency. We highlight the best NAT result.
Models  WMT14  IWSLT14  Quora  DailyDialog  Latency  Speedups  

ENDE  DEEN  DEEN  BLEU1  BLEU2  BLEU  
Transformer (AT)  27.17  31.53  34.29  27.97  31.40  10.70  5.05  512.3 ms  1.00 
NAT  10.78  15.19  17.77  24.65  41.50  1.40  0.01  33.5 ms  15.29 
GLAT  16.71  24.78  29.07  27.01  39.50  26.20  26.13  33.5 ms  15.29 
latentGLAT  24.71  29.16  32.31  29.11  41.00  28.30  27.50  45.3 ms  11.31 
4.2 Main Results
We can see from Table 1 that our latentGLAT almost outperforms all the NAT baselines (NAT and GLAT) in generation quality on all tasks while keeping a competitive decoding speedup to the autoregressive counterpart.
Machine Translation.
As seen, without the help of an AT model for training, the vanilla NAT and advanced GLAT model only obtain inferior generation quality. In contrast, latentGLAT achieves competitive generation quality in machine translation tasks, indicating that the introduced latent variables effectively reduce the multimodality issue and support glancing training well. It narrows the performance gap between nonautoregressive decoding and autoregressive decoding from 11.46 (GLAT vs. AT) to 2.34 (latentGLAT vs. AT) BLEU points on WMT14 ENDE task while keeping a highspeed decoding efficiency.
Paraphrasing.
Unlike the translation task, the performance gap between nonautoregressive and autoregressive decoding on the paraphrase generation task is minor (NAT vs. AT, BLEU points, GLAT vs. AT, BLEU points ). Nevertheless, introducing discrete latent variables still is helpful to obtain a better performance. latentGLAT realizes a nonautoregressive model with better performance than the autoregressive model on Quora (latentGLAT vs. AT, points).
Dialog Generation.
We can see a different trend on the DailyDialog dataset — an AT model performs poorly than NAT models. Both GLAT and latentGLAT outperform the AT model in BLEU1, BLEU2, and BLEU scores, indicating that these models recall more reference tokens and organize the tokens well.
We conjecture that the weak and indirect association between the inputs and outputs of the dialogue results in this unusual phenomenon. Specifically, the weak connection may encourage the AT model to predict the tokens by paying more attention to their history outputs, which degenerate to a targetside language model. In contrast, the NAT models do not have this fast track, pushing them to pay more attention to the inputs and recall more target tokens. We further find that there are socalled safe response Li et al. (2016) in AT’s outputs, which verify our conjecture.
Models  WMT14  IWSLT14  Speedups  
ENDE  DEEN  DEEN  
CMLM  10.88       
CMLM  22.06      9.79 
CMLM  24.65      3.77 
LevT  24.43      2.93 
LVNAR  11.80      22.30 
SynST  20.74  25.50  23.82  4.86 
Flowseq  20.85  25.40    1.10 
CNAT  21.30  25.73  29.81  10.37 
AT  27.17  31.53  34.29  1.00 
NAT  10.78  15.19  17.77  15.29 
GLAT  16.71  24.78  29.07  15.29 
latentGLAT  24.71  29.16  32.31  11.31 
More Comparisons.
we further compare the advanced NAT models that builds upon latent variables or iterative refinement in machine translation tasks:
Table 2 shows that introducing latent variables (LVNAR, Flowseq, and CNAT) or decoding with multiple iterations (CMLM and LevT) both improve nonautoregressive decoding in translation quality. However, iterative refinements or deep transformations always sacrifice decoding efficiency. In contrast, the proposed latentGLAT outperforms all NAT models with a relatively low cost, keeping a competitive speedup over autoregressive Transformer (AT). Specifically, latentGLAT with onepass decoding narrows the performance gap to the AT from 5.87 BLEU points to 2.34 BLEU points on the WMT14 ENDE test set.
Decoding efficiency.
We can see there is a tradeoff between the translation quality and decoding efficiency in Table 2. We thus present the scatter plot of different models in Figure 3, showing the trend of translation quality and decoding efficiency.
As seen, latentGLAT is located on the topright of the baselines. It outperforms the baselines in the BLEU score if decoding speedup is fixed and in decoding speedup if the BLEU score is fixed.
4.3 Analysis
We now turn to verify our intuition that latentGLAT can alleviate the multimodality problem.
Methods  WMT14  IWSLT14  Avg  

ENDE  DEEN  DEEN  
NAT  10.78  15.19  17.77  +6.58 
w/ KD  17.69  22.02  23.78  
GLAT  16.71  24.78  29.07  +5.19 
w/ KD  25.21  29.84  31.07  
Flowseq  20.85  25.40  24.75  +2.87 
w/ KD  23.72  28.39  27.55  
CNAT  21.30  25.73  29.81  +3.08 
w/ KD  25.56  29.36  31.15  
latentGLAT  24.71  29.16  32.31  +0.95 
w/ KD  26.64  29.93  32.47 
latentGLAT largely alleviates the sentencelevel multimodal problem.
Previous researches Gu et al. (2018); Ma et al. (2019); Qian et al. (2021a); Bao et al. (2021) always utilize a Transformer model as a teacher for training NAT models, namely sequencelevel knowledge distillation Kim and Rush (2016), which can directly reduces the sentencelevel multimodal phenomenon in datasets. Therefore, we use the average gains from the knowledge distillation to reflect the ability of the NAT models to overcome this issue.
As seen in Table 3, the pure NAT models heavily rely on knowledge distillation. By introducing the target information with the latent variables (Flowseq and CNAT) or sampled tokens (GLAT), the NAT models improve its’ ability to overcome the multimodality issue. Our proposed latentGLAT well combines the above two techniques. It obtains only 0.95 BLEU points average gains and validates our motivation.
Datasets  Configuration ()  

WMT14  Inputs Raw outputs  2.19  3.03 
Inputs AT outputs  1.38  2.13  
Inputs  1.01  1.35  
Quora  Inputs Raw outputs  0.86  1.48 
DailyDialog  Inputs Raw outputs  1.19  4.23 
Discrete latent variables have fewer modes than raw sentences.
To validate our intuition that the introduced latent variables are easier to predict than tokens, we refer to Zhou et al. (2020) to compute the complexity metrics on each dataset according to alignment relations. Specifically, we use the fast_align^{8}^{8}8https://github.com/clab/fast_align toolkit to align source input
and target outputs
or discretized latent variable sequences . Then, we compute the tokenlevel complexity and the sentencelevel complexity according to Zhou et al. (2020). These metrics can trivially understand as the number of valid candidates for each input.As shown in Table 4, the latent variables have the lowest complexity in both tokenlevel complexity and sentencelevel complexity. In other words, predicting the latent variable sequences is effortless than predicting others, which is consistent with our intuition. Although we obtain a lower complexity dataset by filtering the datasets with an autoregressive model (AT outputs versus Raw outputs), they may introduce model error and need extra training for AT model. In contrast, the discrete latent variables are simple and informative enough to serve as a springboard for modeling target sentences.
L#  Introduce  Glancing Training  BLEU ()  

with  with  
1  12.60  
2  ✓  13.43 (+0.83)  
3  ✓  17.11 (+4.51)  
4  ✓  ✓  18.88 (+6.20)  
5  ✓  ✓  22.35 (+9.75)  
6  ✓  ✓  ✓  23.64 (+11.04) 
Glancing with latent variables improves the performance with a large margin.
We can see in Table 5 that introducing latent variables both obtain performance gains to their counterpart (L#2 vs. L#1, points, and L#4 vs. L#3, points). As expected, the gains are largely improved while adopting the glancing training with discrete latent variables (L#5 vs. L#1, points), which already outperforms glancing training with the reference token (L#5 vs. L#4, points). Finally, we jointly perform glancing training with the reference tokens and discrete latent variables, achieving the best result (L#6 vs. L#1, points).
8  16  32  64  128  256  

BLEU (%)  20.80  22.16  22.61  23.64  23.26  21.94 
ACC (%)  61.20  53.10  43.57  39.24  36.39  33.84 
Effects of and .
As shown in Figure 4 and Table 6, we search the hyperparameter of latentGLAT that the number of discrete latent variables and the length penalty ratio according to the validation performance. We notice that using more latent codes causes performance degradation during inference, in which the latent variables may degenerate to tokens and contains more prediction error during inference. The latentGLAT implemented with 64 latent variables and obtains the best result on WMT14 ENDE valid set.
5 Related Work
Gu et al. (2018)
first propose a nonautoregressive Transformer (NAT) model for neural machine translation (NMT) and begin to explore parallel decoding. It abandons explicitly modeling word interdependencies to decode the tokens in parallel, significantly improving the inference speed. However, its translation quality is inferior to the Transformer
Vaswani et al. (2017).To alleviate this performance degradation, many researchers work to enhance word dependency modeling, including imitation learning
Wei et al. (2019); Li et al. (2019), curriculum learning Guo et al. (2020a); Liu et al. (2020), iterative refinements Lee et al. (2018); Ghazvininejad et al. (2019); Gu et al. (2019); Guo et al. (2020b); Huang et al. (2022), and a simplified autoregressive process Sun et al. (2019). The most representative method is the glancing transformer model Qian et al. (2021a), which adaptively and progressively samples partial tokens as inputs and predicts the remaining tokens, effectively establishing the dependencies between the sampled tokens and the remaining tokens. However, these models still rely on a teacher for training, which cannot directly learn the raw dataset that contains onetomany multimodality phenomenon.Introducing latent variables Bao et al. (2019, 2021) to organize the target sentence is also a helpful route. Among them, our method is close to Kaiser et al. (2018); Shu et al. (2019); Ma et al. (2019); Akoury et al. (2019); Bao et al. (2021). These methods decompose the latent variables (hints) from the target sentence and divide the origin goal into two parts: modeling latent variables and modeling the target sentences based on latent variables. It implicitly overcomes the multimodality phenomenon of target sentences because the latent variables can largely determine the mode of the sentence. However, these methods always model the latent variables with an autoregressive predictor, which naturally sacrifices the decoding efficiency.
Unlike them, our approach models the discrete latent variables in a nonautoregressive fashion and extends glancing training with the discrete latent variables. As a result, latentGLAT accomplishes a competitive performance both in decoding efficiency and quality.
6 Conclusion
We propose latentGLAT, which can be directly trained without the help of knowledge distillation. Specifically, we employ discrete latent variables to capture the word categorical information and divide the original goal into the latent variables modeling and word prediction tasks. Then, we learn each task with the glancing training and encourage the model to build dependencies on the latent variables, which have fewer modes than the words and are also informative for modeling the target sentences. Experiments results on machine translation, paraphrase generation, and dialogue generation tasks validate the effectiveness of our latentGLAT.
Acknowledgements
We would like to thank the anonymous reviewers for their insightful comments. Shujian Huang is the corresponding author. This work is supported by National Science Foundation of China (No. U1836221, 6217020152).
References
 Syntactically supervised transformers for faster neural machine translation. In Proceedings of the 57th Annual Meeting of the Association for Computational Linguistics, Florence, Italy, pp. 1269–1281. External Links: Document, Link Cited by: 1st item, §5.
 Neural machine translation by jointly learning to align and translate. In 3rd International Conference on Learning Representations, ICLR 2015, San Diego, CA, USA, May 79, 2015, Conference Track Proceedings, Y. Bengio and Y. LeCun (Eds.), External Links: Link Cited by: §1, §2.
 PLATO: pretrained dialogue generation model with discrete latent variable. In Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics, Online, pp. 85–96. External Links: Document, Link Cited by: 3rd item.
 Nonautoregressive translation by learning target categorical codes. In Proceedings of the 2021 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Online, pp. 5749–5759. External Links: Link, Document Cited by: §1, §2, §2, §3.1, §3.1, §3, 1st item, §4.3, §5.
 Nonautoregressive transformer by position learning. arXiv preprint arXiv:1911.10677. External Links: Link Cited by: §5.
 Listen and fill in the missing letters: nonautoregressive transformer for speech recognition. arXiv preprint arXiv:1911.04908. External Links: Link Cited by: §1.

Convolutional sequence to sequence learning.
In
Proceedings of the 34th International Conference on Machine Learning, ICML 2017, Sydney, NSW, Australia, 611 August 2017
, D. Precup and Y. W. Teh (Eds.), Proceedings of Machine Learning Research, Vol. 70, pp. 1243–1252. External Links: Link Cited by: §1, §2. 
Maskpredict: parallel decoding of conditional masked language models.
In
Proceedings of the 2019 Conference on Empirical Methods in Natural Language Processing and the 9th International Joint Conference on Natural Language Processing (EMNLPIJCNLP)
, Hong Kong, China, pp. 6112–6121. External Links: Document, Link Cited by: §1, 2nd item, §5.  Nonautoregressive neural machine translation. In 6th International Conference on Learning Representations, ICLR 2018, Vancouver, BC, Canada, April 30  May 3, 2018, Conference Track Proceedings, External Links: Link Cited by: §1, §2, §2, §4.1, §4.1, §4.3, §5.
 Levenshtein transformer. In Advances in Neural Information Processing Systems 32: Annual Conference on Neural Information Processing Systems 2019, NeurIPS 2019, December 814, 2019, Vancouver, BC, Canada, H. M. Wallach, H. Larochelle, A. Beygelzimer, F. d’AlchéBuc, E. B. Fox, and R. Garnett (Eds.), pp. 11179–11189. External Links: Link Cited by: 2nd item, §5.

Finetuning by curriculum learning for nonautoregressive neural machine translation.
In
Proceedings of the AAAI Conference on Artificial Intelligence
, Vol. 34, pp. 7839–7846. Cited by: §1, §5.  Jointly masked sequencetosequence model for nonautoregressive neural machine translation. In Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics, Online, pp. 376–385. External Links: Document, Link Cited by: §1, Table 2, §5.
 Error detecting and error correcting codes. The Bell system technical journal 29 (2), pp. 147–160. External Links: Link Cited by: 2nd item.
 Nonautoregressive translation with layerwise prediction and deep supervision. In AAAI, External Links: Link Cited by: §5.
 Fast decoding in sequence models using discrete latent variables. In Proceedings of the 35th International Conference on Machine Learning, ICML 2018, Stockholmsmässan, Stockholm, Sweden, July 1015, 2018, J. G. Dy and A. Krause (Eds.), Proceedings of Machine Learning Research, Vol. 80, pp. 2395–2404. External Links: Link Cited by: §1, §2, §2, §3.1, §3, §5.
 Sequencelevel knowledge distillation. In Proceedings of the 2016 Conference on Empirical Methods in Natural Language Processing, Austin, Texas, pp. 1317–1327. External Links: Document, Link Cited by: §1, §4.3.
 Adam: A method for stochastic optimization. In 3rd International Conference on Learning Representations, ICLR 2015, San Diego, CA, USA, May 79, 2015, Conference Track Proceedings, Y. Bengio and Y. LeCun (Eds.), External Links: Link Cited by: §4.1.
 Deterministic nonautoregressive neural sequence modeling by iterative refinement. In Proceedings of the 2018 Conference on Empirical Methods in Natural Language Processing, Brussels, Belgium, pp. 1173–1182. External Links: Document, Link Cited by: §1, §5.
 A diversitypromoting objective function for neural conversation models. In Proceedings of the 2016 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, pp. 110–119. Cited by: §4.2.
 DailyDialog: a manually labelled multiturn dialogue dataset. In Proceedings of the Eighth International Joint Conference on Natural Language Processing (Volume 1: Long Papers), Taipei, Taiwan, pp. 986–995. External Links: Link Cited by: 3rd item.
 Hintbased training for nonautoregressive translation. In NeuralIPS (to appear), External Links: Link Cited by: Appendix B, §1, §5.
 Tasklevel curriculum learning for nonautoregressive neural machine translation. AAAI. Cited by: §1, §5.
 FlowSeq: nonautoregressive conditional sequence generation with generative flow. In Proceedings of the 2019 Conference on Empirical Methods in Natural Language Processing and the 9th International Joint Conference on Natural Language Processing (EMNLPIJCNLP), Hong Kong, China, pp. 4282–4292. External Links: Document, Link Cited by: §1, §2, §2, §3.3, 1st item, §4.3, Table 2, §5.
 Fairseq: a fast, extensible toolkit for sequence modeling. In Proceedings of the 2019 Conference of the North American Chapter of the Association for Computational Linguistics (Demonstrations), Minneapolis, Minnesota, pp. 48–53. External Links: Document, Link Cited by: §4.1.
 Bleu: a method for automatic evaluation of machine translation. In Proceedings of the 40th Annual Meeting of the Association for Computational Linguistics, Philadelphia, Pennsylvania, USA, pp. 311–318. External Links: Document, Link Cited by: §1, §2, §4.1.
 Nonautoregressive neural texttospeech. In Proceedings of the 37th International Conference on Machine Learning, ICML 2020, 1318 July 2020, Virtual Event, Proceedings of Machine Learning Research, Vol. 119, pp. 7586–7598. External Links: Link Cited by: §1.
 Glancing transformer for nonautoregressive neural machine translation. In ACL, External Links: Link Cited by: Appendix B, §1, §2, §2, §3.2, §3, §3, §4.1, §4.1, §4.3, Table 2, §5.
 The volctrans glat system: nonautoregressive translation meets wmt21. arXiv preprint arXiv:2109.11247. Cited by: §1.

Towards a better understanding of vector quantized autoencoders
. arXiv. External Links: Link Cited by: §3.1, §3.1.  Neural machine translation of rare words with subword units. In Proceedings of the 54th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), Berlin, Germany, pp. 1715–1725. External Links: Document, Link Cited by: 1st item.
 Latentvariable nonautoregressive neural machine translation with deterministic inference using a delta posterior. arXiv preprint arXiv:1908.07181. External Links: Link Cited by: §1, §2, §2, 1st item, §5.
 Fast structured decoding for sequence models. In Advances in Neural Information Processing Systems 32: Annual Conference on Neural Information Processing Systems 2019, NeurIPS 2019, December 814, 2019, Vancouver, BC, Canada, H. M. Wallach, H. Larochelle, A. Beygelzimer, F. d’AlchéBuc, E. B. Fox, and R. Garnett (Eds.), pp. 3011–3020. External Links: Link Cited by: §5.
 An em approach to nonautoregressive conditional sequence generation. In International Conference on Machine Learning, pp. 9249–9258. Cited by: §1, §2.
 Attention is all you need. In Advances in Neural Information Processing Systems 30: Annual Conference on Neural Information Processing Systems 2017, December 49, 2017, Long Beach, CA, USA, I. Guyon, U. von Luxburg, S. Bengio, H. M. Wallach, R. Fergus, S. V. N. Vishwanathan, and R. Garnett (Eds.), pp. 5998–6008. External Links: Link Cited by: §1, §1, §2, §3.1, §4.1, §4.1, §5.
 Imitation learning for nonautoregressive neural machine translation. In Proceedings of the 57th Annual Meeting of the Association for Computational Linguistics, Florence, Italy, pp. 1304–1312. External Links: Document, Link Cited by: Appendix B, §1, §3.1, §5.
 Understanding knowledge distillation in nonautoregressive machine translation. In 8th International Conference on Learning Representations, ICLR 2020, Addis Ababa, Ethiopia, April 2630, 2020, External Links: Link Cited by: §1, §2, §4.3.
Appendix A Details of GLAT
According to the performance shown in Figure 4(a), we can see a GLAT model will degenerate to a NAT model while using a small sampling ratio. In such a case, introducing an autoregressive Transformer as a teacher for training the GLAT model alleviates this issue (Figure 4(b)), indicating that the GLAT model still needs the help of knowledge distillation for alleviating multimodality problems.
Appendix B Model Details of latentGlat
Decoder Inputs.
Training the Latent Predictor by glancing sampling discrete latent variables.
With the decoder input and the discretized latent variable sequence , we adopt the glancing sampling technique for training the latent predictor in the following steps:

Predicting : latentGLAT predicts the latent variable sequence with its latent predictor: .

Determining sample number : Given and , we compute the sampling number as:
(15) where is the sampling ratio decreasing in the training steps, and we use distance Hamming (1950) for measuring the prediction quality.

Sampling observed latent variables : Given discretized latent variable sequence and sample number , we obtain by random selecting elements from .

Reconstructing inputs : We construct by positionwise replacing the decoder input with .

Updating Latent Predictor: With the as inputs, we train the latent predictor to predict the unobserved references .
Training the Mix. Decoder with sampled discrete latent variables.
Training of Mix. Decoder is largely follow the Qian et al. (2021a), except using extra latent variables as inputs. With the input , the reference sentence , and the sampled latent variables , we train Mix. Decoder in the following steps:

Predicting : latentGLAT predicts the target sentences: .

Determining sample number : Given and , we compute the sampling number .

Sampling target tokens : We obtain the glancing reference by random selecting tokens from reference sequence .

Reconstructing inputs : is constructed by positionwise replacing the decoder input with embedding of .

Updating Mix. Decoder: We then train the Mix. Decoder to predict the unobserved references , with the and as inputs.