Introduction
Transformer has been the de facto architecture for Neural Machine Translation
Vaswani et al. (2017). In this framework, the decoder generates words one by one in a lefttoright manner. Despite its strong performance, the autoregressive decoding method causes a large latency in the inference phase Gu et al. (2018). To break the bottleneck of the inference speed caused by the sequential conditional dependence, several nonautoregressive neural machine translation (NAT) models are proposed to generate all tokens in parallel (Figure 1(a)) Gu et al. (2018); Łukasz Kaiser et al. (2018); Li et al. (2019); Ma et al. (2019). However, vanilla NAT models suffer from the cost of translation accuracy due to they remove the conditional dependence between target tokens.To close the gap from autoregressive models, iterative NAT models are proposed to refine the translation results. They bring conditional dependency between target tokens within several iterations
Ghazvininejad et al. (2019); Ghazvininejad, Levy, and Zettlemoyer (2020); Xie et al. (2020a); Kasai et al. (2020); Guo, Xu, and Chen (2020). Among them, Ghazvininejad et al. Ghazvininejad et al. (2019) first explore to apply conditional masked language model (CMLM) on NAT model (Figure 1(b)). Following this framework, several CMLMbased NAT models are proposed and obtain stateoftheart performance compared with other NATs Xie et al. (2020a); Guo, Xu, and Chen (2020).An open question is whether the potential of the CMLMbased NAT model has been fully exploited, since the masked language model has achieved significant breakthroughs in natural language processing.
To answer the question, we introduce Multiview Subset Regularization (MvSR), a novel regularization method to improve the performance of the CMLMbased NAT model. Specifically, our approach includes two regularization methods: shared mask consistency and model consistency. For shared mask consistency, as shown in Figure 1(c), we randomly mask different subset of the same target sentence twice. Then we encourage the predicted distributions of the shared masked positions to be consistent with each other. As one example, consider the original sentence and two masked sentences in Table 1. The original token ”window” is replaced with [MASK] in both two masked sentences. Although the contexts of ”window” are different due to the random mask strategies, their semantics and generated distributions are expected to be consistent across these two views. To make a summary, we introduce a new paradigm of regularization, different mask strategies for the same target sentence, and the tokens on the shared masked positions are semanticpreserving with different views. This approach is reminiscent of multiview contrast learning Tian, Krishnan, and Isola (2020), our method is not ”contrast” but only considers the consistency of ”positive pairs”.
original  the  cat  went  through  an  open  window  in  the  house  . 

masked  the  cat  [MASK]  [MASK]  an  open  [MASK]  in  the  house  . 
masked  the  cat  went  through  an  [MASK]  [MASK]  in  the  [MASK]  . 
Regarding model consistency (Figure 1(d)), it is inspired by that checkpoint averaging is an essential method for improving the performance of machine translation Vaswani et al. (2017). Similarly, Mean Teacher Tarvainen and Valpola (2017) shows that using an average model as a teacher improves the results. Correspondingly, we construct an average model by updating the weights with an exponential moving average (EMA) method. Then we penalize the generated distributions that are inconsistent between this average model and the online model. Note that we adopt the bidirectional KullbackLeibler (KL) divergence instead of mean squared error (MSE) as the consistency cost. This is related to mutual learning Zhang et al. (2018) but without extra parameters.
As in prior work, we apply our MvSRNAT model in several public benchmark datasets. It outperforms previous NAT models and achieves comparable results with autoregressive Transformer. Intuitively, our two proposed regularization methods have two advantages: 1) they can be seen as stabilizers to promote the robustness of the model to randomness; 2) they reduce the discrepancy between the training and inference phase.
Specifically, the shared mask consistency first enhances the robustness of the model to the random mask. Secondly, we adopt the maskpredict decoding method Ghazvininejad et al. (2019), where the predicted target tokens are replaced by [MASK] symbols in the inference process. Especially in the first decoder iteration, all the target tokens are [MASK] symbols. This decoding strategy causes the discrepancy from training for random mask. Therefore, As a result of a more robust model to random mask, our proposed method can reduce the discrepancy between training and inference caused by [MASK] symbols, thus improving the translation quality.
As for model consistency, it first penalizes the model sensitivity to the model weights, thus improving the robustness. Secondly, the average model and the online model have the same architecture but with different dropout units during training. Therefore, this regularization item also makes our model more robust to random dropout. Moreover, the dropout is closed during inference thus causing the discrepancy between training and inference. By reason of more robust to dropout, the proposed model consistency method implicitly strengthens the generalization ability of the model and improves the performance with dropout closing during inference.
Experimental results demonstrate that our model outperforms several stateoftheart NAT models by over 0.361.14 BLEU on WMT14 ENDE and WMT16 ENRO datasets. Compared with the strong autoregressive Transformer (AT) baseline, our proposed NAT model achieves competitive performance, while significantly reducing the cost of time during inference.
Background
Nonautoregressive Machine Translation
Recently, we have witnessed tremendous progress in neural machine translation (NMT) Sutskever, Vinyals, and Le (2014); Bahdanau, Cho, and Bengio (2015); Vaswani et al. (2017). Given a source sentence , a NMT model is aimed to generate target sentence
. Typically, the probability model of an autoregressive model is defined as
, where is the parameters of a network. It is formulated as a chain of conditional probabilities:(1) 
where and are [BOS] and [EOS], representing the beginning and the end of a target sentence, respectively. Note that the autoregressive NMTs adopt teacher forcing Vaswani et al. (2017) method to capture the sequential conditional dependency between target tokens. And during inference, they generate the target tokens one by one in a lefttoright manner.
As the performance of NMT models have been substantially promoted, nonautoregressive machine translation (NAT) with paralleled decoding becomes a research hotspot Gu et al. (2018), the architecture is shown in Figure 1(a). We define the probability model of a NAT model as . Mathematically, it is parameterized by a conditional independent factorization:
(2) 
NAT model removes the conditional dependency between the target words, thus generating all target tokens simultaneously. Although the translation speed has been significantly accelerated, the lack of dependence between target words reduces the translation quality. To promote the performance of NAT models, a promising research line is iterative decoding methods Lee, Mansimov, and Cho (2018); Ghazvininejad et al. (2019); Ghazvininejad, Levy, and Zettlemoyer (2020); Xie et al. (2020a); Kasai et al. (2020); Guo, Xu, and Chen (2020). Specifically, they explicitly consider the conditional dependency between target tokens within several decoding iterations, thus refining the translation results.
Conditional Masked Language Model
Since the masked language model (MLM) is proposed by BERT Devlin et al. (2019), it has achieved a significant breakthrough in natural language understanding Liu et al. (2019); Joshi et al. (2020); Song et al. (2019); Dong et al. (2019); Sun et al. (2021); Ahmad et al. (2021). However, due to the bidirectional nature of MLM, it is nontrivial to extend MLM for language generation tasks. Wang et al. Wang and Cho (2019) start with a sentence of all [MASK] tokens and generate words one by one in arbitrary order (instead of the standard lefttoright chain decomposition), obtaining inadequate generation quality compared with autoregressive counterparts Brown et al. (2020). Recently, XLM Lample and Conneau (2019) leverages sentencepair translation data for training a conditional masked language model (CMLM), which improves the performance on several downstream tasks, including machine translation.
Upon previous works, as shown in Figure 1(b), Ghazvininejad et al. Ghazvininejad et al. (2019) adopt CMLM to optimize the nonautoregressive NMT model. During training, they predict the masked target tokens conditional on the source sentence and the rest of observed words in the target sentence. Therefore, the training objective of the CMLMbased NAT is presented as:
(3) 
where denotes the number of masked tokens in target sentence. During inference, they propose a MaskPredict decoding strategy, which iteratively refines the generated translation given the most confident target words predicted from the previous iteration. In this paper, our work is built upon this CMLMbased NAT model and improve its performance with two proposed consistency regularization techniques.
Consistency Regularization
Consistency regularization has merged as a goldstandard technique for semisupervised learning
Sajjadi, Javanmardi, and Tasdizen (2016); Laine and Aila (2017); Zhai et al. (2019); Oliver et al. (2018); Xie et al. (2020b); Zheng et al. (2021). One strand of this idea is to regularize predictions to small perturbations on image data or language. These semanticpreserving augmentations can be image flipping or cropping, or adversarial noise on image data Miyato et al. (2019); Carmon et al. (2019) and natural language example Zhu et al. (2019); Liu et al. (2020). Another strand of consistency regularization aims at penalizing sensitivity to model parameters Tarvainen and Valpola (2017); Athiwaratkun et al. (2019); Liang et al. (2021). In our work, we focus on the conditional masked language model setting, leveraging the two strands of consistency regularization.Approach
Model Architecture
Figure 2 illustrates the overall architecture of our proposed MvCRNAT model. It is built upon the CMLM framework, comprising of two Transformerbased modules, an Encoder and a Decoder. It is worth noting that the Encoder structure is based on Transformer Vaswani et al. (2017), while the Decoder is slightly different. Specifically, the Decoder replaces the lefttoright mask with a bidirectional attention mask, allowing the Decoder to leverage both left and right contexts to predict the target words. To focus on our main contributions, we omit the detailed architecture and refer readers to Ghazvininejad et al. (2019) for reference.
In our work, we focus on the training of our model with two proposed regularization methods. Before elaborating on the details, we first present some notations. Given a source sentence and a target sentence , the goal of training is to learn a probability model . To construct the target input for our approach, we randomly mask the target sentence twice. We define the subset of masked tokens as and , and the observed unmasked tokens as and , respectively. As shown in Figure 2, we feed the two masked sentences to the online model (blue part in the figure). Upon the probability model of the CMLMbased NAT presented in Equation 3, our learning objective is learning to minimize the negative loglikelihood (NLL) loss of the masked tokens, which is parameterized as:
(4)  
where represents the parameters of the online model. and represent the number of masked words, respectively. and represent the generated distributions from the online model. And we obtain the two nll losses and , respectively.
Note that the masked words in and are randomly selected and replaced by the [MASK] symbols. As shown in Table 1, the shared masked tokens in both and are marked as [MASK] symbol. And we set the collection of the shared masked words as .
Consistency Regularization
In this Section, we propose to improve the CMLMbased NAT with consistency regularization method. Specifically, we focus on regularizing the generated predictions to be invariant to model parameters and semanticpreserving data perturbations.
Model Consistency
We introduce the model consistency regularization method to encourage consistent predictions from an online model and an average model. The average model weights are maintained by an exponential moving average (EMA) method (yellow part in Figure 2). Previous works have demonstrated that averaging model weights tend to achieve better performance than using the final model weight Polyak and Juditsky (1992); Vaswani et al. (2017); Tarvainen and Valpola (2017). To take this advantage of the average model, we adopt the bidirectional KL divergence to encourage the prediction consistency between these two models. Therefore we can increase the robustness of our model to model weights and learn better representations. Furthermore, similar to the recent ARXIV paper Liang et al. (2021), the dropout units between the online model and the average model is different due to the randomness. Thus the prediction consistency penalizes the sensitivity to random dropout. Totally, the proposed method brings two practical advantages: first, we strengthen the robustness of our model to stochastic model weights. Second, this method robustly improves the model generalization and reduces the discrepancy between training and inference caused by dropout.
Formally, we define the model consistency loss as the distance between the tokenlevel predictions produced by the online model and the average model using the bidirectional KL divergence:
(5)  
where represents the parameters of the average model. In order to ensure readability, and are the abbreviation of and . They represent the predictions from the online model and the average model with the first masked sentence, respectively. Similar for and , but with the second masked sentence.
Moreover, the average model parameters is obtained by EMA method. At training step , the updated is computed as the EMA of successive weights:
(6) 
Shared Mask Consistency
We propose our shared mask consistency regularization method in this part. As examples shown in Table 1 and Figure 2, we randomly mask the same target sentence twice, and forward them to the online model and the average model. Considering a simple example, there is a sentence pairs ”the [MASK] is [MASK] . Diese Katze ist lustig .” and ”[MASK] cat is [MASK] . Diese Katze ist lustig .”. The shared [MASK] to predict can be thought of a tokenlevel ”positive pair” with semanticpreserving. We hypothesis that the representation is viewagnostic, and the semantic is shared between different views caused by randomly mask. Therefore, the predicted distributions of shared masked tokens are expected to be consistent. Note that we do not consider the distribution consistency of the other positions. To illustrate this reason, let’s take an example, in the second position, ”cat” is observed in the second target sentence, but it is masked in the first sentence. If we force the distributions of the second position to be consistent, the model will be confused with ”[MASK]” and ”cat”, thus leading inferior performance.
Mathematically, we define the shared mask consistency cost to measure the distance of prediction distributions in shared mask position. similar to model consistency, we adopt the bidirectional KL divergence:
(7)  
where represents the number of shared masked tokens between and . , , represent the share mask consistency between the two masked sentence when they are feed into onlineonline, onlineaverage, and averageonline models, respectively.
Length Prediction
Autoregressive NMT models generate words onebyone, and the length of the target sentence is decided by encountering a special token [EOS]. However, nonautoregressve NATs generate target sentence in a parallel way, thus requiring the predicted length before decoding. Following Gu et al. (2018); Ghazvininejad et al. (2019), we add an additional special token [LEN] to the begining of the source input. Then we predict the length
by the source sentence X. Mathematically, we define the loss function of this classfication task as:
(8) 
where represents the max length of the target sentence in our corpus.
Training Algorithm
The final training objective for MvSRNAT is the sum of all aforementioned loss functions:
(9)  
where
is a hyperparameter to control KL losses. Jointly training with our proposed two consistency regularization losses, we improve the robustness and generalization ability of our MvCRNAT model to the randomness (e.g., model weights and random mask in target sentence).
The overall training algorithm of our model is presented in Algorithm 1.
Inference
During inference, we feed a sentence with all [MASK] as target input for the first iteration, where its length is determined by the length prediction. Then we refine the translation result by maskingout and repredicting a subset of words whose probabilities are under a threshold within several iterations. For more details, please refer to Ghazvininejad et al. (2019).
Experiments
Experimental Setup
Datasets
We conduct our experiments on five public benchmarks: WMT14 ENDE (4.5M translation pairs), WMT16 ENRO (610K translation pairs), and IWSLT DEEN (150K translation pairs). We strictly follow the dataset configurations of previous works. Specifically, we preprocess the dataset following Lee, Mansimov, and Cho (2018). Then we tokenize the tokens into subword units using BPE method Sennrich, Haddow, and Birch (2016). For WMT14 ENDE, we use newstest2013 and newstest2014 as our development and test datasets, respectively. For WMT16 EnRo, we use newsdev2016 and newstest2016 as our development and test datasets, respectively.
Evaluation Metrics
We adopt the widely used BLEU Papineni et al. (2002) to evaluate the translation quality.
Models  WMT’14  WMT’16  IWSLT14  Speedup  

ENDE  DEEN  ENRO  ROEN  DEEN  
Autoregressive Models  
LSTMS2SBahdanau, Cho, and Bengio (2015)  T  24.60           
ConvS2SGehring et al. (2017)  T  26.42           
Transformer Vaswani et al. (2017)  T  28.04  32.69  34.13  34.46  32.99  1.00x 
NonAutoregressive Models  
NATFT Gu et al. (2018)  1  19.17  23.20  29.79  31.44  24.21  2.36x 
ImitNAT Wei et al. (2019)  1  24.15  27.28  31.45  31.81     
NATHint Li et al. (2019)  1  21.11  25.24         
Flowseq Ma et al. (2019)  1  23.72  28.39  29.73  30.72    1.1x 
NATDCRF Sun et al. (2019)  1  26.07  29.68      29.99  9.63x 
GLATNAT Qian et al. (2021)  1  26.55  31.02  32.87  33.51    7.9x 
NATIR Lee, Mansimov, and Cho (2018)  5  20.26  23.86  28.86  29.72     
10  21.61  25.48  29.32  30.19  23.94  1.50x  
CMLMNAT Ghazvininejad et al. (2019)  1  18.05  21.83  27.32  28.20    27.51x 
4  25.94  29.90  32.53  33.23  30.42  9.79x  
10  27.03  30.53  33.08  33.31  31.71  3.77x  
MvCRNAT (w/ kd)  4  26.25  30.27  32.76  32.96    9.79x 
10  27.39  31.18  33.38  33.56    3.77x  
MvCRNAT (w/o kd)  4  22.89  26.89  32.34  33.60  30.58  9.79x 
10  24.37  28.90  33.76  34.45  32.55  3.77x 
Model Details
The online model setting is strictly following previous works. For the WMT14 ENDE and WMT16 ENRO datasets, the model setting is based on the base Transformer Vaswani et al. (2017). Specifically, we set the model dimension as 512 and the inner dimension as 2048. The Encoder and the Decoder consist of a stack of 6 Transformer layers. For the smaller IWSLT16 DEEN dataset, we follow the configuration of small Transformer that the model dimension and inner dimension are 256 and 1024, respectively. The Encoder and the Decoder consist of a stack of 5 Transformer layers. Besides, we set the max target sentence as 1000.
The average model is built upon the online model with the same architecture. Besides, its model weights are maintained by an exponential moving average method with the moving average decay set as 0.996.
During training, we train the model with 2048 tokens per batch on eight GTX 2080Ti GPUs. We use Adam optimizer Kingma and Ba (2015) and warmup learning rate schedule. During inference, we set the number of the length candidate as 5 for our NAT model. For a fair comparison, we set the beam size as 5 for the baseline AT model. Moreover, we evaluate the final translation accuracy by averaging 10 checkpoints.
SequentialLevel Knowledge Distillation
Previous works have demonstrated that the effectiveness of sequentiallevel knowledge distillation on NAT models Gu et al. (2018); Lee, Mansimov, and Cho (2018); Gu, Wang, and Zhao (2019); Ghazvininejad et al. (2019); Zhou, Neubig, and Gu (2020). Following their works, we train our CMCRNAT model on the distilled corpora, which are produced by a standard lefttoright Transformer model. While previous AT transformers have different performances, we adopt the one used in CMLMNAT Ghazvininejad et al. (2019) which is our primary baseline. In Section Effect of Sequentiallevel Knowledge Distillation, we will identify the effect of knowledge distillation on our model.
Model Variants  WMT’16 ENRO  WMT’16 ROEN  

=4  =10  =4  =10  
CMLMNAT  31.40  32.86  32.87  33.15 
+ model consistency  31.89 (+0.49)  33.10(+0.24)  33.15 (+0.28)  33.52 (+0.37) 
+ shared mask concsistency  32.16 (+0.76)  33.49(+0.63)  33.79 (+0.92)  34.08 (+0.93) 
MvCRNAT  32.34 (+0.94)  33.76(+0.90)  33.60(+0.73)  34.45 (+1.30) 
Results
Table 2 shows our experimental results on three public datasets. As we move our eyes to the first part in this Table, our model achieves comparable performance with the Transformer model. Notably, on the small dataset WMT16 ENRO and IWSLT14 DEEN, the translation results are only 0.010.44 BLEU score behind.
In the second part of Table 2, compared with pure NAT models with oneshot decoding, the multiple iterative decoding methods achieve noticeable improvements. The same thing happens to the CMLMbased NAT models. This phenomenon is mainly due to the problem of multimodality Gu et al. (2018) that the oneshot decoding hardly considers the lefttoright dependency. While the iterative methods explicitly model the conditional dependency between target tokens within several iterations, thus obtaining better performance.
In contrast with our primary baseline CMLMNAT model, our model is additionally optimized with two regularization methods without changing the CMLM architecture. Our model outperforms CMLMNAT with margins from 0.361.14 BLEU scores, illustrating the effectiveness of our methods.
Effect of Sequentiallevel Knowledge Distillation
The comparison results for knowledge distillation are shown in Table 2. In terms of the large dataset, i.e., WMT14 ENDE, our model gains improvements with the sequentiallevel knowledge distillation. However, the improvements from knowledge distillation are not concurrent on the small dataset, i.e., WMT16 ENRO. We attribute this phenomenon to the complexity of the data sets Zhou, Neubig, and Gu (2020). The knowledge distillation is able to reduce ”modes” (alternative translations for an input) in the training data, thus benefiting the NAT models. We conjecture that a small dataset is likely to contain fewer redundant “modes” than a largescale dataset. As a result, distillation knowledge is helpful and more efficient on a large dataset than on a small dataset.
Ablation Study
Model Consistency vs. Shared Mask Consistency
As shown in Table 3, we conduct comparative experiments on the validation set of the WMT16 ENRO task to illustrate the contribution of our proposed two regularization methods. Note that the results are computed without knowledge distillation. Compared with the CMLMNAT baseline model, our proposed model consistency and shared mask consistency regularization methods progressively improve the performance, and the shared mask consistency provides more performance promotion.
Furthermore, to step further understand the two proposed regularization methods, in Figure 3, we show the BLEU score with training epochs on IWSLT14 DEEN task with single decoding iteration. To make a fair comparison, in every training forward pass, we feed two sourcetarget sentence pairs to these compared models. The training curves help us understand the effect of the two proposed regularization methods. We can see that a) the model consistency improves the performance without changing the convergence trend; b) the shared mask consistency method suppresses the convergence speed of the model in the early training period, but obtains better performance in the final training epochs. It indicates that the shared mask consistency method can avoid premature fitting and improve the robustness and the generalization ability of our model.
Effect of Weight
KL Loss Weight  WMT’16 ENRO  WMT’16 ROEN 

33.68  34.40  
33.76  34.45  
33.42  34.02  
33.10  33.43  
32.72  32.68 
We investigate the effect of the loss weight , which is utilized for controlling the KLdivergence loss. We conduct ablation experiments on WMT16 ENRO with different values in . The results are shown in Table 4. We can see that the small kl loss weight performs better than the larger ones. In our setting, the best choice of the kl loss weight is . Too much regularization (e.g. 3) even decreases the model performance.
time Shared Mask Consistency
KL Loss Weight  WMT’16 ENRO  WMT’16 ROEN 

33.76  34.45  
33.53  34.50  
As an example shown in Table 1, we forward two masked target sentences to the model and encourage their masked subset predictions to be consistent. An interesting concern is whether more improvements can be achieved if we forward three or more masked targets with different mask strategies. In this study, we define the number of masked targets as . We conduct comparative experiments about the on the WMT16 ENRO dataset. The results in Table 5 show that is good enough for the tasks. This indicates that our proposed two consistency regularization methods have strong regularization effect between two distributions, without the necessity of more distributions regularization.
Dropout Probability in Average Model
Dropout Prob.  WMT’16 ENRO  WMT’16 ROEN 

0.1  33.17  33.87 
0.2  33.36  34.19 
0.3  33.76  34.45 
0.4  33.39  34.40 
0.5  33.10  34.25 
As mentioned in subsection Model Consistency, we indicate that the model consistency method strengthens the robustness of our model to model weights and randomly dropout. Here, we investigate the different dropout rates in the average model. In this study, we apply different dropout values for the average models during training. As shown in Table 6, we test the dropout values from on WMT16 ENRO dataset. We can see that the best choice of dropout value for the average model is 0.3.
Conclusion
In this paper, upon CMLMbased architecture, we introduce the Multiview Subset Regularization method to improve the CMLMbased NAT performance. We first propose the shared mask consistency method to force the masked subset predictions to be consistent for randomly mask strategies. Second, we propose model consistency to encourage the online model to generate consistent distributions with the average model whose weights are maintained with an EMA method. On several benchmark datasets, we demonstrate that our approach achieves considerable improvements against previous nonautoregressive models and comparable results to the autoregressive Transformer model. This work introduces a new paradigm for regularization method, the multiview subset regularization. We hope this paradigm can be helpful in recent hot contrast learning models.
References
 Ahmad et al. (2021) Ahmad, W. U.; Chakraborty, S.; Ray, B.; and Chang, K.W. 2021. Unified Pretraining for Program Understanding and Generation. In NAACL.
 Athiwaratkun et al. (2019) Athiwaratkun, B.; Finzi, M.; Izmailov, P.; and Wilson, A. G. 2019. There Are Many Consistent Explanations of Unlabeled Data: Why You Should Average. In ICLR.
 Bahdanau, Cho, and Bengio (2015) Bahdanau, D.; Cho, K.; and Bengio, Y. 2015. Neural Machine Translation by Jointly Learning to Align and Translate. CoRR, abs/1409.0473.
 Brown et al. (2020) Brown, T. B.; Mann, B.; Ryder, N.; Subbiah, M.; Kaplan, J.; Dhariwal, P.; Neelakantan, A.; Shyam, P.; Sastry, G.; Askell, A.; Agarwal, S.; HerbertVoss, A.; Krueger, G.; Henighan, T.; Child, R.; Ramesh, A.; Ziegler, D. M.; Wu, J.; Winter, C.; Hesse, C.; Chen, M.; Sigler, E.; Litwin, M.; Gray, S.; Chess, B.; Clark, J.; Berner, C.; McCandlish, S.; Radford, A.; Sutskever, I.; and Amodei, D. 2020. Language Models are FewShot Learners. ArXiv, abs/2005.14165.
 Carmon et al. (2019) Carmon, Y.; Raghunathan, A.; Schmidt, L.; Liang, P.; and Duchi, J. C. 2019. Unlabeled Data Improves Adversarial Robustness. In NeurIPS.
 Devlin et al. (2019) Devlin, J.; Chang, M.W.; Lee, K.; and Toutanova, K. 2019. BERT: Pretraining of Deep Bidirectional Transformers for Language Understanding. ArXiv, abs/1810.04805.
 Dong et al. (2019) Dong, L.; Yang, N.; Wang, W.; Wei, F.; Liu, X.; Wang, Y.; Gao, J.; Zhou, M.; and Hon, H.W. 2019. Unified Language Model Pretraining for Natural Language Understanding and Generation. In NeurIPS.
 Gehring et al. (2017) Gehring, J.; Auli, M.; Grangier, D.; Yarats, D.; and Dauphin, Y. 2017. Convolutional Sequence to Sequence Learning. ArXiv, abs/1705.03122.
 Ghazvininejad et al. (2019) Ghazvininejad, M.; Levy, O.; Liu, Y.; and Zettlemoyer, L. 2019. Maskpredict: Parallel decoding of conditional masked language models. In EMNLPIJCNLP.
 Ghazvininejad, Levy, and Zettlemoyer (2020) Ghazvininejad, M.; Levy, O.; and Zettlemoyer, L. 2020. SemiAutoregressive Training Improves MaskPredict Decoding. ArXiv, abs/2001.08785.
 Gu et al. (2018) Gu, J.; Bradbury, J.; Xiong, C.; Li, V. O. K.; and Socher, R. 2018. NonAutoregressive Neural Machine Translation. In ICLR.
 Gu, Wang, and Zhao (2019) Gu, J.; Wang, C.; and Zhao, J. 2019. Levenshtein Transformer. In NeurIPS.
 Guo, Xu, and Chen (2020) Guo, J.; Xu, L.; and Chen, E. 2020. Jointly Masked SequencetoSequence Model for NonAutoregressive Neural Machine Translation. In ACL.
 Joshi et al. (2020) Joshi, M.; Chen, D.; Liu, Y.; Weld, D. S.; Zettlemoyer, L.; and Levy, O. 2020. SpanBERT: Improving Pretraining by Representing and Predicting Spans. Transactions of the Association for Computational Linguistics, 8: 64–77.

Kasai et al. (2020)
Kasai, J.; Cross, J.; Ghazvininejad, M.; and Gu, J. 2020.
Nonautoregressive machine translation with disentangled context
transformer.
In
International Conference on Machine Learning
, 5144–5155. PMLR.  Kingma and Ba (2015) Kingma, D. P.; and Ba, J. 2015. Adam: A Method for Stochastic Optimization. CoRR, abs/1412.6980.
 Laine and Aila (2017) Laine, S.; and Aila, T. 2017. Temporal Ensembling for SemiSupervised Learning. ArXiv, abs/1610.02242.
 Lample and Conneau (2019) Lample, G.; and Conneau, A. 2019. Crosslingual Language Model Pretraining. ArXiv, abs/1901.07291.
 Lee, Mansimov, and Cho (2018) Lee, J.; Mansimov, E.; and Cho, K. 2018. Deterministic NonAutoregressive Neural Sequence Modeling by Iterative Refinement. In EMNLP.
 Li et al. (2019) Li, Z.; Lin, Z.; He, D.; Tian, F.; Qin, T.; Wang, L.; and Liu, T.Y. 2019. HintBased Training for NonAutoregressive Machine Translation. In EMNLPIJCNLP.
 Liang et al. (2021) Liang, X.; Wu, L.; Li, J.; Wang, Y.; Meng, Q.; Qin, T.; Chen, W.; Zhang, M.; and Liu, T.Y. 2021. RDrop: Regularized Dropout for Neural Networks. ArXiv, abs/2106.14448.
 Liu et al. (2020) Liu, X.; Cheng, H.; He, P.; Chen, W.; Wang, Y.; Poon, H.; and Gao, J. 2020. Adversarial Training for Large Neural Language Models. ArXiv, abs/2004.08994.
 Liu et al. (2019) Liu, Y.; Ott, M.; Goyal, N.; Du, J.; Joshi, M.; Chen, D.; Levy, O.; Lewis, M.; Zettlemoyer, L.; and Stoyanov, V. 2019. RoBERTa: A Robustly Optimized BERT Pretraining Approach. ArXiv, abs/1907.11692.
 Ma et al. (2019) Ma, X.; Zhou, C.; Li, X.; Neubig, G.; and Hovy, E. 2019. FlowSeq: NonAutoregressive Conditional Sequence Generation with Generative Flow. In EMNLPIJCNLP.

Miyato et al. (2019)
Miyato, T.; Maeda, S.I.; Koyama, M.; and Ishii, S. 2019.
Virtual Adversarial Training: A Regularization Method for Supervised and SemiSupervised Learning.
IEEE Transactions on Pattern Analysis and Machine Intelligence, 41(8): 1979–1993.  Oliver et al. (2018) Oliver, A.; Odena, A.; Raffel, C.; Cubuk, E. D.; and Goodfellow, I. 2018. Realistic Evaluation of Deep SemiSupervised Learning Algorithms. In NeurIPS.
 Papineni et al. (2002) Papineni, K.; Roukos, S.; Ward, T.; and Zhu, W.J. 2002. Bleu: a Method for Automatic Evaluation of Machine Translation. In ACL.
 Polyak and Juditsky (1992) Polyak, B. T.; and Juditsky, A. B. 1992. Acceleration of stochastic approximation by averaging. SIAM journal on control and optimization, 30(4): 838–855.
 Qian et al. (2021) Qian, L.; Zhou, H.; Bao, Y.; Wang, M.; Qiu, L.; Zhang, W.; Yu, Y.; and Li, L. 2021. Glancing Transformer for NonAutoregressive Neural Machine Translation. In ACL.
 Sajjadi, Javanmardi, and Tasdizen (2016) Sajjadi, M. S. M.; Javanmardi, M.; and Tasdizen, T. 2016. Regularization With Stochastic Transformations and Perturbations for Deep SemiSupervised Learning. In NIPS.
 Sennrich, Haddow, and Birch (2016) Sennrich, R.; Haddow, B.; and Birch, A. 2016. Neural Machine Translation of Rare Words with Subword Units. In ACL.
 Song et al. (2019) Song, K.; Tan, X.; Qin, T.; Lu, J.; and Liu, T.Y. 2019. MASS: Masked Sequence to Sequence Pretraining for Language Generation. In ICML.
 Sun et al. (2021) Sun, Y.; Wang, S.; Feng, S.; Ding, S.; Pang, C.; Shang, J.; Liu, J.; Chen, X.; Zhao, Y.; Lu, Y.; Liu, W.; Wu, Z.; Gong, W.; Liang, J.; Shang, Z.; Sun, P.; Liu, W.; Ouyang, X.; Yu, D.; Tian, H.; Wu, H.; and Wang, H. 2021. ERNIE 3.0: Largescale Knowledge Enhanced Pretraining for Language Understanding and Generation. ArXiv, abs/2107.02137.
 Sun et al. (2019) Sun, Z.; Li, Z.; Wang, H.; Lin, Z.; He, D.; and Deng, Z.H. 2019. Fast structured decoding for sequence models. arXiv preprint arXiv:1910.11555.

Sutskever, Vinyals, and Le (2014)
Sutskever, I.; Vinyals, O.; and Le, Q. V. 2014.
Sequence to Sequence Learning with Neural Networks.
In NIPS. 
Tarvainen and Valpola (2017)
Tarvainen, A.; and Valpola, H. 2017.
Mean teachers are better role models: Weightaveraged consistency targets improve semisupervised deep learning results.
In NIPS.  Tian, Krishnan, and Isola (2020) Tian, Y.; Krishnan, D.; and Isola, P. 2020. Contrastive Multiview Coding. In ECCV.
 Vaswani et al. (2017) Vaswani, A.; Shazeer, N.; Parmar, N.; Uszkoreit, J.; Jones, L.; Gomez, A. N.; Kaiser, L.; and Polosukhin, I. 2017. Attention is All you Need. In NIPS.
 Wang and Cho (2019) Wang, A.; and Cho, K. 2019. BERT has a Mouth, and It Must Speak: BERT as a Markov Random Field Language Model. ArXiv, abs/1902.04094.
 Wei et al. (2019) Wei, B.; Wang, M.; Zhou, H.; Lin, J.; Xie, J.; and Sun, X. 2019. Imitation learning for nonautoregressive neural machine translation. arXiv preprint arXiv:1906.02041.
 Xie et al. (2020a) Xie, P.; Cui, Z.; Chen, X.; Hu, X.; Cui, J.; and Wang, B. 2020a. Infusing Sequential Information into Conditional Masked Translation Model with SelfReview Mechanism. In COLING.
 Xie et al. (2020b) Xie, Q.; Dai, Z.; Hovy, E.; Luong, M.T.; and Le, Q. V. 2020b. Unsupervised Data Augmentation for Consistency Training. arXiv: Learning.

Zhai et al. (2019)
Zhai, X.; Oliver, A.; Kolesnikov, A.; and Beyer, L. 2019.
S4L: SelfSupervised SemiSupervised Learning.
2019 IEEE/CVF International Conference on Computer Vision (ICCV)
, 1476–1485. 
Zhang et al. (2018)
Zhang, Y.; Xiang, T.; Hospedales, T. M.; and Lu, H. 2018.
Deep Mutual Learning.
2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition
, 4320–4328.  Zheng et al. (2021) Zheng, B.; Dong, L.; Huang, S.; Wang, W.; Chi, Z.; Singhal, S.; Che, W.; Liu, T.; Song, X.; and Wei, F. 2021. Consistency Regularization for CrossLingual FineTuning. ArXiv, abs/2106.08226.
 Zhou, Neubig, and Gu (2020) Zhou, C.; Neubig, G.; and Gu, J. 2020. Understanding Knowledge Distillation in Nonautoregressive Machine Translation. In ICLR.
 Zhu et al. (2019) Zhu, C.; Cheng, Y.; Gan, Z.; Sun, S.; Goldstein, T.; and Liu, J. 2019. Freelb: Enhanced adversarial training for natural language understanding. arXiv preprint arXiv:1909.11764.
 Łukasz Kaiser et al. (2018) Łukasz Kaiser; Roy, A.; Vaswani, A.; Parmar, N.; Bengio, S.; Uszkoreit, J.; and Shazeer, N. 2018. Fast Decoding in Sequence Models using Discrete Latent Variables. In ICML.