Transformer has been the de facto architecture for Neural Machine TranslationVaswani et al. (2017). In this framework, the decoder generates words one by one in a left-to-right manner. Despite its strong performance, the autoregressive decoding method causes a large latency in the inference phase Gu et al. (2018). To break the bottleneck of the inference speed caused by the sequential conditional dependence, several non-autoregressive neural machine translation (NAT) models are proposed to generate all tokens in parallel (Figure 1(a)) Gu et al. (2018); Łukasz Kaiser et al. (2018); Li et al. (2019); Ma et al. (2019). However, vanilla NAT models suffer from the cost of translation accuracy due to they remove the conditional dependence between target tokens.
To close the gap from autoregressive models, iterative NAT models are proposed to refine the translation results. They bring conditional dependency between target tokens within several iterationsGhazvininejad et al. (2019); Ghazvininejad, Levy, and Zettlemoyer (2020); Xie et al. (2020a); Kasai et al. (2020); Guo, Xu, and Chen (2020). Among them, Ghazvininejad et al. Ghazvininejad et al. (2019) first explore to apply conditional masked language model (CMLM) on NAT model (Figure 1(b)). Following this framework, several CMLM-based NAT models are proposed and obtain state-of-the-art performance compared with other NATs Xie et al. (2020a); Guo, Xu, and Chen (2020).
An open question is whether the potential of the CMLM-based NAT model has been fully exploited, since the masked language model has achieved significant breakthroughs in natural language processing.
To answer the question, we introduce Multi-view Subset Regularization (MvSR), a novel regularization method to improve the performance of the CMLM-based NAT model. Specifically, our approach includes two regularization methods: shared mask consistency and model consistency. For shared mask consistency, as shown in Figure 1(c), we randomly mask different subset of the same target sentence twice. Then we encourage the predicted distributions of the shared masked positions to be consistent with each other. As one example, consider the original sentence and two masked sentences in Table 1. The original token ”window” is replaced with [MASK] in both two masked sentences. Although the contexts of ”window” are different due to the random mask strategies, their semantics and generated distributions are expected to be consistent across these two views. To make a summary, we introduce a new paradigm of regularization, different mask strategies for the same target sentence, and the tokens on the shared masked positions are semantic-preserving with different views. This approach is reminiscent of multi-view contrast learning Tian, Krishnan, and Isola (2020), our method is not ”contrast” but only considers the consistency of ”positive pairs”.
Regarding model consistency (Figure 1(d)), it is inspired by that checkpoint averaging is an essential method for improving the performance of machine translation Vaswani et al. (2017). Similarly, Mean Teacher Tarvainen and Valpola (2017) shows that using an average model as a teacher improves the results. Correspondingly, we construct an average model by updating the weights with an exponential moving average (EMA) method. Then we penalize the generated distributions that are inconsistent between this average model and the online model. Note that we adopt the bidirectional Kullback-Leibler (KL) divergence instead of mean squared error (MSE) as the consistency cost. This is related to mutual learning Zhang et al. (2018) but without extra parameters.
As in prior work, we apply our MvSR-NAT model in several public benchmark datasets. It outperforms previous NAT models and achieves comparable results with autoregressive Transformer. Intuitively, our two proposed regularization methods have two advantages: 1) they can be seen as stabilizers to promote the robustness of the model to randomness; 2) they reduce the discrepancy between the training and inference phase.
Specifically, the shared mask consistency first enhances the robustness of the model to the random mask. Secondly, we adopt the mask-predict decoding method Ghazvininejad et al. (2019), where the predicted target tokens are replaced by [MASK] symbols in the inference process. Especially in the first decoder iteration, all the target tokens are [MASK] symbols. This decoding strategy causes the discrepancy from training for random mask. Therefore, As a result of a more robust model to random mask, our proposed method can reduce the discrepancy between training and inference caused by [MASK] symbols, thus improving the translation quality.
As for model consistency, it first penalizes the model sensitivity to the model weights, thus improving the robustness. Secondly, the average model and the online model have the same architecture but with different dropout units during training. Therefore, this regularization item also makes our model more robust to random dropout. Moreover, the dropout is closed during inference thus causing the discrepancy between training and inference. By reason of more robust to dropout, the proposed model consistency method implicitly strengthens the generalization ability of the model and improves the performance with dropout closing during inference.
Experimental results demonstrate that our model outperforms several state-of-the-art NAT models by over 0.36-1.14 BLEU on WMT14 ENDE and WMT16 ENRO datasets. Compared with the strong autoregressive Transformer (AT) baseline, our proposed NAT model achieves competitive performance, while significantly reducing the cost of time during inference.
Non-autoregressive Machine Translation
Recently, we have witnessed tremendous progress in neural machine translation (NMT) Sutskever, Vinyals, and Le (2014); Bahdanau, Cho, and Bengio (2015); Vaswani et al. (2017). Given a source sentence , a NMT model is aimed to generate target sentence
. Typically, the probability model of an autoregressive model is defined as, where is the parameters of a network. It is formulated as a chain of conditional probabilities:
where and are [BOS] and [EOS], representing the beginning and the end of a target sentence, respectively. Note that the autoregressive NMTs adopt teacher forcing Vaswani et al. (2017) method to capture the sequential conditional dependency between target tokens. And during inference, they generate the target tokens one by one in a left-to-right manner.
As the performance of NMT models have been substantially promoted, non-autoregressive machine translation (NAT) with paralleled decoding becomes a research hotspot Gu et al. (2018), the architecture is shown in Figure 1(a). We define the probability model of a NAT model as . Mathematically, it is parameterized by a conditional independent factorization:
NAT model removes the conditional dependency between the target words, thus generating all target tokens simultaneously. Although the translation speed has been significantly accelerated, the lack of dependence between target words reduces the translation quality. To promote the performance of NAT models, a promising research line is iterative decoding methods Lee, Mansimov, and Cho (2018); Ghazvininejad et al. (2019); Ghazvininejad, Levy, and Zettlemoyer (2020); Xie et al. (2020a); Kasai et al. (2020); Guo, Xu, and Chen (2020). Specifically, they explicitly consider the conditional dependency between target tokens within several decoding iterations, thus refining the translation results.
Conditional Masked Language Model
Since the masked language model (MLM) is proposed by BERT Devlin et al. (2019), it has achieved a significant breakthrough in natural language understanding Liu et al. (2019); Joshi et al. (2020); Song et al. (2019); Dong et al. (2019); Sun et al. (2021); Ahmad et al. (2021). However, due to the bidirectional nature of MLM, it is non-trivial to extend MLM for language generation tasks. Wang et al. Wang and Cho (2019) start with a sentence of all [MASK] tokens and generate words one by one in arbitrary order (instead of the standard left-to-right chain decomposition), obtaining inadequate generation quality compared with autoregressive counterparts Brown et al. (2020). Recently, XLM Lample and Conneau (2019) leverages sentence-pair translation data for training a conditional masked language model (CMLM), which improves the performance on several downstream tasks, including machine translation.
Upon previous works, as shown in Figure 1(b), Ghazvininejad et al. Ghazvininejad et al. (2019) adopt CMLM to optimize the non-autoregressive NMT model. During training, they predict the masked target tokens conditional on the source sentence and the rest of observed words in the target sentence. Therefore, the training objective of the CMLM-based NAT is presented as:
where denotes the number of masked tokens in target sentence. During inference, they propose a Mask-Predict decoding strategy, which iteratively refines the generated translation given the most confident target words predicted from the previous iteration. In this paper, our work is built upon this CMLM-based NAT model and improve its performance with two proposed consistency regularization techniques.
Consistency regularization has merged as a gold-standard technique for semi-supervised learningSajjadi, Javanmardi, and Tasdizen (2016); Laine and Aila (2017); Zhai et al. (2019); Oliver et al. (2018); Xie et al. (2020b); Zheng et al. (2021). One strand of this idea is to regularize predictions to small perturbations on image data or language. These semantic-preserving augmentations can be image flipping or cropping, or adversarial noise on image data Miyato et al. (2019); Carmon et al. (2019) and natural language example Zhu et al. (2019); Liu et al. (2020). Another strand of consistency regularization aims at penalizing sensitivity to model parameters Tarvainen and Valpola (2017); Athiwaratkun et al. (2019); Liang et al. (2021). In our work, we focus on the conditional masked language model setting, leveraging the two strands of consistency regularization.
Figure 2 illustrates the overall architecture of our proposed MvCR-NAT model. It is built upon the CMLM framework, comprising of two Transformer-based modules, an Encoder and a Decoder. It is worth noting that the Encoder structure is based on Transformer Vaswani et al. (2017), while the Decoder is slightly different. Specifically, the Decoder replaces the left-to-right mask with a bidirectional attention mask, allowing the Decoder to leverage both left and right contexts to predict the target words. To focus on our main contributions, we omit the detailed architecture and refer readers to Ghazvininejad et al. (2019) for reference.
In our work, we focus on the training of our model with two proposed regularization methods. Before elaborating on the details, we first present some notations. Given a source sentence and a target sentence , the goal of training is to learn a probability model . To construct the target input for our approach, we randomly mask the target sentence twice. We define the subset of masked tokens as and , and the observed unmasked tokens as and , respectively. As shown in Figure 2, we feed the two masked sentences to the online model (blue part in the figure). Upon the probability model of the CMLM-based NAT presented in Equation 3, our learning objective is learning to minimize the negative log-likelihood (NLL) loss of the masked tokens, which is parameterized as:
where represents the parameters of the online model. and represent the number of masked words, respectively. and represent the generated distributions from the online model. And we obtain the two nll losses and , respectively.
Note that the masked words in and are randomly selected and replaced by the [MASK] symbols. As shown in Table 1, the shared masked tokens in both and are marked as [MASK] symbol. And we set the collection of the shared masked words as .
In this Section, we propose to improve the CMLM-based NAT with consistency regularization method. Specifically, we focus on regularizing the generated predictions to be invariant to model parameters and semantic-preserving data perturbations.
We introduce the model consistency regularization method to encourage consistent predictions from an online model and an average model. The average model weights are maintained by an exponential moving average (EMA) method (yellow part in Figure 2). Previous works have demonstrated that averaging model weights tend to achieve better performance than using the final model weight Polyak and Juditsky (1992); Vaswani et al. (2017); Tarvainen and Valpola (2017). To take this advantage of the average model, we adopt the bidirectional KL divergence to encourage the prediction consistency between these two models. Therefore we can increase the robustness of our model to model weights and learn better representations. Furthermore, similar to the recent ARXIV paper Liang et al. (2021), the dropout units between the online model and the average model is different due to the randomness. Thus the prediction consistency penalizes the sensitivity to random dropout. Totally, the proposed method brings two practical advantages: first, we strengthen the robustness of our model to stochastic model weights. Second, this method robustly improves the model generalization and reduces the discrepancy between training and inference caused by dropout.
Formally, we define the model consistency loss as the distance between the token-level predictions produced by the online model and the average model using the bidirectional KL divergence:
where represents the parameters of the average model. In order to ensure readability, and are the abbreviation of and . They represent the predictions from the online model and the average model with the first masked sentence, respectively. Similar for and , but with the second masked sentence.
Moreover, the average model parameters is obtained by EMA method. At training step , the updated is computed as the EMA of successive weights:
Shared Mask Consistency
We propose our shared mask consistency regularization method in this part. As examples shown in Table 1 and Figure 2, we randomly mask the same target sentence twice, and forward them to the online model and the average model. Considering a simple example, there is a sentence pairs ”the [MASK] is [MASK] . Diese Katze ist lustig .” and ”[MASK] cat is [MASK] . Diese Katze ist lustig .”. The shared [MASK] to predict can be thought of a token-level ”positive pair” with semantic-preserving. We hypothesis that the representation is view-agnostic, and the semantic is shared between different views caused by randomly mask. Therefore, the predicted distributions of shared masked tokens are expected to be consistent. Note that we do not consider the distribution consistency of the other positions. To illustrate this reason, let’s take an example, in the second position, ”cat” is observed in the second target sentence, but it is masked in the first sentence. If we force the distributions of the second position to be consistent, the model will be confused with ”[MASK]” and ”cat”, thus leading inferior performance.
Mathematically, we define the shared mask consistency cost to measure the distance of prediction distributions in shared mask position. similar to model consistency, we adopt the bidirectional KL divergence:
where represents the number of shared masked tokens between and . , , represent the share mask consistency between the two masked sentence when they are feed into online-online, online-average, and average-online models, respectively.
Autoregressive NMT models generate words one-by-one, and the length of the target sentence is decided by encountering a special token [EOS]. However, non-autoregressve NATs generate target sentence in a parallel way, thus requiring the predicted length before decoding. Following Gu et al. (2018); Ghazvininejad et al. (2019), we add an additional special token [LEN] to the begining of the source input. Then we predict the length
by the source sentence X. Mathematically, we define the loss function of this classfication task as:
where represents the max length of the target sentence in our corpus.
The final training objective for MvSR-NAT is the sum of all aforementioned loss functions:
is a hyperparameter to control KL losses. Jointly training with our proposed two consistency regularization losses, we improve the robustness and generalization ability of our MvCR-NAT model to the randomness (e.g., model weights and random mask in target sentence).
The overall training algorithm of our model is presented in Algorithm 1.
During inference, we feed a sentence with all [MASK] as target input for the first iteration, where its length is determined by the length prediction. Then we refine the translation result by masking-out and re-predicting a subset of words whose probabilities are under a threshold within several iterations. For more details, please refer to Ghazvininejad et al. (2019).
We conduct our experiments on five public benchmarks: WMT14 ENDE (4.5M translation pairs), WMT16 ENRO (610K translation pairs), and IWSLT DEEN (150K translation pairs). We strictly follow the dataset configurations of previous works. Specifically, we preprocess the dataset following Lee, Mansimov, and Cho (2018). Then we tokenize the tokens into subword units using BPE method Sennrich, Haddow, and Birch (2016). For WMT14 ENDE, we use newstest-2013 and newstest-2014 as our development and test datasets, respectively. For WMT16 EnRo, we use newsdev-2016 and newstest-2016 as our development and test datasets, respectively.
We adopt the widely used BLEU Papineni et al. (2002) to evaluate the translation quality.
|LSTMS2SBahdanau, Cho, and Bengio (2015)||T||24.60||-||-||-||-||-|
|ConvS2SGehring et al. (2017)||T||26.42||-||-||-||-||-|
|Transformer Vaswani et al. (2017)||T||28.04||32.69||34.13||34.46||32.99||1.00x|
|NAT-FT Gu et al. (2018)||1||19.17||23.20||29.79||31.44||24.21||2.36x|
|Imit-NAT Wei et al. (2019)||1||24.15||27.28||31.45||31.81||-||-|
|NAT-Hint Li et al. (2019)||1||21.11||25.24||-||-||-||-|
|Flowseq Ma et al. (2019)||1||23.72||28.39||29.73||30.72||-||1.1x|
|NAT-DCRF Sun et al. (2019)||1||26.07||29.68||-||-||29.99||9.63x|
|GLAT-NAT Qian et al. (2021)||1||26.55||31.02||32.87||33.51||-||7.9x|
|NAT-IR Lee, Mansimov, and Cho (2018)||5||20.26||23.86||28.86||29.72||-||-|
|CMLM-NAT Ghazvininejad et al. (2019)||1||18.05||21.83||27.32||28.20||-||27.51x|
|MvCR-NAT (w/ kd)||4||26.25||30.27||32.76||32.96||-||9.79x|
|MvCR-NAT (w/o kd)||4||22.89||26.89||32.34||33.60||30.58||9.79x|
The online model setting is strictly following previous works. For the WMT14 ENDE and WMT16 ENRO datasets, the model setting is based on the base Transformer Vaswani et al. (2017). Specifically, we set the model dimension as 512 and the inner dimension as 2048. The Encoder and the Decoder consist of a stack of 6 Transformer layers. For the smaller IWSLT16 DEEN dataset, we follow the configuration of small Transformer that the model dimension and inner dimension are 256 and 1024, respectively. The Encoder and the Decoder consist of a stack of 5 Transformer layers. Besides, we set the max target sentence as 1000.
The average model is built upon the online model with the same architecture. Besides, its model weights are maintained by an exponential moving average method with the moving average decay set as 0.996.
During training, we train the model with 2048 tokens per batch on eight GTX 2080Ti GPUs. We use Adam optimizer Kingma and Ba (2015) and warmup learning rate schedule. During inference, we set the number of the length candidate as 5 for our NAT model. For a fair comparison, we set the beam size as 5 for the baseline AT model. Moreover, we evaluate the final translation accuracy by averaging 10 checkpoints.
Sequential-Level Knowledge Distillation
Previous works have demonstrated that the effectiveness of sequential-level knowledge distillation on NAT models Gu et al. (2018); Lee, Mansimov, and Cho (2018); Gu, Wang, and Zhao (2019); Ghazvininejad et al. (2019); Zhou, Neubig, and Gu (2020). Following their works, we train our CMCR-NAT model on the distilled corpora, which are produced by a standard left-to-right Transformer model. While previous AT transformers have different performances, we adopt the one used in CMLM-NAT Ghazvininejad et al. (2019) which is our primary baseline. In Section Effect of Sequential-level Knowledge Distillation, we will identify the effect of knowledge distillation on our model.
|Model Variants||WMT’16 ENRO||WMT’16 ROEN|
|+ model consistency||31.89 (+0.49)||33.10(+0.24)||33.15 (+0.28)||33.52 (+0.37)|
|+ shared mask concsistency||32.16 (+0.76)||33.49(+0.63)||33.79 (+0.92)||34.08 (+0.93)|
|MvCR-NAT||32.34 (+0.94)||33.76(+0.90)||33.60(+0.73)||34.45 (+1.30)|
Table 2 shows our experimental results on three public datasets. As we move our eyes to the first part in this Table, our model achieves comparable performance with the Transformer model. Notably, on the small dataset WMT16 ENRO and IWSLT14 DEEN, the translation results are only 0.01-0.44 BLEU score behind.
In the second part of Table 2, compared with pure NAT models with one-shot decoding, the multiple iterative decoding methods achieve noticeable improvements. The same thing happens to the CMLM-based NAT models. This phenomenon is mainly due to the problem of multimodality Gu et al. (2018) that the one-shot decoding hardly considers the left-to-right dependency. While the iterative methods explicitly model the conditional dependency between target tokens within several iterations, thus obtaining better performance.
In contrast with our primary baseline CMLM-NAT model, our model is additionally optimized with two regularization methods without changing the CMLM architecture. Our model outperforms CMLM-NAT with margins from 0.36-1.14 BLEU scores, illustrating the effectiveness of our methods.
Effect of Sequential-level Knowledge Distillation
The comparison results for knowledge distillation are shown in Table 2. In terms of the large dataset, i.e., WMT14 ENDE, our model gains improvements with the sequential-level knowledge distillation. However, the improvements from knowledge distillation are not concurrent on the small dataset, i.e., WMT16 ENRO. We attribute this phenomenon to the complexity of the data sets Zhou, Neubig, and Gu (2020). The knowledge distillation is able to reduce ”modes” (alternative translations for an input) in the training data, thus benefiting the NAT models. We conjecture that a small dataset is likely to contain fewer redundant “modes” than a large-scale dataset. As a result, distillation knowledge is helpful and more efficient on a large dataset than on a small dataset.
Model Consistency vs. Shared Mask Consistency
As shown in Table 3, we conduct comparative experiments on the validation set of the WMT16 ENRO task to illustrate the contribution of our proposed two regularization methods. Note that the results are computed without knowledge distillation. Compared with the CMLM-NAT baseline model, our proposed model consistency and shared mask consistency regularization methods progressively improve the performance, and the shared mask consistency provides more performance promotion.
Furthermore, to step further understand the two proposed regularization methods, in Figure 3, we show the BLEU score with training epochs on IWSLT14 DEEN task with single decoding iteration. To make a fair comparison, in every training forward pass, we feed two source-target sentence pairs to these compared models. The training curves help us understand the effect of the two proposed regularization methods. We can see that a) the model consistency improves the performance without changing the convergence trend; b) the shared mask consistency method suppresses the convergence speed of the model in the early training period, but obtains better performance in the final training epochs. It indicates that the shared mask consistency method can avoid premature fitting and improve the robustness and the generalization ability of our model.
Effect of Weight
|KL Loss Weight||WMT’16 ENRO||WMT’16 ROEN|
We investigate the effect of the loss weight , which is utilized for controlling the KL-divergence loss. We conduct ablation experiments on WMT16 ENRO with different values in . The results are shown in Table 4. We can see that the small kl loss weight performs better than the larger ones. In our setting, the best choice of the kl loss weight is . Too much regularization (e.g. 3) even decreases the model performance.
-time Shared Mask Consistency
|KL Loss Weight||WMT’16 ENRO||WMT’16 ROEN|
As an example shown in Table 1, we forward two masked target sentences to the model and encourage their masked subset predictions to be consistent. An interesting concern is whether more improvements can be achieved if we forward three or more masked targets with different mask strategies. In this study, we define the number of masked targets as . We conduct comparative experiments about the on the WMT16 ENRO dataset. The results in Table 5 show that is good enough for the tasks. This indicates that our proposed two consistency regularization methods have strong regularization effect between two distributions, without the necessity of more distributions regularization.
Dropout Probability in Average Model
|Dropout Prob.||WMT’16 ENRO||WMT’16 ROEN|
As mentioned in subsection Model Consistency, we indicate that the model consistency method strengthens the robustness of our model to model weights and randomly dropout. Here, we investigate the different dropout rates in the average model. In this study, we apply different dropout values for the average models during training. As shown in Table 6, we test the dropout values from on WMT16 ENRO dataset. We can see that the best choice of dropout value for the average model is 0.3.
In this paper, upon CMLM-based architecture, we introduce the Multi-view Subset Regularization method to improve the CMLM-based NAT performance. We first propose the shared mask consistency method to force the masked subset predictions to be consistent for randomly mask strategies. Second, we propose model consistency to encourage the online model to generate consistent distributions with the average model whose weights are maintained with an EMA method. On several benchmark datasets, we demonstrate that our approach achieves considerable improvements against previous non-autoregressive models and comparable results to the autoregressive Transformer model. This work introduces a new paradigm for regularization method, the multi-view subset regularization. We hope this paradigm can be helpful in recent hot contrast learning models.
- Ahmad et al. (2021) Ahmad, W. U.; Chakraborty, S.; Ray, B.; and Chang, K.-W. 2021. Unified Pre-training for Program Understanding and Generation. In NAACL.
- Athiwaratkun et al. (2019) Athiwaratkun, B.; Finzi, M.; Izmailov, P.; and Wilson, A. G. 2019. There Are Many Consistent Explanations of Unlabeled Data: Why You Should Average. In ICLR.
- Bahdanau, Cho, and Bengio (2015) Bahdanau, D.; Cho, K.; and Bengio, Y. 2015. Neural Machine Translation by Jointly Learning to Align and Translate. CoRR, abs/1409.0473.
- Brown et al. (2020) Brown, T. B.; Mann, B.; Ryder, N.; Subbiah, M.; Kaplan, J.; Dhariwal, P.; Neelakantan, A.; Shyam, P.; Sastry, G.; Askell, A.; Agarwal, S.; Herbert-Voss, A.; Krueger, G.; Henighan, T.; Child, R.; Ramesh, A.; Ziegler, D. M.; Wu, J.; Winter, C.; Hesse, C.; Chen, M.; Sigler, E.; Litwin, M.; Gray, S.; Chess, B.; Clark, J.; Berner, C.; McCandlish, S.; Radford, A.; Sutskever, I.; and Amodei, D. 2020. Language Models are Few-Shot Learners. ArXiv, abs/2005.14165.
- Carmon et al. (2019) Carmon, Y.; Raghunathan, A.; Schmidt, L.; Liang, P.; and Duchi, J. C. 2019. Unlabeled Data Improves Adversarial Robustness. In NeurIPS.
- Devlin et al. (2019) Devlin, J.; Chang, M.-W.; Lee, K.; and Toutanova, K. 2019. BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding. ArXiv, abs/1810.04805.
- Dong et al. (2019) Dong, L.; Yang, N.; Wang, W.; Wei, F.; Liu, X.; Wang, Y.; Gao, J.; Zhou, M.; and Hon, H.-W. 2019. Unified Language Model Pre-training for Natural Language Understanding and Generation. In NeurIPS.
- Gehring et al. (2017) Gehring, J.; Auli, M.; Grangier, D.; Yarats, D.; and Dauphin, Y. 2017. Convolutional Sequence to Sequence Learning. ArXiv, abs/1705.03122.
- Ghazvininejad et al. (2019) Ghazvininejad, M.; Levy, O.; Liu, Y.; and Zettlemoyer, L. 2019. Mask-predict: Parallel decoding of conditional masked language models. In EMNLP-IJCNLP.
- Ghazvininejad, Levy, and Zettlemoyer (2020) Ghazvininejad, M.; Levy, O.; and Zettlemoyer, L. 2020. Semi-Autoregressive Training Improves Mask-Predict Decoding. ArXiv, abs/2001.08785.
- Gu et al. (2018) Gu, J.; Bradbury, J.; Xiong, C.; Li, V. O. K.; and Socher, R. 2018. Non-Autoregressive Neural Machine Translation. In ICLR.
- Gu, Wang, and Zhao (2019) Gu, J.; Wang, C.; and Zhao, J. 2019. Levenshtein Transformer. In NeurIPS.
- Guo, Xu, and Chen (2020) Guo, J.; Xu, L.; and Chen, E. 2020. Jointly Masked Sequence-to-Sequence Model for Non-Autoregressive Neural Machine Translation. In ACL.
- Joshi et al. (2020) Joshi, M.; Chen, D.; Liu, Y.; Weld, D. S.; Zettlemoyer, L.; and Levy, O. 2020. SpanBERT: Improving Pre-training by Representing and Predicting Spans. Transactions of the Association for Computational Linguistics, 8: 64–77.
Kasai et al. (2020)
Kasai, J.; Cross, J.; Ghazvininejad, M.; and Gu, J. 2020.
Non-autoregressive machine translation with disentangled context
International Conference on Machine Learning, 5144–5155. PMLR.
- Kingma and Ba (2015) Kingma, D. P.; and Ba, J. 2015. Adam: A Method for Stochastic Optimization. CoRR, abs/1412.6980.
- Laine and Aila (2017) Laine, S.; and Aila, T. 2017. Temporal Ensembling for Semi-Supervised Learning. ArXiv, abs/1610.02242.
- Lample and Conneau (2019) Lample, G.; and Conneau, A. 2019. Cross-lingual Language Model Pretraining. ArXiv, abs/1901.07291.
- Lee, Mansimov, and Cho (2018) Lee, J.; Mansimov, E.; and Cho, K. 2018. Deterministic Non-Autoregressive Neural Sequence Modeling by Iterative Refinement. In EMNLP.
- Li et al. (2019) Li, Z.; Lin, Z.; He, D.; Tian, F.; Qin, T.; Wang, L.; and Liu, T.-Y. 2019. Hint-Based Training for Non-Autoregressive Machine Translation. In EMNLP-IJCNLP.
- Liang et al. (2021) Liang, X.; Wu, L.; Li, J.; Wang, Y.; Meng, Q.; Qin, T.; Chen, W.; Zhang, M.; and Liu, T.-Y. 2021. R-Drop: Regularized Dropout for Neural Networks. ArXiv, abs/2106.14448.
- Liu et al. (2020) Liu, X.; Cheng, H.; He, P.; Chen, W.; Wang, Y.; Poon, H.; and Gao, J. 2020. Adversarial Training for Large Neural Language Models. ArXiv, abs/2004.08994.
- Liu et al. (2019) Liu, Y.; Ott, M.; Goyal, N.; Du, J.; Joshi, M.; Chen, D.; Levy, O.; Lewis, M.; Zettlemoyer, L.; and Stoyanov, V. 2019. RoBERTa: A Robustly Optimized BERT Pretraining Approach. ArXiv, abs/1907.11692.
- Ma et al. (2019) Ma, X.; Zhou, C.; Li, X.; Neubig, G.; and Hovy, E. 2019. FlowSeq: Non-Autoregressive Conditional Sequence Generation with Generative Flow. In EMNLP-IJCNLP.
Miyato et al. (2019)
Miyato, T.; Maeda, S.-I.; Koyama, M.; and Ishii, S. 2019.
Virtual Adversarial Training: A Regularization Method for Supervised and Semi-Supervised Learning.IEEE Transactions on Pattern Analysis and Machine Intelligence, 41(8): 1979–1993.
- Oliver et al. (2018) Oliver, A.; Odena, A.; Raffel, C.; Cubuk, E. D.; and Goodfellow, I. 2018. Realistic Evaluation of Deep Semi-Supervised Learning Algorithms. In NeurIPS.
- Papineni et al. (2002) Papineni, K.; Roukos, S.; Ward, T.; and Zhu, W.-J. 2002. Bleu: a Method for Automatic Evaluation of Machine Translation. In ACL.
- Polyak and Juditsky (1992) Polyak, B. T.; and Juditsky, A. B. 1992. Acceleration of stochastic approximation by averaging. SIAM journal on control and optimization, 30(4): 838–855.
- Qian et al. (2021) Qian, L.; Zhou, H.; Bao, Y.; Wang, M.; Qiu, L.; Zhang, W.; Yu, Y.; and Li, L. 2021. Glancing Transformer for Non-Autoregressive Neural Machine Translation. In ACL.
- Sajjadi, Javanmardi, and Tasdizen (2016) Sajjadi, M. S. M.; Javanmardi, M.; and Tasdizen, T. 2016. Regularization With Stochastic Transformations and Perturbations for Deep Semi-Supervised Learning. In NIPS.
- Sennrich, Haddow, and Birch (2016) Sennrich, R.; Haddow, B.; and Birch, A. 2016. Neural Machine Translation of Rare Words with Subword Units. In ACL.
- Song et al. (2019) Song, K.; Tan, X.; Qin, T.; Lu, J.; and Liu, T.-Y. 2019. MASS: Masked Sequence to Sequence Pre-training for Language Generation. In ICML.
- Sun et al. (2021) Sun, Y.; Wang, S.; Feng, S.; Ding, S.; Pang, C.; Shang, J.; Liu, J.; Chen, X.; Zhao, Y.; Lu, Y.; Liu, W.; Wu, Z.; Gong, W.; Liang, J.; Shang, Z.; Sun, P.; Liu, W.; Ouyang, X.; Yu, D.; Tian, H.; Wu, H.; and Wang, H. 2021. ERNIE 3.0: Large-scale Knowledge Enhanced Pre-training for Language Understanding and Generation. ArXiv, abs/2107.02137.
- Sun et al. (2019) Sun, Z.; Li, Z.; Wang, H.; Lin, Z.; He, D.; and Deng, Z.-H. 2019. Fast structured decoding for sequence models. arXiv preprint arXiv:1910.11555.
Sutskever, Vinyals, and Le (2014)
Sutskever, I.; Vinyals, O.; and Le, Q. V. 2014.
Sequence to Sequence Learning with Neural Networks.In NIPS.
Tarvainen and Valpola (2017)
Tarvainen, A.; and Valpola, H. 2017.
Mean teachers are better role models: Weight-averaged consistency targets improve semi-supervised deep learning results.In NIPS.
- Tian, Krishnan, and Isola (2020) Tian, Y.; Krishnan, D.; and Isola, P. 2020. Contrastive Multiview Coding. In ECCV.
- Vaswani et al. (2017) Vaswani, A.; Shazeer, N.; Parmar, N.; Uszkoreit, J.; Jones, L.; Gomez, A. N.; Kaiser, L.; and Polosukhin, I. 2017. Attention is All you Need. In NIPS.
- Wang and Cho (2019) Wang, A.; and Cho, K. 2019. BERT has a Mouth, and It Must Speak: BERT as a Markov Random Field Language Model. ArXiv, abs/1902.04094.
- Wei et al. (2019) Wei, B.; Wang, M.; Zhou, H.; Lin, J.; Xie, J.; and Sun, X. 2019. Imitation learning for non-autoregressive neural machine translation. arXiv preprint arXiv:1906.02041.
- Xie et al. (2020a) Xie, P.; Cui, Z.; Chen, X.; Hu, X.; Cui, J.; and Wang, B. 2020a. Infusing Sequential Information into Conditional Masked Translation Model with Self-Review Mechanism. In COLING.
- Xie et al. (2020b) Xie, Q.; Dai, Z.; Hovy, E.; Luong, M.-T.; and Le, Q. V. 2020b. Unsupervised Data Augmentation for Consistency Training. arXiv: Learning.
Zhai et al. (2019)
Zhai, X.; Oliver, A.; Kolesnikov, A.; and Beyer, L. 2019.
S4L: Self-Supervised Semi-Supervised Learning.
2019 IEEE/CVF International Conference on Computer Vision (ICCV), 1476–1485.
Zhang et al. (2018)
Zhang, Y.; Xiang, T.; Hospedales, T. M.; and Lu, H. 2018.
Deep Mutual Learning.
2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition, 4320–4328.
- Zheng et al. (2021) Zheng, B.; Dong, L.; Huang, S.; Wang, W.; Chi, Z.; Singhal, S.; Che, W.; Liu, T.; Song, X.; and Wei, F. 2021. Consistency Regularization for Cross-Lingual Fine-Tuning. ArXiv, abs/2106.08226.
- Zhou, Neubig, and Gu (2020) Zhou, C.; Neubig, G.; and Gu, J. 2020. Understanding Knowledge Distillation in Non-autoregressive Machine Translation. In ICLR.
- Zhu et al. (2019) Zhu, C.; Cheng, Y.; Gan, Z.; Sun, S.; Goldstein, T.; and Liu, J. 2019. Freelb: Enhanced adversarial training for natural language understanding. arXiv preprint arXiv:1909.11764.
- Łukasz Kaiser et al. (2018) Łukasz Kaiser; Roy, A.; Vaswani, A.; Parmar, N.; Bengio, S.; Uszkoreit, J.; and Shazeer, N. 2018. Fast Decoding in Sequence Models using Discrete Latent Variables. In ICML.