1 Introduction
Nonautoregressive machine translation models can significantly improve decoding speed by predicting every word in parallel (Gu et al., 2018; Libovický and Helcl, 2018). This advantage comes at a cost to performance since modeling word order is trickier when the model cannot condition on its previous predictions. A range of semiautoregressive models (Lee et al., 2018; Stern et al., 2019; Gu et al., 2019; Ghazvininejad et al., 2019) have shown there is a speedaccuracy tradeoff that can be optimized with limited forms of autoregression. However, increasing performance of the purely nonautoregressive models without sacrificing decoding speed remains an open challenge. In this paper, we present a new training loss for nonautoregressive machine translation that softens the penalty for word order errors, and significantly improves performance with no modification to the model or to the decoding algorithm.
Target  it  tastes  pretty  good  though 

Model Predictions (Top 5)  but  it  tastes  delicious  . 
however  that  makes  good  ,  
for  this  looks  tasty  so  
and  for  taste  fine  though  
though  the  feels  exquisite  ! 
Existing models (both autoregressive and nonautoregressive) are typically trained with cross entropy loss. Cross entropy is a strict loss function, where a penalty is incurred for every word that is predicted out of position, even for output sequences with small edit distances (see Figure 1). Autoregressive models learn to avoid such penalties, since words are generated conditioned on the sentence prefix. However, nonautoregressive models do not know the exact sentence prefix, and should (intuitively) focus more on root errors (e.g. a missing word) while allowing more partial credit for cascading errors (the right word in the wrong place).
To achieve this more relaxed loss, we introduce aligned cross entropy (AXE), a new objective function that computes the cross entropy loss based on an alignment between the sequence of token labels and the sequence of token distribution predictions. AXE uses dynamic programming to find the monotonic alignment that minimizes the cross entropy loss. It provides nonautoregressive models with a more accurate training signal by ignoring absolute positions and focusing on relative order and lexical matching. We efficiently implement AXE via matrix operations, and use it to train conditional masked language models (CMLM; Ghazvininejad et al., 2019) for machine translation. AXE only slightly increases training time compared to cross entropy, and requires no changes to parallel argmax decoding.
Extensive experiments on machine translation benchmarks demonstrate that AXE substantially boosts the performance of CMLMs, while having the same decoding speed. In WMT’14 ENDE, training CMLMs with AXE (instead of the regular cross entropy loss) increases performance by 5 BLEU points; we observe similar trends in WMT’16 ENRO and WMT’17 ENZH. Moreover, AXE CMLMs significantly outperform stateoftheart nonautoregressive models, such as FlowSeq (Ma et al., 2019), as well as the recent CRFbased semiautoregressive model with bigram LM decoding (Sun et al., 2019). Our detailed analysis suggests that training with AXE makes models more confident in their predictions, thus reducing multimodality, and alleviating a key problem in nonautoregressive machine translation.
2 Aligned Cross Entropy
Let be a target sequence of tokens , and be the model predictions, a sequence of
token probability distributions
. Our goal is to find a monotonic alignment between and that will minimize the cross entropy loss, and thus focus the penalty on lexical errors (predicting the wrong token) rather than positional errors (predicting the right token in the wrong place).We define an alignment to be a function that maps target positions to prediction positions, i.e. . We further assume that this alignment is monotonic, i.e. iff . Given a specific alignment , we define a conditional loss as:
(1)  
The first term of this loss function is an aligned cross entropy between and , and the second term is a penalty for unaligned predictions. Epsilon () is a special “blank” token in our vocabulary that appears in the probability distributions, but that does not appear in the final output string.
Now, the final loss is the minimum over all possible monotonic alignments of the conditional loss:
(2)  
Finding the optimal monotonic alignment between two sequences is a well studied problem. For instance, dynamic time warping (DTW) (Sakoe and Chiba, 1978) is a wellknown algorithm for finding the optimal alignment between two different time series. Here we have extended the idea to compute the optimal alignment between a sequence of target tokens and a sequence of prediction probability distributions. We use a simple dynamic program to find the optimal alignment while calculating the AXE loss.
Align  Aligns the current target with the current prediction , updating along the diagonal.  
Skip Prediction  Skips the current prediction by predicting an empty token (), updating along the axis. This operation is akin to inserting an empty token to the target sequence at the th position.  
Skip Target  Skips the current target by predicting it without incrementing the prediction iterator , updating along the axis. This operation is akin to duplicating the prediction . The hyperparameter controls how expensive this operation is; high values of will discourage alignments that skip too many target tokens. 
Target  it  tastes  pretty  good  though 

Alignment  2  3  3  4  5 
Model Predictions (Top 5)  but  it  tastes  delicious  
however  makes  good  .  
that  looks  tasty  ,  
for  this  taste  fine  so  
and  for  feels  exquisite  though 
Dynamic Programming
Given a sequence of target tokens and a sequence of predictions we propose a method to find the score of the optimal alignment between any prefix of these two sequences and , for any and . The score of the optimal alignment for the full sequences is obtained at and .
We start by defining a matrix of by dimensions, respectively corresponding to and , where represents the minimum loss value for aligning to as defined in Equation 2. We initialize to be and then proceed to fill the matrix by taking the local minimum at each cell from three possible operators: Align, Skip Prediction , and Skip Target. Table 1 describes each operation and its update formula. Once the matrix is full, the cell will contain the cross entropy loss of the optimal alignment. Algorithm 1 lays out a straightforward implementation of AXE’s dynamic program.
According to Equation 2, the optimal alignment can be manytoone, where multiple target positions can be mapped to a single prediction. This would be computed by aligning the first mapped token and skipping the rest of target tokens. To discourage skipping too many target tokens, we penalize skip target operators separately with a parameter as described in Table 1. Setting will result in the loss function defined in Equation 2, but as we show in our ablation study (Section 4.3), higher values yield better performance in practice.
Efficient Implementation
The implementation in Algorithm 1 has time complexity. However, multiple updates of the matrix
can be parallelized on GPUs and other tensorprocessing architectures. Rather than iterating over each cell, we iterate over each
antidiagonal, computing all the values along the antidiagonal in parallel. In other words, we first compute the values of , followed by , etc. Since the number of antidiagonals is , we arrive at a time complexity of . Since is typically on the same order of magnitude as , the linear cost of computing AXE during training becomes negligible compared to forward and backward passes through the model.^{1}^{1}1Batch implementation of this algorithm is straightforward. By doing so, we are able to achieve training times similar to (about 1.2 times slower than) training with cross entropy loss.Example
Figure 2 depicts an example application of AXE. We see that the predictions are generally good, but start with a shift with respect to the target. This misalignment would cause the regular cross entropy loss to severely penalize the first three predictions, even though and are correct when aligned with and . AXE, on the other hand, finds an alignment between the target and the predictions, which allows it to focus the penalty on the redundant prediction in and the missing token , i.e. the root errors.
3 Training NonAutoregressive Models
We use AXE to train conditional masked language models (CMLMs) for nonautoregressive machine translation (Ghazvininejad et al., 2019).^{2}^{2}2While in this work we apply AXE to CMLMs, the loss function can be used to train other models as well. We leave further investigation of this direction to future work.
3.1 Conditional Masked Language Models
A conditional masked language model takes a source sequence and a partiallyobserved target sequence as input, and predicts the probabilities of the masked (unobserved) target sequence tokens . The underlying architecture is an encoderdecoder transformer (Vaswani et al., 2017).
In the original paper, CMLMs are used for machine translation where a random subset of tokens are masked at training time. However, at inference all target tokens are masked () and the length of
(the number of masked tokens) is unknown. To estimate the length of
, an auxiliary task is introduced to predict the target length based on the source sequence .^{3}^{3}3See (Ghazvininejad et al., 2019) for further detail.3.2 Adapting CMLMs to AXE
In our case, the model can also produce blank tokens (), which effectively shorten the predicted sequence’s length. To account for potentially skipped tokens during inference, we multiply the predicted length by a hyperparameter (which is tuned on the validation set) before applying argmax decoding.
3.3 Adapting the Training Objectives to AXE
Since this work focuses on the purely nonautoregressive setting, the entire target sequence will be masked at inference time (). The same does not have to hold for training; we can utilize partially observed sequences in order to provide the learner with easier and more focused training examples. We experiment with three variations:
Unobserved Input, Predict All
All the tokens in the target sequence are masked, and the model is expected to predict all of them. This is a direct replication of the task at inference time. While AXE allows for the number of masked tokens to be different from the length of the gold target sequence , we found that setting produced better models in preliminary experiments.
PartiallyObserved Input, Predict All
As in the original CMLM training process, a random subset of the target sequence is masked before being passed onto the model as input.^{4}^{4}4The number of masked input tokens is distributed uniformly between and . We then apply AXE on the entire sequence, regardless of which tokens were observed. When training on partiallyobserved inputs, we always set to avoid further alterations of the gold target sequence beyond masking.
PartiallyObserved Input, Predict Masks
The straightforward application of AXE to CMLM training (which ignores whether each token was masked or observed) works well in practice. However, we can also allow AXE to skip the observed tokens when computing cross entropy, and focus the training signal on the actual task. We do so by setting for every observed token ; i.e. if the th token is observed and is aligned with the prediction corresponding to the same position (), there is no penalty. Our ablation studies show that this modification provides a modest but consistent boost in performance (see Section 4.3). As a result, we use this setting for training our model.
Model  WMT’14  WMT’16  WMT’17  

ENDE  DEEN  ENRO  ROEN  ENZH  ZHEN  
Cross Entropy CMLM (Ghazvininejad et al., 2019)  18.05  21.83  27.32  28.20  24.23  13.64 
AXE CMLM (Ours)  23.53  27.90  30.75  31.54  30.88  19.79 
Model  Decoding  WMT’14  WMT’16  
Iterations  ENDE  DEEN  ENRO  ROEN  
Autoregressive  
Transformer Base  27.61  31.38  34.28  33.99  
+ Knowledge Distillation  27.75  31.30  — —  — —  
NonAutoregressive  
Iterative Refinement (Lee et al., 2018)  1  13.91  16.77  24.45  25.73 
CTC Loss (Libovický and Helcl, 2018)  1  17.68  19.80  19.93  24.71 
NAT w/ Fertility (Gu et al., 2018)  1  17.69  21.47  27.29  29.06 
Cross Entropy CMLM (Ghazvininejad et al., 2019)  1  18.05  21.83  27.32  28.20 
Auxiliary Regularization (Wang et al., 2019)  1  20.65  24.77  — —  — — 
Bagofngrams Loss (Shao et al., 2019)  1  20.90  24.61  28.31  29.29 
Hintbased Training (Li et al., 2019)  1  21.11  25.24  — —  — — 
FlowSeq (Ma et al., 2019)  1  21.45  26.16  29.34  30.44 
Bigram CRF (Sun et al., 2019)  1  23.44  27.22  — —  — — 
AXE CMLM (Ours)  1  23.53  27.90  30.75  31.54 
4 Experiments
We evaluate CMLMs trained with AXE on 6 standard machine translation benchmarks, and demonstrate that AXE significantly improves performance over cross entropy trained CMLMs and over recentlyproposed nonautoregressive models as well.
4.1 Setup
Translation Benchmarks
We evaluate our method on both directions of three standard machine translation datasets with various training data sizes: WMT’14 EnglishGerman (4.5M sentence pairs), WMT’16 EnglishRomanian (610k pairs), and WMT’17 EnglishChinese (20M pairs). The datasets are tokenized into subword units using BPE (Sennrich et al., 2016).^{5}^{5}5We run joint BPE for all language pairs except EnglishChinese. We use the same data and preprocessing as Vaswani et al. (2017), Lee et al. (2018), and Wu et al. (2019) for WMT’14 ENDE, WMT’16 ENRO, and WMT’17 ENZH respectively. We evaluate performance with BLEU (Papineni et al., 2002) for all language pairs, except for translating English to Chinese where we use SacreBLEU (Post, 2018).^{6}^{6}6SacreBLEU hash: BLEU+case.mixed+lang.enzh +numrefs.1+smooth.exp+test.wmt17+tok.zh+version.1.3.7
Hyperparameters
We generally follow the transformer base hyperparameters (Vaswani et al., 2017): 6 layers for the encoder and decoder, 8 attention heads per layer, 512 model dimensions, and 2048 hidden dimensions. We follow the weight initialization schema from BERT (Devlin et al., 2018), and sample weights from , set biases to zero, and set layer normalization parameters to and . For regularization, we set dropout to , and use weight decay and label smoothing with . We train batches of 128k tokens using Adam (Kingma and Ba, 2015) with and . The learning rate warms up to
within 10k steps, and then decays with the inverse squareroot schedule. We train all models for 300k steps. We measure the validation loss at the end of each epoch, and average the 5 best checkpoints based on their validation loss to create the final model. We train all models with mixed precision floating point arithmetic on 16 Nvidia V100 GPUs. For autoregressive decoding, we use a beam size of
(Vaswani et al., 2017) and tune the length penalty on the validation set. Similarly we use length candidates for CMLM models, tune the length multiplier (,^{7}^{7}7Our preliminary analysis shows that AXE selects Skip Prediction in of the time, roughly suggesting that five to ten percent of generated tokens are epsilons. Hence, we search the same range for the length multiplier. and the target skipping penalty () on the validation set.Knowledge Distillation
Similar to previous work on nonautoregressive translation (Gu et al., 2018; Lee et al., 2018; Ghazvininejad et al., 2019; Stern et al., 2019), we use sequencelevel knowledge distillation (Kim and Rush, 2016) by training CMLMs on translations generated by a standard lefttoright transformer model (transformer large for WMT’14 ENDE and WMT’17 ENZH, transformer base for WMT’16 ENRO). We report the performance of standard autoregressive base transformers trained on distilled data for WMT’14 ENDE and WMT’17 ENZH.
Model  Decoding  WMT’14  WMT’16  

Iterations  ENDE  DEEN  ENRO  ROEN  
Knowledge Distillation  
AXE CMLM (Ours)  1  23.53  27.90  30.75  31.54 
Raw Data  
Cross Entropy CMLM (Ghazvininejad et al., 2019)  1  10.64  — —  21.22  — — 
CTC Loss (Libovický and Helcl, 2018)  1  17.68  19.80  19.93  24.71 
FlowSeq (Ma et al., 2019)  1  18.55  23.36  29.26  30.16 
AXE CMLM (Ours)  1  20.40  24.90  30.47  31.42 
4.2 Main Results
AXE vs Cross Entropy
We first compare the performance of AXEtrained CMLMs to that of CMLMs trained with the original cross entropy loss. Table 2 shows that training with AXE substantially increases the performance CMLMs across all benchmarks. On average, we gain 5.2 BLEU by replacing cross entropy with AXE, with gains of up to 6.65 BLEU in WMT’17 ENZH.
State of the Art
We compare the performance of CMLMs with AXE against nine strong baseline models: the fertilitybased sequencetosequence model (Gu et al., 2018), transformers trained with CTC loss (Libovický and Helcl, 2018), the iterative refinement approach (Lee et al., 2018), transformers trained with auxiliary regularization (Wang et al., 2019), CMLMs trained with (regular) cross entropy loss (Ghazvininejad et al., 2019), Flowseq: a latent variable model based on generative flow (Ma et al., 2019), hintbased training (Li et al., 2019), bagofngrams training (Shao et al., 2019), and the CRFbased semiautoregressive model (Sun et al., 2019). All of these models except the last one are purely nonautoregressive, while the CRFbased model uses bigram statistics during decoding, which deviates from the purely nonautoregressive setting.^{8}^{8}8CMLMs (Ghazvininejad et al., 2019) and the iterative refinement method (Lee et al., 2018) are presented as semiautoregressive models that run in multiple decoding iterations. However, the first decoding iteration of these models is purely nonautoregressive, which is what we use as our baselines.
Table 3 shows that our system yields the highest BLEU scores of all nonautoregressive models. AXEtrained CMLMs outperform the best purely nonautoregressive model (FlowSeq) on both directions of WMT’14 ENDE and WMT’16 ENRO by 1.6 BLEU on average. Moreover, our approach achieves higher BLEU scores than the semiautoregressive CRF decoder across all available benchmarks.
Raw Data
Finally, we compare the performance of AXE to other methods that train on raw data without knowledge distillation. Table 4 shows that AXE CMLMs still significantly outperform other nonautoregressive models in the raw data scenario. In addition, comparing raw data to knowledge distillation training follows previouslypublished results that demonstrate the importance of knowledge distillation for nonautoregressive approaches (Gu et al., 2018; Ghazvininejad et al., 2019; Zhou et al., 2019), although the gap is much smaller for WMT’16 ENRO.
4.3 Ablation Study
In this section, we consider several variations of our proposed method to investigate the effect of each component. We test the performance of AXE CMLMs with these variations on the WMT’14 DEEN and ENDE datasets. To prevent overfitting, we evaluate on the validation set using length candidates.
Training Objective  WMT’14  

Input Tokens  Loss Function  ENDE  DEEN 
Unobserved  All Tokens  21.97  26.32 
PartiallyObserved  All Tokens  22.80  27.59 
PartiallyObserved  Only Masks  23.13  28.01 
Different Training Objectives Table 5 shows the effects of different training objectives (Section 3.3), in which all or part of the target tokens are masked and the loss function is calculated on all tokens or masked tokens only. We find that simulating the inference scenario, where all tokens are unobserved, is actually less effective than revealing a subset of the target tokens as input during training. We speculate that partiallyobserved inputs add easier examples to the training set, allowing for better optimization as in curriculum learning (Bengio et al., 2009). We also see that including only the masked tokens in the loss function gives us a modest but consistent boost in performance, possibly because the training signal is focused on the actual task.
Skip Target  WMT’14 ENDE  WMT’14 DEEN  

Penalty  BLEU  Skip Target  BLEU  Skip Target 
1  22.60  17.57%  26.84  16.87% 
2  23.01  10.91%  27.77  10.53% 
3  22.85  9.56%  27.87  9.04% 
4  22.90  8.14%  28.01  7.83% 
5  23.13  7.40%  27.79  6.95% 
Skip Target Penalty The hyperparameter acts as a coefficient for the penalty associated with skipping a target token (see Table 1 for a definition). We experiment with different values of , and report our findings in Table 6. We observe that tuning can significantly improve performance with respect to the default of . As intended, high values of discourage alignments that skip too many target tokens.
Length Multiplier The length multiplier inflates the length predicted by a CMLM to account for extra blank tokens () that the model could potentially generate (see Section 3.2 for more detail). Table 7 compares the effect of different length multiplier values. Using the best length multiplier increases the performance by 0.53 BLEU on average for WMT’14 ENDE and WMT’16 ENRO.
Length Multiplier  WMT’14  WMT’16  

ENDE  DEEN  ENRO  ROEN  
22.96  27.50  30.43  32.22  
23.06  27.56  30.43  32.25  
23.09  27.70  30.66  32.50  
23.11  27.81  30.75  32.69  
23.13  27.85  30.88  32.83  
23.13  27.93  30.88  32.94  
23.06  28.01  31.01  32.84  
23.09  27.93  31.10  32.61  
23.06  27.71  31.14  32.45  
23.07  27.68  31.06  32.14  
22.92  27.49  30.85  32.01 
5 Analysis
We provide a qualitative analysis to provide some insight where AXE improves over cross entropy, and potential directions for future research on nonautoregressive generation.
AXE Handles Long Sequences Better
We first measure performance of cross entropy versus AXEtrained CMLMs for different sequence lengths. We use comparemt (Neubig et al., 2019) to split the test sets of WMT’14 ENDE and DEEN into different buckets based on target sequence length and calculate BLEU for each bucket. Table 8 shows that the performance of models trained with cross entropy drops drastically as the sequence length increases, while the performance of AXEtrained models remains relatively stable. One explanation for this result is that the longer the sequence, the more likely we are to observe misalignments between the model’s predictions and the target; AXE realigns these cases, providing the model with a cleaner signal for modeling long sequences.
Cross Entropy  AXE  

WMT’14  ENDE  18.75  20.48  
21.69  23.92  
18.64  24.21  
15.37  22.65  
14.04  23.04  
11.62  23.43  
DEEN  22.57  24.39  
25.28  27.86  
22.43  28.78  
19.03  27.18  
16.16  27.55  
12.23  27.64 
AXE Increases Position Confidence
We also study how confident each model is about the position of each generated token. Ideally, we would like each predicted token to have a high probability at the position in which it was predicted and a very low probability in the neighboring positions. After applying argmax decoding, we compute the probability assigned to each generated token in all positions of the sequence and average these probabilities based on the relative distance (positive or negative) to the generated position. Figure 3 plots these averaged probabilities for both short ( tokens) and long ( tokens) target sequences.
Both models are rather confident in their predictions for short sequences (Figure 3(a)): the probability has a high peak at the generated position and drops rapidly as we move further away. However, for longer sentences (Figure 3(b)), we observe that the plot for cross entropy has lost its sharpness. Specifically, the immediate neighbors of the prediction position () receive about probability on average, almost a third of the peak probability. Meanwhile, the probabilities predicted by the AXEtrained model are significantly sharper, assigning negligible probabilities to the generated token in neighboring positions when compared to the center.
On way to explain this result is that cross entropy training encourages predictions to have some probability mass of their neighbors, in order to “hedge their bets” in case the predictions are misaligned with the target. Since AXE finds the best alignment before computing the actual loss, spreading the probability mass of a token among its neighbors is no longer necessary.
AXE Reduces Multimodality
We further argue that AXE reduces the multimodality problem in nonautoregressive machine translation (Gu et al., 2018). Due to minimal coordination between predictions in many nonautoregressive models, a model might consider many possible translations at the same time. In this situation, the model might merge two or more different translations and generate an inconsistent output that is typically characterized by token repetitions. We therefore use the frequency of repeated tokens as a proxy for measuring multimodality in a model.
Table 9 shows the repetition rate for cross entropy and AXEtrained CMLMs. Replacing cross entropy with AXE drastically reduces multimodality, decreasing the number of repetitions by a multiplicative factor of 12.
Model  WMT’14  

ENDE  DEEN  
Cross Entropy CMLM  16.72%  12.31% 
AXE CMLM  1.41%  1.03% 
6 Related Work
Advances in neural machine translation techniques in recent years has brought an increasing interest in breaking the autoregressive generation bottleneck in translation models.
Semiautoregressive models introduce partial parallelism into the decoding process. Some of these techniques include iterative refinement of translations based on previous predictions (Lee et al., 2018; Ghazvininejad et al., 2019, 2020; Gu et al., 2019; Kasai et al., 2020) and combining a lighter autoregressive decoder with a nonautoregressive one (Sun et al., 2019).
Building a fully nonautoregrssive machine translation model is a much more challenging task. One branch of prior work approaches this problem by modeling with latent variables. Gu et al. (2018) introduces word fertility as a latent variable to model the number of generated tokens per each source word. Ma et al. (2019) uses generative flow to model complex distribution of latent variables for parallel decoding of target. Shu et al. (2019) proposes a latentvariable nonautoregressive model with continuous latent variables and a deterministic inference procedure.
There is also work that develops other alternative loss functions for nonautoregressive machine translation. Libovický and Helcl (2018) use the Connectionist Temporal Classification training objective, a loss function from the speech recognition literature that is designed to eliminating repetitions. Li et al. (2019) uses the learning signal provided by hidden states and attention distributions of an autoregressive teacher. Yang et al. (2019)
improves the decoder hidden representations by adding the reconstruction error of source sentence from these representations as an auxiliary regularization term to the loss function. Finally,
Shao et al. (2019) introduce the bagofngrams training objective to encourage the model to capture targetside sequential dependencies.7 Conclusion
We introduced Aligned Cross Entropy (AXE) as an alternative loss function for training nonautoregressive models. AXE focuses on relative order and lexical matching instead of relying on absolute positions. We showed that, in the context of machine translation, a conditional masked language model (CMLM) trained with AXE significantly outperforms cross entropy trained models, setting a new stateoftheart for nonautoregressive models.
Acknowledgements
We thank Abdelrahman Mohamed for sharing his expertise on nonautoregressive models, and our colleagues at FAIR for valuable feedback.
References

Curriculum learning.
In
Proceedings of the 26th Annual International Conference on Machine Learning
, pp. 41–48. Cited by: §4.3.  BERT: pretraining of deep bidirectional transformers for language understanding. arXiv preprint arXiv:1810.04805. Cited by: §4.1.
 Maskpredict: parallel decoding of conditional masked language models. In Proc. of EMNLPIJCNLP, External Links: Link Cited by: §1, §1, Table 2, Table 3, §3, §4.1, §4.2, §4.2, Table 4, §6, footnote 3, footnote 8.
 Semiautoregressive training improves maskpredict decoding. arXiv preprint arXiv:2001.08785. Cited by: §6.
 Nonautoregressive neural machine translation. In Proc. of ICLR, Cited by: §1, Table 3, §4.1, §4.2, §4.2, §5, §6.
 Levenshtein transformer. In Proc. of NeurIPS, External Links: Link Cited by: §1, §6.
 Parallel machine translation with disentangled context transformer. arXiv preprint arXiv:2001.05136. Cited by: §6.
 Sequencelevel knowledge distillation. In Proc. of EMNLP, External Links: Link Cited by: §4.1.
 Adam: a method for stochastic optimization. In International Conference for Learning Representations, Cited by: §4.1.
 Deterministic nonautoregressive neural sequence modeling by iterative refinement. In Proc. of EMNLP, External Links: Link Cited by: §1, Table 3, §4.1, §4.1, §4.2, §6, footnote 8.
 Hintbased training for nonautoregressive machine translation. arXiv preprint arXiv:1909.06708. Cited by: Table 3, §4.2, §6.
 Endtoend nonautoregressive neural machine translation with connectionist temporal classification. In Proceedings of the 2018 Conference on Empirical Methods in Natural Language Processing, Brussels, Belgium, pp. 3016–3021. External Links: Link Cited by: §1, Table 3, §4.2, Table 4, §6.
 FlowSeq: nonautoregressive conditional sequence generation with generative flow. arXiv preprint arXiv:1909.02480. Cited by: §1, Table 3, §4.2, Table 4, §6.
 Comparemt: a tool for holistic comparison of language generation systems. In Meeting of the North American Chapter of the Association for Computational Linguistics (NAACL) Demo Track, Minneapolis, USA. External Links: Link Cited by: §5.
 Bleu: a method for automatic evaluation of machine translation. In Proceedings of the 40th Annual Meeting of the Association for Computational Linguistics, External Links: Link Cited by: §4.1.
 A call for clarity in reporting BLEU scores. In Proceedings of the Third Conference on Machine Translation: Research Papers, External Links: Link Cited by: §4.1.
 Dynamic programming algorithm optimization for spoken word recognition. IEEE transactions on acoustics, speech, and signal processing 26 (1), pp. 43–49. Cited by: §2.
 Neural machine translation of rare words with subword units. In Proceedings of the 54th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), External Links: Link Cited by: §4.1.
 Minimizing the bagofngrams difference for nonautoregressive neural machine translation. arXiv preprint arXiv:1911.09320. Cited by: Table 3, §4.2, §6.
 Latentvariable nonautoregressive neural machine translation with deterministic inference using a delta posterior. arXiv preprint arXiv:1908.07181. Cited by: §6.
 Insertion transformer: flexible sequence generation via insertion operations. In Proc. of ICML, External Links: Link Cited by: §1, §4.1.
 Fast structured decoding for sequence models. In Advances in Neural Information Processing Systems, pp. 3011–3020. Cited by: §1, Table 3, §4.2, §6.
 Attention is all you need. In Advances in Neural Information Processing Systems, pp. 5998–6008. Cited by: §3.1, §4.1, §4.1.
 Nonautoregressive machine translation with auxiliary regularization. In Proc. of AAAI, External Links: Link Cited by: Table 3, §4.2.
 Pay less attention with lightweight and dynamic convolutions. International Conference on Learning Representations. Cited by: §4.1.
 Nonautoregressive video captioning with iterative refinement. arXiv preprint arXiv:1911.12018. Cited by: §6.
 Understanding knowledge distillation in nonautoregressive machine translation. arXiv preprint arXiv:1911.02727. Cited by: §4.2.
Comments
There are no comments yet.