1 Introduction
Transformer (Vaswani et al., 2017)
has been widely used in many text generation tasks, which is first proposed in neural machine translation, achieving great success for its promising performance. Nevertheless, the autoregressive property of Transformer has been a bottleneck. Specifically, the decoder of Transformer generates words sequentially, and the latter words are conditioned on previous ones in a sentence. Such bottleneck prevents the decoder from higher efficiency in parallel computation, and imposes strong constrains in text generation, with which the generation order has to be left to right (or right to left)
(Shaw et al., 2018; Vaswani et al., 2017).Recently, many researches (Gu et al., 2018; Lee et al., 2018; Wang et al., 2019; Wei et al., 2019) are devoted to break the autoregressive bottleneck by introducing nonautoregressive Transformer (NAT) for neural machine translation, where the decoder generates all words simultaneously instead of sequentially. Intuitively, NAT abandons feeding previous predicted words into decoder state at the next time step, but directly copy encoded representation at source side to the decoder inputs (Gu et al., 2018). However, without the autoregressive constrain, the search space of the output sentence becomes larger (Wei et al., 2019), which brings the performance gap (Lee et al., 2018) between NAT and autoregressive Transformer (AT). Related works propose to include some inductive priors or learning techniques to boost the performance of NAT. But most of previous work hardly consider explicitly modeling the position of output words during text generation.
We argue that position prediction is an essential problem of NAT. Current NAT approaches do not explicitly model the position of output words, and may ignore the reordering issue in generating output sentences. Compared to machine translation, the reorder problem is much more severe in tasks such as tabletotext (Liu et al., 2018) and dialog generations (Shen et al., 2017). Additionally, it is straightforward to explicitly model word positions in output sentences, as position embeddings are used in Transformer, which is natively nonautoregressive, to include the order information. Intuitively, if output positions are explicitly modeled, the predicted position combined with Transformer to realize nonautoregressive generation would become more natural.
In this paper, we propose nonautoregressive transformer by position learning (PNAT). PNAT
is simple yet effective, which explicitly models positions of output words as latent variables in the text generation. Specifically, we introduce a heuristic search process to guide the position learning, and max sampling is adopted to inference the latent model. The proposed
PNAT is motivated by learning syntax position (also called syntax distance). Shen et al. (2018)show that syntax position of words in a sentence could be predicted by neural networks in a nonautoregressive fashion, which even obtains top parsing accuracy among strong parser baselines. Given the observations above, we try to directly predict the positions of output words to build a NAT model for text generation.
Our proposed PNAT takes following advantages:

We propose PNAT, which first includes positions of output words as latent variables for text generation. Experiments show that PNAT achieves very top results in nonautoregressive NMT, outperforming many strong baselines. PNAT also obtains better results than AT in paraphrase generation task.

Further analysis shows that PNAT has great potentials. With the increase of position prediction accuracy, performances of PNAT could increase significantly. The observations may shed light on the future direction of NAT.

Thanks to the explicitly modeling of position, we could control the generation by facilitating the position latent variable, which may enable interesting applications such as controlling one special word left to another one. We leave this as future work.
2 Background
2.1 Autoregressive Decoding
A target sequence =
is decomposed into a series of conditional probabilities autoregressively, each of which is parameterized using neural networks. This approach has become a de facto standard in language modeling
(Sundermeyer et al., 2012), and has been also applied to conditional sequence modeling by introducing an additional conditional variable =:(1) 
With different choices of neural network architectures such as recurrent neural networks (RNNs)
(Bahdanau et al., 2014; Cho et al., 2014), convolutional neural networks (CNNs)
(Krizhevsky et al., 2012; Gehring et al., 2017), as well as selfattention based transformer (Vaswani et al., 2017), the autoregressive decoding has achieved great success in tasks such as machine translation (Bahdanau et al., 2014), paraphrase generation (Gupta et al., 2018), speech recognition (Graves et al., 2013), etc.2.2 Nonautoregressive Decoding
Autoregressive model suffers from the issue of slow decoding in inference, because tokens are generated sequentially and each of them depends on previous ones. As a solution to this issue, Gu et al. (2018) proposed NonAutoregressive Transformer (denoted as NAT) for machine translation, breaking the dependency among the target tokens through time by decoding all the tokens simultaneously. Put simply, NAT (Gu et al., 2018) factorizes the conditional distribution over a target sequence into a series of conditionally independent distributions with respect to time:
(2) 
which allows trivially finding the most likely target sequence by for each timestep , effectively bypassing computational overhead and suboptimality in decoding from an autoregressive model.
Although nonautoregressive models achieves speedup in machine translation compared with autoregressive models, it comes at the expense of potential performance degradation (Gu et al., 2018). The degradation results from the removal of conditional dependencies within the decoding sentence( depend on ). Without such dependencies, the decoder is hard to leverage the inherent sentence structure in prediction.
2.3 Latent Variables for NonAutoregressive Decoding
A nonautoregressive model could be incorporated with conditional dependency as latent variable to alleviate the degradation resulted from the absence of dependency:
(3) 
For example, NATFT (Gu et al., 2018) models the inherent sentence structure with a latent fertility variable, which represents how many target tokens that a source token would translate to. Lee et al. (2018) introduces intermediate predictions
as random variables , and to refine the predictions from
to in a iterative manner.3 PNAT: PositionBased NonAutoregressive Transformer
We propose positionbased nonautoregressive transformer (PNAT), an extension to transformer incorporated with non autoregressive decoding and position learning.
3.1 Modeling Position with Latent Variables
Languages are usually inconsistent with each other in word order. Thus reordering is usually required when translating a sentence from a language to another. In NAT family, words representations or encoder states at source side are copied to the target side to feed into decoder as its input. Previously, Gu et al. (2018) utilizes positional attention which incorporates positional encoding into decoder attention to perform local reordering. But such implicitly reordering mechanism by position attention may cause a repeated generation problem, because position learning module is not optimized directly, and is likely to be misguided by target supervision.
To tackle with this problem, we propose to explicitly model the position as a latent variable. We rewrite the target sequence with its corresponding position latent variable = as a set = . The conditional probability is factorized with respect to the position latent variable:
(4) 
where is a set consisting of permutations with elements. At decoding time, the factorization allows us to decode sentences in parallel by prepredicting the corresponding position variables .
3.2 Model Architecture
As shown in Figure 1, PNAT is composed of four modules: an encoder stack, a bridge block, a position predictor as well as a decoder stack. Before detailing each component of PNAT model, we overview the architecture for a brief understanding.
Like most sequencetosequence models, PNAT first encodes a source sequence = into its contextual word representations = with the encoder stack. With generated contextual word representation at source side, the bridge block is leveraged to computed the target length as well as the corresponding features =, which is fed into the decoder as its input. It is worth noting that the decoder inputs is computed without reordering. Thus the position predictor is introduced to deal with this issue by predicting a permutation = over . Finally, PNAT generates the target sequence from the decoder input and its permutation .
Encoder and Decoder
Given a source sentence with length , PNAT encoder produces its contextual word representations . The contextual word representations are further used in computing target length and decoder initial states , and are also used as memory of attention at decoder side.
Generally, PNAT decoder can be considered as a transformer with a broader vision, because it leverages future word information that is blind to the autoregressive transformer. Intuitively, we use relative position encoding in selfattention(Shaw et al., 2018), rather than absolute one that is more likely to cause position errors. Following Shaw et al. (2018) with a clipping distance (usually ) set for relative positions, we preserve relations.
Bridge
The bridge module predicts the target length , and initializes the decoder inputs from the source representations . The target length
could be estimated from the source encoder representation:
(5) 
where produces a categorical distribution ranged in (). It is notable that we use the predicted length at inference stage, although during training, we simply use the length of each reference target sequence. Then, we adopt the method proposed by Li et al. (2019) to compute . Given the source representation and the estimated target length , we linearly combine the embeddings of the neighboring source tokens to generate as follows:
(6) 
(7) 
where is a normalized weight that reflects the contribution of to , and
is a hyperparameter indicating the sharpness of the weight distribution.
Position Predictor
For the proposed PNAT, we model position permutations with a position predictor. As shown in Figure 1, the position predictor takes the decoder inputs and the source representation to predict a permutation . The position predictor has a subencoder which stacks multiple layers of encoder units to predict its predicted input =.
With the predicted inputs , we conduct an autoregressive position predictor, denoted as ARPredictor. The ARPredictor searches a permutation with:
(8) 
where is the parameter of ARPredictor, which includes a RNNbased model incorporated with a pointer network (Vinyals et al., 2015).
To purse the efficiency of decoding, we also explore a nonautoregressive version for the position predictor, denoted as NARPredictor, to model the position permutation probabilities with:
(9) 
To obtain the permutation , ARPredictor performs greedy search whereas NARPredictor performs direct . We chose the ARPredictor as our mainly position module in PNAT, and we also analyze the effectiveness of position modeling in Sec. 4.4.
3.3 Training
Training requires maximizing the marginalized likelihood in Eqn. 4. However, this is intractable since we need to enumerate all the permutations of tokens. We therefore optimize this objective by Monte Carlo sampling method with a heuristic search algorithm.
Heuristic Search for Positions
Intuitively, each target token should have a corresponding decoder input, and meanwhile each decoder input should be assigned to a target token. Based on this idea, we design a heuristic search algorithm to allocate positions. Given the decoder inputs and its target tokens, we first estimate the similarity between each pair of the decoder input and the target token embedding
, which is also the weights of the target word classifier:
(10) 
Based on the cosine similarity matrix, HSP is designed to find a perfect matching between decoder inputs and target tokens:
(11) 
As shown in Algorithm 1, we perform a greedy algorithm to select the pair with the highest similarity score iteratively until the permutation is generated.
The complexity of this algorithm is ( is the length of output sentence). Specifically, the complexity to select the maximum from the similarity matrix is for each loop. We need loops of greedy search to allocate positions for all decoder inputs.
The intuition behind is that, if the decoder input is already the most similar one to a target word, it would be easier to keep and even reinforce this association in learning the model. We also analyze the effectiveness of the HSP in the Sec. 4.4.
Objective Function
With the heuristically discovered positions as reference positions , the position predictor could be trained with a position loss:
(12) 
Grounding on the referenced positions, the generative process of target sequences is optimized by:
(13) 
Finally, combining two loss functions mentioned above, a fullfledged loss is derived as
(14) 
The length predictor is a classifier that follows the previous settings. We also follow the previous practice (Gu et al., 2018; Wei et al., 2019) and perform an extra training process for the length predictor after the model trained and do not tune the parameter of the encoder.
3.4 Inference
We follow the common choice of approximating decoding algorithms (Gu et al., 2018; Lee et al., 2018) to reduce the search space of latent variable model.
Argmax Decoding
Following Gu et al. (2018), one simple and effective method is to select the best sequence by choosing the highestprobability latent sequence :
where identifying only requires independently maximizing the local probability for each output position.
Length Parallel Decoding
We also consider the common practice of noisy parallel decoding (Gu et al., 2018), which generates a number of decoding candidates in parallel and selects the best via rescoring using a pretrained autoregressive model. For PNAT, we first predict the target length as , then generate output sequence with argmax decoding for each target length candidate ( = 4 in our experiments), which was called length parallel decoding (LPD). Then we use the pretrained autoregressive model to rank these sequences and identify the best overall output as the final output.
4 Experiments
We test PNAT on several benchmark sequence generation tasks. We first describe the experimental setting and implementation details and then present the main results, followed by some deep studies.
4.1 Experimental Setting
To show the generation ability of PNAT, we conduct experiments on the popular machine translation and paraphrase generation tasks. These sequence generation task evaluation models from different perspectives. Translation tasks test the ability of semantic transforming across bilingual corpus. While paraphrase task focuses on substitution between the same languages while keeping the semantics.
Machine Translation
We valid the effectiveness of PNAT on the most widely used benchmarks for machine translation — WMT14 ENDE(4.5M pairs) and IWSLT16 DEEN(196K pairs). The dataset is processed with Moses script (Koehn et al., 2007), and the words are segmented into subword units using bytepair encoding (Sennrich et al., 2016, BPE). For both WMT datasets, the source and target languages share the same set of subword embeddings while for IWSLT we use separate embeddings.
Paraphrase Generation
We conduct experiments following previous work (Miao et al., 2019) for paraphrase generation. We make use of the established Quora dataset ^{1}^{1}1https://www.kaggle.com/c/quoraquestionpairs/data to evaluate on the paraphrase generation task. We consider the supervised paraphrase generation and split the Quora dataset in the standard setting. We sample 100k pairs sentence as training data, and holds out 3k, 30k for validation and testing, respectively.
4.2 Implementation Details
Module Setting
For machine translation, we follow the settings from Gu et al. (2018). In the case of IWSLT task, we use a small setting ( = 278, = 507, = 0.1, = 5 and = 2) suggested by Gu et al. (2018) for Transformer and NAT models. For WMT task, we use the base setting of the Vaswani et al. (2017) ( = 512, = 512, = 0.1, = 6).
For paraphrase generation, we follow the settings from Miao et al. (2019), and set the 300dimensional GRU with 2 layer for SeqtoSeq (GRU). We empirically select a Transformer and NAT models with hyperparameters ( = 400, = 800, = 0.1, = 3 and = 4).
Optimization
We optimize the parameter with the Adam optimizer (Kingma and Ba, 2014). The hyperparameter used in Eqn. 14 was be set to 1.0 for WMT, 0.3 for IWSLT and Quora. We also use inverse square root learning rate scheduling (Vaswani et al., 2017) for the WMT, and using linear annealing (from to , suggested by Lee et al. (2018)) for the IWSLT and Quora. Each minibatch consists of approximately 2K tokens for IWSLT and Quora, 32K tokens for WMT.
Knowledge Distillation
Sequencelevel knowledge distillation is applied to alleviate multimodality problem while training, using Transformer as a teacher (Hinton et al., 2015). Previous studies on nonautoregressive generation (Gu et al., 2018; Lee et al., 2018; Wei et al., 2019) have used translations produced by a pretrained Transformer model as the training data, which significantly improves the performance. We follow this setting in translation tasks.
Model  WMT 14  IWSLT16  Speedup  

ENDE  DEEN  DEEN  
Autoregressive Methods  
Transformerbase (Vaswani et al., 2017)  27.30  /  /  / 
*Transformer(Beam=4)  27.40  31.33  34.81  1.0 
NonAutoregressive Methods  
Flowseq (Ma et al., 2019)  18.55  23.36  /  / 
*NATbase  /  11.02  /  / 
*PNAT  19.73  24.04  /  / 
NAT w/ Knowledge Distillation  
NATFT (Gu et al., 2018)  17.69  21.47  /  15.6 
LT (Kaiser et al., 2018)  19.80  /  /  5.8 
IRNAT (Lee et al., 2018)  13.91  16.77  27.68  9.0 
ENAT (Guo et al., 2019)  20.65  23.02  /  24.3 
NATREG (Wang et al., 2019)  20.65  24.77  /   
imitateNAT (Wei et al., 2019)  22.44  25.67  /  18.6 
Flowseq (Ma et al., 2019)  21.45  26.16  /  1.1 
*NATbase  /  16.69  /  13.5 
*PNAT  23.05  27.18  31.23  7.3 
NAT w/ Reranking or Iterative Refinments  
NATFT (rescoring 10 candidates)  18.66  22.42  /  7.7 
LT (rescoring 10 candidates)  22.50  /  /  / 
IRNAT (refinement 10)  21.61  25.48  32.31  1.3 
ENAT (rescoring 9 candidates)  24.28  26.10  /  12.4 
NATREG (rescoring 9 candidates)  24.61  28.90  /   
imitateNAT (rescoring 9 candidates)  24.15  27.28  /  9.7 
Flowseq (rescoring 30 candidates)  23.48  28.40  /  / 
*PNAT (LPD =9,=4)  24.48  29.16  32.60  3.7 
4.3 Main Results
Machine Translation
We compare the PNAT with strong NAT baselines, including the NAT with fertility (Gu et al., 2018, NATFT), the NAT with iterative refinement (Lee et al., 2018, IRNAT), the NAT with regularization (Wang et al., 2019, NATREG), the NAT with enhanced decoder input (Guo et al., 2019, ENAT), the NAT with learning from autoregressive model (Wei et al., 2019, imitateNAT), the NAT build on latent variables (Kaiser et al., 2018, LT), and the flowbased NAT model (Ma et al., 2019, Flowseq).
The results are shown in Table 1. We basically compare the proposed PNAT against the autoregressive counterpart both in terms of generation quality, which is measured with BLEU (Papineni et al., 2002)
and inference speedup. For all our tasks, we obtain the performance of competitors by either directly using the performance figures reported in the previous works if they are available or producing them by using the open source implementation of baseline algorithms on our datasets.
^{2}^{2}2For the sake of fairness, we have chosen the base setting for all competitors. Clearly, PNAT achieves a comparable or better result to previous NAT models on both WMT and IWSLT tasks.We list the result of the NAT models trained without using knowledge distillation in the second block of the Table 1
. The PNAT achieves significant improvements (more than 13.0 BLEU points) over the naive baselines, which indicate that position learning greatly contributes to improve the model capability of NAT model. The PNAT also achieves a better result than the Flowseq around 1.0 BLEU, which demonstrates the effectiveness of PNAT in modeling dependencies between the target outputs.
As shown in the third block of the Table 1, without using reranking techniques, the PNAT outperforms all the competitors with a large margin, achieves a balance between performance and efficiency. In particular, the previous stateoftheart(WMT14 DEEN) Flowseq achieves good performance with the slow speed(1.1), while PNAT goes beyond Flowseq in both respects.
Our best results are obtained with length parallel decoding which employ autoregressive model to rerank the multiple generation candidates of different target length. Specifically, on the large scale WMT14 DEEN task, PNAT (+LPD) surpass the NATREG by 0.76 BLEU score. Without reranking, the gap has increased to 2.4 BLEU score (27.18 v.s. 24.77). The experiments shows the power of explicitly position modeling which reduces the gap between nonautoregressive and the autoregressive models.
Paraphrase Generation
Given a sentence, paraphrase generation aims to synthesize another sentence that is different from the given one, but conveys the same meaning. Comparing with translation task, paraphrase generation prefers a more similar order between source and target sentence, which possibly learn a trivial position model. PNAT can potentially yield better results with the position model to infer the relatively ordered alignment relationship.
Model  Paraphrase(BLEU)  

Valid  Test  
Seqtoseq(GRU)  24.68  24.75 
Transformer  25.88  25.46 
NATbase  19.80  20.34 
PNAT  29.30  29.00 
The results of the paraphrase generation are shown in Table 2. In consist with our intuition, PNAT achieves the best result on this task and even surpass Transformer around 3.5 BLEU. The NAT model is not powerful enough to capture the latent position relationship. The comparison between NATbase and PNAT shows that explicit position modeling in PNAT plays a crucial role in generating sentences.
Model  Position Accuracy(%)  WMT14 DEEN  Speed Up  

permutationacc  relativeacc(r=4)  BLEU  
Transformer(beam=4)  /  /  30.68  1.0 
NATbase  /  /  16.71  13.5 
PNAT w/ HSP  100.00  100.00  46.03  12.5 
PNAT w/ ARPredictor  25.30  59.27  27.11  7.3 
PNAT w/ NARPredictor  23.11  55.57  20.81  11.7 
4.4 Analysis
Effectiveness of Heuristic Searched Position
First, we analyze whether the position derived from the heuristic search is suitable for use as supervision to the position predictor. We evaluate the effectiveness of the searched position by training a PNAT as before and testing with the heuristic searched position instead of the predicted position. As shown in the second block of the Table 3, it is easier noticed that as PNAT w/ HSP achieves a significant improvement over the NATbase and the Transformer, which demonstrates that the heuristic search for the position is effective.
Effectiveness and Efficiency of Position Modeling
We are also analysis the accuracy of our position modeling and its influence on the quality of generation on the WMT14 DEEN task. For evaluating the position accuracy, we adopt the heuristic searched position as the position reference (denoted as “HSP”), which is the training target of the position predictor. PNAT requires the position information at two places. The first is the mutual relative relationship between the states that will be used during decoding. And the second is to reorder the decoded output after decoding. We then propose the corresponding metrics for evaluation, which is the relative position accuracy (with relation threshold ) and the permutation accuracy.
As shown in Table 3, better position accuracy always yields better generation performance. The nonautoregressive position model is less effective than the current autoregressive position model, both in the accuracy of the permutation and the relative position. Even though the current PNAT with a simple ARPredictor has surpassed the previous NAT model, the position accuracy is still less desirable (say, less than 30%) and has a great exploration space. We provide a few examples in Appendix A. There is also a tradeoff between the effectiveness and efficiency, the choice of the nonautoregressive means the efficiency and the choice of autoregressive means the effectiveness.
Model  Paraphrase(BLEU)  

w/ RR  w/o RR  
NATbase  20.34  19.45  0.89 
PNAT  29.00  28.95  0.05 
Repeated Generation Analysis
Previous NAT often suffers from the repeated generation problem due to the lack of sequential position information. NAT is less effective to distinguish adjacent decoder hidden states, which is copied from the adjacent source representation. To further study this problem, we proposed to evaluate the gains of simply remove the repeated tokens. As shown in Table 4, we perform the repeated generation analysis on the paraphrase generation tasks. Removing repeated tokens has little impact for PNAT model, with only 0.05 BLEU differences. However for the NATbase model, the gap comes with almost 1 BLEU (0.89). The results clearly demonstrate that the explicitly position model essentially learns the sequential information for sequence generation.
Convergence Efficiency
We also perform the training efficiency analysis in IWSLT16 DEEN Translation task. The learning curves are shown in 2. The curve of the PNAT is on the topleft corner. Remarkably, PNAT has the best convergence speed compared with the NAT competitors and even a strong autoregressive model. The results are in line with our intuition, that the position learning brings meaningful information of position relationship and benefits the generation of the target sentence.
5 Related Work
Gu et al. (2018) first develops a nonautoregressive Transformer for neural machine translation (NMT) tasks, which produces the outputs in parallel and the inference speed is thus significantly boosted. Due to the removal of the dependencies between the target outputs, it comes at the cost that the translation quality is largely sacrificed. A line of work has been proposed to mitigate such performance degradation. Some previous work is focused on enhancing the decoder inputs by replacing the target words as inputs, such as Guo et al. (2019) and Lee et al. (2018). Lee et al. (2018)
proposed a method of iterative refinement based on the latent variable model and denoising autoencoder.
Guo et al. (2019)enhances decoder input by introducing the phrase table in statistical machine translation and embedding transformation. Another part of previous work focuses on improving the supervision of NAT’s decoder states, including imitation learning from autoregressive models
(Wei et al., 2019) or regularizing the decoder state with backward reconstruction error (Wang et al., 2019). There is also a line studies build upon latent variables, such as Kaiser et al. (2018) and Roy et al. (2018) utilize discrete latent variables for making decoding more parallelizable. Moreover, Shao et al. (2019) also proposed a method to retrieve the target sequential information for NAT models. Unlike previous work, we explicitly model the position, which has shown its importance to the autoregressive model and can well model the dependence between states. To the best of our knowledge, PNAT is the first work to explicitly model position information for nonautoregressive text generation.6 Conclusion
We proposed PNAT, a nonautoregressive transformer by explicitly modeled positions, which bridge the performance gap between the nonautoregressive decoding and autoregressive decoding. Specifically, we model the position as latent variables, and training with heuristic searched positions with MC algorithms. As a result, PNAT leads to significant improvement and move more close to the performance gap between the NAT and AT on machine translation tasks. Besides, the experimental results of the paraphrase generation task show that the performance of the PNAT can exceed that of the autoregressive model, and at the same time, it also has a large improvement space. According to our further analysis on effectiveness of position modeling, in future work, we can still enhance the performance of the NAT model by strengthening position learning.
Appendix A Case Study of Predicted Positions
We also provide a few examples in Table 5. For each source sentence, we first analyze the generation quality of the PNAT with a heuristic searched position. Besides, we also show the translation with the predicted position. We have the following observations: First, the output generated by the PNAT using the heuristic searched position always keeps the high consistency with the reference, shows the effectiveness of the heuristic searched position. Second, better position accuracy always yields better generation performance (Case 1,2 against Case 3). Third, as we can see in case 4, though the permutation accuracy is lower, it still generates a good result, the reason why we chose to use the relative selfattention instead of absolute selfattention.
Source  bei dem deutschen Gesetz geht es um die Zuweisung bei der Geburt . 

Reference  German law is about assigning it at birth . 
Heuristic Searched Position(HSP)  3, 6, 1, 2, 10, 0, 5, 4, 7, 8, 9 
PNAT w/ HSP  German law is about assigning them at birth . 
Predicted Position  3, 6, 1, 2, 10, 0, 5, 4, 7, 8, 9 
PNAT w/ Predicted Postion  German law is about assigning them at birth . 
Source  weiß er über das Telefon @@ Hacking Bescheid ? 
Reference  does he know about phone hacking ? 
Heuristic Searched Position(HSP)  2, 1, 3, 4, 8, 5, 6, 0, 7, 9 
PNAT w/ HSP  does he know the telephone hacking ? 
Predicted Position  1, 0, 3, 4, 8, 5, 6, 2, 7, 9 
PNAT w/ Predicted Postion  he know about the telephone hacking ? 
Source  was CCAA bedeutet , möchte eine Besucherin wissen . 
Reference  one visitor wants to know what CCAA means . 
Heuristic Searched Position(HSP)  5, 6, 7, 8, 9, 3, 2, 0, 1, 11, 4, 10 
PNAT w/ HSP  a visitor wants to know what CCAA means . 
Predicted Position  5, 0, 1, 2, 3, 7, 4, 8, 9, 11, 6, 10 
PNAT w/ Predicted Postion  CCAA means wants to know to a visitor . 
Source  eines von 20 Kindern in den Vereinigten Staaten hat inzwischen eine Lebensmittelallergie . 
Reference  one in 20 children in the United States now have food allergies . 
Heuristic Searched Position(HSP)  14, 1, 2, 3, 4, 5, 6, 7, 9, 8, 0, 10, 11, 12, 13 
PNAT w/ HSP  one of 20 children in the United States now has food allergy . 
Predicted Position  14, 0, 1, 2, 3, 4, 5, 6, 8, 7, 9, 10, 11, 12, 13 
PNAT w/ Predicted Postion  of 20 children in the United States now has a food allergy . 
References
 Neural machine translation by jointly learning to align and translate. arXiv preprint arXiv:1409.0473. Cited by: §2.1.
 Learning phrase representations using rnn encoder–decoder for statistical machine translation. In EMNLP, pp. 1724–1734. Cited by: §2.1.
 Convolutional sequence to sequence learning. In ICML, pp. 1243–1252. Cited by: §2.1.
 Speech recognition with deep recurrent neural networks. In 2013 IEEE international conference on acoustics, speech and signal processing, pp. 6645–6649. Cited by: §2.1.
 Nonautoregressive neural machine translation. In ICLR, Cited by: §1, §2.2, §2.2, §2.3, §3.1, §3.3, §3.4, §3.4, §3.4, §4.2, §4.2, §4.3, Table 1, §5.
 Nonautoregressive neural machine translation with enhanced decoder input. In AAAI, Vol. 33, pp. 3723–3730. Cited by: §4.3, Table 1, §5.

A deep generative framework for paraphrase generation.
In
ThirtySecond AAAI Conference on Artificial Intelligence
, Cited by: §2.1.  Distilling the knowledge in a neural network. arXiv preprint arXiv:1503.02531. Cited by: §4.2.
 Fast decoding in sequence models using discrete latent variables. In ICML, pp. 2395–2404. Cited by: §4.3, Table 1, §5.
 Adam: a method for stochastic optimization. arXiv preprint arXiv:1412.6980. Cited by: §4.2.
 Moses: open source toolkit for statistical machine translation. In ACL, pp. 177–180. Cited by: §4.1.
 Imagenet classification with deep convolutional neural networks. In NIPS, pp. 1097–1105. Cited by: §2.1.
 Deterministic nonautoregressive neural sequence modeling by iterative refinement. In EMNLP, pp. 1173–1182. Cited by: §1, §2.3, §3.4, §4.2, §4.2, §4.3, Table 1, §5.
 Hintbased training for nonautoregressive translation. In NeuralIPS (to appear), Cited by: §3.2.
 Tabletotext generation by structureaware seq2seq learning. In ThirtySecond AAAI Conference on Artificial Intelligence, Cited by: §1.
 FlowSeq: nonautoregressive conditional sequence generation with generative flow. In EMNLPIJCNLP, Hong Kong, China, pp. 4273–4283. External Links: Link, Document Cited by: §4.3, Table 1.
 CGMH: Constrained sentence generation by MetropolisHastings sampling. In AAAI, Cited by: §4.1, §4.2.
 BLEU: a method for automatic evaluation of machine translation. In ACL, pp. 311–318. Cited by: §4.3.

Towards a better understanding of vector quantized autoencoders
. arXiv. Cited by: §5.  Neural machine translation of rare words with subword units. In ACL, pp. 1715–1725. Cited by: §4.1.
 Retrieving sequential information for nonautoregressive neural machine translation. In ACL, Cited by: §5.
 Selfattention with relative position representations. In NAACLHLT, pp. 464–468. Cited by: §1, §3.2.
 A conditional variational framework for dialog generation. In Proceedings of the 55th Annual Meeting of the Association for Computational Linguistics (Volume 2: Short Papers), pp. 504–509. Cited by: §1.
 Straight to the tree: constituency parsing with neural syntactic distance. In ACL, pp. 1171–1180. Cited by: §1.
 LSTM neural networks for language modeling. In Thirteenth annual conference of the international speech communication association, Cited by: §2.1.
 Attention is all you need. In NIPS, pp. 5998–6008. Cited by: §1, §2.1, §4.2, §4.2, Table 1.
 Pointer networks. In NIPS, pp. 2692–2700. Cited by: §3.2.
 Nonautoregressive machine translation with auxiliary regularization. In AAAI, Cited by: §1, §4.3, Table 1, §5.
 Imitation learning for nonautoregressive neural machine translation. In ACL, Cited by: §1, §3.3, §4.2, §4.3, Table 1, §5.