1 Introduction
Neural autoregressive models have become the
de facto standard in a wide range of sequence generation tasks, such as machine translation Bahdanau et al. (2014), summarization Rush et al. (2015) and dialogue systems Vinyals and Le (2015). In these studies, a sequence is modeled autoregressively with the lefttoright generation order, which raises the question of whether generation in an arbitrary order is worth considering Vinyals et al. (2015a); Ford et al. (2018). Nevertheless, previous studies on generation orders mostly resort to a fixed set of generation orders, showing particular choices of ordering are helpful Wu et al. (2018); Ford et al. (2018); Mehri and Sigal (2018), without providing an efficient algorithm for finding adaptive generation orders, or restrict the problem scope to gram segment generation Vinyals et al. (2015a).In this paper, we propose a novel decoding algorithm, Insertionbased Decoding with Inferred Generation Order (InDIGO), which models generation orders as latent variables and automatically infers the generation orders by simultaneously predicting a word and its position to be inserted at each decoding step. Given that absolute positions are unknown before generating the whole sequence, we use a relativepositionbased representation to capture generation orders. We show that decoding consists of a series of insertion operations with a demonstration shown in Fig. 1.
We extend Transformer Vaswani et al. (2017) for supporting insertion operations, where the generation order is directly captured as relative positions through selfattention inspired by Shaw et al. (2018). For learning, we maximize the evidence lowerbound (ELBO) of the maximum likelihood objective, and study two approximate posterior distributions of generation orders based on a predefined generation order and adaptive orders obtained from beamsearch, respectively.
Experimental results on word order recovery, machine translation, code generation and image caption demonstrate that our algorithm can generate sequences with arbitrary orders, while achieving competitive or even better performance compared to the conventional lefttoright generation. Case studies show that the proposed method adopts adaptive orders based on input information.
2 Neural Autoregressive Decoding
Let us consider the problem of generating a sequence conditioned on some inputs, e.g., a source sequence . Our goal is to build a model parameterized by
that models the conditional probability of
given , which is factorized as:(1) 
where and are special tokens and , respectively. The model sequentially predicts the conditional probability of the next token at each step , which can be implemented by any function approximator such as RNNs Bahdanau et al. (2014) and Transformer Vaswani et al. (2017).
Learning
Neural autoregressive model is commonly learned by maximizing the conditional likelihood given a set of parallel examples.
Decoding
A common way to decode a sequence from a trained model is to make use of the autoregressive nature that allows us to predict one word at each step. Given any source
, we essentially follow the order of factorization to generate tokens sequentially using some heuristicbased algorithms such as greedy decoding and beam search.
3 Insertionbased Decoding with Inferred Generation Order (InDIGO)
Eq. 1 explicitly assumes a lefttoright (L2R) generation order of the sequence . In principle, we can factorize the sequence probability in any permutation and train a model for each permutation separately. As long as we have infinite amount of data with proper optimization performed, all these models are equivalent. Nevertheless, Vinyals et al. (2015a) have shown that the generation order of a sequence actually matters in many realworld tasks, e.g. language modeling.
Although the L2R order is a strong inductive bias, as it is ‘‘natural’’ for most humanbeings to read and write sequences from left to right, L2R is not necessarily the optimal option for generating sequences. For instance, people sometimes tend to think of central phrases first before building up a whole sentence; For programming languages, it is beneficial to be generated based on abstract syntax trees Yin and Neubig (2017).
Therefore, a natural question arises, how can we decode a sequence in its best order?
3.1 Orders as Latent Variables
We address this question by modeling generation orders as latent variables. Similar to Vinyals et al. (2015a), we rewrite the target sequence in a particular order ^{1}^{1}1 is the set of all the permutations of . as a set , where represents the th generated token and its absolute position, respectively. Different from the common notation, the target sequence is step drifted because the two special tokens and
are always prepended to represent the left and right boundaries, respectively. Then, we model the conditional probability as the joint distribution of words and positions by marginalizing all the orders:
where for each element:
(2) 
where the third special token is introduced to signal the endofdecoding, and is the endofdecoding probability.
At decoding time, the factorization allows us to decode autoregressively by predicting word and its position step by step. The generation order is automatically inferred during decoding.
3.2 Relative Representation of Positions
It is difficult and inefficient to predict the absolute positions without knowing the actual length . One solution is directly using the absolute positions of the partial sequence at each autoregressive step . For example, the absolute positions for the sequence (, , dream, I) are in Fig. 1 at step
. It is however inefficient to model such explicit positions using a single neural network without recomputing the hidden states for the entire partial sequence, as some positions are changed at every step (as shown in Fig.
1).Relative Positions
We propose using relativeposition representations instead of absolute positions
. We use a ternary vector
as the relativeposition representation for . The th element of is defined as:(3) 
where the elements of show the relative positions with respect to all the other words in the partial sequence at step . We use a matrix to show the relativeposition representations of all the words in the sequence. The relativeposition representation can always be mapped back to the absolute position by:
(4) 
One of the biggest advantages for using such vectorbased representations is that at each step, updating the relativeposition representations is simply extending the relativeposition matrix with the next predicted relative position, because the (left, middle, right) relations described in Eq. (3) stay unchanged once they are created. Thus, we update as follows:
(5) 
where we use to represent the relative position at step . This appendonly property enables our method to reuse the previous hidden states without recomputing the hidden states at each step. For simplicity, the superscript of is omitted from now on without causing conflicts.
3.3 Insertionbased Decoding
Given a partial sequence and its corresponding relativeposition representations , not all of the possible vectors are valid for the next relativeposition representation, . Only these vectors corresponding to insertion operations satisfy Eq. (4). In Algorithm 1, we describe an insertionbased decoding framework based on this observation. The next word is predicted based on and . We then choose an existing word ()) from and insert to its left or right. As a result, the next position is determined by
(6) 
where if is on the left of , and otherwise. Finally, we use to update the relativeposition matrix as shown in Eq. (5).
4 Model
We present TransformerInDIGO, an extension of Transformer Vaswani et al. (2017), supporting insertionbased decoding. To the best of our knowledge, TransformerInDIGO is the first probabilistic model that takes generation orders for autoregressive decoding into account. The overall framework is shown in Fig. 2.
4.1 Network Design
We extend the decoder of Transformer with relativepositionbased selfattention, joint word & position prediction and position updating modules.
SelfAttention
One of the major challenges that prevents the vanilla Transformer from generating sequences following arbitrary orders is that the absolutepositionbased positional encodings are inefficient as mentioned in Section 3.2, in that absolute positions are changed during decoding, invalidating the previous hidden states. In contrast, we adapt shaw2018self to use relative positions in selfattention. Different from Shaw et al. (2018), in which a clipping distance (usually ) is set for relative positions, our relativeposition representations only preserve relations (Eq. (3)).
Each attention head in a multihead selfattention module of TransformerInDIGO takes the hidden states of a partial sequence , denoted as , and its corresponding relative position matrix as input, where each input state
. The logit
for attention is computed as:(7) 
where and are parameter matrices. is the row vector indexed by , which biases all the input keys based on the relative position, .
Word & Position Prediction
Like the vanilla Transformer, we take the representations from the last layer of selfattention, and , to predict both the next word and its position vector in two stages based on the following factorization:
The prediction module for word & position prediction are shown in Fig. 2(a).
First, we predict the next word from the categorical distribution as:
(8) 
where is the embedding matrix and is the size of vocabulary. We linearly project the last representation using for querying .
Then, as shown in Eq. (6), the prediction of the next position is done by performing insertion operations to existing words which can be modeled similarly to Pointer Networks Vinyals et al. (2015b). We predict a pointer based on:
(9) 
where are parameter matrices and is the embedding of the predicted word. are used to obtain the left and right keys, respectively, considering that each word has two ‘‘keys’’ (its left and right) for inserting the generated word. The query vector is obtained by adding up the word embedding , and the linearly projected state, . The resulting relativeposition vector, is computed using according to Eq. (6). We manually set to avoid any word from being inserted to the left of and the right of .
Predefined Order  Descriptions 

Lefttoright (L2R)  Generate words from left to right. Wu et al. (2018) 
Righttoleft (R2L)  Generate words from right to left. Wu et al. (2018) 
OddEven (ODD)  Generate words at odd positions from left to right, then generate even positions. Ford et al. (2018) 
Balancedtree (BLT)  Generate words with a topdown lefttoright order from a balanced binary tree. Stern et al. (2019) 
Syntaxtree (SYN)  Generate words with a topdown lefttoright order from the dependency tree. Wang et al. (2018b) 
CommonFirst (CF)  Generate all common words first from left to right, and then generate the others. Ford et al. (2018) 
RareFirst (RF)  Generate all rare words first from left to right, and then generate the remaining. Ford et al. (2018) 
Random (RND)  Generate words in a random order shuffled every time the example was loaded. 
Position Updating
As mentioned in Sec. 3.1, we update the relative position representation with the predicted . Because updating the relative positions will not change the precomputed relativeposition representations, TransformerInDIGO can reuse the previous hidden states in the next decoding step the same as the vanilla Transformer.
4.2 Learning
Training requires maximizing the marginalized likelihood in Eq. (2). Yet this is intractable since we need to enumerate all of the permutations of tokens. Instead, we maximize the evidence lowerbound (ELBO) of the original objective by introducing an approximate posterior distribution of generation orders , which provides the probabilities of latent generation orders based on the groundtruth sequences and :
(10)  
where , sampled from , is represented as relative positions. is the entropy term which can be ignored if is fixed. Eq. (10) shows that given a sampled order, the learning objective is divided into word & position objectives. For calculating the position prediction loss, we aggregate the two probabilities corresponding to the same position by
(11) 
where and are calculated simultaneously from the same softmax function in Eq. (9). represent the keys corresponding to the same relative position. Here, we study two types of :
Predefined Order
If we already possess some prior knowledge about the sequence, e.g., the L2R order is proven to be a strong baseline in many scenarios, we assume a Diracdelta distribution , where is a predefined order. In this work, we study a set of predefined orders which can be found in Table. 1, for evaluating their effect on generation.
Searched Adaptive Order (SAO)
We choose the approximate posterior
as the point estimation that maximizes
. In practice, we approximate these generation orders through beamsearch (Pal et al., 2006). Unlike the original beamsearch for autoregressive decoding that searches in the sequence space to find the sequence maximizing the probability shown in Eq. 1, we search in the space of all the permutations of the target sequence to find maximising Eq. 2, as all the target tokens are known in advance during training.More specifically, at each step , for every subsequence , we evaluate the probabilities of every possible choice from the left words and its corresponding position . We calculate the cumulative likelihood for each , based on which we select top subsequences as the new set for the next step. After obtaining the generation orders, we optimize our objective as an average over these orders:
(12) 
where we assume .
Beam Search with Dropout
The goal of beam search is to approximately find the most likely generation orders, which limits learning from exploring other generation orders that may not be favourable currently but may ultimately be deemed better. Prior research Vijayakumar et al. (2016) also pointed out that the search space of the standard beamsearch is restricted. We encourage exploration by injecting noise during beam search Cho (2016). Particularly, we found it effective to keep the dropout on (e.g. dropout ).
Bootstrapping from a Predefined Order
During preliminary experiments, sequences returned by beamsearch were often degenerated by always predicting common or functional words (e.g. ‘‘the’’, ‘‘,’’, etc.) as the first several tokens, leading to inferior performance. We conjecture that is due to the fact that the position prediction module learns much faster than the word prediction module, and it quickly captures spurious correlations induced by a poorly initialized model. It is essential to balance the learning progress of these modules. To do so, we bootstrap learning by pretraining the model with a predefined order (e.g. L2R), before training with beamsearched orders.
4.3 Decoding
As for decoding, we directly follow Algorithm 1 to sample or decode greedily from the proposed model. However, in practice beamsearch is important to explore the output space for neural autoregressive models. In our implementation, we perform beamsearch for InDIGO as a twostep search. Suppose the beam size , at each step, we do beamsearch for word prediction and then with the searched words, try out all possible positions and select the top subsequences. In preliminary experiments, we also tried doing beamsearch for word and positions simultaneously with their joint probability. However, it did not seem helpful.
5 Experiments
We evaluate InDIGO extensively on four challenging sequence generation tasks: word order recovery, machine translation, natural language to code generation (NL2Code, Ling et al., 2016) and image captioning. We compare our model trained with the predefined orders (the L2R order in default) and the adaptive orders obtained by beamsearch.
5.1 Experimental Settings
Dataset
The machine translation experiments are conducted on three language pairs for studying how the decoding order influences the translation quality of languages with diversified characteristics: WMT’16 RomanianEnglish (RoEn),^{2}^{2}2 http://www.statmt.org/wmt16/translationtask.html WMT 18 EnglishTurkish (EnTr)^{3}^{3}3 http://www.statmt.org/wmt18/translationtask.html and KFTT EnglishJapanese (EnJa, Neubig, 2011).^{4}^{4}4http://www.phontron.com/kftt/ The English part of the RoEn dataset is used for the word order recovery task. For the NL2Code task, We use the Django dataset Oda et al. (2015)^{5}^{5}5 https://github.com/odashi/ase15djangodataset and the MS COCO Lin et al. (2014) with the standard split Karpathy and FeiFei (2015) for the NL2Code task and image captioning, respectively. The dataset statistics can be found in Table 2.
Dataset  Train  Dev  Test  Length 

WMT16 RoEn  620k  2000  2000  26.48 
WMT18 EnTr  207k  3007  3000  25.81 
KFTT EnJa  405k  1166  1160  27.51 
Django  16k  1000  1801  8.87 
MSCOCO  567k  5000  5000  12.52 
Preprocessing
We apply the Moses tokenization^{6}^{6}6 https://github.com/mosessmt/mosesdecoder and normalization on all the text datasets except for codes. We perform joint BPE Sennrich et al. (2015) operations for the MT datasets, while using all the unique words as the vocabulary for NL2Code. For image captioning, we follow the same procedure as described by lee2018deterministic, where we use dimensional image feature vectors (extracted from a pretrained ResNet18 He et al. (2016)) as the input to the Transformer encoder. The image features are fixed during training.
Model  WMT16 Ro En  WMT18 En Tr  KFTT En Ja  

BLEU  Ribes  Meteor  TER  BLEU  Ribes  Meteor  TER  BLEU  Ribes  Meteor  TER  
RND  20.20  79.35  41.00  63.20  03.04  55.45  19.12  90.60  17.09  70.89  35.24  70.11 
L2R  31.82  83.37  52.19  50.62  14.85  69.20  33.90  71.56  30.87  77.72  48.57  59.92 
R2L  31.62  83.18  52.09  50.20  14.38  68.87  33.33  71.91  30.44  77.95  47.91  61.09 
ODD  30.11  83.09  50.68  50.79  13.64  68.85  32.48  72.84  28.59  77.01  46.28  60.12 
BLT  24.38  81.70  45.67  55.38  08.72  65.70  27.40  77.76  21.50  73.97  40.23  64.39 
SYN  29.62  82.65  50.25  52.14      
CF  30.25  83.22  50.71  50.72  12.04  67.61  31.18  74.75  28.91  77.06  46.46  61.56 
RF  30.23  83.29  50.72  51.73  12.10  67.44  30.72  73.40  27.35  76.40  45.15  62.14 
SAO  32.47  84.10  53.00  49.02  15.18  70.06  34.60  71.56  31.91  77.56  49.66  59.80 
Results of translation experiments for three language pairs in different decoding orders. Scores are reported on the test set with four widely used evaluation metrics (BLEU
, Meteor, TER and Ribes). We do not report models trained with SYN order on EnTr and EnJa due to the lack of reliable dependency parsers. The statistical significance analysis^{6} between the outputs of SAO and L2R are conducted using BLEU score as the metric, and the pvalues are for all three language pairs.Models
We set , , , , , and throughout all the experiments. The source and target embedding matrices are shared except for EnJa, as our preliminary experiments showed that keeping the embeddings not shared significantly improves the translation quality. Both the encoder and decoder use relative positions during selfattention except for the word order recovery experiments (where the position embedding is removed in the encoder, as there is no groundtruth position information in the input.) We do not introduce taskspecific modules such as copying mechanism Gu et al. (2016) for model simplicity.
Training
When training with the predefined orders, we reorder words of each training sequence in advance accordingly which provides supervision of the groundtruth positions that each word should be inserted. We test the predefined orders listed in Table 1. The SYN orders were generated according to the dependency parse obtained by a dependency parse parser^{7}^{7}7 https://spacy.io/usage/linguisticfeatures following a parenttochildren lefttoright order. The CF & RF orders are obtained based on vocabulary cutoff so that the number of common words and the number of rare words are approximately the same Ford et al. (2018). We also consider onthefly sampling a random order for each sentence as the baseline (RND). When using L2R as the predefined order, TransformerInDIGO is almost equivalent to the vanilla Transformer, as the position prediction simply learns to predict the next position as the left of the symbol. The only difference is that it enhances the vanilla Transformer with a small number of additional parameters for the position prediction.
We also train TransformerInDIGO using the searched adaptive order (SAO) where we set the beam size to . In default, models trained with SAO are bootstrapped from a slightly pretrained (6,000 steps) model in L2R order.
Inference
During the test time, we do beamsearch as described in Sec. 4.3. We observe from our preliminary experiments that models trained with different orders (either predefined or SAO) have very different optimal beam sizes for decoding. Therefore, we perform sensitivity studies, in which the beam sizes vary from and pick the beam size with the highest BLEU score on the validation set for each particular model.
5.2 Results and Analysis
Word Order Recovery
Word order recovery takes a bag of words as input and recovers its original word order, which is challenging as the search space is factorial. We do not restrict the vocabulary of the input words. We compare our model trained with the L2R order and eight searched adaptive orders (SAO) from beam search for word order recovery. The BLEU scores over various beam sizes are shown in Fig. 3. The model trained with SAO lead to higher BLEU scores over that trained with L2R with a gain up to BLEU scores. Furthermore, increasing the beam size brings more improvements for SAO compared to L2R, suggesting that InDIGO produces more diversified predictions so that it has a higher chance to recover the correct outputs.
Machine Translation
As shown in Table 3, we compare our model trained with predefined orders and the searched adaptive orders (SAO) with varying setups. We use four evaluation metrics including BLEU Papineni et al. (2002), Ribes Isozaki et al. (2010), Meteor Banerjee and Lavie (2005) and TER Snover et al. (2006) to avoid using a single metric that might be in favor of a particular generation order.
Most of the predefined orders (except for the random order and the balanced tree (BLT) order) perform reasonably well with InDIGO on the three language pairs. The best score is reached by the L2R order among the predefined orders except for EnJa, where the R2L order works slightly better according to Ribes. This indicates that in machine translation, the monotonic orders are reasonable and reflect the languages. ODD, CF and RF show similar performance, which is below the L2R and R2L orders by around BLEU scores. The treebased orders, such as the SYN and BLT orders do not perform well, indicating that predicting words following a syntactic path is not preferable. On the other hand, Table 3 shows that the model with SAO achieves competitive and even statistically significant improvements over the L2R order. The improvements are larger for Turkish and Japanese, which indicates that a flexible generation order may improve the translation quality for languages with different syntactic structures from English.
Table 4 shows the results of the ablation study for the searched order. SAO without bootstrapping nor beamsearch with dropout degenerate by approximate BLEU score on RoEn, demonstrating the effectiveness of these two methods.
Model Variants  dev  test 

SAO default  33.60  32.47 
no bootstrap  32.86  31.88 
no bootstrap, no noise  32.64  31.72 
Model  Django  MSCOCO  

BLEU  Accuracy  BLEU  CIDErD  
L2R  36.74  13.6%  22.12  68.88 
SAO  42.33  16.3%  22.58  69.42 
Code Generation
The goal of this task is to generate Python code based on a natural language description, which can be achieved by using a standard sequencetosequence generation framework such as the proposed TransformerInDIGO. As shown in Table 5, SAO works significantly better than the L2R order in terms of both BLEU and accuracy. This shows that flexible generation orders are more preferable in code generation.
Image Captioning
For the captioning task, one caption is generated per image and is compared against five humancreated captions during testing. As show in Table 5, we observe that SAO obtains higher BLEU and CIDErD Vedantam et al. (2015) compared to the L2R order, and it implies that better captions are generated with different orders.
5.3 Case Study
We demonstrate how InDIGO works by uniformly sampling examples from the validation sets for machine translation (RoEn), image captioning and code generation. As shown in Fig. 4, the proposed model generates sequences in different orders based on the order used for learning (either predefined or SAO). For instance, the model generates tokens approximately following the dependency parse wheb we used the SYN order for the machine translation task. On the other hand, the model trained using the RF order learns to first produce verbs and nouns first, before filling up the sequence with remaining functional words.
We observe several key characteristics about the inferred orders of SAO by analyzing the model’s output for each task: (1) For machine translation, the generation order of an output sequence does not deviate too much from L2R. Instead, the sequences are shuffled with chunks, and words within each chunk are generated in a L2R order; (2) In the examples of image captioning and code generation, the model tends to generate most of the words in the L2R order and insert a faw words afterward in certain locations. Moreover, we provide more examples in the appendix.
6 Related Work
Decoding for Neural Models
Neural autoregressive modelling has become one of the most successful approaches for generating sequences Sutskever et al. (2011); Mikolov (2012), which has been widely used in a range of applications, such as machine translation Sutskever et al. (2014), dialogue response generation Vinyals and Le (2015), image captioning Karpathy and FeiFei (2015) and speech recognition Chorowski et al. (2015). Another stream of work focuses on generating a sequence of tokens in a nonautoregressive fashion Gu et al. (2017); Lee et al. (2018); Oord et al. (2017), in which the discrete tokens are generated in parallel. Semiautoregressive modelling Stern et al. (2018); Wang et al. (2018a) is a mixture of the two approaches, while largely adhering to lefttoright generation. Our method is radically different from these approaches as we support flexible generation orders, while preserving the dependencies among generated tokens.
Generation Orders
Previous studies on generation order of sequences mostly resort to a fixed set of generation orders. Wu et al. (2018) empirically show that R2L generation outperforms its L2R counterpart in a few tasks. Ford et al. (2018) devises a twopass approach that produces partiallyfilled sentence ‘‘templates" and then fills in missing tokens. Zhu et al. (2019) also proposes to generate tokens by first predicting a text template and infill the sentence afterwards while in a more general way. Mehri and Sigal (2018) proposes a middleout decoder that firstly predicts a middleword and simultaneously expands the sequence in both directions afterwards. Another line of work models the probability of a sequence as a tree or directed graph Zhang et al. (2015); Dyer et al. (2016); Aharoni and Goldberg (2017); Wang et al. (2018b); Eriguchi et al. (2017). In contrast, TransformerInDIGO supports fully flexible generation orders which is inferred during decoding.
There are two concurrent work Welleck et al. (2019); Stern et al. (2019), which study sequence generation in a nonL2R order. welleck2019non propose a treelike generation algorithm. Unlike this work, the treebased generation order only produces a subset of all possible generation orders compared to our insertionbased models. Further, welleck2019non find L2R is superior to their learned orders on machine translation tasks, while transformerInDIGO with searched adaptive orders achieves better performance. stern2019insertion propose a very similar idea of using insertion operations in Transformer for machine translation. The major difference is that they directly use absolute positions, while ours utilizes relative positions. As a result, their model needs to reencode the partial sequence at every step, which is computationally more expensive. In contrast, our approach does not necessitate reencoding the entire sentence during generation. In addition, knowledge distillation was necessary to achieve good performance instern2019insertion, while our model is able to match the performance of L2R even without bootstrapping.
7 Conclusion
We have presented a novel approach  InDIGO  which supports flexible sequence generation. Our model was trained with either predefined orders or searched adaptive orders. In contrast to conventional neural autoregressive models which often generate from left to right, our model can flexibly generate a sequence following an arbitrary order. Experiments show that our method achieved competitive or even better performance compared to the conventional lefttoright generation on four tasks, including machine translation, word order recovery, code generation and image captioning.
For future work, it is worth exploring training InDIGO using a trainable inference model to directly predict the permutation Mena et al. (2018) instead of beamsearch. Also, the proposed InDIGO could be extended for postediting tasks such as automatic postediting for machine translation (APE) and grammatical error correction (GEC) by introducing additional operations such as ‘‘deletion’’ and ‘‘substitution’’.
References
 Aharoni and Goldberg (2017) Roee Aharoni and Yoav Goldberg. 2017. Towards stringtotree neural machine translation. arXiv preprint arXiv:1704.04743.
 Bahdanau et al. (2014) Dzmitry Bahdanau, Kyunghyun Cho, and Yoshua Bengio. 2014. Neural machine translation by jointly learning to align and translate. arXiv preprint arXiv:1409.0473.
 Banerjee and Lavie (2005) Satanjeev Banerjee and Alon Lavie. 2005. Meteor: An automatic metric for mt evaluation with improved correlation with human judgments. In Proceedings of the acl workshop on intrinsic and extrinsic evaluation measures for machine translation and/or summarization, pages 6572.
 Cho (2016) Kyunghyun Cho. 2016. Noisy parallel approximate decoding for conditional recurrent language model. arXiv preprint arXiv:1605.03835.
 Chorowski et al. (2015) Jan K Chorowski, Dzmitry Bahdanau, Dmitriy Serdyuk, Kyunghyun Cho, and Yoshua Bengio. 2015. Attentionbased models for speech recognition. In NIPS, pages 577585.
 Dyer et al. (2016) Chris Dyer, Adhiguna Kuncoro, Miguel Ballesteros, and Noah A Smith. 2016. Recurrent neural network grammars. arXiv preprint arXiv:1602.07776.
 Eriguchi et al. (2017) Akiko Eriguchi, Yoshimasa Tsuruoka, and Kyunghyun Cho. 2017. Learning to parse and translate improves neural machine translation. arXiv preprint arXiv:1702.03525.
 Ford et al. (2018) Nicolas Ford, Daniel Duckworth, Mohammad Norouzi, and George E Dahl. 2018. The importance of generation order in language modeling. arXiv preprint arXiv:1808.07910.
 Gu et al. (2017) Jiatao Gu, James Bradbury, Caiming Xiong, Victor OK Li, and Richard Socher. 2017. Nonautoregressive neural machine translation. arXiv preprint arXiv:1711.02281.
 Gu et al. (2016) Jiatao Gu, Zhengdong Lu, Hang Li, and Victor OK Li. 2016. Incorporating copying mechanism in sequencetosequence learning. arXiv preprint arXiv:1603.06393.

He et al. (2016)
Kaiming He, Xiangyu Zhang, Shaoqing Ren, and Jian Sun. 2016.
Deep residual learning for image recognition.
In
Proceedings of the IEEE conference on computer vision and pattern recognition
, pages 770778. 
Isozaki et al. (2010)
Hideki Isozaki, Tsutomu Hirao, Kevin Duh, Katsuhito Sudoh, and Hajime Tsukada.
2010.
Automatic evaluation of translation quality for distant language
pairs.
In
Proceedings of the 2010 Conference on Empirical Methods in Natural Language Processing
, pages 944952. Association for Computational Linguistics.  Karpathy and FeiFei (2015) Andrej Karpathy and Li FeiFei. 2015. Deep visualsemantic alignments for generating image descriptions. In Proceedings of the IEEE conference on computer vision and pattern recognition, pages 31283137.
 Lee et al. (2018) Jason Lee, Elman Mansimov, and Kyunghyun Cho. 2018. Deterministic nonautoregressive neural sequence modeling by iterative refinement. arXiv preprint arXiv:1802.06901.
 Lin et al. (2014) TsungYi Lin, Michael Maire, Serge Belongie, James Hays, Pietro Perona, Deva Ramanan, Piotr Dollár, and C Lawrence Zitnick. 2014. Microsoft coco: Common objects in context. In European conference on computer vision, pages 740755. Springer.
 Ling et al. (2016) Wang Ling, Edward Grefenstette, Karl Moritz Hermann, Tomáš Kočiskỳ, Andrew Senior, Fumin Wang, and Phil Blunsom. 2016. Latent predictor networks for code generation. arXiv preprint arXiv:1603.06744.
 Mehri and Sigal (2018) Shikib Mehri and Leonid Sigal. 2018. Middleout decoding. In S. Bengio, H. Wallach, H. Larochelle, K. Grauman, N. CesaBianchi, and R. Garnett, editors, Advances in Neural Information Processing Systems 31, pages 55235534. Curran Associates, Inc.
 Mena et al. (2018) Gonzalo Mena, David Belanger, Scott Linderman, and Jasper Snoek. 2018. Learning latent permutations with gumbelsinkhorn networks. arXiv preprint arXiv:1802.08665.
 Mikolov (2012) Tomáš Mikolov. 2012. Statistical language models based on neural networks. Presentation at Google, Mountain View, 2nd April.
 Neubig (2011) Graham Neubig. 2011. The Kyoto free translation task. http://www.phontron.com/kftt.
 Oda et al. (2015) Yusuke Oda, Hiroyuki Fudaba, Graham Neubig, Hideaki Hata, Sakriani Sakti, Tomoki Toda, and Satoshi Nakamura. 2015. Learning to generate pseudocode from source code using statistical machine translation. In Proceedings of the 2015 30th IEEE/ACM International Conference on Automated Software Engineering (ASE), ASE ’15, pages 574584, Lincoln, Nebraska, USA. IEEE Computer Society.
 Oord et al. (2017) Aaron van den Oord, Yazhe Li, Igor Babuschkin, Karen Simonyan, Oriol Vinyals, Koray Kavukcuoglu, George van den Driessche, Edward Lockhart, Luis C Cobo, Florian Stimberg, et al. 2017. Parallel wavenet: Fast highfidelity speech synthesis. arXiv preprint arXiv:1711.10433.
 Pal et al. (2006) Chris Pal, Charles Sutton, and Andrew McCallum. 2006. Sparse forwardbackward using minimum divergence beams for fast training of conditional random fields. In Acoustics, Speech and Signal Processing, 2006. ICASSP 2006 Proceedings. 2006 IEEE International Conference on, volume 5, pages VV. IEEE.
 Papineni et al. (2002) Kishore Papineni, Salim Roukos, Todd Ward, and WeiJing Zhu. 2002. Bleu: a method for automatic evaluation of machine translation. In Proceedings of the 40th annual meeting on association for computational linguistics, pages 311318. Association for Computational Linguistics.
 Rush et al. (2015) Alexander M Rush, Sumit Chopra, and Jason Weston. 2015. A neural attention model for abstractive sentence summarization. arXiv preprint arXiv:1509.00685.
 Sennrich et al. (2015) Rico Sennrich, Barry Haddow, and Alexandra Birch. 2015. Neural machine translation of rare words with subword units. arXiv preprint arXiv:1508.07909.
 Shaw et al. (2018) Peter Shaw, Jakob Uszkoreit, and Ashish Vaswani. 2018. Selfattention with relative position representations. arXiv preprint arXiv:1803.02155.
 Snover et al. (2006) Matthew Snover, Bonnie Dorr, Richard Schwartz, Linnea Micciulla, and John Makhoul. 2006. A study of translation edit rate with targeted human annotation. volume 200.
 Stern et al. (2019) Mitchell Stern, William Chan, Jamie Kiros, and Jakob Uszkoreit. 2019. Insertion transformer: Flexible sequence generation via insertion operations. arXiv preprint arXiv:1902.03249.
 Stern et al. (2018) Mitchell Stern, Noam Shazeer, and Jakob Uszkoreit. 2018. Blockwise parallel decoding for deep autoregressive models. In Advances in Neural Information Processing Systems, pages 1010710116.

Sutskever et al. (2011)
Ilya Sutskever, James Martens, and Geoffrey E Hinton. 2011.
Generating text with recurrent neural networks.
InProceedings of the 28th International Conference on Machine Learning (ICML11)
, pages 10171024.  Sutskever et al. (2014) Ilya Sutskever, Oriol Vinyals, and Quoc V Le. 2014. Sequence to sequence learning with neural networks. NIPS.
 Vaswani et al. (2017) Ashish Vaswani, Noam Shazeer, Niki Parmar, Jakob Uszkoreit, Llion Jones, Aidan N. Gomez, Lukasz Kaiser, and Illia Polosukhin. 2017. Attention is all you need. In Proceedings of the Annual Conference on Neural Information Processing Systems (NIPS).
 Vedantam et al. (2015) Ramakrishna Vedantam, C Lawrence Zitnick, and Devi Parikh. 2015. Cider: Consensusbased image description evaluation. In Proceedings of the IEEE conference on computer vision and pattern recognition, pages 45664575.
 Vijayakumar et al. (2016) Ashwin K Vijayakumar, Michael Cogswell, Ramprasath R Selvaraju, Qing Sun, Stefan Lee, David Crandall, and Dhruv Batra. 2016. Diverse beam search: Decoding diverse solutions from neural sequence models. arXiv preprint arXiv:1610.02424.
 Vinyals et al. (2015a) Oriol Vinyals, Samy Bengio, and Manjunath Kudlur. 2015a. Order matters: Sequence to sequence for sets. arXiv preprint arXiv:1511.06391.
 Vinyals et al. (2015b) Oriol Vinyals, Meire Fortunato, and Navdeep Jaitly. 2015b. Pointer networks. In Advances in Neural Information Processing Systems, pages 26922700.
 Vinyals and Le (2015) Oriol Vinyals and Quoc Le. 2015. A neural conversational model. arXiv preprint arXiv:1506.05869.
 Wang et al. (2018a) Chunqi Wang, Ji Zhang, and Haiqing Chen. 2018a. Semiautoregressive neural machine translation. arXiv preprint arXiv:1808.08583.
 Wang et al. (2018b) Xinyi Wang, Hieu Pham, Pengcheng Yin, and Graham Neubig. 2018b. A treebased decoder for neural machine translation. arXiv preprint arXiv:1808.09374.
 Welleck et al. (2019) Sean Welleck, Kianté Brantley, Hal Daumé III, and Kyunghyun Cho. 2019. Nonmonotonic sequential text generation. arXiv preprint arXiv:1902.02192.
 Wu et al. (2018) Lijun Wu, Xu Tan, Di He, Fei Tian, Tao Qin, Jianhuang Lai, and TieYan Liu. 2018. Beyond error propagation in neural machine translation: Characteristics of language also matter. arXiv preprint arXiv:1809.00120.
 Yin and Neubig (2017) Pengcheng Yin and Graham Neubig. 2017. A syntactic neural model for generalpurpose code generation. arXiv preprint arXiv:1704.01696.
 Zhang et al. (2015) Xingxing Zhang, Liang Lu, and Mirella Lapata. 2015. Topdown tree long shortterm memory networks. arXiv preprint arXiv:1511.00060.
 Zhu et al. (2019) Wanrong Zhu, Zhiting Hu, and Eric Xing. 2019. Text Infilling. arXiv eprints, page arXiv:1901.00158.
Appendix
Additional Examples
We present additional examples in Fig. 5 on translation task for EnTr and EnJa.
Comments
There are no comments yet.