Sequence Modeling with Unconstrained Generation Order

11/01/2019 ∙ by Dmitrii Emelianenko, et al. ∙ 0

The dominant approach to sequence generation is to produce a sequence in some predefined order, e.g. left to right. In contrast, we propose a more general model that can generate the output sequence by inserting tokens in any arbitrary order. Our model learns decoding order as a result of its training procedure. Our experiments show that this model is superior to fixed order models on a number of sequence generation tasks, such as Machine Translation, Image-to-LaTeX and Image Captioning.



There are no comments yet.


page 9

page 14

page 15

This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

1 Introduction

Neural approaches to sequence generation have seen a variety of applications such as language modeling Mikolov et al. (2010), machine translation Bahdanau et al. (2014); Sutskever et al. (2014), music generation Briot et al. (2017) and image captioning Xu et al. (2015)

. All these tasks involve modeling a probability distribution over sequences of some kind of tokens.

Usually, sequences are generated in the left-to-right manner, by iteratively adding tokens to the end of an unfinished sequence. Although this approach is widely used due to its simplicity, such decoding restricts the generation process. Generating sequences in the left-to-right manner reduces output diversity Mehri and Sigal (2018) and could be unsuited for the target sequence structure Wu et al. (2018). To alleviate this issue, previous studies suggested exploiting prior knowledge about the task (e.g. the semantic roles of words in a natural language sentence or the concept of language branching) to select the preferable generation order Mehri and Sigal (2018); Wu et al. (2018); Ford et al. (2018). However, these approaches are still limited by predefined generation order, which is the same for all input instances.

Figure 1: Examples of different decoding orders: left-to-right, alternative and right-to-left orders respectively. Each line represents one decoding step.

In this work, we propose INTRUS: INsertion TRansformer for Unconstrained order Sequence modeling. Our model has no predefined order constraint and generates sequences by iteratively adding tokens to a subsequence in any order, not necessarily in the order they appear in the final sequence. It learns to find convenient generation order as a by-product of its training procedure without any reliance on prior knowledge about the task it is solving.

Our key contributions are as follows:

  • We propose a neural sequence model that can generate the output sequence by inserting tokens in any arbitrary order;

  • The proposed model outperforms fixed-order baselines on several tasks, including Machine Translation, Image-to-LaTeX and Image Captioning;

  • We analyze learned generation orders and find that the model has a preference towards producing “easy” words at the beginning and leaving more complicated choices for later.

2 Method

We consider the task of generating a sequence consisting of tokens given some input . In order to remove the predefined generation order constraint, we need to reformulate the probability of target sequence in terms of token insertions. Unlike traditional models, there are multiple valid insertions at each step. This formulation is closely related to the existing framework of generating unordered sets, which we briefly describe in Section 2.1. In Section 2.2, we introduce our approach.

2.1 Generating unordered sets

In the context of unordered set generation, Vinyals et al. (2016) proposed a method to learn sequence order from data jointly with the model. The resulting model samples a permutation of the target sequence and then scores the permuted sequence with a neural probabilistic model:


The training is performed by maximizing the data log-likelihood over both model parameters and target permutation :


Exact maximization over requires

operations, therefore it is infeasible in practice. Instead, the authors propose using greedy or beam search. The resulting procedure resembles the Expectation Maximization algorithm:

  1. E step: find optimal for under current with inexact search,

  2. M step: update parameters with gradient descent under found on the E step.

EM algorithms are known to easily get stuck in local optima. To mitigate this issue, the authors sample permutations proportionally to instead of maximizing over .

2.2 Our approach

The task now is to build a probabilistic model over sequences of insertion operations. This can be viewed as an extension of the approach described in the previous section, which operates on ordered sequences instead of unordered sets. At step , the model generates either a pair consisting of a position in the produced so far sub-sequence () and a token

to be inserted at this position, or a special EOS element indicating that the generation process is terminated. It estimates the conditional probability of a new insertion

given and a partial output constructed from the previous inserts:


Training objective

We train the model by maximizing the log-likelihood of the reference sequence  given the source , summed over the data set :


where denotes the set of all trajectories leading to (see Figure 2).

Figure 2: Graph of trajectories for “acatsat”.

Intuitively, we maximize the total probability “flowing” through the acyclic graph defined by . This graph has approximately paths from an empty sequence to the target sequence . Therefore, directly maximizing (4) is impractical. Our solution, inspired by Vinyals et al. (2016), is to assume that for any input there is a trajectory that is the most convenient for the model. We want the model to concentrate the probability mass on this single trajectory. This can be formulated as a lower bound of the objective (4):


The lower bound is tight iff the entire probability mass in is concentrated along a single trajectory. This leads to a convenient property: maximizing (5) forces the model to choose a certain “optimal” sequence of insertions and concentrate most of the probability mass there.

The bound (5) depends only on the most probable trajectory , thus is difficult to optimize directly. This may result in convergence to a local maximum. Similar to Vinyals et al. (2016), we replace with an expectation w.r.t. trajectories sampled from . We sample from the probability distribution over the trajectories obtained from the model. The new lower bound is:


The sampled lower bound in (6) is less or equal to (5). However, if the entire probability mass is concentrated on a single trajectory, both lower bounds are tight. Thus, when maximizing (6), we also expect most of the probability mass to be concentrated on one or a few “best” trajectories.

Training procedure

We train our model using stochastic gradient ascent of (6). For each pair  from the current mini-batch, we sample the trajectory from the model: . We constrain sampling only to correct trajectories by allowing only the correct insertion operations (i.e. the ones that lead to producing ). At each step along the sampled trajectory , we maximize , where defines a set of all insertions immediately after , such that the trajectory extended with  is correct: . From a formal standpoint, this is a probability of picking any insertion that is on the path to . The simplified training procedure is given in Algorithm 2.2.

  Algorithm 1: Training procedure (simplified)  Inputs: batch , parameters , learning rate , // is the gradient accumulator for  do
       for  do
             // correct inserts
       end for
end for

The training procedure is split into two steps: (i) pretraining with uniform samples from the set of feasible trajectories , and (ii) training on samples from our model’s probability distribution over till convergence. We discuss the importance of the pretraining step in Section 5.2.


To find the most likely output sequence according to our model, we have to compute the probability distribution over target sequences as follows:


Computing such probability exactly requires summation over up to trajectories, which is infeasible in practice. However, due to the nature of our optimization algorithm (explicitly maximizing the lower bound ), we expect most of the probability mass to be concentrated on one or a few “best” trajectories:


Hence, we perform approximate inference by finding the most likely trajectory of insertions, disregarding the fact that several trajectories may lead to the same .222To justify this transition, we translated sentences with a fully trained model using beam size 128 and found only 4 occasions where multiple insertion trajectories in the beam led to the same output sequence. The resulting inference problem is defined as:


This problem is combinatoric in nature, but it can be solved approximately using beam search. In the case of our model, beam search compares partial output sequences and extends them by selecting the best token insertions. Our model also inherits a common problem of the left-to-right machine translation: it tends to stop too early and produce output sequences that are shorter than the reference. To alleviate this effect, we divide hypotheses’ log-probabilities by their length. This has already been used in previous works Vaswani et al. (2017); Wu et al. (2016); Gehring et al. (2017).

3 Model architecture

INTRUS follows the encoder-decoder framework. Specifically, the model is based on the Transformer Vaswani et al. (2017) architecture (Figure 3) due to its state-of-the-art performance on a wide range of tasks Bojar et al. (2018); Devlin et al. (2019); Barrault et al. (2019). There are two key differences of INTRUS from the left-to-right sequence models. Firstly, our model’s decoder does not require the attention mask preventing attention to subsequent positions. Decoder self-attention is re-applied at each decoding step because the positional encodings of most tokens change when inserting a new one into an incomplete subsequence of .333Though this makes training more computationally expensive than the standard Transformer, this does not hurt decoding speed much: on average decoding is only 50 times slower than the baseline. We will discuss this in detail in Section 5.2. Secondly, the decoder predicts the joint probability of a token and a position corresponding to a single insertion (rather than the probability of a token, as usually done in the standard setting). Consequently, the predicted probabilities should add up to 1 over all positions and tokens at each step. We achieve this by decomposing the insertion probability into the probabilities of a token and a position:


Here denotes a single decoder hidden state (of size ) corresponding to an insertion at position ; represents a matrix of all such states. is a learned weight matrix that predicts token probabilities and

is a learned vector of weights used to predict positions. In other words, the hidden state of each token in the current sub-sequence defines (i) the probability that the next token will be generated at the position immediately preceding current and (ii) the probability for each particular token to be generated next.

Figure 3: Model architecture: for a single token insertion. Output of the column for the token defines the probability of inserting the token into before the token at the -th position.

The encoder component of the model can have any task-specific network architecture. For Machine Translation task, it can be an arbitrary sequence encoder: any combination of RNN Sutskever et al. (2014); Bahdanau et al. (2014), CNN Gehring et al. (2017); Elbayad et al. (2018) or self-attention Vaswani et al. (2017); Dehghani et al. (2018). For image-to-sequence problems (e.g. Image-To-LaTeX Deng et al. (2017)

) any 2d convolutional encoder architecture from the domain of computer vision can be used 

Simonyan and Zisserman (2014); He et al. (2016); Huang et al. (2017).

3.1 Relation to prior work

The closest to our is the work by Gu et al. (2019)444At the time of submission, this was a concurrent work., who propose a decoding algorithm which supports flexible sequence generation in arbitrary orders through insertion operations.

In terms of modeling, they describe a similar transformer-based model but use a relative-position-based representation to capture generation orders. This effectively addresses the problem that absolute positional encodings are unknown before generating the whole sequence. While in our model positional encodings of most of the tokens change after each insertion operation and, therefore, decoder self-attention is re-applied at each generation step, the model by Gu et al. (2019) does not need this and has better theoretical time complexity of in contrast to our . However, in practice our decoding is on average only 50 times slower than the baseline; for the details, see Section 5.2.

In terms of training objective, they use lower bound (5) with beam search over , which is different from our lower bound (6). However, we found our lower bound to be beneficial in terms of quality and less prone to getting stuck in local optima. We will discuss this in detail in Section 5.1.

4 Experimental setup

We consider three sequence generation tasks: Machine Translation, Image-To-Latex and Image Captioning. For each, we now define input and output , the datasets and the task-specific encoder we use. Decoders for all tasks are Transformers in base configuration Vaswani et al. (2017)

(either original or INTRUS) with identical hyperparameters.

Machine Translation

For MT, input and output are sentences in different languages. The encoder is the Transformer-base encoder Vaswani et al. (2017).

Wu et al. (2018) suggest that left-to-right NMT models fit better for right-branching languages (e.g., English) and right-to-left NMT models fit better for left-branching languages (e.g., Japanese). This defines the choice of language pairs for our experiments. Our experiments include: En-Ru and Ru-En WMT14; En-Ja ASPEC Nakazawa et al. (2016); En-Ar, En-De and De-En IWSLT14 Machine Translation data sets. We evaluate our models on WMT2017 test set for En-Ru and Ru-En, ASPEC test set for En-Ja, concatenated IWSLT tst2010, tst2011 and tst2012 for En-De and De-En, and concatenated IWSLT tst2014 and tst2013 for En-Ar.

Sentences of all translation directions except the Japanese part of En-Ja data set are preprocessed with the Moses tokenizer Koehn et al. (2007) and segmented into subword units using BPE Sennrich et al. (2016) with 32,000 merge operations. Before BPE segmentation Japanese sentences were firstly segmented into words555Open-source word segmentation software is available at


In this task, is a rendered image of LaTeX markup, is the markup itself. We use the ImageToLatex-140K Deng et al. (2017); Singh (2018) data set. We used the encoder CNN architecture, preprocessing pipeline and evaluation scripts by Singh (2018)666We used

Image captioning

Here is an image, is its description in natural language. We use MSCOCO Chen et al. (2015), the standard Image Captioning dataset. Encoder is VGG16 Simonyan and Zisserman (2014) pretrained777

We use pretrained weights from keras applications, the same for both baseline and our model.

on the ImageNet task without the last layer.


We use BLEU888BLEU is computed via SacreBLEU Post (2018) script with the following parameters:[src-lang]-[dst-lang]+#.1+s.exp+tok.13a+v.1.2.18 Papineni et al. (2002) for evaluation of Machine Translation and Image-to-Latex models. For En-Ja, we measure character-level BLEU to avoid infuence on word segmentation software. The scores on MSCOCO dataset are obtained via the official evaluation script999Script is available at

Training details

The models are trained until convergence with base learning rate 1.4e-3, 16,000 warm-up steps and batch size of 4,000 tokens. We vary the learning rate over the course of training according to Vaswani et al. (2017) and follow their optimization technique. We use beam search with the beam between 4 and 64 selected using the validation data for both baseline and INTRUS, although our model benefits more when using even bigger beam sizes. The pretraining phase of INTRUS is batches.

5 Results

En-Ru Ru-En En-Ja En-Ar En-De De-En Im2Latex MSCOCO
Left-to-right 31.6 35.3 47.9 12.0 28.04 33.17 89.5 18.0 56.1
Right-to-left - - 48.6 11.5 - - - - -
INTRUS 33.2 36.4 50.3 12.2 28.36 33.08 90.3 25.6 81.0
Table 1: The results of our experiments. En-Ru, Ru-En, En-Ja, En-Ar, En-De and De-En are machine translation experiements. indicates statistical significance with -value of 0.05, computed via bootstrapping Koehn (2004).

Among all tasks, the largest improvements are for Image Captioning: 7.6 BLEU and 25.1 CIDER.

For Machine Translation, INTRUS substantially outperforms the baselines for most considered language pairs and matches the baseline for the rest. As expected, the right-to-left generation order is better than the left-to-right for translation into Japanese. However, our model significantly outperforms both baselines. For the tasks where left-to-right decoding order provides a strong inductive bias (e.g. in De-En translation task, where source and target sentences can usually be aligned without any permutations), generation in arbitrary order does not give significant improvements.

Image-To-Latex improves by 0.8 BLEU, which is reasonable difference considering the high performance of the baseline.

5.1 Ablation analysis

In this section, we show the superior performance of the proposed lower bound of the data log-likelihood (6) over the natural choice of (5). We also emphasize the importance of the pretraining phase for INTRUS. Specifically, we compare performance of the following models:

  • [leftmargin=2mm,itemindent=.5cm,labelwidth=labelsep=0cm,align=left]

  • Default — using the training procedure described in Section 2.2;

  • Argmax — trained with the lower bound (5) (maximum is approximated with using beam search with the beam of 4; this technique matches the one used in Gu et al. (2019));

  • Left-to-right pretraining — pretrained with the fixed left-to-right decoding order (in contrast to the uniform samples in the default setting);

  • No pretraining — with no pretraining phase;

  • Only pretraining — training is performed with a model-independent order, either uniform or left-to-right.

Training INTRUS Argmax Pretraining No pre- Only pretraining Baseline
strategy left-to-right training uniform left-to-right left-to-right
BLEU 27.5 26.6 26.3 27.1 24.6 25.5 25.8
Table 2: Training strategies of INTRUS. MT task, scores on the WMT En-Ru 2012-2013 test sets.

Table 2 confirms the importance of the chosen pretraining strategy for the performance of the model. In preliminary experiments, we also observed that introducing any of the two pretraining strategies increases the overall robustness of our training procedure and helps to avoid convergence to poor local optima. We attribute this to the fact that a pretrained model provides the main algorithm with a good initial exploration of the trajectory space , while the Argmax training strategy tends to quickly converge to the current best trajectory which may not be globally optimal. This leads to poor performance and unstable results. This is the only strategy that required several consecutive runs to obtain reasonable quality, despite the fact that it starts from a good pretrained model.

5.2 Computational complexity

Despite its superior performance, INTRUS is more computationally expensive compared to the baseline. The main computational bottleneck in the model training is the generation of insertions required to evaluate the training objective (6). This generation procedure is inherently sequential. Thus, it is challenging to effectively parallelize it on GPU accelerators. In our experiments, training time of INTRUS is 3-4 times longer than that of the baseline. The theoretical computational complexity of the model’s inference is compared to of conventional left-to-right models. However, in practice this is likely not to cause drastic decrease of the decoding speed. Figure 4 shows the decoding speed of both INTRUS and the baseline measured for machine translation task. On average, INTRUS is only slower because for sentences of a reasonable length it performs comparably to the baseline.

Figure 4: Inference time of INTRUS and the baseline models vs sentence length.

6 Analyzing learned generation orders

In this section, we analyze generation orders learned by INTRUS on the Ru-En translation task.

Visual inspection

We noticed that the model often follows a general decoding direction that varies from sentence to sentence: left-to-right, right-to-left, middle-out, etc. (Figure 8 shows several examples101010More examples and the analysis for Image Captioning are provided in the supplementary material.). When following the chosen direction, the model deviates from it for translation of certain phrases. For instance, the model tends to decode pairs of quotes and brackets together. Also we noticed that tokens which are generated first are often uninformative (e.g., punctuation, determiners, etc.). This suggests that the model has preference towards generating “easy” words first.

Figure 8: Decoding examples: left-to-right (left), right-to-left (center) and middle-out (right). Each line represents one decoding step.

Part of speech generation order

We want to find out if the model has any preference towards generating different parts of speech in the beginning or at the end of the decoding process. For each part of speech,111111To derive part of speech tags, we used CoreNLP tagger Manning et al. (2014). we compute the relative index on the generation trajectory (for the baseline, it corresponds to its relative position in a sentence). Figure 9 shows that INTRUS tends to generate punctuation tokens and conjunctions early in decoding. Other parts of speech like nouns, adjectives, prepositions and adverbs are the next easiest to predict. Most often they are produced in the middle of the generation process, when some context is already established. Finally, the most difficult for the model is to insert verbs and particles.

Figure 9: The distributions of the relative generation order of different parts of speech.

These observations are consistent with the easy-first generation hypothesis: the early decoding steps mostly produce words which are the easiest to predict based on the input data. This is especially interesting in the context of previous work. Ford et al. (2018) study the influence of token generation order on a language model quality. They developed a family of two-pass language models that depend on a partitioning of the vocabulary into a set of first-pass and second-pass tokens to generate sentences. The authors find that the most effective strategy is to generate function words in the first pass and content words in the second. While Ford et al. (2018) consider three manually defined strategies, our model learned to give preference to such behavior despite not having any inductive bias to do so.

7 Related work

In Machine Translation, decoding in the right-to-left order improves performance for English-to-Japanese Watanabe and Sumita (2002); Wu et al. (2018). The difference in translation quality is attributed to two main factors: Error Propagation Bengio et al. (2015) and the concept of language branching Wu et al. (2018); Berg (2011). In some languages (e.g. English), sentences normally start with subject/verb on the left and add more information in the rightward direction. Other languages (e.g. Japanese) have the opposite pattern.

Several works suggest to first generate the most “important” token, and then the rest of the sequence using forward and backward decoders. The two decoders start generation process from this first “important” token, which is predicted using classifiers. This approach was shown beneficial for video captioning 

Mehri and Sigal (2018) and conversational systems Qian et al. (2018). Other approaches to non-standard decoding include multi-pass generation models Xia et al. (2017); Ford et al. (2018); Geng et al. (2018); Lee et al. (2018) and non-autoregressive decoding Gu et al. (2018); Lee et al. (2018).

Several recent works proposed sequence models with arbitrary generation order. Gu et al. (2019) propose a similar approach using another lower bound of the log-likelihood which, as we showed in Section 5.1, underperforms ours. They, however, achieve time complexity by utilizing a different probability parameterization along with relative position encoding. Welleck et al. (2019) investigates the possibility of decoding output sequences by descending a binary insertion tree. Stern et al. (2019) focuses on parallel decoding using one of several pre-specified generation orders.

8 Conclusion

In this work, we introduce INTRUS, a model which is able to generate sequences in any arbitrary order via iterative insertion operations. We demonstrate that our model learns convenient generation order as a by-product of its training procedure. The model outperforms left-to-right and right-to-left baselines on several tasks. We analyze learned generation orders and show that the model has a preference towards producing “easy” words at the beginning and leaving more complicated choices for later.


The authors thank David Talbot and Yandex Machine Translation team for helpful discussions and inspiration.


  • D. Bahdanau, K. Cho, and Y. Bengio (2014) Neural machine translation by jointly learning to align and translate. Presented at ICLR 2015 abs/1409.0473. External Links: Link Cited by: §1, §3.
  • L. Barrault, O. Bojar, M. R. Costa-jussà, C. Federmann, M. Fishel, Y. Graham, B. Haddow, M. Huck, P. Koehn, S. Malmasi, C. Monz, M. Müller, S. Pal, M. Post, and M. Zampieri (2019) Findings of the 2019 conference on machine translation (WMT19). In Proceedings of the Fourth Conference on Machine Translation (Volume 2: Shared Task Papers, Day 1), Florence, Italy, pp. 1–61. External Links: Link, Document Cited by: §3.
  • S. Bengio, O. Vinyals, N. Jaitly, and N. Shazeer (2015)

    Scheduled sampling for sequence prediction with recurrent neural networks

    In Proceedings of the 28th International Conference on Neural Information Processing Systems - Volume 1, NIPS’15, Cambridge, MA, USA, pp. 1171–1179. External Links: Link Cited by: §7.
  • T. Berg (2011) Structure in language: a dynamic perspective. Cited by: §7.
  • O. Bojar, C. Federmann, M. Fishel, Y. Graham, B. Haddow, M. Huck, P. Koehn, and C. Monz (2018) Findings of the 2018 conference on machine translation (wmt18). In Proceedings of the Third Conference on Machine Translation, Volume 2: Shared Task Papers, Belgium, Brussels, pp. 272–307. External Links: Link Cited by: §3.
  • J. Briot, G. Hadjeres, and F. Pachet (2017) Deep learning techniques for music generation - A survey. CoRR abs/1709.01620. External Links: Link, 1709.01620 Cited by: §1.
  • X. Chen, H. Fang, T. Lin, R. Vedantam, S. Gupta, P. Dollár, and C. L. Zitnick (2015) Microsoft coco captions: data collection and evaluation server. arXiv preprint arXiv:1504.00325. Cited by: §4.
  • M. Dehghani, S. Gouws, O. Vinyals, J. Uszkoreit, and L. Kaiser (2018) Universal transformers. CoRR abs/1807.03819. Cited by: §3.
  • Y. Deng, A. Kanervisto, J. Ling, and A. M. Rush (2017) Image-to-markup generation with coarse-to-fine attention. In

    Proceedings of the 34th International Conference on Machine Learning

    , D. Precup and Y. W. Teh (Eds.),
    Proceedings of Machine Learning Research, Vol. 70, International Convention Centre, Sydney, Australia, pp. 980–989. External Links: Link Cited by: §3, §4.
  • J. Devlin, M. Chang, K. Lee, and K. Toutanova (2019) BERT: pre-training of deep bidirectional transformers for language understanding. In Proceedings of the 2019 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Volume 1 (Long and Short Papers), Minneapolis, Minnesota, pp. 4171–4186. External Links: Link, Document Cited by: §3.
  • M. Elbayad, L. Besacier, and J. Verbeek (2018)

    Pervasive attention: {2d} convolutional neural networks for sequence-to-sequence prediction

    In Proceedings of the 22nd Conference on Computational Natural Language Learning, pp. 97–107. External Links: Link Cited by: §3.
  • N. Ford, D. Duckworth, M. Norouzi, and G. Dahl (2018) The importance of generation order in language modeling. In

    Proceedings of the 2018 Conference on Empirical Methods in Natural Language Processing

    Brussels, Belgium, pp. 2942–2946. External Links: Link Cited by: §1, §6, §7.
  • J. Gehring, M. Auli, D. Grangier, D. Yarats, and Y. N. Dauphin (2017) Convolutional sequence to sequence learning. CoRR abs/1705.03122. External Links: Link, 1705.03122 Cited by: §2.2, §3.
  • X. Geng, X. Feng, B. Qin, and T. Liu (2018) Adaptive multi-pass decoder for neural machine translation. In Proceedings of the 2018 Conference on Empirical Methods in Natural Language Processing, Brussels, Belgium, pp. 523–532. External Links: Link Cited by: §7.
  • J. Gu, J. Bradbury, C. Xiong, V. O.K. Li, and R. Socher (2018) Non-autoregressive neural machine translation. In International Conference on Learning Representations, External Links: Link Cited by: §7.
  • J. Gu, Q. Liu, and K. Cho (2019) Insertion-based decoding with automatically inferred generation order. Cited by: §3.1, §3.1, 2nd item, §7.
  • K. He, X. Zhang, S. Ren, and J. Sun (2016) Deep residual learning for image recognition. In CVPR, pp. 770–778. Cited by: §3.
  • G. Huang, Z. Liu, L. van der Maaten, and K. Q. Weinberger (2017) Densely connected convolutional networks.

    2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR)

    , pp. 2261–2269.
    Cited by: §3.
  • P. Koehn, H. Hoang, A. Birch, C. Callison-Burch, M. Federico, N. Bertoldi, B. Cowan, W. Shen, C. Moran, R. Zens, C. Dyer, O. Bojar, A. Constantin, and E. Herbst (2007) Moses: open source toolkit for statistical machine translation. In Proceedings of the 45th Annual Meeting of the ACL on Interactive Poster and Demonstration Sessions, ACL ’07, Stroudsburg, PA, USA, pp. 177–180. External Links: Link Cited by: §4.
  • P. Koehn (2004) Statistical significance tests for machine translation evaluation. In Proceedings of the 2004 Conference on Empirical Methods in Natural Language Processing, External Links: Link Cited by: Table 1.
  • J. Lee, E. Mansimov, and K. Cho (2018) Deterministic non-autoregressive neural sequence modeling by iterative refinement. In Proceedings of the 2018 Conference on Empirical Methods in Natural Language Processing, Brussels, Belgium, pp. 1173–1182. External Links: Link Cited by: §7.
  • C. D. Manning, M. Surdeanu, J. Bauer, J. Finkel, S. J. Bethard, and D. McClosky (2014) The Stanford CoreNLP natural language processing toolkit. In Association for Computational Linguistics (ACL) System Demonstrations, pp. 55–60. External Links: Link Cited by: footnote 11.
  • S. Mehri and L. Sigal (2018) Middle-out decoding. In Advances in Neural Information Processing Systems 31, S. Bengio, H. Wallach, H. Larochelle, K. Grauman, N. Cesa-Bianchi, and R. Garnett (Eds.), pp. 5518–5529. External Links: Link Cited by: §1, §7.
  • T. Mikolov, M. Karafiát, L. Burget, J. Cernocký, and S. Khudanpur (2010) Recurrent neural network based language model. In INTERSPEECH, Cited by: §1.
  • T. Nakazawa, M. Yaguchi, K. Uchimoto, M. Utiyama, E. Sumita, S. Kurohashi, and H. Isahara (2016) ASPEC: asian scientific paper excerpt corpus.. In LREC, Cited by: §4.
  • K. Papineni, S. Roukos, T. Ward, and W. Zhu (2002) BLEU: a method for automatic evaluation of machine translation. In Proceedings of the 40th Annual Meeting on Association for Computational Linguistics, ACL ’02, Stroudsburg, PA, USA, pp. 311–318. External Links: Link, Document Cited by: §4.
  • M. Post (2018) A call for clarity in reporting BLEU scores. In Proceedings of the Third Conference on Machine Translation: Research Papers, pp. 186–191. External Links: Link Cited by: footnote 8.
  • Q. Qian, M. Huang, H. Zhao, J. Xu, and X. Zhu (2018) Assigning personality/profile to a chatting machine for coherent conversation generation. In

    Proceedings of the Twenty-Seventh International Joint Conference on Artificial Intelligence, IJCAI-18

    pp. 4279–4285. External Links: Document, Link Cited by: §7.
  • R. Sennrich, B. Haddow, and A. Birch (2016) Neural machine translation of rare words with subword units. In Proceedings of the 54th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), Berlin, Germany, pp. 1715–1725. External Links: Link, Document Cited by: §4.
  • K. Simonyan and A. Zisserman (2014) Very deep convolutional networks for large-scale image recognition. CoRR abs/1409.1556. External Links: Link Cited by: §3, §4.
  • S. S. Singh (2018) Teaching machines to code : neural markup generation with interpretable attention. Cited by: §4.
  • M. Stern, W. Chan, J. Kiros, and J. Uszkoreit (2019) Insertion transformer: flexible sequence generation via insertion operations. Cited by: §7.
  • I. Sutskever, O. Vinyals, and Q. V. Le (2014) Sequence to sequence learning with neural networks. CoRR abs/1409.3215. External Links: Link, 1409.3215 Cited by: §1, §3.
  • A. Vaswani, N. Shazeer, N. Parmar, J. Uszkoreit, L. Jones, A. N. Gomez, L. Kaiser, and I. Polosukhin (2017) Attention is all you need. CoRR abs/1706.03762. External Links: Link, 1706.03762 Cited by: §2.2, §3, §3, §4, §4, §4.
  • O. Vinyals, S. Bengio, and M. Kudlur (2016) Order matters: sequence to sequence for sets. In International Conference on Learning Representations (ICLR), External Links: Link Cited by: §2.1, §2.2, §2.2.
  • T. Watanabe and E. Sumita (2002) Bidirectional decoding for statistical machine translation. In Proceedings of the 19th International Conference on Computational Linguistics - Volume 1, COLING ’02, Stroudsburg, PA, USA, pp. 1–7. External Links: Link, Document Cited by: §7.
  • S. Welleck, K. Brantley, H. Daum’e, and K. Cho (2019)

    Non-monotonic sequential text generation

    Cited by: §7.
  • L. Wu, X. Tan, D. He, F. Tian, T. Qin, J. Lai, and T. Liu (2018) Beyond error propagation in neural machine translation: characteristics of language also matter. In Proceedings of the 2018 Conference on Empirical Methods in Natural Language Processing, Brussels, Belgium, pp. 3602–3611. External Links: Link, Document Cited by: §1, §4, §7.
  • Y. Wu, M. Schuster, Z. Chen, Q. V. Le, M. Norouzi, W. Macherey, M. Krikun, Y. Cao, Q. Gao, K. Macherey, J. Klingner, A. Shah, M. Johnson, X. Liu, L. Kaiser, S. Gouws, Y. Kato, T. Kudo, H. Kazawa, K. Stevens, G. Kurian, N. Patil, W. Wang, C. Young, J. Smith, J. Riesa, A. Rudnick, O. Vinyals, G. Corrado, M. Hughes, and J. Dean (2016) Google’s neural machine translation system: bridging the gap between human and machine translation. CoRR abs/1609.08144. External Links: Link, 1609.08144 Cited by: §2.2.
  • Y. Xia, F. Tian, L. Wu, J. Lin, T. Qin, N. Yu, and T. Liu (2017) Deliberation networks: sequence generation beyond one-pass decoding. In Advances in Neural Information Processing Systems 30, I. Guyon, U. V. Luxburg, S. Bengio, H. Wallach, R. Fergus, S. Vishwanathan, and R. Garnett (Eds.), pp. 1784–1794. External Links: Link Cited by: §7.
  • K. Xu, J. Ba, R. Kiros, K. Cho, A. C. Courville, R. Salakhutdinov, R. S. Zemel, and Y. Bengio (2015) Show, attend and tell: neural image caption generation with visual attention. CoRR abs/1502.03044. External Links: Link, 1502.03044 Cited by: §1.

Appendix A MSCOCO part of speech analysis

The distributions on Figure 16 generated on MSCOCO task demonstrate that, unlike for translation task, conjunctions are not produced in the beginning of generation, as they can no longer be copied from the source, but rather depend on the currently generated sequence. Adjectives and adverbs are now mostly produced in the end of generation as a further refinement. The adjectives and adverbs are hard to generate first based on the image alone, because it is hard to tell in advance what aspects of the image the model will describe in the generated sentence, since the initial encoder was not pretrained to first process image features that can be described with adjectives or adverbs and since it is usually hard to select the main feature among several qualities of an image that can be described with words.

Appendix B decoding examples

Figure 10: Example of Ru-En decoding order.
Figure 11: Example of Ru-En decoding order.
Figure 12: Example of Ru-En decoding order.
Figure 13: Example of Ru-En decoding order.
Figure 14: Example of Ru-En decoding order.
Figure 15: Example of Ru-En decoding order.
Figure 16: The distributions of the relative generation order of different parts of speech on MSCOCO data set.
Figure 17: Learned order dependency inversion fractions for several dependency types. (dependency inversion fractions is the fraction of cases when the dependant word is generated before the main)

1The full dependency taxonomy is available on universal dependencies website

Figure 18: Example of Ru-En decoding order.
Figure 19: Example of Ru-En decoding order.
Figure 20: Example of Ru-En decoding order.
Figure 21: Example of MSCOCO decoding order.
Figure 22: Example of MSCOCO decoding order.
Figure 23: Example of MSCOCO decoding order.
Figure 24: Example of MSCOCO decoding order.
Figure 25: Example of MSCOCO decoding order.