Non-Monotonic Sequential Text Generation

02/05/2019 ∙ by Sean Welleck, et al. ∙ 8

Standard sequential generation methods assume a pre-specified generation order, such as text generation methods which generate words from left to right. In this work, we propose a framework for training models of text generation that operate in non-monotonic orders; the model directly learns good orders, without any additional annotation. Our framework operates by generating a word at an arbitrary position, and then recursively generating words to its left and then words to its right, yielding a binary tree. Learning is framed as imitation learning, including a coaching method which moves from imitating an oracle to reinforcing the policy's own preferences. Experimental results demonstrate that using the proposed method, it is possible to learn policies which generate text without pre-specifying a generation order, while achieving competitive performance with conventional left-to-right generation.



There are no comments yet.


page 6

This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

1 Introduction

Most sequence-generation models, from n-grams

(Bahl et al., 1983) to neural language models (Bengio et al., 2003) generate sequences in a purely left-to-right, monotonic order. This raises the question of whether alternative, non-monotonic orders are worth considering (Ford et al., 2018), especially given the success of “easy first” techniques in natural language tagging (Tsuruoka & Tsujii, 2005), parsing (Goldberg & Elhadad, 2010), and coreference (Stoyanov & Eisner, 2012), which allow a model to effectively learn their own ordering. In investigating this question, we are solely interested in considering non-monotonic generation that does not rely on external supervision, such as parse trees (Eriguchi et al., 2017; Aharoni & Goldberg, 2017).

Figure 1: A sequence, “how are you ?”, generated by the proposed approach trained on utterances from a dialogue dataset. The model first generated the word “are” and then recursively generated left and right subtrees (“how” and “you ?”, respectively) of this word. At each production step, the model may either generate a token, or an token, which indicates that this subtree is complete. The full generation is performed in a level-order traversal, and the output is read off from an in-order traversal. The numbers in green squares (left) denote the order in which the nodes were generated (level-order); those in rounded blue squares (right) denote the nodes’ location in the final sequence (in-order).

In this paper, we propose a framework for training sequential text generation models which learn a generation order without having to specifying an order in advance (§ 2). An example generation from our model is shown in Figure 1. We frame the learning problem as an imitation learning problem, in which we aim to learn a generation policy that mimics the actions of an oracle generation policy (§ 3). Because the tree structure is unknown, the oracle policy cannot know the exact correct actions to take; to remedy this we propose a method called annealed coaching which can yield a policy with learned generation orders, by gradually moving from imitating a maximum entropy oracle to reinforcing the policy’s own preferences. Experimental results demonstrate that using the proposed framework, it is possible to learn policies which generate text without pre-specifying a generation order, achieving easy first-style behavior. The policies achieve performance metrics that are competitive with or superior to conventional left-to-right generation in language modeling, word reordering, and machine translation (§ 5).

2 Non-Monotonic Sequence Generation

Formally, we consider the problem of sequentially generating a sequence of discrete tokens , such as a natural language sentence, where , a finite vocabulary. Let .

Unlike conventional approaches with a fixed generation order, often left-to-right (or right-to-left), our goal is to build a sequence generator that generates these tokens in an order automatically determined by the sequence generator, without any extra annotation nor supervision of what might be a good order. We propose a method which does so by generating a word at an arbitrary position, then recursively generating words to its left and words to its right, yielding a binary tree like that shown in Figure 1.

We view the generation process as deterministically navigating a state space where a state corresponds to a sequence of tokens from . We interpret this sequence of tokens as a top-down traversal of a binary tree, where terminates a subtree. The initial state is the empty sequence. For example, in Figure 1, , , …, . An action is an element of which is deterministically appended to the state. Terminal states are those for which all subtrees have been ’ed. If a terminal state is reached, we have that , where is the number of words (non- tokens) in the tree. We use to denote the level-order traversal index of the -th node in an in-order traversal of a tree, so that corresponds to the sequence of discrete tokens generated. The final sequence returned is this, postprocessed by removing all ’s. In Figure 1, maps from the numbers in the blue squares to those in the green squares.

A policy

is a (possibly) stochastic mapping from states to actions, and we denote the probability of an action

given a state as . A policy ’s behavior decides which and whether words appear before and after the token of the parent node. Typically there are many unique binary trees with an in-order traversal equal to a sequence . Each of these trees has a different level-order traversal, thus the policy is capable of choosing from many different generation orders for , rather than a single predefined order. Note that left-to-right generation can be recovered if if and only if

is odd (or non-zero and even for right-to-left generation).

3 Learning for Non-Monotonic Generation

Learning in our non-monotonic sequence generation model (§ 2) amounts to inferring a policy from data. We first consider the unconditional generation problem (akin to language modeling) in which the data consists simply of sequences to be generated. Subsequently (§ 4.1) we consider the conditional case in which we wish to learn a mapping from inputs to output sequences .

This learning problem is challenging because the sequences alone only tell us what the final output sequences of words should be, but not what tree(s) should be used to get there. In left-to-right generation, the observed sequence

fully determines the sequence of actions to take. In our case, however, the tree structure is effectively a latent variable, which will be determined by the policy itself. This prevents us from using conventional supervised learning for training the parameterized policy. On the other hand, at training time, we do know

which words should eventually appear, and their order; this substantially constrains the search space that needs to be explored, suggesting learning-to-search (Daumé et al., 2009) and imitation learning (Ross et al., 2011; Ross & Bagnell, 2014) as a learning strategy.111

One could consider applying reinforcement learning to this problem. This would ignore the fact that at training time we

know which words will appear, reducing the size of the feasible search space from to , a huge savings. Furthermore, even with a fixed generation order, RL has proven to be difficult without partially relying on supervised learning (Ranzato et al., 2015; Bahdanau et al., 2015b, 2016).

Figure 2: A sampled tree for the sentence “a b c d” with an action space , showing an oracle’s distribution and consecutive subsequences (“valid actions”) for . Each oracle distribution is depicted as 6 boxes showing (lighter = higher probability). After b is sampled at the root, two empty left and right child nodes are created, associated with valid actions (a) and (c, d), respectively. Here, only assigns positive probability to tokens in .

The key idea in our imitation learning framework is that at the first step, an oracle policy’s action is to produce any word that appears anywhere in . Once picked, in a quicksort-esque manner, all words to the left of in are generated recursively on the left (following the same procedure), and all words to the right of in are generated recursively on the right. (See Figure 2 for an example.) Because the oracle is non-deterministic (many “correct” actions are available at any given time), we inform this oracle policy with the current learned policy, encouraging it to favor actions that are preferred by the current policy, inspired by work in direct loss minimization (Hazan et al., 2010) and related techniques (Chiang, 2012; He et al., 2012).

3.1 Background: Learning to Search

In learning-to-search-style algorithms, we aim to learn a policy that mimics an oracle (or “reference”) policy . To do so, we define a roll-in policy and roll-out policy . We then repeatedly draw states according to the state distribution induced by , and compute cost-to-go under , for all possible actions at that state. The learned policy

is then trained to choose actions to minimize this cost-to-go estimate.

Formally, denote the uniform distribution over

as and denote by the distribution of states induced by running for -many steps. Denote by a scalar cost measuring the loss incurred by against the cost-to-go estimates under (for instance,

may measure the squared error between the vector

and the cost-to-go estimates). Then, the quantity being optimized is:


Here, and can use information not available at test-time (e.g., the ground-truth ). Learning consists of finding a policy which only has access to states but performs as well or better than . By varying the choice of , , and , one obtains different variants of learning-to-search algorithms, such as DAgger (Ross et al., 2011), AggreVaTe (Ross & Bagnell, 2014) or LOLS (Chang et al., 2015).

In the remainder of this section, we describe the cost function we use, a set of oracle policies and a set of roll-in policies, both of which are specifically designed for the proposed problem of non-monotonic sequential generation of a sequence. These sets of policies are empirically evaluated later in the experiments (§ 5).

3.2 Cost Measurement

There are many ways to measure the prediction cost ; arguably the most common is squared error between cost-predictions by and observed costs obtained by at the state

. However, recent work has found that, especially when dealing with recurrent neural network policies (which we will use; see

§ 4), using a cost function more analogous to a cross-entropy loss can be preferred (Leblond et al., 2018; Cheng et al., 2018; Welleck et al., 2018). In particular, we use a KL-divergence type loss, measuring the difference between the action distribution produced by and the action distribution preferred by .


Our approach estimates the loss in Eq. (1) by first sampling one training sequence, running the roll-in policy for steps, and computing the KL divergence (2) at that state using as . Learning corresponds to minimizing this KL divergence iteratively with respect to the parameters of .

3.3 Roll-In Policies

The roll-in policy determines the state distribution over which the learned policy is to be trained. In most formal analyses, the roll-in policy is a stochastic mixture of the learned policy and the oracle policy , ensuring that is eventually trained on its own state distribution (Daumé et al., 2009; Ross et al., 2011; Ross & Bagnell, 2014; Chang et al., 2015). Despite this, experimentally, it has often been found that simply using the oracle’s state distribution is optimal (Ranzato et al., 2015; Leblond et al., 2018). This is likely because the noise incurred early on in learning by using ’s state distribution is not overcome by the benefit of matching state distributions, especially when the the policy class is sufficiently high capacity so as to be nearly realizable on the training data (Leblond et al., 2018). In preliminary experiments, we observed the same is true in our setting: simply rolling in according to the oracle policy (§ 3.4) yielded the best results experimentally. Therefore, despite the fact that this can lead to inconsistency in the learned model (Chang et al., 2015), all experiments are with oracle roll-ins.

3.4 Oracle Policies

In this section we formalize the oracle policies that we consider. To simplify the discussion (we assume that the roll-in distribution is the oracle), we only need to define an oracle policy that takes actions on states it, itself, visits. All the oracles we consider have access to the ground truth output , and the current state . We interpret the state as a partial binary tree and a “current node” in that binary tree where the next prediction will go. It is easiest to consider the behavior of the oracle as a top-down, level-order traversal of the tree, where in each state it maintains a sequence of “possible tokens” at that state. An oracle policy is defined with respect to , a consecutive subsequence of . At , uses the full . This is subdivided as the tree is descended. At each state , contains “valid actions”; labeling the current node with any token from keeps the generation leading to . For instance, in Figure 2, after sampling b for the root, the valid actions are split into for the left child and for the right child.

Given the consecutive subsequence , an oracle policy is defined as:


where the s are arbitrary such that . An oracle policy places positive probability only on valid actions, and forces an output if there are no more words to produce. This is guaranteed to always generate , regardless of how the random coin flips come up.

When an action is chosen, at , this “splits” the sub-sequence into left and right sub-sequences, and , where is the index of in . (This split may not be unique due to duplicated words in , in which case we choose a valid split arbitrarily.) These are “passed” to the left and right child nodes, respectively.

There are many possible oracle policies, and each of them is characterized by how in Eq. (3) is defined. Specifically, we propose three variants.

Uniform Oracle.

Motivated by Welleck et al. (2018) who applied learning-to-search to the problem of multiset prediction, we design a uniform oracle . This oracle treats all possible generation orders that lead to the target sequence as equally likely, without preferring any specific set of orders. Formally, gives uniform probabilities for all words in where is the number of unique words in . (Daumé (2009) used a similar oracle for unsupervised structured prediction, which has a similar non-deterministic oracle complication.)

Coaching Oracle.

An issue with the uniform oracle is that it does not prefer any specific set of generation orders, making it difficult for a parameterized policy to imitate. This gap has been noticed as a factor behind the difficulty in learning-to-search by He et al. (2012), who propose the idea of coaching. In coaching, the oracle takes into account the preference of a parameterized policy in order to facilitate its learning. Motivated by this, we design a coaching oracle as the product of the uniform oracle and current policy :


This coaching oracle ensures that no invalid action is assigned any probability, while preferring actions that are preferred by the current parameterized policy, reinforcing the selection by the current policy if it is valid.

Annealed Coaching Oracle.

The multiplicative nature of the coaching oracle gives rise to an issue, especially in the early stage of learning, as it does not encourage learning to explore a diverse set of generation orders. We thus design a mixture of the uniform and coaching policies, which we refer to as an annealed coaching oracle:


We anneal from to over learning, on a linear schedule.

Deterministic Left-to-Right Oracle.

In addition to the proposed oracle policies above, we also experiment with a deterministic oracle that corresponds to generating the target sequence from left to right: always selects the first un-produced word as the correct action, with probability . When both roll-in and oracle policies are set to the left-to-right oracle , the proposed approach recovers to maximum likelihood learning of an autoregressive sequence model, which is de facto standard in neural sequence modeling. In other words, supervised learning of an autoregressive sequence model is a special case of the proposed approach.

4 Neural Net Policy Structure

We use a neural network to implement the proposed binary tree generating policy, as it has been shown to encode a variable-sized input and predict a structured output effectively (Cleeremans et al., 1989; Forcada & Ñeco, 1997; Sutskever et al., 2014; Cho et al., 2014b; Tai et al., 2015; Bronstein et al., 2017; Battaglia et al., 2018). This neural network takes as input a partial binary tree, or equivalently a sequence of nodes in this partial tree by level-order traversal, and outputs a distribution over the action set

. The policy we use is implemented as a recurrent network with long short-term memory (LSTM) units 

(Hochreiter & Schmidhuber, 1997) by considering the partial binary tree as a flat sequence of nodes in a level-order traversal . The recurrent network encodes the sequence into a vector and computes a categorical distribution over the action set:


where and are weights and bias associated with .

This LSTM structure relies entirely on the linearization of a partial binary tree, and minimally takes advantage of the actual tree structure or the surface order. It is possible to exploit the tree structure more thoroughly by using latest neural networks that are specifically designed to encode a tree (Zhang et al., 2015; Alvarez-Melis & Jaakkola, 2017; Dyer et al., 2015; Bowman et al., 2016). We do not consider these in this paper, but leave them for future investigation. We did, however, experiment with additionally conditioning ’s action distribution on the parent of the current node in the tree, but preliminary experiments did not show gains.

4.1 Conditional Sentence Generation

An advantage of using a neural network to implement the proposed policy is that it can be easily conditioned on an extra context. It allows us to build a conditional non-monotonic sequence generator that can for instance be used for machine translation, image caption generation, speech recognition and generally multimedia description generation (Cho et al., 2015). To do so, we assume that a conditioning input (e.g. an image or sentence) can be represented as a -dimensional context vector. To obtain these vector representations, we learn an encoder function , and use its output to initialize the LSTM policy’s -dimensional state, , where . For machine translation experiments (§ 5.4) the encoder additionally outputs vectors , and the policy computes an additional context vector at each step using a learned attention function, which is then combined with the policy’s state . The encoder’s parameters are learned jointly with the policy’s.

5 Experimental Results

In this section we experiment with our non-monotone sequence generation model across four tasks. The first two are unconditional generation tasks: language modeling (§ 5.1) and out-of-order sentence completion (§ 5.2). Our analysis in these tasks is primarily qualitative: we seek to understand what the non-monotone policy is learning and how it compares to a left-to-right model. The second two tasks are conditional generation tasks, which generate output sequences based on some given input sequence: word reordering (§ 5.3) and machine translation (§ 5.4).

5.1 Language Modeling

We begin by considering generating samples from our model, trained as a language model. Our goal in this section is to qualitatively understand what our model has learned. It would be natural also to evaluate our model according to a score like perplexity. Unfortunately, unlike conventional autoregressive language models, it is intractable to compute the probability of a given sequence in the non-monotonic generation setting, as it requires us to marginalize out all possible binary trees that lead to the sequence.


We use a dataset derived from the Persona-Chat (Zhang et al., 2018) dialogue dataset, which consists of multi-turn dialogues between two agents. Our dataset here consists of all unique persona sentences and utterances in Persona-Chat. We derive the examples from the same train, validation, and test splits as Persona-Chat, resulting in 133,176 train, 16,181 validation, and 15,608 test examples. Sentences are tokenized by splitting on spaces and punctuation. The training set has a vocabulary size of 20,090 and an average of 12.0 tokens per example.


We use a uni-directional LSTM that has 2 layers of 1024 LSTM units. See Appendix A for more details.

Oracle %Novel %Unique Avg. Tokens Avg. Span Bleu
left-right 17.8 97.0 11.9 1.0 47.0
uniform 98.3 99.9 13.0 1.43 40.0
annealed 93.1 98.2 10.6 1.31 56.2
Validation 97.0 100 12.1 - -
Table 1: Statistics computed over 10,000 sampled sentences (in-order traversals of sampled trees with tokens removed) for policies trained on Persona-Chat. A sample is novel when it is not in the training set. Percent unique is the cardinality of the set of sampled sentences divided by the number of sampled sentences.


hey there , i should be !
not much fun . what are you doing ?
not . not sure if you .
i love to always get my nails done .
sure , i can see your eye underwater
  while riding a footwork .


i just got off work .
yes but believe any karma , it is .
i bet you are . i read most of good tvs
  on that horror out . cool .
sometimes , for only time i practice
  professional baseball .
i am rich , but i am a policeman .


i do , though . do you ?
i like iguanas . i have a snake . i wish
  i could win . you ?
i am a homebody .
i care sometimes . i also snowboard .
i am doing okay . just relaxing ,
  and you ?
Table 2: Samples from unconditional generation policies trained on Persona-Chat for each training oracle. The first sample’s underlying tree is shown. See Appendix B.1.2 for more samples.
Basic Statistics.

We draw 10,000 samples from each trained policy (by varying the oracle) and analyze the results using the following metrics: percentage of novel sentences, percentage of unique, average number of tokens, average span size222The average span is the average number of children for non-leaf nodes excluding the special token , ranging from (chain, as induced by the left-right oracle) to (full binary tree). and Bleu (Table 1). We use Bleu to quantify the sample quality by computing the Bleu score of the samples using the validation set as reference, following Yu et al. (2016) and Zhu et al. (2018). In Appendix B.1.3 we report additional scores. We see that the non-monotonically trained policies generate many more novel sentences, and build trees that are bushy (span ), but not complete binary trees. The policy trained with the annealed oracle is most similar to the validation data according to Bleu.

Content Analysis.

We investigate the content of the models in Table 2, which shows samples from policies trained with different oracles. Each of the displayed samples are not a part of the training set. We provide additional samples organized by length in Appendix Tables 2 and 3, and samples showing the underlying trees that generated them in Appendix Figures 3-5. In Appendix B.1.1, we additionally examine word frequencies and part-of-speech tag frequencies, and find that the samples from each policy typically follow the validation set’s word and tag frequencies.

Figure 3:

POS tag counts by tree-depth, computed by tagging 10,000 sampled sentences. Counts are normalized across each row (depth), then the marginal tag probabilities are subtracted. A light value means the probability of the tag occurring at that depth is higher than the prior probability of the tag occurring.

Generation Order.

We analyze the generation order of our various models by inspecting the part-of-speech (POS) tags each model tends to put at different tree depths (i.e. number of edges from node to root). Figure 3 shows POS counts by tree depth, normalized by the sum of counts at each depth (we only show the four most frequent POS categories). We also show POS counts for the validation set’s dependency trees, obtained with an off-the-shelf parser. Not surprisingly, policies trained with the uniform oracle tend to generate words with a variety of POS tags at each level. Policies trained with the annealed oracle on the other hand, learned to frequently generate punctuation at the root node, often either the sentence-final period or a comma, in an “easy first” style, since most sentences contain a period. Furthermore, we see that the policy trained with the annealed oracle tends to generate a pronoun before a noun or a verb (tree depth 1), which is a pattern that policies trained with the left-right oracle also learn. Nouns typically appear in the middle of the policy trained with the annealed oracle’s trees. Aside from verbs, the annealed policy’s trees, which have punctuation and pronouns near the root and nouns deeper, follow a similar structure as the dependency trees.

5.2 Sentence Completion

A major weakness of the conventional autoregressive model, especially with unbounded context, is that it cannot be easily used to fill in missing parts of a sentence except at the end. This is especially true when the number of tokens per missing segment is not given in advance. Achieving this requires significant changes to both model architecture, learning and inference 

(Berglund et al., 2015).

Our proposed approach, on the other hand, can naturally fill in missing segments in a sentence. Using models trained as language models from the previous section (§ 5.1), we can achieve this by initializing a binary tree with observed tokens in a way that they respect their relative positions. For instance, the first example shown in Table 3 can be seen as the template “        favorite         food         !        ” with variable-length missing segments. Generally, an initial tree with nodes ensures that each appears in the completed sentence, and that appears at some position to the left of in the completed sentence when is a left-descendant of (analogously for right-descendants).

To quantify the completion quality, we first create a collection of initial trees by randomly sampling three words from each sentence from the Persona-Chat validation set of § 5.1. We then sample one completion for each initial tree and measure the Bleu of each sample using the validation set as reference as in § 5.1. According to Bleu, the policy trained with the annealed oracle sampled completions that were more similar to the validation data (Bleu 44.7) than completions from the policies trained with the uniform (Bleu 38.9) or left-to-right (Bleu 14.3) oracles.

In Table 3, we present some sample completions using the policy trained with the uniform oracle. The completions illustrate a property of the proposed non-monotonic generation that is not available in left-to-right generation.

Initial Tree Samples
lasagna is my favorite food !
my favorite food is mac and cheese !
what is your favorite food ? pizza , i love it !
whats your favorite food ? mine is pizza !
seafood is my favorite . and mexican food !
  what is yours ?
hello ! i like classical music . do you ?
hello , do you enjoy playing music ?
hello just relaxing at home listening to
  fine music . you ?
hello , do you like to listen to music ?
hello . what kind of music do you like ?
i am a doctor or a lawyer .
i would like to feed my doctor , i aspire
  to be a lawyer .
i am a doctor lawyer . 4 years old .
i was a doctor but went to a lawyer .
i am a doctor since i want to be a lawyer .
Table 3: Sentence completion samples from a policy trained on Persona-Chat with the uniform oracle. The left column shows the initial seed tree. In the sampled sentences, seed words are bold.

5.3 Word Reordering

We first evaluate the proposed models for conditional generation on the Word Reordering task, also known as Bag Translation (Brown et al., 1990) or Linearization (Schmaltz et al., 2016). In this task, a sentence is given as an unordered collection , and the task is to reconstruct from . We assemble a dataset of pairs using sentences from the Persona-Chat sentence dataset of § 5.1. In our approach, we do not explicitly force the policies trained with our non-monotonic oracles to produce a permutation of the input and instead let them learn this automatically.


For encoding each unordered input , we use a simple bag-of-words encoder: . We implement

using an embedding layer followed by a linear transformation. The embedding layer is initialized with GloVe

(Pennington et al., 2014) vectors and updated during training. As the policy (decoder) we use a flat LSTM with 2 layers of 1024 LSTM units. The decoder hidden state is initialized with a linear transformation of .

Validation Test
Oracle Bleu F1 EM Bleu F1 EM
left-right 46.6 0.910 0.230 46.3 0.903 0.208
uniform 44.7 0.968 0.209 44.3 0.960 0.197
annealed 46.8 0.960 0.230 46.0 0.950 0.212
Table 4: Word Reordering results on Persona-Chat, reported according to Bleu score, F1 score, and percent exact match on validation and test data.

Table 4 shows Bleu, F1 score, and exact match for policies trained with each oracle. The uniform and annealed policies outperform the left-right policy in F1 score (0.96 and 0.95 vs. 0.903). The policy trained using the annealed oracle also matches the left-right policy’s performance in terms of Bleu score (46.0 vs. 46.3) and exact match (0.212 vs. 0.208). The model trained with the uniform policy does not fare as well on Bleu or exact match. See Appendix Figure 6 for example predictions.

Easy-First Analysis.

Figure 4 shows the entropy of each model as a function of depth in the tree (normalized to fall in ). The left-right-trained policy has high entropy on the first word and then drops dramatically as additional conditioning from prior context kicks in. The uniform-trained policy exhibits similar behavior. The annealed-trained policy, however, makes its highest confidence (“easiest”) predictions at the beginning (consistent with Figure 3) and defers harder decisions until later.

Figure 4: Normalized entropy of as a function of tree depth for policies trained with each of the oracles. The anneal-trained policy, unlike the others, makes low entropy decisions early.

5.4 Machine Translation

Data and Preprocessing.

We evaluate the proposed models on IWSLT’16 German English (196k pairs) translation task. The data sets consist of TED talks. We use the TED tst2013 as a validation dataset and tst-2014 as test. We use the default Moses tokenizer script (Koehn et al., 2007) and segment each word into a subword using BPE (Sennrich et al., 2015) creating 40k tokens for both source and target. Similar to (Bahdanau et al., 2015a), we filter sentence pairs that exceed 50 words and shuffle mini-batches.

Validation Test
Oracle Bleu (BP) Meteor YiSi Ribes Bleu (BP) Meteor YiSi Ribes
left-right 29.47 (0.97) 29.66 52.03 82.55 26.23 (1.00) 27.87 47.58 79.85
uniform 14.97 (0.63) 21.76 41.62 77.70 13.17 (0.64) 19.87 36.48 75.36
 +-tuning 18.79 (0.89) 25.30 46.23 78.49 17.68 (0.96) 24.53 42.46 74.12
annealed 19.50 (0.71) 26.57 48.00 81.48 16.94 (0.72) 23.15 42.39 78.99
 +-tuning 21.95 (0.90) 26.74 49.01 81.77 19.19 (0.91) 25.24 43.98 79.24
Table 5:

Results of machine translation experiments for different training oracles across four different evaluation metrics.

Model & Training.

We use a bi-directional LSTM encoder-decoder architecture that has a single layer of size 512, with global concat attention (Luong et al., 2015). The learning rate for all of our models is initialized to 0.001 and multiplied by a factor of 0.5 on a fixed interval.


In preliminary results on validation data, we found that the annealed-trained models tended to overproduce tokens. This likely happens because roughly 50% of the examples seen have the correct action as

, which is much more reliable than any other word, and the classifier learns to favor

too strongly as a result. This is reminiscent of other settings in which “learning to stop” can be difficult (Misra et al., 2017)

. To address this, we tune a linear offset in the policy logits (

Eq missing), just for the token. Formally, Eq missing is the same for all , but is replaced with for , where are scalars tuned on the validation data by grid search.


Results on validation and test data are in Table 5 according to four (very) different evaluation measures: Bleu, Meteor (Lavie & Agarwal, 2007), YiSi (Lo, 2018), and Ribes (Isozaki et al., 2010). The most dramatic score difference is the drastically superior performance of left-right according to Bleu. As previously observed (Callison-Burch et al., 2006; Wilks, 2008), Bleu tends to strongly prefer models with left-to-right language models because it focuses so strongly on getting a large number of grams correct. We found that the annealed model significantly outperforms the left-right model in 1- and 2-gram precision, ties for 3-gram, and loses for 4-gram. This suggests that Bleu score could be improved by explicitly modeling the linearization order in our approach. The other three measures of translation quality are significantly less sensitive to exact word order and focus more on whether the “semantics” is preserved (for varying definitions of “semantics”). For those, we see that the annealed+-tuned models are more competitive, though still under-performing left-right by a few percent for Meteor and YiSi.

6 Related Work

Arguably one of the most successful approaches for generating discrete sequences, or sentences, is neural autoregressive modeling (Sutskever et al., 2011; Tomas, 2012). It has become de facto standard in machine translation (Cho et al., 2014a; Sutskever et al., 2014) and is widely studied for dialogue response generation (Vinyals & Le, 2015) as well as speech recognition (Chorowski et al., 2015). On the other hand, recent works have shown that it is possible to generate a sequence of discrete tokens in parallel by capturing strong dependencies among the tokens in a non-autoregressive way (Gu et al., 2017; Lee et al., 2018; Oord et al., 2017). Stern et al. (2018) and Wang et al. (2018) proposed to mix in these two paradigms and build a semi-autoregressive sequence generator, while largely sticking to left-to-right generation. Our proposal radically departs from these conventional approaches by building an algorithm that automatically captures a distinct generation order.

In (neural) language modeling, there is a long tradition of modeling the probability of a sequence as a tree or directed graph. For example, Emami & Jelinek (2005) proposed to factorize the probability over a sentence following its syntactic structure and train a neural network to model conditional distributions, which was followed more recently by Zhang et al. (2015) and by Dyer et al. (2016)

. This approach was applied to neural machine translation by

Eriguchi et al. (2017) and Aharoni & Goldberg (2017). In all cases, these approaches require the availability of the ground-truth parse of a sentence or access to an external parser during training or inference time. This is unlike the proposed approach which does not require any such extra annotation or tool and learns to sequentially generate a sequence in an automatically determined non-monotonic order.

7 Conclusion, Limitations & Future Work

We described an approach to generating text in non-monotonic orders that fall out naturally as the result of learning. We explored several different oracle models for imitation, and found that an annealed “coaching” oracle performed best, and learned a “best-first” strategy for language modeling, where it appears to significantly outperform alternatives. On a word re-ordering task, we found that this approach essentially ties left-to-right decoding, a rather promising finding given the decades of work on left-to-right models. In a machine translation setting, we found that, after tuning the probability of ending subtrees, the model learns to translate in a way that tends to preserve meaning but not n-grams.

There are several potentially interesting avenues for future work. One is to solve the “learning to stop” problem directly, rather than through an after-the-fact tuning step. Another is to better understand how to construct an oracle that generalizes well after mistakes have been made, in order to train off of the gold path(s).

Moreover, the proposed formulation of sequence generation by tree generation is limited to binary trees. It is possible to extend the proposed approach to -ary trees by designing a policy to output up to decisions at each node, leading to up to child nodes. This would bring a set of generation orders, that could be captured by the proposed approach, which includes all projective dependency parses. A new oracle must be designed so as to ensure that well-balanced -ary trees are assigned enough probabilities, and we leave this as a follow-up work.

Finally, although the proposed approach indeed learns to sequentially generate a sequence in a non-monotonic order, it cannot consider all possible orders. It is due to the constraint that there cannot be any crossing of two edges when the nodes (excluding nodes) are arranged on a line following the inorder traversal, which we refer to as projective generation. Extending the proposed approach to non-projective generation, which we leave as future work, would expand the number of generation orders considered during learning.


We thank support by eBay, TenCent and NVIDIA. This work was partly supported by Samsung Advanced Institute of Technology (Next Generation Deep Learning: from pattern recognition to AI) and Samsung Electronics (Improving Deep Learning using Latent Structure).


8 Appendix

Appendix A Additional Experiment Details

a.1 Word Reordering


The decoder is a 2-layer LSTM with 1024 hidden units, dropout of 0.0, based on a preliminary grid search of . Word embeddings are initialized with GloVe vectors and updated during training. All presented Word Reordering results use greedy decoding.


Each model was trained for 48 hours on a single NVIDIA Tesla P100 GPU, using a maximum of 500 epochs, batch size of 32, Adam optimizer, gradient clipping with maximum

-norm of 1.0, and a learning rate starting at 0.001 and multiplied by a factor of 0.5 every 20 epochs. For evaluation we select the model state which had the highest validation Bleu score, which is evaluated after each training epoch.


For , is linearly annealed from 1.0 to 0.0 at a rate of 0.05 each epoch, after a burn-in period of 20 epochs in which is not decreased. We use greedy decoding when is selected at a roll-in step; we did not observe significant performance variations with stochastically sampling from . These settings are based on a grid search of using the model selected in the Model section above.

a.2 Unconditional Generation

We use the same settings as the Word Reordering experiments, except we always use stochastic sampling from during roll-in. For evaluation we select the model state at the end of training.

Appendix B Additional Results

b.1 Unconditional Generation

b.1.1 Frequency Plots

Figures 5 and 6 contain word and part-of-speech frequencies, respectively, from the validation set and 10,000 samples from each model, ordered by the validation set frequencies.

b.1.2 Unconditional Samples

Samples in Tables 7-8 are organized as ‘short’ ( 5th percentile), ‘average-length’ (45-55th percentile), and ‘multi-sentence’ ( 3 punctuation tokens).

Each image in Figures 7, 8, and 9 shows a sampled sentence, its underlying tree, and its generation order.

Oracle k BLEU-2 BLEU-3 BLEU-4
10 0.905 0.778 0.624
100 0.874 0.705 0.514
1000 0.853 0.665 0.466
all 0.853 0.668 0.477
10 0.966 0.906 0.788
100 0.916 0.751 0.544
1000 0.864 0.651 0.435
all 0.831 0.609 0.395
10 0.966 0.895 0.770
100 0.931 0.804 0.628
1000 0.907 0.765 0.585
all 0.894 0.740 0.549
Table 6: Unconditional generation Bleu for various top- samplers and policies trained with the specified oracle.

b.1.3 Additional Bleu Scores

Since absolute Bleu scores can vary by using a softmax temperature (Caccia et al., 2018) or top-k sampler, we report additional scores for and Bleu- in Table 6. Generally the policy trained with the annealed oracle achieves the highest metrics.

Figure 5: Word frequencies.
Figure 6: POS tag frequencies.

b.2 Word Reordering

Figure 10 shows example predictions from the validation set, including the generation order and underlying tree.

left-right i can drive you alone . do you like to test your voice to a choir ?
yeah it is very important . no pets , on the subject in my family , yes .
i am a am nurse . cool . i have is also a cat named cow .
do you actually enjoy it ? i am doing good taking a break from working on it .
what pair were you in ? i do not have one , do you have any pets ?
uniform good just normal people around . just that is for a while . and yourself right now ?
you run the hills right ? i am freelance a writer but i am a writer .
i am great yourself ? that is so sad . do you have a free time ?
i work 12 hours . yes i do not like pizza which is amazing lol .
do you go to hockey ? since the gym did not bother me many years ago .
annealed are you ? i am . yeah it can be . what is your favorite color ?
i like to be talented . i do not have dogs . they love me here .
how are you doing buddy ? no kids . . . i am . . you ?
i like healthy foods . that is interesting . i am just practicing my piano degree .
i love to eat . yea it is , you need to become a real nerd !
Table 7: Short (left) and Average-Length (right) unconditional samples from policies trained on Persona-Chat.
left-right nice ! i think i will get a jump blade again . have you done that at it ?
great . what kinds of food do you like best ? i love italian food .
wow . bike ride is my thing . i do nothing for kids .
i am alright . my mom makes work and work as a nurse . that is what i do for work .
that is awesome . i need to lose weight . i want to start a food place someday .
uniform love meat . or junk food . i sometimes go too much i make . avoid me unhealthy .
does not kill anyone that can work around a lot of animals ? you ? i like trains .
baby ? it will it all here . that is the workforce .
i am good , thank you . i love my sci fi stories . i write books .
i am well . thank you . my little jasper is new .
annealed i am definitely a kid . are you ? i am 10 !
i am in michigan state . . that is a grand state .
that is good . i work as a pharmacist in florida . . .
how are you ? wanna live in san fran ! i love it .
well that is awesome ! i do crosswords ! that is cool .
Table 8: Multi-sentence unconditional samples from policies trained on Persona-Chat.
Figure 7: Unconditional samples from a policy trained with .
Figure 8: Unconditional samples from a policy trained with .
Figure 9: Unconditional samples from a policy trained with .
Figure 10: Word Reordering Examples. The columns show policies trained with , , and , respectively.
Figure 11: Translation outputs (test set) from a policy trained with .
Figure 12: Translation outputs (test set) from a policy trained with .
Figure 13: Translation outputs (test set) from a policy trained with .