Most sequence-generation models, from n-grams(Bahl et al., 1983) to neural language models (Bengio et al., 2003) generate sequences in a purely left-to-right, monotonic order. This raises the question of whether alternative, non-monotonic orders are worth considering (Ford et al., 2018), especially given the success of “easy first” techniques in natural language tagging (Tsuruoka & Tsujii, 2005), parsing (Goldberg & Elhadad, 2010), and coreference (Stoyanov & Eisner, 2012), which allow a model to effectively learn their own ordering. In investigating this question, we are solely interested in considering non-monotonic generation that does not rely on external supervision, such as parse trees (Eriguchi et al., 2017; Aharoni & Goldberg, 2017).
In this paper, we propose a framework for training sequential text generation models which learn a generation order without having to specifying an order in advance (§ 2). An example generation from our model is shown in Figure 1. We frame the learning problem as an imitation learning problem, in which we aim to learn a generation policy that mimics the actions of an oracle generation policy (§ 3). Because the tree structure is unknown, the oracle policy cannot know the exact correct actions to take; to remedy this we propose a method called annealed coaching which can yield a policy with learned generation orders, by gradually moving from imitating a maximum entropy oracle to reinforcing the policy’s own preferences. Experimental results demonstrate that using the proposed framework, it is possible to learn policies which generate text without pre-specifying a generation order, achieving easy first-style behavior. The policies achieve performance metrics that are competitive with or superior to conventional left-to-right generation in language modeling, word reordering, and machine translation (§ 5).
2 Non-Monotonic Sequence Generation
Formally, we consider the problem of sequentially generating a sequence of discrete tokens , such as a natural language sentence, where , a finite vocabulary. Let .
Unlike conventional approaches with a fixed generation order, often left-to-right (or right-to-left), our goal is to build a sequence generator that generates these tokens in an order automatically determined by the sequence generator, without any extra annotation nor supervision of what might be a good order. We propose a method which does so by generating a word at an arbitrary position, then recursively generating words to its left and words to its right, yielding a binary tree like that shown in Figure 1.
We view the generation process as deterministically navigating a state space where a state corresponds to a sequence of tokens from . We interpret this sequence of tokens as a top-down traversal of a binary tree, where terminates a subtree. The initial state is the empty sequence. For example, in Figure 1, , , …, . An action is an element of which is deterministically appended to the state. Terminal states are those for which all subtrees have been ’ed. If a terminal state is reached, we have that , where is the number of words (non- tokens) in the tree. We use to denote the level-order traversal index of the -th node in an in-order traversal of a tree, so that corresponds to the sequence of discrete tokens generated. The final sequence returned is this, postprocessed by removing all ’s. In Figure 1, maps from the numbers in the blue squares to those in the green squares.
is a (possibly) stochastic mapping from states to actions, and we denote the probability of an actiongiven a state as . A policy ’s behavior decides which and whether words appear before and after the token of the parent node. Typically there are many unique binary trees with an in-order traversal equal to a sequence . Each of these trees has a different level-order traversal, thus the policy is capable of choosing from many different generation orders for , rather than a single predefined order. Note that left-to-right generation can be recovered if if and only if
is odd (or non-zero and even for right-to-left generation).
3 Learning for Non-Monotonic Generation
Learning in our non-monotonic sequence generation model (§ 2) amounts to inferring a policy from data. We first consider the unconditional generation problem (akin to language modeling) in which the data consists simply of sequences to be generated. Subsequently (§ 4.1) we consider the conditional case in which we wish to learn a mapping from inputs to output sequences .
This learning problem is challenging because the sequences alone only tell us what the final output sequences of words should be, but not what tree(s) should be used to get there. In left-to-right generation, the observed sequence
fully determines the sequence of actions to take. In our case, however, the tree structure is effectively a latent variable, which will be determined by the policy itself. This prevents us from using conventional supervised learning for training the parameterized policy. On the other hand, at training time, we do knowwhich words should eventually appear, and their order; this substantially constrains the search space that needs to be explored, suggesting learning-to-search (Daumé et al., 2009) and imitation learning (Ross et al., 2011; Ross & Bagnell, 2014) as a learning strategy.111
One could consider applying reinforcement learning to this problem. This would ignore the fact that at training time weknow which words will appear, reducing the size of the feasible search space from to , a huge savings. Furthermore, even with a fixed generation order, RL has proven to be difficult without partially relying on supervised learning (Ranzato et al., 2015; Bahdanau et al., 2015b, 2016).
The key idea in our imitation learning framework is that at the first step, an oracle policy’s action is to produce any word that appears anywhere in . Once picked, in a quicksort-esque manner, all words to the left of in are generated recursively on the left (following the same procedure), and all words to the right of in are generated recursively on the right. (See Figure 2 for an example.) Because the oracle is non-deterministic (many “correct” actions are available at any given time), we inform this oracle policy with the current learned policy, encouraging it to favor actions that are preferred by the current policy, inspired by work in direct loss minimization (Hazan et al., 2010) and related techniques (Chiang, 2012; He et al., 2012).
3.1 Background: Learning to Search
In learning-to-search-style algorithms, we aim to learn a policy that mimics an oracle (or “reference”) policy . To do so, we define a roll-in policy and roll-out policy . We then repeatedly draw states according to the state distribution induced by , and compute cost-to-go under , for all possible actions at that state. The learned policy
is then trained to choose actions to minimize this cost-to-go estimate.
Formally, denote the uniform distribution overas and denote by the distribution of states induced by running for -many steps. Denote by a scalar cost measuring the loss incurred by against the cost-to-go estimates under (for instance,
may measure the squared error between the vectorand the cost-to-go estimates). Then, the quantity being optimized is:
Here, and can use information not available at test-time (e.g., the ground-truth ). Learning consists of finding a policy which only has access to states but performs as well or better than . By varying the choice of , , and , one obtains different variants of learning-to-search algorithms, such as DAgger (Ross et al., 2011), AggreVaTe (Ross & Bagnell, 2014) or LOLS (Chang et al., 2015).
In the remainder of this section, we describe the cost function we use, a set of oracle policies and a set of roll-in policies, both of which are specifically designed for the proposed problem of non-monotonic sequential generation of a sequence. These sets of policies are empirically evaluated later in the experiments (§ 5).
3.2 Cost Measurement
There are many ways to measure the prediction cost ; arguably the most common is squared error between cost-predictions by and observed costs obtained by at the state
. However, recent work has found that, especially when dealing with recurrent neural network policies (which we will use; see§ 4), using a cost function more analogous to a cross-entropy loss can be preferred (Leblond et al., 2018; Cheng et al., 2018; Welleck et al., 2018). In particular, we use a KL-divergence type loss, measuring the difference between the action distribution produced by and the action distribution preferred by .
Our approach estimates the loss in Eq. (1) by first sampling one training sequence, running the roll-in policy for steps, and computing the KL divergence (2) at that state using as . Learning corresponds to minimizing this KL divergence iteratively with respect to the parameters of .
3.3 Roll-In Policies
The roll-in policy determines the state distribution over which the learned policy is to be trained. In most formal analyses, the roll-in policy is a stochastic mixture of the learned policy and the oracle policy , ensuring that is eventually trained on its own state distribution (Daumé et al., 2009; Ross et al., 2011; Ross & Bagnell, 2014; Chang et al., 2015). Despite this, experimentally, it has often been found that simply using the oracle’s state distribution is optimal (Ranzato et al., 2015; Leblond et al., 2018). This is likely because the noise incurred early on in learning by using ’s state distribution is not overcome by the benefit of matching state distributions, especially when the the policy class is sufficiently high capacity so as to be nearly realizable on the training data (Leblond et al., 2018). In preliminary experiments, we observed the same is true in our setting: simply rolling in according to the oracle policy (§ 3.4) yielded the best results experimentally. Therefore, despite the fact that this can lead to inconsistency in the learned model (Chang et al., 2015), all experiments are with oracle roll-ins.
3.4 Oracle Policies
In this section we formalize the oracle policies that we consider. To simplify the discussion (we assume that the roll-in distribution is the oracle), we only need to define an oracle policy that takes actions on states it, itself, visits. All the oracles we consider have access to the ground truth output , and the current state . We interpret the state as a partial binary tree and a “current node” in that binary tree where the next prediction will go. It is easiest to consider the behavior of the oracle as a top-down, level-order traversal of the tree, where in each state it maintains a sequence of “possible tokens” at that state. An oracle policy is defined with respect to , a consecutive subsequence of . At , uses the full . This is subdivided as the tree is descended. At each state , contains “valid actions”; labeling the current node with any token from keeps the generation leading to . For instance, in Figure 2, after sampling b for the root, the valid actions are split into for the left child and for the right child.
Given the consecutive subsequence , an oracle policy is defined as:
where the s are arbitrary such that . An oracle policy places positive probability only on valid actions, and forces an output if there are no more words to produce. This is guaranteed to always generate , regardless of how the random coin flips come up.
When an action is chosen, at , this “splits” the sub-sequence into left and right sub-sequences, and , where is the index of in . (This split may not be unique due to duplicated words in , in which case we choose a valid split arbitrarily.) These are “passed” to the left and right child nodes, respectively.
There are many possible oracle policies, and each of them is characterized by how in Eq. (3) is defined. Specifically, we propose three variants.
Motivated by Welleck et al. (2018) who applied learning-to-search to the problem of multiset prediction, we design a uniform oracle . This oracle treats all possible generation orders that lead to the target sequence as equally likely, without preferring any specific set of orders. Formally, gives uniform probabilities for all words in where is the number of unique words in . (Daumé (2009) used a similar oracle for unsupervised structured prediction, which has a similar non-deterministic oracle complication.)
An issue with the uniform oracle is that it does not prefer any specific set of generation orders, making it difficult for a parameterized policy to imitate. This gap has been noticed as a factor behind the difficulty in learning-to-search by He et al. (2012), who propose the idea of coaching. In coaching, the oracle takes into account the preference of a parameterized policy in order to facilitate its learning. Motivated by this, we design a coaching oracle as the product of the uniform oracle and current policy :
This coaching oracle ensures that no invalid action is assigned any probability, while preferring actions that are preferred by the current parameterized policy, reinforcing the selection by the current policy if it is valid.
Annealed Coaching Oracle.
The multiplicative nature of the coaching oracle gives rise to an issue, especially in the early stage of learning, as it does not encourage learning to explore a diverse set of generation orders. We thus design a mixture of the uniform and coaching policies, which we refer to as an annealed coaching oracle:
We anneal from to over learning, on a linear schedule.
Deterministic Left-to-Right Oracle.
In addition to the proposed oracle policies above, we also experiment with a deterministic oracle that corresponds to generating the target sequence from left to right: always selects the first un-produced word as the correct action, with probability . When both roll-in and oracle policies are set to the left-to-right oracle , the proposed approach recovers to maximum likelihood learning of an autoregressive sequence model, which is de facto standard in neural sequence modeling. In other words, supervised learning of an autoregressive sequence model is a special case of the proposed approach.
4 Neural Net Policy Structure
We use a neural network to implement the proposed binary tree generating policy, as it has been shown to encode a variable-sized input and predict a structured output effectively (Cleeremans et al., 1989; Forcada & Ñeco, 1997; Sutskever et al., 2014; Cho et al., 2014b; Tai et al., 2015; Bronstein et al., 2017; Battaglia et al., 2018). This neural network takes as input a partial binary tree, or equivalently a sequence of nodes in this partial tree by level-order traversal, and outputs a distribution over the action set
. The policy we use is implemented as a recurrent network with long short-term memory (LSTM) units(Hochreiter & Schmidhuber, 1997) by considering the partial binary tree as a flat sequence of nodes in a level-order traversal . The recurrent network encodes the sequence into a vector and computes a categorical distribution over the action set:
where and are weights and bias associated with .
This LSTM structure relies entirely on the linearization of a partial binary tree, and minimally takes advantage of the actual tree structure or the surface order. It is possible to exploit the tree structure more thoroughly by using latest neural networks that are specifically designed to encode a tree (Zhang et al., 2015; Alvarez-Melis & Jaakkola, 2017; Dyer et al., 2015; Bowman et al., 2016). We do not consider these in this paper, but leave them for future investigation. We did, however, experiment with additionally conditioning ’s action distribution on the parent of the current node in the tree, but preliminary experiments did not show gains.
4.1 Conditional Sentence Generation
An advantage of using a neural network to implement the proposed policy is that it can be easily conditioned on an extra context. It allows us to build a conditional non-monotonic sequence generator that can for instance be used for machine translation, image caption generation, speech recognition and generally multimedia description generation (Cho et al., 2015). To do so, we assume that a conditioning input (e.g. an image or sentence) can be represented as a -dimensional context vector. To obtain these vector representations, we learn an encoder function , and use its output to initialize the LSTM policy’s -dimensional state, , where . For machine translation experiments (§ 5.4) the encoder additionally outputs vectors , and the policy computes an additional context vector at each step using a learned attention function, which is then combined with the policy’s state . The encoder’s parameters are learned jointly with the policy’s.
5 Experimental Results
In this section we experiment with our non-monotone sequence generation model across four tasks. The first two are unconditional generation tasks: language modeling (§ 5.1) and out-of-order sentence completion (§ 5.2). Our analysis in these tasks is primarily qualitative: we seek to understand what the non-monotone policy is learning and how it compares to a left-to-right model. The second two tasks are conditional generation tasks, which generate output sequences based on some given input sequence: word reordering (§ 5.3) and machine translation (§ 5.4).
5.1 Language Modeling
We begin by considering generating samples from our model, trained as a language model. Our goal in this section is to qualitatively understand what our model has learned. It would be natural also to evaluate our model according to a score like perplexity. Unfortunately, unlike conventional autoregressive language models, it is intractable to compute the probability of a given sequence in the non-monotonic generation setting, as it requires us to marginalize out all possible binary trees that lead to the sequence.
We use a dataset derived from the Persona-Chat (Zhang et al., 2018) dialogue dataset, which consists of multi-turn dialogues between two agents. Our dataset here consists of all unique persona sentences and utterances in Persona-Chat. We derive the examples from the same train, validation, and test splits as Persona-Chat, resulting in 133,176 train, 16,181 validation, and 15,608 test examples. Sentences are tokenized by splitting on spaces and punctuation. The training set has a vocabulary size of 20,090 and an average of 12.0 tokens per example.
We use a uni-directional LSTM that has 2 layers of 1024 LSTM units. See Appendix A for more details.
|Oracle||%Novel||%Unique||Avg. Tokens||Avg. Span||Bleu|
|hey there , i should be !|
|not much fun . what are you doing ?|
|not . not sure if you .|
|i love to always get my nails done .|
|sure , i can see your eye underwater|
|while riding a footwork .|
|i just got off work .|
|yes but believe any karma , it is .|
|i bet you are . i read most of good tvs|
|on that horror out . cool .|
|sometimes , for only time i practice|
|professional baseball .|
|i am rich , but i am a policeman .|
|i do , though . do you ?|
|i like iguanas . i have a snake . i wish|
|i could win . you ?|
|i am a homebody .|
|i care sometimes . i also snowboard .|
|i am doing okay . just relaxing ,|
|and you ?|
We draw 10,000 samples from each trained policy (by varying the oracle) and analyze the results using the following metrics: percentage of novel sentences, percentage of unique, average number of tokens, average span size222The average span is the average number of children for non-leaf nodes excluding the special token , ranging from (chain, as induced by the left-right oracle) to (full binary tree). and Bleu (Table 1). We use Bleu to quantify the sample quality by computing the Bleu score of the samples using the validation set as reference, following Yu et al. (2016) and Zhu et al. (2018). In Appendix B.1.3 we report additional scores. We see that the non-monotonically trained policies generate many more novel sentences, and build trees that are bushy (span ), but not complete binary trees. The policy trained with the annealed oracle is most similar to the validation data according to Bleu.
We investigate the content of the models in Table 2, which shows samples from policies trained with different oracles. Each of the displayed samples are not a part of the training set. We provide additional samples organized by length in Appendix Tables 2 and 3, and samples showing the underlying trees that generated them in Appendix Figures 3-5. In Appendix B.1.1, we additionally examine word frequencies and part-of-speech tag frequencies, and find that the samples from each policy typically follow the validation set’s word and tag frequencies.
We analyze the generation order of our various models by inspecting the part-of-speech (POS) tags each model tends to put at different tree depths (i.e. number of edges from node to root). Figure 3 shows POS counts by tree depth, normalized by the sum of counts at each depth (we only show the four most frequent POS categories). We also show POS counts for the validation set’s dependency trees, obtained with an off-the-shelf parser. Not surprisingly, policies trained with the uniform oracle tend to generate words with a variety of POS tags at each level. Policies trained with the annealed oracle on the other hand, learned to frequently generate punctuation at the root node, often either the sentence-final period or a comma, in an “easy first” style, since most sentences contain a period. Furthermore, we see that the policy trained with the annealed oracle tends to generate a pronoun before a noun or a verb (tree depth 1), which is a pattern that policies trained with the left-right oracle also learn. Nouns typically appear in the middle of the policy trained with the annealed oracle’s trees. Aside from verbs, the annealed policy’s trees, which have punctuation and pronouns near the root and nouns deeper, follow a similar structure as the dependency trees.
5.2 Sentence Completion
A major weakness of the conventional autoregressive model, especially with unbounded context, is that it cannot be easily used to fill in missing parts of a sentence except at the end. This is especially true when the number of tokens per missing segment is not given in advance. Achieving this requires significant changes to both model architecture, learning and inference(Berglund et al., 2015).
Our proposed approach, on the other hand, can naturally fill in missing segments in a sentence. Using models trained as language models from the previous section (§ 5.1), we can achieve this by initializing a binary tree with observed tokens in a way that they respect their relative positions. For instance, the first example shown in Table 3 can be seen as the template “ favorite food ! ” with variable-length missing segments. Generally, an initial tree with nodes ensures that each appears in the completed sentence, and that appears at some position to the left of in the completed sentence when is a left-descendant of (analogously for right-descendants).
To quantify the completion quality, we first create a collection of initial trees by randomly sampling three words from each sentence from the Persona-Chat validation set of § 5.1. We then sample one completion for each initial tree and measure the Bleu of each sample using the validation set as reference as in § 5.1. According to Bleu, the policy trained with the annealed oracle sampled completions that were more similar to the validation data (Bleu 44.7) than completions from the policies trained with the uniform (Bleu 38.9) or left-to-right (Bleu 14.3) oracles.
In Table 3, we present some sample completions using the policy trained with the uniform oracle. The completions illustrate a property of the proposed non-monotonic generation that is not available in left-to-right generation.
|lasagna is my favorite food !|
|my favorite food is mac and cheese !|
|what is your favorite food ? pizza , i love it !|
|whats your favorite food ? mine is pizza !|
|seafood is my favorite . and mexican food !|
|what is yours ?|
|hello ! i like classical music . do you ?|
|hello , do you enjoy playing music ?|
|hello just relaxing at home listening to|
|fine music . you ?|
|hello , do you like to listen to music ?|
|hello . what kind of music do you like ?|
|i am a doctor or a lawyer .|
|i would like to feed my doctor , i aspire|
|to be a lawyer .|
|i am a doctor lawyer . 4 years old .|
|i was a doctor but went to a lawyer .|
|i am a doctor since i want to be a lawyer .|
5.3 Word Reordering
We first evaluate the proposed models for conditional generation on the Word Reordering task, also known as Bag Translation (Brown et al., 1990) or Linearization (Schmaltz et al., 2016). In this task, a sentence is given as an unordered collection , and the task is to reconstruct from . We assemble a dataset of pairs using sentences from the Persona-Chat sentence dataset of § 5.1. In our approach, we do not explicitly force the policies trained with our non-monotonic oracles to produce a permutation of the input and instead let them learn this automatically.
For encoding each unordered input , we use a simple bag-of-words encoder: . We implement
using an embedding layer followed by a linear transformation. The embedding layer is initialized with GloVe(Pennington et al., 2014) vectors and updated during training. As the policy (decoder) we use a flat LSTM with 2 layers of 1024 LSTM units. The decoder hidden state is initialized with a linear transformation of .
Table 4 shows Bleu, F1 score, and exact match for policies trained with each oracle. The uniform and annealed policies outperform the left-right policy in F1 score (0.96 and 0.95 vs. 0.903). The policy trained using the annealed oracle also matches the left-right policy’s performance in terms of Bleu score (46.0 vs. 46.3) and exact match (0.212 vs. 0.208). The model trained with the uniform policy does not fare as well on Bleu or exact match. See Appendix Figure 6 for example predictions.
Figure 4 shows the entropy of each model as a function of depth in the tree (normalized to fall in ). The left-right-trained policy has high entropy on the first word and then drops dramatically as additional conditioning from prior context kicks in. The uniform-trained policy exhibits similar behavior. The annealed-trained policy, however, makes its highest confidence (“easiest”) predictions at the beginning (consistent with Figure 3) and defers harder decisions until later.
5.4 Machine Translation
Data and Preprocessing.
We evaluate the proposed models on IWSLT’16 German English (196k pairs) translation task. The data sets consist of TED talks. We use the TED tst2013 as a validation dataset and tst-2014 as test. We use the default Moses tokenizer script (Koehn et al., 2007) and segment each word into a subword using BPE (Sennrich et al., 2015) creating 40k tokens for both source and target. Similar to (Bahdanau et al., 2015a), we filter sentence pairs that exceed 50 words and shuffle mini-batches.
Results of machine translation experiments for different training oracles across four different evaluation metrics.
Model & Training.
We use a bi-directional LSTM encoder-decoder architecture that has a single layer of size 512, with global concat attention (Luong et al., 2015). The learning rate for all of our models is initialized to 0.001 and multiplied by a factor of 0.5 on a fixed interval.
In preliminary results on validation data, we found that the annealed-trained models tended to overproduce tokens. This likely happens because roughly 50% of the examples seen have the correct action as
, which is much more reliable than any other word, and the classifier learns to favortoo strongly as a result. This is reminiscent of other settings in which “learning to stop” can be difficult (Misra et al., 2017)
. To address this, we tune a linear offset in the policy logits (Eq missing), just for the token. Formally, Eq missing is the same for all , but is replaced with for , where are scalars tuned on the validation data by grid search.
Results on validation and test data are in Table 5 according to four (very) different evaluation measures: Bleu, Meteor (Lavie & Agarwal, 2007), YiSi (Lo, 2018), and Ribes (Isozaki et al., 2010). The most dramatic score difference is the drastically superior performance of left-right according to Bleu. As previously observed (Callison-Burch et al., 2006; Wilks, 2008), Bleu tends to strongly prefer models with left-to-right language models because it focuses so strongly on getting a large number of grams correct. We found that the annealed model significantly outperforms the left-right model in 1- and 2-gram precision, ties for 3-gram, and loses for 4-gram. This suggests that Bleu score could be improved by explicitly modeling the linearization order in our approach. The other three measures of translation quality are significantly less sensitive to exact word order and focus more on whether the “semantics” is preserved (for varying definitions of “semantics”). For those, we see that the annealed+-tuned models are more competitive, though still under-performing left-right by a few percent for Meteor and YiSi.
6 Related Work
Arguably one of the most successful approaches for generating discrete sequences, or sentences, is neural autoregressive modeling (Sutskever et al., 2011; Tomas, 2012). It has become de facto standard in machine translation (Cho et al., 2014a; Sutskever et al., 2014) and is widely studied for dialogue response generation (Vinyals & Le, 2015) as well as speech recognition (Chorowski et al., 2015). On the other hand, recent works have shown that it is possible to generate a sequence of discrete tokens in parallel by capturing strong dependencies among the tokens in a non-autoregressive way (Gu et al., 2017; Lee et al., 2018; Oord et al., 2017). Stern et al. (2018) and Wang et al. (2018) proposed to mix in these two paradigms and build a semi-autoregressive sequence generator, while largely sticking to left-to-right generation. Our proposal radically departs from these conventional approaches by building an algorithm that automatically captures a distinct generation order.
In (neural) language modeling, there is a long tradition of modeling the probability of a sequence as a tree or directed graph. For example, Emami & Jelinek (2005) proposed to factorize the probability over a sentence following its syntactic structure and train a neural network to model conditional distributions, which was followed more recently by Zhang et al. (2015) and by Dyer et al. (2016)
. This approach was applied to neural machine translation byEriguchi et al. (2017) and Aharoni & Goldberg (2017). In all cases, these approaches require the availability of the ground-truth parse of a sentence or access to an external parser during training or inference time. This is unlike the proposed approach which does not require any such extra annotation or tool and learns to sequentially generate a sequence in an automatically determined non-monotonic order.
7 Conclusion, Limitations & Future Work
We described an approach to generating text in non-monotonic orders that fall out naturally as the result of learning. We explored several different oracle models for imitation, and found that an annealed “coaching” oracle performed best, and learned a “best-first” strategy for language modeling, where it appears to significantly outperform alternatives. On a word re-ordering task, we found that this approach essentially ties left-to-right decoding, a rather promising finding given the decades of work on left-to-right models. In a machine translation setting, we found that, after tuning the probability of ending subtrees, the model learns to translate in a way that tends to preserve meaning but not n-grams.
There are several potentially interesting avenues for future work. One is to solve the “learning to stop” problem directly, rather than through an after-the-fact tuning step. Another is to better understand how to construct an oracle that generalizes well after mistakes have been made, in order to train off of the gold path(s).
Moreover, the proposed formulation of sequence generation by tree generation is limited to binary trees. It is possible to extend the proposed approach to -ary trees by designing a policy to output up to decisions at each node, leading to up to child nodes. This would bring a set of generation orders, that could be captured by the proposed approach, which includes all projective dependency parses. A new oracle must be designed so as to ensure that well-balanced -ary trees are assigned enough probabilities, and we leave this as a follow-up work.
Finally, although the proposed approach indeed learns to sequentially generate a sequence in a non-monotonic order, it cannot consider all possible orders. It is due to the constraint that there cannot be any crossing of two edges when the nodes (excluding nodes) are arranged on a line following the inorder traversal, which we refer to as projective generation. Extending the proposed approach to non-projective generation, which we leave as future work, would expand the number of generation orders considered during learning.
- Aharoni & Goldberg (2017) Aharoni, R. and Goldberg, Y. Towards string-to-tree neural machine translation. arXiv preprint arXiv:1704.04743, 2017.
- Alvarez-Melis & Jaakkola (2017) Alvarez-Melis, D. and Jaakkola, T. S. Tree-structured decoding with doubly-recurrent neural networks. International Conference on Learning Representations (ICLR), 2017.
- Bahdanau et al. (2015a) Bahdanau, D., Cho, K., and Bengio, Y. Neural machine translation by jointly learning to align and translate. In International Conference on Learning Representations, 2015a.
- Bahdanau et al. (2015b) Bahdanau, D., Serdyuk, D., Brakel, P., Ke, N. R., Chorowski, J., Courville, A., and Bengio, Y. Task loss estimation for sequence prediction. arXiv preprint arXiv:1511.06456, 2015b.
- Bahdanau et al. (2016) Bahdanau, D., Brakel, P., Xu, K., Goyal, A., Lowe, R., Pineau, J., Courville, A., and Bengio, Y. An actor-critic algorithm for sequence prediction. arXiv preprint arXiv:1607.07086, 2016.
- Bahl et al. (1983) Bahl, L. R., Jelinek, F., and Mercer, R. L. A maximum likelihood approach to continuous speech recognition. IEEE transactions on pattern analysis and machine intelligence, 5(2):179–190, 1983.
- Battaglia et al. (2018) Battaglia, P. W., Hamrick, J. B., Bapst, V., Sanchez-Gonzalez, A., Zambaldi, V., Malinowski, M., Tacchetti, A., Raposo, D., Santoro, A., Faulkner, R., et al. Relational inductive biases, deep learning, and graph networks. arXiv preprint arXiv:1806.01261, 2018.
- Bengio et al. (2003) Bengio, Y., Ducharme, R., Vincent, P., and Jauvin, C. A neural probabilistic language model. Journal of machine learning research, 3(Feb):1137–1155, 2003.
- Berglund et al. (2015) Berglund, M., Raiko, T., Honkala, M., Kärkkäinen, L., Vetek, A., and Karhunen, J. T. Bidirectional recurrent neural networks as generative models. In Advances in Neural Information Processing Systems, pp. 856–864, 2015.
- Bowman et al. (2016) Bowman, S. R., Gauthier, J., Rastogi, A., Gupta, R., Manning, C. D., and Potts, C. A fast unified model for parsing and sentence understanding. arXiv preprint arXiv:1603.06021, 2016.
- Bronstein et al. (2017) Bronstein, M. M., Bruna, J., LeCun, Y., Szlam, A., and Vandergheynst, P. Geometric deep learning: going beyond euclidean data. IEEE Signal Processing Magazine, 34(4):18–42, 2017.
- Brown et al. (1990) Brown, P. F., Cocke, J., Pietra, S. A. D., Pietra, V. J. D., Jelinek, F., Lafferty, J. D., Mercer, R. L., and Roossin, P. S. A statistical approach to machine translation. Comput. Linguist., 16(2):79–85, June 1990. ISSN 0891-2017. URL http://dl.acm.org/citation.cfm?id=92858.92860.
- Caccia et al. (2018) Caccia, M., Caccia, L., Fedus, W., Larochelle, H., Pineau, J., Charlin MILA, L., and Montréal, H. Language gans falling short. arXiv preprint 1811.02549, 2018. URL https://arxiv.org/pdf/1811.02549.pdf.
- Callison-Burch et al. (2006) Callison-Burch, C., Osborne, M., and Koehn, P. Re-evaluation the role of bleu in machine translation research. In 11th Conference of the European Chapter of the Association for Computational Linguistics, 2006.
- Chang et al. (2015) Chang, K.-W., Krishnamurthy, A., Agarwal, A., Daumé III, H., and Langford, J. Learning to search better than your teacher. arXiv preprint arXiv:1502.02206, 2015.
- Cheng et al. (2018) Cheng, C.-A., Yan, X., Wagener, N., and Boots, B. Fast Policy Learning through Imitation and Reinforcement. arXiv preprint 1805.10413, 2018. URL https://arxiv.org/pdf/1805.10413.pdf.
- Chiang (2012) Chiang, D. Hope and fear for discriminative training of statistical translation models. Journal of Machine Learning Research, 13(Apr):1159–1187, 2012.
- Cho et al. (2014a) Cho, K., Van Merriënboer, B., Bahdanau, D., and Bengio, Y. On the properties of neural machine translation: Encoder-decoder approaches. arXiv preprint arXiv:1409.1259, 2014a.
- Cho et al. (2014b) Cho, K., Van Merriënboer, B., Gulcehre, C., Bahdanau, D., Bougares, F., Schwenk, H., and Bengio, Y. Learning phrase representations using rnn encoder-decoder for statistical machine translation. arXiv preprint arXiv:1406.1078, 2014b.
- Cho et al. (2015) Cho, K., Courville, A., and Bengio, Y. Describing multimedia content using attention-based encoder-decoder networks. IEEE Transactions on Multimedia, 17(11):1875–1886, 2015.
- Chorowski et al. (2015) Chorowski, J. K., Bahdanau, D., Serdyuk, D., Cho, K., and Bengio, Y. Attention-based models for speech recognition. In Advances in neural information processing systems, pp. 577–585, 2015.
- Cleeremans et al. (1989) Cleeremans, A., Servan-Schreiber, D., and McClelland, J. L. Finite state automata and simple recurrent networks. Neural computation, 1(3):372–381, 1989.
- Daumé et al. (2009) Daumé, H., Langford, J., and Marcu, D. Search-based structured prediction. Machine learning, 75(3):297–325, 2009.
- Daumé (2009) Daumé, III, H. Unsupervised search-based structured prediction. In International Conference on Machine Learning (ICML), Montreal, Canada, 2009.
- Dyer et al. (2015) Dyer, C., Ballesteros, M., Ling, W., Matthews, A., and Smith, N. A. Transition-based dependency parsing with stack long short-term memory. arXiv preprint arXiv:1505.08075, 2015.
- Dyer et al. (2016) Dyer, C., Kuncoro, A., Ballesteros, M., and Smith, N. A. Recurrent neural network grammars. arXiv preprint arXiv:1602.07776, 2016.
- Emami & Jelinek (2005) Emami, A. and Jelinek, F. A neural syntactic language model. Machine learning, 60(1-3):195–227, 2005.
- Eriguchi et al. (2017) Eriguchi, A., Tsuruoka, Y., and Cho, K. Learning to parse and translate improves neural machine translation. arXiv preprint arXiv:1702.03525, 2017.
- Forcada & Ñeco (1997) Forcada, M. L. and Ñeco, R. Recursive hetero-associative memories for translation. In International Work-Conference on Artificial Neural Networks, 1997.
- Ford et al. (2018) Ford, N., Duckworth, D., Norouzi, M., and Dahl, G. E. The importance of generation order in language modeling. arXiv preprint arXiv:1808.07910, 2018.
- Goldberg & Elhadad (2010) Goldberg, Y. and Elhadad, M. An efficient algorithm for easy-first non-directional dependency parsing. In Human Language Technologies: The 2010 Annual Conference of the North American Chapter of the Association for Computational Linguistics, pp. 742–750. Association for Computational Linguistics, 2010.
- Gu et al. (2017) Gu, J., Bradbury, J., Xiong, C., Li, V. O., and Socher, R. Non-autoregressive neural machine translation. arXiv preprint arXiv:1711.02281, 2017.
- Hazan et al. (2010) Hazan, T., Keshet, J., and McAllester, D. A. Direct loss minimization for structured prediction. In Advances in Neural Information Processing Systems, pp. 1594–1602, 2010.
- He et al. (2012) He, H., Eisner, J., and Daume, H. Imitation learning by coaching. In Advances in Neural Information Processing Systems, pp. 3149–3157, 2012.
- Hochreiter & Schmidhuber (1997) Hochreiter, S. and Schmidhuber, J. Long short-term memory. Neural computation, 9(8):1735–1780, 1997.
Isozaki et al. (2010)
Isozaki, H., Hirao, T., Duh, K., Sudoh, K., and Tsukada, H.
Automatic evaluation of translation quality for distant language
Proceedings of the 2010 Conference on Empirical Methods in Natural Language Processing, pp. 944–952. Association for Computational Linguistics, 2010.
- Koehn et al. (2007) Koehn, P., Hoang, H., Birch, A., Callison-Burch, C., Federico, M., Bertoldi, N., Cowan, B., Shen, W., Moran, C., Zens, R., Dyer, C., Bojar, O., Constantin, A., and Herbst, E. Moses: Open source toolkit for statistical machine translation. In Proceedings of the 45th Annual Meeting of the ACL on Interactive Poster and Demonstration Sessions, pp. 177–180. Association for Computational Linguistics, 2007.
- Lavie & Agarwal (2007) Lavie, A. and Agarwal, A. Meteor: An automatic metric for mt evaluation with high levels of correlation with human judgments. In Proceedings of the Second Workshop on Statistical Machine Translation, pp. 228–231. Association for Computational Linguistics, 2007.
- Leblond et al. (2018) Leblond, R., Alayrac, J.-B., Osokin, A., and Lacoste-Julien, S. SeaRNN: Training RNNs with global-local losses. In ICLR, 2018.
- Lee et al. (2018) Lee, J., Mansimov, E., and Cho, K. Deterministic non-autoregressive neural sequence modeling by iterative refinement. arXiv preprint arXiv:1802.06901, 2018.
- Lo (2018) Lo, C. YiSi: A semantic machine translation evaluation metric for evaluating languages with different levels of available resources. Unpublished, 2018. URL http://chikiu-jackie-lo.org/home/index.php/yisi.
- Luong et al. (2015) Luong, T., Pham, H., and Manning, C. D. Effective approaches to attention-based neural machine translation. In Proceedings of the 2015 Conference on Empirical Methods in Natural Language Processing, pp. 1412–1421. Association for Computational Linguistics, 2015.
- Misra et al. (2017) Misra, D., Langford, J., and Artzi, Y. Mapping instructions and visual observations to actions with reinforcement learning. In Empirical Methods in Natural Language Processing (EMNLP), 2017.
- Oord et al. (2017) Oord, A. v. d., Li, Y., Babuschkin, I., Simonyan, K., Vinyals, O., Kavukcuoglu, K., Driessche, G. v. d., Lockhart, E., Cobo, L. C., Stimberg, F., et al. Parallel wavenet: Fast high-fidelity speech synthesis. arXiv preprint arXiv:1711.10433, 2017.
- Pennington et al. (2014) Pennington, J., Socher, R., and Manning, C. D. Glove: Global vectors for word representation. In Empirical Methods in Natural Language Processing (EMNLP), pp. 1532–1543, 2014. URL http://www.aclweb.org/anthology/D14-1162.
- Ranzato et al. (2015) Ranzato, M., Chopra, S., Auli, M., and Zaremba, W. Sequence level training with recurrent neural networks. arXiv preprint arXiv:1511.06732, 2015.
- Ross & Bagnell (2014) Ross, S. and Bagnell, J. A. Reinforcement and imitation learning via interactive no-regret learning. arXiv preprint arXiv:1406.5979, 2014.
Ross et al. (2011)
Ross, S., Gordon, G., and Bagnell, D.
A reduction of imitation learning and structured prediction to
no-regret online learning.
Proceedings of the fourteenth international conference on artificial intelligence and statistics, pp. 627–635, 2011.
- Schmaltz et al. (2016) Schmaltz, A., Rush, A. M., and Shieber, S. Word ordering without syntax. In Proceedings of the 2016 Conference on Empirical Methods in Natural Language Processing, pp. 2319–2324. Association for Computational Linguistics, 2016. doi: 10.18653/v1/D16-1255. URL http://aclweb.org/anthology/D16-1255.
- Sennrich et al. (2015) Sennrich, R., Haddow, B., and Birch, A. Neural machine translation of rare words with subword units. arXiv preprint arXiv:1508.07909, 2015.
- Stern et al. (2018) Stern, M., Shazeer, N., and Uszkoreit, J. Blockwise parallel decoding for deep autoregressive models. In Advances in Neural Information Processing Systems, pp. 10107–10116, 2018.
- Stoyanov & Eisner (2012) Stoyanov, V. and Eisner, J. Easy-first coreference resolution. Proceedings of COLING 2012, pp. 2519–2534, 2012.
- Sutskever et al. (2011) Sutskever, I., Martens, J., and Hinton, G. E. Generating text with recurrent neural networks. In Proceedings of the 28th International Conference on Machine Learning (ICML-11), pp. 1017–1024, 2011.
- Sutskever et al. (2014) Sutskever, I., Vinyals, O., and Le, Q. V. Sequence to sequence learning with neural networks. In Advances in neural information processing systems, pp. 3104–3112, 2014.
- Tai et al. (2015) Tai, K. S., Socher, R., and Manning, C. D. Improved semantic representations from tree-structured long short-term memory networks. arXiv preprint arXiv:1503.00075, 2015.
- Tomas (2012) Tomas, M. Statistical language models based on neural networks. Brno University of Technology, 2012.
- Tsuruoka & Tsujii (2005) Tsuruoka, Y. and Tsujii, J. Bidirectional inference with the easiest-first strategy for tagging sequence data. In Proceedings of the conference on human language technology and empirical methods in natural language processing, pp. 467–474. Association for Computational Linguistics, 2005.
- Vinyals & Le (2015) Vinyals, O. and Le, Q. A neural conversational model. arXiv preprint arXiv:1506.05869, 2015.
- Wang et al. (2018) Wang, C., Zhang, J., and Chen, H. Semi-autoregressive neural machine translation. arXiv preprint arXiv:1808.08583, 2018.
- Welleck et al. (2018) Welleck, S., Yao, Z., Gai, Y., Mao, J., Zhang, Z., and Cho, K. Loss functions for multiset prediction. In Advances in Neural Information Processing Systems, pp. 5788–5797, 2018.
- Wilks (2008) Wilks, Y. Machine translation: its scope and limits. Springer Science & Business Media, 2008.
- Yu et al. (2016) Yu, L., Zhang, W., Wang, J., and Yu, Y. Seqgan: Sequence generative adversarial nets with policy gradient. CoRR, abs/1609.05473, 2016. URL http://dblp.uni-trier.de/db/journals/corr/corr1609.html#YuZWY16.
- Zhang et al. (2018) Zhang, S., Dinan, E., Urbanek, J., Szlam, A., Kiela, D., and Weston, J. Personalizing dialogue agents: I have a dog, do you have pets too? In Proceedings of the 56th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pp. 2204–2213, Melbourne, Australia, 2018. Association for Computational Linguistics.
- Zhang et al. (2015) Zhang, X., Lu, L., and Lapata, M. Top-down tree long short-term memory networks. arXiv preprint arXiv:1511.00060, 2015.
- Zhu et al. (2018) Zhu, Y., Lu, S., Zheng, L., Guo, J., Zhang, W., Wang, J., and Yu, Y. Texygen: A benchmarking platform for text generation models. SIGIR, 2018.
Appendix A Additional Experiment Details
a.1 Word Reordering
The decoder is a 2-layer LSTM with 1024 hidden units, dropout of 0.0, based on a preliminary grid search of . Word embeddings are initialized with GloVe vectors and updated during training. All presented Word Reordering results use greedy decoding.
For , is linearly annealed from 1.0 to 0.0 at a rate of 0.05 each epoch, after a burn-in period of 20 epochs in which is not decreased. We use greedy decoding when is selected at a roll-in step; we did not observe significant performance variations with stochastically sampling from . These settings are based on a grid search of using the model selected in the Model section above.
a.2 Unconditional Generation
We use the same settings as the Word Reordering experiments, except we always use stochastic sampling from during roll-in. For evaluation we select the model state at the end of training.
Appendix B Additional Results
b.1 Unconditional Generation
b.1.1 Frequency Plots
b.1.2 Unconditional Samples
b.1.3 Additional Bleu Scores
b.2 Word Reordering
Figure 10 shows example predictions from the validation set, including the generation order and underlying tree.
|left-right||i can drive you alone .||do you like to test your voice to a choir ?|
|yeah it is very important .||no pets , on the subject in my family , yes .|
|i am a am nurse .||cool . i have is also a cat named cow .|
|do you actually enjoy it ?||i am doing good taking a break from working on it .|
|what pair were you in ?||i do not have one , do you have any pets ?|
|uniform||good just normal people around .||just that is for a while . and yourself right now ?|
|you run the hills right ?||i am freelance a writer but i am a writer .|
|i am great yourself ?||that is so sad . do you have a free time ?|
|i work 12 hours .||yes i do not like pizza which is amazing lol .|
|do you go to hockey ?||since the gym did not bother me many years ago .|
|annealed||are you ? i am .||yeah it can be . what is your favorite color ?|
|i like to be talented .||i do not have dogs . they love me here .|
|how are you doing buddy ?||no kids . . . i am . . you ?|
|i like healthy foods .||that is interesting . i am just practicing my piano degree .|
|i love to eat .||yea it is , you need to become a real nerd !|
|left-right||nice ! i think i will get a jump blade again . have you done that at it ?|
|great . what kinds of food do you like best ? i love italian food .|
|wow . bike ride is my thing . i do nothing for kids .|
|i am alright . my mom makes work and work as a nurse . that is what i do for work .|
|that is awesome . i need to lose weight . i want to start a food place someday .|
|uniform||love meat . or junk food . i sometimes go too much i make . avoid me unhealthy .|
|does not kill anyone that can work around a lot of animals ? you ? i like trains .|
|baby ? it will it all here . that is the workforce .|
|i am good , thank you . i love my sci fi stories . i write books .|
|i am well . thank you . my little jasper is new .|
|annealed||i am definitely a kid . are you ? i am 10 !|
|i am in michigan state . . that is a grand state .|
|that is good . i work as a pharmacist in florida . . .|
|how are you ? wanna live in san fran ! i love it .|
|well that is awesome ! i do crosswords ! that is cool .|