Connecting the Dots Between MLE and RL for Sequence Generation

11/24/2018 ∙ by Bowen Tan, et al. ∙ Petuum, Inc. Carnegie Mellon University 0

Sequence generation models such as recurrent networks can be trained with a diverse set of learning algorithms. For example, maximum likelihood learning is simple and efficient, yet suffers from the exposure bias problem. Reinforcement learning like policy gradient addresses the problem but can have prohibitively poor exploration efficiency. A variety of other algorithms such as RAML, SPG, and data noising, have also been developed from different perspectives. This paper establishes a formal connection between these algorithms. We present a generalized entropy regularized policy optimization formulation, and show that the apparently divergent algorithms can all be reformulated as special instances of the framework, with the only difference being the configurations of reward function and a couple of hyperparameters. The unified interpretation offers a systematic view of the varying properties of exploration and learning efficiency. Besides, based on the framework, we present a new algorithm that dynamically interpolates among the existing algorithms for improved learning. Experiments on machine translation and text summarization demonstrate the superiority of the proposed algorithm.



There are no comments yet.


page 1

page 2

page 3

page 4

This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

1 Introduction

Sequence generation is a ubiquitous problem in many applications, such as machine translation (Wu et al., 2016; Sutskever et al., 2014), text summarization (Hovy and Lin, 1998; Rush et al., 2015), image captioning (Vinyals et al., 2015; Karpathy and Fei-Fei, 2015)

, and so forth. Great advances in these tasks have been made by the development of sequence models such as recurrent neural networks (RNNs) with different cells 

(Hochreiter and Schmidhuber, 1997; Chung et al., 2014) and attention mechanisms (Bahdanau et al., 2015; Luong et al., 2015). These models can be trained with a variety of learning algorithms.

The standard training algorithm is based on maximum-likelihood estimation (MLE) which seeks to maximize the log-likelihood of ground-truth sequences. Despite the computational simplicity and efficiency, MLE training suffers from the

exposure bias (Ranzato et al., 2016). That is, the model is trained to predict the next token given the previous ground-truth tokens; while at test time, since the resulting model does not have access to the ground truth, tokens generated by the model itself are instead used to make the next prediction. This discrepancy between training and test leads to the issue that mistakes in prediction can quickly accumulate. Recent efforts have been made to alleviate the issue, many of which resort to the reinforcement learning (RL) techniques (Ranzato et al., 2016; Bahdanau et al., 2017; Ding and Soricut, 2017). For example, Ranzato et al. (2016) adopt policy gradient (Sutton et al., 2000)

that avoids the training/test discrepancy by using the same decoding strategy. However, RL-based approaches for sequence generation can face challenges of prohibitively poor sample efficiency and high variance. For more practical training, a diverse set of methods has been developed that are in a middle ground between the two paradigms of MLE and RL. For example, RAML 

(Norouzi et al., 2016) adds reward-aware perturbation to the MLE data examples; SPG (Ding and Soricut, 2017) leverages reward distribution for effective sampling of policy gradient. Other approaches such as data noising (Xie et al., 2017) also show improved results.

In this paper, we establish a unified perspective of the broad set of learning algorithms. Specifically, we present a generalized entropy regularized policy optimization framework, and show that the apparently diverse algorithms, such as MLE, RAML, SPG, and data noising, can all be re-formulated as special instances of the framework, with the only difference being the choice of reward and the values of a couple of hyperparameters (Figure 1). In particular, we show MLE is equivalent to using a delta-function reward that assigns 1 to samples that exactly match data examples while to any other samples. Such extremely restricted reward has literally disabled any exploration of the model beyond training data, yielding the exposure bias. Other algorithms essentially use rewards that are more smooth, and also leverage model distribution for exploration, which generally results in a larger effective exploration space, more difficult training, and better test-time performance.

Besides the new understandings of the existing algorithms, the unified perspective also facilitates to develop new algorithms for improved learning. We present an example new algorithm that, as training proceeds, gradually expands the exploration space by annealing the reward and hyperparameter values. The annealing in effect dynamically interpolates among the existing algorithms. Experiments on machine translation and text summarization show the interpolation algorithm achieves significant improvement over the various existing methods.

2 Related Work

Sequence generation models are usually trained to maximize the log-likelihood of data by feeding the ground-truth tokens during decoding. Reinforcement learning (RL) addresses the discrepancy between training and test by also using models’ own predictions at training time. Various RL approaches have been applied for sequence generation, such as policy gradient (Ranzato et al., 2016) and actor-critic (Bahdanau et al., 2017). Softmax policy gradient (SPG) (Ding and Soricut, 2017) additionally incorporates the reward distribution to generate high-quality sequence samples. The algorithm is derived by applying a log-softmax trick to adapt the standard policy gradient objective. Reward augmented maximum likelihood (RAML) (Norouzi et al., 2016) is an algorithm in between MLE and policy gradient. It is originally developed to go beyond the maximum likelihood criteria and incorporate task metric (such as BLEU for machine translation) to guide the model learning. Mathematically, RAML shows that MLE and maximum-entropy policy gradient are respectively minimizing KL divergences in opposite directions. We reformulate both SPG and RAML in a new perspective, and show they are precisely instances of a general entropy regularized policy optimization framework. The new framework provides a more principled formulation for both algorithms. Besides the algorithms discussed in the paper, there are other learning methods for sequence models. For example, Hal Daumé et al. (2009); Leblond et al. (2018); Wiseman and Rush (2016) use a learning-to-search paradigm for sequence generation or structured prediction. Scheduled Sampling (Bengio et al., 2015) adapts MLE by randomly replacing ground-truth tokens with model predictions as the input for decoding the next-step token. Hu et al. (2017); Yang et al. (2018); Fedus et al. (2018) learn (conditional) text generation with holistic discriminators. Zhu et al. (2018) explore the new setting of text infilling that leverages both left- and right-side context for generation.

Policy optimization for reinforcement learning is studied extensively in robotics and game environment. For example, Peters et al. (2010) introduce a relative entropy regularization to reduce information loss during learning. Schulman et al. (2015) develop a trust-region approach for monotonic improvement. Dayan and Hinton (1997); Levine (2018); Abdolmaleki et al. (2018) study the policy optimization algorithms in a probabilistic inference perspective. Hu et al. (2018b) show the connections between policy optimization, Bayesian posterior regularization (Hu et al., 2016; Ganchev et al., 2010), and GANs (Goodfellow et al., 2014) for combining structured knowledge with deep generative models. The entropy-regularized policy optimization formulation presented here can be seen as a generalization of many of the previous policy optimization methods, as shown in the next section. Besides, we formulate the framework in the sequence generation context.

3 Connecting the Dots

We first present a generalized formulation of an entropy regularized policy optimization framework, to which a broad set of learning algorithms for sequence generation are connected. In particular, we show the conventional maximum likelihood learning is a special case of the policy optimization formulation. This provides new understandings of the exposure bias problem as well as the exploration efficiency of the algorithms. We further show that the framework subsumes as special cases other well-known learning methods that were originally developed in diverse perspectives. We thus establish a unified, principled view of the broad class of works.

Let us first set up the notations for the sequence generation setting. Let be the input and the sequence of tokens in the target space. For example, in machine translation, is the sentence in source language and is in target language. Let be a training example drawn from the empirical data distribution, where is the ground truth sequence. We aim to learn a sequence generation model parameterized with . The model can, for example, be a recurrent network. It is worth noting that though we present in the sequence generation context, the formulations can straightforwardly be extended to other settings such as robotics and game environment.

Figure 1: A unified formulation of different learning algorithms. Each algorithm is a special instance of the general ERPO framework taking certain specifications of the hyperparameters (Eq.1).

3.1 Generalized Entropy Regularized Policy Optimization (ERPO)

Policy optimization is a family of reinforcement learning (RL) algorithms that seeks to learn the parameter of the model (a.k.a policy). Given a reward function (e.g., BLEU score in machine translation) that evaluates the quality of generation against the true , the general goal of policy optimization is to maximize the expected reward. A rich research line of entropy regularized policy optimization (ERPO) stabilizes the learning by augmenting the objective with information theoretic regularizers. Here we present a generalized formulation of ERPO. Assuming a general distribution (more details below), the objective we adopt is written as



is the Kullback–Leibler divergence forcing

to stay close to ; is the Shannon entropy imposing maximum entropy assumption on ; and and are balancing weights of the respective terms. In the RL literature, the distribution has taken various forms, leading to different policy optimization algorithms. For example, setting to a non-parametric policy and results in the prominent relative entropy policy search (Peters et al., 2010) algorithm. Assuming as a parametric distribution and leads to the commonly-used maximum entropy policy gradient (Ziebart, 2010; Haarnoja et al., 2017). Letting be a variational distribution and corresponds to the probabilistic inference formulation of policy gradient (Abdolmaleki et al., 2018; Levine, 2018). Related objectives have also been used in other popular RL algorithms (Schulman et al., 2015, 2017; Teh et al., 2017).

We assume a non-parametric . The above objective can be maximized with an EM-style procedure that iterates two coordinate ascent steps optimizing and , respectively. At iteration :


The E-step is obtained with simple Lagrange multipliers (Hu et al., 2016). Note that has a closed-form solution in the E-step. We can have an intuitive interpretation of its form. First, it is clear to see that if , we have . This is also reflected in the objective Eq.(1) where the weight encourages to be close to . Second, the weight serves as the temperature of the softmax distribution. In particular, a large temperature makes

a uniform distribution, which is consistent to the outcome of an infinitely large maximum entropy regularization in Eq.(

1). Regarding the M-step, the update rule can be interpreted as maximizing the log-likelihood of samples from the distribution .

In the context of sequence generation, it is sometimes more convenient to express the equations at token level, as shown shortly. To this end, we decompose along the time steps:


where measures the reward contributed by token . The solution of in Eq.(2) can then be re-written as:


The above ERPO framework has three key hyperparameters, namely . In the following, we show that different values of the three hyperparameters correspond to different learning algorithms (Figure 1). We first connect MLE to the above general formulation, and compare and discuss the properties of MLE and regular ERPO from the new perspective.

3.2 MLE as a Special Case of ERPO

Maximum likelihood estimation is the most widely-used approach to learn a sequence generation model due to its simplicity and efficiency. It aims to find the optimal parameter value that maximizes the data log-likelihood:


As discussed in section 1, MLE suffers from the exposure bias problem as the model is only exposed to the training data, rather than its own predictions, by using the ground-truth subsequence

to evaluate the probability of


We show that the MLE objective can be recovered from Eq.(2) with specific reward and weight configurations. Consider a -reward defined as222For token-level, define if and otherwise, where is the length of . Note that the value of can also be set to any constant larger than .:


Let . From the E-step of Eq.(2), we have if and otherwise. The M-step is therefore equivalent to , which recovers precisely the MLE objective in Eq.(5).

That is, MLE can be seen as an instance of the policy optimization algorithm with the -reward and the above weight values. Any sample that fails to match precisely the data will receive a negative infinite reward and never contribute to model learning.

Exploration efficiency

The ERPO reformulation of MLE provides a new statistical explanation of the exposure bias problem. Specifically, a very small value makes the model distribution ignored during sampling from , while the -reward permits only samples that match training examples. The two factors in effect make void any exploration beyond the small set of training data (Figure 2(a)), leading to a brittle model that performs poorly at test time due to the extremely restricted exploration. On the other hand, however, a key advantage of the -reward specification is that its regular reward shape allows extreme pruning of the huge sample space, resulting in a space that includes exactly the training examples. This makes the MLE implementation very simple and the computation very efficient in practice.

On the contrary, common rewards (e.g., BLEU) used in policy optimization are more smooth than the -reward, and permit exploration in a broader space. However, such rewards usually do not have a regular shape as the -reward, and thus are not amenable to sample space pruning. Generally, a larger exploration space would lead to a harder training problem. Also, when it comes to the huge sample space, the rewards are still very sparse (e.g., most sequences have BLEU=0 against a reference sequence). Such reward sparsity can make exploration inefficient and even impractical.

Given the opposite algorithm behaviors in terms of exploration and computation efficiency, it is a natural idea to seek a middle ground between the two extremes to combine the advantages of both. A broad set of such approaches have been recently developed. We re-visit some of the popular ones, and show that these apparently divergent approaches can all be reformulated within our ERPO framework (Eqs.1-4) with varying reward and weight specifications.

Figure 2: Effective exploration space of different algorithms. (a): The exploration space of MLE is exactly the set of training examples. (b): RAML and Data Noising use smooth rewards and allow larger exploration space surrounding the training examples. (c): Common policy optimization such as SPG basically allows the whole exploration space.

3.3 Reward-Augmented Maximum Likelihood (RAML)

RAML (Norouzi et al., 2016) was originally proposed to incorporate task metric reward into the MLE training, and has shown superior performance to the vanilla MLE. Specifically, it introduces an exponentiated reward distribution where , as in vanilla policy optimization, is a task metric such as BLEU. RAML maximizes the following objective:


That is, unlike MLE that directly maximizes the data log-likelihood, RAML first perturbs the data proportionally to the reward distribution , and maximizes the log-likelihood of the resulting samples.

The RAML objective reduces to the vanilla MLE objective if we replace the task reward in with the MLE -reward (Eq.6). The relation between MLE and RAML still holds within our new formulation (Eqs.1-2). In particular, similar to how we recovered MLE from Eq.(2), let 333The exponentiated reward distribution can also include a temperature  (Norouzi et al., 2016). In this case, we set ., but set to the task metric reward, then the M-step of Eq.(2) is precisely equivalent to maximizing the above RAML objective.

Formulating within the same framework allows us to have an immediate comparison between RAML and others. In particular, compared to MLE, the use of smooth task metric reward instead of permits a larger effective exploration space surrounding the training data (Figure 2(b)), which helps to alleviate the exposure bias problem. On the other hand, as in MLE still limits the exploration as it ignores the model distribution. Thus, RAML takes a step from MLE toward regular RL, and has effective exploration space size and exploration efficiency in between.

3.4 Softmax Policy Gradient (SPG)

SPG (Ding and Soricut, 2017) was developed in the perspective of adapting the vanilla policy gradient (Sutton et al., 2000) to use reward for sampling. SPG has the following objective:


where is a common reward as above. As a variant of the standard policy gradient algorithm, SPG aims to address the exposure bias problem and shows promising results (Ding and Soricut, 2017).

We show SPG can readily fit into our ERPO framework. Specifically, taking gradient of Eq.(8) w.r.t , we immediately get the same update rule as in Eq.(2) with .

Note that the only difference between the SPG and RAML configuration is that now . SPG thus moves a step further than RAML by leveraging both the reward and the model distribution for full exploration (Figure 2(c)). Sufficient exploration at training time would in theory boost the test-time performance. However, with the increased learning difficulty, additional sophisticated optimization and approximation techniques have to be used (Ding and Soricut, 2017) to make the training practical.

3.5 Data Noising

Adding noise to training data is a widely adopted technique for regularizing models. Previous work (Xie et al., 2017) has proposed several data noising strategies in the sequence generation context. For example, a unigram noising, with probability , replaces each token in data with a sample from the unigram frequency distribution. The resulting noisy data is then used in MLE training.

Though previous literature has commonly seen such techniques as a data pre-processing step that differs from the above learning algorithms, we show the ERPO framework can also subsume data noising as a special instance. Specifically, starting from the ERPO reformulation of MLE which takes (section 3.2), data noising can be formulated as using a locally relaxed variant of . For example, assume has the same length with and let be the set of tokens in that differ from the corresponding tokens in , then a simple data noising strategy that randomly replaces a single token with another uniformly picked token is equivalent to using a reward that takes when and otherwise. Likewise, the above unigram noising (Xie et al., 2017) is equivalent to using a reward


where is the unigram frequency distribution.

With a relaxed (i.e., smoothed) reward, data noising expands the exploration space of vanilla MLE locally (Figure 2(b)). The effect is essentially the same as the RAML algorithm (section 3.3), except that RAML expands the exploration space based on the task metric reward.

Other Algorithms

Ranzato et al. (2016) made an early attempt to address the exposure bias problem by exploiting the classic policy gradient algorithm (Sutton et al., 2000) and mixing it with MLE training. We show in the supplementary materials that the algorithm is closely related to the ERPO framework, and can be recovered with moderate approximations. Section 2 discusses more relevant algorithms for sequence generation learning.

4 Interpolation Algorithm

We have presented the generalized ERPO framework, and connected a series of well-used learning algorithms by showing that they are all instances of the framework with certain specifications of the three hyperparameters . Each of the algorithms can be seen as a point in the hyperparameter space (Figure 1). Generally, a point with a more restricted reward function and a very small tends to have a smaller effective exploration space and allow efficient learning (e.g., MLE), while in contrast, a point with smooth and a larger would lead to a more difficult learning problem, but permit more sufficient exploration and better test-time performance (e.g., (softmax) policy gradient). The unified perspective provides new understandings of the existing algorithms, and also facilitates to develop new algorithms for further improvement. Here we present an example algorithm that interpolates the existing ones.

The interpolation algorithm exploits the natural idea of starting learning from the most restricted yet easiest problem configuration, and gradually expands the exploration space to reduce the discrepancy from the test time. The easy-to-hard learning paradigm resembles the curriculum learning (Bengio et al., 2009). As we have mapped the algorithms to points in the hyperparameter space, interpolation becomes very straightforward, which requires only annealing of the hyperparameter values.

Specifically, in the general update rules Eq.(2), we would like to anneal from using to using smooth common reward, and anneal from exploring by only to exploring by both and . Let denote a common reward (e.g., BLEU). The interpolated reward can be written in the form , for . Plugging into in Eq.(2) and re-organizing the scalar weights, we obtain the numerator of in the form: , where is defined as a distribution (i.e., ), and, along with , are determined by . For example, . We gradually increase and and decrease as the training proceeds.

Further, noting that is a Delta function (Eq.6) which would make the above direct function interpolation problematic, we borrow the idea from the Bayesian spike-and-slab factor selection method (Ishwaran et al., 2005)

. That is, we introduce a categorical random variable

that follows the distribution , and augment as . The M-step is then to maximize the objective with marginalized out: . The spike-and-slab adaption essentially transforms the product of experts in to a mixture, which resembles the bang-bang rewarded SPG method (Ding and Soricut, 2017) where the name bang-bang refers to a system that switches abruptly between extreme states (i.e., the values). Finally, similar to (Ding and Soricut, 2017), we adopt the token-level formulation (Eq.4) and associate each token with a separate variable .

We provide the pseudo-code of the interpolation algorithm in the supplements. It is notable that Ranzato et al. (2016) also develop an annealing strategy that mixes MLE and policy gradient training. As discussed in section 3 and the supplements, the algorithm can be seen as a special instance of the ERPO framework (with moderate approximation) we have presented. Next section shows improved performance of the proposed, more general algorithm compared to (Ranzato et al., 2016).

5 Experiments

We evaluate the above interpolation algorithm in the tasks of machine translation and text summarization. The proposed algorithm consistently improves over a variety of previous methods. Implementation is based on Texar (Hu et al., 2018a), a general-purpose text generation toolkit.


In both tasks, we follow previous work (Norouzi et al., 2016; Ranzato et al., 2016) and use an attentional sequence-to-sequence model (Luong et al., 2015) where both the encoder and decoder are single-layer LSTM recurrent networks. The dimensions of word embedding, RNN hidden state, and attention are all set to 256. We apply dropout of rate 0.2 on the recurrent hidden state. We use Adam optimization for training, with an initial learning rate of 0.001 and batch size of 64. At test time, we use beam search decoding with a beam width of 5. Please see the supplementary materials for more configuration details.

Model BLEU
RAML (Norouzi et al., 2016)
SPG (Ding and Soricut, 2017)
MIXER (Ranzato et al., 2016)
Scheduled Sampling (Bengio et al., 2015)
Table 1: Results of machine translation.

5.1 Machine Translation


Our dataset is based on the common IWSLT 2014 (Cettolo et al., 2014) German-English machine translation data, as also used in previous evaluation (Norouzi et al., 2016; Ranzato et al., 2016). After proper pre-processing as described in the supplementary materials, we obtain the final dataset with train/dev/test size of around 146K/7K/7K, respectively. The vocabulary sizes of German and English are around 32K and 23K, respectively.


The BLEU metric (Papineni et al., 2002) is used as the reward and for evaluation. Table 1 shows the test-set BLEU scores of various methods. Besides the approaches described above, we also compare with the Scheduled Sampling method (Bengio et al., 2015) which combats the exposure bias by feeding model predictions at randomly-picked decoding steps during training. From the table, we can see the various approaches such as RAML provide improved performance over the vanilla MLE, as more sufficient exploration is made at training time. Our proposed new algorithm performs best, as it interpolates among the existing algorithms to gradually increase the exploration space and solve the generation problem better.

Figure 3 shows the test-set BLEU scores against the training steps. We can see that, with annealing, our algorithm improves BLEU smoothly, and surpasses other algorithms to converge at a better point.

Figure 3: Convergence curve of learning algorithms in the task of machine translation. RAML is the second-best performing method (Table.1), inferior only to our algorithm.
Figure 4: Improvement on the ROUGE-L metric in comparison to MLE (e.g., RAML improves ROUGE-L by 0.17).
RAML (Norouzi et al., 2016)
SPG (Ding and Soricut, 2017)
MIXER (Ranzato et al., 2016)
Scheduled Sampling (Bengio et al., 2015)
Table 2: Results of text summarization.

5.2 Text Summarization


We use the popular English Gigaword corpus (Graff et al., 2003) for text summarization, and pre-processed the data following (Rush et al., 2015). The resulting dataset consists of 200K/8K/2K source-target pairs in train/dev/test sets, respectively. More details are included in the supplements.


The ROUGE metrics (including -1, -2, and -L) (Lin, 2004) are the most commonly used metrics for text summarization. Following previous work (Ding and Soricut, 2017), we use the summation of the three ROUGE metrics as the reward in the learning algorithms. Table 2 show the results on the test set. The proposed interpolation algorithm achieves the best performance on all the three metrics. For easier comparison, Figure 4 shows the improvement of each algorithm compared to MLE in terms of ROUGE-L. The RAML algorithm, which performed well in machine translation, falls behind other algorithms in text summarization. In contrast, our method consistently provides the best results.

6 Conclusions

We have presented a unified perspective of a variety of well-used learning algorithms for sequence generation. The framework is based on a generalized entropy regularized policy optimization formulation, and we show these algorithms are mathematically equivalent to specifying certain hyperparameter configurations in the framework. The new principled treatment provides systematic understanding and comparison among the algorithms, and inspires further enhancement. The proposed interpolation algorithm shows consistent improvement in machine translation and text summarization. We would be excited to extend the framework to other settings such as robotics and game environments.


  • Abdolmaleki et al. (2018) A. Abdolmaleki, J. T. Springenberg, Y. Tassa, R. Munos, N. Heess, and M. Riedmiller. Maximum a posteriori policy optimisation. In ICLR, 2018.
  • Bahdanau et al. (2015) D. Bahdanau, K. Cho, and Y. Bengio. Neural machine translation by jointly learning to align and translate. In ICLR, 2015.
  • Bahdanau et al. (2017) D. Bahdanau, P. Brakel, K. Xu, A. Goyal, R. Lowe, J. Pineau, A. Courville, and Y. Bengio. An actor-critic algorithm for sequence prediction. In ICLR, 2017.
  • Bengio et al. (2015) S. Bengio, O. Vinyals, N. Jaitly, and N. Shazeer. Scheduled sampling for sequence prediction with recurrent neural networks. In Advances in Neural Information Processing Systems, pages 1171–1179, 2015.
  • Bengio et al. (2009) Y. Bengio, J. Louradour, R. Collobert, and J. Weston. Curriculum learning. In ICML, pages 41–48. ACM, 2009.
  • Cettolo et al. (2014) M. Cettolo, J. Niehues, S. Stüker, L. Bentivogli, and M. Federico. Report on the 11th IWSLT evaluation campaign, IWSLT 2014. In Proceedings of the International Workshop on Spoken Language Translation, Hanoi, Vietnam, 2014.
  • Chung et al. (2014) J. Chung, C. Gulcehre, K. Cho, and Y. Bengio. Empirical evaluation of gated recurrent neural networks on sequence modeling. arXiv preprint arXiv:1412.3555, 2014.
  • Dayan and Hinton (1997) P. Dayan and G. E. Hinton.

    Using expectation-maximization for reinforcement learning.

    Neural Computation, 9(2):271–278, 1997.
  • Ding and Soricut (2017) N. Ding and R. Soricut. Cold-start reinforcement learning with softmax policy gradient. In Advances in Neural Information Processing Systems, pages 2814–2823, 2017.
  • Fedus et al. (2018) W. Fedus, I. Goodfellow, and A. M. Dai. MaskGAN: Better text generation via filling in the _. In ICLR, 2018.
  • Ganchev et al. (2010) K. Ganchev, J. Gillenwater, B. Taskar, et al. Posterior regularization for structured latent variable models.

    Journal of Machine Learning Research

    , 11(Jul):2001–2049, 2010.
  • Goodfellow et al. (2014) I. Goodfellow, J. Pouget-Abadie, M. Mirza, B. Xu, D. Warde-Farley, S. Ozair, A. Courville, and Y. Bengio. Generative adversarial nets. In NIPS, pages 2672–2680, 2014.
  • Graff et al. (2003) D. Graff, J. Kong, K. Chen, and K. Maeda. English Gigaword. Linguistic Data Consortium, Philadelphia, 4(1):34, 2003.
  • Haarnoja et al. (2017) T. Haarnoja, H. Tang, P. Abbeel, and S. Levine. Reinforcement learning with deep energy-based policies. In ICML, pages 1352–1361, 2017.
  • Hal Daumé et al. (2009) I. Hal Daumé, J. Langford, and D. Marcu. Search-based structured prediction as classification. Journal Machine Learning, 2009.
  • Hochreiter and Schmidhuber (1997) S. Hochreiter and J. Schmidhuber. Long short-term memory. Neural computation, 9(8):1735–1780, 1997.
  • Hovy and Lin (1998) E. Hovy and C.-Y. Lin. Automated text summarization and the summarist system. In Proceedings of a workshop on held at Baltimore, Maryland: October 13-15, 1998, pages 197–214. Association for Computational Linguistics, 1998.
  • Hu et al. (2016) Z. Hu, X. Ma, Z. Liu, E. Hovy, and E. Xing. Harnessing deep neural networks with logic rules. In ACL, 2016.
  • Hu et al. (2017) Z. Hu, Z. Yang, X. Liang, R. Salakhutdinov, and E. P. Xing. Toward controlled generation of text. In ICML, 2017.
  • Hu et al. (2018a) Z. Hu, H. Shi, Z. Yang, B. Tan, T. Zhao, J. He, W. Wang, X. Yu, L. Qin, D. Wang, et al. Texar: A modularized, versatile, and extensible toolkit for text generation. arXiv preprint arXiv:1809.00794, 2018a.
  • Hu et al. (2018b) Z. Hu, Z. Yang, R. Salakhutdinov, X. Liang, L. Qin, H. Dong, and E. Xing. Deep generative models with learnable knowledge constraints. In NIPS, 2018b.
  • Ishwaran et al. (2005) H. Ishwaran, J. S. Rao, et al. Spike and slab variable selection: frequentist and Bayesian strategies. The Annals of Statistics, 33(2):730–773, 2005.
  • Karpathy and Fei-Fei (2015) A. Karpathy and L. Fei-Fei. Deep visual-semantic alignments for generating image descriptions. In

    Proceedings of the IEEE conference on computer vision and pattern recognition

    , pages 3128–3137, 2015.
  • Leblond et al. (2018) R. Leblond, J.-B. Alayrac, A. Osokin, and S. Lacoste-Julien. SEARNN: Training RNNs with global-local losses. In ICLR, 2018.
  • Levine (2018) S. Levine. Reinforcement learning and control as probabilistic inference: Tutorial and review. arXiv preprint arXiv:1805.00909, 2018.
  • Lin (2004) C.-Y. Lin. Rouge: A package for automatic evaluation of summaries. Text Summarization Branches Out, 2004.
  • Luong et al. (2015) M.-T. Luong, H. Pham, and C. D. Manning. Effective approaches to attention-based neural machine translation. In EMNLP, 2015.
  • Ma et al. (2017) X. Ma, P. Yin, J. Liu, G. Neubig, and E. Hovy. Softmax q-distribution estimation for structured prediction: A theoretical interpretation for raml. arXiv preprint arXiv:1705.07136, 2017.
  • Norouzi et al. (2016) M. Norouzi, S. Bengio, N. Jaitly, M. Schuster, Y. Wu, D. Schuurmans, et al. Reward augmented maximum likelihood for neural structured prediction. In Advances In Neural Information Processing Systems, pages 1723–1731, 2016.
  • Papineni et al. (2002) K. Papineni, S. Roukos, T. Ward, and W.-J. Zhu. Bleu: a method for automatic evaluation of machine translation. In Proceedings of the 40th annual meeting on association for computational linguistics, pages 311–318. Association for Computational Linguistics, 2002.
  • Peters et al. (2010) J. Peters, K. Mülling, and Y. Altun. Relative entropy policy search. In AAAI, pages 1607–1612. Atlanta, 2010.
  • Ranzato et al. (2016) M. Ranzato, S. Chopra, M. Auli, and W. Zaremba. Sequence level training with recurrent neural networks. In ICLR, 2016.
  • Rush et al. (2015) A. M. Rush, S. Chopra, and J. Weston.

    A neural attention model for abstractive sentence summarization.

    In EMNLP, pages 379–389, 2015.
  • Schulman et al. (2015) J. Schulman, S. Levine, P. Abbeel, M. Jordan, and P. Moritz. Trust region policy optimization. In ICML, pages 1889–1897, 2015.
  • Schulman et al. (2017) J. Schulman, X. Chen, and P. Abbeel. Equivalence between policy gradients and soft q-learning. arXiv preprint arXiv:1704.06440, 2017.
  • Sutskever et al. (2014) I. Sutskever, O. Vinyals, and Q. V. Le. Sequence to sequence learning with neural networks. In Advances in neural information processing systems, pages 3104–3112, 2014.
  • Sutton et al. (2000) R. S. Sutton, D. A. McAllester, S. P. Singh, and Y. Mansour. Policy gradient methods for reinforcement learning with function approximation. In Advances in neural information processing systems, pages 1057–1063, 2000.
  • Teh et al. (2017) Y. Teh, V. Bapst, W. M. Czarnecki, J. Quan, J. Kirkpatrick, R. Hadsell, N. Heess, and R. Pascanu. Distral: Robust multitask reinforcement learning. In Advances in Neural Information Processing Systems, pages 4496–4506, 2017.
  • Vinyals et al. (2015) O. Vinyals, A. Toshev, S. Bengio, and D. Erhan. Show and tell: A neural image caption generator. In CVPR, pages 3156–3164, 2015.
  • Wiseman and Rush (2016) S. Wiseman and A. M. Rush. Sequence-to-sequence learning as beam-search optimization. In EMNLP, pages 1296–1306, 2016.
  • Wu et al. (2016) Y. Wu, M. Schuster, Z. Chen, Q. V. Le, M. Norouzi, W. Macherey, M. Krikun, Y. Cao, Q. Gao, K. Macherey, et al. Google’s neural machine translation system: Bridging the gap between human and machine translation. arXiv preprint arXiv:1609.08144, 2016.
  • Xie et al. (2017) Z. Xie, S. I. Wang, J. Li, D. Lévy, A. Nie, D. Jurafsky, and A. Y. Ng. Data noising as smoothing in neural network language models. In ICLR, 2017.
  • Yang et al. (2018) Z. Yang, Z. Hu, C. Dyer, E. P. Xing, and T. Berg-Kirkpatrick. Unsupervised text style transfer using language models as discriminators. In NIPS, 2018.
  • Zhou et al. (2017) Q. Zhou, N. Yang, F. Wei, and M. Zhou. Selective encoding for abstractive sentence summarization. In ACL, 2017.
  • Zhu et al. (2018) W. Zhu, Z. Hu, and E. P. Xing. Text infiling. 2018.
  • Ziebart (2010) B. D. Ziebart. Modeling purposeful adaptive behavior with the principle of maximum causal entropy. In PhD Thesis, 2010.

Appendix A Policy Gradient & MIXER

Ranzato et al. (2016) made an early attempt to address the exposure bias problem by exploiting the policy gradient algorithm (Sutton et al., 2000). Policy gradient aims to maximizes the expected reward:


where is usually a common reward function (e.g., BLEU). Taking gradient w.r.t gives:


We now reveal the relation between the ERPO framework we present and the policy gradient algorithm. Starting from the M-step of Eq.(2) and setting as in SPG (section 3.4), we use as the proposal distribution and obtain the importance sampling estimate of the gradient (we omit the superscript for notation simplicity):


where is the normalization constant of , which can be considered as adjusting the step size of gradient descent.

We can see that Eq.(12) recovers Eq.(11) if we further set , and omit the scaling factor . In other words, policy gradient can be seen as a special instance of the general ERPO framework with and with omitted.

The MIXER algorithm (Ranzato et al., 2016) incorporates an annealing strategy that mixes between MLE and policy gradient training. Specifically, given a ground-truth example , the first tokens are used for evaluating MLE loss, and starting from step , policy gradient objective is used. The value decreases as training proceeds. With the relation between policy gradient and ERPO as established above, MIXER can be seen as a specific instance of the proposed interpolation algorithm (section 4) that follows a restricted annealing strategy for token-level hyperparameters . That is, for in Eq.4 (i.e.,the first steps), is set to and , namely the MLE training; while for , is set to and .

Appendix B Interpolation Algorithm

Algorithm 1 summarizes the interpolation algorithm described in section 4.

1:  Initialize model parameter and weights
2:  repeat
3:     Get training example
4:     for  do
5:         Sample
6:         if  then
7:            Sample token
8:         else if  then
9:            Sample token
10:         else
11:            Sample token , i.e., set
12:         end if
13:     end for
14:     Update by maximizing the log-likelihood
15:     Anneal by increasing and and decreasing
16:  until convergence
Algorithm 1 Interpolation Algorithm

Appendix C Experimental Settings

c.1 Data Pre-processing

For the machine translation dataset, we follow (Ma et al., 2017) for data pre-processing.

In text summarization, we sampled 200K out of the 3.8M pre-processed training examples provided by (Rush et al., 2015) for the sake of training efficiency. We used the refined validation and test sets provided by (Zhou et al., 2017).

c.2 Algorithm Setup

For RAML (Norouzi et al., 2016)

, we use the sampling approach (n-gram replacement) by

(Ma et al., 2017) to sample from the exponentiated reward distribution. For each training example we draw 10 samples. The softmax temperature is set to .

For Scheduled Sampling (Bengio et al., 2015), the decay function we used is inverse-sigmoid decay. The probability of sampling from model , where is a hyperparameter controlling the speed of convergence, which is set to and in the machine translation and text summarization tasks, respectively.

For MIXER (Ranzato et al., 2016), the advantage function we used for policy gradient is .

For the proposed interpolation algorithm, we initialize the weights as , and increase and while decreasing every time when the validation-set reward decreases. Specifically, we increase by once and increase by for four times, periodically. For example, at the first time the validation-set reward decreases, we increase , and at the second to fifth time, we increase , and so forth. The weight is decreased by every time we increase either or . Notice that we would not update when the validation-set reward decreases.

Appendix D Additional Results

Here we present additional results of machine translation using a dropout rate of 0.3 (Table 3). The improvement of the proposed interpolation algorithm over the baselines is comparable to that of using dropout 0.2 (Table 1 in the paper). For example, our algorithm improves over MLE by 1.5 BLEU points, and improves over the second best performing method RAML by 0.49 BLEU points. (With dropout 0.2 in Table 1, the improvements are 1.42 BLEU and 0.64, respectively.) The proposed interpolation algorithm outperforms existing approaches with a clear margin.

Figure 5 shows the convergence curves of the comparison algorithms.

Model BLEU
RAML (Norouzi et al., 2016)
SPG (Ding and Soricut, 2017)
MIXER (Ranzato et al., 2016)
Scheduled Sampling (Bengio et al., 2015)
Table 3: Results of machine translation when dropout is .
Figure 5: Convergence curve of learning algorithms in the task of machine translation with a dropout rate of . The horizontal dashed lines indicate the test-set results of each of the algorithms (reported in Table 3) picked according to the validation set performance.