1 Introduction
Sequence generation is a ubiquitous problem in many applications, such as machine translation (Wu et al., 2016; Sutskever et al., 2014), text summarization (Hovy and Lin, 1998; Rush et al., 2015), image captioning (Vinyals et al., 2015; Karpathy and FeiFei, 2015)
, and so forth. Great advances in these tasks have been made by the development of sequence models such as recurrent neural networks (RNNs) with different cells
(Hochreiter and Schmidhuber, 1997; Chung et al., 2014) and attention mechanisms (Bahdanau et al., 2015; Luong et al., 2015). These models can be trained with a variety of learning algorithms.The standard training algorithm is based on maximumlikelihood estimation (MLE) which seeks to maximize the loglikelihood of groundtruth sequences. Despite the computational simplicity and efficiency, MLE training suffers from the
exposure bias (Ranzato et al., 2016). That is, the model is trained to predict the next token given the previous groundtruth tokens; while at test time, since the resulting model does not have access to the ground truth, tokens generated by the model itself are instead used to make the next prediction. This discrepancy between training and test leads to the issue that mistakes in prediction can quickly accumulate. Recent efforts have been made to alleviate the issue, many of which resort to the reinforcement learning (RL) techniques (Ranzato et al., 2016; Bahdanau et al., 2017; Ding and Soricut, 2017). For example, Ranzato et al. (2016) adopt policy gradient (Sutton et al., 2000)that avoids the training/test discrepancy by using the same decoding strategy. However, RLbased approaches for sequence generation can face challenges of prohibitively poor sample efficiency and high variance. For more practical training, a diverse set of methods has been developed that are in a middle ground between the two paradigms of MLE and RL. For example, RAML
(Norouzi et al., 2016) adds rewardaware perturbation to the MLE data examples; SPG (Ding and Soricut, 2017) leverages reward distribution for effective sampling of policy gradient. Other approaches such as data noising (Xie et al., 2017) also show improved results.In this paper, we establish a unified perspective of the broad set of learning algorithms. Specifically, we present a generalized entropy regularized policy optimization framework, and show that the apparently diverse algorithms, such as MLE, RAML, SPG, and data noising, can all be reformulated as special instances of the framework, with the only difference being the choice of reward and the values of a couple of hyperparameters (Figure 1). In particular, we show MLE is equivalent to using a deltafunction reward that assigns 1 to samples that exactly match data examples while to any other samples. Such extremely restricted reward has literally disabled any exploration of the model beyond training data, yielding the exposure bias. Other algorithms essentially use rewards that are more smooth, and also leverage model distribution for exploration, which generally results in a larger effective exploration space, more difficult training, and better testtime performance.
Besides the new understandings of the existing algorithms, the unified perspective also facilitates to develop new algorithms for improved learning. We present an example new algorithm that, as training proceeds, gradually expands the exploration space by annealing the reward and hyperparameter values. The annealing in effect dynamically interpolates among the existing algorithms. Experiments on machine translation and text summarization show the interpolation algorithm achieves significant improvement over the various existing methods.
2 Related Work
Sequence generation models are usually trained to maximize the loglikelihood of data by feeding the groundtruth tokens during decoding. Reinforcement learning (RL) addresses the discrepancy between training and test by also using models’ own predictions at training time. Various RL approaches have been applied for sequence generation, such as policy gradient (Ranzato et al., 2016) and actorcritic (Bahdanau et al., 2017). Softmax policy gradient (SPG) (Ding and Soricut, 2017) additionally incorporates the reward distribution to generate highquality sequence samples. The algorithm is derived by applying a logsoftmax trick to adapt the standard policy gradient objective. Reward augmented maximum likelihood (RAML) (Norouzi et al., 2016) is an algorithm in between MLE and policy gradient. It is originally developed to go beyond the maximum likelihood criteria and incorporate task metric (such as BLEU for machine translation) to guide the model learning. Mathematically, RAML shows that MLE and maximumentropy policy gradient are respectively minimizing KL divergences in opposite directions. We reformulate both SPG and RAML in a new perspective, and show they are precisely instances of a general entropy regularized policy optimization framework. The new framework provides a more principled formulation for both algorithms. Besides the algorithms discussed in the paper, there are other learning methods for sequence models. For example, Hal Daumé et al. (2009); Leblond et al. (2018); Wiseman and Rush (2016) use a learningtosearch paradigm for sequence generation or structured prediction. Scheduled Sampling (Bengio et al., 2015) adapts MLE by randomly replacing groundtruth tokens with model predictions as the input for decoding the nextstep token. Hu et al. (2017); Yang et al. (2018); Fedus et al. (2018) learn (conditional) text generation with holistic discriminators. Zhu et al. (2018) explore the new setting of text infilling that leverages both left and rightside context for generation.
Policy optimization for reinforcement learning is studied extensively in robotics and game environment. For example, Peters et al. (2010) introduce a relative entropy regularization to reduce information loss during learning. Schulman et al. (2015) develop a trustregion approach for monotonic improvement. Dayan and Hinton (1997); Levine (2018); Abdolmaleki et al. (2018) study the policy optimization algorithms in a probabilistic inference perspective. Hu et al. (2018b) show the connections between policy optimization, Bayesian posterior regularization (Hu et al., 2016; Ganchev et al., 2010), and GANs (Goodfellow et al., 2014) for combining structured knowledge with deep generative models. The entropyregularized policy optimization formulation presented here can be seen as a generalization of many of the previous policy optimization methods, as shown in the next section. Besides, we formulate the framework in the sequence generation context.
3 Connecting the Dots
We first present a generalized formulation of an entropy regularized policy optimization framework, to which a broad set of learning algorithms for sequence generation are connected. In particular, we show the conventional maximum likelihood learning is a special case of the policy optimization formulation. This provides new understandings of the exposure bias problem as well as the exploration efficiency of the algorithms. We further show that the framework subsumes as special cases other wellknown learning methods that were originally developed in diverse perspectives. We thus establish a unified, principled view of the broad class of works.
Let us first set up the notations for the sequence generation setting. Let be the input and the sequence of tokens in the target space. For example, in machine translation, is the sentence in source language and is in target language. Let be a training example drawn from the empirical data distribution, where is the ground truth sequence. We aim to learn a sequence generation model parameterized with . The model can, for example, be a recurrent network. It is worth noting that though we present in the sequence generation context, the formulations can straightforwardly be extended to other settings such as robotics and game environment.
3.1 Generalized Entropy Regularized Policy Optimization (ERPO)
Policy optimization is a family of reinforcement learning (RL) algorithms that seeks to learn the parameter of the model (a.k.a policy). Given a reward function (e.g., BLEU score in machine translation) that evaluates the quality of generation against the true , the general goal of policy optimization is to maximize the expected reward. A rich research line of entropy regularized policy optimization (ERPO) stabilizes the learning by augmenting the objective with information theoretic regularizers. Here we present a generalized formulation of ERPO. Assuming a general distribution (more details below), the objective we adopt is written as
(1) 
where
is the Kullback–Leibler divergence forcing
to stay close to ; is the Shannon entropy imposing maximum entropy assumption on ; and and are balancing weights of the respective terms. In the RL literature, the distribution has taken various forms, leading to different policy optimization algorithms. For example, setting to a nonparametric policy and results in the prominent relative entropy policy search (Peters et al., 2010) algorithm. Assuming as a parametric distribution and leads to the commonlyused maximum entropy policy gradient (Ziebart, 2010; Haarnoja et al., 2017). Letting be a variational distribution and corresponds to the probabilistic inference formulation of policy gradient (Abdolmaleki et al., 2018; Levine, 2018). Related objectives have also been used in other popular RL algorithms (Schulman et al., 2015, 2017; Teh et al., 2017).We assume a nonparametric . The above objective can be maximized with an EMstyle procedure that iterates two coordinate ascent steps optimizing and , respectively. At iteration :
(2) 
The Estep is obtained with simple Lagrange multipliers (Hu et al., 2016). Note that has a closedform solution in the Estep. We can have an intuitive interpretation of its form. First, it is clear to see that if , we have . This is also reflected in the objective Eq.(1) where the weight encourages to be close to . Second, the weight serves as the temperature of the softmax distribution. In particular, a large temperature makes
a uniform distribution, which is consistent to the outcome of an infinitely large maximum entropy regularization in Eq.(
1). Regarding the Mstep, the update rule can be interpreted as maximizing the loglikelihood of samples from the distribution .In the context of sequence generation, it is sometimes more convenient to express the equations at token level, as shown shortly. To this end, we decompose along the time steps:
(3) 
where measures the reward contributed by token . The solution of in Eq.(2) can then be rewritten as:
(4) 
The above ERPO framework has three key hyperparameters, namely . In the following, we show that different values of the three hyperparameters correspond to different learning algorithms (Figure 1). We first connect MLE to the above general formulation, and compare and discuss the properties of MLE and regular ERPO from the new perspective.
3.2 MLE as a Special Case of ERPO
Maximum likelihood estimation is the most widelyused approach to learn a sequence generation model due to its simplicity and efficiency. It aims to find the optimal parameter value that maximizes the data loglikelihood:
(5) 
As discussed in section 1, MLE suffers from the exposure bias problem as the model is only exposed to the training data, rather than its own predictions, by using the groundtruth subsequence
to evaluate the probability of
.We show that the MLE objective can be recovered from Eq.(2) with specific reward and weight configurations. Consider a reward defined as^{2}^{2}2For tokenlevel, define if and otherwise, where is the length of . Note that the value of can also be set to any constant larger than .:
(6) 
Let . From the Estep of Eq.(2), we have if and otherwise. The Mstep is therefore equivalent to , which recovers precisely the MLE objective in Eq.(5).
That is, MLE can be seen as an instance of the policy optimization algorithm with the reward and the above weight values. Any sample that fails to match precisely the data will receive a negative infinite reward and never contribute to model learning.
Exploration efficiency
The ERPO reformulation of MLE provides a new statistical explanation of the exposure bias problem. Specifically, a very small value makes the model distribution ignored during sampling from , while the reward permits only samples that match training examples. The two factors in effect make void any exploration beyond the small set of training data (Figure 2(a)), leading to a brittle model that performs poorly at test time due to the extremely restricted exploration. On the other hand, however, a key advantage of the reward specification is that its regular reward shape allows extreme pruning of the huge sample space, resulting in a space that includes exactly the training examples. This makes the MLE implementation very simple and the computation very efficient in practice.
On the contrary, common rewards (e.g., BLEU) used in policy optimization are more smooth than the reward, and permit exploration in a broader space. However, such rewards usually do not have a regular shape as the reward, and thus are not amenable to sample space pruning. Generally, a larger exploration space would lead to a harder training problem. Also, when it comes to the huge sample space, the rewards are still very sparse (e.g., most sequences have BLEU=0 against a reference sequence). Such reward sparsity can make exploration inefficient and even impractical.
Given the opposite algorithm behaviors in terms of exploration and computation efficiency, it is a natural idea to seek a middle ground between the two extremes to combine the advantages of both. A broad set of such approaches have been recently developed. We revisit some of the popular ones, and show that these apparently divergent approaches can all be reformulated within our ERPO framework (Eqs.14) with varying reward and weight specifications.
3.3 RewardAugmented Maximum Likelihood (RAML)
RAML (Norouzi et al., 2016) was originally proposed to incorporate task metric reward into the MLE training, and has shown superior performance to the vanilla MLE. Specifically, it introduces an exponentiated reward distribution where , as in vanilla policy optimization, is a task metric such as BLEU. RAML maximizes the following objective:
(7) 
That is, unlike MLE that directly maximizes the data loglikelihood, RAML first perturbs the data proportionally to the reward distribution , and maximizes the loglikelihood of the resulting samples.
The RAML objective reduces to the vanilla MLE objective if we replace the task reward in with the MLE reward (Eq.6). The relation between MLE and RAML still holds within our new formulation (Eqs.12). In particular, similar to how we recovered MLE from Eq.(2), let ^{3}^{3}3The exponentiated reward distribution can also include a temperature (Norouzi et al., 2016). In this case, we set ., but set to the task metric reward, then the Mstep of Eq.(2) is precisely equivalent to maximizing the above RAML objective.
Formulating within the same framework allows us to have an immediate comparison between RAML and others. In particular, compared to MLE, the use of smooth task metric reward instead of permits a larger effective exploration space surrounding the training data (Figure 2(b)), which helps to alleviate the exposure bias problem. On the other hand, as in MLE still limits the exploration as it ignores the model distribution. Thus, RAML takes a step from MLE toward regular RL, and has effective exploration space size and exploration efficiency in between.
3.4 Softmax Policy Gradient (SPG)
SPG (Ding and Soricut, 2017) was developed in the perspective of adapting the vanilla policy gradient (Sutton et al., 2000) to use reward for sampling. SPG has the following objective:
(8) 
where is a common reward as above. As a variant of the standard policy gradient algorithm, SPG aims to address the exposure bias problem and shows promising results (Ding and Soricut, 2017).
We show SPG can readily fit into our ERPO framework. Specifically, taking gradient of Eq.(8) w.r.t , we immediately get the same update rule as in Eq.(2) with .
Note that the only difference between the SPG and RAML configuration is that now . SPG thus moves a step further than RAML by leveraging both the reward and the model distribution for full exploration (Figure 2(c)). Sufficient exploration at training time would in theory boost the testtime performance. However, with the increased learning difficulty, additional sophisticated optimization and approximation techniques have to be used (Ding and Soricut, 2017) to make the training practical.
3.5 Data Noising
Adding noise to training data is a widely adopted technique for regularizing models. Previous work (Xie et al., 2017) has proposed several data noising strategies in the sequence generation context. For example, a unigram noising, with probability , replaces each token in data with a sample from the unigram frequency distribution. The resulting noisy data is then used in MLE training.
Though previous literature has commonly seen such techniques as a data preprocessing step that differs from the above learning algorithms, we show the ERPO framework can also subsume data noising as a special instance. Specifically, starting from the ERPO reformulation of MLE which takes (section 3.2), data noising can be formulated as using a locally relaxed variant of . For example, assume has the same length with and let be the set of tokens in that differ from the corresponding tokens in , then a simple data noising strategy that randomly replaces a single token with another uniformly picked token is equivalent to using a reward that takes when and otherwise. Likewise, the above unigram noising (Xie et al., 2017) is equivalent to using a reward
(9) 
where is the unigram frequency distribution.
With a relaxed (i.e., smoothed) reward, data noising expands the exploration space of vanilla MLE locally (Figure 2(b)). The effect is essentially the same as the RAML algorithm (section 3.3), except that RAML expands the exploration space based on the task metric reward.
Other Algorithms
Ranzato et al. (2016) made an early attempt to address the exposure bias problem by exploiting the classic policy gradient algorithm (Sutton et al., 2000) and mixing it with MLE training. We show in the supplementary materials that the algorithm is closely related to the ERPO framework, and can be recovered with moderate approximations. Section 2 discusses more relevant algorithms for sequence generation learning.
4 Interpolation Algorithm
We have presented the generalized ERPO framework, and connected a series of wellused learning algorithms by showing that they are all instances of the framework with certain specifications of the three hyperparameters . Each of the algorithms can be seen as a point in the hyperparameter space (Figure 1). Generally, a point with a more restricted reward function and a very small tends to have a smaller effective exploration space and allow efficient learning (e.g., MLE), while in contrast, a point with smooth and a larger would lead to a more difficult learning problem, but permit more sufficient exploration and better testtime performance (e.g., (softmax) policy gradient). The unified perspective provides new understandings of the existing algorithms, and also facilitates to develop new algorithms for further improvement. Here we present an example algorithm that interpolates the existing ones.
The interpolation algorithm exploits the natural idea of starting learning from the most restricted yet easiest problem configuration, and gradually expands the exploration space to reduce the discrepancy from the test time. The easytohard learning paradigm resembles the curriculum learning (Bengio et al., 2009). As we have mapped the algorithms to points in the hyperparameter space, interpolation becomes very straightforward, which requires only annealing of the hyperparameter values.
Specifically, in the general update rules Eq.(2), we would like to anneal from using to using smooth common reward, and anneal from exploring by only to exploring by both and . Let denote a common reward (e.g., BLEU). The interpolated reward can be written in the form , for . Plugging into in Eq.(2) and reorganizing the scalar weights, we obtain the numerator of in the form: , where is defined as a distribution (i.e., ), and, along with , are determined by . For example, . We gradually increase and and decrease as the training proceeds.
Further, noting that is a Delta function (Eq.6) which would make the above direct function interpolation problematic, we borrow the idea from the Bayesian spikeandslab factor selection method (Ishwaran et al., 2005)
. That is, we introduce a categorical random variable
that follows the distribution , and augment as . The Mstep is then to maximize the objective with marginalized out: . The spikeandslab adaption essentially transforms the product of experts in to a mixture, which resembles the bangbang rewarded SPG method (Ding and Soricut, 2017) where the name bangbang refers to a system that switches abruptly between extreme states (i.e., the values). Finally, similar to (Ding and Soricut, 2017), we adopt the tokenlevel formulation (Eq.4) and associate each token with a separate variable .We provide the pseudocode of the interpolation algorithm in the supplements. It is notable that Ranzato et al. (2016) also develop an annealing strategy that mixes MLE and policy gradient training. As discussed in section 3 and the supplements, the algorithm can be seen as a special instance of the ERPO framework (with moderate approximation) we have presented. Next section shows improved performance of the proposed, more general algorithm compared to (Ranzato et al., 2016).
5 Experiments
We evaluate the above interpolation algorithm in the tasks of machine translation and text summarization. The proposed algorithm consistently improves over a variety of previous methods. Implementation is based on Texar (Hu et al., 2018a), a generalpurpose text generation toolkit.
Setup
In both tasks, we follow previous work (Norouzi et al., 2016; Ranzato et al., 2016) and use an attentional sequencetosequence model (Luong et al., 2015) where both the encoder and decoder are singlelayer LSTM recurrent networks. The dimensions of word embedding, RNN hidden state, and attention are all set to 256. We apply dropout of rate 0.2 on the recurrent hidden state. We use Adam optimization for training, with an initial learning rate of 0.001 and batch size of 64. At test time, we use beam search decoding with a beam width of 5. Please see the supplementary materials for more configuration details.
5.1 Machine Translation
Dataset
Our dataset is based on the common IWSLT 2014 (Cettolo et al., 2014) GermanEnglish machine translation data, as also used in previous evaluation (Norouzi et al., 2016; Ranzato et al., 2016). After proper preprocessing as described in the supplementary materials, we obtain the final dataset with train/dev/test size of around 146K/7K/7K, respectively. The vocabulary sizes of German and English are around 32K and 23K, respectively.
Results
The BLEU metric (Papineni et al., 2002) is used as the reward and for evaluation. Table 1 shows the testset BLEU scores of various methods. Besides the approaches described above, we also compare with the Scheduled Sampling method (Bengio et al., 2015) which combats the exposure bias by feeding model predictions at randomlypicked decoding steps during training. From the table, we can see the various approaches such as RAML provide improved performance over the vanilla MLE, as more sufficient exploration is made at training time. Our proposed new algorithm performs best, as it interpolates among the existing algorithms to gradually increase the exploration space and solve the generation problem better.
Figure 3 shows the testset BLEU scores against the training steps. We can see that, with annealing, our algorithm improves BLEU smoothly, and surpasses other algorithms to converge at a better point.
5.2 Text Summarization
Dataset
Results
The ROUGE metrics (including 1, 2, and L) (Lin, 2004) are the most commonly used metrics for text summarization. Following previous work (Ding and Soricut, 2017), we use the summation of the three ROUGE metrics as the reward in the learning algorithms. Table 2 show the results on the test set. The proposed interpolation algorithm achieves the best performance on all the three metrics. For easier comparison, Figure 4 shows the improvement of each algorithm compared to MLE in terms of ROUGEL. The RAML algorithm, which performed well in machine translation, falls behind other algorithms in text summarization. In contrast, our method consistently provides the best results.
6 Conclusions
We have presented a unified perspective of a variety of wellused learning algorithms for sequence generation. The framework is based on a generalized entropy regularized policy optimization formulation, and we show these algorithms are mathematically equivalent to specifying certain hyperparameter configurations in the framework. The new principled treatment provides systematic understanding and comparison among the algorithms, and inspires further enhancement. The proposed interpolation algorithm shows consistent improvement in machine translation and text summarization. We would be excited to extend the framework to other settings such as robotics and game environments.
References
 Abdolmaleki et al. (2018) A. Abdolmaleki, J. T. Springenberg, Y. Tassa, R. Munos, N. Heess, and M. Riedmiller. Maximum a posteriori policy optimisation. In ICLR, 2018.
 Bahdanau et al. (2015) D. Bahdanau, K. Cho, and Y. Bengio. Neural machine translation by jointly learning to align and translate. In ICLR, 2015.
 Bahdanau et al. (2017) D. Bahdanau, P. Brakel, K. Xu, A. Goyal, R. Lowe, J. Pineau, A. Courville, and Y. Bengio. An actorcritic algorithm for sequence prediction. In ICLR, 2017.
 Bengio et al. (2015) S. Bengio, O. Vinyals, N. Jaitly, and N. Shazeer. Scheduled sampling for sequence prediction with recurrent neural networks. In Advances in Neural Information Processing Systems, pages 1171–1179, 2015.
 Bengio et al. (2009) Y. Bengio, J. Louradour, R. Collobert, and J. Weston. Curriculum learning. In ICML, pages 41–48. ACM, 2009.
 Cettolo et al. (2014) M. Cettolo, J. Niehues, S. Stüker, L. Bentivogli, and M. Federico. Report on the 11th IWSLT evaluation campaign, IWSLT 2014. In Proceedings of the International Workshop on Spoken Language Translation, Hanoi, Vietnam, 2014.
 Chung et al. (2014) J. Chung, C. Gulcehre, K. Cho, and Y. Bengio. Empirical evaluation of gated recurrent neural networks on sequence modeling. arXiv preprint arXiv:1412.3555, 2014.

Dayan and Hinton (1997)
P. Dayan and G. E. Hinton.
Using expectationmaximization for reinforcement learning.
Neural Computation, 9(2):271–278, 1997.  Ding and Soricut (2017) N. Ding and R. Soricut. Coldstart reinforcement learning with softmax policy gradient. In Advances in Neural Information Processing Systems, pages 2814–2823, 2017.
 Fedus et al. (2018) W. Fedus, I. Goodfellow, and A. M. Dai. MaskGAN: Better text generation via filling in the _. In ICLR, 2018.

Ganchev et al. (2010)
K. Ganchev, J. Gillenwater, B. Taskar, et al.
Posterior regularization for structured latent variable models.
Journal of Machine Learning Research
, 11(Jul):2001–2049, 2010.  Goodfellow et al. (2014) I. Goodfellow, J. PougetAbadie, M. Mirza, B. Xu, D. WardeFarley, S. Ozair, A. Courville, and Y. Bengio. Generative adversarial nets. In NIPS, pages 2672–2680, 2014.
 Graff et al. (2003) D. Graff, J. Kong, K. Chen, and K. Maeda. English Gigaword. Linguistic Data Consortium, Philadelphia, 4(1):34, 2003.
 Haarnoja et al. (2017) T. Haarnoja, H. Tang, P. Abbeel, and S. Levine. Reinforcement learning with deep energybased policies. In ICML, pages 1352–1361, 2017.
 Hal Daumé et al. (2009) I. Hal Daumé, J. Langford, and D. Marcu. Searchbased structured prediction as classification. Journal Machine Learning, 2009.
 Hochreiter and Schmidhuber (1997) S. Hochreiter and J. Schmidhuber. Long shortterm memory. Neural computation, 9(8):1735–1780, 1997.
 Hovy and Lin (1998) E. Hovy and C.Y. Lin. Automated text summarization and the summarist system. In Proceedings of a workshop on held at Baltimore, Maryland: October 1315, 1998, pages 197–214. Association for Computational Linguistics, 1998.
 Hu et al. (2016) Z. Hu, X. Ma, Z. Liu, E. Hovy, and E. Xing. Harnessing deep neural networks with logic rules. In ACL, 2016.
 Hu et al. (2017) Z. Hu, Z. Yang, X. Liang, R. Salakhutdinov, and E. P. Xing. Toward controlled generation of text. In ICML, 2017.
 Hu et al. (2018a) Z. Hu, H. Shi, Z. Yang, B. Tan, T. Zhao, J. He, W. Wang, X. Yu, L. Qin, D. Wang, et al. Texar: A modularized, versatile, and extensible toolkit for text generation. arXiv preprint arXiv:1809.00794, 2018a.
 Hu et al. (2018b) Z. Hu, Z. Yang, R. Salakhutdinov, X. Liang, L. Qin, H. Dong, and E. Xing. Deep generative models with learnable knowledge constraints. In NIPS, 2018b.
 Ishwaran et al. (2005) H. Ishwaran, J. S. Rao, et al. Spike and slab variable selection: frequentist and Bayesian strategies. The Annals of Statistics, 33(2):730–773, 2005.

Karpathy and FeiFei (2015)
A. Karpathy and L. FeiFei.
Deep visualsemantic alignments for generating image descriptions.
In
Proceedings of the IEEE conference on computer vision and pattern recognition
, pages 3128–3137, 2015.  Leblond et al. (2018) R. Leblond, J.B. Alayrac, A. Osokin, and S. LacosteJulien. SEARNN: Training RNNs with globallocal losses. In ICLR, 2018.
 Levine (2018) S. Levine. Reinforcement learning and control as probabilistic inference: Tutorial and review. arXiv preprint arXiv:1805.00909, 2018.
 Lin (2004) C.Y. Lin. Rouge: A package for automatic evaluation of summaries. Text Summarization Branches Out, 2004.
 Luong et al. (2015) M.T. Luong, H. Pham, and C. D. Manning. Effective approaches to attentionbased neural machine translation. In EMNLP, 2015.
 Ma et al. (2017) X. Ma, P. Yin, J. Liu, G. Neubig, and E. Hovy. Softmax qdistribution estimation for structured prediction: A theoretical interpretation for raml. arXiv preprint arXiv:1705.07136, 2017.
 Norouzi et al. (2016) M. Norouzi, S. Bengio, N. Jaitly, M. Schuster, Y. Wu, D. Schuurmans, et al. Reward augmented maximum likelihood for neural structured prediction. In Advances In Neural Information Processing Systems, pages 1723–1731, 2016.
 Papineni et al. (2002) K. Papineni, S. Roukos, T. Ward, and W.J. Zhu. Bleu: a method for automatic evaluation of machine translation. In Proceedings of the 40th annual meeting on association for computational linguistics, pages 311–318. Association for Computational Linguistics, 2002.
 Peters et al. (2010) J. Peters, K. Mülling, and Y. Altun. Relative entropy policy search. In AAAI, pages 1607–1612. Atlanta, 2010.
 Ranzato et al. (2016) M. Ranzato, S. Chopra, M. Auli, and W. Zaremba. Sequence level training with recurrent neural networks. In ICLR, 2016.

Rush et al. (2015)
A. M. Rush, S. Chopra, and J. Weston.
A neural attention model for abstractive sentence summarization.
In EMNLP, pages 379–389, 2015.  Schulman et al. (2015) J. Schulman, S. Levine, P. Abbeel, M. Jordan, and P. Moritz. Trust region policy optimization. In ICML, pages 1889–1897, 2015.
 Schulman et al. (2017) J. Schulman, X. Chen, and P. Abbeel. Equivalence between policy gradients and soft qlearning. arXiv preprint arXiv:1704.06440, 2017.
 Sutskever et al. (2014) I. Sutskever, O. Vinyals, and Q. V. Le. Sequence to sequence learning with neural networks. In Advances in neural information processing systems, pages 3104–3112, 2014.
 Sutton et al. (2000) R. S. Sutton, D. A. McAllester, S. P. Singh, and Y. Mansour. Policy gradient methods for reinforcement learning with function approximation. In Advances in neural information processing systems, pages 1057–1063, 2000.
 Teh et al. (2017) Y. Teh, V. Bapst, W. M. Czarnecki, J. Quan, J. Kirkpatrick, R. Hadsell, N. Heess, and R. Pascanu. Distral: Robust multitask reinforcement learning. In Advances in Neural Information Processing Systems, pages 4496–4506, 2017.
 Vinyals et al. (2015) O. Vinyals, A. Toshev, S. Bengio, and D. Erhan. Show and tell: A neural image caption generator. In CVPR, pages 3156–3164, 2015.
 Wiseman and Rush (2016) S. Wiseman and A. M. Rush. Sequencetosequence learning as beamsearch optimization. In EMNLP, pages 1296–1306, 2016.
 Wu et al. (2016) Y. Wu, M. Schuster, Z. Chen, Q. V. Le, M. Norouzi, W. Macherey, M. Krikun, Y. Cao, Q. Gao, K. Macherey, et al. Google’s neural machine translation system: Bridging the gap between human and machine translation. arXiv preprint arXiv:1609.08144, 2016.
 Xie et al. (2017) Z. Xie, S. I. Wang, J. Li, D. Lévy, A. Nie, D. Jurafsky, and A. Y. Ng. Data noising as smoothing in neural network language models. In ICLR, 2017.
 Yang et al. (2018) Z. Yang, Z. Hu, C. Dyer, E. P. Xing, and T. BergKirkpatrick. Unsupervised text style transfer using language models as discriminators. In NIPS, 2018.
 Zhou et al. (2017) Q. Zhou, N. Yang, F. Wei, and M. Zhou. Selective encoding for abstractive sentence summarization. In ACL, 2017.
 Zhu et al. (2018) W. Zhu, Z. Hu, and E. P. Xing. Text infiling. 2018.
 Ziebart (2010) B. D. Ziebart. Modeling purposeful adaptive behavior with the principle of maximum causal entropy. In PhD Thesis, 2010.
Appendix A Policy Gradient & MIXER
Ranzato et al. (2016) made an early attempt to address the exposure bias problem by exploiting the policy gradient algorithm (Sutton et al., 2000). Policy gradient aims to maximizes the expected reward:
(10) 
where is usually a common reward function (e.g., BLEU). Taking gradient w.r.t gives:
(11) 
We now reveal the relation between the ERPO framework we present and the policy gradient algorithm. Starting from the Mstep of Eq.(2) and setting as in SPG (section 3.4), we use as the proposal distribution and obtain the importance sampling estimate of the gradient (we omit the superscript for notation simplicity):
(12) 
where is the normalization constant of , which can be considered as adjusting the step size of gradient descent.
We can see that Eq.(12) recovers Eq.(11) if we further set , and omit the scaling factor . In other words, policy gradient can be seen as a special instance of the general ERPO framework with and with omitted.
The MIXER algorithm (Ranzato et al., 2016) incorporates an annealing strategy that mixes between MLE and policy gradient training. Specifically, given a groundtruth example , the first tokens are used for evaluating MLE loss, and starting from step , policy gradient objective is used. The value decreases as training proceeds. With the relation between policy gradient and ERPO as established above, MIXER can be seen as a specific instance of the proposed interpolation algorithm (section 4) that follows a restricted annealing strategy for tokenlevel hyperparameters . That is, for in Eq.4 (i.e.,the first steps), is set to and , namely the MLE training; while for , is set to and .
Appendix B Interpolation Algorithm
Appendix C Experimental Settings
c.1 Data Preprocessing
For the machine translation dataset, we follow (Ma et al., 2017) for data preprocessing.
c.2 Algorithm Setup
For RAML (Norouzi et al., 2016)
, we use the sampling approach (ngram replacement) by
(Ma et al., 2017) to sample from the exponentiated reward distribution. For each training example we draw 10 samples. The softmax temperature is set to .For Scheduled Sampling (Bengio et al., 2015), the decay function we used is inversesigmoid decay. The probability of sampling from model , where is a hyperparameter controlling the speed of convergence, which is set to and in the machine translation and text summarization tasks, respectively.
For MIXER (Ranzato et al., 2016), the advantage function we used for policy gradient is .
For the proposed interpolation algorithm, we initialize the weights as , and increase and while decreasing every time when the validationset reward decreases. Specifically, we increase by once and increase by for four times, periodically. For example, at the first time the validationset reward decreases, we increase , and at the second to fifth time, we increase , and so forth. The weight is decreased by every time we increase either or . Notice that we would not update when the validationset reward decreases.
Appendix D Additional Results
Here we present additional results of machine translation using a dropout rate of 0.3 (Table 3). The improvement of the proposed interpolation algorithm over the baselines is comparable to that of using dropout 0.2 (Table 1 in the paper). For example, our algorithm improves over MLE by 1.5 BLEU points, and improves over the second best performing method RAML by 0.49 BLEU points. (With dropout 0.2 in Table 1, the improvements are 1.42 BLEU and 0.64, respectively.) The proposed interpolation algorithm outperforms existing approaches with a clear margin.
Figure 5 shows the convergence curves of the comparison algorithms.
Comments
There are no comments yet.