1 Introduction
Many problems in machine learning involve structured prediction, i.e., predicting a group of outputs that depend on each other. Recent advances in sequence labeling
(Ma & Hovy, 2016), syntactic parsing (McDonald et al., 2005) and machine translation (Bahdanau et al., 2015) benefit from the development of more sophisticated discriminative models for structured outputs, such as the seminal work on conditional random fields (CRFs) (Lafferty et al., 2001) and large margin methods (Taskar et al., 2004), demonstrating the importance of the joint predictions across multiple output components.A principal problem in structured prediction is direct optimization towards the taskspecific metrics (i.e., rewards) used in evaluation, such as tokenlevel accuracy for sequence labeling or BLEU score for machine translation. In contrast to maximum likelihood (ML) estimation which uses likelihood to serve as a reasonable surrogate for the taskspecific metric, a number of techniques (Taskar et al., 2004; Gimpel & Smith, 2010; Volkovs et al., 2011; Shen et al., 2016) have emerged to incorporate taskspecific rewards in optimization. Among these methods, reward augmented maximum likelihood (RAML) (Norouzi et al., 2016) has stood out for its simplicity and effectiveness, leading to stateoftheart performance on several structured prediction tasks, such as machine translation (Wu et al., 2016) and image captioning (Liu et al., 2016). Instead of only maximizing the loglikelihood of the groundtruth output as in ML, RAML attempts to maximize the expected loglikelihood of all possible candidate outputs w.r.t. the exponentiated payoff distribution
, which is defined as the normalized exponentiated reward. By incorporating taskspecific reward into the payoff distribution, RAML combines the computational efficiency of ML with the conceptual advantages of reinforcement learning (RL) algorithms that optimize the expected reward
(Ranzato et al., 2016; Bahdanau et al., 2017). Simple as RAML appears to be, its empirical success has piqued interest in analyzing and justifying RAML from both theoretical and empirical perspectives. In their pioneering work, Norouzi et al. (2016) showed that both RAML and RL optimize the KL divergence between the exponentiated payoff distribution and model distribution, but in opposite directions. Moreover, when applied to loglinear model, RAML can also be shown to be equivalent to the softmaxmargin training method (Gimpel & Smith, 2010; Gimpel, 2012). Nachum et al. (2016) applied the payoff distribution to improve the exploration properties of policy gradient for modelfree reinforcement learning.Despite these efforts, the theoretical properties of RAML, especially the interpretation and behavior of the exponentiated payoff distribution, have largely remained understudied (§2
). First, RAML attempts to match the model distribution with the heuristically designed exponentiated payoff distribution whose behavior has largely remained underappreciated, resulting in a nonintuitive asymptotic property. Second, there is no direct theoretical proof showing that RAML can deliver a prediction function better than ML. Third, no attempt (to our best knowledge) has been made to further improve RAML from the algorithmic and practical perspectives.
In this paper, we attempt to resolve the abovementioned understudied problems by providing an theoretical interpretation of RAML. Our contributions are threefold: (1) Theoretically, we introduce the framework of softmax Qdistribution estimation, through which we are able to interpret the role the payoff distribution plays in RAML (§3
). Specifically, the softmax Qdistribution serves as a smooth approximation to the Bayes decision boundary. By comparing the payoff distribution with this softmax Qdistribution, we show that RAML approximately estimates the softmax Qdistribution, therefore approximating the Bayes decision rule. Hence, our theoretical results provide an explanation of what distribution RAML asymptotically models, and why the prediction function provided by RAML outperforms the one provided by ML. (2) Algorithmically, we further propose softmax Qdistribution maximum likelihood (SQDML) which improves RAML by achieving the exact Bayes decision boundary asymptotically. (3) Experimentally, through one experiment using synthetic data on multiclass classification and one using real data on image captioning, we verify our theoretical analysis, showing that SQDML is consistently as good or better than RAML on the taskspecific metrics we desire to optimize. Additionally, through three structured prediction tasks in natural language processing (NLP) with rewards defined on sequential (named entity recognition), treebased (dependency parsing) and complex irregular structures (machine translation), we deepen the empirical analysis of
Norouzi et al. (2016), showing that RAML consistently leads to improved performance over ML on taskspecific metrics, while ML yields better exact match accuracy (§4).2 Background
2.1 Notations
Throughout we use uppercase letters for random variables (and occasionally for matrices as well), and lowercase letters for realizations of the corresponding random variables. Let
be the input, and be the desired structured output, e.g., in machine translation and are French and English sentences, resp. We assume that the set of all possible outputs is finite. For instance, in machine translation all English sentences are up to a maximum length. denotes the taskspecific reward function (e.g., BLEU score) which evaluates a predicted output against the groundtruth .Let denote the true distribution of the data, i.e., , and be our training samples, where (resp. ) are usually i.i.d. samples of (resp. ). Let denote a parametric statistical model indexed by parameter , where
is the parameter space. Some widely used parametric models are conditional loglinear models
(Lafferty et al., 2001)and deep neural networks
(Sutskever et al., 2014) (details in Appendix D.2). Once the parametric statistical model is learned, given an input , model inference (a.k.a. decoding) is performed by finding an output achieving the highest conditional probability:(1) 
where is the set of parameters learned on training data .
2.2 Maximum Likelihood
Maximum likelihood minimizes the negative loglikelihood of the parameters given training data:
(2) 
where and is derived from the empirical distribution of training data :
(3) 
and is the indicator function. From (2), ML attempts to learn a conditional model distribution that is as close to the conditional empirical distribution as possible, for each . Theoretically, under certain regularity conditions (Wasserman, 2013), asymptotically as , converges to the true distribution , since converges to for each .
2.3 Reward Augmented Maximum Likelihood
As proposed in Norouzi et al. (2016), RAML incorporates taskspecific rewards by reweighting the loglikelihood of each possible candidate output proportionally to its exponentiated scaled reward:
(4) 
where the reward information is encoded by the exponentiated payoff distribution with the temperature controlling it smoothness
(5) 
Norouzi et al. (2016) showed that (4) can be reexpressed in terms of KL divergence as follows:
(6) 
where is the empirical distribution in (3). As discussed in Norouzi et al. (2016), the globally optimal solution of RAML is achieved when the learned model distribution matches the exponentiated payoff distribution, i.e., for each and for some fixed value of .
Open Problems in RAML
We identify three open issues in the theoretical interpretation of RAML: i) Though both and are distributions defined over the output space , the former is conditioned on the input while the latter is conditioned on the output which appears to serve as groundtruth but is sampled from data distribution . This makes the behavior of RAML attempting to match them unintuitive; ii) Supposing that in the training data there exist two training instances with the same input but different outputs, i.e., . Then has two “targets” and , making it unclear what distribution asymptotically converges to. iii) There is no rigorous theoretical evidence showing that generating from yields a better prediction function than generating from .
To our best knowledge, no attempt has been made to theoretically address these problems. The main goal of this work is to theoretically analyze the properties of RAML, in hope that we may eventually better understand it by answering these questions and further improve it by proposing new training framework. To this end, in the next section we introduce a softmax Qdistribution estimation framework, facilitating our later analysis.
3 Softmax QDistribution Estimation
With the end goal of theoretically interpreting RAML in mind, in this section we present the softmax Qdistribution estimation framework. We first provide background on Bayesian decision theory (§3.1) and softmax approximation of deterministic distributions (§3.2). Then, we propose the softmax Qdistribution (§3.3), and establish the framework of estimating the softmax Qdistribution from training data, called softmax Qdistribution maximum likelihood (SQDML, §3.4). In §3.5, we analyze SQDML, which is central in linking RAML and softmax Qdistribution estimation.
3.1 Bayesian Decision Theory
Bayesian decision theory is a fundamental statistical approach to the problem of pattern classification, which quantifies the tradeoffs between various classification decisions using the probabilities and rewards (losses) that accompany such decisions.
Based on the notations setup in §2.1, let denote all the possible prediction functions from input to output space, i.e., . Then, the expected reward of a prediction function is:
(7) 
where is the reward function accompanied with the structured prediction task.
Bayesian decision theory states that the global maximum of , i.e., the optimal expected prediction reward is achieved when the prediction function is the socalled Bayes decision rule:
(8) 
where is called the conditional reward. Thus, the Bayes decision rule states that to maximize the overall reward, compute the conditional reward for each output and then select the output for which is maximized.
Importantly, when the reward function is the indicator function, i.e., , the Bayes decision rule reduces to a specific instantiation called the
Bayes classifier
:(9) 
where is the true conditional distribution of data defined in §2.1.
In §2.2, we see that ML attempts to learn the true distribution . Thus, in the optimal case, decoding from the distribution learned with ML, i.e., , produces the Bayes classifier , but not the more general Bayes decision rule . In the rest of this section, we derive a theoretical proof showing that decoding from the distribution learned with RAML, i.e., approximately achieves , illustrating why RAML yields a prediction function with improved performance towards the optimized reward function over ML.
3.2 Softmax Approximation of Deterministic Distributions
Aimed at providing a smooth approximation of the Bayes decision boundary determined by the Bayes decision rule in (8), we first describe a widely used approximation of deterministic distributions using the softmax function.
Let denote a class of functions, where . We assume that is finite. Then, we define the random variable where is our input random variable. Obviously, Z is deterministic when X is given, i.e.,
(10) 
for each and .
The softmax function provides a smooth approximation of the point distribution in (10), with a temperature parameter, , serving as a hyperparameter that controls the smoothness of the approximating distribution around the target one:
(11) 
It should be noted that at , the distribution reduces to the original deterministic distribution in (10), and in the limit as ,
is equivalent to the uniform distribution
.3.3 Softmax Qdistribution
We are now ready to propose the softmax Qdistribution, which is central in revealing the relationship between RAML and Bayes decision rule. We first define random variable . Then, is deterministic given , and according to (11), we define the softmax Qdistribution to approximate the conditional distribution of given :
(12) 
for each and .^{1}^{1}1In the following derivations we omit in for simplicity when there is no ambiguity. Importantly, one can verify that decoding from the softmax Qdistribution provides us with the Bayes decision rule,
(13) 
with any value of .
3.4 Softmax Qdistribution Maximum Likelihood
Because making predictions according to the softmax Qdistribution is equivalent to the Bayes decision rule, we would like to construct a (parametric) statistical model to directly model the softmax Qdistribution in (12), similarly to how ML models the true data distribution . We call this framework softmax Qdistribution maximum likelihood (SQDML). This framework is modelagnostic, so any probabilistic model used in ML such as conditional loglinear models and deep neural networks, can be directly applied to modeling the softmax Qdistribution.
Suppose that we use a parametric statistical model to model the softmax Qdistribution. In order to learn “optimal” parameters from training data , an intuitive and wellmotivated objective function is the KLdivergence between the empirical conditional distribution of , denoted as , and the model distribution :
(14) 
We can directly set , which leaves the problem of defining the empirical conditional distribution . Before defining , we first note that if the defined empirical distribution asymptotically converges to the true Qdistribution , the learned model distribution converges to . Therefore, decoding from ideally achieves the Bayes decision rule .
A straightforward way to define is to use the empirical distribution :
(15) 
where is the empirical distribution of defined in (3). Asymptotically as , converges to . Thus, asymptotically converges to .
Unfortunately, the empirical distribution (15) is not efficient to compute, since the expectation term is inside the exponential function (See appendix D.2 for approximately learning in practice). This leads us to seek an approximation of the softmax Qdistribution and its corresponding empirical distribution. Here we propose the following distribution to approximate the softmax Qdistribution defined in (12):
(16) 
where we move the expectation term outside the exponential function. Then, the corresponding empirical distribution of can be written in the following form:
(17) 
Approximating with , and plugging (17) into the RHS in (14), we have:
(18) 
where is the exponentiated payoff distribution of RAML in (5).
Equation (18) states that RAML is an approximation of our proposed SQDML by approximating with . Interestingly and mostly in practice, when the input is unique in the training data, i.e., , we have , resulting in . It states that the estimated distribution and are exactly the same when the input is unique in the training data, since the empirical distributions and estimated from the training data are the same.
3.5 Analysis and Discussion of SQDML
In §3.4, we provided a theoretical interpretation of RAML by establishing the relationship between RAML and SQDML. In this section, we try to answer the questions of RAML raised in §2.3 using this interpretation and further analyze the level of approximation from the softmax Qdistribution in (13) to in (16) by proving a upper bound of the approximation error.
Let’s first use our interpretation to answer the three questions regarding RAML in §2.3. First, instead of optimizing the KL divergence between the artificially designed exponentiated payoff distribution and the model distribution, RAML in our formulation approximately matches model distribution with the softmax Qdistribution . Second, based on our interpretation, asymptotically as , RAML learns a distribution that converges to in (16), and therefore approximately converges to the softmax Qdistribution. Third, as mentioned in §3.3, generating from the softmax Qdistribution produces the Bayes decision rule, which theoretically outperforms the prediction function from ML, w.r.t. the expected reward.
It is necessary to mention that both RAML and SQDML are trying to learn distributions, decoding from which (approximately) delivers the Bayes decision rule. There are other directions that can also achieve the Bayes decision rule, such as minimum Bayes risk decoding (Kumar & Byrne, 2004), which attempts to estimate the Bayes decision rule directly by computing expectation w.r.t the data distribution learned from training data.
So far our discussion has concentrated on the theoretical interpretation and analysis of RAML, without any concerns for how well approximates . Now, we characterize the approximating error by proving a upper bound of the KL divergence between them:
Theorem 1.
From Theorem 1 (proof in Appendix A.1) we observe that the level of approximation mainly depends on two factors: the upper bound of the reward function () and the temperature parameter . In practice, is often less than or equal to 1, when metrics like accuracy or BLEU are applied.
It should be noted that, at one extreme when becomes larger, the approximation error tends to be zero. At the same time, however, the softmax Qdistribution becomes closer to the uniform distribution , providing less information for prediction. Thus, in practice, it is necessary to consider the tradeoff between approximation error and predictive power.
What about the other extreme — “as close to zero as possible”? With suitable assumptions about the data distribution , we can characterize the approximating error by using the same KL divergence:
Theorem 2.
Suppose that the reward function is bounded , and , where is a constant. Suppose additionally that, like a subGaussian, for every , satisfies the exponential tail bound w.r.t. — that is, for each , there exists a unique such that for every
(20) 
where is a distributiondependent constant. Assume that . Denote . Then, as ,
(21) 
4 Experiments
In this section, we performed two sets of experiments to verity our theoretical analysis of the relation between SQDML and RAML. As discussed in §3.4, RAML and SQDML deliver the same predictions when the input is unique in the data. Thus, in order to compare SQDML against RAML, the first set of experiments are designed on two data sets in which is not unique — synthetic data for costsensitive multiclass classification, and the MSCOCO benchmark dataset (Chen et al., 2015) for image captioning. To further confirm the advantages of RAML (and SQDML) over ML, and thus the necessity for better theoretical understanding, we performed the second set of experiments on three structured prediction tasks in NLP. In these cases SQDML reduces to RAML, as the input is unique in these three data sets.
4.1 Experiments on SQDML
4.1.1 Costsensitive Multiclass Classification
First, we perform experiments on synthetic data for costsensitive multiclass classification designed to demonstrate that RAML learns a distribution approximately producing the Bayes decision rule, which is asymptotically the prediction function delivered by SQDML.
The synthetic data set is for a 4class classification task, where , and . We define four base points, one for each class:
For data generation, the distribution is the uniform distribution on , and the log form of the conditional distribution for each is proportional to the negative distance of each base point:
(22) 
where is the Euclidean distance between two points. To generate training data, we first draw 1 million inputs from . Then, we independently generate 10 outputs y from for each to build a data set with multiple references. Thus, the total number of training instances is 10 million. For validation and test data, we independently generate 0.1 million pairs of from , respectively.
The model we used is a feedforward (dense) neural networks with 2 hidden layers, each of which has 8 units. Optimization is performed with minibatch stochastic gradient descent (SGD) with learning rate 0.1 and momentum 0.9. Each model is trained with 100 epochs and we apply early stopping
(Caruana et al., 2001) based on performance on validation sets.The reward function is designed to distinguish the four classes. For “correct” predictions, the specific reward values assigned for the four classes are:
For “wrong” predictions, rewards are always zero, i.e. when .
Figure 3 depicts the effect of varying the temperature parameter on model performance, ranging from 0.1 to 3.0 with step 0.1. For each fixed , we report the mean performance over 5 repetitions. Figure 3 shows the averaged rewards obtained as a function of on both validation and test datasets of ML, RAML and SQDML, respectively. From Figure 3 we can see that when increases, the performance gap between SQDML and RAML keeps decreasing, indicting that RAML achieves better approximation to SQDML. This evidence verities the statement in Theorem 1 that the approximating error between RAML and SQDML decreases when continues to grow.
The results in Figure 3 raise a question: does larger necessarily yield better performance for RAML? To further illustrate the effect of on model performance of RAML and SQDML, we perform experiments with a wide range of — from 1 to 10,000 with step 200. We also repeat each experiment 5 times. The results are shown in Figure 6. We see that the model performance (average reward), however, has not kept growing with increasing . As discussed in §3.5, the softmax Qdistribution becomes closer to the uniform distribution when becomes larger, making it less expressive for prediction. Thus, when applying RAML in practice, considerations regarding the tradeoff between approximating error and predictive power of model are needed. More details, results and analysis of the conducted experiments are provided in Appendix B.
RAML  SQDML  RAML  SQDML  

Reward  BLEU  Reward  BLEU  Reward  BLEU  Reward  BLEU  
10.77  27.02  10.82  27.08  10.84  27.26  10.82  27.03  
10.81  27.27  10.78  26.92  10.82  27.29  10.80  27.20  
10.88  27.62  10.91  27.54  10.74  26.89  10.78  26.98  
10.82  27.33  10.79  27.02  10.77  27.01  10.72  26.66 
(standard evaluation metric) scores for image captioning task with different
.4.1.2 Image Captioning with Multiple References
Second, to show that optimizing toward our proposed SQDML objective yields better predictions than RAML on realworld structured prediction tasks, we evaluate on the MSCOCO image captioning dataset. This dataset contains 123,000 images, each of which is paired with as least five manually annotated captions. We follow the offline evaluation setting in (Karpathy & Li, 2015)
, and reserve 5,000 images for validation and testing, respectively. We implemented a simple neural image captioning model using a pretrained VGGNet as the encoder and a Long ShortTerm Memory (LSTM) network as the decoder. Details of the experimental setup are in Appendix
C.As in §4.1.1, for the sake of comparing SQDML with RAML to verify our theoretical analysis, we use the average reward as the performance measure by simply defining the reward as pairwise sentence level BLEU score between model’s prediction and each reference caption^{2}^{2}2
Not that this is different from standard multireference sentencelevel BLEU, which counts ngram matches w.r.t. all sentences then uses these sufficient statistics to calculate a final score.
, though the standard benchmark metric commonly used in image captioning (e.g., corpuslevel BLEU4 score) is not simply defined as averaging over the pairwise rewards between prediction and reference captions.We use stochastic gradient descent to optimize the objectives for SQDML (14) and RAML (4). However, the denominators of the softmaxQ distribution for SQDML (15) and the payoff distribution for RAML (5) contain summations over intractable exponential hypotheses space . We therefore propose a simple heuristic approach to approximate the denominator by restricting the exponential space using a fixed set of sampled targets, i.e., . Approximating the intractable hypotheses space using sampling is not new in structured prediction, and has been shown effective in optimizing neural structured prediction models (Shen et al., 2016). Specifically, the sampled candidate set is constructed by (i) including each groundtruth reference into ; and (ii) uniformly replacing an gram () in one (randomly sampled) reference with a randomly sampled gram. We refer to this approach as gram replacement. We provide more details of the training procedure in Appendix C.
Table 1 lists the results. We evaluate on both the average reward and the benchmark metric (corpuslevel BLEU4). We also tested on a vanilla ML baseline, which achieves 10.71 average reward and 26.91 corpuslevel BLEU. Both SQDML and RAML outperform ML according to the two metrics. Interestingly, comparing SQDML with RAML we did not observe a significant improvement of average reward. We hypothesize that this is due to the fact that the reference captions for each image are largely different, making it highly nontrivial for the model to predicate a “consensus” caption that agrees with multiple references. As an example, we randomly sampled 300 images from the validation set and compute the averaged sentencelevel BLEU between two references, which is only 10.09. Nevertheless, through case studies we still found some interesting examples, which demonstrate that SQDML is capable of generating predictions that match with multiple candidates. Figure 7 gives two examples. In the two examples, SQDML’s predictions match with multiple references, registering the highest average reward. On the other hand, RAML gives suboptimal predictions in terms of average reward since it is an approximation of SQDML. And finally for ML, since its objective is solely maximizing the reward w.r.t a single reference, it gives the lowest average reward, while achieving higher maximum reward.
4.2 Experiments on Structured Prediction
Norouzi et al. (2016)
already evaluated the effectiveness of RAML on sequence prediction tasks of speech recognition and machine translation using neural sequencetosequence models. In this section, we further confirm the empirical success of RAML (and SQDML) over ML: (i) We apply RAML on three structured prediction tasks in NLP, including named entity recognition (NER), dependency parsing and machine translation (MT), using both classical featurebased loglinear models (NER and parsing) and stateoftheart attentional recurrent neural networks (MT). (ii) Different from
Norouzi et al. (2016) where edit distance is uniformly used as a surrogate training reward and the learning objective in (4) is approximated through sampling, we use taskspecific rewards, defined on sequential (NER), treebased (parsing) and complex irregular structures (MT). Specifically, instead of sampling, we apply efficient dynamic programming algorithms (NER and parsing) to directly compute the analytical solution of (4). (iii) We present further analysis comparing RAML with ML, showing that due to different learning objectives, RAML registers better results under taskspecific metrics, while ML yields better exactmatch accuracy.4.2.1 Setup
In this section we describe experimental setups for three evaluation tasks. We refer readers to Appendix D for dataset statistics, modeling details and training procedure.
Named Entity Recognition (NER)
For NER, we experimented on the English data from CoNLL 2003 shared task (Tjong Kim et al., 2003). There are four predefined types of named entities: PERSON, LOCATION, ORGANIZATION, and MISC. The dataset includes 15K training sentences, 3.4K for validation, and 3.7K for testing.
We built a linear CRF model (Lafferty et al., 2001) with the same features used in Finkel et al. (2005). Instead of using the official F1 score over complete span predictions, we use tokenlevel accuracy as the training reward, as this metric can be factorized to each word, and hence there exists efficient dynamic programming algorithm to compute the expected loglikelihood objective in (4).
Dependency Parsing
For dependency parsing, we evaluate on the English Penn Treebanks (PTB) (Marcus et al., 1993). We follow the standard splits of PTB, using sections 221 for training, section 22 for validation and 23 for testing. We adopt the Stanford Basic Dependencies (De Marneffe et al., 2006) using the Stanford parser v3.3.0^{3}^{3}3http://nlp.stanford.edu/software/lexparser.shtml. We applied the same data preprocessing procedure as in Dyer et al. (2015).
We adopt an edgefactorized treestructure loglinear model with the same features used in Ma & Zhao (2012). We use the unlabeled attachment score (UAS) as the training reward, which is also the official evaluation metric of parsing performance. Similar as NER, the expectation in (4) can be computed deficiently using dynamic programming since UAS can be factorized to each edge.
Machine Translation (MT)
We tested on the GermanEnglish machine translation task in the IWSLT 2014 evaluation campaign (Cettolo et al., 2014), a widelyused benchmark for evaluating optimization techniques for neural sequencetosequence models. The dataset contains 153K training sentence pairs. We follow previous works (Wiseman & Rush, 2016; Bahdanau et al., 2017; Li et al., 2017) and use an attentional neural encoderdecoder model with LSTM networks. The size of the LSTM hidden states is 256. Similar as in §4.1.2, we use the sentence level BLEU score as the training reward and approximate the learning objective using gram replacement (). We evaluate using standard corpuslevel BLEU.
4.2.2 Main Results
SB  CB  SB  CB  

28.67  27.42  29.37  28.49  
29.44  28.38  29.52  28.59  
29.59  28.40  29.54  28.63  
29.80  28.77  29.48  28.58  
29.55  28.45  29.34  28.40 
Methods  ML Baseline  Proposed Model 

Ranzato et al. (2016)  20.10  21.81 
Wiseman & Rush (2016)  24.03  26.36 
Li et al. (2017)  27.90  28.30 
Bahdanau et al. (2017)  27.56  28.53 
This Work  27.66  28.77 
NER  Parsing  MT  

Metric  Acc.  F1  E.M.  UAS  E.M.  SB  CB  E.M. 
ML  97.0  84.9  78.8  90.7  39.9  29.15  27.66  3.79 
RAML  97.3  86.0  80.1  91.1  39.4  29.80  28.77  3.35 
The results of NER and dependency parsing are shown in Table 3 and Table 3, respectively. We observed that the RAML model obtained the best results at for NER, and for dependency parsing. Beyond , RAML models get worse than the ML baseline for both the two tasks, showing that in practice selection of temperature is needed. In addition, the rewards we directly optimized in training (tokenlevel accuracy for NER and UAS for dependency parsing) are more stable w.r.t. than the evaluation metrics (F1 in NER), illustrating that in practice, choosing a training reward that correlates well with the evaluation metric is important.
Table 6 summarizes the results for MT. We also compare our model with previous works on incorporating taskspecific rewards (i.e., BLEU score) in optimizing neural sequencetosequence models (c.f. Table 6). Our approach, albeit simple, surprisingly outperforms previous works. Specifically, all previous methods require a pretrained ML baseline to initialize the model, while RAML learns from scratch. This suggests that RAML is easier and more stable to optimize compared with existing approaches like RL (e.g., Ranzato et al. (2016) and Bahdanau et al. (2017)
), which requires sampling from the moving model distribution and suffers from high variance. Finally, we remark that RAML performs consistently better than the ML (27.66) across most temperature terms.
4.2.3 Further Comparison with Maximum Likelihood
Table 6 illustrates the performance of ML and RAML under different metrics of the three tasks. We observe that RAML outperforms ML on both the directly optimized rewards (tokenlevel accuracy for NER, UAS for dependency parsing and sentencelevel BLEU for MT) and taskspecific evaluation metrics (F1 for NER and corpuslevel BLEU for MT). Interestingly, we find a trend that ML gets better results on two out of the three tasks under exact match accuracy, which is the reward that ML attempts to optimize (as discussed in (9)). This is in line with our theoretical analysis, in that RAML and ML achieve better prediction functions w.r.t. their corresponding rewards they try to optimize.
5 Conclusion
In this work, we propose the framework of estimating the softmax Qdistribution from training data. Based on our theoretical analysis, asymptotically, the prediction function learned by RAML approximately achieves the Bayes decision rule. Experiments on three structured prediction tasks demonstrate that RAML consistently outperforms ML baselines.
References
 Bahdanau et al. (2015) Dzmitry Bahdanau, Kyunghyun Cho, and Yoshua Bengio. Neural machine translation by jointly learning to align and translate. In Proceedings of ICLR, San Diego, California, 2015.
 Bahdanau et al. (2017) Dzmitry Bahdanau, Philemon Brakel, Kelvin Xu, Anirudh Goyal, Ryan Lowe, Joelle Pineau, Aaron Courville, and Yoshua Bengio. An actorcritic algorithm for sequence prediction. In Proceedings of ICLR, Toulon, France, 2017.

Caruana et al. (2001)
Rich Caruana, Steve Lawrence, and Giles Lee.
Overfitting in neural nets: Backpropagation, conjugate gradient, and early stopping.
In Proceedings of NIPS, volume 13, pp. 402. MIT Press, 2001.  Cettolo et al. (2014) Mauro Cettolo, Jan Niehues, Sebastian Stuker, Luisa Bentivogli, and Marcello Federico. Report on the 11th iwslt evaluation campaign, iwslt 2014. In Proceedings for the International Workshop on Spoken Language Translation, pp. 2–11, 2014.
 Chen et al. (2015) Xinlei Chen, Hao Fang, TsungYi Lin, Ramakrishna Vedantam, Saurabh Gupta, Piotr Dollár, and C. Lawrence Zitnick. Microsoft COCO captions: Data collection and evaluation server. CoRR, abs/1504.00325, 2015.
 De Marneffe et al. (2006) MarieCatherine De Marneffe, Bill MacCartney, Christopher D Manning, et al. Generating typed dependency parses from phrase structure parses. In Proceedings of LREC, pp. 449–454, 2006.
 Dyer et al. (2015) Chris Dyer, Miguel Ballesteros, Wang Ling, Austin Matthews, and Noah A. Smith. Transitionbased dependency parsing with stack long shortterm memory. In Proceedings of ACL, pp. 334–343, Beijing, China, July 2015.
 Finkel et al. (2005) Jenny Rose Finkel, Trond Grenager, and Christopher Manning. Incorporating nonlocal information into information extraction systems by gibbs sampling. In Proceedings of ACL, pp. 363–370, Ann Arbor, Michigan, June 2005.
 Gimpel (2012) K. Gimpel. Discriminative FeatureRich Modeling for SyntaxBased Machine Translation. PhD thesis, Carnegie Mellon University, 2012.
 Gimpel & Smith (2010) Kevin Gimpel and Noah A. Smith. Softmaxmargin CRFs: Training loglinear models with cost functions. In Proceedings of NAACL, pp. 733–736, Los Angeles, California, June 2010.
 Hochreiter & Schmidhuber (1997) Sepp Hochreiter and Jürgen Schmidhuber. Long shortterm memory. Neural Computation, 9(8):1735–1780, 1997.
 Karpathy & Li (2015) Andrej Karpathy and FeiFei Li. Deep visualsemantic alignments for generating image descriptions. In Proceedings of CVPR, pp. 3128–3137, Boston, MA, USA, June 2015.
 Kumar & Byrne (2004) Shankar Kumar and William Byrne. Minimum bayesrisk decoding for statistical machine translation. Technical Report, 2004.
 Lafferty et al. (2001) John Lafferty, Andrew McCallum, Fernando Pereira, et al. Conditional random fields: Probabilistic models for segmenting and labeling sequence data. In Proceedings of ICML, volume 1, pp. 282–289, San Francisco, California, 2001.
 Li et al. (2017) Jiwei Li, Will Monroe, and Dan Jurafsky. Learning to decode for future success. CoRR, abs/1701.06549, 2017.
 Liu et al. (2016) Siqi Liu, Zhenhai Zhu, Ning Ye, Sergio Guadarrama, and Kevin Murphy. Optimization of image description metrics using policy gradient methods. CoRR, abs/1612.00370, 2016.
 Luong et al. (2015) Thang Luong, Hieu Pham, and Christopher D. Manning. Effective approaches to attentionbased neural machine translation. In Proceedings of EMNLP, pp. 1412–1421, Lisbon, Portugal, 2015.
 Ma & Hovy (2016) Xuezhe Ma and Eduard Hovy. Endtoend sequence labeling via bidirectional LSTMCNNsCRF. In Proceedings of ACL, pp. 1064–1074, Berlin, Germany, August 2016.
 Ma & Zhao (2012) Xuezhe Ma and Hai Zhao. Probabilistic models for highorder projective dependency parsing. Technical Report, arXiv:1502.04174, 2012.
 Marcus et al. (1993) Mitchell Marcus, Beatrice Santorini, and Mary Ann Marcinkiewicz. Building a large annotated corpus of English: the Penn Treebank. Computational Linguistics, 19(2):313–330, 1993.
 McDonald et al. (2005) Ryan McDonald, Koby Crammer, and Fernando Pereira. Online largemargin training of dependency parsers. In Proceedings of ACL, pp. 91–98, Ann Arbor, Michigan, June 2530 2005.
 Nachum et al. (2016) Ofir Nachum, Mohammad Norouzi, and Dale Schuurmans. Improving policy gradient by exploring underappreciated rewards. arXiv preprint arXiv:1611.09321, 2016.
 Norouzi et al. (2016) Mohammad Norouzi, Samy Bengio, Navdeep Jaitly, Mike Schuster, Yonghui Wu, Dale Schuurmans, et al. Reward augmented maximum likelihood for neural structured prediction. In Proceedings of NIPS, pp. 1723–1731, Barcelona, Spain, 2016.
 Paskin (2001) Mark A Paskin. Cubictime parsing and learning algorithms for grammatical bigram models. Citeseer, 2001.
 Ranzato et al. (2016) Marc’Aurelio Ranzato, Sumit Chopra, Michael Auli, and Wojciech Zaremba. Sequence level training with recurrent neural networks. In Proceedings of ICLR, San Juan, Puerto Rico, 2016.
 Shen et al. (2016) Shiqi Shen, Yong Cheng, Zhongjun He, Wei He, Hua Wu, Maosong Sun, and Yang Liu. Minimum risk training for neural machine translation. In Proceedings of ACL, pp. 1683–1692, Berlin, Germany, August 2016.
 Simonyan & Zisserman (2015) Karen Simonyan and Andrew Zisserman. Very deep convolutional networks for largescale image recognition. In Proceedings of ICLR, San Diego, California, 2015.
 Sutskever et al. (2014) Ilya Sutskever, Oriol Vinyals, and Quoc V Le. Sequence to sequence learning with neural networks. In Proceedings of NIPS, pp. 3104–3112, Montreal, Canada, 2014.
 Taskar et al. (2004) Ben Taskar, Carlos Guestrin, and Daphne Koller. Maxmargin markov networks. Advances in neural information processing systems, 16:25, 2004.
 Tjong Kim et al. (2003) Sang Tjong Kim, Erik F., and Fien De Meulder. Introduction to the conll2003 shared task: Languageindependent named entity recognition. In Proceedings of CoNLL2003  Volume 4, pp. 142–147, Edmonton, Canada, 2003.
 Volkovs et al. (2011) Maksims N Volkovs, Hugo Larochelle, and Richard S Zemel. Losssensitive training of probabilistic conditional random fields. arXiv preprint arXiv:1107.1805, 2011.
 Wallach (2004) Hanna M Wallach. Conditional random fields: An introduction. 2004.
 Wasserman (2013) Larry Wasserman. All of statistics: a concise course in statistical inference. Springer Science & Business Media, 2013.
 Wiseman & Rush (2016) Sam Wiseman and Alexander M. Rush. Sequencetosequence learning as beamsearch optimization. In Proceedings of EMNLP, pp. 1296–1306, Austin, Texas, 2016.
 Wu et al. (2016) Yonghui Wu, Mike Schuster, Zhifeng Chen, Quoc V Le, Mohammad Norouzi, Wolfgang Macherey, Maxim Krikun, Yuan Cao, Qin Gao, Klaus Macherey, et al. Google’s neural machine translation system: Bridging the gap between human and machine translation. arXiv preprint arXiv:1609.08144, 2016.
 Yang et al. (2016) Zhilin Yang, Ye Yuan, Yuexin Wu, William W Cohen, and Ruslan R Salakhutdinov. Review networks for caption generation. In Proceedings of NIPS, pp. 2361–2369, 2016.
Appendix: Softmax QDistribution Estimation for Structured Prediction: A Theoretical Interpretation for RAML
Appendix A Softmax Qdistribution Maximum Likelihood
a.1 Proof of Theorem 1
Proof.
Since the reward function is bounded , we have:
Then,
(1) 
Now we can bound the conditional distribution and :
(2) 
and,
(3) 
Thus, ,
To sum up, we have:
∎
a.2 Proof of Theorem 2
Lemma 3.
For every ,
where .
Lemma 4.
Proof.
Lemma 5.
Proof.
Lemma 6.
Proof.
Since for every , , we have
If ,
If ,
∎
Lemma 7.
where .
Proof.
Now, we can prove Theorem 2 with the above lemmas.