On the Weaknesses of Reinforcement Learning for Neural Machine Translation

by   Leshem Choshen, et al.
Hebrew University of Jerusalem

Reinforcement learning (RL) is frequently used to increase performance in text generation tasks, including machine translation (MT), notably through the use of Minimum Risk Training (MRT) and Generative Adversarial Networks (GAN). However, little is known about what and how these methods learn in the context of MT. We prove that one of the most common RL methods for MT does not optimize the expected reward, as well as show that other methods take an infeasibly long time to converge. In fact, our results suggest that RL practices in MT are likely to improve performance only where the pre-trained parameters are already close to yielding the correct translation. Our findings further suggest that observed gains may be due to effects unrelated to the training signal, but rather from changes in the shape of the distribution curve.


Towards Reinforcement Learning for Pivot-based Neural Machine Translation with Non-autoregressive Transformer

Pivot-based neural machine translation (NMT) is commonly used in low-res...

A Study of Reinforcement Learning for Neural Machine Translation

Recent studies have shown that reinforcement learning (RL) is an effecti...

Exploring Supervised and Unsupervised Rewards in Machine Translation

Reinforcement Learning (RL) is a powerful framework to address the discr...

Neural Machine Translation with Adequacy-Oriented Learning

Although Neural Machine Translation (NMT) models have advanced state-of-...

Beyond General Purpose Machine Translation: The Need for Context-specific Empirical Research to Design for Appropriate User Trust

Machine Translation (MT) has the potential to help people overcome langu...

Quality expectations of machine translation

Machine Translation (MT) is being deployed for a range of use-cases by m...

If MaxEnt RL is the Answer, What is the Question?

Experimentally, it has been observed that humans and animals often make ...

1 Introduction

Reinforcement learning (RL) is an appealing path for advancement in Machine Translation (MT), as it allows training systems to optimize non-differentiable score functions, common in MT evaluation, as well as its ability to tackle the “exposure bias” (Ranzato et al., 2015) in standard training, namely that the model is not exposed during training to incorrectly generated tokens, and is thus unlikely to recover from generating such tokens at test time. These motivations have led to much interest in RL for text generation in general and MT in particular. Various policy gradient methods have been used, notably Reinforce (Williams, 1992) and variants thereof (e.g., Ranzato et al., 2015; Edunov et al., 2018) and Minimum Risk Training (MRT; e.g., Och, 2003; Shen et al., 2016). Another popular use of RL is for training GANs Yang et al. (2018); Tevet et al. (2018). See §2. Nevertheless, despite increasing interest and strong results, little is known about what accounts for these performance gains, and the training dynamics involved.

We present the following contributions. First, our theoretical analysis shows that commonly used approximation methods are theoretically ill-founded, and may converge to parameter values that do not minimize the risk, nor are local minima thereof (§2.3).

Second, using both naturalistic experiments and carefully constructed simulations, we show that performance gains observed in the literature likely stem not from making target tokens the most probable, but from unrelated effects, such as increasing the

peakiness of the output distribution (i.e., the probability mass of the most probable tokens). We do so by comparing a setting where the reward is informative, vs. one where it is constant. In §4 we discuss this peakiness effect (PkE).

Third, we show that promoting the target token to be the mode is likely to take a prohibitively long time. The only case we find, where improvements are likely, is where the target token is among the first 2-3 most probable tokens according to the pretrained model. These findings suggest that Reinforce5) and CMRT (§6) are likely to improve over the pre-trained model only under the best possible conditions, i.e., where the pre-trained model is “nearly” correct.

We conclude by discussing other RL practices in MT which should be avoided for practical and theoretical reasons, and briefly discuss alternative RL approaches that will allow RL to tackle a larger class of errors in pre-trained models (§7).

2 RL in Machine Translation

2.1 Setting and Notation

An MT system generates tokens from a vocabulary one token at a time. The probability of generating given preceding tokens is given by , where is the source sentence and are the model parameters. For each generated token , we denote with the score, or reward, for generating given , , and the reference sentence . For brevity, we omit parameters where they are fixed within context. For simplicity, we assume does not depend on following tokens .

We also assume there is exactly one valid target token, as in practice MT systems are trained against a single reference translation per sentence Schulz et al. (2018). In practice, either a token-level reward is approximated using Monte-Carlo methods (e.g., Yang et al., 2018), or a sentence-level (sparse) reward is given at the end of the episode (sentence). The latter is equivalent to a uniform token-level reward.

is often either the negative log-likelihood, or based on standard MT metrics, e.g., BLEU (Papineni et al., 2002). When applying RL in MT, we seek to maximize the expected reward (denoted with ); i.e., to find


2.2 Reinforce

For a given source sentence, and partially generated sentence , Reinforce (Williams, 1992) samples tokens (

is a hyperparameter)

from and updates according to this rule:


The right-hand side of Eq. (2

) is an unbiased estimator of the gradient of the objective function, i.e.,

. Therefore, Reinforce is performing a form of stochastic gradient ascent on , and has similar formal guarantees. From here follows that if is constant with respect to , then the expected prescribed by Reinforce is zero. We note that may be shifted by a constant term (called a “baseline”), without affecting the optimal value for .

Reinforce is used by a variety of works in MT, text generation, and image-to-text tasks (Liu et al., 2016; Wu et al., 2018; Rennie et al., 2017; Shetty et al., 2017; Hendricks et al., 2016) – in isolation, or as a part of training Ranzato et al. (2015). Lately, an especially prominent use for Reinforce is adversarial training with discrete data, where another network predicts the reward (GAN). For some recent work on RL for NMT, see (Zhang et al., 2016; Li et al., 2017; Wu et al., 2017; Yu et al., 2017; Yang et al., 2018).

2.3 Minimum Risk Training

The term Minimum Risk Training (MRT) is used ambiguously in MT to refer either to the application of Reinforce to minimizing the risk (equivalently, to maximizing the expected reward, the negative loss), or more commonly to a somewhat different estimation method, which we term Contrastive MRT (CMRT) and turn now to analyzing. CMRT was proposed by Och (2003), adapted to NMT by Shen et al. (2016), and often used since (Ayana et al., 2016; Neubig, 2016; Shen et al., 2017; Edunov et al., 2018; Makarov and Clematide, 2018; Neubig et al., 2018).

The method works as follows: at each iteration, sample tokens from , and update according to the gradient of


Commonly (but not universally), deduplication is performed, so sums over a set of unique values Sennrich et al. (2017). This changes little in our empirical results and theoretical analysis.

Despite the resemblance in definitions of (Eq. (1)) and (indeed, is sometimes presented as an approximation of ), they differ in two important aspects. First, ’s support is , so increasing for some necessarily comes at the expense of for some . In contrast, increasing , as in Reinforce, may come at the expense of for any . Second, is a smoothness parameter: the closer is to 0, the closer is to be uniform.

We show in Appendix A that despite its name, CMRT does not optimize , nor does it optimize . That is, it may well converge to values that are not local maxima of , making it theoretically ill-founded.111Sakaguchi et al. (2017) discuss the relation between CMRT and Reinforce, claiming that CMRT is a variant of Reinforce. Appendix A shows that CMRT does not in fact optimize the same objective. However, given that CMRT is often used in practice, the strong results it yielded and the absence of theory for explaining it, we discuss it here. Given a sample , the gradient of is given by


where . See Appendix B.

Comparing Equations (2) and (3), the differences between Reinforce and CMRT are reflected again. First, has an additional term, proportional to , which yields the contrastive effect. This contrast may improve the rate of convergence since it counters the decrease of probability mass for non-sampled tokens.

Second, for a given , the relative weighting of the gradients is proportional to , or equivalently to . CMRT with deduplication sums over distinct values in (Eq. (3)), while Reinforce sums over all values. This means that the relative weight of the unique value is in Reinforce. For the expected value of these relative weights is the same, and so for (as is commonly used), more weight is given to improbable tokens, which could also have a positive effect on the convergence rate.222Not performing deduplication results in assigning higher relative weight to high-probability tokens, which may have an adverse effect on convergence rate. For an implementation without deduplication, see THUMT Zhang et al. (2017). However, if is too close to 0, vanishes. This tradeoff explains the importance of tuning reported in the literature. In §6 we present simulations with CMRT, showing very similar trends as presented by Reinforce.

3 Motivating Discussion

Implementing a stochastic gradient ascent, Reinforce is guaranteed to converge to a stationary point of under broad conditions. However, not much is known about its convergence rate under the prevailing conditions in NMT.

We begin with a qualitative, motivating analysis of these questions. As work on language generation empirically showed, RNNs quickly learn to output very peaky distributions Press et al. (2017). This tendency is advantageous for generating fluent sentences with high probability, but may also entail slower convergence rates when using RL to fine-tune the model, because RL methods used in text generation sample from the (pretrained) policy distribution, which means they mostly sample what the pretrained model deems to be likely. Since the pretrained model (or policy) is peaky, exploration of other potentially more rewarding tokens will be limited, hampering convergence.

Intuitively, Reinforce increases the probabilities of successful (positively rewarding) observations, weighing updates by how rewarding they were. When sampling a handful of tokens in each context (source sentence and generated prefix

), and where the number of epochs is not large, it is unlikely that more than a few unique tokens will be sampled from

. (In practice, is typically between 1 and 20, and the number of epochs between 1 and 100.) It is thus unlikely that anything but the initially most probable candidates will be observed. Consequently, Reinforce initially raises their probabilities, even if more rewarding tokens can be found down the list.

We thus hypothesize the peakiness of the distribution, i.e., the probability mass allocated to the most probable tokens, will increase, at least in the first phase. We call this the peakiness-effect (PkE), and show it occurs both in simulations (§4.1) and in full-scale NMT experiments (§4.2).

With more iterations, the most-rewarding tokens will be eventually sampled, and gradually gain probability mass. This discussion suggests that training will be extremely sample-inefficient. We assess the rate of convergence empirically in §5, finding this to be indeed the case.

4 The Peakiness Effect

We turn to demonstrate that the initially most probable tokens will initially gain probability mass, even if they are not the most rewarding, yielding a PkE.

Caccia et al. (2018) recently observed in the context of language modeling using GANs that performance gains similar to those GAN yield can be achieved by decreasing the temperature for the prediction softmax (i.e., making it peakier). However, they proposed no account as to what causes this effect. Our findings propose an underlying mechanism leading to this trend. We return to this point in §7. Furthermore, given their findings, it is reasonable to assume that our results are relevant for RL use in other generation tasks, whose output space too is discrete, high-dimensional and concentrated.

4.1 Controlled Simulations

We experiment with a 1-layer softmax model, that predicts a token with probability . are the model’s parameters.

This model simulates the top of any MT decoder that ends with a softmax layer, as essentially all NMT decoders do. To make experiments realistic, we use similar parameters as those reported in the influential Transformer NMT system

(Vaswani et al., 2017). Specifically, the size of (distinct BPE tokens) is 30715, and the initial values for

were sampled from 1000 sets of logits taken from decoding the standard newstest2013 development set, using a pretrained Transformer model. The model was pretrained on WMT2015 training data

Bojar et al. (2015). Hyperparameters are reported in Appendix C. We define one of the tokens in to be the target token and denote it with . We experiment with two reward functions:

  1. Simulated Reward: for , if is one of the 10 initially highest scoring tokens, and otherwise. This simulates a condition where the pretrained model is of decent but sub-optimal quality. here is at the scale of popular rewards used in MT, such as GAN-based rewards or BLEU (which are between 0 and 1).

  2. Constant Reward: is constantly equal to 1, for all tokens. This setting is aimed to confirm that PkE is not a result of the signal carried by the reward.

Experiments with the first setting were run 100 times, each time for 50K steps, updating after each step. With the second setting, it is sufficient to take a single step at a time, as the expected update after each step is zero, and so any PkE seen in a single step is only accentuated in the next. It is, therefore, more telling to run more repetitions rather than more steps per initialization. We, therefore, sample 10000 pretrained distributions, and perform a single Reinforce step.

As RL training in NMT lasts about 30 epochs before stopping, samples about 100K tokens per epoch, and as the network already predicts in about two thirds of the contexts,333Based on our NMT experiments, which we assume to be representative of the error rate of other NMT systems. we estimate the number of steps used in practice to be in the order of magnitude of 1M. For visual clarity, we present figures for 50K-100K steps. However, full experiments (with 1M steps) exhibit similar trends: where Reinforce was not close to converging after 50K steps, the same was true after 1M steps.

We evaluate the peakiness of a distribution in terms of the probability of the most probable token (the mode), the total probability of the ten most probable tokens, and the entropy of the distribution (lower entropy indicates more peakiness).

(a) Top 10
(b) Mode
Figure 1: A histogram of the update size (x-axis) to the total probability of the 10 most probable tokens (left) or the most probable token (right) in the Constant Reward setting. An update is overwhelmingly more probable to increase this probability than to decrease it.
Figure 2: Token probabilities through Reinforce training, in the controlled simulations in the Simulated Reward setting. The left/center/right figures correspond to simulations where the target token () was initially the second/third/fourth most probable token. The green line corresponds to the target token, yellow lines to medium-reward tokens and red lines to no-reward tokens.


The distributions become peakier in terms of all three measures: on average, the mode’s probability and the 10 most probable tokens increases, and the entropy decreases. Figure 0(a) presents the histogram of the difference in the probability of the 10 most probable tokens in the Constant Reward setting, after a single step. Figure 0(b) depicts similar statistics for the mode. The average entropy in the pretrained model is 2.9 is reduced to 2.85 after one Reinforce step.

Simulated Reward setting shows similar trends. For example, entropy decreases from 3 to about 0.001 in 100K steps. This extreme decrease suggests it is effectively a deterministic policy. PkE is achieved in a few hundred steps, usually before other effects become prominent (see Figure 2), and is stronger than for Constant Reward.

4.2 NMT Experiments

We turn to analyzing a real-world application of Reinforce to NMT. Important differences between this and the previous simulations are: (1) it is rare in NMT for Reinforce to sample from the same conditional distribution more than a handful of times, given the number of source sentences and sentence prefixes (contexts); and (2) in NMT shares parameters between contexts, which means that updating for one context may influence for another.

We follow the same pretraining as in §4.1. We then follow Yang et al. (2018) in defining the reward function based on the expected BLEU score. Expected BLEU is computed by sampling suffixes for the sentence, and averaging the BLEU score of the sampled sentences against the reference.

We use early stopping with a patience of 10 epochs, where each epoch consists of 5000 sentences sampled from the WMT2015 Bojar et al. (2015) German-English training data. We use . We retuned the learning-rate, and positive baseline settings against the development set. Other hyper-parameters were an exact replication of the experiments reported in (Yang et al., 2018).


Results indicate an increase in the peakiness of the conditional distributions. Our results are based on a sample of 1000 contexts from the pretrained model, and another (independent) sample from the reinforced model.

Indeed, the modes of the conditional distributions tend to increase. Figure 3 presents the distribution of the modes’ probability in the reinforced conditional distributions compared with the pretrained model, showing a shift of probability mass towards higher probabilities for the mode, following RL. Another indication of the increased peakiness is the decrease in the average entropy of , which was reduced from 3.45 in the pretrained model to an average of 2.82 following RL. This more modest reduction in entropy (compared to §4.1) might also suggest that the procedure did not converge to the optimal value for , as then we would have expected the entropy to substantially drop if not to 0 (overfit), then to the average entropy of valid next tokens (given the source and a prefix of the sentence).

Figure 3: The cumulative distribution of the probability of the most likely token in the NMT experiments. The green distribution corresponds to the pretrained model, and the blue corresponds to the reinforced model. The y-axis is the proportion of conditional probabilities with a mode of value (the x-axis).

5 Performance following Reinforce

We now turn to assessing under what conditions it is likely that Reinforce will lead to an improvement in the performance of an NMT system. As in the previous section, we use both controlled simulations and NMT experiments.

5.1 Controlled Simulations

We use the same model and experimental setup described in Section 4.1, this time only exploring the Simulated Reward setting, as a Constant Reward is not expected to converge to any meaningful . Results are averaged over 100 conditional distributions sampled from the pretrained model.

Caution should be exercised when determining the learning rate (LR). Common LRs used in the NMT literature are of the scale of . However, in our simulations, no LR smaller than 0.1 yielded any improvement in . We thus set the LR to be 0.1. We note that in our simulations, a higher learning rate means faster convergence as our reward is noise-free: it is always highest for the best option. In practice, increasing the learning rate may deteriorate results, as it may cause the system to overfit to the sampled instances. Indeed, when increasing the learning rate in our NMT experiments (see below) by an order of magnitude, early stopping caused the RL procedure to stop without any parameter updates.

Figure 2 shows how changes over the first 50K steps of Reinforce (probabilities are averaged over 100 repetitions), for a case where was initially the second, third and fourth most probable. Although these are the easiest settings for Reinforce, and despite the high learning rate, it fails to make the mode of the distribution within 100K steps, unless was initially the second most probable. In cases where is initially of a lower rank than four, it is hard to see any increase in its probability, even after 1M steps.

Figure 4: Cumulative percentage of contexts where the pretrained model ranks in rank or below and where it does not rank first (). In about half the cases it is ranked fourth or below.

5.2 NMT Experiments

We trained an NMT system, using the same procedure as in Section 4.2, and report BLEU scores over the news2014 test set. After training with an expected BLEU reward, we indeed see a minor improvement which is consistent between trials and pretrained models. While the pretrain BLEU score is 30.31, the reinforced one is 30.73.

Analyzing what words were influenced by the RL procedure, we begin by computing the cumulative probability of the target token to be ranked lower than a given rank according to the pretrained model. Results (Figure 4) show that in about half of the cases, is not among the top three choices of the pretrained model, and we thus expect it not to gain substantial probability following Reinforce, according to our simulations.

We next turn to compare the ranks the reinforced model assigns to the target tokens, and their ranks according to the pretrained model. Figure 5 presents the difference in the probability that is ranked at a given rank following RL and the probability it is ranked there initially. Results indicate that indeed more target tokens are ranked first, and less second, but little consistent shift of probability mass occurs otherwise across the ten first ranks. It is possible that RL has managed to push in some cases between very low ranks (<1000) to medium-low ranks (between 10 and 1000). However, token probabilities in these ranks are so low that it is unlikely to affect the system outputs in any way. This fits well with the results of our simulations that predicted that only the initially top-ranked tokens are likely to change.

In an attempt to explain the improved BLEU score following RL with PkE, we repeat the NMT experiment this time using a constant reward of 1. Our results present a nearly identical improvement in BLEU, achieving 30.72, and a similar pattern in the change of the target tokens’ ranks (see Appendix 8). Therefore, there is room to suspect that even in cases where RL yields an improvement in BLEU, it may partially result from reward-independent factors, such as PkE.444We tried several other reward functions as well, all of which got BLEU scores of 30.73–30.84. This improvement is very stable across metrics, trials and pretrained models.

Figure 5: Difference between the ranks of in the reinforced and the pretrained model. Each column corresponds to the difference in the probability that is ranked in rank in the reinforced model, and the same probability in the pretrained model.

6 Experiments with Contrastive MRT

Figure 6: The probability of different tokens following CMRT, in the controlled simulations in the Simulated Reward setting. The left/right figures correspond to simulations where the target token () was initially the second/third most probable token. The green line corresponds to the target token, yellow lines to medium-reward tokens and red lines to tokens with .

In §2.3 we showed that CMRT does not, in fact, maximize , and so does not enjoy the same theoretical guarantees as Reinforce and similar policy gradient methods. However, it is still the RL procedure of choice in much recent work on NMT. We therefore repeat the simulations described in §4 and §5, assessing the performance of MRT in these conditions. We experiment with and , common settings in the literature, and average over 100 trials.

Figure 6 shows how the distribution changes over the course of 50K update steps to , where is taken to be the second and third initially most probable token (Simulated Reward setting). Results are similar in trends to those obtained with Reinforce: MRT succeeds in pushing to be the highest ranked token if it was initially second, but struggles where it was initially ranked third or below. We only observe a small PkE in MRT. This is probably due to the contrastive effect, which means that tokens that were not sampled do not lose probability mass.

All graphs we present here allow sampling the same token more than once in each batch (i.e., is a sample with replacements). Simulations with deduplication show similar results.

7 Discussion

In this paper, we showed that the type of distributions used in NMT entail that promoting the target token to be the mode is likely to take a prohibitively long times for existing RL practices, except under the best conditions (where the pretrained model is “nearly” correct). This leads us to conclude that observed improvements from using RL for NMT are likely due either to fine-tuning the most probable tokens in the pretrained model (an effect which may be more easily achieved using reranking methods, and uses but little of the power of RL methods), or to effects unrelated to the signal carried by the reward, such as PkE. Another contribution of this paper is in showing that CMRT does not optimize the expected reward and is thus theoretically unmotivated.

A number of reasons lead us to believe that in our NMT experiments, improvements are not due to the reward function, but to artefacts such as PkE. First, reducing a constant baseline from , so as to make the expected reward zero, disallows learning. This is surprising, as Reinforce, generally and in our simulations, converges faster where the reward is centered around zero, and so the fact that this procedure here disallows learning hints that other factors are in play. As PkE can be observed even where the reward is constant (if the expected reward is positive; see §4.1), this suggests PkE may play a role here. Second, we observe more peakiness in the reinforced model and in such cases, we expect improvements in BLEU Caccia et al. (2018). Third, we achieve similar results with a constant reward in our NMT experiments (§5.2). Fourth, our controlled simulations show that asymptotic convergence is not reached in any but the easiest conditions (§5.1).

Our analysis further suggests that gradient clipping, sometimes used in NMT

(Zhang et al., 2016), is expected to hinder convergence further. It should be avoided when using Reinforce as it violates Reinforce’s assumptions.

The per-token sampling as done in our experiments is more exploratory than beam search Wu et al. (2018), reducing PkE. Furthermore, the latter does not sample from the behavior policy, but does not properly account for being off-policy in the parameter updates.

Adding the reference to the sample , which some implementations allow (Sennrich et al., 2017) may help reduce the problems of never sampling the target tokens. However, as Edunov et al. (2018) point out, this practice may lower results, as it may destabilize training by leading the model to improve over outputs it cannot generalize over, as they are very different from anything the model assigns a high probability to, at the cost of other outputs.

8 Conclusion

The standard MT scenario poses several uncommon challenges for RL. First, the action space in MT problems is a high-dimensional discrete space (generally in the size of the vocabulary of the target language or the product thereof for sentences). This contrasts with the more common scenario studied by contemporary RL methods, which focuses mostly on much smaller discrete action spaces (e.g., video games (Mnih et al., 2015, 2016)), or continuous action spaces of relatively low dimensions (e.g., simulation of robotic control tasks (Lillicrap et al., 2015)). Second, reward for MT is naturally very sparse – almost all possible sentences are “wrong” (hence, not rewarding) in a given context. Finally, it is common in MT to use RL for tuning a pretrained model. Using a pretrained model ameliorates the last problem. But then, these pretrained models are in general quite peaky, and because training is done on-policy – that is, actions are being sampled from the same model being optimized – exploration is inherently limited.

Here we argued that, taken together, these challenges result in significant weaknesses for current RL practices for NMT, that may ultimately prevent them from being truly useful. At least some of these challenges have been widely studied in the RL literature, with numerous techniques developed to address them, but were not yet adopted in NLP. We turn to discuss some of them.

Off-policy methods, in which observations are sampled from a different policy than the one being currently optimized, are prominent in RL (Watkins and Dayan, 1992; Sutton and Barto, 1998), and were also studied in the context of policy gradient methods (Degris et al., 2012; Silver et al., 2014). In principle, such methods allow learning from a more “exploratory” policy. Moreover, a key motivation for using in CMRT is smoothing; off-policy sampling allows smoothing while keeping convergence guarantees.

In its basic form, exploration in Reinforce relies on stochasticity in the action-selection (in MT, this is due to sampling). More sophisticated exploration methods have been extensively studied, for example using measures for the exploratory usefulness of states or actions (Fox et al., 2018), or relying on parameter-space noise rather than action-space noise (Plappert et al., 2017).

For MT, an additional challenge is that even effective exploration (sampling diverse sets of observations), may not be enough, since the state-action space is too large to be effectively covered, with almost all sentences being not rewarding. Recently, diversity-based and multi-goal methods for RL were proposed to tackle similar challenges (Andrychowicz et al., 2017; Ghosh et al., 2018; Eysenbach et al., 2019). We believe the adoption of such methods is a promising path forward for the application of RL in NLP.


Appendix A Contrastive MRT does not Maximize the Expected Reward

We hereby detail a simple example where following the Contrastive MRT method (see §2.3) does not converge to the parameter value that maximizes .

Let be a real number in , and let be a family of distributions over three values such that:

Let . The expected reward as a function of is:

is uniquely maximized by = 0.25.

Table 1 details the possible samples of size , their probabilities, the corresponding and its gradient. Standard numerical methods show that over possible samples is positive for and negative for , where . This means that for any initialization of , Contrastive MRT will converge to if the learning rate is sufficiently small. For , , and there will be no gradient updates, so the method will converge to . Neither of these values maximizes .

We further note that resorting to maximizing instead, does not maximize either. Indeed, plotting as a function of for this example, yields a maximum at .

-- 0.5 +
1 0
0 0
-- 0.5 0
Table 1: The gradients of for each possible sample . The batch size is . Rows correspond to different sampled outcomes. is the gradient of given the corresponding value for .

Appendix B Deriving the Gradient of

Given , recall the definition of :

Taking the deriviative w.r.t. :

Figure 7: The probability of different tokens following Reinforce, in the controlled simulations in the Constant Reward setting. The left/center/right figures correspond to simulations where the target token () was initially the second/third/fourth most probable token. The green line corresponds to the target token, yellow lines to medium-reward tokens and red lines to tokens with .

Appendix C NMT Implementation Details

True casing and tokenization were used Koehn et al. (2007), including escaping html symbols and "-" that represents a compound was changed into a separate token of =. Some preprocessing used before us converted the latter to ##AT##-##AT## but standard tokenizers in use process that into 11 different tokens, which over-represents the significance of that character when BLEU is calculated. BPE Sennrich et al. (2016) extracted 30715 tokens. For the MT experiments we used 6 layers in the encoder and the decoder. The size of the embeddings was 512. Gradient clipping was used with size of 5 for pre-training (see Discussion on why not to use it in training). We did not use attention dropout, but 0.1 residual dropout rate was used. In pretraining and training sentences of more than 50 tokens were discarded. Pretraining and training were considered finished when BLEU did not increase in the development set for 10 consecutive evaluations, and evaluation was done every 1000 and 5000 for batches of size 100 and 256 for pretraining and training respectively. Learning rate used for rmsprop Tieleman and Hinton (2012) was 0.01 in pretraining and for adam Kingma and Ba (2015) with decay was 0.005 for training. 4000 learning rate warm up steps were used. Pretraining took about 7 days with 4 GPUs, afterwards, training took roughly the same time. Monte Carlo used 20 sentence rolls per word.

Appendix D Detailed Results for Constant Reward Setting

We present graphs for the constant reward setting in Figures 8 and 7. Trends are similar to the ones obtained for the Simulated Reward setting.

Figure 8: Difference between the ranks of in the reinforced with constant reward and the pretrained model. Each column corresponds to the difference in the probability that is ranked in rank in the reinforced model, and the same probability in the pretrained model.