1 Introduction
Machine translation has recently become a commodity service that is offered for free for online translation, e.g, by Google or Microsoft, or is integrated into ecommerce platforms (eBay) or social media (Facebook). Such commercial settings facilitate the collection of user feedback on the quality of machine translation output, either in form of an explicit user rating, or as indirect signal that can be inferred from the interaction of the user with the translated content. While user feedback in form of user clicks on displayed ads has been shown to be a valuable signal in response prediction for online advertising (Chapelle and Li, 2011; Bottou et al., 2013), the gold mine of free user feedback has not yet been exploited in the area of machine translation. Recent research has proposed bandit structured prediction (Sokolov et al., 2016; Kreutzer et al., 2017; Nguyen et al., 2017) for online learning of machine translation from weak user feedback to predicted translations, instead of from costly manually created reference translations. The scenario investigated in these works is still far removed from real world applications in commercial machine translation: Besides the fact that previous research has been confined to simulated feedback, online bandit learning is unrealistic in commercial settings due to the additional latency and the desire for offline testing of system updates before deployment. A natural solution would be to exploit counterfactual learning that reuses existing interaction data where the predictions have been made by a historic system different from the target system. However, both online learning and offline learning from logged data are plagued by the problem that exploration is prohibitive in commercial systems since it means to show inferior translations to users. This effectively results in deterministic logging policies that lack explicit exploration, making an application of offpolicy methods theoretically questionable.
Lawrence et al. (2017) recently showed that bandit learning of machine translation is possible even under deterministic logging. They proposed an application of techniques such as doublyrobust policy evaluation and learning (Dudik et al., 2011) or weighted importance sampling (Jiang and Li, 2016; Thomas and Brunskill, 2016) to offline learning from deterministically logged data, and presented evidence from simulation experiments that confirmed their conjecture that these techniques effectively serve to smooth out deterministic components. The purpose of our paper is to give a formal account on the possible degeneracies of the standard inverse propensity scoring technique (Rosenbaum and Rubin, 1983) and its reweighted variant (Kong, 1992) under stochastic and deterministic logging, with the goal of a clearer understanding of the effectiveness of the techniques proposed in Lawrence et al. (2017).
2 Counterfactual Learning for Machine Translation
In the following, we give a short overview of the methods developed in Lawrence et al. (2017). They formalize the problem of counterfactual learning for bandit structured prediction as follows: denotes a structured input space, denotes the set of possible structured output for input , and denotes a reward function quantifying the quality of structured outputs. A data log is denoted as a set of tuples , where for inputs , a logging policy produced an output , which is logged with a corresponding reward . In the case of stochastic logging, a propensity score
is logged in addition. Using the inverse propensity scoring approach (IPS), importance sampling achieves an unbiased estimate of the expected reward under the parametric target policy
:(1)  
In case of deterministic logging, outputs are logged with propensity of the historical system. This results in empirical risk minimization, or empirical reward maximization, without correction of the sampling bias of the logging policy:
(2) 
Lawrence et al. (2017) call equation (2) the deterministic propensity matching (DPM) objective, and propose a first modification by the use of weighted importance sampling (Precup et al., 2000; Jiang and Li, 2016; Thomas and Brunskill, 2016). The new objective is the reweighted deterministic propensity matching (DPM+R) objective:
(3)  
with . Setting recovers IPS with reweighting, (Swaminathan and Joachims, 2015).
Lawrence et al. (2017) present further modifications of Equation (3) by the incorporation of a direct reward estimation method into IPS as proposed in the doublyrobust (DR) estimator (Dudik et al., 2011; Jiang and Li, 2016; Thomas and Brunskill, 2016). Let be a regressionbased reward model trained on the logged data, and let
be a scalar that allows to optimize the estimator for minimal variance
(Ross, 2013). They define a doubly controlled empirical risk minimization objective as follows:(4) 
with . Setting yields an objective called . Setting recovers the standard stochastic doublyrobust estimator . The optimal scalar parameter can be derived easily by taking the derivative of the variance term, leading to
The learning algorithms in Lawrence et al. (2017) are defined by applying a stochastic gradient ascent update rule to the objective functions defined above. The gradients are shown in Table 1. In the experiments reported in Lawrence et al. (2017), the policy distribution is assumed to be a Gibbs model based on a feature representation
, a weight vector
, and a smoothing parameter , yielding the following simple derivative3 Degenerate Behaviour in Counterfactual Learning
Both the IPS and the DPM estimators can exhibit a degenerate behavior in that they can be maximized by simply setting all logged outputs to probability
, i.e., if for . This is the case irrespective of whether data are logged stochastically (IPS) or deterministically (DPM). Obviously, this is undesired as the probability for low reward outputs should not be raised. For abbreviation, we set and .Theorem 1.
.
Proof. We start by showing that the value of where is greater than the value of where . W.l.og. assume that is the tuple with . Then
(5)  
where the last line is true by assumption . Because DPM is a special case of IPS with for , the proof also holds for DPM.
The degenerate behavior of IPS and DPM described in Theorem 1
can be fixed by using reweighting, which results in defining a probability distribution over the log
. Under reweighting, increasing the probability of a low reward output takes away probability mass from the higher reward output. This decreases the value of the estimator, and will thus be avoided in learning.However, IPS+R and DPM+R still can behave in a degenerate manner, as we will show in the following. We define the set that contains all tuples that receive the highest reward observed in the log, and we assume , leading to a cardinality of of at least one.
Definition 1.
Let , then .
We will show that the estimators can be maximized by simply setting the probability of at least one tuple in to a value higher than , while leaving all other tuples in at their probabilities , and setting the probability of tuples in the set to . Clearly, this is undesired as outputs with a reward close to should not receive a probability of . Furthermore, this learning goal is easy to achieve since a degenerate estimator only needs to be concerned about lowering the probability of tuples in as long as there is one tuple of with a probability above 0. We want to prove the following theorem:
Theorem 2.
.
We introduce a definition of data indices belonging to the sets and its complement in :
Definition 2.
Let
where w.l.o.g. indices refer to tuples in and indices refer to indices in . Thus, and .
Proof. We need to show that the value of where for is lower than the value of where with . Then
(6)  
where the last line is true for as long as with as by definition.
Furthermore, we need to show that the value of where with is lower than the value of with for .
From the above, it is clear that with , thus is defined. W.l.o.g. assume that is the tuple with . Then
(7)  
(8) 
Equation 8 is true as by Definition 1 . As DPM+R is a special case of IPS+R with for , the proof also holds for DPM+R.
While employing stochastic gradient ascent, IPS+R and DPM+R can be prevented from reaching their degenerate state by performing early stopping on a validation set. However, one cannot control what happens to the probability mass that is freed when lowering the probability of a logged output. The freed probability mass could be allocated to outputs that receive a lower reward than the logged output which would create a system that is worse than the logging system.
The estimators DR and DC successfully solve this problem. The direct reward predictor takes the whole output space into account and thus assigns rewards to any structured output. The objective will now be increased if the probability of outputs with high estimated reward is increased, and decreased for outputs with low estimated reward. For this to happen, high reward outputs other than the ones with maximal reward will be considered, even if the outputs have not been seen in the training log. This will shift probability mass to unseen data with high estimated reward, which is a desired property in learning.
4 Experimental Evidence
For completeness, we report the experimental evidence that Lawrence et al. (2017) provide to show the effectiveness of their proposed techniques. They report an application of counterfactual learning in a domainadaptation setup for machine translation. A model is trained using outofdomain data using the hierarchical phrasebased machine translation framework that is based on a linear learner. The model is given indomain data to translate, and outputs are logged together with their persentence BLEU score to the true reference, which simulates the reward signal. Experiments are conducted on two language pairs. The first is GermantoEnglish and its baseline system is trained on the concatenation of the Europarl corpus, the Common Crawl corpus and the News corpus. The target domain is represented by a corpus containing transcribed TED talks. The second language pair is FrenchtoEnglish. Its outofdomain system is trained on the Europarl corpus and the target domain is the News corpus.
BLEU  BLEU difference  BLEU  
outofdomain  DPM+R  DC  DC  indomain  
determin. 
TED 
validation  22.39  +0.59  +1.50  +1.89  25.43 
test  22.76  +0.67  +1.41  +2.02  25.58  
News 
validation  24.64  +0.62  +0.99  +1.02  27.62  
test  25.27  +0.94  +1.05  +1.13  28.08  
outofdomain  IPS+R  DR  DR  indomain  
stochastic 
TED 
validation  22.39  +0.57  +1.92  +1.95  25.43 
test  22.76  +0.58  +2.04  +2.09  25.58  
News 
validation  24.64  +0.71  +1.00  +0.71  27.62  
test  25.27  +0.81  +1.18  +0.95  28.08 
As shown in Table 2, under deterministic logging, the best results are obtained by the combining reweighting and double control in the DC method. The relations between the algorithms and even the absolute improvements are quite similar under stochastic logging. For an extended discussion see Lawrence et al. (2017).
5 Discussion
We presented an analysis of possible degenerate behavior in counterfactual learning scenarios. We analyzed the degeneracies of the standard inverse propensity scoring method and its weighted variant, both under stochastic and deterministic logging. Our analysis facilitates a clearer understanding why doubly robust learning techniques serve to avoid such degeneracies, and why such techniques even allow to perform counterfactual learning under deterministic logging.
Lawrence et al. (2017) also discuss a possible implicit exploration effect by the stochastic selection of inputs. This phenomenon has recently been given a formal account by Bastani et al. (2017) and has not been analyzed formally in this paper.
An open question is the application of the techniques proposed by Lawrence et al. (2017)
to machine translation with neural networks. For example, the necessity to normalize probabilities over the full set of logged data creates a memory bottleneck which makes it difficult to transfer the reweighting approach to neural networks.
Acknowledgments
The research reported in this paper was supported in part by the German research foundation (DFG).
References
 Bastani et al. (2017) Bastani, H., Bayati, M., and Khosravi, K. (2017). Exploiting the natural exploration in contextual bandits. ArXiv eprints, 1704.09011.

Bottou et al. (2013)
Bottou, L., Peters, J., QuiñoneroCandela, J., Charles, D. X.,
Chickering, D. M., Portugaly, E., Ray, D., Simard, P., and Snelson, E.
(2013).
Counterfactual reasoning and learning systems: The example of
computational advertising.
Journal of Machine Learning Research
, 14:3207–3260. 
Chapelle and Li (2011)
Chapelle, O. and Li, L. (2011).
An empirical evaluation of Thompson sampling.
In Advances in Neural Information Processing Systems (NIPS), Granada, Spain.  Dudik et al. (2011) Dudik, M., Langford, J., and Li, L. (2011). Doubly robust policy evaluation and learning. In Proceedings of the 28th International Conference on Machine Learning (ICML), Bellevue, WA.

Jiang and Li (2016)
Jiang, N. and Li, L. (2016).
Doubly robust offpolicy value evaluation for reinforcement learning.
In Proceedings of the 33rd International Conference on Machine Learning (ICML), New York, NY.  Kong (1992) Kong, A. (1992). A note on importance sampling using standardized weights. Technical Report 348, Department of Statistics, University of Chicago, Illinois.
 Kreutzer et al. (2017) Kreutzer, J., Sokolov, A., and Riezler, S. (2017). Bandit structured prediction for neural sequencetosequence learning. In Proceedings of the 55th Annual Meeting of the Association for Computational Linguistics (ACL), Vancouver, Canada.

Lawrence et al. (2017)
Lawrence, C., Sokolov, A., and Riezler, S. (2017).
Counterfactual learning from bandit feedback under deterministic
logging: A case study in statistical machine translation.
In
Proceedings of the Conference on Empirical Methods in Natural Language Processing (EMNLP)
, Copenhagen, Denmark. 
Nguyen et al. (2017)
Nguyen, K., Daumé, H., and BoydGraber, J. (2017).
Reinforcement learning for bandit neural machine translation with simulated feedback.
In Proceedings of the Conference on Empirical Methods in Natural Language Processing (EMNLP), Copenhagen, Denmark.  Precup et al. (2000) Precup, D., Sutton, R. S., and Singh, S. P. (2000). Eligibility traces for offpolicy policy evaluation. In Proceedings of the Seventeenth International Conference on Machine Learning (ICML), San Francisco, CA.
 Rosenbaum and Rubin (1983) Rosenbaum, P. R. and Rubin, D. B. (1983). The central role of the propensity score in observational studies for causal effects. Biometrika, 70(1):41–55.
 Ross (2013) Ross, S. M. (2013). Simulation. Elsevier, fifth edition.
 Sokolov et al. (2016) Sokolov, A., Kreutzer, J., Lo, C., and Riezler, S. (2016). Stochastic structured prediction under bandit feedback. In Advances in Neural Information Processing Systems (NIPS), Barcelona, Spain.
 Swaminathan and Joachims (2015) Swaminathan, A. and Joachims, T. (2015). The selfnormalized estimator for counterfactual learning. In Advances in Neural Information Processing Systems (NIPS), Montreal, Canada.
 Thomas and Brunskill (2016) Thomas, P. S. and Brunskill, E. (2016). Dataefficient offpolicy policy evaluation for reinforcement learning. In Proceedings of the 33rd International Conference on Machine Learning (ICML), New York, NY.