1 Introduction
Learning from logged bandit feedback (Swaminathan) is a form of counterfactual inference given only observational data (Pearl). This problem is ubiquitous in many realworld decision making scenarios such as personalized medicine (“what would have been the treatment leading to the optimal outcome for this particular patient?”) (Saria) or online marketing (“which ad should have been placed in order to maximize the clickthroughrate?”) (Strehl2010). We review existing approaches for solving these types of problems in Sec. 2.
In this paper, we focus on a specific flavor of learning from logged bandit feedback, which we name costeffective incentive allocation. More precisely, we allocate economic incentives (e.g., online coupons) to customers and observe a response (e.g., whether the coupon is used or not). Each action is mapped to a cost and we further assume that the response is monotonically increasing with respect to the action’s cost. Furthermore, we incorporate budget constraints related to the global cost of the marketing campaign. This framework can be readily applied to the problem of allocating monetary values of coupons in a marketing campaign under fixed budget constraints from the management. We present the setting of batch learning from bandit feedback in Sec. 3 and the novel assumptions in Sec. 4.
Existing work in counterfactual inference using bandit feedback (Joachims2018; Shalit2016) does not make use of the supplementary structure for the rewards and is limited in practice when the cardinality of the action space is large (LefortierSGJR16). We therefore developed a novel algorithm which incorporates such structure. The algorithm, which we refer to as constrained policy optimization via structured incentive response estimation, has two components. First, we take into account the reward structure to estimate the incentive response. This step is based on representation learning for counterfactual inference (Johansson2016). Second, we rely on the estimates to optimize the coupon assignment policy under budget constraints. We derive error bounds for this algorithm in Sec. 5.
We benchmark our approach on simulated and realworld data against stateoftheart approaches and show the advantages of our method in Sec. 6.
2 Related Work
Constrained Policy Optimization
Safety constraints in reinforcement learning (RL) are usually expressed via the level sets of a cost function. The main challenge of safe RL is that the cost of a certain policy must be evaluated with a offpolicy strategy, which is a hard problem
(jiang16). Achiam2017 focus on developing a local policy search algorithm with guarantees of respecting cost constraints. This approach is based on a trustregion method (Schulman2015), which allows it to circumvent the offpolicy evaluation step. Our setting is more akin to that of contextual multiarmed bandits, since customers are modeled as sampled iid from a unique distribution. In this scenario, checking for the satisfaction of the cost constraints is straightforward, which makes the original problem substantially easier. Most research contributions on policy optimization with budget constraints focus on the online learning setting (ding2013multi; badanidiyuru2018bandits; burnetas2017asymptotically; xia2015thompson; wu2015algorithms). The underlying algorithms are unfortunately not directly applicable to offline policy optimization.Counterfactual Risk Minimization
The problem of learning from logged bandit feedback consists in maximizing the expected rewards
(1) 
where is the reward function, a parameterized policy and
the logging policy. All methods rely on importance sampling and it is therefore necessary to log the action probabilities along with the reward.
Strehl2010 developed error bounds when the logging policy is unknown and learned from the data. Dudik2011focus on reducing the variance of offline policy evaluation using a doubly robust estimator.
Swaminathan; Swaminathan2015counterfactual develop learning bounds based on an empirical Bernstein inequality (Maurer2009) and present a principle of Counterfactual Risk Minimization (CRM) which is based on empirical variance regularization. Wu2018 propose a trustregion variant and further bound the empirical variance with a chisquare divergence, evaluated with an fGAN (fgan). Joachims2018 is based on Swaminathan2015selfand focuses on a practical deep learning solution of equivariant estimation and stochastic optimization of Eq. (
1). An advantage of this framework is that its mathematical assumptions are weak, while a disadvantage is that it is not clear how to make use of structured rewards.Individualized Treatment Effect Estimation
The problem of Individualized Treatment Effect (ITE) estimation aims at estimating the difference in expectation between two treatments
(2) 
where is a point in customer feature space and
is a random variable corresponding to the rewards. The difficulty arises primarily from the fact that the historical data do not always fall into the ideal setting of a randomized control trial. That inevitably induces an estimation bias due to the discrepancy between the empirical distributions
and . hill2011 presents an instrumental method for such an estimation based on Bayesian Additive Regression Trees (BART). Johansson2016; Shalit2016 cast the counterfactual question into a domain adaptation problem that can be solved via representation learning. Essentially, they propose to find an intermediate feature space which embeds the customers and trades off the treatment discrepancy for the reward predictability. yoon2018ganite propose using generative adversarial networks to learn the counterfactual rewards and extend this framework to the multiple treatment case via a meansquare error loss. This line of work does not require knowing the logging policy beforehand. Remarkably, no previous work focuses on the setting of structured rewards.3 Batch Learning from Bandit Feedback
For concreteness, we focus on the example of a marketing campaign. Let be an abstract space and
. Let be the mediator and a set of customers. We assume each customer is an iid sample from . Let be the set of financial incentives possibly emitted by and be the space of probability distributions over . Mediator deploys a marketing campaign which we model as a policy . For simplicity, we also denote the probability of an action under a policy for a customer using conditional distributions . In response, customers can either choose to purchase the product from or from another unobserved party. As the customers might engage in interaction with other parties, this is not a fullyobservable game. We therefore model our data acquisition process as a stochastic contextual bandit feedback problem. Given a context and an action , we observe a stochastic reward . In practice, the reward can be defined as any available proxy to the mediator profit (e.g., whether the coupon was used or not). From that, the mediator seeks an optimal policy:(3) 
This setting is known as batch learning from bandit feedback (Swaminathan; Swaminathan2015counterfactual; Swaminathan2015self). This problem has connections to causal and particularly counterfactual inference. As described in Swaminathan; Swaminathan2015counterfactual, the data is incomplete in the sense that we do not observe what would have been the reward if another action was taken. Furthermore, we cannot play the policy in real time; we instead only observe data sampled from a logging policy (the data generative process is described in Alg. 1). Therefore, the collected data is also biased since actions taken by the logging policy are overrepresented. Swaminathan2015self introduce a distinction between standard overfitting which arises from performing model selection on finitesample data and propensity overfitting which instead arises from performing risk minimization based on biased exploration data.
4 CostEffective Incentive Allocation
The setting of batch learning from bandit feedback might not be suitable when actions can be mapped to monetary values, as we illustrated in the following example.
Example 1.
(Monetary marketing campaign) Let be the space of discounts. Let be the customer’s response (e.g., how much he or she bought) according to his / her psychological profile and the discount rate . The trivial policy which always selects the most expensive action is a solution for problem (3).
In order to better pose the problem, we introduce novel modeling assumptions which further relate actions to the reward distribution.
Model Assumption 1.
(Structured action space) There exists a total ordering over the discrete space .
Model Assumption 2.
(Budget constraints) There exists a cost function monotone on with respect to the total ordering . Let be a maximal average budget per customer. We define the set of feasible policies as
(4) 
Model Assumption 3.
(Structured reward distribution) The reward distribution is compatible with the total ordering in the sense that
(5) 
We subsequently tailor the problem of offline policy optimization to the constrained case, which is formulated as follows:
(6) 
We claim that the estimation of the incentive response function is a computationally harder problem than the constrained optimization problem (6) in the following sense (proof in Appendix A):
Proposition 1.
Let us assume that the incentive response function is known. Then solving (6) reduces to a binary search over a unique Lagrange multiplier for the cost constraint. Particularly, each step of the search has time complexity with being the number of samples in the dataset.
This negative result shows that one should not rely on Incentive Response Estimation (IRE) solely in order to perform Constrained Policy Optimization (CPO). As we will see, however, the IRE problem becomes simpler under the structured reward assumption (Model Ass. 3). This is in contrast to CRMbased policy optimization methods which to not make use of the structure and therefore have the same complexity. Our approach is therefore a twostage procedure. First, we will present a new method for IRE. That method is inspired from the ITE estimation literature but fully exploits the reward structure. Second, we solve the CPO problem based on the estimated structured reward function. We argue that exploiting the statistical structure when it exists gives advantages when compared to vanilla constrained policy optimization; this argument is similar to that comparing modelbased or modelfree reinforcement learning (pong2018temporal).
Statistical Benefits from Structured Rewards
Learning from bandit feedback is a hard problem when compared to supervised learning because of the partial feedback. On average, one single fullysupervised feedback is equivalent to
instances of bandit feedback. While taking into account a structured reward, a high reward for a given incentive means that a higher incentive will also probably yield a high reward. Even though this might not be quantifiable, the benefit is to have a lesscomplex reward structure. Second, learning from bandit feedback is also hard due to the data collection bias. For the same reason, our structure enables one action to give clues about the consequences of other actions—which is helpful for alleviating such bias.5 Constrained Policy Optimization via Structured Incentive Response Estimation
We begin by making our assumptions formal.
Model Assumption 4.
(Refined rewards structure) There exists a function such that
(7) 
and satisfies the structured reward assumption
(8) 
In order to ensure that the causal effect is identifiable, we adopt the following classical assumptions from counterfactual inference (rubin2005).
Math Assumption 1.
(Overlap) The logging policy satisfies the condition
(9) 
Math Assumption 2.
(No unmeasured confounding) Let
denote the vector of possible outcomes in the RubinNeyman potential outcomes framework
(rubin2005). We assume that the vector is independent of the action given .These “strong ignorability” hypotheses are sufficient for identifying the reward function from the historical data (Shalit2016).
5.1 Bias estimation from domain adaptations bounds
We turn to the problem of estimating the function from historical data (i.e., generated by Alg. 1). To this end, we first write the estimation problem with a general population loss. Let
be a loss function and
a probability distribution on the product (which we refer to as a domain). Let be a set of functions parameterized by . We define the domain dependent population risk as(10) 
In the historical data, individual data points are sampled from the socalled source domain . However, in order to perform offpolicy evaluation of a given policy , we would need ideally data from . The discrepancy between the two distributions will cause a strong bias in offpolicy evaluation which can be quantified via learning bounds from domain adaptation (Blitzer2007). The issue is therefore that of finding a target domain from which the estimated incentive response would generalize to all policies for offline evaluation.
Intuitively, we want to find a domain which lies halfway and for all policies . We extend the work of Johansson2016; Shalit2016 for this purpose. In the ITE scenario, the domain that is central to the problem is the mixture between factual and counterfactual domain , in which the treatment assignment (i.e., in our setting) and the customer feature are decoupled. In the sequel, we therefore focus on selecting a target domain in the factorized form where is a categorical distribution on the action space with probability .
Following Shalit2016, we wish to derive a learning bound for the structured incentive response estimation problem that yields a principled algorithm. In particular, we wish to bound how much an estimate of based on data from the source domain can generalize to the target domain . To this end, we bound the discrepancy between the population risks on the source and target domains following (Blitzer2007):
(11) 
where is the set of functions
(12) 
We refer to the mathematical object in equation (11) as an integral probability metric (IPM) between distributions and with function class , noted . The general problem of computing IPMs is known to be NPhard. Shalit2016 present some specific cases for two treatments in which this IPM can be estimated, for example using the maximum mean discrepancy (MMD) (MMD). Atan
focus on multiple treatments but relying on adversarial neural networks
(Ganin2016). We focus instead on a different nonparametric measure of independence—the HilbertSchmidt Independence Criterion (HSIC) (Gretton2005; HSIC)—which also yields an efficient estimation procedure for the IPM.Proposition 2.
Let us assume that and are separable metric spaces. Let (resp. ) be a continuous, bounded, positive semidefinite kernel. Let (resp. ) be the corresponding reproducing kernel Hilbert space (RKHS). Let us assume that the function space
is included in the unit ball of the tensor space
. Let be the marginal frequency of actions under the logging policy :(13) 
Under these assumptions, one can identify the IPM in equation (11) as follows:
Proof.
See Section 2.3 of smola2007hilbert, Definition 2 from MMD and Definition 11 from sejdinovic2013. ∎
Notably, the HSIC can be directly estimated via samples from (HSIC) in , with number of data points.
5.2 Bias correction via feature transformation
Provided that kernels and are both characteristic, then the IPM in Eq. (11) is null if and only if the logging policy is independent from the customer feature (HSIC). In order to control the bias in estimating from the historical data, we follow Johansson2016; Shalit2016 and introduce an abstract feature space and a mapping such that . For technical reasons, we need to make the following assumption:
Math Assumption 3.
is a twicedifferentiable onetoone mapping. We let denote the corresponding inverse mapping.
Adopting the notation of Shalit2016, we can write the following bound
Proposition 3.
Let and an mapping satisfying Math Ass. 3. Let us assume there exists a constant such that is inside the unit ball of the tensor space . There exists a constant such that
(14) 
Remarkably, the first term of the right hand side is a prediction error for the historical reward (function of and ) while the second one measures how much the intermediate space is informative of the logging policy (a function of only).
5.3 Counterfactual policy optimization
We proceed to assessing the bias for policy evaluation as a function of the estimation error.
Proposition 4.
Let be a policy, a mapping and . Let us assume that is the loss. Under assumptions from Prop. 3, there exists a constant such that:
(15) 
Exploiting the reward structure
There are exactly two ways in which we expect an improvement from the structured rewards assumption in Eq. (15). First, inside the integral probability metric of Eq. (11), we need to take the supremum over a significantly smaller class of functions (functions which are nondecreasing in their second argument). This implies that weaker regularization or weaker kernels may provide satisfactory results in practice. Second, the learning bound in Eq. (15) only refers to population risk and does not include finitesample concentration effects. In particular, the second term (the error for estimating from the historical data) can benefit from the monotonicity assumption in a similar fashion as the improved rates for isotonic regression or other shapeconstraint scenarios.
5.4 Algorithm and implementation
Parameterizing the expected reward
One practical problem is how to restrain the search space for to functions that are increasing in for each fixed transformed feature . We use a feedforward neural network which takes as input and outputs a vector that parameterizes . Let us assume that the actions are ordered by increasing cost, and there are constraints:
(16) 
We enforce these constraints with the simple following steps. Let the final layer of the network have an exponential nonlinearity. We let denote its output. We apply the matrix product operation with the upper triangular matrix full of ones. Finally, for each action , is estimated via the quantity , where
denotes the sigmoid function. Since all individual entries of the vector
are positive, the constraints in (16) are satisfied. A similar scheme by first applying a lower triangular matrix and then the upper triangular matrix can be used to ensure concavity.Policy Optimization
Given the function and a set of customers , the problem of policy optimization becomes
(17) 
for which multiple solutions can be used. Options include the exact binary search procedure from Prop. 1, stochastic optimization where is parameterized with a neural network, or more complex estimators such as a doubly robust estimator (Dudik2011) with our structured rewards.
We detail the practical implementation of our algorithm in Alg. 2.
(18) 
(19) 
6 Experiments
We compare our method for costeffective incentive allocation with a CRM baseline inspired from Joachims2018 and described in Appendix B. We note that this baseline also relies on deep learning. When applicable, we also benchmark our ITE estimation against BART (hill2011).
6.1 Fullysimulated data
Since it is difficult to obtain a realistic dataset meeting all our assumptions and containing full information, for benchmarking purposes we first construct a synthetic dataset following zhao2017uplift.
Simulation framework
The dataset is of the form and is generated following the procedure from Alg. 1. Random variable
represents the users’ features and is uniformly distributed in the cube:
(20) 
represents the action historically assigned to the customer, the cost of which is respectively. Our logging policy is defined as
(21) 
where represents the response after customer receives the incentive . In order to meet Model Assumption 4, we generate the response by function as
(22) 
where and
are the mean value and standard deviation of
respectively, denotes the sigmoid function and is defined as(23) 
with , , . Specifically, we select a group of , and for independently according to the uniform distribution on for each repeated experiment.
Incentive Response Estimation via HSIC
Three common metrics for the estimation errors—Precision in Estimation of Heterogeneous Effect (PEHE), Average Treatment Effect (ATE) and Individualized Treatment Effect (ITE) (Johansson2016)—are compared. For the multiple treatments experiments, we use the Mean Squared Error (MSE) to evaluate our method following yoon2018ganite. Each experiment with the same HSIC penalty
is repeated 50 times. We use stochastic gradient descent as a firstorder stochastic optimizer with a learning rate of 0.01, an RBF kernel and a threelayer neural network with 512 neurons for each hidden layer. Experimental results on the synthetic dataset for a binary treatment (resp. multiple treatments) are shown in Fig.
1 (resp. Fig. 1). For the binary treatments experiments, the dataset described above is modified so that only the first two actions are considered as treatments. In the binary treatments experiments of Fig. 1, we compare our approach with the ITE error of BART while the other two error metrics are also shown. The error curves are flat for . In the regime , the HSIC improves the performance on three indexes, which shows that HSIC helps in improving the treatment estimation. For , the errors increase rapidly, which shows that HSIC may hurt representation learning if overoptimized. In the case of the multiple treatments experiments of Fig. 1, we can draw the conclusion that the HSIC also improves the incentive response estimation and outperforms BART.Constrained Policy Optimization results
Finally, we perform constrained policy optimization via the incentive response estimation according to Alg. 2. Our results in Fig. 1 are averaged across 10 repeated experiments. The performance of the constraint policy optimization shows a similar trend as the incentive response estimation, with respect to . For suitable values of , we outperform the CRM baseline.
6.2 Simulating Structured Bandit Feedback from Nested Classification
Standard contextual bandits algorithms can be evaluated by simulating bandit feedback in a supervised learning setting (Agarwal2014). We propose a novel approach to evaluate costeffective incentive allocation algorithms. To this end, we use a model of nested classification with bandittype feedback that we describe in detail in Appendix C.
Experiment
We randomly select images with labels “Animal,” “Plant,” and “Natural object” from ImageNet
(ImageNet), and we focus on the special nested structureA logging policy, with the exact same form as in the fullysimulated experiment, is used to assign one of these four labels for each image. Feedback is if the label is correct, or otherwise. The corresponding costs for selecting these labels are .
The dataset is randomly split into training and testing datasets with ratio 8:2. The ratio of the positive and negative samples is equal to 1:10. Images are uniformly preprocessed, cropped to the same size and embedded into
with a pretrained convolution neural network (VGG16 from
simonyan2014very).Results
We perform constrained policy optimization via structured incentive response estimation according to Alg. 2. Our estimation errors are reported in Fig. 2. We compare our learned policy with the CRM baseline in Fig. 2. We average all results across ten repeated experiments.
We span the same range of the HSIC penalty as in the previous experiments. All the results are obtained using the same parameters except the number of neurons for the hidden layers that is doubled. In Fig. 2, the dashed line represents the MSE error for . Consistently with the simulations, the error decreases first with but then increases. This demonstrates again that the HSIC improves the treatment estimation.
Correspondingly, the performance of the constrained policy optimization based on feature representation is also influenced by value of . As shown in Fig. 2, with suitable we can also obtain better policies comparing to the CRM baseline. Notably, it is comforting that the same leads to the smallest estimation error as well as the best performance for policy optimization. The superiority reaches a higher level of significance when the average budget constraint become smaller, due to the fact that policy optimization becomes harder with smaller budgets.
Moreover, after adopting the structured response assumption, both the incentive response estimation and the constraint policy optimization perform much better, for which proper values of still help. Specifically, we show the result from an experiment with structured incentive response estimation but without the HSIC penalty. The incentive response in this experiment fits the structured assumption perfectly, which is the main reason why the structured estimation helps so much.
7 Discussion
We have presented a novel framework for counterfactual inference based on the batch learning from bandit feedback scenario but with additional structure on the reward distribution as well as the action space. For this specific setting, we have proposed a novel algorithm based on domain adaptation which effectively trades off prediction power for the rewards against estimation bias. We obtained theoretical bounds which explicitly emphasize this tradeoff and we presented empirical evaluations that show that our algorithm outperforms stateoftheart methods based on the counterfactual risk minimization principle.
Our framework involves the use of a nonparametric measure of dependence to unbias the estimation of rewards. Penalizing the HSIC as we do for each minibatch implies that no information is aggregated during training about the embedding and how it might be biased with respect to the logging policy. On the one hand, this is positive since we do not have to estimate more parameters, especially if the joint estimation would require a minimax problem as in Atan; yoon2018ganite. On the other hand, that approach could be harmful if the HSIC could not be estimated with only a minibatch. Our experiments show this does not happen in a reasonable set of configurations. Trading a minimax problem for an estimation problem does not come for free. First, there are some computational considerations. The HSIC is computed in quadratic time but lineartime estimators of dependence (Jitkrittum2016) or randomfeature approximations (PerezSuay2018) should be used for nonstandard batch sizes.
Following up on our work, a natural question is how to properly choose the optimal , the regularization strength for the HSIC. In previous work such as Shalit2016, such a parameter is chosen with crossvalidation via splitting the datasets. However, in a more industrial setting, it is reasonable to expect the mediator to have tried several logging policies which once aggregated into a mixture of deterministic policies enjoy effectivce exploration properties (see, e.g., Strehl2010). In particular, an interesting methodological development would include deriving a new notion of Counterfactual CrossValidation (which informally would reduce the variance when compared against a randomized CV by preventing the propensity overfitting).
Another scenario in which our framework could be extended is the case of continuous treatments. That extension would be natural in the setting of financial incentive allocation and has already been of interest in recent research work (kallus18a). The HSIC would still be an adequate tool for quantifying the selection bias since kernel are flexible tools for continuous measurements.
References
Appendix A Reduction of incentive estimation to policy optimization
Proof.
Let us denote the reward function as . We use the following estimator for the objective as well as the cost in problem 6. Let and . An empirical estimate of the expected reward can be written as
(24) 
There exists a Lagrange multiplier such that this problem is has same optimal set than the following
(25) 
In this particular setting, we can directly solve for the optimal action for each customer, which is given by
To search for the adhoc Lagrange multiplier , a binary search can be used to make sure the resulting policy saturates the budget constraint. ∎
Appendix B Counterfactual Risk Minimization baseline
We consider again the constrained policy optimization problem in Eq. (6). We propose to use as a baseline the selfnormalized IPS (Swaminathan2015self) estimator and a novel algorithmic procedure adapted from Joachims2018 to solve the constrained policy optimization problem with the stochastic gradients algorithm:
(26) 
Let us assume for simplicity that there exists a unique solution to this problem for a fixed budget . This solution has a specific value for the normalization covariate . If we knew the covariate , then we would be able to solve for:
(27) 
The problem, as explained in Joachims2018, is that we do not know beforehand but we know it is supposed to concentrate around its expectation. We can therefore search for an approximate , for each , in the grid given by standard nonasymptotic bounds. Also, by properties of Lagrangian optimization, we can turn the constrained problem into an unconstrained one and keep monotonic properties between the Lagrange multipliers and the search of the covariate (see Algorithm 3).
(28) 
Appendix C Nested classification to structured bandit feedback
Vanilla supervised learning aims at discriminating between very different entities (dogs, cats, humans, etc.) and is not straightforwardly compatible with our structured feedback. The vanilla setting lacks an ordering between the actions. We propose to simulate this ordering by using data with nested labels . In classification problems, the sets are disjoint. In our setting, they are monotonic with respect to the set inclusion. One concrete example can be constructed from the ImageNet dataset (ImageNet). We focus on the nested labels
Fullysupervised feedback would give what is the optimal class to describe a given image. Classical bandit feedback would reveal whether or not the guess from the learner is valid. A structured bandit says only if the guess is semantically valid. In this particular example of nested classification, say we observe a dataset where is the pixel value a single image and is the perfect descriptor for this image among the labels. The vanilla supervised learning problem refers to finding a labeling function that maximizes the classification error
(29) 
Here, we are interested in a different setting where labels are nested and partial reward should be given when the labels is correct but not optimal. To motivate the incentive allocation problem, we give to the agent full reward whenever the label is correctly assigned. The corresponding loss can be written as (after adequately permuting the labels):
(30) 
As in the marketing problem, a trivial labeling function constant equal to can maximize this.
Cost constraints and analogy to incentive estimation
As a consequence, a vanilla policy optimization problem will learn to label all the dataset as “Animal” since this will give a maximal reward. Therefore, we need again to add cost constraints and a budget so that the learner will have to guess what is the optimal decision to take (i.e., with an optimal ratio of reward / cost). The problem becomes:
(31) 
Remarkably, it is easy to characterize when the problem is trivial or not.
Proposition 5.
Let us fix a cost function .

If the budget is greater than the cost of , than is feasible and optimal. Other solutions will make choices that are correct but suboptimal.

If the budget is exactly equal to the cost of , then the only solutions to the nested classification are optimal also for the supervised learning case.
Comments
There are no comments yet.