We consider data of discrete decisions taken in contexts with features . The outcome of each decision has an associated cost (or equivalently, negative reward). For instance, we may obtain data from a hospital in which patients with features are given treatment to lower their blood pressure and denotes the change of pressure value. The observational data is drawn independently as follows
where we have used a causal factorization of the unknown data-generating process. The distribution of contexts is described by and summarizes a decision policy which is generally unknown.
Using the training data points, our goal is to automatically improve upon the past policy. That is, learn a new policy, which is a mapping from features to decisions
such that the outcome cost will tend to be lower than in the past. This policy partitions the feature space into disjoint regions. A sample from the resulting data generating process can then be expressed as
where the last identity follows if features overlap across decisions so that . The optimal policy for this problem is and is determined by the unknown training distribution (1). Thus a policy must be learned from training samples, where a fundamental source of uncertainty about outcomes is uneven feature overlap across decisions [4, 11] (see Fig. 0(a) for an illustration). Eq. (3) is equivalent to an off-policy learning problem in contextual bandit settings using logged data [13, 6, 19, 10, 18], but where the past policy is unknown.
A common approach is to learn a regression model of , which in the case of binary decisions and linear models restricts the class of policies to the form . To avoid the sensitivity to regression model misspecification, an alternative approach is to learn a model of and then approximately solve (3) by numerical search over a restricted parametric class of policies, see the analysis in .
Reliably estimating the expected cost of a policy would yield an important performance certificate in safety-critical applications 
. In such applications, however, reducing the prevalence of high costs outcomes is a more robust strategy than reducing the expected cost, even when such tail events have low probability, see Figure0(b) for an illustration. This is especially relevant when the conditional distribution of outcome costs
is skewed or has a dispersion that varies with.
In this paper, we develop a method for learning a robust policy that
targets the reduction of the tail of the cost distribution , rather than ,
provides a statistically valid limit for each decision,
is operational even when there is little feature overlap.
Moreover, when the past policy is unknown, the robust policy can be learned using unsupervised techniques, which obviates the need to specify associative models and/or . The method is demonstrated using both real and synthetic data.
2 Problem formulation
We consider a policy to be robust if it can reduce the tail costs at a specified level as compared to the past policy – even for finite and highly uneven feature overlap. We define the -tail as all for which the probability is greater than or equal to . An optimal robust policy therefore minimizes the
-quantile of the cost, viz. a solution to
Since a learned policy is a function of the training data , the probability is also defined over all i.i.d. training points.
The problem we consider is to learn a policy in a class that approximately solves (4) and certifies each decision by a limit that holds with a probability of at least for finite and highly uneven feature overlap.
3 Learning Method
Since the cumulative distribution function (cdf) in (4) is unknown for a given policy, it is a challenging task to find the minimum which satisfies the constraint. We propose to restrict the policies to a class , constructed as follows: Suppose there exists a feature-specific limit for a given decision , such that is no less than . Then we define as all policies that select with the minimum cost limit at the specified level . That is, a class of robust policies
Learning a policy in therefore amounts to using to learn a set of functions that satisfy the constraints. Figure 2 illustrates constructed using the method described below, for a binary decision variable .
Remark: If there is a tie among , the policy can randomly draw from the minimizers. If the limits are non-informative, , the method will indicate that the data is not sufficiently informative for reliable cost-reducing decisions. See Figure 2 for regions in feature space where there is no data about outcomes for treated younger males and untreated older women; consequently for such pairs of features and decisions.
3.1 Statistically valid limits
To construct feature-specific limits that satisfy the constraint in (5), we leverage recent results developed using the conformal prediction framework [22, 14, 1]. We begin by quantifying the divergence of a sample in (2) from those in , using the residual
where is any predictor of the cost fitted using . Then can be viewed as a random non-conformity score with a cdf and quantile
Result 1 (Finite-sample validity).
For a given level and context , construct a set of probability weights
for and define an empirical cdf for the residuals
where . Then
satisfies the probabilistic constraint in (5).
Computing requires a search of the maximum value in the set (10), which can be implemented efficiently using interval halving. Each evaluation point in the set, however, requires re-fitting to in (6). For an efficient computation of (10), we therefore consider the locally weighted average of costs, i.e.,
which is linear in . This choice then defines a policy in and is illustrated in Figures 2(a) and 2(b). Each decision of the policy can then be certified by a limit obtained by setting in (10) and the probability of exceeding the limit is bounded by . For the sake of clarity, the computation of is summarized in Algorithm 1.
An important property of (10) is that it is statistically valid also for highly uneven feature overlap. As approaches for a given , the probability weights in (8) concentrate so that in (9). Consequently, converges to so that the proposed robust policy avoids decisions in contexts for which there is little or no training data.
3.2 Unsupervised learning of weights
In randomized control trials, and other controlled experiments, the weights in (8) are given by a known past policy. In the general case, however,
must be learned from training data. This is effectively an unsupervised learning problem which therefore circuments the need for specifying associative models of(regression) or (propensity score).
The categorical distribution of past decisions, , is readily modeled as using . The conditional feature distribution
can in turn be modelled by a flexible generative model, e.g. Gaussian mixture models or multinoulli models. The accuracy of the learned generative modelcan then be assessed using model validation methods, e.g. . If the training data contains high-dimensional covariates, we propose constructing features
using dimension-reduction methods, such as autoencoders[2, 12, 16, 20]. The weights in (8) are learned via and , and using .
Remark: If a validated propensity score model already exists, one can simply use the equivalent form .
4 Numerical experiments
We study the statistical properties of policies in the robust class , which we denote . To illustrate some key differences between a mean-optimal policy (3) and a robust policy, we first consider a well-specified scenario in which the mean-optimal policy belongs to a given class . Subsequently, we study a scenario with misspecified models using real training data.
4.1 Synthetic data
We consider a scenario in which patients are assigned treatments to reduce their blood pressure. We create a synthetic dataset, drawing data points from the training distribution (1) where features represent age and gender ( for females and for males). The feature distribution for the population of patients is specified as
The treatment decision is assigned based on a past policy which we specify by the probability
is the sigmoid function. See Figures2(a) and 2(b) for all illustration. While the assignment mechanism is not necessarily realistic, we use it to illustrate the relevant case of uneven feature overlap. Finally, the change in blood pressure is drawn randomly as
where and . While the expected cost for the untreated group is lower than for the treated group, we consider the untreated patients to have more heterogeneous outcomes, so that the dispersion is higher. That is, while .
Since the past policy is unknown, we learn weights (8) for in an unsupervised manner, using , where is a misspecified Gaussian model and is a Bernoulli model. We let . As a baseline comparison, we consider minimizing the expected cost (3) for a linear policy class . Since is a linear function in , this is a well-specified scenario in which the mean-optimal policy belongs to . We fit a correct linear model of the conditional mean and denote the resulting policy by .
Figures 2(a) and 2(b) show the decision taken by the robust and mean-optimal policy, and , respectively, as a function of features . Note that (14) leads to a mean-optimal policy , since the expected cost for the untreated group is lower than that of the treated group. By contrast, the robust policy takes into account that the dispersion of costs is much higher for untreated patients and therefore assigns to male patients in the age span 41-54 years as well as all females in the observable age span. To reduce the risk of increased blood pressure at the specified level, it therefore opts for treatments more often. This is highlighted in Figure 2(c) which shows the cost distribution, using the complementary cdf , for the different policies. We see that the robust policy safeguards against large increases in blood pressure, where the quantile is smaller than that for the mean-optimal policy. Thus the robust strategy trades off a higher expected cost for a lower tail cost at the -level.
An important feature of the proposed methodology is that each decision of the policy has an associated limit , such that the probability of exceeding it, , is bounded by . Figure 2(d) shows the estimated probability under the robust policy versus the target level . Despite the misspecification of the Gaussian model , the target provides an accurate limit for the actual probability.
4.2 Infant Health and Development Program data
Next, the properties of the proposed method are studied using real data. We use data from the Infant Health and Development program (IHDP) , which investigated the effect of personalized home visits and intensive high-quality child care on the health of low birth-weight and premature infants 
. The data for each child included a 25-dimensional covariate vector
, containing information on birth weight, head circumference, gender etc., standardized to zero mean and unit standard deviation, as well as a decisionindicating whether a child received special medical care or not. The outcome cost is a child’s cognitive underdevelopment score (simply a sign change of a development score).
The covariate distribution is unknown. The past policy, which we also treat as unknown, was in fact a randomized control experiment, so that was a constant. This policy was found to be successful in improving cognitive scores of the treated children as compared to those in the control group. To obtain outcome costs for either decision in , we generate synthetically by the nonlinear associative models following [8, 5]:
where we consider different dispersions below. Here is selected as described in  and  so that the effect of treatment on the treated is . The unknown parameter is a 25-dimensional vector of coefficients drawn randomly from with probabilities , respectively, as specified in . The IHDP data contains data points and we randomly select a subset of training points that form . The remaining points are used to evaluate learned policies.
To learn the weights (8) for the robust policy, we first reduce the 25-dimensional covariates into 4-dimensional features using an autoencoder [2, sec.7.1]. Then is a learned Gaussian mixture model with four mixture components and is a learned Bernoulli model. Together the models define (8) and a robust policy is learned for the target probability . For comparison, we also consider a linear policy that aims to minimize the expected cost (3) using linear models of the conditional means. Note that a such models are well-specified and misspecified for the treated and untreated outcomes in (15), respectively.
Figure 4 shows the cost distribution for the past and learned policies when the dispersions in (15) are equal or different. We see that in the cases of equal dispersion in Figure 3(a) and higher dispersion for untreated in Figure 3(c), both the robust and linear policies reduce the quantile of the cost as compared to that for the past policy, where the robust policy does slightly better.
Since the treated group tends to have a lower mean cost than the untreated group in the training data, the linear policy tends to assign to most patients in the test data. Moreover, the misspecified linear model leads to biased estimates of the expected cost and the resulting policy cannot fully capture the non-linear partition of the feature space implied by the mean-optimal policy based on .
Figure 3(e) shows the cost distribution when the treatment outcome costs have higher dispersion. Given the tendency toward treatment assignment by the linear policy, this results in heavier tails for the cost distribution. By contrast, the robust policy adapts to a higher cost dispersion in the treated group and assigns fewer treatments which results in resulting in smaller tail costs. In this case, the tail cost is similar to the past policy since its proportion of (random) treatment assignments is small in the data.
The robust methodology also provides a certificate for each decision, as illustrated in Figures 3(b), 3(f) and 3(d) with respect to two standardized covariates for each child in the test set. The probability that the cost exceeds is , estimated using Monte Carlo runs, which is close to and no greater than the targeted probability despite the model misspecification of .
We have developed a method for learning decision policies from observational data that lower the tail costs of decisions at a specified level. This is relevant in safely-critical applications. By building on recent results in conformal prediction, the method also provides statistically valid bound on the cost of each decision. These properties are valid under finite samples and even in scenarios with highly uneven overlap between features for different decisions in the observed data. Using both real and synthetic data, we illustrated the statistical properties and performance of the proposed method.
6 Broader Impact
We believe the work presented herein can provide a useful tool for decision support, especially in safety-critical applications where it is of interest to reduce the risk of incurring high costs. The methodology can leverage large and heterogeneous data on past decisions, contexts and outcomes, to improve human decision making, while providing an interpretable statistical guarantee for its recommendations. It is important, however, to consider the population from which the training data is obtained and used. If the method is deployed in a setting with a different population it may indeed fail to provide cost-reducing decisions. Moreover, if there are categories of features that are sensitive and subject to unwarranted biases, the population may need to be split into appropriate subpopulations or else the biases can be reproduced in the learned policies.
-  Rina Foygel Barber, Emmanuel J Candes, Aaditya Ramdas, and Ryan J Tibshirani. Conformal prediction under covariate shift. Advances in Neural Information Processing Systems 32 (NeurIPS 2019), 2019.
-  Yoshua Bengio, Aaron Courville, and Pascal Vincent. Representation learning: A review and new perspectives. IEEE transactions on pattern analysis and machine intelligence, 35(8):1798–1828, 2013.
-  Jeanne Brooks-Gunn, Fong-ruey Liaw, and Pamela Kato Klebanov. Effects of early intervention on cognitive function of low birth weight preterm infants. The Journal of pediatrics, 120(3):350–359, 1992.
-  Alexander D’Amour, Peng Ding, Avi Feller, Lihua Lei, and Jasjeet Sekhon. Overlap in observational studies with high-dimensional covariates. arXiv preprint arXiv:1711.02582, 2017.
-  Vincent Dorie. Non-parmeterics for Causal Inference, 2016.
-  Miroslav Dudík, John Langford, and Lihong Li. Doubly robust policy evaluation and learning. arXiv preprint arXiv:1103.4601, 2011.
-  Sheng Fu, Qinying He, Sanguo Zhang, and Yufeng Liu. Robust outcome weighted learning for optimal individualized treatment rules. Journal of biopharmaceutical statistics, 29(4):606–624, 2019.
-  Jennifer L Hill. Bayesian nonparametric modeling for causal inference. Journal of Computational and Graphical Statistics, 20(1):217–240, 2011.
-  Guido W Imbens and Donald B Rubin. Causal inference in statistics, social, and biomedical sciences. Cambridge University Press, 2015.
-  Thorsten Joachims, Adith Swaminathan, and Maarten de Rijke. Deep learning with logged bandit feedback. 2018.
-  Fredrik D Johansson, Dennis Wei, Michael Oberst, Tian Gao, Gabriel Brat, David Sontag, and Kush R Varshney. Characterization of overlap in observational studies. arXiv preprint arXiv:1907.04138, 2019.
Diederik P. Kingma and Max Welling.
An introduction to variational autoencoders.
Foundations and Trends® in Machine Learning, 12(4):307–392, 2019.
John Langford and Tong Zhang.
The epoch-greedy algorithm for multi-armed bandits with side information.In Advances in neural information processing systems, pages 817–824, 2008.
-  Jing Lei, Max G’Sell, Alessandro Rinaldo, Ryan J Tibshirani, and Larry Wasserman. Distribution-free predictive inference for regression. Journal of the American Statistical Association, 113(523):1094–1111, 2018.
-  Andreas Lindholm, Dave Zachariah, Petre Stoica, and Thomas B Schön. Data consistency approach to model validation. IEEE Access, 7:59788–59796, 2019.
-  Alireza Makhzani, Jonathon Shlens, Navdeep Jaitly, Ian Goodfellow, and Brendan Frey. Adversarial autoencoders. arXiv preprint arXiv:1511.05644, 2015.
-  Min Qian and Susan A Murphy. Performance guarantees for individualized treatment rules. Annals of statistics, 39(2):1180, 2011.
-  Yi Su, Maria Dimakopoulou, Akshay Krishnamurthy, and Miroslav Dudík. Doubly robust off-policy evaluation with shrinkage. arXiv preprint arXiv:1907.09623, 2019.
-  Adith Swaminathan and Thorsten Joachims. Counterfactual risk minimization: Learning from logged bandit feedback. In International Conference on Machine Learning, pages 814–823, 2015.
-  Ilya Tolstikhin, Olivier Bousquet, Sylvain Gelly, and Bernhard Schoelkopf. Wasserstein auto-encoders. arXiv preprint arXiv:1711.01558, 2017.
-  A.A. Tsiatis, M. Davidian, S.T. Holloway, and E.B. Laber. Dynamic Treatment Regimes: Statistical Methods for Precision Medicine. Chapman & Hall/CRC Monographs on Statistics & Applied Probability. CRC Press, 2019.
-  Vladimir Vovk, Alex Gammerman, and Glenn Shafer. Algorithmic learning in a random world. Springer Science & Business Media, 2005.
-  Lan Wang, Yu Zhou, Rui Song, and Ben Sherwood. Quantile-optimal treatment regimes. Journal of the American Statistical Association, 113(523):1243–1254, 2018.
-  Yingqi Zhao, Donglin Zeng, A John Rush, and Michael R Kosorok. Estimating individualized treatment rules using outcome weighted learning. Journal of the American Statistical Association, 107(499):1106–1118, 2012.
-  Xin Zhou, Nicole Mayer-Hamblett, Umer Khan, and Michael R Kosorok. Residual weighted learning for estimating individualized treatment rules. Journal of the American Statistical Association, 112(517):169–187, 2017.