 # Learning Robust Decision Policies from Observational Data

We address the problem of learning a decision policy from observational data of past decisions in contexts with features and associated outcomes. The past policy maybe unknown and in safety-critical applications, such as medical decision support, it is of interest to learn robust policies that reduce the risk of outcomes with high costs. In this paper, we develop a method for learning policies that reduce tails of the cost distribution at a specified level and, moreover, provide a statistically valid bound on the cost of each decision. These properties are valid under finite samples – even in scenarios with uneven or no overlap between features for different decisions in the observed data – by building on recent results in conformal prediction. The performance and statistical properties of the proposed method are illustrated using both real and synthetic data.

## Authors

##### This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

## 1 Introduction

We consider data of discrete decisions taken in contexts with features . The outcome of each decision has an associated cost (or equivalently, negative reward). For instance, we may obtain data from a hospital in which patients with features are given treatment to lower their blood pressure and denotes the change of pressure value. The observational data is drawn independently as follows

 (xi,yi,zi)∼p(x,y,z)=p(z)p(x|z)past policyp(y|x,z),i=1,…,n (1)

where we have used a causal factorization of the unknown data-generating process. The distribution of contexts is described by and summarizes a decision policy which is generally unknown.

Using the training data points, our goal is to automatically improve upon the past policy. That is, learn a new policy, which is a mapping from features to decisions

 π(z):Z→X,

such that the outcome cost will tend to be lower than in the past. This policy partitions the feature space into disjoint regions. A sample from the resulting data generating process can then be expressed as

 (x,y,z)∼pπ(x,y,z)=p(z)1{x=π(z)}p(y|x,z), (2)

where is the indicator function. In the treatment regime literature [17, 24, 25, 7, 21], the conventional aim is to minimize the expected cost, viz.

 minπ∈ΠEπ[y],whereEπ[y]≡E[∑x∈XE[y|x,z]1{x=π(z)}]≡∑x∈XE[1{x=π(z)}p(x|z)y], (3)

where the last identity follows if features overlap across decisions so that . The optimal policy for this problem is and is determined by the unknown training distribution (1). Thus a policy must be learned from training samples, where a fundamental source of uncertainty about outcomes is uneven feature overlap across decisions [4, 11] (see Fig. 0(a) for an illustration). Eq. (3) is equivalent to an off-policy learning problem in contextual bandit settings using logged data [13, 6, 19, 10, 18], but where the past policy is unknown.

A common approach is to learn a regression model of , which in the case of binary decisions and linear models restricts the class of policies to the form . To avoid the sensitivity to regression model misspecification, an alternative approach is to learn a model of and then approximately solve (3) by numerical search over a restricted parametric class of policies

. In scenarios with highly uneven feature overlap, however, this approach leads to high-variance estimates of

, see the analysis in .

Reliably estimating the expected cost of a policy would yield an important performance certificate in safety-critical applications 

. In such applications, however, reducing the prevalence of high costs outcomes is a more robust strategy than reducing the expected cost, even when such tail events have low probability, see Figure

0(b) for an illustration. This is especially relevant when the conditional distribution of outcome costs

is skewed or has a dispersion that varies with

.

In this paper, we develop a method for learning a robust policy that

• targets the reduction of the tail of the cost distribution , rather than ,

• provides a statistically valid limit for each decision,

• is operational even when there is little feature overlap.

Moreover, when the past policy is unknown, the robust policy can be learned using unsupervised techniques, which obviates the need to specify associative models and/or . The method is demonstrated using both real and synthetic data.

## 2 Problem formulation

We consider a policy to be robust if it can reduce the tail costs at a specified level as compared to the past policy – even for finite and highly uneven feature overlap. We define the -tail as all for which the probability is greater than or equal to . An optimal robust policy therefore minimizes the

-quantile of the cost, viz. a solution to

 (4)

Since a learned policy is a function of the training data , the probability is also defined over all i.i.d. training points.

The problem we consider is to learn a policy in a class that approximately solves (4) and certifies each decision by a limit that holds with a probability of at least for finite and highly uneven feature overlap.

## 3 Learning Method

Since the cumulative distribution function (

cdf) in (4) is unknown for a given policy, it is a challenging task to find the minimum which satisfies the constraint. We propose to restrict the policies to a class , constructed as follows: Suppose there exists a feature-specific limit for a given decision , such that is no less than . Then we define as all policies that select with the minimum cost limit at the specified level . That is, a class of robust policies

 (5)

Learning a policy in therefore amounts to using to learn a set of functions that satisfy the constraints. Figure 2 illustrates constructed using the method described below, for a binary decision variable .

Remark: If there is a tie among , the policy can randomly draw from the minimizers. If the limits are non-informative, , the method will indicate that the data is not sufficiently informative for reliable cost-reducing decisions. See Figure 2 for regions in feature space where there is no data about outcomes for treated younger males and untreated older women; consequently for such pairs of features and decisions.

### 3.1 Statistically valid limits

To construct feature-specific limits that satisfy the constraint in (5), we leverage recent results developed using the conformal prediction framework [22, 14, 1]. We begin by quantifying the divergence of a sample in (2) from those in , using the residual

 s(x,y,z)=|y−μ(x,y,z)|≥0, (6)

where is any predictor of the cost fitted using . Then can be viewed as a random non-conformity score with a cdf and quantile

 s1−α(F)=inf{s:F(s)≥1−α} (7)
###### Result 1 (Finite-sample validity).

For a given level and context , construct a set of probability weights

 pk(xi,zi)≜wk(xi,zi)∑nj=1wk(xj,zj)+wk(x,z),wherewk(x,z)≜1{x=k}p(z)p(z|x)p(x), (8)

for and define an empirical cdf for the residuals

 ˆFx(s)=n∑i=1px(xi,zi)1{s≥si}+px(x,z)1{s≥s(x,y,z)}, (9)

where . Then

 yα(x,z)≜max{y∈Y:s(x,y,z)≤s1−α(ˆFx)}, (10)

satisfies the probabilistic constraint in (5).

###### Proof.

By expressing , where it follows from [1, corr. 1] that the set in (10) will cover with a probability of at least . ∎

Computing requires a search of the maximum value in the set (10), which can be implemented efficiently using interval halving. Each evaluation point in the set, however, requires re-fitting to in (6). For an efficient computation of (10), we therefore consider the locally weighted average of costs, i.e.,

 μ(x,y,z)=n∑i=1px(xi,zi)yi+px(x,z)y (11)

which is linear in . This choice then defines a policy in and is illustrated in Figures 2(a) and 2(b). Each decision of the policy can then be certified by a limit obtained by setting in (10) and the probability of exceeding the limit is bounded by . For the sake of clarity, the computation of is summarized in Algorithm 1.

An important property of (10) is that it is statistically valid also for highly uneven feature overlap. As approaches for a given , the probability weights in (8) concentrate so that in (9). Consequently, converges to so that the proposed robust policy avoids decisions in contexts for which there is little or no training data.

### 3.2 Unsupervised learning of weights

In randomized control trials, and other controlled experiments, the weights in (8) are given by a known past policy. In the general case, however,

must be learned from training data. This is effectively an unsupervised learning problem which therefore circuments the need for specifying associative models of

(regression) or (propensity score).

The categorical distribution of past decisions, , is readily modeled as using . The conditional feature distribution

can in turn be modelled by a flexible generative model, e.g. Gaussian mixture models or multinoulli models. The accuracy of the learned generative model

can then be assessed using model validation methods, e.g. . If the training data contains high-dimensional covariates, we propose constructing features

using dimension-reduction methods, such as autoencoders

[2, 12, 16, 20]. The weights in (8) are learned via and , and using .

Remark: If a validated propensity score model already exists, one can simply use the equivalent form .

## 4 Numerical experiments

We study the statistical properties of policies in the robust class , which we denote . To illustrate some key differences between a mean-optimal policy (3) and a robust policy, we first consider a well-specified scenario in which the mean-optimal policy belongs to a given class . Subsequently, we study a scenario with misspecified models using real training data.

### 4.1 Synthetic data

We consider a scenario in which patients are assigned treatments to reduce their blood pressure. We create a synthetic dataset, drawing data points from the training distribution (1) where features represent age and gender ( for females and for males). The feature distribution for the population of patients is specified as

 p(z1|z2)=z2×N(30, 5)+(1−z2)×N(45, 5)andp(z2)≡0.5 (12)

The treatment decision is assigned based on a past policy which we specify by the probability

 p(x=1|z)=z2×0.95 f(−z1−206)+(1−z2)×0.20 f(−z1−452), (13)

where

is the sigmoid function. See Figures

2(a) and 2(b) for all illustration. While the assignment mechanism is not necessarily realistic, we use it to illustrate the relevant case of uneven feature overlap. Finally, the change in blood pressure is drawn randomly as

 p(y|x,z)=x×N(e⊤1z−45, σ21)+(1−x)×N(e⊤1z−46, σ20), (14)

where and . While the expected cost for the untreated group is lower than for the treated group, we consider the untreated patients to have more heterogeneous outcomes, so that the dispersion is higher. That is, while .

Since the past policy is unknown, we learn weights (8) for in an unsupervised manner, using , where is a misspecified Gaussian model and is a Bernoulli model. We let . As a baseline comparison, we consider minimizing the expected cost (3) for a linear policy class . Since is a linear function in , this is a well-specified scenario in which the mean-optimal policy belongs to . We fit a correct linear model of the conditional mean and denote the resulting policy by .

Figures 2(a) and 2(b) show the decision taken by the robust and mean-optimal policy, and , respectively, as a function of features . Note that (14) leads to a mean-optimal policy , since the expected cost for the untreated group is lower than that of the treated group. By contrast, the robust policy takes into account that the dispersion of costs is much higher for untreated patients and therefore assigns to male patients in the age span 41-54 years as well as all females in the observable age span. To reduce the risk of increased blood pressure at the specified level, it therefore opts for treatments more often. This is highlighted in Figure 2(c) which shows the cost distribution, using the complementary cdf , for the different policies. We see that the robust policy safeguards against large increases in blood pressure, where the quantile is smaller than that for the mean-optimal policy. Thus the robust strategy trades off a higher expected cost for a lower tail cost at the -level.

An important feature of the proposed methodology is that each decision of the policy has an associated limit , such that the probability of exceeding it, , is bounded by . Figure 2(d) shows the estimated probability under the robust policy versus the target level . Despite the misspecification of the Gaussian model , the target provides an accurate limit for the actual probability.

### 4.2 Infant Health and Development Program data

Next, the properties of the proposed method are studied using real data. We use data from the Infant Health and Development program (IHDP) , which investigated the effect of personalized home visits and intensive high-quality child care on the health of low birth-weight and premature infants 

. The data for each child included a 25-dimensional covariate vector

, containing information on birth weight, head circumference, gender etc., standardized to zero mean and unit standard deviation, as well as a decision

indicating whether a child received special medical care or not. The outcome cost is a child’s cognitive underdevelopment score (simply a sign change of a development score).

The covariate distribution is unknown. The past policy, which we also treat as unknown, was in fact a randomized control experiment, so that was a constant. This policy was found to be successful in improving cognitive scores of the treated children as compared to those in the control group. To obtain outcome costs for either decision in , we generate synthetically by the nonlinear associative models following [8, 5]:

 y|x=0,˜z∼N(−exp[(˜z+0.51)⊤β], σ0)andy|x=1,˜z∼N(−˜z⊤β−ω, σ1), (15)

where we consider different dispersions below. Here is selected as described in  and  so that the effect of treatment on the treated is . The unknown parameter is a 25-dimensional vector of coefficients drawn randomly from with probabilities , respectively, as specified in . The IHDP data contains data points and we randomly select a subset of training points that form . The remaining points are used to evaluate learned policies.

To learn the weights (8) for the robust policy, we first reduce the 25-dimensional covariates into 4-dimensional features using an autoencoder [2, sec.7.1]. Then is a learned Gaussian mixture model with four mixture components and is a learned Bernoulli model. Together the models define (8) and a robust policy is learned for the target probability . For comparison, we also consider a linear policy that aims to minimize the expected cost (3) using linear models of the conditional means. Note that a such models are well-specified and misspecified for the treated and untreated outcomes in (15), respectively.

Figure 4 shows the cost distribution for the past and learned policies when the dispersions in (15) are equal or different. We see that in the cases of equal dispersion in Figure 3(a) and higher dispersion for untreated in Figure 3(c), both the robust and linear policies reduce the quantile of the cost as compared to that for the past policy, where the robust policy does slightly better.

Since the treated group tends to have a lower mean cost than the untreated group in the training data, the linear policy tends to assign to most patients in the test data. Moreover, the misspecified linear model leads to biased estimates of the expected cost and the resulting policy cannot fully capture the non-linear partition of the feature space implied by the mean-optimal policy based on .

Figure 3(e) shows the cost distribution when the treatment outcome costs have higher dispersion. Given the tendency toward treatment assignment by the linear policy, this results in heavier tails for the cost distribution. By contrast, the robust policy adapts to a higher cost dispersion in the treated group and assigns fewer treatments which results in resulting in smaller tail costs. In this case, the tail cost is similar to the past policy since its proportion of (random) treatment assignments is small in the data.

The robust methodology also provides a certificate for each decision, as illustrated in Figures 3(b), 3(f) and 3(d) with respect to two standardized covariates for each child in the test set. The probability that the cost exceeds is , estimated using Monte Carlo runs, which is close to and no greater than the targeted probability despite the model misspecification of .

## 5 Conclusion

We have developed a method for learning decision policies from observational data that lower the tail costs of decisions at a specified level. This is relevant in safely-critical applications. By building on recent results in conformal prediction, the method also provides statistically valid bound on the cost of each decision. These properties are valid under finite samples and even in scenarios with highly uneven overlap between features for different decisions in the observed data. Using both real and synthetic data, we illustrated the statistical properties and performance of the proposed method.