## 1 Introduction

The problem of learning treatment assignment policies, or mappings from individual characteristics
to treatment assignments, is ubiquitous in applied economics and
statistics.^{†}^{†}We are grateful for helpful conversations with colleagues
including
Victor Chernozhukov,
David Hirshberg,
Guido Imbens,
Michael Kosorok,
Alexander Luedtke,
Eric Mbakop,
Whitney Newey,
Xinkun Nie,
Alexander Rakhlin,
James Robins,
Max Tabord-Meehan and
Zhengyuan Zhou,
and for feedback from seminar participants at a variety of universities and workshops.
We also thank Guido Imbens for sharing the California GAIN dataset with us. Generous financial support was provided by the Sloan Foundation, by the Office of Naval Research grant N00014-17-1-2131, and a Facebook Faculty Award. It arises,
for example, in medicine when a doctor must decide which patients to refer
for a risky surgery; in marketing when a company needs to
choose which customers to send targeted offers to; and in government and policy settings,
when assigning students to educational programs or inspectors to buildings and restaurants.

The treatment assignment problem rarely arises in an unconstrained environment. Treatments are often expensive, and so a policy may need to respect budget constraints. Policies may need to be implemented in environments characterized by human or machine constraints; for example, emergency medical professionals or police officers may need to implement decision policies in the field, where a simple decision tree might be used. For internet or mobile services, algorithms may need to determine the set of information displayed to a user very quickly, and a simple lookup table may decrease the time it takes to respond to a user’s request. Fairness constraints may require a treatment assignment policy to depend only on particular types of covariates (for example, test scores or income), even when other covariates are observed.

This paper is about using observational data to learn policies that respect the types of constraints outlined above.
The existing literature on policy learning has mostly focused on setting where we want
to optimize allocation of a binary treatment using data from a randomized trial, or from a study with a known, random treatment assignment policy.
In many problems, however, one may need to leverage richer forms of observational
data to learn treatment assignment rules. For example, if we want to learn whom to prescribe a drug
to based on data from a clinical trial, we need to have methods that deal with non-compliance
and resulting endogenous treatment assignments.^{1}^{1}1If we believed that compliance
patterns when we deploy our policy would be similar to those in the clinical trial, then an
intent-to-treat analysis may be a reasonable way to side-step endogeneity concerns. However,
if we suspect that compliance patterns may change (e.g., if patients may be more likely to adhere
to a treatment regime prescribed by their doctor than one randomly assigned in a clinical trial),
then using an analysis that disambiguates received treatment from assigned treatment is
necessary. Or, if we are interested in offering some customers discounts, then we need methods
that let us study interventions to continuous variables (e.g., price) rather than just discrete ones.
The goal of this paper is to develop methods for policy learning that don’t just work in randomized
trials (or related settings), but can instead work with a rich variety of observational designs.

Formally, we study the problem where we have access to observational data and want to use it to learn a policy that maps a subject’s characteristics to a binary decision, . The practitioner has also specified a class that encodes problem-specific constraints pertaining to budget, functional form, fairness, etc., and requires that our learned policy satisfies these constraints, . Then, following Manski (2004, 2009) and, more recently, Hirano and Porter (2009), Kitagawa and Tetenov (2018) and Stoye (2009, 2012), we seek guarantees on the regret , i.e., the difference between the expected utility from deploying the learned policy and the best utility that could be achieved from deploying any policy in the class .

Our paper builds on a rich literature at the intersection of econometrics, statistics and computer science on learning structured treatment assignment rules, including Kitagawa and Tetenov (2018), Swaminathan and Joachims (2015) and Zhao, Zeng, Rush, and Kosorok (2012). Most closely related to us, Kitagawa and Tetenov (2018)

study a special case of our problem where treatments are binary and exogenous with known assignment probabilities, and show that an algorithm based on inverse-probability weighting achieves regret that depends optimally on the sample size and the complexity of the policy class

.^{2}

^{2}2Kitagawa and Tetenov (2018) also consider the case where treatment assignment probabilities are unknown; in this case, however, their method no longer achieves optimal dependence on the sample size.

Here, we develop a new family of algorithms that achieve regret guarantees with optimal dependence on sample size and on , but under considerable more generality on the sampling design. We consider both the classical case where we want to optimize a binary treatment, and a related setting where we want to optimize infinitesimal nudges to a continuous treatment (e.g., a price). Moreover, our approach can leverage observational data where the treatment assignment mechanism may either be exogenous with unknown assignment probabilities, or endogenous, in which case we require an instrument.

Our approach starts from recent unifying results of Chernozhukov, Escanciano, Ichimura, Newey, and Robins (2018b) on semiparametrically efficient estimation. As discussed in more detail in Section 2, Chernozhukov et al. (2018b) show that in many problems of interest, we can construct efficient estimates of average-treatment-effect-like parameters as

(1) |

where is an appropriate doubly robust score for the intervention of interest. This approach can be used to estimate the average effect of a binary treatment, the average derivative of a continuous treatment, or other related estimands.

In this paper we find that, whenever one can estimate the average utility of treating everyone using an estimator of the type (1) built via the doubly robust construction of Chernozhukov et al. (2018b), we can also efficiently learn whom to target with the intervention via a simple procedure: Given a policy class , we propose using the treatment assignment rule that solves

(2) |

where are the same doubly robust scores as used in (1). Our main result is that, under regularity conditions, the resulting policies have regret bounded on the order of with high probability. Here, is the Vapnik-Chervonenkis dimension of the class and is the sample size. We also highlight how the constants in this bound depend on fundamental quantities from the semiparametric efficiency literature.

Our proof combines results from semiparametrics with carefully tailored analysis tools that build on classical ideas from empirical process theory. The reason we obtain strong guarantees for the approach (2) is closely tied to robustness properties of the estimator (1). In the setting where we only want to estimate a single average effect parameter, it is well known that non-doubly robust estimators can also be semiparametrically efficient (Hirano, Imbens, and Ridder, 2003). Here, however, we need convergence results that are strong enough to withstand optimization over the whole class

. The fact that doubly robust estimators are fit for this task is closely to their ability to achieve semiparametric efficiency under general conditions, even if nuisance components are estimated via black-box machine learning methods for which we can only guarantee fast enough convergence in mean-squared error

(Chernozhukov et al., 2018a; van der Laan and Rose, 2011).We spell out our general framework in Section 2. For intuition, however, it is helpful to first consider this approach in simplest case where we want to study the effect of a binary treatment on an outcome interpreted as a utility and are willing to assume selection on observables: We have potential outcomes such that and (Imbens and Rubin, 2015). Then, the utilitarian regret of deploying a policy is (Manski, 2009)

(3) |

and we can construct our estimator (2) using the well known augmented inverse-propensity weighted scores of Robins, Rotnitzky, and Zhao (1994, 1995),

(4) |

In this setup, our result implies that—under regularity conditions—the estimator (2) with scores (4) has regret (3) bounded on the order of .

Even in this simplest case, our result is considerably stronger than results currently available in the literature. The main result of Kitagawa and Tetenov (2018) is that, if treatment propensities are known, then a variant of inverse-propensity weighted policy learning achieves regret on the order of . However, in observational studies where the treatment propensities are unknown, the bounds of Kitagawa and Tetenov (2018) depend on the rate at which we can estimate , and will generally decay slower than . The only other available -bounds for policy learning in observational studies with a binary treatment that we are aware of are a result of van der Laan, Dudoit, and van der Vaart (2006) for the case where consists of a finite set of policies whose cardinality grows with , and a result of Kallus (2017a) in the special case is assumed to belong to a reproducing kernel Hilbert space. The idea of using doubly robust scores to to learn optimal treatment assignment of a binary treatment has been previously discussed in Dudík, Langford, and Li (2011) and Zhang, Tsiatis, Davidian, Zhang, and Laber (2012); however, neither paper provides a regret bounds for this approach.

In the more general case where the observed treatment assignments may be continuous and/or we may need to use instrumental variables to identify causal effects, both the methods and regret bounds provided here are new. By connecting the policy learning problem to the semiparametric efficiency literature, we are able to develop a general framework that applies across a variety of settings.

The rest of the paper proceeds as follows. We discuss related work in more detail below. Next, we formalize the connection between evaluating of a single policy, and optimizing over a set of policies. Section 2.1 spells out how several commonly studied settings map into our framework. Formal results on upper and lower bounds on regret are provided in Sections 3 and 4. In Section 5, we study an empirical application to the problem of assigning individuals to a training program in California, and we conduct a simulation study to illustrate the application to settings with instrumental variables.

### 1.1 Related Work

The literature on optimal treatment allocation has been rapidly expanding across several fields. In the econometrics literature, the program of learning regret-optimal treatment rules was started by Manski (2004, 2009). One line of work considers the case where the policy class is unrestricted, and the optimal treatment assignment rule simply depends on the sign of the conditional average treatment effect for each individual unit. In this setting, Hirano and Porter (2009) show that when -rate estimation of the conditional average treatment effect function is possible, then treatment assignment rules obtained by thresholding an efficient estimate of the conditional average treatment effect are asymptotically minimax-optimal. Meanwhile, Stoye (2009) derives finite sample minimax decision rules in a class of problems where both the response surfaces and the policies may depend arbitrarily on covariates. Further results are given in Armstrong and Shen (2015), Bhattacharya and Dupas (2012), Chamberlain (2011), Dehejia (2005), Kasy (2016), Stoye (2012) and Tetenov (2012).

Building on this line of work, Kitagawa and Tetenov (2018) study policy learning in a non-parametric setting where the learned policy is constrained to belong to a structured class and show that, in this case, we can obtain regret bounds relative to the best policy in that scale with the complexity of the class . A key insight from Kitagawa and Tetenov (2018) is that, when propensity scores are known and has finite VC dimension, it is possible to get -rate regret bounds for policy learning over a class even if the conditional average treatment effect function itself cannot be estimated at a -rate; in other words, we can reliably find a nearly best-in-class policy without needing to accurately estimate a model that describes all causal effects. As discussed above, our paper builds on this work by considering rate-optimal regret bounds for best-in-class policy learning in observational studies where propensity scores are unknown and treatment assignment may be endogenous, etc.

One difference between our results and those of Kitagawa and Tetenov (2018) is that the latter provide finite sample regret bounds, whereas our results are asymptotic in the sample size . The reason for this is that our bounds rely on results from the literature on semiparametric estimation (Bickel, Klaassen, Ritov, and Wellner, 1998; Chernozhukov, Escanciano, Ichimura, Newey, and Robins, 2018b; Chen, Hong, and Tarozzi, 2008; Hahn, 1998; Newey, 1994; Robins and Rotnitzky, 1995), which themselves are asymptotic. Recently, Armstrong and Kolesár (2017) showed that, in a class of average treatment effect estimation problems, finite sample conditionally minimax linear estimators are asymptotically efficient, thus providing a connection between desirable finite sample guarantees and asymptotic optimality. It would be interesting to examine whether similar connections are possible in the policy learning case.

Policy learning from observational data has also been considered in parallel literatures developed in both statistics (Chen, Zeng, and Kosorok, 2016; Luedtke and van der Laan, 2016a, b; Luedtke and Chambaz, 2017; Qian and Murphy, 2011; Zhang, Tsiatis, Davidian, Zhang, and Laber, 2012; Zhao, Zeng, Rush, and Kosorok, 2012; Zhou, Mayer-Hamblett, Khan, and Kosorok, 2017) and machine learning (Beygelzimer and Langford, 2009; Dudík, Langford, and Li, 2011; Kallus, 2017a, b; Swaminathan and Joachims, 2015). Two driving themes behind these literatures are the development of performant algorithms for solving the empirical maximization problems (and relaxations thereof) that underlie policy learning, and the use of doubly robust objectives for improved practical performance. Kallus (2017a), Swaminathan and Joachims (2015), Zhao et al. (2012) and Zhou et al. (2017) also prove regret bounds for their methods; however, they do not achieve a sample dependence, with the exception of Kallus (2017a) in the special case of the reproducing kernel Hilbert space setting described above. Finally, Luedtke and Chambaz (2017) propose a class of regret bounds that decay faster than by exploiting non-uniform asymptotics; see Section 4 for a further discussion.

The problem of optimal treatment allocation can also be seen as a special case as the broader problem of optimal data-driven decision making. From this perspective, our result is related to the work of Ban and Rudin (2018) and Bertsimas and Kallus (2014)

, who study data-driven rules for optimal inventory management. Much like in our case, they advocate learning with a loss function that is directly tied to a utility-based criterion.

Another relevant line of work studies the online “contextual bandit” setup where a practitioner seeks to learn a decision rule while actively making treatment allocation decisions for incoming subjects (e.g., Agarwal, Hsu, Kale, Langford, Li, and Schapire, 2014; Auer, Cesa-Bianchi, Freund, and Schapire, 2002; Bastani and Bayati, 2015; Dimakopoulou, Athey, and Imbens, 2017; Lai and Robbins, 1985; Perchet and Rigollet, 2013; Rakhlin and Sridharan, 2016). Despite aiming for a similar goal, the contextual bandit problem is quite different from ours: On one hand, it is harder because of an exploration/exploitation trade-off that arises in sequential trials; on the other hand, it is easier, because the experimenter has perfect control over the treatment assignment mechanism at each step of the procedure.

Finally, we also note a growing literature on estimating conditional average treatment effects (Athey and Imbens, 2016; Wager and Athey, 2018; Athey, Tibshirani, and Wager, 2018; Chen, 2007; Künzel, Sekhon, Bickel, and Yu, 2017; Nie and Wager, 2017). Although the goal is similar to that of learning optimal treatment assignment rules, the specific results themselves differ; they focus on squared-error loss rather than utilitarian regret.

## 2 From Efficient Policy Evaluation to Learning

Our goal is to learn a policy that maps a subject’s features to a treatment decision: . In order to do so, we assume that we have independent and identically distributed samples , where is the outcome we want to intervene on, is the observed treatment assignment, and is an (optional) instrument used for identifying causal effects. In cases where is exogenous, we simply take . Throughout our analysis, we interpret as the utility resulting from our intervention on the -th sample, e.g., could measure the benefit accrued by a subject minus any cost of treatment. We then seek policies that make the expected value of large.

We define the causal effect of the intervention in terms of the potential outcomes model (Neyman, 1923; Rubin, 1974), whereby the correspond to utilities we would have observed for the -th sample had the treatment been set to , and . When instruments are present, we always assume that the exclusion restriction holds so that this notation is well specified. We consider both examples with a binary treatment and with a continuous treatment .

In the case where is binary, we follow the existing literature (Hirano and Porter, 2009; Kitagawa and Tetenov, 2018; Manski, 2004; Stoye, 2009), and study interventions that directly specify the treatment level. In this case, the utility of deploying a policy relative to treating no one is (Manski, 2009)

(5) |

and the corresponding policy regret relative to the best possible policy in the class is

(6) |

As discussed in the introduction, in this binary setting, Kitagawa and Tetenov (2018) show that if is exogenous with known treatment propensities, then we can use inverse-propensity weighting to derive a policy whose regret decays as , with

(7) |

Here, we develop methods that can also be used in observational studies where treatment propensities may be unknown, and where we may need to use instrumental variables to identify from (5).

Meanwhile, when is continuous, we study infinitesimal interventions on the treatment level motivated by the work of Powell, Stock, and Stoker (1989). We define the utility of such an infinitesimal intervention as

(8) |

and then define regret in terms of as in (6).
One interesting conceptual difference that arises in this case is that, now, our interventions
and observed treatment assignments take values in different spaces. For example,
in an example where we want to target some customers with personalized discounts,
we may have access to past prices that take on a continuum of values, but
are considering a new class of policies that only allow us to make a binary decision
on whether to offer customers a small discount or not. The fact that we can still learn low-regret
policies via the simple strategy (2) even when these two spaces are decoupled
highlights the richness of the policy learning problem.^{3}^{3}3Another interesting question one could
ask is how best to optimize the assignment of globally rather than locally (i.e., the case
where we can set the treatment level to an arbitrary level, rather than simply nudge the pre-existing
levels of ). This question would require different formal tools, however, as the results developed
in this paper only apply to binary decisions.

With both binary and continuous treatments, the regret of a policy can be written in terms of a conditional average treatment effect function,

(9) |

such that and regret is as in (6). Our analysis pertains to any setup with a regret function that admits such a representation.

Given these preliminaries, recall that our goal is to learn low regret policies, i.e., to use observational data to derive a policy with a guarantee that . In order to do so, we need to make assumptions on the observational data generation distribution that allow for identification and adequate estimation of , and also control the size of in a way that makes emulating the best-in-class policy a realistic objective. The following two subsections outline these required conditions; our main result is then stated in Section 2.3.

### 2.1 Identifying and Estimating Causal Effects

In order to learn a good policy , we first need to be able to evaluate for any specific policy . Our main assumption, following Chernozhukov, Escanciano, Ichimura, Newey, and Robins (2018b), is that we can construct a doubly robust score for the average treatment effect . At the end of this section we discuss how this approach applies to three important examples, and refer the reader to Chernozhukov et al. (2018b) for a more general discussion of when such doubly robust scores exist.

###### Assumption 1.

Write for the counterfactual response surface and suppose that the induced conditional average treatment effect function is linear in . Suppose moreover that we can define regret in terms of this -function as in (6) with . We assume that there exists a weighting function that identifies this -function,

(10) |

for any counterfactual response surface .

Given this setup, Chernozhukov et al. (2018b) propose first estimating and , and then consider

(11) |

They show that this estimator is -consistent and asymptotically unbiased Gaussian for ,
provided that the nuisance estimates and
converge sufficiently fast and that we use cross-fitting (Chernozhukov et al., 2018a; Schick, 1986).
This estimator is also semiparametrically efficient under general conditions
(Newey, 1994).^{4}^{4}4Our results don’t depend on efficiency of (11); rather,
we only use -consistency. In cases where (11) may not be efficient, our regret bounds
still hold verbatim; the only difference being that we can no longer interpret the terms of the form

appearing in the bound as related to the semiparametric efficient variance for

.Here, we also start with the doubly robust scores in (11); however, instead of using them for simply estimating , we use them for policy learning by plugging them into (2). Our main result will establish that we can get strong regret bounds for learning policies under conditions that are similar to those used by Chernozhukov et al. (2018b) to show asymptotic normality of (11) and, more broadly, that build on assumptions often made in the literature on semiparametric efficiency (Bickel, Klaassen, Ritov, and Wellner, 1998; Chen, Hong, and Tarozzi, 2008; Hahn, 1998; Newey, 1994; Robins and Rotnitzky, 1995).

As in the recent work of Chernozhukov et al. (2018a) on double machine learning or that of van der Laan and Rose (2011) on targeted learning, we take an agnositic view on how the nuisance estimates and are obtained, and simply impose high level conditions on their rates of convergence. In applications, this allows practitioners to try several different machine learning methods for each component, or potentially combinations thereof (van der Laan, Polley, and Hubbard, 2007), and then use cross validation to pick the best method. For completeness, we allow problem specific quantities to change with the sample size , and track this dependence with a subscript, e.g., , etc. Given sufficient regularity, we can construct estimators that satisfy the rate condition (13) via sieve-based methods (Chen, 2007; Negahban, Ravikumar, Wainwright, and Yu, 2012) or kernel regression (Caponnetto and De Vito, 2007; Mendelson and Neeman, 2010).

###### Assumption 2.

In the setting of Assumption 1, assume that , and for all . Moreover, we assume that we have access to uniformly consistent estimators of these nuisance components,

(12) |

whose errors decay as follows, for some with
and some , where is taken to be an independent test
example drawn from the same distribution as the training data:^{5}^{5}5A notable special case of
this assumption is when ; this is equivalent to the standard assumption
in the semiparametric estimation literature that all nuisance components (i.e., in our case, both the
outcome and weighting regressions) are -consistent in terms of -error. The
weaker requirement (13) reflects the fact that doubly robust treatment effect
estimators can trade-off accuracy of the -model with accuracy of the -model, provided
the product of the error rates is controlled (Farrell, 2015).

(13) |

We end this section by verifying that Assumption 1 in fact covers several settings of interest, and is closely related to several standard approaches to semiparametric inference. In cases with selection on observables we do not need an instrument (or can simply set ), so for simplicity of notation we replace all instances of with .

#### Binary treatment with selection on observables.

Most existing work on policy learning, including Kitagawa and Tetenov (2018), have focused on the setup where is binary and unconfounded is unconfounded, i.e., . In this case, weighting by the inverse propensity score lets us recover the average treatment effect, i.e., with identifies the conditional average treatment effect via (10). Moreover, the estimation strategy (11) yields

(14) |

and thus recovers augmented inverse propensity weighting (Robins, Rotnitzky, and Zhao, 1994, 1995).

#### Continuous treatment with selection on observables.

In the case where is continuous and unconfounded , we can derive a representer via integration by parts (Powell, Stock, and Stoker, 1989). Under regularity conditions, the -function can be identified via (10) using

(15) |

where denotes the conditional density of given . Although closely related to existing proposals, including one by Ai and Chen (2007), the resulting doubly robust estimator was first derived via the general approach of Chernozhukov et al. (2018b).

#### Binary, endogenous treatment with binary treatment and instrument.

Instead of unconfoundedness, now suppose that is a valid instrument conditionally on
features in the sense of Assumption 2.1 of Abadie (2003). Suppose
moreover that treatment effects are homogenous, meaning that the conditional average treatment
effect matches the conditional local average treatment effect (Imbens and Angrist, 1994),^{6}^{6}6As
discussed above, our notation with potential outcomes that do not involve the instrument
is only meaningful when the exclusion restriction holds.

(16) |

Then we can use a weighting function defined in terms of the compliance score (Abadie, 2003; Aronow and Carnegie, 2013),

(17) |

to identify this -function using (10). We note that our formal results all require that be bounded, which implicitly rules out the case of weak instruments (since if approaches 0, the -weights blow up).

### 2.2 Assumptions about the Policy Class

Next, in order to obtain regret bounds that decay as , we need some control over the complexity of the class (and again let potentially change with for generality). Here, we achieve this control by assuming that that is a Vapnik-Chervonenkis (VC) class whose dimension does not grow too fast with the sample size . As is familiar from the literature on classification, we will find that the best possible uniform regret bounds scale as (Vapnik, 2000).

###### Assumption 3.

We assume that there is a constant and sequence such that the Vapnik-Chervonenkis of is bounded by for all .

In order to understand the role of this VC dimension bound in our proof, it is helpful to reformulate it in terms of a bound on the covering number of . For any discrete set of points and any , define the -Hamming covering number as the smallest number of policies (not necessarily contained in ) required to -cover under Hamming distance,

(18) |

Then, define the -Hamming entropy of as , where

(19) |

is the number of functions needed to -cover under Hamming distance for any discrete set of points. We note that this notion of entropy is purely geometric, and does not depend on the distribution used to generate the .

As argued in Pakes and Pollard (1989), the class has a finite VC dimension if and only if there is a constant for which

(20) |

Moreover, there are simple quantitative bounds for Hamming entropy in terms of the VC dimension: If is a VC class of dimension , then (Haussler, 1995)

(21) |

meaning that (20) holds with whenever . Conversely, a direct calculation shows that, if (20) holds with , then must have VC dimension satisfying .

Whenever we invoke Assumption 3 in our proof, we actually work in terms of the
complexity bound (20). Sometimes, it may be easier to verify
(20) directly, rather than first proving a bound on the VC dimension.
For example, if is the set of all depth- decision trees with , we
can directly verify that^{7}^{7}7To establish this result for trees,
one can follow Bartlett and Mendelson (2002) and view each tree-leaf as a conjunction of
boolean functions, along with a sign. A simple argument then shows that a library of
boolean functions lets us approximate each leaf to within Hamming
error ; and so we can also approximate the tree to within
Hamming error. The resulting bound on follows by noting that a full tree
has splits, and so can be approximated using of these boolean functions.

(22) |

Then, thanks to the reductions in the previous paragraph, we see that Assumption 3 holds for trees whose depth may grow with samples size as for some .

Finally, we note that further high-level constraints on , e.g., budget constraints or constraints on marginal treatment rates among subgroups, simply reduce the complexity of the policy class and thus do not interfere with the present assumptions.

### 2.3 Bounding Asymptotic Regret

We are now ready to state our main result on the asymptotic regret of policy learning using doubly robust scores. Following Chernozhukov et al. (2018a, b) we assume that we run our method with scores obtained via cross-fitting, which is a type of data splitting that can be used to verify asymptotic normality given only high-level conditions on the predictive accuracy of the methods used to estimate nuisance components. In particular, cross-fitting allows for the use of black-box machine learning tools provided we can verify that they are accurate in mean-squared error as in Assumption 2.

We proceed as follows: First divide the data into evenly-sized folds and, for each fold , run an estimator of our choice on the other data folds to estimate the functions and ; denote the resulting estimates and . Throughout, we will only assume that these nuisance estimates are accurate in the sense of Assumption 2. Then, given these pre-computed values, we choose by maximizing a doubly robust estimate of ,

(23) |

where denotes the fold containing the -th observation. The -fold algorithmic structure used in (23) was proposed in an early paper by Schick (1986) as a general purpose tool for efficient estimation in semiparametric models, and has also been used in Robins et al. (2008, 2017), Wager et al. (2016) and Zheng and van der Laan (2011).

To prove the result below, we also assume that the weighting function is bounded uniformly above for some :

(24) |

In the case of a binary exogenous treatment, this is equivalent to the “overlap” assumption in the causal inference literature (Imbens and Rubin, 2015), whereby for all values of . In our setting, the condition (24) acts as a generalization of the overlap assumption (Hirshberg and Wager, 2018). We also define

(25) |

where

bounds the second moment of the scores, and

is the asymptotic variance of (11) for estimating the policy improvement of the best policy in .^{8}

^{8}8In the case where arguments from Newey (1994) imply that the doubly robust estimator (11) is efficient, then is the semiparametric efficient variance for estimating . We note that, unless we have an exceptionally large signal-to-noise ratio, we will have and so the rounded log-term in (27) below is just 0.

###### Theorem 1.

Given Assumption 1 and (24), define as in (23), and suppose that we can consistently estimate nuisance components as in Assumption 2. Suppose moreover that the irreducible noise is both uniformly sub-Gaussian conditionally on and and has second moments uniformly bounded from below, , and that the treatment effect function is uniformly bounded in and . Finally, suppose that satisfies Assumption 3 with VC dimension

(26) |

Then, for any , there is a universal
constant^{9}^{9}9Throughout this paper, we will use to denote
different universal constants; no two instantiations of should be
assumed to denote the same constant unless specified.
, as well as a threshold that depends on the constants used to define the regularity
assumptions such that

(27) |

where denotes regret for the -th data-generating distribution.

Theorem 1 establishes conditions under which we are guaranteed low regret provided we can solve the optimization problem in (23). Now, this optimization problem is not convex, and so the numerical optimization task may be difficult. However, from an implementation point of view several authors, including Beygelzimer and Langford (2009), Kitagawa and Tetenov (2018), Zhang, Tsiatis, Davidian, Zhang, and Laber (2012) and Zhao, Zeng, Rush, and Kosorok (2012), have noted that any optimization problems of the type (23) is numerically equivalent to a weighted classification problem,

(28) |

where we train a classifier

with response using sample weights . Given this formalism, we can build on existing tools for weighted classification to learn , e.g., best-subset empirical risk minimization (Chen and Lee, 2016; Greenshtein, 2006) or optimal trees (Bertsimas and Dunn, 2017). In practice, it may also be interesting to consider the empirical performance of computationally less demanding methods that solve an approximation to the weighted classification problem, e.g., support vector machines

(Cortes and Vapnik, 1995) or recursive partitioning (Breiman, Friedman, Olshen, and Stone, 1984); however, we caution that our formal results only apply to methods that solve the problem (28) exactly.## 3 Formal Results

We study policy learning for a class of problems where regret can be written as in (6) using a function , and we obtain by maximizing a cross-fitted doubly robust estimate of defined in (23) over the class . If we could use , then (23) would directly yield the regret-minimizing policy in the class ; but of course we never know in applications. Thus, the main focus of our formal results is to study stochastic fluctuations of the empirical process for , and examine how they affect the quality of policies learned via (23).

As preliminaries, we note that the results of Chernozhukov et al. (2018b) immediately imply that, given Assumption 2, is an asymptotically normal estimate of , where we use “1” as shorthand for the “always treat” policy. Furthermore, it is easy to check that given any fixed policy , is asymptotically normal around (Hirano, Imbens, and Ridder, 2003). What these existing results leave open is a study of error bounds for that hold uniformly across , and a characterization of how such errors interact with the maximization in (23).

### 3.1 Rademacher Complexities and Oracle Regret Bounds

We start our analysis by characterizing concentration of an ideal version of the objective in (23) based on the true influence scores , rather than doubly robust estimates thereof:

(29) |

The advantage of studying concentration of the empirical process over the set is that it allows us, for the time being, to abstract away from the specific machine learning methods used to obtain , and instead to focus on the complexity of empirical maximization over the class .

A convenient way to bound the supremum of this empirical process over any class is by controlling its “centered” Rademacher complexity , defined as

(30) |

where the

are independent Rademacher (i.e., sign) random variables

with probability each (Bartlett and Mendelson, 2002). For intuition as to why Rademacher complexity is a natural complexity measure, note that characterizes the maximum (weighted) in-sample classification accuracy on randomly generated labels over classifiers ; thus, measures how much we can overfit to random coin flips using . The offset is not present in the standard notion of Rademacher complexity, but enables us to get sharper bounds by reducing variance.Following this proof strategy, we bound the Rademacher complexity of “slices” of our policy class , defined as

(31) |

The reason we focus on slices of is that, when we use doubly robust scores, low-regret policies can generally be evaluated more accurately than high-regret policies, and using this fact allows for sharper bounds. Specifically, we can check that , and so

(32) |

where and are defined in (25). This type of slicing technique is common in the literature, and has been used in different contexts by, e.g., Bartlett, Bousquet, and Mendelson (2005) and Giné and Koltchinskii (2006).

The following result provides such a bound in terms of the second moments of the doubly robust score, specifically and . This bound is substantially stronger than corresponding bounds used in existing results on policy learning. Kitagawa and Tetenov (2018) build their result on bounds that depend on , which can only be used with scores that are uniformly bounded in order to get optimal rates. Meanwhile, bounds that scale as are developed by Cortes, Mansour, and Mohri (2010), Maurer and Pontil (2009) and Swaminathan and Joachims (2015); however, the additional

factor makes these bounds inappropriate for asymptotic analysis.

###### Lemma 2.

Suppose that the class satisfies Assumption 3, and that the scores in (29

) are drawn from a sequence of uniformly sub-Gaussian distributions with variance bounded from below,

(33) |

for some constants and all Then, there is a universal constant such that, for any ,

(34) |

Given this Rademacher complexity bound, we can obtain a uniform concentration bound for using standard methods. Here, we refine an argument of Bartlett and Mendelson (2002) using Talagrand’s inequality to obtain a bound that depends on second momens of rather than .

###### Corollary 3.

In our final argument, we will apply Corollary 3 for different -slices, and verify that we can in fact focus on those slices where is nearly 0. Before that, however, we also need to control the discrepancy between the feasible objective and the oracle surrogate studied here.

### 3.2 Uniform Coupling with the Doubly Robust Score

In the previous section, we established risk bounds that would hold if we could optimize the infeasible value function ; we next need to extend these bounds to cover the situation where we optimize a feasible value function. As discussed above, we focus on the doubly robust estimator (23), obtained using cross-fitting as in Chernozhukov et al. (2018a, b). For a single, fixed policy , Chernozhukov et al. (2018b) showed that

(37) |

meaning that the discrepancy between the two value estimates decays faster than the variance of either. However, in our setting, the analyst gets to optimize over all policies , and so coupling results established for a single pre-determined policy are not strong enough. The following lemma extends the work of Chernozhukov et al. (2018b) to the case where we seek to establish a coupling of the form (37) that holds simultaneously for all .

###### Lemma 4.

Under the conditions of Lemma 2, suppose that Assumption 1 holds, that we obtain using cross-fitted estimates of nuisance components satisfying Assumption 2, and that (24) holds. Then

(38) |

where the term hides a dependence on the overlap parameter (24) and the sub-Gaussianity parameter specified in Lemma 2.

The above result is perhaps surprisingly strong: Provided that the dimension of does not grow too fast with , the bound (38) is the same coupling bound as we might expect to obtain for a single policy , and the dimension of the class does not affect the leading-order constants in the bound. In other words, in terms of the coupling of and , we do not lose anything by scanning over a continuum of policies rather than just considering a single policy .

The doubly robust form used here is not the only way to construct efficient estimators for the value of a single policy —for example, Hirano, Imbens, and Ridder (2003) show that inverse-propensity weighting with non-parametrically estimated propensity scores may also be efficient—but it plays a key role in the proof of Lemma 4. In particular, under Assumption 2, the natural bound for the bias term due to misspecification of the nuisance components in fact holds simultaneously for all , and this helps us pay a smaller-than-expected price for seeking a uniform result as in (38). It is far from obvious that other efficient methods for evaluating a single policy , such as that of Hirano et al. (2003), would lead to equally strong uniform couplings over the whole class .

### 3.3 Proof of Theorem 1

Given that Assumption 1, 2 and 3 hold and that, moreover, for some , a combination of results from Corollary 3 and Lemma 4 implies that concentrates around over . To conclude, it now remains to apply these bounds at two different values of . First we choose as below, so that for large enough :

(39) |

Then, by Corollary 3 and Lemma 4, we find that

(40) |

which in turn implies that any satisfies

(41) |

In other words, if we knew that our learned policy maximizes *and*
has regret less than , then we could guarantee that its regret decays at the desired rate.

To prove our result, it remains to show that often enough for (41) to capture the leading-order behavior of regret. To do so, we apply a similar argument as above, but at a different value of . We use , and note that for large enough . Then, by (36) we know that

(42) |

while from (38) paired with Markov’s inequality we know that

(43) |

By combining these two bounds, we see that

(44) |

and moreover, because is uniformly bounded, we find that

(45) |

Thus, the term controlled in (41) was in fact dominant, and the claimed result follows.

## 4 Lower Bounds

To complement the upper bounds given in Theorem 1, we also present lower bounds on the minimax risk for policy learning. Our goal is to show that our bounds are the best possible regret bounds that flexibly account for the distribution of the observed data and depend on the policy class through the Vapnik-Chervonenkis dimension