# Overlap in Observational Studies with High-Dimensional Covariates

Causal inference in observational settings typically rests on a pair of identifying assumptions: (1) unconfoundedness and (2) covariate overlap, also known as positivity or common support. Investigators often argue that unconfoundedness is more plausible when many covariates are included in the analysis. Less discussed is the fact that covariate overlap is more difficult to satisfy in this setting. In this paper, we explore the implications of overlap in high-dimensional observational studies, arguing that this assumption is stronger than investigators likely realize. Our main innovation is to frame (strict) overlap in terms of bounds on a likelihood ratio, which allows us to leverage and expand on existing results from information theory. In particular, we show that strict overlap bounds discriminating information (e.g., Kullback-Leibler divergence) between the covariate distributions in the treated and control populations. We use these results to derive explicit bounds on the average imbalance in covariate means under strict overlap and a range of assumptions on the covariate distributions. Importantly, these bounds grow tighter as the dimension grows large, and converge to zero in some cases. We examine how restrictions on the treatment assignment and outcome processes can weaken the implications of certain overlap assumptions, but at the cost of stronger requirements for unconfoundedness. Taken together, our results suggest that adjusting for high-dimensional covariates does not necessarily make causal identification more plausible.

• 21 publications
• 26 publications
• 20 publications
• 21 publications
• 5 publications
01/09/2018

### A note on strict functional covariate overlap in causal inference problems with high-dimensional covariates

A powerful tool for the analysis of nonrandomized observational studies ...
10/19/2021

### Addressing Positivity Violations in Causal Effect Estimation using Gaussian Process Priors

In observational studies, causal inference relies on several key identif...
03/15/2012

### On the Validity of Covariate Adjustment for Estimating Causal Effects

Identifying effects of actions (treatments) on outcome variables from ob...
11/30/2021

### Contrasting Identifying Assumptions of Average Causal Effects: Robustness and Semiparametric Efficiency

Semiparametric inference about average causal effects from observational...
04/28/2022

### Mahalanobis balancing: a multivariate perspective on approximate covariate balancing

In the past decade, various exact balancing-based weighting methods were...
05/09/2018

### Comparing Covariate Prioritization via Matching to Machine Learning Methods for Causal Inference using Five Empirical Applications

Matching methods have become one frequently used method for statistical ...
07/18/2019

### A discriminative approach for finding and characterizing positivity violations using decision trees

The assumption of positivity in causal inference (also known as common s...

## 1 Introduction

Accompanying the rapid growth in online platforms, administrative databases, and genetic studies, there has been a push to extend methods for observational studies to settings with high-dimensional covariates. These studies typically require a pair of identifying assumptions. First, unconfoundedness: conditional on observed covariates, treatment assignment is as good as random. Second, covariate overlap, also known as positivity or common support: all units have a non-zero probability of assignment to each treatment condition

(Rosenbaum & Rubin, 1983).

A key argument for high-dimensional observational studies is that unconfoundedness is more plausible when the analyst adjusts for more covariates (Rosenbaum, 2002; Rubin, 2009). Setting aside notable counter-examples to this argument (Myers et al., 2011; Pearl, 2011, 2010), the intuition is straightforward to state: the richer the set of covariates, the more likely that unmeasured confounding variables become measured confounding variables. The intuition, however, has the opposite implications for overlap: the richer the set of covariates, the closer these covariates come to perfectly predicting treatment assignment for at least some subgroups.

This tension between unconfoundedness and overlap in the presence of many covariates is particularly relevant in light of recent methodological developments that incorporate machine learning methods in semiparametric causal effect estimation

(van der Laan & Gruber, 2010; Chernozhukov et al., 2016; Athey et al., 2016). On one hand, by using machine learning to perform covariate adjustment, these methods can achieve parametric convergence rates under extremely weak nonparametric modeling assumptions. On the other hand, the cost of this nonparametric flexibility is that these methods are highly sensitive to poor overlap.

In this paper, we explore the population implications of overlap, arguing that this assumption has strong implications when there are many covariates. In particular, we focus on the strict overlap assumption, which asserts that the propensity score is bounded away from 0 and 1 with probability 1, and which is essential for the performance guarantees of common modern semiparametric estimators. Although strict overlap appears to be a local constraint that bounds the propensity score for each unit in the population, we show that it implies global restrictions on the discrepancy between the covariate distributions in the treated and control populations. In our main result, we derive explicit bounds on the average imbalance in covariate means. In several cases, we are able to show that, as the dimension of the covariates grows, strict overlap implies that the average imbalance in covariate means converges to zero. To put these results into context, we discuss how the implications of strict overlap intersect with common modeling assumptions, and how our results inform the common practice of trimming in high-dimensional contexts.

## 2 Preliminaries

### 2.1 Definitions

We focus on an observational study with a binary treatment. For each sampled unit , are potential outcomes, is the treatment indicator, and is a sequence of covariates. Let be independently and identically distributed according to a superpopulation probability measure . We drop the subscript when discussing population stochastic properties of these quantities. We observe triples where . We would like to estimate the average treatment effect

 τATE=E[Y(1)−Y(0)].

The standard approach in observational studies is to argue that identification is plausible conditional on a possibly large set of covariates (Rosenbaum & Rubin, 1983). Specifically, the investigator chooses a set of covariates , and assumes the unconfoundedness below.

###### Assumption 1 (Unconfoundedness).

Assumption 1 ensures

 τATE =E[E[Y(1)∣X1:p]−E[Y(0)∣X1:p]] =E[E[Yobs∣T=1,X1:p]−E[Yobs∣T=0,X1:p]]. (1)

Importantly, the conditional expectations in (1) are non-parametrically identifiable only if the following population overlap assumption is satisfied. Let be the propensity score.

###### Assumption 2 (Population overlap).

with probability 1.

Assumption 2 is sufficient for non-parametric identification of , but is not sufficient for efficient semiparametric estimation of , a fact we discuss in further detail in the next section. For this reason, investigators typically invoke a stronger variant of Assumption 2, which we call the strict overlap assumption.

###### Assumption 3 (Strict overlap).

For some constant , with probability 1.

We call the bound of the strict overlap assumption. The implications of the strict overlap assumption are the primary focus of this paper.

### 2.2 Necessity of Strict Overlap

Strict overlap is a necessary condition for the existence of regular semiparametric estimators of that are uniformly -consistent over a nonparametric model family (Khan & Tamer, 2010). Many estimators in this class have recently been proposed or modified to operate in high-dimensional settings (van der Laan & Rose, 2011; Chernozhukov et al., 2016). These estimators are regular in that they are

–consistent and asymptotically normal along any sequence of parametric models that approach the true data-generating process.

All regular semiparametric estimators are subject to the following asymptotic variance lower bound, known as the semiparametric efficiency bound

(Hahn, 1998; Crump et al., 2009),

 Veff=n−1/2⋅E[var(Y(1)∣X1:p)e(X1:p)+var(Y(0)∣X1:p)1−e(X1:p)+(τ(X1:p)−τATE)2], (2)

where is the conditional average treatment effect. Since the propensity score appears in the denominator, these fractions are only bounded if strict overlap holds. If it does not, there exists a parametric submodel for which this lower bound diverges, so no uniform guarantees of -consistency are possible (Khan & Tamer, 2010).

For these estimators, strict overlap is necesssary because the guarantees are made under nonparametric modeling assumptions. can be estimated efficiently under weaker overlap conditions if one is willing to make assumptions about the outcome model or the conditional average treatment effect surface . We discuss these assumption trade-offs in more detail in § 4.2.

###### Remark 1 (Strict overlap for other treatment effects).

can be decomposed into two parts: the average treatment effect on the treated,

 τATT:=E[Y(1)−Y(0)∣T=1],

and the average treatment effect on control

 τATC:=E[Y(1)−Y(0)∣T=0].

Letting be the marginal probability of treatment, these are related to the ATE by In some cases, or are of independent interest.

and have weaker, one-sided strict overlap requirements for identification and estimation. In particular, an –consistent regular semiparametric estimator of exists only if with probability 1, and for only if with probability 1. Many of the results that we present here can be adapted to the or cases.

## 3 Implications of Strict Overlap

### 3.1 Framework

In this section, we show that strict overlap restricts the overall discrepancy between the treated and control covariate measures, and that this restriction becomes more binding as the dimension increases. Formally, we write the control and treatment measures for covariates, for all , as:

 P0(X1:p∈A) :=P(X1:p∈A∣T=0), P1(X1:p∈A) :=P(X1:p∈A∣T=1).

Let be the marginal probability that any unit is assigned to treatment. For the remainder of the paper, we will assume that . With a slight abuse of notation, the relationship between the marginal probability measure on covariates, implied by the superpopulation distribution , and the condition-specific probability measures and is given by the mixture

We write the densities of and with respect to the dominating measure as and . We write the marginal probability measures of finite-dimensional covariate sets as and , and the marginal densities as and . When discussing density ratios, we will omit the dominating measure .

By Bayes’ Theorem, Assumption

3 is equivalent to the following bound on the density ratio between and , which we will refer to as a likelihood ratio:

 bmin:=1−ππη1−η≤dP1(X1:p)dP0(X1:p)≤1−ππ1−ηη=:bmax. (3)

Implications of bounded likelihood ratios are well-studied in information theory (Hellman & Cover, 1970; Rukhin, 1993, 1997). Each of the results that follow are applications of a theorem due to Rukhin (1997), which relates likelihood ratio bounds of the form (3) to upper bounds on -divergences measuring the discrepancy between the distributions and . We include an adaptation of Rukhin’s theorem in the appendix, as Theorem 2. We also derive additional implications of this result in the appendix.

In the subsequent, we explore the implications of Assumption 3 when there are many covariates. To do so, we set up an analytical framework in which the covariate sequence is a stochastic process . For any single problem, the investigator selects a finite set of covariates from the infinite pool of covariates . Importantly, this framework includes no notion of sample size because we are examining the population-level implications of an assumption about the population measure . Our results are independent of the number of samples that an investigator might draw from this population.

###### Remark 2 (Strict Overlap and Gaussian Covariates).

While we focus on the implications of strict overlap in high dimensions, this assumption also has surprising implications in low dimensions. For example, if

is one-dimensional and follows a Gaussian distribution under both

and , strict overlap implies that , or that the covariate is perfectly balanced. This is because if , the log-density ratio diverges for values of with large magnitude, implying that can be arbitrarily close to 0 or 1 with positive probability. Similar results can be derived when is multi-dimensional. Thus, for Gaussianly distributed covariates, the implications of strict overlap are so strong that they are uninteresting. For this reason, we do not give any examples of the implications of the strict overlap assumption when the covariates are Gaussianly distributed.

### 3.2 Strict Overlap Implies Bounded Mean Discrepancy

We now turn to the main results of the paper, which give concrete implications of strict overlap. Here, we show that strict overlap implies a strong restriction on the discrepancy between the means of and . In particular, when is large, strict overlap implies that either the covariates are highly correlated under both and , or the average discrepancy in means across covariates is small.

We represent the expectations and covariance matrices of under and as follows:

 μ0,1:p:=(μ(1)0,…,μ(p)0) :=EP0[X1:p] Σ0,1:p :=varP0(X1:p) μ1,1:p:=(μ(1)1,…,μ(p)1) :=EP1[X1:p] Σ1,1:p :=varP1(X1:p).

We use

to denote the Euclidean norm of a vector, and

to denote the operator norm of a matrix.

###### Theorem 1.

Assumption 3 implies

 ∥μ0,1:p−μ1,1:p∥≤min{ ∥Σ0,1:p∥1/2op⋅√Bχ2(1∥0),∥Σ1,1:p∥1/2op⋅√Bχ2(0∥1)}, (4)

where and are free of .

The proof is included in the Appendix.

Theorem 1 has strong implications when is large. These implications become apparent when we examine how much each covariate mean can differ, on average, under (4).

###### Corollary 1.

Assumption 3 implies

 1pp∑i=1∣∣μ(k)0−μ(k)1∣∣≤p−1/2min{ ∥Σ0,1:p∥1/2op⋅√Bχ2(1∥0),∥Σ1,1:p∥1/2op⋅√Bχ2(0∥1)}. (5)

The mean discrepancy bounds in Theorem 1 and Corollary 1 depend on the operator norms of the covariance matrices and

. The operator norm is equal to the largest eigenvalue of the covariance matrix, and is a proxy for the degree to which the covariates

are correlated. In particular, the operator norm is large relative to the dimension if and only if a large proportion of the variance in is contained in a low-dimensional projection of . For example, in the cases where the components of are independent, or where are samples from a stationary ergodic process, the operator norm scales like a constant in . On the other hand, in the case where the variance in is dominated by a low-dimensional latent factor model, the operator norm scales linearly in . We treat these examples precisely in the appendix.

Corollary 1 establishes that strict overlap implies that the average mean discrepancy across covariates is not too large relative to the operator norms of the covariance matrices , and . When is large, these implications are strong. To explore this, let be a sequence of covariates such that for each , . When the smaller operator norm grows more slowly than , the bound in (5) converges to zero, implying that the covariate means are, on average, arbitrarily close to balance. On the other hand, for the bound to remain non-zero as grows large, both operator norms must grow at the same rate as . This is a strong restriction on the covariance structure; it implies that all but a vanishing proportion of the variance in concentrates in a finite-dimensional subspace under both and .

###### Remark 3.

Theorem 1 bounds the mean discrepancy of , which extends to a bound on functional discrepancy of the form for any function that is measurable and square-integrable under or . This result is of independent interest, and is included in the appendix.

### 3.3 Strict Overlap Restricts General Distinguishability

In addition to bounds on mean discrepancies, strict overlap also implies restrictions on more general discrepancies between and . In this section, we present two additional results showing that strict overlap restricts how well the covariate distributions can be distinguished from each other.

First, we show that Assumption 3 restricts the extent to which can be distinguished from

by any classifier or statistical test. Let

be a classifier that maps from the covariate support to . We have the following upper bound on the accuracy of any arbitrary classifier when Assumption 3 holds.

###### Proposition 1.

Let be an arbitrary classifier of against . Assumption 3 implies the following upper bound on the accuracy of :

 P(ϕ(X1:p)=T)≤1−η. (6)
###### Proof.

Let

 ~ϕ(X1:p)=I{e(X1:p)≥0.5} (7)

be the Bayes optimal classifier. The probability of a correct decision from the Bayes optimal classifier is

 P(~ϕ(X1:p)=T) :=∫max{e(X1:p),1−e(X1:p)}dP. (8)

Assumption 3 immediately implies . The conclusion follows because the Bayes optimal classifier has the highest accuracy among all classifiers based on the covariate set (Devroye et al., 1996, Theorem 2.1). ∎

Asymptotically, by Proposition 1, strict overlap implies that there exists no consistent classifier of against in the large- limit.

###### Definition 1.

A classifier is -consistent if and only if as grows large.

###### Corollary 2 (No Consistent Classifier).

Let be a sequence of covariates, and for each , let be a finite subset. If Assumption 3 holds as grows large, there exists no -consistent test of against .

We can characterize the relationship between the dimension and the distinguishability of from non-asymptotically by examining the Kullback-Leibler divergence. The following result is a special case of Theorem 2, included in the appendix.

###### Proposition 2 (KL Divergence Bound).

The strict overlap assumption with bound implies the following two inequalities

 KL( P1(X1:p)∥P0(X1:p)) (9) ≤(1−bmin)bmaxlogbmax+(bmax−1)bminlogbminbmax−bmin=:BKL(1∥0), KL( P0(X1:p)∥P1(X1:p)) (10) ≤−(1−bmin)logbmax+(bmax−1)logbminbmax−bmin=:BKL(0∥1).

In the case of balanced treatment assignment, i.e., , and have a simple form:

 BKL(1∥0)=BKL(0∥1)=(1−2η)∣∣∣logη1−η∣∣∣.

Proposition 2 becomes more restrictive for larger values of . This follows because neither bound in Proposition 2 depends on , while the KL divergence is free to grow in

. In particular, by the so-called chain rule, the KL divergence can be expanded into a summation of

non-negative terms (Cover & Thomas, 2005, Theorem 2.5.3):

 (11)

Each term in (11) is the expected KL divergence between the conditional distributions of the th covariate under and , after conditioning on all previous covariates . Thus, each term represents the discriminating information added by , beyond the information contained in . In the large- limit, strict overlap implies that the average unique discriminating information contained in each covariate converges to zero.

###### Corollary 3.

Let be a sequence of covariates, and for each , let be a finite subset of . As grows large, strict overlap with fixed bound implies

 1pp∑k=1EP1{KL(P1(X(k)∣X1:k−1)∥P0(X(k)∣X1:k−1))}=O(p−1), (12)

and likewise for the KL divergence evaluated in the opposite direction.

By Corollary 3, strict overlap implies that, on average, the conditional distributions of each covariate , given all previous covariates , are arbitrarily close to balance. In the special case where the covariates are mutually independent under both and , Corollary 3 implies that, on average, the marginal treated and control distributions for each covariate are arbitrarily close to balance.

## 4 Strict Overlap and Modeling Assumptions

### 4.1 Treatment Models: Strict Overlap with Fewer Implications

In this section, we discuss how the implications of strict overlap align with common modeling assumptions about the treatment assignment mechanism in a study. We show that certain modeling assumptions already impose many of the constraints that strict overlap implies. Thus, if one is willing to accept these modeling assumptions, strict overlap has fewer unique implications.

We will focus specifically on the class of modeling assumptions that assert that the propensity score is only a function of a sufficient summary of the covariates . In this case, overlap in the summary implies overlap in the full set of covariates . Models in this class include sparse models and latent variable models.

###### Assumption 4 (Sufficient Condition for Strict Overlap).

There exists some function of the covariates satisfying the following two conditions:

 X1:pto0.0pt$⊥$⊥T∣b(X1:p), (13) η≤eb(X1:p):=P(T=1∣b(X1:p))≤1−η. (14)

Here, the variable is a balancing score, introduced by Rosenbaum & Rubin (1983). is a sufficient summary of the covariates for the treatment assignment because the propensity score can be written as a function of alone, i.e., there exists some such that

 e(X1:p)=h(b(X1:p)).

This is a restatement of the fact that the propensity score is the coarsest balancing score (Rosenbaum & Rubin, 1983).

Overlap in a balancing score is a sufficient condition for overlap in the entire covariate set .

###### Proposition 3 (Sufficient Condition Statement).

Assumption 4 implies strict overlap in with bound .

###### Proof.

Note that . Then w.p. 1 implies w.p. 1. ∎

Assumption 4 has some trivial specifications, which are useful examples. At one extreme, we may specify that . In this case, Assumption 4 is vacuous: this puts no restrictions on the form of the propensity score and strict overlap with respect to is equivalent to strict overlap. At the other extreme, we may specify to be a constant; i.e., we assume that the data were generated from a randomized trial. In this case, the overlap condition in Assumption 4 holds automatically.

Of particular interest are restrictions on between these two extremes, such as the sparse propensity score model in Example 1 below. Such restrictions trade off stronger modeling assumptions on the propensity score with weaker implications of strict overlap. Importantly, these specifications exclude cases such as deterministic treatment rules: even when the covariates are high-dimensional, the information they contain about the treatment assignment is upper bounded by the information contained in .

###### Example 1 (Sparse Propensity Score).

Consider a study where the propensity score is sparse in the covariate set , so that for some subset of covariates with ,

 e(X1:p)=e(X1:s).

This implies

 X1:pto0.0pt$⊥$⊥T∣X1:s, (15)

and is a balancing score. In this case, strict overlap in the finite-dimensional implies strict overlap for .

Belloni et al. (2013) and Farrell (2015) propose a specification similar to this, with an “approximately sparse” specification for the propensity score. The approximately sparse specification in these papers is broader than the model defined here, but has similar implications for overlap.

###### Example 2 (Latent Variable Model for Propensity Score).

Consider a study where the treatment assignment mechanism is only a function of some latent variable , such that

For example, such a structure exists in cases where treatment is assigned only as a function of a latent class or latent factor. The projection of onto is a balancing score:

 b(X1:p)=E[e(U)∣X1:p]. (16)

Because of (16), strict overlap in the latent variable implies strict overlap in , which implies strict overlap in by Proposition 3.

Athey et al. (2016) propose a specification similar to this in their simulations, in which the propensity score is dense with respect to observable covariates but can be specified simply in terms of a latent class.

###### of (16).

To begin, note that .

Now, we show that . First, .

Second,

Thus, . ∎

The modeling assumptions discussed in this section can complicate the unconfoundedness assumption. In particular, if the treatment assignment mechanism admits a non-trivial sufficient summary , then is only identified if unconfoundedness holds with respect to the sufficient summary alone (Rosenbaum & Rubin, 1983). Thus, simultaneously assuming unconfoundedness and a model of sufficiency on the treatment assignment mechanism indirectly imposes structure on the confounders, which may not be plausible.

### 4.2 Outcome Models: Efficient Estimation with Weaker Overlap

The average treatment effect can be estimated efficiently under weaker overlap conditions if one is willing to make structural assumptions about the data generating process. For example, if one assumes that the conditional expectation of outcomes belongs to a restricted class, Hansen (2008) established that can be estimated under Assumption 1 and the following assumption.

###### Assumption 5 (Prognostic Identification).

There exists some function satisfying the following two conditions

 (Y(0),Y(1))to0.0pt$⊥$⊥X1:p∣r(X1:p), (17) η≤er(X1:p):=P(T=1∣r(X1:p))≤1−η. (18)

Modifying Hansen (2008)’s nomenclature slightly, we call a prognostic score. The assumption of strict overlap in a prognostic score in (18) is often weaker than Assumption 3. van der Laan & Gruber (2010) and Luo et al. (2017) propose methodology designed to exploit this sort of structure.

One can also weaken overlap requirements by imposing modeling assumptions on the outcome process via the conditional average treatment effect . If is assumed constant, for example, in the case of the partial linear model (Belloni et al., 2014; Farrell, 2015), then estimation of only requires that strict overlap hold with positive probability, rather than with probability 1.

###### Assumption 6 (Strict Overlap with Positive Probability).

For some ,

 P(η≤e(X1:p)≤1−η)>δ. (19)

Here, Assumption 6 is sufficient because the constant treatment effect assumption justifies extrapolation from subpopulations where the treatment effect can be estimated to other subpopulations for which strict overlap may fail. The constant treatment effect assumption can also be used to justify trimming strategies, which we discuss in more detail in § 5.2.

## 5 Discussion: Implications for Practice

### 5.1 Empirical Extensions

The implications of strict overlap have observable implications for any fixed overlap bound . Of particular interest is the most favorable overlap bound compatible with the study population:

 η∗:=supη∈(0,0.5){η:η≤e(X1:p)≤1−η % with probability 1},

which enters into worst-case calculations of the variance of semiparametric estimators. By testing whether the implications of strict overlap hold in a given study, we can obtain estimates of with one-sided confidence guarantees. We describe this approach in detail in a separate paper.

### 5.2 Trimming

When Assumption 3 does not hold, one can still estimate an average treatment effect within a subpopulation in which strict overlap does hold. This motivates the common practice of trimming, where the investigator drops observations in regions without overlap (Rosenbaum & Rubin, 1983; Petersen et al., 2012). In general, trimming changes the estimand unless additional structure, such as a constant treatment effect, is imposed on the conditional treatment effect surface .

Our results suggest that trimming may need to be employed more often when the covariate dimension is large, especially in cases where overlap violations result from small imbalances accumulated over many dimensions. In these cases, trimming procedures may have undesirable properties for the same reason that strict overlap does not hold. For example, in high dimensions, one may need to trim a large proportion of units to achieve desirable overlap in the new target subpopulation. The proportion of units that can be retained under a trimming policy designed to achieve overlap bound is related to the accuracy of the Bayes optimal classifier in (7) by the following proposition.

###### Proposition 4.
 P(~η≤e(X1:p)≤1−~η)≤[1−P(~ϕ(X1:p)=T)]/~η. (20)
###### Proof.

Define the event . The conclusion follows from

 P(~ϕ(X1:p)≠T)≥P(A)P(~ϕ(X1:p)≠T∣A)≥P(A)~η.

When large covariate sets enable units to be more accurately classified in treatment and control, the probability that a unit has an acceptable propensity score becomes small. In this case, a trimming procedure must throw away a large proportion of the sample. In the large- limit, if the Bayes optimal classifier is consistent in the sense of Definition 1, then the expected proportion of the sample that must be discarded to achieve any approaches 1.

This fact motivates methods beyond trimming that modify the covariates, rather than the sample, to eliminate information about the treatment assignment mechanism, while still maintaining unconfoundedness. Such methods would generalize the advice to eliminate instrumental variables from adjusting covariates (Myers et al., 2011; Pearl, 2010; Ding et al., 2017).

## Appendix A Strict Overlap Implies Bounded f-Divergences

Here, we adapt a theorem from information theory, due to Rukhin (1997), to derive general implications of strict overlap. The theorem states that a likelihood ratio bound of the form (3) implies upper bounds on -divergences between and .

-divergences are a family of discrepancy measures between probability distributions defined in terms of a convex function

(Csiszár, 1963; Ali & Silvey, 1966; Liese & Vajda, 2006). Formally, the -divergence from some probability measure to another is defined as

 Df(Q1(X1:p)∥Q0(X1:p)):=EQ0[f(dQ1(X1:p)dQ0(X1:p))], (21)

-divergences are non-negative, achieve a minimum when , and are, in general, asymmetric in their arguments. Common examples of -divergences include the Kullback-Leibler divergence, with , and the - or Pearson divergence, with . Here, we restate Rukhin’s theorem in terms of strict overlap and the bounds defined in (3).

###### Theorem 2.

Let be an -divergence such that has a minimum at . Assumption 3 implies

 Df(P1(X1:p)∥P0(X1:p)) ≤bmax−1bmax−bminf(bmin)+1−bminbmax−bminf(bmax), (22) Df(P0(X1:p)∥P1(X1:p)) ≤b−1min−1b−1min−b−1maxf(b−1max)+1−b−1maxb−1min−b−1maxf(b−1min). (23)
###### Proof.

Theorem 2.1 of Rukhin (1997) shows that the likelihood ratio bound in (3) implies the bounds in (22) and (23) when has a minimum at and is “bowl-shaped”, i.e., non-increasing on and non-decreasing on . The “bowl-shaped” constraint is satisfied because is convex. ∎

## Appendix B Proof of Theorem 1

### b.1 Strict Overlap Implies Bounded Functional Discrepancy Using the the χ2-Divergence

The proof of Theorem 1 follows from several steps, each of which is of independent interest.

Here, we apply Theorem 2 to show that strict overlap implies an upper bound on functional discrepancies of the form

 ∣∣EP0[g(X1:p)]−EP1[g(X1:p)]∣∣ (24)

for any function that is measurable under and . This result plays a key role in the proof of Theorem 1, but is general enough to be of independent interest.

We establish this bound by applying Theorem 2 to the special case of the -divergence

 χ2(Q1(X1:p)∥Q0(X1:p)) :=EQ0⎡⎣(dQ1(X1:p)dQ0(X1:p)−1)2⎤⎦. (25)

Strict overlap implies the following bound on the -divergence.

###### Corollary 4.

Assumption 3 implies

 χ2(P1(X1:p)∥P0(X1:p)) ≤(1−bmin)(bmax−1)=:Bχ2(1∥0), (26) χ2(P0(X1:p)∥P1(X1:p)) ≤(1−b−1max)(b−1min−1)=:Bχ2(0∥1). (27)

In the case of balanced treatment assignment, i.e., , and have a simple form:

We now apply Corollary 4 to show that strict overlap implies an explicit upper bound on functional discrepancies of form (24).

###### Corollary 5.

Assumption 3 implies

 ∣∣EP1[g(X1:P)]−EP0[g(X1:p)]∣∣≤min{ √varP0(g(X1:p))⋅√Bχ2(1∥0), (28) √varP1(g(X1:p))⋅√Bχ2(0∥1)}.
###### Proof.

By the Cauchy-Schwarz inequality, (24) has the following upper bound

 |EP1[g(X1:p)]−EP0[g(X1:p)]| =∣∣ ∣∣EP0[(g(X1:p)−C)⋅(dP1(X1:p)dP0(X1:p)−1)]∣∣ ∣∣ (29) ≤∥g(X1:p)−C∥P0,2⋅√χ2(P1(X1:p)∥P0(X1:p)), (30)

for any finite constant , and where denotes the -norm of the function under measure . A similar bound holds with respect to the -divergence evaluated in the opposite direction.

Let then apply (30) and Corollary 4. Do the same for . ∎

Corollary 5 remains valid even when ; in this case, inequality (28) holds automatically.

### b.2 Proof of Theorem 1

Theorem 1 is a special case of Corollary 5. In particular, let , where is a vector of unit length, and apply Corollary 5. is upper-bounded by by definition, and likewise for . The result follows.

## Appendix C Other implications of strict overlap

The decomposition in (29) can be used to construct additional upper bounds on the mean discrepancy in using Hölder’s inequality in combination with upper bounds on -divergences (Vajda, 1973). These bounds give a tighter bound in terms of

, but are functions of higher-order moments of

. Formally, -divergences are a class of divergences that generalize the -divergence (Vajda, 1973):

 χα(P0(X1:p)∥P1(X1:p)):=EP0[∣∣∣dP1(X1:p)dP0(X1:p)−1∣∣∣α]for α≥1. (31)

The divergence in the opposite direction is obtained by switching the roles of and .

Theorem 2.1 of Rukhin (1997) implies that, under strict overlap with bound ,

 χα(P0(X1:p)∥P1(X1:p))≤(bmax−1)(1−bmin)(1−bmin)α−1+(bmax−1)α−1bmax−bmin χα(P1(X1:p)∥P0(X1:p))≤(b−1min−1)(1−b−1max)(1−b−1max)α−1+(b−1min−1)α−1b−1min−b−1max.

We denote these bounds as and , respectively.

Applying Hölder’s inequality to (29), we obtain

 |EP1g(X1:p)−EP0g(X1:p)|≤min{ ∥g(X1:p)−C∥P0,qα⋅B1/αχα(1∥0), ∥g(X1:p)−C∥P1,qα⋅B1/αχα(0∥1)},

where is the Hölder conjugate of . Setting establishes a relationship between the th central moment of under and the functional discrepancy between and . For small values of , this bound scales as , whereas (28) scales as .

## Appendix D Operator Norm

The behavior of the bounds in Theorem 1 and Corollary 1</