Variance Reduction for Experiments with One-Sided Triggering using CUPED

12/25/2021
by   Alex Deng, et al.
Airbnb, Inc.
0

In online experimentation, trigger-dilute analysis is an approach to obtain more precise estimates of intent-to-treat (ITT) effects when the intervention is only exposed, or "triggered", for a small subset of the population. Trigger-dilute analysis cannot be used for estimation when triggering is only partially observed. In this paper, we propose an unbiased ITT estimator with reduced variance for cases where triggering status is only observed in the treatment group. Our method is based on the efficiency augmentation idea of CUPED and draws upon identification frameworks from the principal stratification and instrumental variables literature. The unbiasedness of our estimation approach relies on a testable assumption that an augmentation term used for covariate adjustment equals zero in expectation. When this augmentation term fails a mean-zero test, we show how our estimator can incorporate in-experiment observations to reduce the augmentation's bias, by sacrificing the amount of variance reduced. This provides an explicit knob to trade off bias with variance. We demonstrate through simulations that our estimator can remain unbiased and achieve precision improvements as good as if triggering status were fully observed, and in some cases outperforms trigger-dilute analysis.

READ FULL TEXT VIEW PDF
POST COMMENT

Comments

There are no comments yet.

Authors

page 1

page 2

page 3

page 4

09/08/2020

Designing Transportable Experiments

We consider the problem of designing a randomized experiment on a source...
10/13/2019

Empirical and Constrained Empirical Bayes Variance Estimation Under A One Unit Per Stratum Sample Design

A single primary sampling unit (PSU) per stratum design is a popular des...
02/20/2021

An unbiased ray-marching transmittance estimator

We present an in-depth analysis of the sources of variance in state-of-t...
01/19/2021

A note on the g and h control charts

In this note, we revisit the g and h control charts that are commonly us...
09/06/2021

Unbiased Estimation of the Hessian for Partially Observed Diffusions

In this article we consider the development of unbiased estimators of th...
04/08/2019

Unbiased variance reduction in randomized experiments

This paper develops a flexible method for decreasing the variance of est...
09/04/2019

Semiparametric Inference for Non-monotone Missing-Not-at-Random Data: the No Self-Censoring Model

We study the identification and estimation of statistical functionals of...
This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

1. Introduction

In online A/B testing, treatment assignment typically occurs upstream of treatment exposure, and experimental subjects have to trigger an event for them to be exposed to the intervention of interest. Sometimes subjects do not reach that triggering point and therefore do not receive their assigned treatment condition. For example, a user might not reach the checkout page to trigger exposure to a change in payment options, or a user might not open an email to trigger exposure to a promotion code. The presence of triggering defines four sub-populations of interest, illustrated in the left panel of Figure 1: In the treatment group, there are those who triggered exposure to the active treatment condition () and those who did not trigger (). Similarly, in the control group, there are those who triggered exposure to the control condition (meaning they would have triggered exposure to the active treatment had they been assigned to the treatment group) (), and those who did not trigger and would not have triggered even if assigned to treatment (). In an experiment with triggering conditions, the naive difference-in-means between treatment and control outcomes is an unbiased but non-optimal estimator for the overall ITT effect. A more precise estimator can be obtained via trigger-dilute analysis: estimate an average treatment effect for the triggered subjects ( vs.

) and then multiply this triggered average treatment effect by the triggering rate. As a rough heuristic, such trigger-dilute analysis can reduce sampling variance by a factor of the inverse of the triggering rate (e.g., approximately

variance reduction with a triggering rate of ). In this paper, we study the case where triggering status is not observed for control subjects, i.e., we do not observe and labels, so trigger-dilute analysis cannot be used. This one-sided triggering scenario, shown in Figure 1(b), is known as one-sided noncompliance in the causal inference literature. (re Table 1).

For one-sided triggering experiments, we propose an estimator that is unbiased for the overall ITT and has smaller variance than the popular difference-in-outcome-means estimator . Our estimator is based on the fundamental idea behind CUPED (deng2013cuped), which is to engineer a mean-zero augmentation term by which we can further adjust the unbiased estimator, akin to covariate adjustment in a regression context. The key assumption we leverage is that there is no treatment effect for and subjects; these are trigger-complement subjects who would never trigger exposure to the intervention regardless of their study group assignment. We can therefore use observations from and to construct a mean-zero augmentation term for linear adjustment. (Because in practice we do not observe labels, construction of relies on reweighting the full control group to match .) Since trigger-complements are a subset of the full experimental population, and will be correlated, so adjusting for with an appropriate scaling parameter will improve estimation efficiency compared to simply using .

A/B Testing Causal Inference
Treatment Assignment Instrumental Variable
Triggering Counterfactual Status Compliance Status (Principal Stratum)
Only Triggered subjects can be affected by the treatment Exclusion Restriction
Only Treatment group can receive active treatment Strong Monotonicity
Triggered Average Treatment Effect
Local Average Treatment Effect
(Complier Average Treatment Effect)
Overall Average Treatment Effect Intent-to-Treat Effect
T1, C1 (Triggered) Compliers
T0, C0 (Trigger-Complement) Never-Takers

Triggering Probability

Principal Score
Table 1. Terminology mapping between A/B testing and causal inference literature.
Figure 1. Illustration of two-sided triggering (a) and one-sided triggering (b) for the typical online experiment design where subjects assigned to control have no access to the active treatment condition. are subjects assigned to treatment who triggered exposure to the active treatment. are subjects assigned to control who triggered exposure to the control condition (and who would have triggered exposure to the active treatment if assigned to the treatment group). and are called the trigger-complement group and consist of subjects who would not trigger exposure to the active treatment regardless of which study group they are assigned to. In one-sided triggering, and labels are unobserved.

1.1. Setup and Notation

We consider a randomized experiment with binary treatment assignment . A binary label indicates whether a subject triggered exposure to the active treatment when assigned to study group . For example, indicates that subject was assigned to the control group and triggered exposure to the active treatment. is the pair of counterfactual outcomes under assignment to control and treatment, respectively (Imbens:2015). Similarly, is the counterfactual pair of triggered exposure conditions under assignment to control and treatment. (angrist1994; frangakis2002) We call the “triggering counterfactual status”, or “triggering status” for short.

We focus on the typical case for online experiments where by design; that is, subjects assigned to control have no access to the active treatment and can only be exposed to the control condition. This allows us to simplify notation: We use interchangeably with to denote subjects who would trigger exposure to the active treatment if assigned to the treatment group ( and ). We use interchangeably with to denote subjects who would not trigger exposure to the active treatment, regardless of their treatment assignment ( and ). Mapping to Figure 1, corresponds to the Triggered group while corresponds to the Trigger-Complement group. and refer to subject ’s observed outcome and observed exposure condition, respectively. We use subscripts and to denote treatment and control groups, so denotes the “delta” of the average observed outcome between the two study groups. The causal impact we wish to estimate is the overall ITT effect, .

1.2. Related Work

In the online A/B testing literature, trigger-dilute analysis is a popular approach for obtaining a more precise ITT estimate when the intervention is only exposed to a small subset of the experimental population (kohavi2020trustworthy; kohavi2007practical; deng2015diluted; abScale). First, an average treatment effect is estimated among the triggered subjects. This triggered average treatment effect is then multiplied (“diluted”) by the triggering rate. The success of a trigger-dilute analysis depends on either a simple triggering condition, or a mechanism of counterfactual logging such that the experimentation system is able to compare the treatment vs. control experience at any time and label whether there is a realized difference in the study groups’ experiences. Complex triggering conditions can easily lead to sample ratio mismatch, rendering the triggered analysis untrustworthy (fabijan2019diagnosing). Meanwhile, high fidelity counterfactual logging is unwieldy and too expensive for many real applications. When triggering status is not observed for all units, trigger-dilute analysis cannot be applied.

To tackle scenarios with partially-observed triggering, we turn to the causal inference literature around noncompliance, which occurs when treatment assignment differs from treatment exposure (i.e., for some units ). Instrumental variables (IV) (angrist1994; angrist1996) and principal stratification (frangakis2002; ding2017principal; feller2017; jiang2020identification) frameworks provide strategies for identifying and estimating subgroup average treatment effects under a range of noncompliance conditions.

In the IV literature, for example, the causal quantity of interest is typically the local average treatment effect (LATE), which is equivalent to the average treatment effect for the triggered. The standard IV estimator for LATE is divided by the estimated proportion of Compliers (i.e., the triggering rate). Multiplying this IV estimator by the triggering rate therefore recovers . As such, the standard IV approach is not aimed at producing a more precise estimate of the overall ITT. There are, however, extensions of the standard IV estimator that attempt to improve estimation efficiency. Weighted IV methods (coussens2021improving; HuntingtonKlein_2020; joffe2003weighting) use predicted compliance to weight both treatment and control groups when computing the LATE estimator. Our method is similar in that we also use predicted compliance or triggering probabilities, but whereas the weighted IV estimator is unbiased for LATE only when there is no correlation between treatment effect heterogeneity and compliance, our estimator is unbiased for ITT (and asymptotically unbiased for LATE after dividing by the triggering rate) as long as the augmentation term that we construct equals zero in expectation. Weighted IV and our method also have different variance reduction properties. To achieve variance reduction, weighted IV depends heavily on high prediction accuracy of triggering status, while our method can reduce variance even when triggering is completely random.

Beyond instrumental variables, the principal stratification literature generalizes identification and estimation strategies for causal effects under more complicated forms of noncompliance (ding2017principal; feller2017; yuan2019; jiang2020identification). We draw upon many ideas from the principal stratification literature to construct our proposed estimator. In particular, we invoke a key assumption called weak principal ignorability (jo_stuart2009; feller2017; ding2017principal) as a sufficient condition under which our estimator is unbiased for ITT. We also use principal scores (i.e., predicted triggering probabilities) to construct the augmentation term critical to our estimator.

More generally, there is a vast literature on using pre-assignment covariates for regression adjustment to increase estimation efficiency (fisher1925statistical; guo2021machine; poyarkov2016boosted; xie2016improving; lin2013agnostic; li_ding2020). Our method is based on the augmentation idea of CUPED (deng2013cuped)(see also (li_ding2020)), applied to the one-sided triggering context, and is a general approach that can be used on top of any pre-assignment covariate regression adjustment. In particular, our approach has the flexibility to incorporate in-experiment observations for covariate adjustment without introducing bias, though there is a tradeoff in the amount of variance reduced.

1.3. Contribution and Organization

This paper makes the following contributions to the A/B testing community:

  1. We propose an unbiased ITT estimator with reduced variance for experiments with one-sided triggering. Our estimator relies on a testable assumption that an augmentation term used for covariate adjustment equals zero in expectation.

  2. We explain how to test for this mean-zero assumption. When the augmentation term fails a mean-zero test, we show how our estimator can incorporate in-experiment observations to reduce the augmentation’s bias, by sacrificing the amount of variance reduced. This provides an explicit knob to trade off bias with variance. We believe this idea is novel and effective for many real applications.

  3. We study multiple flavors of augmentation and explain their differences in theory and through simulation studies.

The rest of the paper is organized as follows: Section 2 starts with a review of the general augmentation idea from deng2013cuped and an application of CUPED to two-sided triggering. This motivates our augmentation-based estimator for the case of one-sided triggering. Section 3 gives practical advice on how to test the mean-zero assumption that is crucial for the unbiasedness of our estimator. Section 4 presents a conceptual framework to guide variable selection when constructing the augmentation term, and explains how using in-experiment observations to achieve the mean-zero condition relates to a bias-variance tradeoff. Section 5 addresses estimation details behind the augmentation term that have practical implications for empirical variance reduction. We end with a simulation study in Section 6 that demonstrates the performance of our estimator under different specifications of the augmentation term.

2. Variance Reduction using CUPED

CUPED, acronym for Controlled-experiment Using Pre-experiment Data (deng2013cuped), is a variance reduction technique widely adopted in the A/B testing industry to improve the sensitivity of A/B tests (xie2016improving; kohavi2020trustworthy; poyarkov2016boosted)

. At its core, CUPED is an efficiency augmentation method applied on top of any existing unbiased estimator. If

is an unbiased estimator for , the CUPED estimator is defined as

(1)

where is an augmentation such that . The mean-zero requirement ensures that the CUPED estimator has the same expectation as the original estimator , and therefore is also unbiased for . Variance reduction is achieved when there is sufficient correlation between and , since

whenever . Moreover, for any fixed , is also a mean-zero augmentation, so we can solve for a that minimizes the variance of the new augmentation

(2)

deng2013cuped

showed that this optimization has a solution similar to an ordinary least squares regression, with the optimal

being , and the variance reduction rate equal to . This can be generalized to multiple augmentations where

is a mean-zero vector, and the optimal

is the ordinary least squares solution of a multiple linear regression, giving a variance reduction rate equal to

, the coefficient of determination of the fitted regression. Despite its resemblance to regression adjustment in the typical case of using pre-assignment data to create an augmentation , CUPED as an efficiency augmentation framework is more general and flexible than standard regression adjustment. We explain more in Section 2.2.

We end the review of CUPED with a few closing remarks and motivate the new ITT estimation method we describe in this paper. First, CUPED did not invent the idea of efficiency augmentation, but CUPED provides a nice conceptual framework for identifying useful augmentations in the setting of online experimentation where we can leverage knowledge of the randomization design. For example, randomization guarantees that the simple mean difference of any function of pre-assignment data between the study groups, , equals zero in expectation and can therefore be used as a mean-zero augmentation. Second, with more knowledge of constraints in the data generating process (dgp), we can come up with other forms of augmentations. This paper proposes one form of augmentation that can be used when the dgp is constrained by one-sided triggering. Lastly, when we only have partial knowledge of the dgp and a mean-zero augmentation can only be achieved with extra assumptions, we need to test and validate the feasibility of the mean-zero assumption before applying CUPED.

2.1. CUPED for Completely Observed Triggering

To apply CUPED on experiments with triggering, we search for constraints in the data generating process that can be leveraged to construct mean-zero augmentations on top of the naive estimator . We first exam the two-sided triggering case where triggering status is completely observed; that is, we fully observe and triggered subjects () as well as and trigger-complement subjects ().

A key constraint we exploit is the assertion that there is no treatment effect for the trigger-complement group.111This assumption is satisfied in online experiments where triggering is synonymous with intervention exposure. For example, the triggering condition may be that a user must scroll to the bottom of a webpage in order to be shown one of three ads. These ads have no effect on users who never scroll to the bottom of the page. This means that , the difference in mean outcomes for versus subjects, equals zero in expectation. Moreover, and are strongly correlated because the trigger-complement group is a subset of the full experimental population. Given these properties of , we can take and use the two-sided triggering estimator

(3)

to estimate ITT more efficiently. This idea was explored in (deng2015diluted), who also added other augmentations such as the difference in proportion of triggered .

2.2. CUPED for One-Sided Triggering

We now consider the one-sided triggering case where triggering status is only observed in the treatment group. This occurs, for instance, when a subject has to opt-in to receive an intervention. Among subjects assigned to treatment, those who opt-in are and those who do not opt-in are . Subjects assigned to control are not given the choice to opt-in, so we cannot distinguish whether a control subject is or . However, we still assert that there is no treatment effect for the trigger-complement group. This means that if we can transform the entire control group to make it comparable to in terms of the outcome distribution, then we can use the difference in outcomes between and transformed for mean-zero augmentation. And since and are random samples from the full experimental population, we can guarantee correlation between this augmentation term and , thereby obtaining a more precise ITT estimate, following Eq (2).

To make and comparable, we leverage matching and covariate balancing techniques from the causal inference literature (Imbens:2015; rosenbaum1983central). The trick is to properly re-weight subjects in by their probability of being among . Under randomization, the triggering rate in the treatment and control groups are the same in expectation, so we can use the treatment group to fit a triggering probability model (jo_stuart2009; ding2017principal; feller2017), and apply this model to to estimate each control subject’s probability of belonging to . Formally, given a set of pre-assignment covariates , let be the triggering probability for subject , and define the augmentation term

(4)

The zero subscript in reminds us that we are trying to identify the and trigger-complement subjects. is the difference of two weighted averages, where for treated subjects we use a hard weight of observed , and for control subjects we use a soft weight .222Weighted IV estimator uses soft weights for both treatment and control. This won’t provide a mean-zero augmentation in our framework because from will carry a treatment effect to bias our augmentation. The first term on the right hand side of Eq (4) is just , the average of in . The second term is an importance sampled average of the control subjects to mimic the average of in . is straightforward to prove under the weak principal ignorability assumption (jiang2020multiply; jiang2020identification). Here, we state the assumption and theorem and leave the proof for Appendix A.

[Weak Principal Ignorability] There exists a set of pre-assignment covariates such that

This is implied by a slightly stronger assumption that and the triggering counterfactual are conditionally independent given .

Under weak principal ignorability Assumption (2.2), the expectation of (4) asymptotically equals .

2.2.1. Steps to implement the one-sided triggering estimator

To apply the mean-zero augmentation with CUPED, we carry out the following steps:

  1. Fit a model to predict triggering probability, , using treatment group data. Apply the fitted model to the control group to obtain predicted triggering probability weights for the control subjects. Alternatively, can be optimized to directly balance a set of covariates (see Section 4).

  2. Define as in Eq (4).

  3. Define .

  4. The CUPED One-Sided Trigger estimator is

    (5)

    and has variance

    (6)

In practice, we recommend using bootstrap (efron1994introduction) to estimate  and , and to refit the triggering probability model for each bootstrap sample. We can then estimate333The variance estimator (7) treats as a fixed parameter, though in practice we need to estimate . The additional uncertainty in the estimated may cause finite-sample bias, but in large sample size conditions common to online experimentation, and with enough bootstrap iterations, this bias should be negligible. This is confirmed in our simulation studies. the variance of with

(7)

3. Testing the Mean-Zero Assumption

In Section 2.2, we discussed how is asymptotically unbiased for ITT under weak principal ignorability. This ignorability condition is not testable because it is based on counterfactuals and that are not observable at the same time, e.g., we only observe in the treatment group. But we can still make progress, because principal ignorability is a sufficient but not necessary condition for to be unbiased. In fact, is unbiased for ITT as long as the augmentation equals zero in expectation, and this is testable. For example, via a Wald test, using the delta method (Dengkdd2018; allofstat) to compute the variance of , or using bootstrapped , as described in Section 2.2.1.

Only requiring that means practitioners have lots of freedom to modify and improve upon the augmentation term (4

). As is common in observational data analyses, we can apply additional procedures such as weight bucketing, weight trimming, outcome outlier removal, etc., to further centralize

towards a mean of 0. Alternatively, instead of using a triggering probability model to predict control weights, can be optimized to directly balance covariates with other regularization criteria (imai2014covariate; hainmueller2012entropy; athey2018approximate) to match to . Unlike in typical covariate balancing exercises where we can only balance on pre-assignment covariates (to ensure that covariate balancing does not lead to biased treatment impact estimates), here we are allowed to balance on in-experiment observations, even the outcome of interest itself. This is because in our triggering experiment setup where only triggered subjects can be affected by the treatment, we have the constraint that we should observe no treatment effect when comparing and (reweighted) units. However, matching too well to can eliminate precision gains. In the extreme case, we can balance the outcome of interest directly by finding weights such that the augmentation (4) is exactly 0. An augmentation of exactly 0 has no effect on the estimator it is augmenting for and won’t provide any efficiency gains.

4. How Covariate Balancing Relates to a Bias-Variance Tradeoff

In the previous section, we discussed specifying to directly balance the covariates of to match those of . We can also expand covariate selection from pre-assignment observations to in-experiment observations. In one extreme case, when we treat the outcome of interest itself as an in-experiment covariate, we can force (4) to be a point mass at 0 and therefore trivially satisfy the mean-zero requirement. But this would also forego any precision gains. Following Eq (6), if is a point mass at zero, then .

In theory, we would like to find the minimum set of covariates satisfying the weak ignorability assumption (2.2). In practice, when pre-assignment covariates do not capture all relevant confounders of triggering status and outcome, balancing on in-experiment covariates can pull closer to zero, in exchange for balancing away “good” variation introduced by the augmentation term, thereby weakening efficiency gains. This is an interesting observation with important practical implications:

  1. We are guaranteed to have a mean-zero augmentation by balancing directly between and . But this provides no variance reduction.

  2. We can start by balancing on a small set of pre-assignment covariates to construct an augmentation that reduces variance. However, this augmentation may fail to pass the mean-zero test because it is missing important confounders of triggering status and outcome.

  3. We can gradually add more covariates for balancing, including in-experiment observations that are highly correlated with . Adding more balancing covariates can de-bias the augmentation, but will also lessen the amount of variance reduced.

4.1. Guidance on which covariates to balance

Consider the law of total variance: in which we decompose the variance of into the variance explained by (the first term on the right hand side of the equation), and the remaining variance not explained by . In analyses of randomized experiments, removing variance explained by pre-assignment covariates generally improves estimation efficiency compared to . This adjusts for covariate imbalances that make experiment results harder to interpret, since the difference in study group outcomes may be attributed to either the treatment or differences in baseline covariates. Common approaches to adjust for covariate imbalance include post-stratification for discrete covariates (miratrix2013adjusting) and regression adjustment for general covariates. Similarly, when applying CUPED estimators, we should always adjust for pre-assignment covariate imbalance, either by including augmentation terms, or by first residualizing the outcomes and then applying CUPED to residualized .

After adjusting for pre-assignment covariates, we consider whether to directly balance on certain covariates to further improve the efficiency of a CUPED estimator. Recall that the simplest form of the CUPED estimator is with variance We reduce the variance of by adjusting away noise that is shared by and the augmentation . In other words, we strive to maximize the variation captured in that correlates with variation in . For the one-sided triggering augmentation (see Eq 4), this means we want to keep as much variation due to covariate imbalances as possible, and only minimally balance covariates in order for to have a mean of zero.  rosenbaum1983central showed that the coarsest balancing weights based on pre-assignment covariates is the propensity score , i.e., the probability that a data point from belongs to , given . This means that under a weak ignorability assumption, if we know ground truth triggering probabilities, or equivalently the propensity score of a data point from belonging to , then using these values to construct weights for will lead to the largest variance reduction while keeping the estimator unbiased. Any additional covariate balancing would decrease the amount of variance reduced by the augmentation.

So, should we balance on a covariate or not? For pre-assignment covariates, we should always combine one-sided triggering CUPED with pre-assignment augmentation or regression adjustment. This means variation in due to pre-assignment covariate imbalance should be removed whenever possible. Since the adjusted already seeks to exclude noise from , keeping extra variation from in the augmentation no longer contributes to increased correlation between and . Hence, covariate balancing on pre-assignment covariates at least won’t hurt when regression adjustment by the same covariates is also applied. For in-experiment covariates, things are different. We cannot balance on in-experiment observations from , since subjects are exposed to the active treatment and potentially affected by the intervention. Using in-experiment covariates will bias the one-sided triggering estimator. We can balance on in-experiment observations from , provided it is reasonable to believe that subjects are unaffected by the treatment intervention. Balancing on in-experiment covariates may reduce bias in when pre-assignment covariates used for adjustment fail to include all confounders of triggering status and outcome. However, this additional balancing comes at the cost of reduced efficiency gains.

To summarize, we have a knob to trade off bias and variance when we include more in-experiment observations for balancing. On one end, we only use pre-assignment covariates, with the possibility of not including all confounders of triggering status and outcome. Then and the CUPED estimator is biased, though we will see efficiency gains. On the other end, we balance on the in-experiment outcome-of-interest and the augmentation becomes trivially mean-zero. Then, the CUPED estimator is unbiased, but provides no efficiency gains.

5. Principal Score or Propensity Score?

In Section 2.2, we proposed estimating triggering probabilities — also known as principal scores in the principal stratification literature — and using them as balancing weights in (4) . Alternatively, to balance to match , we could use the propensity score . Both balancing scores lead to the same weighting of subjects for (4). To see this, notice that

Let , and denote by . We then have

Denote by , and we have the relationship

(8)

To re-weight to match , the weights for data points in are proportional to . Using (8), this can be simplified as

We have shown that the propensity score leads to the same weights defined by the principal score in (4) .

In practice, however, the two balancing scores have different finite sample performance, due to how we fit the principal and propensity score models on empirical data. The principal score model is trained using the treatment data only and applied to the control group; the predicted control weights are out-of-sample predictions. In contrast, fitting the propensity model using data with as a positive label is in-sample prediction. In Section 6, we show that out-of-sample principal score predictions perform better in terms of variance reduction, compared to in-sample propensity score predictions, even when both methods lead to unbiased CUPED estimators.

6. Simulation Study

We ran four simulation studies to compare the estimation performance of the CUPED One-Sided Trigger estimator versus other standard ITT estimators. We also assess multiple versions of One-Sided Trigger estimator, in which we alter specifications of the augmentation term. In the first study, we implement the One-Sided Trigger as described in Section 2.2.1

and compare its average bias and standard error against that of the Naive estimator

. We also compare to estimators that could be used when triggering status is fully observed in both treatment and control groups. In Study 2, we explore using predicted principal scores versus estimated propensity scores to construct the CUPED augmentation term (ref. Section 5). We demonstrate the phenomenon of increased variance due to covariate overbalancing when we use in-sample propensity score predictions (ref. Section 4). Study 3 shows the impact of combining regression adjustment with one-sided triggering augmentation. Study 4 provides an example where observed pre-assignment covariates are not sufficient to make the augmentation mean-zero, and it is necessary to balance on an in-experiment covariate.

6.1. Data generating process

Our simulation design, illustrated in Figure 2, mimics a conversion process commonly of interest in online experimentation. We outline the data generating process (dgp) here and refer the reader to Appendix B for details and R code.

We simulate a randomized experiment with users, of which are assigned to treatment and the remaining are assigned to control. Each user belongs to either a high or low engagement tier, represented by an unobserved binary label . Two pre-assignment covariates, and

, are generated from uniform distributions with a lower bound of 0. The upper-bound for

is when , and when ; this allows us to set a higher conversion rate for high-engagement users. The upper bound for is . The user-specific triggering rate (which parameterizes the triggering counterfactual status ) is a linear function of and . The user-specific baseline conversion rate is a linear function of and . For those who triggered in the treatment group, we add an additional constant treatment effect to the conversion rate. The outcome

is generated from a binomial distribution with

trials and probability of success equal to the user-specific conversion rate, which can be interpreted as total conversions in a 30-day period under a fixed daily conversion rate. From Figure 2, we see that , , and are confounders that affect both the triggering status and the outcome . The weak principal ignorability assumption (2.2) is satisfied when we condition on and , thereby blocking all the backdoor paths from to  (pearl2000models). Unless otherwise stated, we assume is only observed in the treatment group, i.e., we have an experiment with one-sided triggering.

The most important statistics for this simulation study are as follows:

  • The ground truth ITT effect is 0.075.

  • The triggering rate is 5%.

  • Treatment and control sample sizes are 75,000 and 25,000, respectively.

Figure 2. DAG for the simulation design. is treatment assignment, is the actual triggering event, and is the outcome. is the latent triggering counterfactual status. and are pre-assignment covariates that are confounders of and . is an unobserved confounder, but and fully block all backdoor paths from to .

For each simulation study, we run 50,000 simulation trials and report the following for each estimator:

  • Est. ITT. We compare this to the true ITT effect of 0.075 to assess the mean bias of the estimator.

  • Sample standard deviation across the 50,000 trials.

    444This is the Monte Carlo (estimated) true standard error. We use this as the estimator’s true standard error.

  • Average estimated standard error (SE). We estimate the SE of all One-Sided Trigger estimators following the variance estimator (7), with 1000 bootstrap samples. We estimate SE for the Naive555  Naive:

    where and are the sample variances of the outcomes under treatment and control, respectively. and Trigger-Dilute666  Trigger-Dilute:
    where is the triggering rate and and are the sample variances of the outcomes within the triggered treatment group () and the triggered control group (), respectively.
    estimators using closed form formulas.

In the following sections, we summarize results from each simulation study.

6.2. Study 1: Benchmark against other unbiased estimators

We begin by benchmarking the CUPED One-Sided Trigger to other unbiased ITT estimators including Naive , Trigger-Dilute, and the CUPED Two-Sided Trigger from Section  2.1. To obtain Trigger-Dilute and Two-Sided Trigger estimates, we assume is observed for control subjects. We implement the CUPED One-Sided Trigger using augmentation (4), where the weights are predicted triggering probabilities

. In particular, for each simulation trial, we fit a logistic regression model

using treatment group data and make (out-of-sample) predictions on the control group.

Simulation results are in Table 2. We make several observations. First, all estimators are unbiased for the true ITT of 0.075. Second, Trigger-Dilute and Two-Sided Trigger have the same ground truth variance and both reduce SE by about 4 times compared to Naive. This is roughly a 16x variance reduction rate, and corroborates the heuristic that variance reduction with trigger-dilute analysis can be as high as the reciprocal of the triggering rate (which is set at in our dgp). The proposed One-Sided Trigger with augmentation (4) has the smallest SE. This means that when the weak principal ignorability assumption holds, not observing triggering status in the control group does not prevent us from using triggering to reduce variance. In fact, exploiting the ignorability assumption allows us to obtain precision improvement even beyond Trigger-Dilute. Third, all the variance estimators that we use indeed recover the ground truth SE. In particular, we’ve confirmed that the bootstrap procedure we described in 2.2 is able to recover the true SE of the One-Sided Trigger estimator.

Estimator Est. ITT True SE Est. SE
Naive 0.0750 0.0122 0.0123
Trigger-Dilute 0.0750 0.00315 0.00315
CUPED Two-Sided Trigger 0.0750 0.00315 0.00324
CUPED One-Sided Trigger 0.0750 0.00195 0.00195
Table 2. (Study 1) Performance of CUPED One-Sided Trigger compared to Naive and two other unbiased estimators that could be used if triggering counterfactual status were fully observed for all units.

A surprising result is that CUPED One-Sided Trigger outperforms both Trigger-Dilute and CUPED Two-Sided Trigger, even though the latter two estimators have the benefit of observing for both control and treatment groups. The extra efficiency gain for the One-Sided Trigger likely comes from its exploitation of weak principal ignorability, which allows it to use the entire control group (i.e., a larger sample size) to obtain a more precise estimate of the control outcome mean for the group. In our simulation setup, only of units are assigned to control, so increasing the effective sample size of the control group can help reduce the variance of estimated control means, which are subsequently used to estimate the average treatment effect.

6.3. Study 2: Alternative ways to balance and

This study explores different ways to balance and to create an augmentation term for the One-Sided Trigger. Specifically, we compare using predicted triggering probabilities versus estimated propensity scores as balancing weights in our augmentation (4). We also use weights obtained from entropy balancing (hainmueller2012entropy; qingyuan2017entropy) on different sets of covariates, to directly enforce a zero mean difference in each of these covariates when we compare the reweighted to .

Estimator Est. ITT True SE Est. SE
Naive 0.0750 0.0122 0.0123
Trigger-Dilute 0.0750 0.00315 0.00315
CUPED One-Sided Trigger estimators:
Triggering probability out-of-sample prediction 0.0750 0.00195 0.00195
Triggering probability ground truth 0.0750 0.00227 0.00228
Propensity score (in-sample prediction) 0.0750 0.00738 0.00738
Entropy balance on 0.0750 0.00738 0.00737
Entropy balance on 0.0750 0.00797 0.00797
Entropy balance on true triggering probability 0.0750 0.00729 0.00726
Table 3. (Study 2) Estimation performance of CUPED One-Sided Trigger under different specifications of the augmentation term.

Results are shown in Table 3. Because we correctly account for the confounders and when fitting the triggering probability model, the propensity score model, and in entropy balancing, all the One-Sided Trigger estimators are unbiased. In terms of standard errors, using triggering probabilities for weights gives the largest variance reduction. In this study, the out-of-sample triggering probability estimator has a SE of , which is more than 3 times smaller than the SEs of the in-sample propensity score estimator and entropy balance estimators (SE ). Compared to using propensity score estimates for balancing weights, entropy balancing on true triggering probabilities is slightly more efficient, while entropy balancing on leads to larger variance.

Section 5 showed that if we know the true triggering probability , we can compute the true propensity score , and the two probabilities lead to the same balancing weights in our augmentation (4). In practice, however, we have to estimate these probabilities. This study shows that using predicted triggering probabilities results in larger variance reduction than using estimated propensity scores. This is because the predicted triggering probabilities are out-of-sample predictions, whereas the estimated propensity scores are in-sample predictions. Using in-sample predictions to construct the CUPED augmentation term means there will be less “good” variation retained in the augmentation (see Section 4) that is correlated with . This, in turn, means the amount of variance reduced will be smaller. Similarly, when we entropy balance on instead of just , we are overbalancing and again see a reduction in efficiency gains. It is also interesting to see that entropy balancing on true triggering probabilities also results in a larger SE (0.00729), compared to using true triggering probabilities as weights (0.00227).

6.4. Study 3: Balancing and after regression adjustment

Following Section 4.1 guidance to always use regression adjustment for pre-assignment covariates, in this study we first fit a regression model predicting outcome using , and then apply our various estimators on residual outcomes . This post-regression-adjustment analysis is analogous to CUPED with augmentation already applied as a first step, and we further seek a second augmentation term leveraging triggering and randomization.

Estimator Est. ITT True SE Est. SE
Naive 0.0750 0.00995 0.00999
Trigger-Dilute 0.0750 0.00275 0.00275
CUPED One-Sided Trigger Estimators:
Triggering probability out-of-sample prediction 0.0750 0.00194 0.00194
Triggering probability ground truth 0.0750 0.00194 0.00194
Propensity score (in-sample prediction) 0.0750 0.00194 0.00194
Entropy balance on 0.0750 0.00194 0.00195
Entropy balance on 0.0750 0.00359 0.00357
Entropy balance on true triggering probability 0.0750 0.00194 0.00194
Table 4. (Study 3) Estimation performance of various estimators, applied on residual outcomes .
Estimator Est. ITT True SE Est. SE
Naive 0.0750 0.0104 0.0105
Trigger-Dilute 0.0750 0.00287 0.00287
CUPED One-Sided Trigger Estimators:
Triggering probability out-of-sample prediction 0.0797 0.00203 0.00202
Propensity score (in-sample prediction) 0.0797 0.00203 0.00202
Entropy balance on 0.0797 0.00203 0.00203
Entropy balance on 0.0750 0.00478 0.00480
Table 5. (Study 4) Not observing confounder results in biased ITT estimates until we introduce an in-experiment covariate that is correlated with both the triggering counterfactual and outcome. All estimation approaches here are applied on residual outcomes .

Results in Table 4 show that all estimators remain unbiased. Both Naive and Trigger-Dilute now have smaller SEs compared to Studies 1 and 2 because of the regression adjustment. All the CUPED One-Sided Trigger estimators also have smaller SEs compared to Table 3.

As mentioned in Section 4, when we combine regression adjustment with one-sided trigger augmentation, balancing on the same pre-assignment covariates used for regression adjustment should not penalize us with increased variance. In fact, as Table 4 shows, all the CUPED One-Sided Trigger estimators now have the same standard error, with the exception of the estimator that also includes in entropy balancing. This exception is not surprising, and illustrates the point that balancing on additional covariates that are not included in the regression adjustment will increase variance. In our simulation model (see Figure 2), we know controlling for already satisfies the weak principal ignorability assumption. It is not necessary to further balance on and doing so will in fact be overbalancing; that is, balancing on covariates that are not needed to ensure the CUPED augmentation term has a zero mean, but that will reduce the amount of covariate imbalance contained in the augmentation term and thereby reduce efficiency gains.

6.5. Study 4: Use in-experiment covariate for bias-variance tradeoff

This study explores how One-Sided Trigger estimators perform when the weak principal ignorability assumption does not hold. Specifically, we pretend the triggering-impact confounder is not observed and we only have access to as a pre-assignment covariate. , which is also a triggering-impact confounder, is treated as an in-experiment observation so it cannot be used to fit a triggering probability or propensity score model, but it can be used for covariate balancing to create a CUPED augmentation term. We replicate the simulation as in Study 3, with the only change that is removed from regression adjustment as well as all model fitting and covariate balancing.

Comparing Table 5 to Table 4 shows that Naive and Trigger-Dilute estimators are still unbiased, though now with slightly larger SEs due to missing in the regression adjustment. The main difference, however, is that One-Sided Trigger estimators with out-of-sample triggering probability weights, in-sample propensity score weights, as well as entropy balancing on are no longer unbiased. This confirms that alone does not capture all confounding between triggering status and outcome . When the weak principal ignorability assumption is violated and no further adjustments are applied, the CUPED augmentation term (4) will be significantly different from zero, and One-Sided Trigger will be biased. Because in our dgp also satisfies the weak principal ignorability assumption by blocking all backdoor paths from to , including the in-experiment covariate in entropy balancing successfully removed the bias, but resulted in a larger SE compared to its biased counterparts.

7. Conclusion

When the triggering rate in a randomized experiment is low, it is critical to exploit subjects’ triggering counterfactual status for efficient estimation of the overall ITT effect. However, it is not always possible to confirm whether a control group subject would have triggered the active treatment had they been assigned to the treatment group. This kind of one-sided triggering problem poses a challenge both in theory and in practice.

In this paper, we tackle one-sided triggering purely as a variance reduction problem. We take the inefficient difference-in-outcome-means estimator and add a mean-zero augmentation term as a form of linear covariate adjustment to reduce variance of the estimated ITT. Any mean-zero augmentation is a search direction for which we can optimize for the best “step size” to obtain a new estimator: .

We derive our mean-zero augmentation by comparing the treatment trigger-complement against the entire control group , where we properly weight data-points in to make the distribution of outcome from comparable to that of . It is known in the literature of principal stratification that such weights exist and can be estimated as a function of pre-assignment covariates when weak principal ignorability holds. Following this theory, we propose our augmentation using the predicted triggering probabilities trained on the treatment group and predicted on the control group. We compare this approach to alternatives such as directly training a propensity model on and weighting units by estimated propensity scores, or using entropy balancing to enforce tight covariate balance between and . These discussions strengthen our understanding of sources of variance reduction and their relation to regression adjustment and covariate balancing.

Our simulation study shows that our proposed estimator can be even more efficient than standard estimators that are used when the triggering counterfactual status is observed in both treatment and control groups. The augmentation that weights control units by their predicted triggering probabilities gives the largest variance reduction and is unbiased under the weak principal ignorability condition, while other alternative weighting methods also considerably reduce variance. When conditioning on pre-assignment covariates is insufficient to satisfy weak principal ignorability and our augmentation fails to pass a mean-zero test, we find that including in-experiment observations in covariate balancing (e.g., entropy balancing) can effectively reduce or eliminate bias, in exchange for extra variance. In the worst extreme, we balance on the outcome-of-interest so the augmentation becomes a point mass at zero, but then we also forfeit any efficiency gains. In practice, however, there are often plenty of in-experiment observations correlated with the triggering counterfactual that, when controlled for or balanced on, can mitigate confounding between triggering status and the outcome. Using in-experiment observations as a pragmatic knob to trade off bias with variance is another novel aspect of our method.

Acknowledgements.
We thank George Roumeliotis and Zhenyu Lai for the motivational problem and discussion. Mitra Akhatari, Jenny Chen, and Yunshan Zhu for propensity modeling.

References

Appendix A Proof of Theorem 2.2

Recall from Eq (4) that

where indicates observed triggering status and is the estimated probability that subject is in the trigger-complement group. We prove that by showing under weak principal ignorability.

First, is simply the sample average of in , and it has expectation , where is the potential outcome under assignment to control. For a set of covariates satisfying weak principal ignorability, let and . Then

(9)
(10)

where (9) follows from the weak principal ignorability assumption, and (10) follows from the law of iterated expectations:

Meanwhile, is a weighted average of in , with weights , and has an asymptotic mean of
. Applying iterated expectations to the numerator gives

Thus,

We have shown under weak principal ignorability, thereby concluding our proof of Theorem 2.2.

Appendix B Simulation R Code

width=

#’ Treatment and control sample sizes are fixed as inputs.
#’ There are two levels of users: higher tier has higher conversion rate r_high,
#’ and lower tier has lower conversion rate r_low.
#’ U is an unobserved user tier label. p0=P(U=1). U is i.i.d.
#’ X1 ~ Uniform(0,1) if U=1, and X1~Uniform(0,0.25) if U=0.
#’ So E(X1) = (3*p0+1)/8. Under default value p0=0.2, E(X1)=0.2.
#’ X2 ~ Uniform(0,1). So E(X2) = 1/2.
#’ P(S=1|X1,X2) = p1 + c1(X1 - (3*p0+1)/8) + c2(X2-1/2).
#’ S=1 indicates the subject will trigger the event that will expose them to the treatment.
#’ Conversion rate: If U=0, r_low+c3(X2-1/2). If U=1, r_high+c3(X2-1/2).
#’ Once a subject is exposed to the treatment (i.e., in treatment group, and S=1),
#’ their conversion rate will be raised by r_effect.
#’
#’ @param n_control
#’ @param n_treatment
#’ @param p_0 marginal proportion of high tier subjects
#’ @param p_1 marginal proportion of triggered/opt-in subjects
#’ @param c_1
#’ @param c_2
#’ @param c_3
#’ @param r_low conversion rate of low tier
#’ @param r_high conversion rate of high tier
#’ @param r_effect added conversion rate from treatment effect
#’ @param random_seed
#’
#’ @return a data.table of simulated data
#’ @export
get_synthetic_dataset = function(n_control, n_treatment, p_0 = 0.2, p_1 = 0.05,
    c_1 = 0.05, c_2 = 0.05, c_3 = 0.1, T = 30, r_low = 0.05, r_high = 0.1, r_effect = 0.05,
    random_seed = NULL) {
    if (!is_null(random_seed))
        set.seed(random_seed)
    assignment = c(rep(”control”, n_control), rep(”treatment”, n_treatment))
    n = n_control + n_treatment
    U = rbinom(n, 1, p_0)
    x_1 = runif(n, 0, 0.25 + 0.75 * U)  # if U=1, upper bound is 1, otherwise 0.25
    x_2 = runif(n, 0, 1)
    # c_1 and c_2 are multiplied to the centered versions of x_1 and x_2 to make p_1 the mean of p_s
    p_s = p_1 + c_1 * (x_1 - (3 * p_0 + 1)/8) + c_2 * (x_2 - 0.5)
    S = rbinom(n, 1, p_s)
    df = data.table(dim_treatment = assignment, x_1 = x_1, x_2 = x_2, p_s = p_s,
        U = U, S = S, key = c(”dim_treatment”, S”))
    df[, ‘:=‘(dim_did_trigger, S * (assignment == treatment”))]
    df[, ‘:=‘(r, dplyr::if_else(U == 1, r_high, r_low) + c_3 * (x_2 - 0.5) + r_effect *
        dim_did_trigger)]
    df[, ‘:=‘(Y, rbinom(n, 30, r))]
    df[]  # idiom to let data.table print
}