1. Introduction
In online A/B testing, treatment assignment typically occurs upstream of treatment exposure, and experimental subjects have to trigger an event for them to be exposed to the intervention of interest. Sometimes subjects do not reach that triggering point and therefore do not receive their assigned treatment condition. For example, a user might not reach the checkout page to trigger exposure to a change in payment options, or a user might not open an email to trigger exposure to a promotion code. The presence of triggering defines four subpopulations of interest, illustrated in the left panel of Figure 1: In the treatment group, there are those who triggered exposure to the active treatment condition () and those who did not trigger (). Similarly, in the control group, there are those who triggered exposure to the control condition (meaning they would have triggered exposure to the active treatment had they been assigned to the treatment group) (), and those who did not trigger and would not have triggered even if assigned to treatment (). In an experiment with triggering conditions, the naive differenceinmeans between treatment and control outcomes is an unbiased but nonoptimal estimator for the overall ITT effect. A more precise estimator can be obtained via triggerdilute analysis: estimate an average treatment effect for the triggered subjects ( vs.
) and then multiply this triggered average treatment effect by the triggering rate. As a rough heuristic, such triggerdilute analysis can reduce sampling variance by a factor of the inverse of the triggering rate (e.g., approximately
variance reduction with a triggering rate of ). In this paper, we study the case where triggering status is not observed for control subjects, i.e., we do not observe and labels, so triggerdilute analysis cannot be used. This onesided triggering scenario, shown in Figure 1(b), is known as onesided noncompliance in the causal inference literature. (re Table 1).For onesided triggering experiments, we propose an estimator that is unbiased for the overall ITT and has smaller variance than the popular differenceinoutcomemeans estimator .
Our estimator is based on the fundamental idea behind CUPED (deng2013cuped), which is to engineer a meanzero augmentation term by which we can further adjust the unbiased estimator, akin to covariate adjustment in a regression context.
The key assumption we leverage is that there is no treatment effect for and subjects; these are triggercomplement subjects who would never trigger exposure to the intervention regardless of their study group assignment.
We can therefore use observations from and to construct a meanzero augmentation term for linear adjustment.
(Because in practice we do not observe labels, construction of relies on reweighting the full control group to match .)
Since triggercomplements are a subset of the full experimental population, and will be correlated, so adjusting for with an appropriate scaling parameter will improve estimation efficiency compared to simply using .
A/B Testing  Causal Inference  

Treatment Assignment  Instrumental Variable  
Triggering Counterfactual Status  Compliance Status (Principal Stratum)  
Only Triggered subjects can be affected by the treatment  Exclusion Restriction  
Only Treatment group can receive active treatment  Strong Monotonicity  
Triggered Average Treatment Effect 


Overall Average Treatment Effect  IntenttoTreat Effect  
T1, C1 (Triggered)  Compliers  
T0, C0 (TriggerComplement)  NeverTakers  
Triggering Probability 
Principal Score 
1.1. Setup and Notation
We consider a randomized experiment with binary treatment assignment . A binary label indicates whether a subject triggered exposure to the active treatment when assigned to study group . For example, indicates that subject was assigned to the control group and triggered exposure to the active treatment. is the pair of counterfactual outcomes under assignment to control and treatment, respectively (Imbens:2015). Similarly, is the counterfactual pair of triggered exposure conditions under assignment to control and treatment. (angrist1994; frangakis2002) We call the “triggering counterfactual status”, or “triggering status” for short.
We focus on the typical case for online experiments where by design; that is, subjects assigned to control have no access to the active treatment and can only be exposed to the control condition. This allows us to simplify notation: We use interchangeably with to denote subjects who would trigger exposure to the active treatment if assigned to the treatment group ( and ). We use interchangeably with to denote subjects who would not trigger exposure to the active treatment, regardless of their treatment assignment ( and ). Mapping to Figure 1, corresponds to the Triggered group while corresponds to the TriggerComplement group. and refer to subject ’s observed outcome and observed exposure condition, respectively. We use subscripts and to denote treatment and control groups, so denotes the “delta” of the average observed outcome between the two study groups. The causal impact we wish to estimate is the overall ITT effect, .
1.2. Related Work
In the online A/B testing literature, triggerdilute analysis is a popular approach for obtaining a more precise ITT estimate when the intervention is only exposed to a small subset of the experimental population (kohavi2020trustworthy; kohavi2007practical; deng2015diluted; abScale). First, an average treatment effect is estimated among the triggered subjects. This triggered average treatment effect is then multiplied (“diluted”) by the triggering rate. The success of a triggerdilute analysis depends on either a simple triggering condition, or a mechanism of counterfactual logging such that the experimentation system is able to compare the treatment vs. control experience at any time and label whether there is a realized difference in the study groups’ experiences. Complex triggering conditions can easily lead to sample ratio mismatch, rendering the triggered analysis untrustworthy (fabijan2019diagnosing). Meanwhile, high fidelity counterfactual logging is unwieldy and too expensive for many real applications. When triggering status is not observed for all units, triggerdilute analysis cannot be applied.
To tackle scenarios with partiallyobserved triggering, we turn to the causal inference literature around noncompliance, which occurs when treatment assignment differs from treatment exposure (i.e., for some units ). Instrumental variables (IV) (angrist1994; angrist1996) and principal stratification (frangakis2002; ding2017principal; feller2017; jiang2020identification) frameworks provide strategies for identifying and estimating subgroup average treatment effects under a range of noncompliance conditions.
In the IV literature, for example, the causal quantity of interest is typically the local average treatment effect (LATE), which is equivalent to the average treatment effect for the triggered. The standard IV estimator for LATE is divided by the estimated proportion of Compliers (i.e., the triggering rate). Multiplying this IV estimator by the triggering rate therefore recovers . As such, the standard IV approach is not aimed at producing a more precise estimate of the overall ITT. There are, however, extensions of the standard IV estimator that attempt to improve estimation efficiency. Weighted IV methods (coussens2021improving; HuntingtonKlein_2020; joffe2003weighting) use predicted compliance to weight both treatment and control groups when computing the LATE estimator. Our method is similar in that we also use predicted compliance or triggering probabilities, but whereas the weighted IV estimator is unbiased for LATE only when there is no correlation between treatment effect heterogeneity and compliance, our estimator is unbiased for ITT (and asymptotically unbiased for LATE after dividing by the triggering rate) as long as the augmentation term that we construct equals zero in expectation. Weighted IV and our method also have different variance reduction properties. To achieve variance reduction, weighted IV depends heavily on high prediction accuracy of triggering status, while our method can reduce variance even when triggering is completely random.
Beyond instrumental variables, the principal stratification literature generalizes identification and estimation strategies for causal effects under more complicated forms of noncompliance (ding2017principal; feller2017; yuan2019; jiang2020identification). We draw upon many ideas from the principal stratification literature to construct our proposed estimator. In particular, we invoke a key assumption called weak principal ignorability (jo_stuart2009; feller2017; ding2017principal) as a sufficient condition under which our estimator is unbiased for ITT. We also use principal scores (i.e., predicted triggering probabilities) to construct the augmentation term critical to our estimator.
More generally, there is a vast literature on using preassignment covariates for regression adjustment to increase estimation efficiency (fisher1925statistical; guo2021machine; poyarkov2016boosted; xie2016improving; lin2013agnostic; li_ding2020). Our method is based on the augmentation idea of CUPED (deng2013cuped)(see also (li_ding2020)), applied to the onesided triggering context, and is a general approach that can be used on top of any preassignment covariate regression adjustment. In particular, our approach has the flexibility to incorporate inexperiment observations for covariate adjustment without introducing bias, though there is a tradeoff in the amount of variance reduced.
1.3. Contribution and Organization
This paper makes the following contributions to the A/B testing community:

We propose an unbiased ITT estimator with reduced variance for experiments with onesided triggering. Our estimator relies on a testable assumption that an augmentation term used for covariate adjustment equals zero in expectation.

We explain how to test for this meanzero assumption. When the augmentation term fails a meanzero test, we show how our estimator can incorporate inexperiment observations to reduce the augmentation’s bias, by sacrificing the amount of variance reduced. This provides an explicit knob to trade off bias with variance. We believe this idea is novel and effective for many real applications.

We study multiple flavors of augmentation and explain their differences in theory and through simulation studies.
The rest of the paper is organized as follows: Section 2 starts with a review of the general augmentation idea from deng2013cuped and an application of CUPED to twosided triggering. This motivates our augmentationbased estimator for the case of onesided triggering. Section 3 gives practical advice on how to test the meanzero assumption that is crucial for the unbiasedness of our estimator. Section 4 presents a conceptual framework to guide variable selection when constructing the augmentation term, and explains how using inexperiment observations to achieve the meanzero condition relates to a biasvariance tradeoff. Section 5 addresses estimation details behind the augmentation term that have practical implications for empirical variance reduction. We end with a simulation study in Section 6 that demonstrates the performance of our estimator under different specifications of the augmentation term.
2. Variance Reduction using CUPED
CUPED, acronym for Controlledexperiment Using Preexperiment Data (deng2013cuped), is a variance reduction technique widely adopted in the A/B testing industry to improve the sensitivity of A/B tests (xie2016improving; kohavi2020trustworthy; poyarkov2016boosted)
. At its core, CUPED is an efficiency augmentation method applied on top of any existing unbiased estimator. If
is an unbiased estimator for , the CUPED estimator is defined as(1) 
where is an augmentation such that . The meanzero requirement ensures that the CUPED estimator has the same expectation as the original estimator , and therefore is also unbiased for . Variance reduction is achieved when there is sufficient correlation between and , since
whenever . Moreover, for any fixed , is also a meanzero augmentation, so we can solve for a that minimizes the variance of the new augmentation
(2) 
deng2013cuped
showed that this optimization has a solution similar to an ordinary least squares regression, with the optimal
being , and the variance reduction rate equal to . This can be generalized to multiple augmentations whereis a meanzero vector, and the optimal
is the ordinary least squares solution of a multiple linear regression, giving a variance reduction rate equal to
, the coefficient of determination of the fitted regression. Despite its resemblance to regression adjustment in the typical case of using preassignment data to create an augmentation , CUPED as an efficiency augmentation framework is more general and flexible than standard regression adjustment. We explain more in Section 2.2.We end the review of CUPED with a few closing remarks and motivate the new ITT estimation method we describe in this paper. First, CUPED did not invent the idea of efficiency augmentation, but CUPED provides a nice conceptual framework for identifying useful augmentations in the setting of online experimentation where we can leverage knowledge of the randomization design. For example, randomization guarantees that the simple mean difference of any function of preassignment data between the study groups, , equals zero in expectation and can therefore be used as a meanzero augmentation. Second, with more knowledge of constraints in the data generating process (dgp), we can come up with other forms of augmentations. This paper proposes one form of augmentation that can be used when the dgp is constrained by onesided triggering. Lastly, when we only have partial knowledge of the dgp and a meanzero augmentation can only be achieved with extra assumptions, we need to test and validate the feasibility of the meanzero assumption before applying CUPED.
2.1. CUPED for Completely Observed Triggering
To apply CUPED on experiments with triggering, we search for constraints in the data generating process that can be leveraged to construct meanzero augmentations on top of the naive estimator . We first exam the twosided triggering case where triggering status is completely observed; that is, we fully observe and triggered subjects () as well as and triggercomplement subjects ().
A key constraint we exploit is the assertion that there is no treatment effect for the triggercomplement group.^{1}^{1}1This assumption is satisfied in online experiments where triggering is synonymous with intervention exposure. For example, the triggering condition may be that a user must scroll to the bottom of a webpage in order to be shown one of three ads. These ads have no effect on users who never scroll to the bottom of the page. This means that , the difference in mean outcomes for versus subjects, equals zero in expectation. Moreover, and are strongly correlated because the triggercomplement group is a subset of the full experimental population. Given these properties of , we can take and use the twosided triggering estimator
(3) 
to estimate ITT more efficiently. This idea was explored in (deng2015diluted), who also added other augmentations such as the difference in proportion of triggered .
2.2. CUPED for OneSided Triggering
We now consider the onesided triggering case where triggering status is only observed in the treatment group. This occurs, for instance, when a subject has to optin to receive an intervention. Among subjects assigned to treatment, those who optin are and those who do not optin are . Subjects assigned to control are not given the choice to optin, so we cannot distinguish whether a control subject is or . However, we still assert that there is no treatment effect for the triggercomplement group. This means that if we can transform the entire control group to make it comparable to in terms of the outcome distribution, then we can use the difference in outcomes between and transformed for meanzero augmentation. And since and are random samples from the full experimental population, we can guarantee correlation between this augmentation term and , thereby obtaining a more precise ITT estimate, following Eq (2).
To make and comparable, we leverage matching and covariate balancing techniques from the causal inference literature (Imbens:2015; rosenbaum1983central). The trick is to properly reweight subjects in by their probability of being among . Under randomization, the triggering rate in the treatment and control groups are the same in expectation, so we can use the treatment group to fit a triggering probability model (jo_stuart2009; ding2017principal; feller2017), and apply this model to to estimate each control subject’s probability of belonging to . Formally, given a set of preassignment covariates , let be the triggering probability for subject , and define the augmentation term
(4) 
The zero subscript in reminds us that we are trying to identify the and triggercomplement subjects. is the difference of two weighted averages, where for treated subjects we use a hard weight of observed , and for control subjects we use a soft weight .^{2}^{2}2Weighted IV estimator uses soft weights for both treatment and control. This won’t provide a meanzero augmentation in our framework because from will carry a treatment effect to bias our augmentation. The first term on the right hand side of Eq (4) is just , the average of in . The second term is an importance sampled average of the control subjects to mimic the average of in . is straightforward to prove under the weak principal ignorability assumption (jiang2020multiply; jiang2020identification). Here, we state the assumption and theorem and leave the proof for Appendix A.
[Weak Principal Ignorability] There exists a set of preassignment covariates such that
This is implied by a slightly stronger assumption that and the triggering counterfactual are conditionally independent given .
2.2.1. Steps to implement the onesided triggering estimator
To apply the meanzero augmentation with CUPED, we carry out the following steps:

Fit a model to predict triggering probability, , using treatment group data. Apply the fitted model to the control group to obtain predicted triggering probability weights for the control subjects. Alternatively, can be optimized to directly balance a set of covariates (see Section 4).

Define as in Eq (4).

Define .

The CUPED OneSided Trigger estimator is
(5) and has variance
(6)
In practice, we recommend using bootstrap (efron1994introduction) to estimate and , and to refit the triggering probability model for each bootstrap sample. We can then estimate^{3}^{3}3The variance estimator (7) treats as a fixed parameter, though in practice we need to estimate . The additional uncertainty in the estimated may cause finitesample bias, but in large sample size conditions common to online experimentation, and with enough bootstrap iterations, this bias should be negligible. This is confirmed in our simulation studies. the variance of with
(7) 
3. Testing the MeanZero Assumption
In Section 2.2, we discussed how is asymptotically unbiased for ITT under weak principal ignorability. This ignorability condition is not testable because it is based on counterfactuals and that are not observable at the same time, e.g., we only observe in the treatment group. But we can still make progress, because principal ignorability is a sufficient but not necessary condition for to be unbiased. In fact, is unbiased for ITT as long as the augmentation equals zero in expectation, and this is testable. For example, via a Wald test, using the delta method (Dengkdd2018; allofstat) to compute the variance of , or using bootstrapped , as described in Section 2.2.1.
Only requiring that means practitioners have lots of freedom to modify and improve upon the augmentation term (4
). As is common in observational data analyses, we can apply additional procedures such as weight bucketing, weight trimming, outcome outlier removal, etc., to further centralize
towards a mean of 0. Alternatively, instead of using a triggering probability model to predict control weights, can be optimized to directly balance covariates with other regularization criteria (imai2014covariate; hainmueller2012entropy; athey2018approximate) to match to . Unlike in typical covariate balancing exercises where we can only balance on preassignment covariates (to ensure that covariate balancing does not lead to biased treatment impact estimates), here we are allowed to balance on inexperiment observations, even the outcome of interest itself. This is because in our triggering experiment setup where only triggered subjects can be affected by the treatment, we have the constraint that we should observe no treatment effect when comparing and (reweighted) units. However, matching too well to can eliminate precision gains. In the extreme case, we can balance the outcome of interest directly by finding weights such that the augmentation (4) is exactly 0. An augmentation of exactly 0 has no effect on the estimator it is augmenting for and won’t provide any efficiency gains.4. How Covariate Balancing Relates to a BiasVariance Tradeoff
In the previous section, we discussed specifying to directly balance the covariates of to match those of . We can also expand covariate selection from preassignment observations to inexperiment observations. In one extreme case, when we treat the outcome of interest itself as an inexperiment covariate, we can force (4) to be a point mass at 0 and therefore trivially satisfy the meanzero requirement. But this would also forego any precision gains. Following Eq (6), if is a point mass at zero, then .
In theory, we would like to find the minimum set of covariates satisfying the weak ignorability assumption (2.2). In practice, when preassignment covariates do not capture all relevant confounders of triggering status and outcome, balancing on inexperiment covariates can pull closer to zero, in exchange for balancing away “good” variation introduced by the augmentation term, thereby weakening efficiency gains. This is an interesting observation with important practical implications:

We are guaranteed to have a meanzero augmentation by balancing directly between and . But this provides no variance reduction.

We can start by balancing on a small set of preassignment covariates to construct an augmentation that reduces variance. However, this augmentation may fail to pass the meanzero test because it is missing important confounders of triggering status and outcome.

We can gradually add more covariates for balancing, including inexperiment observations that are highly correlated with . Adding more balancing covariates can debias the augmentation, but will also lessen the amount of variance reduced.
4.1. Guidance on which covariates to balance
Consider the law of total variance: in which we decompose the variance of into the variance explained by (the first term on the right hand side of the equation), and the remaining variance not explained by . In analyses of randomized experiments, removing variance explained by preassignment covariates generally improves estimation efficiency compared to . This adjusts for covariate imbalances that make experiment results harder to interpret, since the difference in study group outcomes may be attributed to either the treatment or differences in baseline covariates. Common approaches to adjust for covariate imbalance include poststratification for discrete covariates (miratrix2013adjusting) and regression adjustment for general covariates. Similarly, when applying CUPED estimators, we should always adjust for preassignment covariate imbalance, either by including augmentation terms, or by first residualizing the outcomes and then applying CUPED to residualized .
After adjusting for preassignment covariates, we consider whether to directly balance on certain covariates to further improve the efficiency of a CUPED estimator. Recall that the simplest form of the CUPED estimator is with variance We reduce the variance of by adjusting away noise that is shared by and the augmentation . In other words, we strive to maximize the variation captured in that correlates with variation in . For the onesided triggering augmentation (see Eq 4), this means we want to keep as much variation due to covariate imbalances as possible, and only minimally balance covariates in order for to have a mean of zero. rosenbaum1983central showed that the coarsest balancing weights based on preassignment covariates is the propensity score , i.e., the probability that a data point from belongs to , given . This means that under a weak ignorability assumption, if we know ground truth triggering probabilities, or equivalently the propensity score of a data point from belonging to , then using these values to construct weights for will lead to the largest variance reduction while keeping the estimator unbiased. Any additional covariate balancing would decrease the amount of variance reduced by the augmentation.
So, should we balance on a covariate or not? For preassignment covariates, we should always combine onesided triggering CUPED with preassignment augmentation or regression adjustment. This means variation in due to preassignment covariate imbalance should be removed whenever possible. Since the adjusted already seeks to exclude noise from , keeping extra variation from in the augmentation no longer contributes to increased correlation between and . Hence, covariate balancing on preassignment covariates at least won’t hurt when regression adjustment by the same covariates is also applied. For inexperiment covariates, things are different. We cannot balance on inexperiment observations from , since subjects are exposed to the active treatment and potentially affected by the intervention. Using inexperiment covariates will bias the onesided triggering estimator. We can balance on inexperiment observations from , provided it is reasonable to believe that subjects are unaffected by the treatment intervention. Balancing on inexperiment covariates may reduce bias in when preassignment covariates used for adjustment fail to include all confounders of triggering status and outcome. However, this additional balancing comes at the cost of reduced efficiency gains.
To summarize, we have a knob to trade off bias and variance when we include more inexperiment observations for balancing. On one end, we only use preassignment covariates, with the possibility of not including all confounders of triggering status and outcome. Then and the CUPED estimator is biased, though we will see efficiency gains. On the other end, we balance on the inexperiment outcomeofinterest and the augmentation becomes trivially meanzero. Then, the CUPED estimator is unbiased, but provides no efficiency gains.
5. Principal Score or Propensity Score?
In Section 2.2, we proposed estimating triggering probabilities — also known as principal scores in the principal stratification literature — and using them as balancing weights in (4) . Alternatively, to balance to match , we could use the propensity score . Both balancing scores lead to the same weighting of subjects for (4). To see this, notice that
Let , and denote by . We then have
Denote by , and we have the relationship
(8) 
To reweight to match , the weights for data points in are proportional to . Using (8), this can be simplified as
We have shown that the propensity score leads to the same weights defined by the principal score in (4) .
In practice, however, the two balancing scores have different finite sample performance, due to how we fit the principal and propensity score models on empirical data. The principal score model is trained using the treatment data only and applied to the control group; the predicted control weights are outofsample predictions. In contrast, fitting the propensity model using data with as a positive label is insample prediction. In Section 6, we show that outofsample principal score predictions perform better in terms of variance reduction, compared to insample propensity score predictions, even when both methods lead to unbiased CUPED estimators.
6. Simulation Study
We ran four simulation studies to compare the estimation performance of the CUPED OneSided Trigger estimator versus other standard ITT estimators. We also assess multiple versions of OneSided Trigger estimator, in which we alter specifications of the augmentation term. In the first study, we implement the OneSided Trigger as described in Section 2.2.1
and compare its average bias and standard error against that of the Naive estimator
. We also compare to estimators that could be used when triggering status is fully observed in both treatment and control groups. In Study 2, we explore using predicted principal scores versus estimated propensity scores to construct the CUPED augmentation term (ref. Section 5). We demonstrate the phenomenon of increased variance due to covariate overbalancing when we use insample propensity score predictions (ref. Section 4). Study 3 shows the impact of combining regression adjustment with onesided triggering augmentation. Study 4 provides an example where observed preassignment covariates are not sufficient to make the augmentation meanzero, and it is necessary to balance on an inexperiment covariate.6.1. Data generating process
Our simulation design, illustrated in Figure 2, mimics a conversion process commonly of interest in online experimentation. We outline the data generating process (dgp) here and refer the reader to Appendix B for details and R code.
We simulate a randomized experiment with users, of which are assigned to treatment and the remaining are assigned to control. Each user belongs to either a high or low engagement tier, represented by an unobserved binary label . Two preassignment covariates, and
, are generated from uniform distributions with a lower bound of 0. The upperbound for
is when , and when ; this allows us to set a higher conversion rate for highengagement users. The upper bound for is . The userspecific triggering rate (which parameterizes the triggering counterfactual status ) is a linear function of and . The userspecific baseline conversion rate is a linear function of and . For those who triggered in the treatment group, we add an additional constant treatment effect to the conversion rate. The outcomeis generated from a binomial distribution with
trials and probability of success equal to the userspecific conversion rate, which can be interpreted as total conversions in a 30day period under a fixed daily conversion rate. From Figure 2, we see that , , and are confounders that affect both the triggering status and the outcome . The weak principal ignorability assumption (2.2) is satisfied when we condition on and , thereby blocking all the backdoor paths from to (pearl2000models). Unless otherwise stated, we assume is only observed in the treatment group, i.e., we have an experiment with onesided triggering.The most important statistics for this simulation study are as follows:

The ground truth ITT effect is 0.075.

The triggering rate is 5%.

Treatment and control sample sizes are 75,000 and 25,000, respectively.
For each simulation study, we run 50,000 simulation trials and report the following for each estimator:

Est. ITT. We compare this to the true ITT effect of 0.075 to assess the mean bias of the estimator.

Sample standard deviation across the 50,000 trials.
^{4}^{4}4This is the Monte Carlo (estimated) true standard error. We use this as the estimator’s true standard error. 
Average estimated standard error (SE). We estimate the SE of all OneSided Trigger estimators following the variance estimator (7), with 1000 bootstrap samples. We estimate SE for the Naive^{5}^{5}5 Naive:
In the following sections, we summarize results from each simulation study.
6.2. Study 1: Benchmark against other unbiased estimators
We begin by benchmarking the CUPED OneSided Trigger to other unbiased ITT estimators including Naive , TriggerDilute, and the CUPED TwoSided Trigger from Section 2.1. To obtain TriggerDilute and TwoSided Trigger estimates, we assume is observed for control subjects. We implement the CUPED OneSided Trigger using augmentation (4), where the weights are predicted triggering probabilities
. In particular, for each simulation trial, we fit a logistic regression model
using treatment group data and make (outofsample) predictions on the control group.Simulation results are in Table 2. We make several observations.
First, all estimators are unbiased for the true ITT of 0.075.
Second, TriggerDilute and TwoSided Trigger have the same ground truth variance and both reduce SE by about 4 times compared to Naive.
This is roughly a 16x variance reduction rate, and corroborates the heuristic that variance reduction with triggerdilute analysis can be as high as the reciprocal of the triggering rate (which is set at in our dgp).
The proposed OneSided Trigger with augmentation (4) has the smallest SE.
This means that when the weak principal ignorability assumption holds, not observing triggering status in the control group does not prevent us from using triggering to reduce variance.
In fact, exploiting the ignorability assumption allows us to obtain precision improvement even beyond TriggerDilute.
Third, all the variance estimators that we use indeed recover the ground truth SE.
In particular, we’ve confirmed that the bootstrap procedure we described in 2.2 is able to recover the true SE of the OneSided Trigger estimator.
Estimator  Est. ITT  True SE  Est. SE 

Naive  0.0750  0.0122  0.0123 
TriggerDilute  0.0750  0.00315  0.00315 
CUPED TwoSided Trigger  0.0750  0.00315  0.00324 
CUPED OneSided Trigger  0.0750  0.00195  0.00195 
A surprising result is that CUPED OneSided Trigger outperforms both TriggerDilute and CUPED TwoSided Trigger, even though the latter two estimators have the benefit of observing for both control and treatment groups. The extra efficiency gain for the OneSided Trigger likely comes from its exploitation of weak principal ignorability, which allows it to use the entire control group (i.e., a larger sample size) to obtain a more precise estimate of the control outcome mean for the group. In our simulation setup, only of units are assigned to control, so increasing the effective sample size of the control group can help reduce the variance of estimated control means, which are subsequently used to estimate the average treatment effect.
6.3. Study 2: Alternative ways to balance and
This study explores different ways to balance and to create an augmentation term for the OneSided Trigger. Specifically, we compare using predicted triggering probabilities versus estimated propensity scores as balancing weights in our augmentation (4). We also use weights obtained from entropy balancing (hainmueller2012entropy; qingyuan2017entropy) on different sets of covariates, to directly enforce a zero mean difference in each of these covariates when we compare the reweighted to .
Estimator  Est. ITT  True SE  Est. SE 
Naive  0.0750  0.0122  0.0123 
TriggerDilute  0.0750  0.00315  0.00315 
CUPED OneSided Trigger estimators:  
Triggering probability outofsample prediction  0.0750  0.00195  0.00195 
Triggering probability ground truth  0.0750  0.00227  0.00228 
Propensity score (insample prediction)  0.0750  0.00738  0.00738 
Entropy balance on  0.0750  0.00738  0.00737 
Entropy balance on  0.0750  0.00797  0.00797 
Entropy balance on true triggering probability  0.0750  0.00729  0.00726 
Results are shown in Table 3. Because we correctly account for the confounders and when fitting the triggering probability model, the propensity score model, and in entropy balancing, all the OneSided Trigger estimators are unbiased. In terms of standard errors, using triggering probabilities for weights gives the largest variance reduction. In this study, the outofsample triggering probability estimator has a SE of , which is more than 3 times smaller than the SEs of the insample propensity score estimator and entropy balance estimators (SE ). Compared to using propensity score estimates for balancing weights, entropy balancing on true triggering probabilities is slightly more efficient, while entropy balancing on leads to larger variance.
Section 5 showed that if we know the true triggering probability , we can compute the true propensity score , and the two probabilities lead to the same balancing weights in our augmentation (4). In practice, however, we have to estimate these probabilities. This study shows that using predicted triggering probabilities results in larger variance reduction than using estimated propensity scores. This is because the predicted triggering probabilities are outofsample predictions, whereas the estimated propensity scores are insample predictions. Using insample predictions to construct the CUPED augmentation term means there will be less “good” variation retained in the augmentation (see Section 4) that is correlated with . This, in turn, means the amount of variance reduced will be smaller. Similarly, when we entropy balance on instead of just , we are overbalancing and again see a reduction in efficiency gains. It is also interesting to see that entropy balancing on true triggering probabilities also results in a larger SE (0.00729), compared to using true triggering probabilities as weights (0.00227).
6.4. Study 3: Balancing and after regression adjustment
Following Section 4.1 guidance to always use regression adjustment for preassignment covariates, in this study we first fit a regression model predicting outcome using , and then apply our various estimators on residual outcomes .
This postregressionadjustment analysis is analogous to CUPED with augmentation already applied as a first step, and we further seek a second augmentation term leveraging triggering and randomization.
Estimator  Est. ITT  True SE  Est. SE 
Naive  0.0750  0.00995  0.00999 
TriggerDilute  0.0750  0.00275  0.00275 
CUPED OneSided Trigger Estimators:  
Triggering probability outofsample prediction  0.0750  0.00194  0.00194 
Triggering probability ground truth  0.0750  0.00194  0.00194 
Propensity score (insample prediction)  0.0750  0.00194  0.00194 
Entropy balance on  0.0750  0.00194  0.00195 
Entropy balance on  0.0750  0.00359  0.00357 
Entropy balance on true triggering probability  0.0750  0.00194  0.00194 
Estimator  Est. ITT  True SE  Est. SE 
Naive  0.0750  0.0104  0.0105 
TriggerDilute  0.0750  0.00287  0.00287 
CUPED OneSided Trigger Estimators:  
Triggering probability outofsample prediction  0.0797  0.00203  0.00202 
Propensity score (insample prediction)  0.0797  0.00203  0.00202 
Entropy balance on  0.0797  0.00203  0.00203 
Entropy balance on  0.0750  0.00478  0.00480 
Results in Table 4 show that all estimators remain unbiased. Both Naive and TriggerDilute now have smaller SEs compared to Studies 1 and 2 because of the regression adjustment. All the CUPED OneSided Trigger estimators also have smaller SEs compared to Table 3.
As mentioned in Section 4, when we combine regression adjustment with onesided trigger augmentation, balancing on the same preassignment covariates used for regression adjustment should not penalize us with increased variance. In fact, as Table 4 shows, all the CUPED OneSided Trigger estimators now have the same standard error, with the exception of the estimator that also includes in entropy balancing. This exception is not surprising, and illustrates the point that balancing on additional covariates that are not included in the regression adjustment will increase variance. In our simulation model (see Figure 2), we know controlling for already satisfies the weak principal ignorability assumption. It is not necessary to further balance on and doing so will in fact be overbalancing; that is, balancing on covariates that are not needed to ensure the CUPED augmentation term has a zero mean, but that will reduce the amount of covariate imbalance contained in the augmentation term and thereby reduce efficiency gains.
6.5. Study 4: Use inexperiment covariate for biasvariance tradeoff
This study explores how OneSided Trigger estimators perform when the weak principal ignorability assumption does not hold. Specifically, we pretend the triggeringimpact confounder is not observed and we only have access to as a preassignment covariate. , which is also a triggeringimpact confounder, is treated as an inexperiment observation so it cannot be used to fit a triggering probability or propensity score model, but it can be used for covariate balancing to create a CUPED augmentation term. We replicate the simulation as in Study 3, with the only change that is removed from regression adjustment as well as all model fitting and covariate balancing.
Comparing Table 5 to Table 4 shows that Naive and TriggerDilute estimators are still unbiased, though now with slightly larger SEs due to missing in the regression adjustment. The main difference, however, is that OneSided Trigger estimators with outofsample triggering probability weights, insample propensity score weights, as well as entropy balancing on are no longer unbiased. This confirms that alone does not capture all confounding between triggering status and outcome . When the weak principal ignorability assumption is violated and no further adjustments are applied, the CUPED augmentation term (4) will be significantly different from zero, and OneSided Trigger will be biased. Because in our dgp also satisfies the weak principal ignorability assumption by blocking all backdoor paths from to , including the inexperiment covariate in entropy balancing successfully removed the bias, but resulted in a larger SE compared to its biased counterparts.
7. Conclusion
When the triggering rate in a randomized experiment is low, it is critical to exploit subjects’ triggering counterfactual status for efficient estimation of the overall ITT effect. However, it is not always possible to confirm whether a control group subject would have triggered the active treatment had they been assigned to the treatment group. This kind of onesided triggering problem poses a challenge both in theory and in practice.
In this paper, we tackle onesided triggering purely as a variance reduction problem. We take the inefficient differenceinoutcomemeans estimator and add a meanzero augmentation term as a form of linear covariate adjustment to reduce variance of the estimated ITT. Any meanzero augmentation is a search direction for which we can optimize for the best “step size” to obtain a new estimator: .
We derive our meanzero augmentation by comparing the treatment triggercomplement against the entire control group , where we properly weight datapoints in to make the distribution of outcome from comparable to that of . It is known in the literature of principal stratification that such weights exist and can be estimated as a function of preassignment covariates when weak principal ignorability holds. Following this theory, we propose our augmentation using the predicted triggering probabilities trained on the treatment group and predicted on the control group. We compare this approach to alternatives such as directly training a propensity model on and weighting units by estimated propensity scores, or using entropy balancing to enforce tight covariate balance between and . These discussions strengthen our understanding of sources of variance reduction and their relation to regression adjustment and covariate balancing.
Our simulation study shows that our proposed estimator can be even more efficient than standard estimators that are used when the triggering counterfactual status is observed in both treatment and control groups. The augmentation that weights control units by their predicted triggering probabilities gives the largest variance reduction and is unbiased under the weak principal ignorability condition, while other alternative weighting methods also considerably reduce variance. When conditioning on preassignment covariates is insufficient to satisfy weak principal ignorability and our augmentation fails to pass a meanzero test, we find that including inexperiment observations in covariate balancing (e.g., entropy balancing) can effectively reduce or eliminate bias, in exchange for extra variance. In the worst extreme, we balance on the outcomeofinterest so the augmentation becomes a point mass at zero, but then we also forfeit any efficiency gains. In practice, however, there are often plenty of inexperiment observations correlated with the triggering counterfactual that, when controlled for or balanced on, can mitigate confounding between triggering status and the outcome. Using inexperiment observations as a pragmatic knob to trade off bias with variance is another novel aspect of our method.
Acknowledgements.
We thank George Roumeliotis and Zhenyu Lai for the motivational problem and discussion. Mitra Akhatari, Jenny Chen, and Yunshan Zhu for propensity modeling.References
Appendix A Proof of Theorem 2.2
Recall from Eq (4) that
where indicates observed triggering status and is the estimated probability that subject is in the triggercomplement group. We prove that by showing under weak principal ignorability.
First, is simply the sample average of in , and it has expectation , where is the potential outcome under assignment to control. For a set of covariates satisfying weak principal ignorability, let and . Then
(9)  
(10) 
where (9) follows from the weak principal ignorability assumption, and (10) follows from the law of iterated expectations:
Meanwhile, is a weighted average of in , with weights , and has an asymptotic mean of
.
Applying iterated expectations to the numerator gives
Thus,
We have shown under weak principal ignorability, thereby concluding our proof of Theorem 2.2.
Appendix B Simulation R Code
width=
Comments
There are no comments yet.