Applying the Delta method in metric analytics: A practical guide with novel ideas

03/16/2018 ∙ by Alex Deng, et al. ∙ Microsoft 0

During the last decade, the information technology industry has adopted a data-driven culture, relying on online metrics to measure and monitor business performance. Under the setting of big data, the majority of such metrics approximately follow normal distributions, opening up potential opportunities to model them directly and solve big data problems using distributed algorithms. However, certain attributes of the metrics, such as their corresponding data generating processes and aggregation levels, pose numerous challenges for constructing trustworthy estimation and inference procedures. Motivated by four real-life examples in metric development and analytics for large-scale A/B testing, we provide a practical guide to applying the Delta method, one of the most important tools from the classic statistics literature, to address the aforementioned challenges. We emphasize the central role of the Delta method in metric analytics, by highlighting both its classic and novel applications.



There are no comments yet.


page 1

page 2

page 3

page 4

This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

1. Introduction

1.1. Background

The era of big data brings both blessings and curses (Fan et al., 2014)

. On one hand, it allows us to measure more accurately and efficiently, to study smaller and more subtle effects, and to tackle problems with smaller signal-to-noise-ratio. On the other hand, it demands larger storage and more intensive computations, forcing the data science community to strive for efficient algorithms which can run in parallel in a distributed system. Many existing algorithms and methods (e.g., support vector machines) that are known to work well in small data scenarios do not scale well in a big data setting

(Cortes and Vapnik, 1995; Forero et al., 2010; Wang and Zhou, 2012). Recently, there has been an increasing amount of research interest in meta-algorithms, which can extend algorithms that are difficult to parallelize into distributed algorithms (Bottou, 2010; Zinkevich et al., 2010), and ideas that resemble the divide-and-conquer algorithm (Kleiner et al., 2014; Jordan et al., 2018).

At the same time, there is a class of algorithms which are trivially parallelizable and therefore can downsize big data problems into smaller ones. The key idea behind them, which dates back to Fisher (1922)

, is to summarize the original data set by a low-dimensional vector of summary statistics, which can often be computed in a parallel and distributed way. For example, to estimate the mean and variance of a normal distribution from independent and identically distributed (i.i.d.) observations, we only need to obtain their sum and sum of squares, which are the corresponding summary statistics

111In fact, they are sufficient statistics (Casella and Berger, 2002), i.e., they can represent the original data-set perfectly without losing any information. and can be trivially computed in a distributed fashion. In data-driven businesses such as information technology, these summary statistics are often referred to as metrics, and used for measuring and monitoring key performance indicators (Dmitriev and Wu, 2016; Deng and Shi, 2016). In practice, it is often the changes or differences between metrics, rather than measurements at the most granular level, that are of greater interest. In the context of randomized controlled experimentation (or A/B testing), inferring the changes of metrics can establish causality (Kohavi et al., 2013; Tang et al., 2010; Xu et al., 2015), e.g., whether a new user interface design causes longer view times and more clicks.

1.2. Central limit theorem and Delta method

We advocate directly modeling the metrics rather than the original data-set. When analyzing data at the most granular level (e.g., user), we need some basic assumptions of the underlying probabilistic model, such as i.i.d. observations. When looking at the metrics level, we also need to know their joint distribution. This is where the blessing of big data comes into play. Given large sample sizes, the metrics often possess desirable asymptotic properties due to the

central limit theorem (Van der Vaart, 2000)

. To ensure that the paper is self-contained, we first review the central limit theorem in its most well-known form. Let

be i.i.d. observations with finite mean and variance We let denote the sample average, then as the sample size in distribution

A common application of the central limit theorem is to construct the confidence interval of as where is the th quantile for This is arguably one of the most influential results of asymptotic statistics used in almost all scientific fields whenever an estimate with error bars is presented.

While an influential result, the central limit theorem in its basic form only applies to the average of i.i.d. random variables, and in practice our metrics are often more complex. To enjoy the blessing of big data, we employ the Delta method, which extends the normal approximations from the central limit theorem broadly. For illustration we only review the uni-variate case. For any random variable

(the subscript indicates its dependency on e.g., sample average) and constant such that in distribution as the Delta method allows us to extend its asymptotic normality to any continuous transformation To be more specific, by using the fact that and the first order Taylor expansion (Rudin et al., 1964)


we have in distribution

This is the Delta method. Without relying on any assumptions other than “big data,” the Delta method is general. It is also memorable – anyone with basic knowledge of calculus can derive it. Moreover, the calculation is trivially parallelizable and can be easily implemented in a distributed system. Nevertheless, although conceptually and theoretically straightforward, the practical difficulty is to find the right “link” function that transforms the simple average to our desired metric. Because of different attributes of the metrics, such as the underlying data generating process and aggregation levels, the process of discovering the corresponding transformation can be challenging. However, unfortunately, although various applications of the Delta method have previously appeared in the data mining literature (Deng et al., 2013; Kohavi et al., 2013; Tang et al., 2010; Kohavi et al., 2010), the method itself and the discovery of were often deemed technical details and only briefly mentioned or relegated to appendices. Motivated by this gap, we aim to provide a practical guide that highlights the effectiveness and importance of the Delta method, hoping to help fellow data scientists, developers and applied researchers conduct trustworthy metric analytics.

1.3. Scope and contributions

As a practical guide, this paper presents four applications of the Delta method in real-life scenarios, all of which have been deployed in Microsoft’s online A/B testing platform ExP (Kohavi et al., 2013, 2009a) and employed to analyze experimental results on a daily basis:

  • In Section 2, we derive the asymptotic distributions of ratio metrics222Pedantically, ratio of two measurements of the same metric, from the treatment and control groups.. Compared with standard approaches by Fieller (1940, 1954), the Delta method provides a much simpler and yet almost equally accurate and effective solution;

  • In Section 3, we analyze cluster randomized experiments, where the Delta method offers an efficient alternative algorithm to standard statistical machinery known as the mixed effect model (Bates et al., 2014b; Gelman and Hill, 2006)

    , and provides unbiased estimates;

  • In Section 4, by extending the Delta method to outer confidence intervals (Meyer, 1987), we propose a novel hybrid method to construct confidence intervals for quantile metrics with almost no extra computational cost333Our procedure computes two more quantiles for confidence interval. Since the main cost of quantile computing is sorting the data, computing three and one quantiles cost almost the same.. Unlike most existing methods, our proposal does not require repeated simulations as in bootstrap, nor does it require estimating an unknown density function, which itself is often a theoretically challenging and computationally intensive task, and has been a center of criticism (Brown and Wolfe, 1983);

  • In Section 5, we handle missing data in within-subject studies by combining the Delta method with data augmentation. We demonstrate the effectiveness of “big data small problem” approach when directly modeling metrics. Comparing to alternative methods that need to put up a model for individual subjects, our method requires less model assumptions.

The main purpose of this paper is to promote the Delta method as a general, memorable and efficient tool for metric analytics. In particular, our contributions to the existing data mining literature include:

  • Practical guidance for scenarios such as inferring ratio metrics, where the Delta method offers scalable and easier-to-implement solutions;

  • A novel and computationally efficient solution to estimate the sampling variance of quantile metrics;

  • A novel data augmentation technique that employs the Delta method to model metrics resulting from within-subject or longitudinal studies with unspecified missing data patterns.

For reproducibility and knowledge transfer, we provide all relevant computer programs at

2. Inferring percent changes

2.1. Percent change and Fieller interval

Measuring change is a common theme in applied data science. In online A/B testing (Kohavi et al., 2013; Tang et al., 2010; Deng and Shi, 2016; Kohavi et al., 2007; Dmitriev et al., 2017), we estimate the average treatment effect (ATE) by the difference of the same metric measured from treatment and control groups, respectively. In time series analyses and longitudinal studies, we often track a metric over time and monitor changes between different time points. For illustration, let be i.i.d. observations from the control group with mean and variance i.i.d. observations from the treatment group with mean and variance and the covariance444For A/B testing where the treatment and control groups are independently sampled from a super population, between ’s and ’s. Let and be two measurements of the same metric from the treatment and control groups, respectively, and their difference is an unbiased estimate of the ATE Because both and are approximately normally distributed, their difference also follows an approximate normal distribution with mean and variance

Consequently, the well-known 100(1-)% confidence interval of is where is the finite-sample analogue of and can be computed using the sample variances and covariance of the treatment and control observations, denoted as and respectively.

In practice, however, absolute differences as defined above are often hard to interpret because they are not scale-invariant. Instead, we focus on the relative difference or percent change estimated by The key problem of this section is constructing a 100(1-)% confidence interval for . For this classic problem, Fieller (1940, 1954) seems to be the first to provide a solution. To be specific, let denote the th quantile for the

distribution with degree of freedom

and then Fieller’s interval of is


Although widely considered as the standard approach for estimating variances of percent changes (Kohavi et al., 2009b), deriving (2) is cumbersome (see (Von Luxburg and Franz, 2009) for a modern revisit). The greater problem is that, even when given this formula, applying it often requires quite some effort. According to (2) we need to not only estimate the sample variances and covariance, but also the parameter

2.2. Delta method and Edgeworth correction

The Delta method provides a more intuitive alternative solution. Although they can be found in classic textbooks such as (Casella and Berger, 2002), this paper (as a practical guide) still provides all the relevant technical details. We let , and A multi-variate analogue of (1) suggests that which implies that


For let which are also i.i.d. observations. Consequently, we can re-write (3) as leading to the Delta method based confidence interval


(4) is easier to implement than (2), and is in fact the limit of (2) in the large sample case, because as we have and Although Fieller’s confidence interval can be more accurate for small samples (Hirschberg and Lye, 2010), this benefit appears rather limited for big data. Moreover, the Delta method can also be easily extended for a better approximation by using Edgeworth expansion (Hall, 2013; Boos and Hughes-Oliver, 2000). To be more specific, (3) suggests that in distribution

whose cumulative distribution function

contains a correction term in addition to the cumulative distribution function of the standard normal, and

, the skewness of the

’s. By simply replacing the terms “” in (4) with and the th and th quantiles of respectively, we obtain the corresponding Edgeworth expansion based confidence interval. Finally, we can add a second-order bias correction term to the point estimate in (4); the same correct term can be applied to the Edgeworth expansion based interval.

2.3. Numerical examples

To illustrate the performance of Fieller’s interval in (2), the Delta method interval in (4), and the Edgeworth interval with and without bias correction under different scenarios, we let the sample size For each fixed we assume that the treatment and control groups are independent, and consider three simulation models for i.i.d. experimental units

The above models aim to mimic the behaviors of our most common metrics, discrete or continuous. For each case, we repeatedly sample data sets, and for each data set we construct Fieller’s and the Delta method intervals, respectively, and then add correction to the Delta method result. We also construct the Edgeworth expansion interval without and with the bias correction.

Method n Fieller Delta Delta(BC) Edgeworth Edgeworth(BC)
Normal 20 0.9563 0.9421 0.9422 0.9426 0.9426
Normal 50 0.9529 0.9477 0.9477 0.9478 0.9477
Normal 200 0.9505 0.9490 0.9491 0.9490 0.9490
Normal 2000 0.9504 0.9503 0.9503 0.9503 0.9503
Poisson 20 0.9400 0.9322 0.9370 0.9341 0.9396
Poisson 50 0.9481 0.9448 0.9464 0.9464 0.9478
Poisson 200 0.9500 0.9491 0.9493 0.9496 0.9498
Poisson 2000 0.9494 0.9494 0.9495 0.9494 0.9495
Bernoulli 20 0.9539 0.9403 0.9490 0.9476 0.9521
Bernoulli 50 0.9547 0.9507 0.9484 0.9513 0.9539
Bernoulli 200 0.9525 0.9513 0.9509 0.9517 0.9513
Bernoulli 2000 0.9502 0.9500 0.9499 0.9501 0.9500
Table 1. Simulated examples: The first two columns contain simulation models and sample sizes. The next five columns present the coverage rates of various 95% confidence intervals – Fieller’s, Delta method based (w/o and w/ bias correction) and Edgeworth expansion based (w/o and w/ bias correction).

We report the corresponding coverage rates in Table 1, from which we can draw several conclusions. For big data (), all methods achieve nominal (i.e., ) coverage rates for all simulation models. For small data (), although Fieller’s interval seems more accurate for some simulation models (e.g., normal), other methods perform comparably, especially after the bias correction. For simplicity in implementation and transparency in applications, we recommend Algorithm 1, which uses the Delta method based interval (4) with the bias correction.

1:function deltaci( )
4:     bc = bias correction term
5:     pest = + bc point estimate
6:     vest =
7:     return: confidence interval
Algorithm 1 Confidence interval for ratio: Delta method + bias correction

3. Decoding cluster randomization

3.1. The variance estimation problem

Two key concepts in a typical A/B test are the randomization unit – the granularity level where sampling or randomization is performed, and the analysis unit – the aggregation level of metric computation. Analysis is straightforward when the randomization and analysis units agree (Deng et al., 2017), e.g., when randomizing by user while also computing the average revenue per user. However, often the randomization unit is a cluster of analysis units (it cannot be more granular than the analysis unit, otherwise the analysis unit would contain observations under both treatment and control, nullifying the purpose of differentiating the two groups). Such cases, sometimes referred to as cluster randomized experiments in the econometrics and statistics literature (Athey and Imbens, 2017; Klar and Donner, 2001), are quite common in practice, e.g., enterprise policy prohibiting users within the same organization from different experiences, or the need to reduce bias in the presence of network interference (Backstrom and Kleinberg, 2011; Eckles et al., 2017; Gui et al., 2015). Perhaps more ubiquitously, for the same experiment we usually have metrics with different analysis units. For example, to meet different business needs, most user-randomized experiments run by ExP contain both user-level and page-level metrics.

We consider two average metrics555Pedantically, they are two measurements of the same metric. We often use metric to refer to both the metric itself (e.g. revenue-per-user) and measurements of the metric (e.g. revenue-per-user of the treatment group) and this distinction would be clear in the context. of the treatment and control groups, assumed to be independent. Without loss of generality, we only focus on the treatment group with clusters. For the th cluster contains analysis unit level observations (). Then the corresponding average metric is We assume that within each cluster the observations ’s are i.i.d. with mean and variance , and across clusters are also i.i.d.

3.2. Existing and Delta method based solutions

is not an average of i.i.d. random variables, and the crux of our analysis is to estimate its variance. Under strict assumptions, closed-form solutions for this problem exist (Donner, 1987; Klar and Donner, 2001). For example, when and for all


where is the between-cluster variance and is the coefficient of intra-cluster correlation, which quantifies the contribution of between-cluster variance to the total variance. To facilitate understanding of the variance formula (5), two extreme cases are worth mentioning:

  1. If then for each and all In this case, and

  2. If , then for all and therefore the observations ’s are in fact i.i.d. In this case, and (5) reduces to .

However, unfortunately, although theoretically sound, (5) has only limited practical value because the assumptions it makes are unrealistic. In reality, the cluster sizes and distributions of for each cluster are different, which means that and are different.

Another common approach is the mixed effect model, also known as multi-level/hierarchical regression (Gelman and Hill, 2006), where depends on and while the parameters themselves follow a “higher order” distribution. Under this setting, we can infer the treatment effect as the “fixed” effect for the treatment indicator term666Y Treatment + (1—User) in lme4 notation. A detailed discussion is beyond the scope of this paper; see Gelman and Hill (2006); Bates et al. (2014a).. Stan (Carpenter et al., 2016)

offers a Markov Chain Monte Carlo (MCMC) implementation aiming to infer the posterior distribution of those parameters, but this needs significant computational effort for big data. Moreover, the estimated ATE, i.e., the coefficient for the treatment assignment indicator, is for the randomization unit (i.e., cluster) but not the analysis unit level, because it treats all clusters with equal weights and can be viewed as the effect on the double average

which is usually different than the population average (Deng and Shi, 2016). This distinction doesn’t make a systematic difference when effects across clusters are homogeneous. However, in practice the treatment effects are often heterogeneous, and using mixed effect model estimates without further adjustment steps could lead to severe biases.

On the contrary, the Delta method solves the problem directly from the metric definition. Re-write into Let , and divide both the numerator and denominator by ,

Albeit not an average of i.i.d. random variables, is a ratio of two averages of i.i.d. randomization unit level quantities (Deng et al., 2017). Therefore, by following (3) in Section 2.2,


Therefore, the variance of depends on the variance of a centered version of (by a multiple of , not the constant as typically done). Intuitively, both the variance of the cluster size and of the within-cluster sum of observations contribute to (6). In particular, is an important contributor of the variance of leading to a practical recommendation for the experimenters – it is desirable to make the cluster sizes homogeneous; otherwise it can be difficult to detect small treatment effects due to low statistical power.

3.3. Numerical examples

Because the coverage property of (6) has been extensively covered in Section  2.3, we only focus on comparing it with the mixed effect model here. We first describe the data generating process, which consists of a two-level hierarchy as described in the previous section. First, at the randomization unit level, let the total number of clusters To mimic cluster heterogeneity, we divide clusters into three categories: small, medium and large. We generate the numbers of clusters in the three categories by the following multinomial distribution:

For the th cluster, depending on which category it belongs to, we generate and in the following way777The positive correlation between and is not important, and reader can try out code with different configuration.:

  • Small:

  • Medium:

  • High:

Second, for each fixed let for all This choice is motivated by binary metrics such as page-click-rate, and because of it we can determine the ground truth by computing the weighted average of weighted by the cluster sizes and the mixture of small, medium and large clusters.

Our goal is to infer and we compare the following three methods:

  1. A naive estimator pretending all observations are i.i.d;

  2. Fitting a mixed effect model with a cluster random effect

  3. Using the metric as in the first method, but using the Delta method for variance estimation.

Based on the aforementioned data generating mechanism, we repeatedly and independently generate 1000 data sets. For each data set, we compute the point and variance estimates of using the naive, mixed effect, and delta methods. We then compute empirical variances for the three estimators, and compare them to the average of estimated variances. We report the results in Table  2.

Method Ground Truth SD(True) Estimate Avg. SE(Model)
Naive 0.667 0.00895 0.667 0.00522
Mixed effect 0.667 0.00977 0.547 0.00956
Delta method 0.667 0.00895 0.667 0.00908
Table 2. Simulated examples: The first three columns contain the chosen method, the true value of

and the true standard deviation of the corresponding methods. The last two columns contain the point estimates and average estimated standard errors.

Not surprisingly, the naive method under-estimates the true variance – we had treated the correlated observations as independent. Both the Delta method and the mixed effect model produced satisfactory variance estimates. However, although both the naive and the Delta method correctly estimated , the mixed effect estimator is severely biased. This shouldn’t be a big surprise if we look deeper into the model and where the random effects are centered so . The sum of the intercept terms and stands for the per-cluster mean , and represents the average of per-cluster mean, where we compute the mean within each cluster first, and then average over clusters. This is different from the metrics defined as simple average of in the way that in the former all clusters are equally weighted and in the latter case bigger clusters have more weight. The two definitions will be the same if and only if either there is no heterogeneity, i.e. per-cluster means are all the same, or all clusters have the same size. We can still use the mixed effect model to get a unbiased estimate. This requires us to first estimate every (thus ), and then compute by applying the correct weight . The mixed effect model with the above formula gave a new estimate , much closer to the ground truth. Unfortunately, it is still hard to get the variance of this new estimator.

In this study we didn’t consider the treatment effect. In ATE estimation, the mixed effect model will similarly result in a biased estimate for the ATE for the same reason, as long as per-cluster treatment effects vary and cluster sizes are different. The fact that the mixed effect model provides a double average type estimate and the Delta method estimates the “population” mean is analogous to the comparison of the mixed effect model with GEE (generalized estimating equations) (Liang and Zeger, 1986). In fact, in the Gaussian case, the Delta method can be seen as the ultimate simplification of GEE’s sandwich variance estimator after summarizing data points into sufficient statistics. But the derivation of GEE is much more involved than the central limit theorem, while we can explain the Delta method in a few lines and it is not only more memorable but also provides more insights in (6).

4. Efficient variance estimation for quantile metrics

4.1. Sample quantiles and their asymptotics

Although the vast majority of metrics are averages of user telemetry data, quantile metrics form another category that is widely used to focus on the tails of distributions. In particular, this is often the case for performance measurements, where we not only care about an average user’s experience, but even more so about those that suffer from the slowest responses. Within the web performance community, quantiles (of, for example, page loading time) at 75%, 95% or 99% often take the spotlight. In addition, the 50% quantile (median) is sometimes used to replace the mean, because it is more robust to outlier observations (e.g., errors in instrumentation). This section focuses on estimating the variances of quantile estimates.

Suppose we have i.i.d. observations generated by a cumulative distribution function and a density function 888We do not consider cases when has discrete mass, and will have jumps. In this case the quantile can take many values and is not well defined. In practice this case can be seen as continuous case with some discrete correction.. The theoretical th quantile for the distribution is defined as . Let be the ascending ordering of the original observations. The sample quantile at is if is an integer. Otherwise, let be the floor of , then the sample quantile can be defined as any number between and

or a linear interpolation of the two

999When p is 0.5, the 50% quantile, or median, is often defined as the average of the middle two numbers if we have even number of observations.. For simplicity here we use which will not affect any asymptotic results. It is a well-known fact that, if are i.i.d. observations, following the central limit theorem and a rather straightforward application of the Delta method, the sample quantile is approximately normal (Casella and Berger, 2002; Van der Vaart, 2000):


where However, unfortunately, in practice we rarely have i.i.d. observations. A common scenario is in search engine performance telemetry, where we receive an observation (e.g., page loading time) for each server request or page-view, while randomization is done at a higher level such as device or user. This is the same situation we have seen in Section 3, where are clustered. To simplify future notations, we let where is the indicator function. Then (6) can be used to compute , and (7) holds in the clustered case with . This generalizes the i.i.d. case where . Note that the Delta method is instrumental in proving (7) itself, but a rigorous proof involves a rather technical last step that is beyond our scope. A formal proof can be found in (Van der Vaart, 2000).

4.2. A Delta method solution to a practical issue

Although theoretically sound, the difficulty of applying (7) in practice lies in the denominator whose computation requires the unknown density function at the unknown quantile A common approach is to estimate from the observed

using non-parametric methods such as kernel density estimation

(Wasserman, 2003). However, any non-parametric density estimation method is trading off between bias and variance. To reduce variance, more aggressive smoothing and hence larger bias need to be introduced to the procedure. This issue is less critical for quantiles at the body of the distribution, e.g. median, where density is high and more data exists around the quantile to make the variance smaller. As we move to the tail, e.g. 90%, 95% or 99%, however, the noise of the density estimation gets bigger, so we have to introduce more smoothing and more bias. Because the density shows up in the denominator and density in the tail often decays to , a small bias in estimated density can lead to a big bias for the estimated variance (Brown and Wolfe (1983) raised similar criticisms with their simulation study). A second approach is to bootstrap, re-sampling the whole dataset many times and computing quantiles repeatedly. Unlike an average, computing quantiles requires sorting, and sorting in distributed systems (data is distributed in the network) requires data shuffling between nodes, which incurs costly network I/O. Thus, bootstrap works well for small scale data but tends to be too expensive in large scale in its original form (efficient bootstrap on massive data is a research area of its own (Kleiner et al., 2014)).

An alternative method without the necessity for density estimation is more desirable, especially from a more practical perspective. One such method is called outer confidence interval (outer CI) (Krewski, 1976; Meyer, 1987), which produces a closed-form formula for quantile confidence intervals using combinatorial arguments. Recall that and is the count of observations no greater than the quantile. In the aforementioned i.i.d. case,

follows a binomial distribution. Consequently, when

is large where . If the quantile value then . The above equation can be inverted into a confidence interval for

. This means with 95% probability the true percentile is between the lower rank

and upper rank !

The traditional outer CI depends on being i.i.d. But when there are clusters, instead of being simply takes a different formula (6) by the Delta method and the result above still holds. Hence the confidence interval for a quantile can be computed in the following steps:

  1. fetch the quantile

  2. compute }

  3. compute , , , ,

  4. compute by setting equal to the result of equation (6)

  5. compute

  6. fetch the two ranks and

We call this outer CI with pre-adjustment. This method reduces the complexity of computing a quantile and its confidence interval into a Delta method step and subsequently fetching three “ntiles”. However, in this algorithm the ranks depends on , whose computation depends on the quantile estimates (more specifically the requires a pass through the data after quantile is estimated). This means that this algorithm requires a first ‘ntile’ retrieval, and then a pass through the data for computation, and then another two ‘ntile’ retrievals. Turns out, computing all three ‘ntiles’ in one stage is much more efficient than splitting into two stages. This is because retrieving ‘ntiles’ can be optimized in the following way: if we only need to fetch tail ranks, it is pointless to sort data that are not at the tail; we can use sketching algorithm to narrow down the possible range where our ranks reside and only sort in that range, making it even more efficient to retrieve multiple ‘ntiles’ at once. Along this line of thoughts, to make the algorithm more efficient, we noticed that (7) also implies that the change from being i.i.d. to clustered only requires an adjustment to the numerator , which is a simple re-scaling step, and the correction factor does not depend on the unknown density function in the denominator. If the outer CI were to provide a good confidence interval in i.i.d. case, a re-scaled outer CI with the same correction term should also work for the clustered case, at least when is large. This leads to the outer CI with post-adjustment algorithm:

  1. compute

  2. fetch , and

  3. compute }

  4. compute , , , ,

  5. compute by setting equal to the result of equation (6)

  6. compute the correction factor and apply it to and

We implemented this latter method in ExP using Apache Spark (Zaharia et al., 2016) and SCOPE (Chaiken et al., 2008).

4.3. Numerical examples

To test the validity and performance of the adjusted outer CI method, we compare its coverage to a standard non-parametric bootstrap ( replicates). The simulation setup consists of users (clusters) with observations each (

are uniformly distributed). Each observation is the sum of two i.i.d. random variables

, where is constant for each user. We consider two cases, one symmetric and one heavy-tailed distribution:

First, we find the ”true” 95th percentile value of these distribution by computing its value for a very large sample (). Second, we compute the confidence intervals for simulation runs using bootstrap and outer CI with pre- and post-adjustment and compare their coverage estimates (0.002 standard error), shown in Table  3. We found that when the sample contains 1000 or more clusters, all methods provide good coverage. Pre- and post-adjustment outer CI results are both very close to the much more computationally expensive bootstrap (in our un-optimized simulations, the outer CI method was 20 times faster than bootstrap). When the sample size was smaller than 1000 clusters, bootstrap was noticeably inferior to outer CI. For all sample sizes, pre-adjustment provided slightly larger coverage than post-adjustment, and this difference increased for smaller samples. In addition, because adjustment tends to result in increased confidence intervals, unadjusted ranks are more likely to have the same value as the quantile value, and thus post-adjustment is more likely to underestimate the variance in that case. In conclusion, post-adjustment outer CI works very well for large sample sizes and reasonably well for smaller samples, but has slightly inferior coverage compared to pre-adjustment. For big data, outer CI with post-adjustment is recommended due its efficiency, while for median to small sample sizes, outer CI with pre-adjustment is preferred if accuracy is paramount.

Distribution Bootstrap Outer CI Outer CI
(pre-adj.) (post-adj.)
Normal 100 0.9039 0.9465 0.9369
Normal 1000 0.9500 0.9549 0.9506
Normal 10000 0.9500 0.9500 0.9482
Log-normal 100 0.8551 0.9198 0.9049
Log-normal 1000 0.9403 0.9474 0.9421
Log-normal 10000 0.9458 0.9482 0.9479
Table 3. Simulated examples: The first two columns contain the simulated models and sample sizes. The last three columns contain the coverage rates for the Bootstrap, pre-adjusted and post-adjusted outer confidence intervals.

5. Missing Data and Within-Subject Analyses

5.1. Background

Consider the case that we are tracking a metric over time. For example, businesses need to track key performance indicators like user engagement and revenue daily or weekly for a given audience. To simplify, let us say we are tracking weekly visits-per-user. Visits-per-user is defined as a simple average for each week. If there is a drop between and , how do we set up an alert so that we can avoid triggering it due to random noise and control the false alarm rate? Looking at the variance, we see

and we need to have a good estimate of the covariance term because we know it is non-zero.

Normally, estimating the covariance of two sample averages is trivial and the procedure is very similar to the estimation of the sample variance. But there is something special about this case — missing data. Not everyone uses the product every week. For any online and offline metric, we are only able to define the metric using observed data. If for every user we observe its value in week and , the covariance can be estimated using the sample covariance formula. But if there are many users who appear in week but not in , it is unclear how to proceed. The naive approach is to estimate the covariance using complete observations, i.e. users who appear in both weeks. However, this only works if the data is missing completely at random. In reality, active users will show up more often and are more likely to appear in both weeks, meaning the missing data are obviously not random.

In this section we show how the Delta method can be very useful for estimating for any two time points and . We then use this to show how we can analyze within-subject studies, also known as pre-post, repeated measurement, or longitudinal studies. Our method starts with metrics directly, highly contrasting with traditional methods such as mixed effect models which build models from individual user’s data. Because we study metrics directly, our model is small and easy to solve with its complexity constant to the scale of data.101010Part of the work in this section has previously appeared in a technical report (Guo and Deng, 2015). Since authoring the technical report, the authors developed a better understanding of the differences between mixed effect models and the proposed method, and we present our new findings here. The technical report has more details about other types of repeated measurement models.

5.2. Methodology

How do we estimate covariance with a completely unknown missing data mechanism? There are many existing works on handling missing data. One approach is to model the propensity of a data-point being missing using other observed predictors (Davidian et al., 2005). This requires additional covariates/predictors to be observed, plus a rather strong assumption that conditioned on these predictors, data are missing completely at random. We present a novel idea using the Delta method after data augmentation without the need of modeling the missing data mechanism. Specifically, we use an additional indicator for the presence/absence status of a user in each period . For user in period let if user appears in period , and otherwise. For each user in period , instead of one scalar metric value , we augment it to a vector . When , i.e. user is missing, we define . Under this simple augmentation, the metric value , taking the average over those non-missing measurements in period , is the same as In this connection,

where the last equality is by dividing both numerator and denominator by the same total number of users who have appeared in any of the two periods.111111Actually, if there are more than two period, we can either use only users appeared in any of the two periods, or users appeared in any of all the periods. It is mathematically the same thing if we added more users and then treat them as not appeared in and , i.e. remains the same. Thanks to the central limit theorem, the vector is also asymptotically (multivariate) normal with covariance matrix , which can be estimated using sample variance and covariance because there is no missing data after augmentation. In Section 2 we already applied the Delta method to compute the variance of a ratio of metrics by taking the first order Taylor expansion. Here, we can expand the two ratios to their first order linear form where and are the means of and and the expansion for is similar. can then be computed using .

We can apply this to the general case of a within-subject study. Without loss of any generality and with the benefit of a concrete solution, we only discuss a cross-over design here. Other designs including more periods are the same in principle. In a cross-over design, we have two groups I and II. For group I, we assign them to the treatment for the first period and control for the second. For group II, the assignment order is reversed. We often pick the two groups using random selection, so each period is an A/B test by itself.

Let be the metric for group in period . Let . We know for a mean vector and covariance matrix . has the form since the two groups are independent with the same distribution. With the help of the Delta method, we can estimate from the data and treat it as a constant covariance matrix (hence ). Our model concerns the mean vector using other parameters which represent our interest. In a cross-over design, where the first two are baseline means for the two periods and our main interest is the treatment effect . (We assume there is no treatment effect for group I carried over to the 2nd period. To study carry over effect, a more complicated design needs to be employed.) In this parameterization,


The maximum likelihood estimator and Fisher information theory (Van der Vaart, 2000) paved a general way for us to estimate as well as the variance of the estimators for various mean vector models. For example, if we want to model the relative change directly, we just need to change the addition of into multiplication of . Notice the model we are fitting here is very small, almost a toy example from a text book. All the heavy lifting is in computing which is dealt with by the Delta method where all computations are trivially distributed. When is linear as in the additive cross-over model above, and where the Fisher Information .

5.3. Numerical examples

We simulate two groups with 1000 users each. The first group receives treatment in period 1, then control in period 2, while the second group receives the inverse. For each user, we impose an independent user effect that follows a normal distribution with mean 10 and standard deviation 3, and independent noises with mean 0 and standard deviation 2. Each user’s base observations (before treatment effect) for the two periods are . We then model the missing data and treatment effect such that they are correlated. We define a user’s engagement level by its user effect through , i.e. the probability that a user’s is bigger than another random user. We model the treatment effect as an additive normal with mean 10 and standard deviation 0.3, multiplied by . For each user and each of the two periods, there is a 1 - max(0.1, ) probability of this user being missing. We can interpret the missing data as a user not showing up, or as a failure to observe. In this model, without missing data, every user has two observations and the average treatment effect should be because . Since we have missing data and it is more likely for lower engagement levels to be missing, we expect the average treatment effect for the population of all observed users to be between 5 to 10. In the current model we didn’t add a time period effect such that the second period could have a different mean from the first period’s mean , but in analysis we always assume that this effect could exist.

We simulate the following 1000 times: each time we run both the mixed effect model as well as the additive cross-over model (8) and record their estimates of the ATE and the corresponding estimated variance. We then estimate the true estimator variance using the sample variance of those estimates among 1000 trials and compare that to the mean of the estimated variance to evaluate the quality of variance estimations. We also compute the average of ATE estimates and compare to the true ATE to assess the bias.

Ground Truth SD(True) Estimate Avg. SE(Model)
mixed effect 6.592 0.1295 7.129 0.1261
Delta method 6.592 0.1573 6.593 0.1568
Table 4. Simulated examples: The first three columns contain the method, true ATE and standard deviations of the corresponding methods. The last two columns contain the point estimates and average estimated standard errors.

Table 4 summarizes the results. We found both methods provide good variance estimation and the mixed effect model shows smaller variance. However, mixed effect also displays an upward bias in the ATE estimate while the Delta method closely tracks the true effect. To further understand the difference between the two, we separate the data into two groups: users with complete data, and users who only appear in one period (incomplete group). We run the mixed effect model for the the two groups separately. Note that in the second group each user only appears once in the data, so the model is essentially a linear model. Our working hypothesis is the following: because the mixed effect model assumes a fixed treatment effect, the effect for the complete and incomplete groups must be the same. The mixed effect model can take advantage of this assumption and construct an estimator by weighted average of the two estimates from the two groups, with the optimal weight inversely proportional to their variances. Table  5 shows the weighted average estimator is indeed very close to mixed effect model for both estimate and variance. The weighted average is closer to the complete group because the variance there is much smaller than that of the incomplete group since within-subject comparison significantly reduces noise. This explains why the mixed effect model can produce misleading results in within-subject analysis whenever missing data patterns can be correlated with the treatment effect. The Delta method, on the other hand, offers a flexible and robust solution.

Estimate Var
mixed effect model 7.1290 0.00161
mixed effect model on complete group 7.3876 0.00180
linear model on incomplete group 5.1174 0.01133
weighted avg estimator 7.0766 0.00155
Table 5. Simulated examples: Point and variance estimates using the mixed effect vs. weighted average estimators.

6. Concluding remarks

Measure everything is not only an inspiring slogan, but also a crucial step towards the holy grail of data-driven decision making. In the big data era, innovative technologies have tremendously advanced user telemetry and feedback collection, and distilling insights and knowledge from them is an imperative task for business success. To do so, we typically apply certain statistical models at the level of individual observations, fit the model using numerical procedures such as solving optimization problems, and draw conclusions and assess uncertainties from the fitted models. However, for big data such an analytical procedure can be very challenging. Based on the key observation that metrics are approximately normal due to the central limit theorem, this paper offers an alternative perspective by advocating modeling metrics, i.e., summaries or aggregations of the original data sets, in a direct manner. By doing so, we can decompose big data into small problems. However, although conceptually sound, in practice these metric level models often involve nonlinear transformations of data or complex data generating mechanisms, posing several challenges for trustworthy and efficient metric analytics.

To address these issues, we promoted the Delta method’s central role in making it possible to extend the central limit theorem to new territories. We demonstrated how to apply the Delta method in four real-life applications, and illustrated how this approach naturally leads to trivially parallelizable and highly efficient implementations. Among these applications, ratio metrics, clustered randomizations and quantile metrics are all common and important scenarios for A/B testing, and business analytics in general. Within-subject studies are becoming more popular for superior sensitivities, and missing data with an unknown mechanism is ubiquitous in both the online and offline worlds. Our contribution to technical improvements, novel ideas and new understandings includes bias correction for ratio metric estimation, combination of the Delta method and outer confidence intervals for quantile metric variance estimation, and the idea of data augmentation for general missing data problems in within-subject studies. We also revealed the connection between the Delta method and mixed effect models, and explained their differences. In addition, we pointed out the advantage of the Delta method in the presence of an unknown missing data mechanism. Overall speaking, we hope this paper can serve as a practical guide of applying the Delta method in large-scale metric analyses and A/B tests, so that it is no longer just a technical detail, but a starting point and a central piece of analytical thinking.

Although the Delta method can help tackle big data problems, it does not replace the need for rigorous experimental designs and probabilistic modeling. For example, “optimal” choices for cluster configurations, randomization mechanisms and data transformations are known to increase the sensitivities of metrics (Xie and Aurisset, 2016; Kharitonov et al., 2017; Budylin et al., 2018). We leave them as future research directions.

We benefited from insightful discussions with Ya Xu and Daniel Ting on quantile metrics and outer confidence intervals. We thank Jong Ho Lee for implementing the efficient quantile metric estimation and confidence interval construction in ExP using Apache Spark, and Yu Guo for previous work on repeated measurements.


  • (1)
  • Athey and Imbens (2017) Susan Athey and Guido W Imbens. 2017. The econometrics of randomized experiments. Handbook of Economic Field Experiments 1 (2017), 73–140.
  • Backstrom and Kleinberg (2011) Lars Backstrom and Jon Kleinberg. 2011. Network bucket testing. In Proceedings of the 20th international conference on World wide web. ACM, 615–624.
  • Bates et al. (2014a) Douglas Bates, Martin Mächler, Ben Bolker, and Steve Walker. 2014a. Fitting linear mixed-effects models using lme4. arXiv preprint arXiv:1406.5823 (2014).
  • Bates et al. (2014b) Douglas Bates, Martin Maechler, Ben Bolker, Steven Walker, et al. 2014b. lme4: Linear mixed-effects models using Eigen and S4. R package version 1, 7 (2014), 1–23.
  • Boos and Hughes-Oliver (2000) Dennis D Boos and Jacqueline M Hughes-Oliver. 2000. How large does have to be for and intervals? The American Statistician 54, 2 (2000), 121–128.
  • Bottou (2010) Léon Bottou. 2010.

    Large-scale machine learning with stochastic gradient descent.

    In Proceedings of COMPSTAT’2010. Springer, 177–186.
  • Brown and Wolfe (1983) Morton B Brown and Robert A Wolfe. 1983. Estimation of the variance of percentile estimates. Computational Statistics & Data Analysis 1 (1983), 167–174.
  • Budylin et al. (2018) Roman Budylin, Alexey Drutsa, Ilya Katsev, and Valeriya Tsoy. 2018. Consistent Transformation of Ratio Metrics for Efficient Online Controlled Experiments. In Proceedings of the Eleventh ACM International Conference on Web Search and Data Mining. ACM, 55–63.
  • Carpenter et al. (2016) Bob Carpenter, Andrew Gelman, Matt Hoffman, Daniel Lee, Ben Goodrich, Michael Betancourt, Michael A Brubaker, Jiqiang Guo, Peter Li, and Allen Riddell. 2016. Stan: A probabilistic programming language. Journal of Statistical Software 20 (2016), 1–37.
  • Casella and Berger (2002) George Casella and Roger L Berger. 2002. Statistical Inference, Second Edition. Duxbury Press: Pacific Grove, CA.
  • Chaiken et al. (2008) Ronnie Chaiken, Bob Jenkins, Per-Åke Larson, Bill Ramsey, Darren Shakib, Simon Weaver, and Jingren Zhou. 2008. SCOPE: Easy and efficient parallel processing of massive data sets. Proceedings of the VLDB Endowment 1 (2008), 1265–1276.
  • Cortes and Vapnik (1995) Corinna Cortes and Vladimir Vapnik. 1995. Support-vector networks. Machine Learning 20 (1995), 273–297.
  • Davidian et al. (2005) M. Davidian, A.A. Tsiatis, and S. Leon. 2005. Semiparametric Estimation of Treatment Effect in a Pretest-Posttest Study with Missing Data. Statist. Sci. 20 (2005), 295–301. Issue 3.
  • Deng et al. (2017) A. Deng, J. Lu, and J. Litz. 2017. Trustworthy analysis of online A/B tests: Pitfalls, challenges and solutions. In Proceedings of the Tenth ACM International Conference on Web Search and Data Mining. 641–649.
  • Deng and Shi (2016) A. Deng and X. Shi. 2016. Data-driven metric development for online controlled experiments: Seven lessons learned. In Proceedings of the 22nd ACM SIGKDD International Conference on Knowledge Discovery and Data Mining.
  • Deng et al. (2013) Alex Deng, Ya Xu, Ron Kohavi, and Toby Walker. 2013. Improving the sensitivity of online controlled experiments by utilizing pre-experiment data. In Proceedings of the 6th ACM WSDM Conference. 123–132.
  • Dmitriev et al. (2017) Pavel Dmitriev, Somit Gupta, Dong Woo Kim, and Garnet Vaz. 2017. A Dirty Dozen: Twelve Common Metric Interpretation Pitfalls in Online Controlled Experiments. In Proceedings of the 23rd ACM SIGKDD International Conference on Knowledge Discovery and Data Mining (KDD ’17). ACM, New York, NY, USA, 1427–1436.
  • Dmitriev and Wu (2016) Pavel Dmitriev and Xian Wu. 2016. Measuring Metrics. In Proceedings of the 25th ACM International on Conference on Information and Knowledge Management. ACM, 429–437.
  • Donner (1987) Allan Donner. 1987. Statistical methodology for paired cluster designs. American Journal of Epidemiology 126, 5 (1987), 972–979.
  • Eckles et al. (2017) Dean Eckles, Brian Karrer, and Johan Ugander. 2017. Design and analysis of experiments in networks: Reducing bias from interference. Journal of Causal Inference 5, 1 (2017).
  • Fan et al. (2014) Jianqing Fan, Fang Han, and Han Liu. 2014. Challenges of big data analysis. National Science Review 1 (2014), 293–314.
  • Fieller (1940) Edgar C Fieller. 1940. The biological standardization of insulin. Supplement to the Journal of the Royal Statistical Society 7, 1 (1940), 1–64.
  • Fieller (1954) Edgar C Fieller. 1954. Some problems in interval estimation. Journal of the Royal Statistical Society. Series B (Methodological) (1954), 175–185.
  • Fisher (1922) Ronald Aylmer Fisher. 1922. On the mathematical foundations of theoretical statistics. Philosophical Transactions of the Royal Society of London. Series A, Containing Papers of a Mathematical or Physical Character 222 (1922), 309–368.
  • Forero et al. (2010) Pedro A Forero, Alfonso Cano, and Georgios B Giannakis. 2010. Consensus-based distributed support vector machines. Journal of Machine Learning Research 11, May (2010), 1663–1707.
  • Gelman and Hill (2006) Andrew Gelman and Jennifer Hill. 2006. Data analysis using regression and multilevel/hierarchical models. Cambridge University Press.
  • Gui et al. (2015) Huan Gui, Ya Xu, Anmol Bhasin, and Jiawei Han. 2015. Network A/B testing: From sampling to estimation. In Proceedings of the 24th International Conference on World Wide Web. International World Wide Web Conferences Steering Committee, 399–409.
  • Guo and Deng (2015) Yu Guo and Alex Deng. 2015. Flexible Online Repeated Measures Experiment. arXiv preprint arXiv:1501.00450 (2015).
  • Hall (2013) Peter Hall. 2013. The bootstrap and Edgeworth expansion. Springer Science & Business Media.
  • Hirschberg and Lye (2010) Joe Hirschberg and Jenny Lye. 2010. A geometric comparison of the delta and Fieller confidence intervals. The American Statistician 64 (2010), 234–241.
  • Jordan et al. (2018) Michael I Jordan, Jason D Lee, and Yun Yang. 2018. Communication-efficient distributed statistical inference. J. Amer. Statist. Assoc. in press (2018). doi:10.1080/01621459.2018.1429274.
  • Kharitonov et al. (2017) Eugene Kharitonov, Alexey Drutsa, and Pavel Serdyukov. 2017. Learning sensitive combinations of a/b test metrics. In Proceedings of the Tenth ACM International Conference on Web Search and Data Mining. ACM, 651–659.
  • Klar and Donner (2001) Neil Klar and Allan Donner. 2001. Current and future challenges in the design and analysis of cluster randomization trials. Statistics in medicine 20, 24 (2001), 3729–3740.
  • Kleiner et al. (2014) Ariel Kleiner, Ameet Talwalkar, Purnamrita Sarkar, and Michael I Jordan. 2014. A scalable bootstrap for massive data. Journal of the Royal Statistical Society: Series B (Statistical Methodology) 76, 4 (2014), 795–816.
  • Kohavi et al. (2009a) Ronny Kohavi, Thomas Crook, Roger Longbotham, Brian Frasca, Randy Henne, Juan Lavista Ferres, and Tamir Melamed. 2009a. Online experimentation at Microsoft. In Proceedings of the Third International Workshop on Data Mining Case Studies, held at the 5th ACM SIGKDD Conference. 11–23.
  • Kohavi et al. (2013) Ron Kohavi, Alex Deng, Brian Frasca, Toby Walker, Ya Xu, and Nils Pohlmann. 2013. Online Controlled Experiments at Large Scale. Proceedings of the 19th ACM SIGKDD Conference (2013).
  • Kohavi et al. (2007) Ron Kohavi, Randal M Henne, and Dan Sommerfield. 2007. Practical guide to controlled experiments on the web: listen to your customers not to the hippo. In Proceedings of the 13th ACM SIGKDD Conference. 959–967.
  • Kohavi et al. (2009b) Ron Kohavi, Roger Longbotham, Dan Sommerfield, and Randal M Henne. 2009b. Controlled experiments on the web: survey and practical guide. Data mining and knowledge discovery 18, 1 (2009), 140–181.
  • Kohavi et al. (2010) R. Kohavi, R. Longbotham, and T. Walker. 2010. Online Experiments: Practical Lessons. Computer 43, 9 (Sept 2010), 82–85.
  • Krewski (1976) Daniel Krewski. 1976. Distribution-free confidence intervals for quantile intervals. J. Amer. Statist. Assoc. 71, 354 (1976), 420–422.
  • Liang and Zeger (1986) Kung-Yee Liang and Scott L Zeger. 1986. Longitudinal data analysis using generalized linear models. Biometrika 73, 1 (1986), 13–22.
  • Meyer (1987) John S Meyer. 1987. Outer and inner confidence intervals for finite population quantile intervals. J. Amer. Statist. Assoc. 82, 397 (1987), 201–204.
  • Rudin et al. (1964) Walter Rudin et al. 1964. Principles of mathematical analysis. Vol. 3. McGraw-hill New York.
  • Tang et al. (2010) Diane Tang, Ashish Agarwal, Deirdre O’Brien, and Mike Meyer. 2010. Overlapping Experiment Infrastructure: More, Better, Faster Experimentation. Proceedings of the 16th ACM SIGKDD Conference (2010).
  • Van der Vaart (2000) Aad W Van der Vaart. 2000. Asymptotic statistics. Vol. 3. Cambridge university press.
  • Von Luxburg and Franz (2009) Ulrike Von Luxburg and Volker H Franz. 2009. A geometric approach to confidence sets for ratios: Fieller’s theorem, generalizations and bootstrap. Statistica Sinica (2009), 1095–1117.
  • Wang and Zhou (2012) Dongli Wang and Yan Zhou. 2012. Distributed support vector machines: An overview. In Control and Decision Conference (CCDC), 2012 24th Chinese. IEEE, 3897–3901.
  • Wasserman (2003) Larry Wasserman. 2003. All of Statistics: A Concise Course in Statistical Inference. Springer.
  • Xie and Aurisset (2016) Huizhi Xie and Juliette Aurisset. 2016. Improving the sensitivity of online controlled experiments: Case studies at netflix. In Proceedings of the 22nd ACM SIGKDD International Conference on Knowledge Discovery and Data Mining. ACM, 645–654.
  • Xu et al. (2015) Ya Xu, Nanyu Chen, Addrian Fernandez, Omar Sinno, and Anmol Bhasin. 2015. From infrastructure to culture: A/B testing challenges in large scale social networks. In Proceedings of the 21th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining. ACM, 2227–2236.
  • Zaharia et al. (2016) Matei Zaharia, Reynold S Xin, Patrick Wendell, Tathagata Das, Michael Armbrust, Ankur Dave, Xiangrui Meng, Josh Rosen, Shivaram Venkataraman, Michael J Franklin, et al. 2016. Apache Spark: A unified engine for big data processing. Commun. ACM 59, 11 (2016), 56–65.
  • Zinkevich et al. (2010) Martin Zinkevich, Markus Weimer, Lihong Li, and Alex J Smola. 2010. Parallelized stochastic gradient descent. In Advances in Neural Information Processing Systems. 2595–2603.