Designing Experiments to Measure Incrementality on Facebook

06/07/2018 ∙ by C. H. Bryan Liu, et al. ∙ 0

The importance of Facebook advertising has risen dramatically in recent years, with the platform accounting for almost 20 spend in 2017. An important consideration in advertising is incrementality: how much of the change in an experimental metric is an advertising campaign responsible for. To measure incrementality, Facebook provide lift studies. As Facebook lift studies differ from standard A/B tests, the online experimentation literature does not describe how to calculate parameters such as power and minimum sample size. Facebook also offer multi-cell lift tests, which can be used to compare campaigns that don't have statistically identical audiences. In this case, there is no literature describing how to measure the significance of the difference in incrementality between cells, or how to estimate the power or minimum sample size. We fill these gaps in the literature by providing the statistical power and required sample size calculation for Facebook lift studies. We then generalise the statistical significance, power, and required sample size calculation to multi-cell lift studies. We represent our results theoretically in terms of the distributions of test metrics and in practical terms relating to the metrics used by practitioners, making all of our code publicly available.



There are no comments yet.


page 1

page 2

page 3

page 4

This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

1. Introduction

In 2017, advertisers spent $204bn online (Zenith, 2018), with a large share ($40bn) spent targeting Facebook’s 2.13bn monthly active users (Inc., 2018a). To maximise their return on investment, advertisers continuously test and optimise their campaigns. It is increasingly common to use controlled experiments to maximise the incrementality of an advertising campaign. In the most common variant — known as A/B, or split testing — the target population is divided into two groups, a test group, where members are shown adverts, and a control group, where members are not shown adverts. The difference in a metric of interest (e.g. total sales or number of app installs) between the test group and control group is the incrementality of the campaign. Facebook offers advertisers the opportunity to measure the incrementality of their campaigns via lift studies.

Despite the importance of Facebook advertising, there is a lack of literature or documentation describing how to design experiments. The deficiencies are summarised in Table 1.111On their experimentation website (Inc., 2018b), Facebook state that “To build a study with more rigorous calculations, or for more information on Conversion or Brand Lift, please reach out to your Facebook Account Representative.” We address this issue by first describing how Facebook calculate incrementality and using this to derive measures of statistical significance, the test power and the minimum sample size for Facebook lift studies.

Existing literature on Lift studies Multi-cell lift studies
Test statistic (Gordon et al., 2017)
Statistical significance (Gordon et al., 2017)
Power / Required sample size
Table 1. Existing literature on calculating the test statistic (lift/incrementality), its statistical significance, test power, and the required sample size for Facebook lift studies and multi-cell lift studies. The only literature available is the white paper by Gordon et al. (Gordon et al., 2017).

A Facebook lift study is similar to an A/B test with two important differences. Firstly, the control group is scaled so that the size of the test and control groups are the same. This changes the variance of the metric of interest in the control group.

222If the control group is scaled up, the variance increases. Likewise the variance decreases if the control group is scaled down. Secondly, not everyone in the test group is shown an advert. This happens because the advertiser can lose every bid for a particular user, or when a bid is won, the advert appears off the screen. Members of the test group who are shown the advert at least once during the test period are referred to as the reached audience, and those who have not seen the advert during the test period are referred to as the unreached audience. The activity of the unreached audience introduces variance that is not present in a standard A/B test, which must be factored in when calculating the power and required sample size.

Facebook has a mechanism that takes the scaled control group and the unreached audience into account when reporting on the incrementality and its associated statistical significance (Gordon et al., 2017) (see Section 2), but they do not cover the statistical power or required sample size. We introduce these calculations in this paper.

Facebook also support multi-cell lift studies, where the target population is split into multiple cells each with a control and test group of their own, as illustrated in Figure 1. These can be used to compare two marketing strategies where the target audience exhibits a selection bias (Liu and Chamberlain, 2018). An example is comparing campaigns that vary the bid size based on customer lifecycle, which result in a different user composition between the cells. In this case we are interested in measuring the difference between incrementalities attained by the campaigns.

While Facebook reports the incrementality of each individual cell in a multi-cell lift study, they do not report if the incrementality difference is statistically significant, nor advise on the statistical power or sample size required to design the experiment. A common pitfall is to apply the standard sample size calculation for a lift study to a multi-cell lift study. As there are more test/control groups in a multi-cell experiment, the variance of the test metric will be larger, even when the groups have the same size. Furthermore, changes in marketing strategies are likely to lead to changes in audience composition meaning that test group metrics from multiple cells are not directly comparable via standard t-tests. Permutation tests are also not possible in this setting as Facebook do not provide data regarding the control-test split.

Figure 1. A Facebook multi-cell lift study. The population (100 boxes), is randomly divided into multiple cells. Different campaigns with differing test-control splits can be run in each cell.

We resolve these problems by introducing a framework to calculate the power and minimum sample size for lift studies and multi-cell lift studies on Facebook. Our framework takes into account control group scaling and the effect of the unreached audience. We present our calculations both theoretically and in practical terms. Our theoretical results relate to the distribution of the test metrics, while in practical terms, we present results in the metrics used by advertising practitioners (e.g. lift or proportion of reached audience).

To summarise, our contributions are:

  1. We derive the statistical power and required sample size for Facebook lift studies, bridging the gap between the online controlled experimental literature and the reality on measuring incrementality on Facebook.

  2. We generalise the results to multi-cell lift studies, where incrementalities under different strategies are compared against each other.

  3. We make our result useful to advertising practitioners by presenting our statistical power and minimum required sample size calculations in terms of expected lift, reach percentage, and the ratio between test/control groups, as well as making the code used in the paper publicly available.333

In the remainder of the paper we derive the distribution of the test metric and hence the test power and minimum sample size required in a Facebook lift study in Section 2. We then generalise the results to multi-cell lift studies in Section 3. Finally, we show a number of empirical results illustrating the correctness of the derived distributions and the difference in the required sample sizes in single-cell/multi-cell lift studies in Section 4.

2. Facebook Lift Studies

We first describe a lift study, concentrating on how Facebook derives the incrementality and lift (relative incrementality) of the metric of interest in Section 2.1. We then base our derivation of the distribution of lift as a test statistic (Section 2.2), as well as calculations on the test power and required samples size (Section 2.3) on their work. We will use conversions, defined as the number of transactions from users in the lift study, as our metric of interest, but our calculations are applicable to other metrics which can be described with a Poisson process.444

For metrics which cannot be described with a Poisson process, our framework, which supports the use of a simulated distribution generated from arithmetic operations of samples drawn from Poisson distributions, can still be applied by swapping in different base distributions.

2.1. How does Facebook calculate incrementality and lift?

Facebook manages the test-control splitting and is therefore able to measure the conversions in each group. Facebook reports three results: (1) the number of conversions in the test group , (2) the number of conversions in the control group and (3) the number of conversions from the reached audience in the test group . The sizes of the test and control groups are also reported enabling the control group to be scaled to match the total audience of the test group. We base our calculations on the conversions in the control group, which is scaled so that the audience size matches that in the test group:


where is the ratio of the test to control group sizes

Figure 2. The Facebook incrementality calculation. and represent the metric attained by the test and scaled control groups respectively. and represent the contribution by the reached audience in the test and scaled control groups respectively. and represent the contribution of the unreached audience in the test and scaled control groups respectively.

The conversions in the test and scaled control groups contain contributions from both the reached and unreached audiences


and these are illustrated in Figure 2. Since the conversion rates in both unreached audiences are assumed to be the same


Reach is defined as the fraction of people in the test group who saw an advert


where is the size of the reached audience and is the total audience size of the test group. We assume that the reach would be the same in both test and control groups, hence


where is the size of the audience who would have been shown an advert in the control group. In the control group the conversion rates are the same in the unreached and reached audiences and so


The incrementality is the difference in conversions between the test and scaled control groups and originates solely from the reached audiences


The test statistic is lift () defined as incrementality divided by the number of reached conversions in the scaled control


which can be calculated in terms of , and as


Facebook’s Null Hypothesis Significance Test determines if there is a non-zero lift at 90% confidence level (two-tailed). In our calculations, we focus on the alternate hypothesis that a campaign is incremental at 5% significance level (one-tailed).

555While the calculations around test power and required sample size is nearly identical in both formulations, we are assuming an advert will not have a negative incrementality. This is most often the case when we run control experiments to measure an advert’s incrementality. Formally


where is the null and the alternate hypothesis.

2.2. Derivation of the lift distributions

To obtain the power and required sample size for a lift study, it is necessary to understand the distributions of the test statistic under the null and alternate hypotheses. Here we derive the distribution of the test statistic , which is not available in the literature.666We take as the relative difference between a Poisson variable and the scalar multiple of a Poisson variable. This rules out the use of the Poisson means test (Krishnamoorthy and Thomson, 2004), which compares two standard Poisson variables with potentially different rates. We begin by observing that is defined to be a scalar multiple of by Equation (7), and hence can be written as


where is the reach. We assume follows a Poisson distribution with rate , and is

, an independent Poisson random variable with rate

, scaled by a factor of (i.e. , by Equations (7) and (1

)). The probability mass functions (PMF) of

and is then given as:


where Equation (14) is a standard result on transformation of univariate random variables.

The cumulative mass function (CMF) of is


where we use approximately equal in the expression as the probability distribution of

is not well defined.777 can be equal to zero, leading to the quotient having an undefined value with positive probability. In practice, with being sufficiently large (say over 30, achieved by a sufficient number of naturally occurring conversions) we can safely proceed as the probability of  equal to zero is negligible ( and the probability decreases with increasing ). Alternatively, we can model as a zero-truncated Poisson distribution, though with all these random variables related to each other by some arithmetic operations, this approach will introduce other complications when deriving the distribution of . The CMF has the form


The outer summation is difficult to implement as it is defined over , and is unknown a priori. We substitute so that the outer summation sums over the natural numbers and uses the PMF of instead (see Equation (14)):


The derived distribution can then be used to calculate the critical value of , above which should be rejected. The critical value is necessary for calculating the power and required sample size.

2.3. Power and Minimum Sample Size Calculation

A prerequisite of any A/B test is a calculation of the expected test power and the minimum sample size to achieve an acceptable test power.888Typically taken to be 0.8 . While we have derived the necessary CMF to calculate power and sample size, we also explore the possibility to proceed by simulating the distribution for using a large number of samples. We show in Section 4.1 that the derived and simulated distributions are equivalent, and there are computational advantages to using the simulation approach. The simulation is also applicable if we assume the variables used in this section follow other distributions.

2.3.1. Power

Test power is the probability that the test will correctly reject the null hypothesis when the alternate hypothesis

is true (the complement of Type II error). For Facebook lift studies, test power is dependent on the minimum detectable lift 

, the number of expected conversions in the control group , the scaling factor relating the size of the test group to the control group , and the reach , which depends on many variables, in particular ad spend.

To calculate the test power we require the distribution for . This can be done by using Equation (19). Alternatively, we can obtain an empirical distribution for by 1) treating  and as Poisson random variables with means and respectively, 2) drawing samples from and , and using Equations (7) and (1) to scale them to obtain samples for and , and 3) using Equation (9) to obtain samples for .

We calculate the means and by expressing them in terms of , and expected lift . We can approximate with


and are then able to calculate as


by rearranging Equation (12) and noting the scaling relationship between and using Equations (7) and (1).

The procedure for calculating the test power is two-fold and is illustrated in Figure 3(a). First, the distribution of is calculated under in which (i.e. ). Estimates for and can be taken from previous Facebook advertising results. For a one-tailed test at the 5% significance level the critical value is calculated as the 95th percentile of this distribution:


Second, the distribution of is calculated under a specific in which is as defined in Equation (21). Since the test power is strongly coupled to (see Figure 4), it is important to have a reasonable estimate. Estimates for can be taken from previous Facebook advertising results. If no previous studies are available, we can estimate from a lightweight pre-study, or related studies in the literature. The test power can then be calculated as the percentage of this distribution above :


2.3.2. Minimum sample size

The minimum sample size required to give a specified test power (commonly 80%) can be obtained from the power simulation by solving for the minimum that will give a power greater than using the bisection method (Burden and Faires, 1985). The minimum sample sizes to observe lifts of 1%, 2%, 5% and 10% are shown in Table 10.

Single-cell Multi-cell
Effect size
10% 1,352 54,068 2,745 219,596
5% 5,107 204,271 10,754 860,346
2% 31,571 1,262,848 67,453 5,396,260
1% 124,459 4,978,355 264,745 21,179,569
Table 2. Minimum number of conversions in the control group and total audience size required to achieve a power of 80%. For the multi-cell calculation the lift in cell A was taken to be 5%. To calculate the total audience size, we divide by the conversion rate101010Defined as the number of conversions divided by the total number of users. (assumed to be 5%), and multiply the result by the number of groups (two for single-cell, and four for two-cell lift studies).

3. Multi-cell lift studies

Multi-cell lift studies can be used to compare the incrementalities of multiple marketing strategies with potentially statistically different audiences. Here we consider the case of two cells, and . To maximise the test power, we assume the cells are of the same size, with the same test-control split proportions. A common pitfall in multi-cell studies is to use the test power and minimum sample size derived in Section 2. As multi-cell studies have more test/control groups, the variance of the test statistic, which involves arithmetic operations on all groups, will increase even if the variance within each group stays the same. In Section 4.2 we demonstrate this and develop the mechanism for correctly calculating test parameters.

In a multi-cell lift study, Equations (9) and (10) still hold for individual cells:


where the additional subscripts and indicate the cells. Facebook provide advertisers with , , , , and so and can be computed as

Test Statistic

We define the test statistic as the absolute (as opposed to relative) difference between the lifts in cells and :


which is directly comparable with the lift in a single-cell study.111111If we define the test statistic as the relative difference, the effect size between cells will be a percentage of the effect size achieved in the single-cell case. To illustrate, a 1% relative difference in lifts means we are comparing a 5% lift in cell A and a 5.05% lift in cell B. To detect such difference with 80% power we require around 106M conversions in the control group of cell A (one out of four groups in a two-cell lift study), a number which even the largest companies struggle to meet for experimentation purposes. The null and alternative hypotheses are defined to be


While the distributions for and can be characterised by their CMF, it is difficult to obtain the PMF of these distributions. Accordingly, the distribution of (e.g. the CMF or PMF ) can not be readily evaluated using a convolution. We believe that deriving an analytical form for the distribution of is of little practical use for test power and sample size calculation as there are other simpler alternatives such as simulating the distribution.

Under the distribution of is defined by , , , and . It is reasonable to assume that and are the same for both cells. In general, the audiences are not statistically identical in cells and so that can not be assumed. However, if the strategy in has not previously been tested, there is no good way of estimating and so we assume here.

Statistical Significance & Critical Value

As Facebook do not report the difference in lifts between cells (or its significance) in multi-cell studies, advertisers are free to choose the significance level that suits their needs. We use a one-tailed test at 5% for the calculations shown in Section 4.2 to be consistent with Section 2.

The critical value is defined to satisfy the following equation:


This can be obtained by finding the percentile of the samples simulating the distribution of .


Under we define a minimum detectable difference such that


and calculate the test power by the following equation:

Minimum sample size

The minimum sample sizes required to be able to observe with a power of 80% were calculated as described in Section 2.3.2. The equivalent numbers of conversions in cell control and total audience sizes are shown in Table 10.

4. Evaluation

In this section, empirical results on the distribution of the test statistic in single-cell lift studies and the calculated power and sample size in both single-cell and multi-cell lift studies are provided. In Section 4.1 we show the correctness of our simulation of by comparing it to the analytical form in Equation (19). Finally, in Section 4.2, we calculate the test power and required sample size for a range of minimum detectable effects, for both single-cell and multi-cell lift studies.

4.1. Comparing the derived and simulated distribution of

(a) , ,
(b) , ,
(c) , ,
(d) , ,
Figure 3. Comparison between the CMF of the lift derived in Section 2.2 (blue line) and the cumulative histogram of 1,000 samples drawn from the generative process in Section 2.3 (orange bars). Over a large range of the parameters , , , and , the two methods produce largely identical distributions.

We first confirm that our simulation of (specified in Equation (19)) is correct by running a number of Kolmogorov-Smirnov (K-S) tests (Smirnov, 1944; Daniel et al., 1978). This indicates that the simulated distribution can be safely used as an alternative for the purpose of power and required sample size calculation.

For each run we 1) randomly specify the four parameters required by both methods: , , the reach , and the scaling factor , 2) generate a number of samples from the simulated distribution, 3) compute the K-S statistic w.r.t. the derived distribution, and 4) evaluate if there are any statistical significance to reject the null hypothesis that the two distributions are the same. Steps 3) and 4) are mostly handled by the kstest function in scipy.

We had 500 test runs (four are shown in Figure 3), and 28 of them have a K-S statistic that results in rejecting the null hypothesis at a 5% significance level. Taking into account that we are running multiple comparisons and hence should expect around 25 rejections given the two distributions are the same, we are satisfied that the derived and simulated distributions are statistically equivalent.

It is more than 30 times quicker to obtain the 95th percentile of the distribution of (i.e. the critical value) using the simulated distribution than the derived distribution. This is done by comparing the time taken to:

  • (Simulated distribution) Find the value of the 95th percentile in the 10M samples simulating the distribution, versus

  • (Derived distribution) Find the root of the function under the same parameters, using the root-finding algorithm proposed by Brent (Brent, 2013).

This suggests it is more effective for an advertiser to obtain the test power using the simulated distribution for the single-cell case.

4.2. Comparison of single-cell and multi-cell test power and minimum sample size

Figure 4. Simulations for single-cell (a-d) and multi-cell (e-f) lift studies. a) Distributions of under and for 20,000 conversions in the control group, true lift of 5%, reach of 100% and a 50:50 control-test split. marks the critical value for a one-tailed test at the 5% significance level. b) Test power against the number of control conversions for different minimum detectable lifts. c) Test power against reach percentage holding the total audience size constant (). d) Test power against the fraction of audience in the control group, holding the total audience size constant ( when the test/control split is 50:50) e) Distributions of the difference in lift between two cells under and where the true difference is 5%. f) Test power against the number of conversions in the control group for different minimum detectable relative differences in lift.

Finally, we visualise our power and required sample size calculations, recording the number of conversions (and thus users) required to detect certain effects in both single-cell and multi-cell lift studies.

Figures 4a & e show the power calculation for the single and multi-cell cases respectively. To be comparable, the total audience size is fixed s.t. and . The power in the multi-cell case of 78% (with ) is meaningfully lower than the 100% power achieved in the single-cell case (with ). Figures 4b, c & d show the variation of single-cell test power with audience size, reach and control-test split respectively. For a given audience size the maximum power can be obtained with a reach of 100% and a 50:50 split between the test and control groups (where ). Figure 3(f) is the multi-cell equivalent of Figure 3(b). Comparing these figures shows that for the same number of conversions per control group, the power achieved is less in the multi-cell case. Furthermore, this effect is larger for smaller effect sizes.

Table 10 shows that to achieve a test power of 80% over twice as many conversions are needed per control group in the multi-cell than in the single-cell case. Since our multi-cell scenario has two cells, the total audience size needed in the multi-cell is over four times that of the single-cell case.

5. Conclusion

We have described how to design experiments to measure the incrementality of advertising campaigns on Facebook, bridging the gap between the general literature in online controlled experiments and industrial practices. We provided the statistical power and required sample size calculation for Facebook lift studies, and generalised the statistical significance, power and required sample size calculation to multi-cell lift studies, which are used by advertisers to compare campaigns or strategies where the target audience can exhibit a selection bias. We make our results useful to practitioners by presenting our calculations in terms of common advertising metrics — expected lift, reach percentage, and ratio between test/control groups — and publishing all of our code.

The authors thank Markus Ojala and Lauri Kovanen for useful discussions and the anonymous reviewers for providing many improvements to the original manuscript.