In 2017, advertisers spent $204bn online (Zenith, 2018), with a large share ($40bn) spent targeting Facebook’s 2.13bn monthly active users (Inc., 2018a). To maximise their return on investment, advertisers continuously test and optimise their campaigns. It is increasingly common to use controlled experiments to maximise the incrementality of an advertising campaign. In the most common variant — known as A/B, or split testing — the target population is divided into two groups, a test group, where members are shown adverts, and a control group, where members are not shown adverts. The difference in a metric of interest (e.g. total sales or number of app installs) between the test group and control group is the incrementality of the campaign. Facebook offers advertisers the opportunity to measure the incrementality of their campaigns via lift studies.
Despite the importance of Facebook advertising, there is a lack of literature or documentation describing how to design experiments. The deficiencies are summarised in Table 1.111On their experimentation website (Inc., 2018b), Facebook state that “To build a study with more rigorous calculations, or for more information on Conversion or Brand Lift, please reach out to your Facebook Account Representative.” We address this issue by first describing how Facebook calculate incrementality and using this to derive measures of statistical significance, the test power and the minimum sample size for Facebook lift studies.
|Existing literature on||Lift studies||Multi-cell lift studies|
|Test statistic||(Gordon et al., 2017)||✗|
|Statistical significance||(Gordon et al., 2017)||✗|
|Power / Required sample size||✗||✗|
A Facebook lift study is similar to an A/B test with two important differences. Firstly, the control group is scaled so that the size of the test and control groups are the same. This changes the variance of the metric of interest in the control group.222If the control group is scaled up, the variance increases. Likewise the variance decreases if the control group is scaled down. Secondly, not everyone in the test group is shown an advert. This happens because the advertiser can lose every bid for a particular user, or when a bid is won, the advert appears off the screen. Members of the test group who are shown the advert at least once during the test period are referred to as the reached audience, and those who have not seen the advert during the test period are referred to as the unreached audience. The activity of the unreached audience introduces variance that is not present in a standard A/B test, which must be factored in when calculating the power and required sample size.
Facebook has a mechanism that takes the scaled control group and the unreached audience into account when reporting on the incrementality and its associated statistical significance (Gordon et al., 2017) (see Section 2), but they do not cover the statistical power or required sample size. We introduce these calculations in this paper.
Facebook also support multi-cell lift studies, where the target population is split into multiple cells each with a control and test group of their own, as illustrated in Figure 1. These can be used to compare two marketing strategies where the target audience exhibits a selection bias (Liu and Chamberlain, 2018). An example is comparing campaigns that vary the bid size based on customer lifecycle, which result in a different user composition between the cells. In this case we are interested in measuring the difference between incrementalities attained by the campaigns.
While Facebook reports the incrementality of each individual cell in a multi-cell lift study, they do not report if the incrementality difference is statistically significant, nor advise on the statistical power or sample size required to design the experiment. A common pitfall is to apply the standard sample size calculation for a lift study to a multi-cell lift study. As there are more test/control groups in a multi-cell experiment, the variance of the test metric will be larger, even when the groups have the same size. Furthermore, changes in marketing strategies are likely to lead to changes in audience composition meaning that test group metrics from multiple cells are not directly comparable via standard t-tests. Permutation tests are also not possible in this setting as Facebook do not provide data regarding the control-test split.
We resolve these problems by introducing a framework to calculate the power and minimum sample size for lift studies and multi-cell lift studies on Facebook. Our framework takes into account control group scaling and the effect of the unreached audience. We present our calculations both theoretically and in practical terms. Our theoretical results relate to the distribution of the test metrics, while in practical terms, we present results in the metrics used by advertising practitioners (e.g. lift or proportion of reached audience).
To summarise, our contributions are:
We derive the statistical power and required sample size for Facebook lift studies, bridging the gap between the online controlled experimental literature and the reality on measuring incrementality on Facebook.
We generalise the results to multi-cell lift studies, where incrementalities under different strategies are compared against each other.
We make our result useful to advertising practitioners by presenting our statistical power and minimum required sample size calculations in terms of expected lift, reach percentage, and the ratio between test/control groups, as well as making the code used in the paper publicly available.333https://github.com/liuchbryan/fb_lift_study_design
In the remainder of the paper we derive the distribution of the test metric and hence the test power and minimum sample size required in a Facebook lift study in Section 2. We then generalise the results to multi-cell lift studies in Section 3. Finally, we show a number of empirical results illustrating the correctness of the derived distributions and the difference in the required sample sizes in single-cell/multi-cell lift studies in Section 4.
2. Facebook Lift Studies
We first describe a lift study, concentrating on how Facebook derives the incrementality and lift (relative incrementality) of the metric of interest in Section 2.1. We then base our derivation of the distribution of lift as a test statistic (Section 2.2), as well as calculations on the test power and required samples size (Section 2.3) on their work. We will use conversions, defined as the number of transactions from users in the lift study, as our metric of interest, but our calculations are applicable to other metrics which can be described with a Poisson process.444 For metrics which cannot be described with a Poisson process, our framework, which supports the use of a simulated distribution generated from arithmetic operations of samples drawn from Poisson distributions, can still be applied by swapping in different base distributions.
For metrics which cannot be described with a Poisson process, our framework, which supports the use of a simulated distribution generated from arithmetic operations of samples drawn from Poisson distributions, can still be applied by swapping in different base distributions.
2.1. How does Facebook calculate incrementality and lift?
Facebook manages the test-control splitting and is therefore able to measure the conversions in each group. Facebook reports three results: (1) the number of conversions in the test group , (2) the number of conversions in the control group and (3) the number of conversions from the reached audience in the test group . The sizes of the test and control groups are also reported enabling the control group to be scaled to match the total audience of the test group. We base our calculations on the conversions in the control group, which is scaled so that the audience size matches that in the test group:
where is the ratio of the test to control group sizes
The conversions in the test and scaled control groups contain contributions from both the reached and unreached audiences
and these are illustrated in Figure 2. Since the conversion rates in both unreached audiences are assumed to be the same
Reach is defined as the fraction of people in the test group who saw an advert
where is the size of the reached audience and is the total audience size of the test group. We assume that the reach would be the same in both test and control groups, hence
where is the size of the audience who would have been shown an advert in the control group. In the control group the conversion rates are the same in the unreached and reached audiences and so
The incrementality is the difference in conversions between the test and scaled control groups and originates solely from the reached audiences
The test statistic is lift () defined as incrementality divided by the number of reached conversions in the scaled control
which can be calculated in terms of , and as
Facebook’s Null Hypothesis Significance Test determines if there is a non-zero lift at 90% confidence level (two-tailed). In our calculations, we focus on the alternate hypothesis that a campaign is incremental at 5% significance level (one-tailed).555While the calculations around test power and required sample size is nearly identical in both formulations, we are assuming an advert will not have a negative incrementality. This is most often the case when we run control experiments to measure an advert’s incrementality. Formally
where is the null and the alternate hypothesis.
2.2. Derivation of the lift distributions
To obtain the power and required sample size for a lift study, it is necessary to understand the distributions of the test statistic under the null and alternate hypotheses. Here we derive the distribution of the test statistic , which is not available in the literature.666We take as the relative difference between a Poisson variable and the scalar multiple of a Poisson variable. This rules out the use of the Poisson means test (Krishnamoorthy and Thomson, 2004), which compares two standard Poisson variables with potentially different rates. We begin by observing that is defined to be a scalar multiple of by Equation (7), and hence can be written as
where is the reach. We assume follows a Poisson distribution with rate , and is
, an independent Poisson random variable with rate, scaled by a factor of (i.e. , by Equations (7) and (1
)). The probability mass functions (PMF) ofand is then given as:
where Equation (14) is a standard result on transformation of univariate random variables.
The cumulative mass function (CMF) of is
where we use approximately equal in the expression as the probability distribution ofis not well defined.777 can be equal to zero, leading to the quotient having an undefined value with positive probability. In practice, with being sufficiently large (say over 30, achieved by a sufficient number of naturally occurring conversions) we can safely proceed as the probability of equal to zero is negligible ( and the probability decreases with increasing ). Alternatively, we can model as a zero-truncated Poisson distribution, though with all these random variables related to each other by some arithmetic operations, this approach will introduce other complications when deriving the distribution of . The CMF has the form
The outer summation is difficult to implement as it is defined over , and is unknown a priori. We substitute so that the outer summation sums over the natural numbers and uses the PMF of instead (see Equation (14)):
The derived distribution can then be used to calculate the critical value of , above which should be rejected. The critical value is necessary for calculating the power and required sample size.
2.3. Power and Minimum Sample Size Calculation
A prerequisite of any A/B test is a calculation of the expected test power and the minimum sample size to achieve an acceptable test power.888Typically taken to be 0.8 . While we have derived the necessary CMF to calculate power and sample size, we also explore the possibility to proceed by simulating the distribution for using a large number of samples. We show in Section 4.1 that the derived and simulated distributions are equivalent, and there are computational advantages to using the simulation approach. The simulation is also applicable if we assume the variables used in this section follow other distributions.
Test power is the probability that the test will correctly reject the null hypothesis when the alternate hypothesis
is true (the complement of Type II error). For Facebook lift studies, test power is dependent on the minimum detectable lift, the number of expected conversions in the control group , the scaling factor relating the size of the test group to the control group , and the reach , which depends on many variables, in particular ad spend.
To calculate the test power we require the distribution for . This can be done by using Equation (19). Alternatively, we can obtain an empirical distribution for by 1) treating and as Poisson random variables with means and respectively, 2) drawing samples from and , and using Equations (7) and (1) to scale them to obtain samples for and , and 3) using Equation (9) to obtain samples for .
We calculate the means and by expressing them in terms of , and expected lift . We can approximate with
and are then able to calculate as
The procedure for calculating the test power is two-fold and is illustrated in Figure 3(a). First, the distribution of is calculated under in which (i.e. ). Estimates for and can be taken from previous Facebook advertising results. For a one-tailed test at the 5% significance level the critical value is calculated as the 95th percentile of this distribution:
Second, the distribution of is calculated under a specific in which is as defined in Equation (21). Since the test power is strongly coupled to (see Figure 4), it is important to have a reasonable estimate. Estimates for can be taken from previous Facebook advertising results. If no previous studies are available, we can estimate from a lightweight pre-study, or related studies in the literature. The test power can then be calculated as the percentage of this distribution above :
2.3.2. Minimum sample size
The minimum sample size required to give a specified test power (commonly 80%) can be obtained from the power simulation by solving for the minimum that will give a power greater than using the bisection method (Burden and Faires, 1985). The minimum sample sizes to observe lifts of 1%, 2%, 5% and 10% are shown in Table 10.
3. Multi-cell lift studies
Multi-cell lift studies can be used to compare the incrementalities of multiple marketing strategies with potentially statistically different audiences. Here we consider the case of two cells, and . To maximise the test power, we assume the cells are of the same size, with the same test-control split proportions. A common pitfall in multi-cell studies is to use the test power and minimum sample size derived in Section 2. As multi-cell studies have more test/control groups, the variance of the test statistic, which involves arithmetic operations on all groups, will increase even if the variance within each group stays the same. In Section 4.2 we demonstrate this and develop the mechanism for correctly calculating test parameters.
where the additional subscripts and indicate the cells. Facebook provide advertisers with , , , , and so and can be computed as
We define the test statistic as the absolute (as opposed to relative) difference between the lifts in cells and :
which is directly comparable with the lift in a single-cell study.111111If we define the test statistic as the relative difference, the effect size between cells will be a percentage of the effect size achieved in the single-cell case. To illustrate, a 1% relative difference in lifts means we are comparing a 5% lift in cell A and a 5.05% lift in cell B. To detect such difference with 80% power we require around 106M conversions in the control group of cell A (one out of four groups in a two-cell lift study), a number which even the largest companies struggle to meet for experimentation purposes. The null and alternative hypotheses are defined to be
While the distributions for and can be characterised by their CMF, it is difficult to obtain the PMF of these distributions. Accordingly, the distribution of (e.g. the CMF or PMF ) can not be readily evaluated using a convolution. We believe that deriving an analytical form for the distribution of is of little practical use for test power and sample size calculation as there are other simpler alternatives such as simulating the distribution.
Under the distribution of is defined by , , , and . It is reasonable to assume that and are the same for both cells. In general, the audiences are not statistically identical in cells and so that can not be assumed. However, if the strategy in has not previously been tested, there is no good way of estimating and so we assume here.
Statistical Significance & Critical Value
As Facebook do not report the difference in lifts between cells (or its significance) in multi-cell studies, advertisers are free to choose the significance level that suits their needs. We use a one-tailed test at 5% for the calculations shown in Section 4.2 to be consistent with Section 2.
The critical value is defined to satisfy the following equation:
This can be obtained by finding the percentile of the samples simulating the distribution of .
Under we define a minimum detectable difference such that
and calculate the test power by the following equation:
Minimum sample size
In this section, empirical results on the distribution of the test statistic in single-cell lift studies and the calculated power and sample size in both single-cell and multi-cell lift studies are provided. In Section 4.1 we show the correctness of our simulation of by comparing it to the analytical form in Equation (19). Finally, in Section 4.2, we calculate the test power and required sample size for a range of minimum detectable effects, for both single-cell and multi-cell lift studies.
4.1. Comparing the derived and simulated distribution of
We first confirm that our simulation of (specified in Equation (19)) is correct by running a number of Kolmogorov-Smirnov (K-S) tests (Smirnov, 1944; Daniel et al., 1978). This indicates that the simulated distribution can be safely used as an alternative for the purpose of power and required sample size calculation.
For each run we 1) randomly specify the four parameters required by both methods: , , the reach , and the scaling factor , 2) generate a number of samples from the simulated distribution, 3) compute the K-S statistic w.r.t. the derived distribution, and 4) evaluate if there are any statistical significance to reject the null hypothesis that the two distributions are the same. Steps 3) and 4) are mostly handled by the kstest function in scipy.
We had 500 test runs (four are shown in Figure 3), and 28 of them have a K-S statistic that results in rejecting the null hypothesis at a 5% significance level. Taking into account that we are running multiple comparisons and hence should expect around 25 rejections given the two distributions are the same, we are satisfied that the derived and simulated distributions are statistically equivalent.
It is more than 30 times quicker to obtain the 95th percentile of the distribution of (i.e. the critical value) using the simulated distribution than the derived distribution. This is done by comparing the time taken to:
(Simulated distribution) Find the value of the 95th percentile in the 10M samples simulating the distribution, versus
(Derived distribution) Find the root of the function under the same parameters, using the root-finding algorithm proposed by Brent (Brent, 2013).
This suggests it is more effective for an advertiser to obtain the test power using the simulated distribution for the single-cell case.
4.2. Comparison of single-cell and multi-cell test power and minimum sample size
Finally, we visualise our power and required sample size calculations, recording the number of conversions (and thus users) required to detect certain effects in both single-cell and multi-cell lift studies.
Figures 4a & e show the power calculation for the single and multi-cell cases respectively. To be comparable, the total audience size is fixed s.t. and . The power in the multi-cell case of 78% (with ) is meaningfully lower than the 100% power achieved in the single-cell case (with ). Figures 4b, c & d show the variation of single-cell test power with audience size, reach and control-test split respectively. For a given audience size the maximum power can be obtained with a reach of 100% and a 50:50 split between the test and control groups (where ). Figure 3(f) is the multi-cell equivalent of Figure 3(b). Comparing these figures shows that for the same number of conversions per control group, the power achieved is less in the multi-cell case. Furthermore, this effect is larger for smaller effect sizes.
Table 10 shows that to achieve a test power of 80% over twice as many conversions are needed per control group in the multi-cell than in the single-cell case. Since our multi-cell scenario has two cells, the total audience size needed in the multi-cell is over four times that of the single-cell case.
We have described how to design experiments to measure the incrementality of advertising campaigns on Facebook, bridging the gap between the general literature in online controlled experiments and industrial practices. We provided the statistical power and required sample size calculation for Facebook lift studies, and generalised the statistical significance, power and required sample size calculation to multi-cell lift studies, which are used by advertisers to compare campaigns or strategies where the target audience can exhibit a selection bias. We make our results useful to practitioners by presenting our calculations in terms of common advertising metrics — expected lift, reach percentage, and ratio between test/control groups — and publishing all of our code.
Acknowledgements.The authors thank Markus Ojala and Lauri Kovanen for useful discussions and the anonymous reviewers for providing many improvements to the original manuscript.
- Brent (2013) Richard P Brent. 2013. Algorithms for minimization without derivatives. Courier Corporation.
- Burden and Faires (1985) R.L. Burden and J.D. Faires. 1985. Numerical analysis. Prindle, Weber & Schmidt.
- Daniel et al. (1978) Wayne W Daniel et al. 1978. Applied nonparametric statistics. Houghton Mifflin.
- Gordon et al. (2017) Brett R. Gordon, Florian Zettelmeyer, Neha Bhargava, and Dan Chapsky. 2017. A Comparison of Approaches to Advertising Measurement: Evidence from Big Field Experiments at Facebook. (2017). http://www.kellogg.northwestern.edu/faculty/gordon_b/files/fb_comparison.pdf White paper.
- Inc. (2018a) Facebook Inc. 2018a. Facebook Reports Fourth Quarter and Full Year 2017 Results. (2018). https://investor.fb.com/investor-news/press-release-details/2018/Facebook-Reports-Fourth-Quarter-and-Full-Year-2017-Results/default.aspx
- Inc. (2018b) Facebook Inc. 2018b. What makes a lift study statistically powerful? (2018). https://www.facebook.com/business/help/165866720571247
- Krishnamoorthy and Thomson (2004) K. Krishnamoorthy and Jessica Thomson. 2004. A more powerful test for comparing two Poisson means. Journal of Statistical Planning and Inference 119, 1 (2004), 23–35.
- Liu and Chamberlain (2018) C.H. Bryan Liu and Benjamin Paul Chamberlain. 2018. Online Controlled Experiments for Personalised e-Commerce Strategies: Design, Challenges, and Pitfalls. arXiv preprint arXiv:1803.06258 (2018).
- Smirnov (1944) Nikolai Vasilyevich Smirnov. 1944. Approximate laws of distribution of random variables from empirical data. Uspekhi Matematicheskikh Nauk 10 (1944), 179–206.
- Zenith (2018) Zenith. 2018. Advertising Expenditure Forecasts March 2018. (2018). https://www.zenithmedia.com/product/advertising-expenditure-forecasts-march-2018/