1 Introduction
A/B tests refer to the factorial experiments whose purpose is to estimate the treatment effect(s) of a single controllable experimental factor. As an important tool for causal inference, A/B tests have been widely applied in the fields of clinical trials, marketing, business intelligence, etc. For instance, many technology companies use A/B tests to improve the quality of website products and online services. Initially, the experimental factor of A/B tests only had two levels, e.g., control and treatment, but the idea was extended to “A/B/n test”, in which the experimental factor has more than two levels. For simplicity, we use the term “A/B test” throughout this paper to represent both twolevel A/B test and multilevel A/B/n test. Also, we do not distinguish the actual meanings of settings, such as “treatment” or “control”. Both are simply regarded as two different treatment settings. Traditional A/B tests are usually conducted offline, meaning that the treatment settings are assigned to the test units before data collection. Nowadays, when an IT company experiments on the new website products, the treatment settings are assigned to users or test units sequentially when they enter the experiment at different times. Such sequential data collection process is known as online A/B tests (Bhat et al., 2017).
The response of an A/B testing experiment can be of various types. It can be a continuous variable such as the revenue of a product, or a discrete variable such as the number of clicks of a website advertisement. With the experimental data, statistical estimations and inferences are performed to conclude on the significance of the treatment effects (Wu and Hamada, 2011). In this paper, we follow the fundamental independence assumption that the test units do not affect each other’s behavior, also known as the “stable unit treatment value assumption” (Rubin, 1990). Besides the response observations, useful information of the test units is often available, which is described by covariate variables or covariates. Examples of the covariates can be patients’ medical history in a clinical trial or users’ cookie data in a web service experiment.
Randomized design is the most commonly used method to plan an A/B testing experiment, by which one hopes to achieve the covariate balance, a concept generally indicating the distributions of the covariates are similar across all treatment groups. When the sample size gets large, randomization can achieve the covariate balance asymptotically, whereas, for small to middlesized experiments, covariate imbalance can frequently occur and is likely to cause the confounding between the treatment effect(s) and some covariates. Many previous works have pointed out the shortcoming of purely relying on randomized design, including Seidenfeld (1981); Urbach (1985); Krause and Howard (2003); Hansen and Bowers (2008); Rubin (2008); Worrall (2010), etc. See Morgan et al. (2012) for a detailed review. Some actions are necessary to mitigate the covariate imbalance.
Two common design methods, stratification and blocking, can dealias the treatment and the covariate effects and reduce the bias. Both methods require the experimenter to have the liberty to choose the test units from the general population. In stratified sampling, the overall population is divided into the mutually exclusive strata. Each treatment group is constructed by randomly drawing the test units from all the strata, and the number of units from each stratum is proportional to the percentage of the corresponding stratum to the overall population. Thus, each treatment group is representative of the overall population and covariate balance is achieved. Similarly, block design puts the homogeneous units in a block (Sun et al., 1997; Cheng and Wu, 2002; Zhao et al., 2018). Within each block, the treatment settings are randomly assigned to the equal number of homogeneous units. The block design must be balanced in the sense that all blocks must have the same number of units (Wu and Hamada, 2011), which is different from stratification. The two methods have some obvious limitations. First, it is not always the case that the experimenter can randomly pick the test units from the general population to enter the experiment. For example, when an IT company wants to know which web design attracts more clicks and there is not much traffic visiting the web, the experimenter would rather collect all the data from the visitors than picking a subset. Second, both stratification and blocking are more suitable when there are only one or two covariate variables and they are all categorical. In the case when all the covariates are continuous, we can cluster the units based on covariates, and thus the clusters become the strata or blocks. But this approach may not truly achieve covariate balance due to the clusteringbased discretization.
When all the covariates are continuous, Morgan et al. (2012) and Morgan and Rubin (2015) proposed to rerandomize the test units across all treatment groups until the balance criterion based on Mahalanobis distance is sufficiently small. Branson et al. (2016) applied this method to construct a factorial design of treatment combinations with 48 covariates in a study on New York City high schools. Motivated by the same goal to achieve covariate balance, we propose a different balance criterion, which turns out to be consistent with the Mahalanobis distance as shown in our simulation study. Instead of rerandomization, we take the optimal design approach and find the best allocation of treatment settings by minimizing the proposed criterion. Similarly, Bhat et al. (2017) proposed another method which is essentially the optimal design that depends significantly on the model assumption (Atkinson and Donev, 1992; RomeroVillafranca et al., 2007). Bhat et al. (2017) used the semidefinite program (SDP) to solve a relaxation of the original optimal design problem for the offline design and dynamic programming for the online design. Although theoretically sound, the optimization methods cannot be extended to the case when the number of treatment settings is more than two, nor the modelbased criterion can be applied to categorical responses.
In this paper we consider the A/B testing experiments with continuous covariates. We propose a modelfree discrepancybased criterion in Section 2 and demonstrate theoretically that the smaller discrepancy would make the upper bound of the mean squared errors (MSE) of the parameter estimates smaller. The optimal design algorithms for both offline and online experiments are developed in Section 4. The discrepancybased criterion is modelfree and can be applied to both continuous and discrete responses as shown in Section 5. In Section 6, we demonstrate how to deal with high dimensional covariate using PCA through a real dataset. In Section 7 we discuss the extension as well as the limitation of the proposed design method.
2 DiscrepancyBased Criterion
In this section, we introduce a discrepancybased criterion for the design of A/B testing experiments. It aims at minimizing the differences of the covariate probability distributions between all treatment groups.
2.1 Motivation
The reason that randomized design can obtain the covariate balance can be explained in a probability sense. Assume that an A/B test has different treatment settings with . Without loss of generality, we label the treatment settings by . If there are test units, we denote the set as the experimental design, where is the treatment setting for the th test unit. We assume the covariates of any test unit as a
dimensional continuous random variable which is denoted by
and let be the observed covariates values of the th test unit. For a randomized design, each should be a random variable and independently follows the multinomial distribution with for , . Since the treatment assignments should be independent of the covariates, for any test unit , the conditional probability of should also be , i.e., for any . To construct the design , we only need to randomly sample from the multinomial distribution with probability for each test unit. Summarizing the above discussion, we have the following proposition. The proof is in Appendix.Proposition 2.1.
In a randomized design that is carried to the entire infinite population of test units, the treatment setting of any test unit should satisfy
(1) 
which holds if and only if the distribution of covariate in all treatment groups are identical, i.e. . Here is the density function of in the treatment group from the entire population.
Ideally, for a randomized design, the empirical distribution of , both and converge to the target multinomial distribution when . But in practice, due to the randomized sampling, even if we can enforce the marginal empirical distribution of to be balanced, the empirical conditional distribution of given can be different from target multinomial distribution. Therefore, we cannot ensure the condition (1) to hold for the empirical distribution for small to medium sized . Proposition 2.1 provides an intuitive direction of constructing the balanced design for A/B testings. To achieve covariate balance, one should make the distributions of the covariates in all treatment groups to be as close to each other as possible.
2.2 Discrepancybased criterion
Motivated by Proposition 2.1, one should assure in treatment groups. However, the practical experiment is carried with a limited sample size and we do not know the population density of
from the entire population nor of each treatment group. So we use a kernel density estimation (KDE) to approximate the probability density function of the covariates. KDE is a nonparametric method to estimate the probability density function of a random variable
(Epanechnikov, 1969). With covariate observations , the KDE of a probability density function for covariate is(2) 
where is a positive definite kernel function and is the positive definite bandwidth matrix. The choice of bandwidth is discussed in Section 4. Many kernel functions can be used here.
Denote the sample space of the covariate as . Common examples of are , bounded hypercube like , etc. For an experiment with treatment settings, we propose the following discrepancybased criterion for measuring the balancing property of covariates in treatment groups.
(3) 
where is the KDE of using the observed covariates of the th group, and is the KDE of using the complete covariates data of all the test units. Note that the KDE for all is independent of how the test units are partitioned into groups, but the KDE for each group does depend on the partition of the units. The norm is the function norm. Basically, (3) compares each group’s distribution of the covariates with the overall covariates’ distribution and returns the maximum discrepancy. Then, our goal is to partition the test units into treatment groups by minimizing the discrepancybased criterion so that the distributions of covariate in all treatment groups are as similar as possible as implied in Proposition 2.1. We show in Section 3 that small leads to small upper bound of MSE of the treatment effect estimate. This discrepancy criterion has appeared in the literature before. (Anderson et al., 1994)
proposed a twosample test statistics
as a measurement for the difference between two multivariate samples.One can also use as the criterion, but the maximum in (3) is obviously a stronger condition if we want all to be small. Note that it is not advisable to compare every pair of for , because it requires the computation of terms of , which becomes much more computational expensive than (3) as increases. Also, is a more accurate estimate of than any to , since is based on the complete data whereas any is only based on a smaller subset. Therefore, it is more reasonable to push to be close to . We propose the discrepancybased design method in the following two steps.

[leftmargin=*]

Partition all the available test units into groups by minimizing the criterion in (3). Here is the notation for a partition of test units, i.e., , where indicates which group the th unit is partitioned into.

Randomly assign the treatment settings to the partitioned groups.
Note that the partitioned group and the assigned treatment setting has a subtle difference due to Step 2. If any two units are in the same partition group, i.e., , their assigned treatment settings are also the same . But the th partition group with all may be assigned to the treatment setting and thus all the design for the units in this partition group.
3 Discrepancy and MSE
In this section, we show theoretically that the discrepancybased design has the advantage of regulating the upper bounds of the MSE of two estimators, differenceinmean and least squares estimator. For simplicity, we only consider a twolevel A/B testing experiment with the treatment settings A and B, and the response measurements are continuous. We also assume the number of test units in both groups are the same, i.e., . The results can be extended to the A/B testing with more than two levels and . Given a design , we assume the observations of the covariates and the response
follow the underlying linear regression model
(4) 
The parameter denotes the treatment effect,
is a dummy variable that is equal to 1 when the
th test unit is assigned with treatment A and 1 if assigned with treatment B,is the vector of basis functions on the covariates (
could include the intercept), is the vector of regression coefficients, and is the error. The errorsare independent and identically distributed with zero mean and variance
. For convenience, we index the data in treatment group A from 1 to and group B from to . Then, the above regression model (4) in vectormatrix notation is , where and are the vector of observed response and noise. The matrix can be written in the block matrixwith , , and is a vector of 1’s of length . The two block matrices are and .
To estimate the treatment effect , the differenceinmean and least squares estimators are most commonly used in practice. The optimal design approach, e.g., optimal design, constructs the design by minimizing the MSE of least squares estimator . However, it is a modelbased criterion and the basis functions must be known in advance. In the following two subsections, we show that for both estimators, the upper bounds of the MSE depends on the experimental design via a generally defined discrepancy.
3.1 Differenceinmean estimator
The differenceinmean estimator is widely used in causal inference to estimate the treatment effect. The notation and are the sample means of responses of the two treatment groups. In fact, coincides the least squares estimator based on the model assumption for the treatment A and for B. It implies the covariates do not affect the response. Under this model, the treatment effect is , the difference between the mean responses from the two groups. Thus,
is the unbiased estimator with the minimum variance among all possible linear estimators. However, more often than not, the response is affected by the covariates, and consequently, the true underlying model is, in fact, (
4) where . Under the model (4), the sample means and have the alternative form(5)  
(6) 
where and are the empirical CDF’s of covariate in group A and group B, respectively. The mean value of the sample mean errors and is 0 and their common variance is . With (5) and (6), the differenceinmean estimator is
In this paper, we assume the test units in the A/B testing are already fixed, i.e., the observed covariates are given. Thus, with given the observed covariates and design , the mean of the differenceinmean estimator is
With a randomized design, one hopes to have
and thus is nearly unbiased even if the oversimplified model assumption is wrong. But it is not guaranteed to hold for a single randomized partition. Assume the noise and the design are independent. Given the design that determines the empirical distributions and , then mean squared error of the differenceinmean estimator can be written as
(7) 
To find an upper bound of the , we first need to introduce the discrepancy concept in the reproducing kernel Hilbert space (RKHS). Let be a separable Hilbert space with a reproducing kernel function , with the reproducing property (Aronszajn, 1950; Berlinet and ThomasAgnan, 2011; Fasshauer, 2007),
Note that although sounds alike, the reproducing kernel used to define is different from the kernel for kernel density estimate defined in Section 2.
The KoksmaHlawka inequality (Hickernell, 2014) is
(8) 
where is a target probability measure or cdf and is an empirical distribution of that is supposed to approximate . The term is the variation of the integrand whose definition in (9) depends on if the RKHS contains constants,
(9) 
So measures the fluctuation of function and does not depend on or . The term is the discrepancy, which measures how much differs from based on the reproducing kernel . It is defined as
(10) 
By the KoksmaHlawka inequality (8), we can show Theorem 3.1. The proof is in Appendix.
Theorem 3.1.
Suppose that is the true distribution of covariates with sample space , and that is an RKHS of functions defined on with the reproducing kernel . Assuming that the true model of the response is (4) and the function lies in , the MSE of the differenceinmean estimator is bounded above by
(11) 
where is the discrepancy defined in (3.1) and is the variation defined in (9).
The upper bound of MSE in (11) consists of two parts. The second part is a constant. The first part is a product of two terms. The variation measures the smoothness of and does not depend on the design. The sum depends on , , and the true distribution of the covariates, but not the basis . For the twolevel A/B testing, the design is a partition of the test units into two groups, which in turn decides the empirical cdf and . The upper bound suggests we should choose the design such that and are as small as possible. In other words, the distributions of covariate in both groups should well imitate the population distribution of . Although a smaller upper bound does not directly implies a smaller , it regulates the largest possible MSE. Therefore, a design that makes both and small would make the differenceinmean estimator robust to the oversimplified assumption.
3.2 Least squares estimator
When we are confident that the covariates influence the response, i.e., in (4), instead of the differenceinmean estimator, the least squares estimate should be used. Usually, we assume a large set of basis to include all terms that are likely in the unknown true model. Model selection is then performed to select the significant ones in the analysis stage.
Denote for all the regression coefficients in (4). The least squares estimate of is
The corresponding least squares estimate of the treatment effect is
(12) 
where is the column vector whose first element is 1 and the rest elements are 0. Lemma 3.1 derives the MSE of the treatment effect estimate and the sum of MSE’s of ’s. Following the similar proof of Theorem 3.1, we derive the upper bound for .
Lemma 3.1.
The MSE of the least squares estimate of the treatment effect is
where is a column vector whose th element is and . The sum of is
Theorem 3.2.
Assuming , lies in the RKHS with reproducing kernel , the MSE of the least squares estimate of the treatment effect is upper bounded by
provided that .
It is straightforward to see that is invariant under different partitions of the test units. The term depends on the basis functions, but not the design. The term depends on the design but not the basis . As in the previous subsection, we should choose the design such that and are as small as possible. The condition , which is unitfree, is necessary to hold in order to regulate the worst case MSE. If some functions of are very oscillated, e.g., contains polynomials of high order, then it is more necessary for and to be small to guard against large ’s. When the condition is satisfied, smaller discrepancies and would lead to smaller worst case MSE of the least squares estimator.
Sometimes, the regression coefficients are also of interest to the experimenter. If so, we should minimize , on which the follow theorem provides an upper bound. We need to introduce some new notation. Recall that is the true distribution of covariates and is an RKHS of functions defined on with reproducing kernel . Define the information matrices at any covariates value in treatment A and group B as
respectively. Then, given the design, the information matrices of treatment group A and B are
and the information matrix of all the test units is
Similarly, we can also define the information matrix with respect to the true distribution ,
Suppose that the two functions
lie in the RKHS for any . Define the variation over all the basis functions in in group A and B as
respectively, where is the variation defined in (9).
Theorem 3.3.
The sum of for is bounded above by
provided that , and .
In the upper bound in Theorem 3.3, and are the trace of inverse information matrices in group A and B with respect to the true distribution of covariate , which depends on the true distribution of , the basis , but not the design . The terms in the denominators, and , depend on the design and basis . This implies that we should choose the design such that both and are as small as possible if we want to regulate the largest possible value of the sum of the MSE of the least squares estimates of the coefficients. Similar as the condition in Theorem 3.2, the conditions , and , which are also unitfree, are necessary to hold in order to regulate the worst case sum of MSE’s. If the basis has a large variation, e.g., contains polynomials of high order, then it is more necessary for and to be small to guard against large and . When the conditions are satisfied, smaller discrepancies and would lead to the smaller worstcase sum of MSE’s of the regression coefficients estimators.
In the proofs of Theorem 3.1, 3.2 and 3.3, the equality in each of the inequality is attainable thus the upper bound for each MSE is a tight bound. They show that we should choose the design that minimizes both and so as to regulate the worse MSE of the estimated parameters, including both treatment effect and the regression coefficients for the covariates. Although we only reach this conclusion for the case, it can be directly extended to case, since every two treatment settings can be considered as the case.
At the end of this section, we link the discrepancy defined on RPKH with the design criterion in (3). To calculate the value of any discrepancy, we need to choose a reproducing kernel function. Unfortunately, not every valid reproducing kernel function leads to a discrepancy that can be calculated easily. If we choose a relatively simple reproducing kernel function ,
the discrepancy corresponding to the treatment group becomes
Following the definition of distribution functions, we have and (number of observed in group that are equal to ). The true pdf is unknown, so we approximate it using , the KDE with all the observed covariates. The exact format of (which is the counted frequency) is not suitable for the continuous covariates. We use the KDE as a smoothed approximation of . Using the two approximation, the discrepancy can be approximated by
So we want to minimize for , it would be approximately the same to minimize all for all groups, which is equivalent to minimizing with respect to all the partitions.
4 Optimization Algorithms
In this section, we develop the algorithms that minimize discrepancybased criterion for both offline and online A/B tests. First, we derive the discrepancy criterion defined in (3) with a more concrete formula to facilitate its computation. Define a matrix operator as the summation of all the entries of a matrix. Denote the number of test units in the partitioned group as , and . Then, the discrepancy criterion can be computed as
(13)  
Detailed derivations are included in the Appendix. In (13) the matrix is a symmetric matrix of size , with elements defined as follows:
(14)  
(15) 
To illustrate the design approach, we choose a simple sample space and the commonly used Gaussian kernel function
, which is suitable for normally distributed covariates. Concrete expression of elements in matrix
are derived in the Appendix. Given a specific partition , we partition the matrix into submatrices accordingly, such that each submatrix corresponds to the test units in group and . That is, the entries of submatrix are