Introduction
Online experimentation has been playing a key role in datadriven decision making in the IT industry including Microsoft (kohavi2009online; kohavi2014seven), Google(tang2010overlapping), Linkedin(xu2018sqr), Netflix(xie2016improving), Uber, eBay(nie2020dealing), and many others (gupta2019top). Generally, online controlled experimentation, also known as A/B testing, is conducted for a predetermined amount of time to compare the difference in metrics between the treatment group and the control group where users are randomly assigned to. Prior to experimentation, a set of highquality metrics are determined to assess the effects of new features in the treatment group. The collected metric results can provide strong evidence to support hypotheses and hence accelerate the decisionmaking process (deng2016data; machmouchi2016principles; dmitriev2016measuring). In this study, we focus on the analysis of metrics that have incomplete measurements at the end of data collection in experiments.
According to the positions in the shopping funnel, metrics can be categorized as top, middle, and bottom funnel metrics. For instance, a successful purchase typically requires users to take multiple steps from the top homepage webpage to the bottom purchase webpage in the shopping funnel. In online experimentation, it is common for millions of users to arrive at the top funnel (e.g., homepage webpage), while only a small percentage of users reach the bottom funnel (e.g., purchase webpage). Between the transition from the top funnel to the bottom funnel, users need to navigate through multiple pages where they can exit from the shopping process. There are numerous scenarios in which users can exit the funnel, resulting in incomplete records of their purchases or other metrics. A common occurrence is simply that each experiment has its own experiment duration. Keeping experiments alive for a long period of time is expensive due to the high operational efforts and business opportunity costs. When we close experiments, we stop the track of all users, but some users might yet complete their purchases. This incompleteness in metrics due to the delay in collecting measurements for bottomfunnel metrics in experimentation are inevitable. There is also the possibility that users are lost to follow due to technical issues or user unavailability. For instance, when users switch from the desktop app to the mobile app, they become unavailable. It is essential to fill in the incomplete metrics to improve metric quality, leading to trustworthy results and better decisions.
With incomplete metric measurements, the inference of the difference in metrics between the treatment and the control in experiments is at the risk of being biased and inaccurate (imbens2001estimating; imbens2000analysis; goldstein2007subtle). To analyze experiments with missing metric values, a naive approach is to disregard users with incomplete outcomes. This approach assumes that the missingness is completely at random and that the fully observed users are representative of the entire population. Such an approach will reduce the total number of users in the study, leading to a decrease in the experiment power. The power decrease is substantial especially when the proportion of missingness is high.
Various imputation methods have been developed to address problems with missing data. One widely used method is the single imputation method, which fills in missing values with a single value, such as the mean of observed outcomes, for both the treatment group and the control group. The single imputation method preserves the full sample size, but it raises concerns regarding results with a distorted distribution and underestimated uncertainty (spineli2020comparison). In addition, the single imputation method disregards information from other observed variables collected along users’ journeys within the funnel. Other imputation methods have been developed for missing at random (MAR) and missing not at random (MNAR) scenarios. The MAR assumes that the missing mechanism is only associated with the observed variables (rubin1976inference; imai2009statistical; bhaskaran2014difference)
. Likelihoodbased methods, such as generalized linear mixed models, are developed in clinical trials with incomplete outcomes
(molenberghs2004analyzing). The performance of the methods depends on the degree to which the assumptions are held for MAR. For MNAR in which the effect from missing outcomes is nonignorable, the observed difference would be a biased estimate of the average treatment effect
(molenberghs2004analyzing). Regressionbased imputation methods, such as the logistic regression, are employed for modeling the indicator for missingness
(mao2021driving). Other prevalent methods, such as matching imputations, identify similar users from a set of variables. In general, these imputation methods require the identification of users with missing outcomes and users with outcomes as zero. In other words, general imputation methods are incapable of handling certain online experimentation scenarios in which users’ missing outcomes represent both missing cases and zero cases.To address the challenges, we propose a clusterbased knearest neighbors (kNN)based imputation method for the analysis of online controlled experimentation in the presence of incomplete metrics. The idea is to identify and impute incomplete metrics with users’ neighbors by incorporating the structure information of the online experimentation data set. Specifically, the proposed method consists of two steps. The first step is to partition the data set into clusters after the stratification of experimentspecific features, specifically, the treatment assignment and the buyers’ characteristics. In the second step, we perform the kNNbased imputation method within each cluster. We intend to improve the metric quality so that the experiment results can be trustworthy and better datadriven decisions can be made. In our framework, the treatment assignment and user covariates are fully observed, whereas only the outcome at the bottom of the funnel has missing values. In addition, we divide users with missing outcomes into two categories: visitors and dropout buyers. The proposed method has three key advantages. First, the proposed method uses the informative covariates during users’ journeys in the shopping funnel to impute incomplete metrics. Specifically, our method evaluates the heterogeneous impact from different user segments on missing rates in metrics. Second, the imputed values from our method are not limited to single values. Lastly, our method employs stratification and clustering to alleviate the computation issues in largescale online experimentation data sets.
Throughout the paper, we consider the metric purchase as an example of the incomplete metric at the funnel’s bottom for illustration. We also assume that purchase is the only metric (i.e., outcome) of interest in the experiment. The rest of the paper is organized as follows. In Sections 2 and 3, we detail the problem formulation, the proposed method, and the estimation procedures. In Section 4, we describe the competing methods and performance measures. A real case study is conducted in Section 5. We conclude this work with some discussion in Section 6.
Problem Formulation
In the context of online controlled experiments, we can classify users into three types based on their purchase behaviors: visitors, real buyers, and dropout buyers. Visitors participate in experiments but do not make contributions (e.g., purchases). Real buyers not only participate in experiments but also make contributions (e.g., purchases). Dropout buyers could have made their contributions (e.g., completed their transactions) within the experimentation period but failed due to various reasons. For example, users could drop out of the experiment because of unexpected external payment issues. Another example is that the experiment lost users due to various technical issues.
Suppose there are users in an experiment, and let denote whether the th user is a buyer or not, and let denote the metric value of the th user impacted by the experimentation. That is,
and where is the indicator function. We know for sure that user is a real buyer and the corresponding value amount if he/she has completed transaction(s) during the experimentation period. In other cases, it is ambiguous whether he/she is a dropout buyer or merely a visitor. Therefore, we use and if the th user is a real buyer and and to represent the ambiguous situation (i.e., could be a dropout buyer or a visitor). To clarify,
However, some practitioners arbitrarily treat all and as 0 without the diligence to distinguish between dropout buyers and visitors. Here, we denote such an arbitrary but simplified buyer indicator as
Their corresponding vectors are denote as
Additionally, let denote the relevant features for user , , , and let , Without loss of generality, we assume that features are continuous variables.Suppose there are real buyers among the total users, and without loss of generality, let us assume the first users are real buyers. Denote users’ purchase and transactional amount during the experimentation period using vectors
The problem of interest is to impute missing values and in the context of online experimentation. Among users with missing values, visitors are mixed with dropout buyers. Therefore, our proposed method is to firstly identify the candidates of dropout buyers (i.e., identifying the candidates of 1s in ) with the help of a classification model and then impute the and using an efficient clusterbased nearest neighborsbased approach.
The objective of the imputation problem is to impute missing values such that they are close to the underlying true data. The missing value imputation problem can be formulated as
where
is a loss function to quantify the difference between the imputed missing values
and the underlying true values .Imputing missing values with nonparametric methods such as the nearest neighbors algorithm in largescale data sets is challenging due to the large computation requirements for distances between pairs of data points. To solve this challenge, we propose to incorporate the data clustering patterns into the imputation. In other words, we partition users into clusters and then perform imputations within each cluster. Thus, the clusterbased imputation problem is described as
(1)  
where denotes the features for user , and represents the user with missing value belongs to cluster with the centroid , the constant controls the withincluster distances, and is the Lnorm. The set of indices is defined as . After imputing , we can estimate the corresponding as well.
Note that it is unknown whether a user with an incomplete metric is a visitor or a dropout buyer. The dropout buyers are mixed with visitors because both do not have their purchase information recorded. To address the challenge, in Section 3.1, we apply the logistic regression model to identify a certain portion of visitors and narrow down the candidates of dropout buyers. Section 3.2 will detail the proposed clusterbased imputation. Notice that the data set in online controlled experiments often is very large such that the conventional clustering methods cannot be conducted efficiently. To alleviate the computation issue, Section 3.3 will consider a stratificationbased clustering and describe how to choose the number of clusters.
The practitioners’ simplified buyer indicator reveals partial information in the true buyer indicator . Therefore, a classification model based on provides us with the likelihood of purchases. Users with a high likelihood but missing purchase records can serve as the candidates for dropout buyers. Since is used as a substitution of , we call pseudoresponse.
Specifically, we propose to apply the logistic regression model for the buyer identification. Denote the conditional probability for user
as , that is,We model the conditional probability with the logistic model with = . Note that the features used in the logistic regression model are believed to be closely related to users’ purchase behaviors. A threshold is needed in the logistic model for classification. One widely used threshold value is 0.5.
Comparing the model prediction and pseudoresponse, Table 1 summarizes four types of classification results: false positive (FP), true negative (TN), false negative (FN), and true positive (TP) from the classification model. The FP indicates that the users with pseudoresponse as 0 should have purchase information. We use this inconsistency to figure out the candidates of dropout buyers. That is, the FP cases can be either visitors or dropout buyers. The TN suggests the agreement that these users do not have purchases recorded. Thus, we treat all TN cases as visitors. The FN and the TP are users recorded with purchase behaviors, and hence they are real buyers, not dropout buyers or visitors.
Pseudoresponse ()  Prediction  Description  
True Negative (TN)  0  0  Visitors 
False Positive (FP)  0  1  Candidates of dropout buyers 
False Negative (FN)  1  0  Real buyers 
True Positive (TP)  1  1  Real buyers 
Suppose there are visitors and dropout buyer candidates that have been identified. Without loss of generality, let us assume the first users in the missing set are those visitors. Then we write as
where , and with 0 representing visitors and 1 representing dropout buyers. Similarly, we denote the corresponding continuous response for the purchase amount as
where , represents the purchase amounts from estimated visitors and represents the missing nonnegative response from users. In the following imputation methods, we consider is known and the aim is to impute .
Clustering improves data analysis efficiency by identifying inherent structure patterns and partitioning the largescale data set into small subsets. In each strata
(described later in the stratification step), we perform the kmeans clustering method
(macqueen1967some) to form clusters, which is formulated asWithin each cluster
, we suggest the knearest neighbors (kNN) approach for imputation. The main idea of the kNN method is that nearby data points are similar to each other. The kNN algorithm is straightforward and does not require parametric model estimation, but it is computationally expensive and becomes slow as the size of the data set increases. However, this computational burden is greatly mitigated by the strategy of clustering. Given the specific cluster
(i.e., the fixed constraint in (1)), the imputation problem (1) with the kNN method can be written as(2) 
where is the binary label, is a positive integer representing the size of target user’s nearest neighbors and is the nearest neighbors’ user index. In this work, we use a fixed value 15 for . It is not difficult to derive the solution to the objective function, which is written as
where is the average of response in the nearest neighbors.
With the imputed , we obtain the corresponding imputed missing value from the cost function formulated as
That is, the estimated is given by
where is the average of response in the nearest neighbors.
The nearest neighbors are determined based on their distances to the target user, that is, the closest neighbors are found by
where is the distance between the users and . In this study, we use the Lnorm to measure distances.
The data set in the online controlled experimentation often is very large to cluster in the imputation step. To reduce the computational burden in clustering, we propose the stratificationbased clustering approach. The key idea is to firstly stratify the user pool, and then perform clustering within each strata.
In the stratification step, we stratify users into two hierarchical levels: treatment assignment and users’ buying characteristics. The treatment assignment, including the treatment group and the control group, is determined by the experimentation configuration. Generally, in online controlled experiments there are two treatment assignments: control and treatment. However, more than two treatment assignments are possible in cases such as multivariant experiments. User’s buying characteristics, including new buyers, infrequent buyers, frequent buyers, and idle buyers, are categorized based on users’ purchase activities at eBay. There are in total 12 buyer categories. Note that both the experimentation configuration and the users’ buying segments are determined prior to the start of the experimentation. The hierarchical stratification is formulated as
where is the strata at the th treatment level and the th users’ buying characteristics in the feature space , and there are in total levels treatment assignment and levels users’ buying characteristics.
The combination of stratification and clustering within each strata greatly improves computation efficiency in the imputation step, where the neighbors of the target user are searched only within its belonged cluster.
The number of clusters in each strata from the stratification is obtained by maximizing a simplified version of the Silhouette score, also known as simplified Silhouette. The Silhouette score is an effective measure of clustering goodness (rousseeuw1987silhouettes), but it requires an intense computation of the distance betweeb each data point and the rest data points. The simplified Silhouette improves the computational efficiency of the Silhouette score by calculating the distances between each data point and centroids of clusters (hruschka2004evolutionary). The simplified Silhouette of data point , denoted as , is defined as
where is the distance between the data point and the centroid of the cluster it belongs to, and is the minimum of distances between the data point and the centroids of other clusters. The final simplified Silhouette is the average of all data points’ simplified Silhouette. Note that the distances of each data point to its cluster centroid have already been calculated and recorded during the modeling process of kmeans clustering, which greatly reduces the computational burden of the simplified Silhouette.
A pseudocode for the proposed method is summarized in Algorithm 1.
To evaluate the proposed method’s performance, we compare the proposed method with a list of benchmark models, including

Completecase analysis (BM);

Unconditional controlmean imputation (BM);

Unconditional treatmentmean imputation (BM);

Unconditional zero imputation (BM);

Bestcase analysis (BM);

Worstcase analysis (BM);
Completecase analysis removes cases with missing values and uses only
cases with complete outcomes. Specifically, we discard and
the sample size is reduced to , that is,
The completecase analysis is easy to implement but generates unnecessary waste of information especially when the number of incomplete cases is substantial.
Unconditional controlmean imputation uses the mean in the observed
users in the control group to impute missing values while unconditional treatmentmean imputation uses the mean in the observed users in the
treatment group for imputation. That is,
where the set of indices is defined as and the set of indices is defined as . and is the number of sample sizes in the control group and in the treatment group, respectively. Unconditional zero imputation uses zero to impute missing values, that is,
These three imputation methods are different types of single value imputation approach, which can keep the full data size. But these imputation methods treat the missing values as fixed, distorting the distribution and ignoring the uncertainty in the missing values.
The bestcase analysis imputes missing values in the treatment (control) group with the mean in the users in the treatment (control) group. In contrast to the bestcase analysis, the worstcase analysis imputes missing values in the treatment (control) group with the mean in the users in the control (treatment) group. Here, we assume that the testing feature in nature has a positive impact, and thus the mean in the treatment group is expected to be greater than the mean in the control group. The bestcase analysis and the worstcase analysis are expressed as
where () is the imputed missing value in the treatment (control) group, () is the observed value in the treatment (control) group.
To check the performance of the proposed method, we estimate the mean and variance in the control group, and compute lift in the mean between the treatment group and the control group, the standard error (SE) of the difference between the treatment and control group, coefficient of variation (CV) for the control group, zero rate (ZR) and pvalue. The lift in the mean between the treatment group and the control group is described as
where and are the mean in the treatment group and the control group, respectively.
The SE is expressed as
where and are the standard errors for the treatment group and the control group, respectively.
In online experimentation, the faster we run experiments, the more economic benefits, and less operational costs are achieved. Given constant user traffic, running experiments faster means a smaller number of users required (wu2011experiments; deng2013improving). The CV is proportional to the number of users required for achieving a predetermined statistical power of experiments. The CV is expressed as
The smaller the CV, the smaller the user size required to detect the difference at the specific statistical power, and thus the higher sensitivity.
The ZR is the ratio of the number of zero’s () in imputed out of total data size , described as
The ZR evaluates the proportion of visitors with the outcome as zero after the imputation method.
To illustrate the proposed method, this section uses a past experiment whose objective was to improve eBay’s item ranking search results based on one ranking algorithm. The experiment hypothesis is that integrating information about negative buyer experiences into the ranking algorithm will reduce the visibility of items with a high probability of negative buyer experiences in search results, resulting in lower product return rates and increased revenues. This experiment lasts three weeks. A portion of eligible eBay users are selected and randomized into three variants – two treatment groups and one control group. The number of participant users in each variant exceeds 10 million. One of the most important outcomes is related to purchases, denoted here as PR.
The outcome PR is incomplete due to its high missing rate. The PR is recorded when users made purchases during the experiment’s data collection period, but not when either of the following occurred: users did not make purchases, or the platform was unable to record the purchases before the end of the experiment’s data collection period. To impute PR and thus identify visitors and dropout buyers, we use these informative covariates, including the treatment assignment, the number of sessions, the number of sessions with searches, the number of sessions with qualified events highly related to purchases at eBay, and the user’s buying characteristics. The treatment assignment is predetermined before running the experiment to assign users to the treatment group and the control group. The number of sessions corresponds to the number of sessions users have throughout the experiment. The number of sessions with searches is the number of sessions that contain at least one search activity. The number of sessions with qualified events is the number of sessions that include at least one qualified event activity. The buying characteristics of users are their historical purchasing patterns at eBay. These useful covariates are complete and do not have missing values. We impute the outcome PR using the proposed clusterbased imputation method. In the step of stratification, we divided the largescale data set into smaller subsets based on two variables: the treatment assignment and user’s buying characteristic. When performing clustering within each strata, we use the number of sessions, the number of sessions with searches, and the number of sessions with qualified events.
In Table 2, we compare the performance between the proposed clusterbased imputation method and benchmark methods. The proposed method has a smaller mean in the control group and ZR than other methods except for the BM. The proposed imputation method identifies visitors and dropout buyers from missing values. That being said, the proposed clusterbased imputation method imputes zeros for visitors, which is a portion of users with missing outcomes, and positive values for dropout buyers. Compared to the BM, the proposed imputation method has a smaller size of zero and thus a larger mean in the control group. Compared to other meanimputation methods that impute all missing values with a single value, the proposed imputation method has more zero’s and a smaller mean in the control group. The proposed method has a larger CV in the control group than all other methods, with the exception of BM. This is largely attributable to the change in the mean of the control group, as the pooled standard errors for all methods, with the exception of BM
, are quite close. The proposed method has the smallest lift, and all methods have a consistent direction of lift. Based on the pvalue and the Type I error as 10%, the proposed method and BM
are statistically significant, indicating that there is sufficient evidence to reject the null hypothesis, whereas other methods are not statistically significant. This is expected because it is well known that single imputation methods tend to dilute mean differences, producing results that there is no difference between the control group and the treatment group. The proposed method has a larger variance in the control group and SE than other methods except for the BM
. The BM has a reduced sample size, resulting in the largest variance and SE for the control group. Unlike other methods, with the exception of the BM, the proposed method does not ignore variance among missing values, resulting in a greater variance.Method  CV  Zero rate  Lift (%)  SE  pvalue  
BM_{1}  107035.21  1235.8  0.265  0.00  0.37  0.33  0.17 
BM_{2}  20003.17  390.5  0.362  0.00  0.16  0.06  0.28 
BM_{3}  20004.96  389.9  0.363  0.00  0.17  0.06  0.28 
BM_{4}  20693.30  213.7  0.673  0.83  0.29  0.06  0.31 
BM_{5}  20003.17  390.5  0.362  0.00  0.29  0.06  0.05 
BM_{6}  20004.96  389.9  0.363  0.00  0.03  0.06  0.82 
Proposed  21661.66  246.3  0.598  0.80  0.60  0.06  0.02 
, CV, and SE are not real and masked with particular linear transformation to meet the disclosure requirement.
Figure 1 illustrates the increase in the mean of the control group across users’ buying segments for the proposed clusterbased imputation method and the zeroimputation method. Different user segments have different mean values, with the top two being the frequent buyer levels II and III. The proposed imputation method has larger mean values than the zero imputation method in nearly all user segments. The segments the frequent buyer levels II and III have considerably larger mean increases than the idle buyer levels. This suggests that the dropout buyers are more likely to occur in the frequent buyer levels II and III, while in the segments such as idle buyer levels, users with unrecorded outcomes are more likely to be visitors. This is consistent with the findings in Figure 2 regarding the allocation of the zero rate across user segments. Different user segments have varying degrees of zero rate. The zero rates for frequent buyer levels II and III are approximately 45%, whereas the zero rates for idle buyer levels II and III are above 90%. This is reasonable given that frequent buyer levels II and III are more likely to make purchases, resulting in low zero values for outcome PR. The high zero rate corresponds to the low mean value in Figure 1.
Figure 3 shows the distribution of CV across user segments for the proposed imputation method and the zero imputation method. For both methods, the CV values for the frequent buyer levels are less than half of those for the idle buyer levels. However, the CV of the proposed method is consistently lower than that of the zero imputation method across all user segments. The decrease in the CV indicates an improvement in sensitivity for the outcome PR. This improvement in sensitivity is largely attributable to the change in mean values.
Metrics provide strong evidence to support hypotheses in online experimentation and hence reduce debates in the decisionmaking process. This paper introduces the concept of dropout buyers and classifies users with incomplete metric values into two categories: visitors and dropout buyers. For the analysis of incomplete metrics, we propose a clusterbased knearest neighborsbased imputation method. The proposed imputation method considers both the experimentspecific features and users’ activities along their shopping paths. The proposed method incorporates uncertainty among missing values in the outcome metrics using the knearest neighbors method. To facilitate efficient imputation in largescale data sets in online experimentation, the proposed method employs a combination of stratification and clustering. The stratification approach divides the entire largescale data set into small subsets to improve computation efficiency in the clustering step. The clustering approach identifies inherent structure patterns to improve the performance of the knearest neighbors method within each cluster.
It is worth remarking that in this work the knearest neighbors method used the average of responses in nearest neighbors. The weighted average of nearest neighbors has been proposed to suggest that different data points in the neighbor contribute differently to the decision based on their distances from the target point (hechenbichler2004weighted). That is, nearby data points, which are closer to the target in the neighbors, have higher influence on the decision than distant data points. Another direction for future research is to study the effects of dynamic numbers of nearest neighbors (ougiaroglou2007adaptive) in the proposed imputation framework. On the other hand, the proposed imputation method aims to impute missing values for each user with missing outcomes. It would be interesting to categorize users with missing outcomes into various hubs and investigate the imputation strategy for each hub of users altogether.