I Introduction
During web and mobile software product development cycles, innovations and improvements are continuously made to the products by developing and rolling out new features. Integration of features to the mobile app or digital website requires changes to the code base on the client or server sides. In a complex world of microservices and intraservice dependencies, it’s very hard to ensure reliability in a manual way. While new features are designed and tested cautiously before being released to end users, it’s inevitable that a small fraction of these features will have bugs in the code or flaws in their design. Such problematic features will degrade the user experience (e.g. causing application crash, slowing down the performance, draining mobile battery, blocking users from requesting a trip, etc.) when they are released.
Improper ways to release a new feature include (1) making the feature available to all users at once; (2) releasing a feature without a proper monitoring and alerting system in place. Due to the uncertainty of impact embedded in the new features, releasing the feature through an unguided process can be risky:

The negative impact may go unnoticed for a long time resulting in a bad customer experience.

The problematic feature may impact a large number of users, even if it is reverted timely.

When a metric degradation is detected, it is hard to attribute the degradation to a specific feature and fix a bug.
Online controlled experimentation, also known as A/B testing, is one way to evaluate a feature’s impact before formal release [1, 2, 3, 4, 5, 6, 7, 8, 9, 10]. In a standard A/B test, users are split into treatment group (with the new feature turned on) and control group (with the new feature turned off), and the analysis is performed on the user engagement and business metrics when the experiment concludes. While experimentation can help detect potential flaws in the feature before its formal release, it is worth noting that such feature release process can still be risky post experimentation. If the experiment was run on certain population (let’s say in couple of cities) and then rolled out to a population that were not tested against (rolled out to all cities) then the postexperiment release can be risky since it may affect an underrepresented population in the experiment. Thus, even with the presence of an A/B test, a proper rollout process is needed for the post experimentation feature rollout.
A reliable feature release process can systematically reduce the negative impact of problematic feature updates. A reliable feature release process has following desirable properties: 1) early detection of regression, 2) reverting feature rollout if regression is detected, 3) reducing the impact scope in terms of user size, 4) rolling out flawless features at a fast speed, and 5) requiring minimum human attention and intervention in the process.
Based on the criteria mentioned above, this paper proposes a framework for rolling out features autonomously using an adaptive experimental design. The rollout starts with a small proportion of users getting exposed to this feature, gradually ramps up to a larger user population, and eventually rolls out to the entire target population. This rollout framework is composed of two main algorithmic components making the process both intelligent and autonomous. First, a continuous monitoring algorithm based on sequential test that detects regressions early so that the team can be alerted and the feature can be reverted in a timely manner. Second, an adaptive rampup algorithm determining when and how much to ramp up to based on required sample size to get enough power at the end stage. This proposed framework satisfies the five properties for a safe feature rollout process. It enables engineers to quickly evaluate and roll out new features with a controlled amount of risk, without too much manual work and intervention during the process. As a result, this framework powers the development and rollout of mobile and web software products.
The contribution of this paper includes:

Proposing a staged rollout framework for releasing new features autonomously by using a continuous monitoring algorithm and a rampup algorithm.

Providing formulas for estimating power and required sample size for the sequential test.

Proposing and compared three rampup algorithms: timebased, powerbased, and riskbased.

Evaluating the empirical performance of the staged rollout framework through examples.
This paper is organized as follows: Section II introduces the background and terminology of this work, as well as its relation to standard A/B testing; Section III presents the concept of the staged rollout framework for feature release; Section IV describes the monitoring algorithm: sequential probability ratio test with nonparametric variance estimation; Section V introduces three rampup algorithms: the timebased rampup, the powerbased rampup based on statistical power of sequential test, and the riskbased rampup algorithm based on Bayesian methods; Section VI evaluates how this rollout framework works in practice through both real data and synthetic data examples; and Section VII summarize the work and discuss practical considerations.
Ii Background
Iia Terminology
In this section, the terminology is defined and introduced for this paper. Some of the terminologies are commonly known in the software development world, and some of them have specific meanings and scopes in this paper.

Feature A new feature is defined as any change to the product that involves a code or configuaration change. The change can happen either in the client codebase, or in the backend codebase on the server. A feature can be as large as an app redesign, and as small as a few lines of code or configuaration change that are not visible to users.

Feature Flag Feature flags are a gating system to control code flow in the product. A feature flag can turn on and off a block of code (as a feature). In other words, a feature flag can be used to control if a user can see a new feature or not. For example, turning on a feature is controlled by setting the feature flag value to True, while a feature can be turned off by setting the value to False.

Rollout The process of exposing a new feature to the user population is defined as a rollout. The rollout starts with the state of no users getting exposed to the new feature and ends with the feature turned on for all users.

Staged Rollout The staged rollout framework proposed in this paper breaks the feature rollout process into different stages, each stage with a different proportion of the user population getting exposed to the feature. In this framework, a feature is gradually rolled out to the user population from smaller sample sizes to larger sample sizes.

Revert Reverting a feature means turning off the feature for all users. When the new feature is turned off, a default experience (often the legacy version of the feature) is served to the user.

Rampup Ramping up a feature means turning on the feature for more users. In the staged rollout framework, each subsequent stage has a larger sample size (in proportion) than the previous stage, thus moving forward one stage is equivalent to ramping up the feature to more users.

Regression A regression is a situation where a feature stops functioning as intended or performs in a suboptimal manner. A regression can cause user experience degradation. A regression can be caused by a new feature that contains a code bug or design flaw.

Outage An outage, also known as downtime, is a serious regression situation that the core function or the entire app stops functioning. A problematic feature can cause an outage.
IiB Relation to Standard A/B Testing
Staged Rollout  Standard A/B Testing  

Main Purpose  Automated Feature Release Process  Feature Evaluation Process 
Question Trying to Answer  If the feature is causing a regression  If the feature is successful 
Experiment Design and Configuration  Multiple Adaptive Stages  One Fixed Stage 
Metrics  Core user engagement and software performance  Feature related user engagement 
Statistical Test  Sequential Test  Fixed Horizon Test 
Analysis Frequency  Continuous Monitoring  OneTime Fixed Horizon 
There are similarities between the staged rollouts and standard A/B testing. In both cases, there is a randomized control group to compare with the treatment group. A large portion of the infrastructure can also be shared between these two frameworks. However, there are major differences between them two, and the standard A/B testing cannot solve the problem that the staged rollouts are focused on. See Table I for a list of difference between staged rollouts and standard A/B testing. In practice, the staged rollout framework can be used by itself to roll out a feature, but it can also be used within an A/B testing to monitor and ramp up the experiment.
Iii Staged Rollout Framework Overview
The design for the staged rollout framework comes with two main components.
First, the feature rollout process is staged such that the feature is gradually ramped up. The initial stage comes with a relatively small sample size (e.g. users in treatment), while the last stage consists of a large sample size reflecting the final rollout goal, usually of users in the treatment. The stages are designed to be increasing in sample size. The number of stages and the sample size at each stage can be determined by either manual inputs or algorithms (to be introduced in following sections).
Second, the feature rollout process is monitored and controlled by statistical algorithms to evaluate the feature impact and infer the optimal rollout actions: revert, ramp up, or stay on the current stage. There are two main statistical algorithms supporting the rollout decisionmaking: a continuous monitoring algorithm and an adaptive rampup algorithm. The monitoring algorithm continuously monitors the metrics for the rollout via a sequential test (Section IV). If there is no significant difference detected in the metrics between treatment and control, the rollout process will continue. On the other hand, if there is a significant difference detected, an alert with the metric signal will be sent to the feature developers and the rollout can be reverted by the developer or the system. In addition to the monitoring algorithm for deciding when to revert a rollout, there are three methods to decide when to ramp up to the next stage introduced in Section V. The timebase rampup schedule is the most basic but transparent approach. The powerbased rampup aims to achieve sufficient statistical power at a preset sensitivity level (minimum detectable difference) within a given time frame. The riskbased rampup algorithm sets the rollout speed and scope according to the signals collected and risk tolerance based on a Bayesian framework.
Table II shows an example scenario for a staged rollout. Each staged rollout is carried out for one feature, such that if a regression is detected, the feature development team can associate the impact back to the feature. But multiple features can be rolled out simultaneously in parallel, each in a separate rollout experiment, using the multilayering system [11]. Upon setting up the rollout, the feature engineer specifies the target population for the rollout (e.g. all active users in US using the new app version 2.0 on Android), which is defined as all users. In this example, the feature is gradually rolled out through five stages, and each stage takes one day. The control group users get the default experience with the feature turned off, and the treatment group users get the new experience with the feature turned on. The monitoring will be performed on the metric data collected from these two groups. The untreated group get the same experience as the control group, but the data are not used for analysis, and these users are treated as outside of the experiment. The reason for keeping equal sample sizes between the control group and the treatment group is to avoid Simpson’s Paradox [12]. During the rollout process, key user engagement and app health metrics are monitored by the sequential test to be introduced in the following section. The metrics can be defined on different analytics unit levels: trip, session, or user. And the metrics can be proportional metrics, continuous metric, and ratio metrics. For simplicity, the session level proportion metrics are used as examples.
Feature Flag  Feature X  
Target Population  Users in US using App Version 2.0 on Android  
Stages  1  2  3  4  5 
Time  Day 1  Day 2  Day 3  Day 4  Day 5 
Control Sample Size  
Treatment Sample Size  
Untreated Sample Size  
Metrics  Sessions with Login Success, Sessions with Orders,  
Sessions with App Crash, etc. 
Iv Monitoring Algorithm  Sequential Test
In this section, a variant of the sequential test is introduced, called a mixture sequential probability ratio test (mSPRT), used as the core monitoring algorithm. In addition, a nonparametric variance estimation method (deleteagroup jackknife) is introduced to correct the correlation for the sequential algorithm. The statistical power and sample size estimation is derived for the sequential test as well.
Iva Mixture Sequential Probability Ratio Test (mSPRT)
Sequential probability ratio test [13] is widely used in clinical research, where scientists often allow sample size dependent decisions to be made based on likelihood ratio of two hypothesis. The mSPRT introduced in [14] and [15] applies to A/B testing to enable multiple tests without inflating the false positive rate (FPR). In an A/B testing setting, assume the control variables
are independent random variables from a distribution with a density function
where andrepresent mean and standard deviation. Similarly, the distribution density function for the treatment variables
is . The hypothesis to be tested is on the difference in distribution mean:(1)  
(2) 
where represents the difference in mean between treatment and control, and
is the difference value under null hypothesis (
for testing if two groups have the same metric mean). The test statistic used in mSPRT is a likelihood ratio integrated over a prior distribution of
values under the alternative hypothesis. Denote the prior density function as , and for simplicity, a normal prior is chosen in this paper. It can be proven that the integrated likelihood ratio statistic is a Martingale under alternative hypothesis. The confidence interval for is derived by [15] as(3) 
with and as sample means and V as the variance of sample mean difference (e.g. where and are the sample variance estimates for control and treatment).
IvB Variance Estimation
One assumption made by the sequential testing is the independence of each observation. However, this assumption often does not hold in practice. For example, if the click through rate is the metric of interest, it is improper to assume each impression is independent, since the same user can use the product multiple times on different days, and multiple impressions and clicks can be generated by the same user. Such observations are correlated since they are generated by the same user. Violating the independence assumption in the sequential test can produce an inflated false positive rate. Embedding a variance estimation with correlation correction is one way to generalize the mSPRT to correlated data.
Several previous papers discuss the adjustment of metric variance in A/B testing. The delta method [16] and Bootstrap method [17] are two variance estimation approaches that can be applied to correct the variance without the assumption of independence. These two methods work well in theory, however, they require storing raw data (e.g. all mobile event level data) in order to perform analysis. Therefore, they are hard to scale when the data storage is limited. In addition, processing the raw event level data adds computation latency which is a concern for monitoring all feature rollouts in real time. Therefore, a fast and scalable variance method is preferred. In our system, a version of the deleteagroup jackknife [18] variance estimation method is implemented, which meets such requirements.
To implement the deleteagroup jackknife method, the users (user ID or device ID, which is the experiment unit) are split into partitions with equal probability using a hash function within each experiment group. The hash function ([19], [9]) takes the user ID as input and outputs an integer as the partition ID . Note that the hash function should be different (by choosing a different random seed or algorithm) from the one used for experiment randomization on splitting users into control and treatment groups, so that the user partition is independent to the experiment randomization.
The deleteagroup jackknife variance estimator can be expressed as
(4) 
with
where is the metric mean for all users except partition , is an indicator function with value if the argument condition is true and value otherwise, is a hash function, and is the user ID for the th observation.
The deleteagroup jackknife variance estimation is scalable since only the partition level metrics are needed and stored for calculation, instead of the raw eventlevel data. Taking the CTR metric as an example, at each stage, the system will calculate the total number of impressions and clicks by partition, and then the cumulative (from the initial stage up to the current stage) CTR (as ) can be calculated by dividing the sum of clicks by the sum of impressions across stages except partition . The overall cumulative CTR (as ) can be calculated by dividing the sum of clicks by the sum of impressions, where both sums are across stages and partitions.
The empirical performance of different variance estimation methods are compared with real data reflecting the correlation structure in practice. Four methods are compared: sequential test (assuming observation independence), sequential test with bootstrap, sequential test with delta method, and sequential test with (deleteagroup) jackknife (with ). A sample dataset is prepared for three metrics in oneweek time period. The false positive rate in Figure 1 is calculated by simulating an A/A test scenario: randomly assign the sample data into control and treatment groups by hashing the user ID. The power in Figure 2 is calculated by simulating an A/B test scenario: using the same experiment assignment logic as an A/A test, but artificially adding a metric increase in the treatment group. The data is fed into the algorithm at hourly level, and a positive result is defined if the test returns at least one significance (at level) out of the tests. The results show when the independence assumption does not hold, it leads to an inflated false positive rate for the original sequential test. All the three improved variance estimation methods successfully control the false positive rate below , while having a similar level of power. The jackknife method is preferred considering the scalability, and the performance can be improved by increasing the number of partitions . By default, the jackknife method is applied for all examples related to real data in this paper.
IvC Power and Sample Size Estimation
The statistical power of the the sequential test is needed for evaluating when it is ready to fully ramp up and conclude the rollout. At the beginning of the rollout, the rollout developer will set a sensitivity level for metric degradation detection. For example, the developer can set the system to raise an alert if the true metric difference is greater than . In this paper, this minimum detectable difference is defined as the MDE (minimum detectable effect). Then the criteria for the feature to be formally rolled out is to achieve enough statistical power to detect the MDE. If the power is achieved, and there is no regression detected, then system will believe this feature is not causing a regression greater than the MDE with high confidence. In order to make the powerbased criteria work, an estimation of power and sample size is needed during the rollout process.
To derive the power and sample size estimation formula, define the stopping time as the smallest sample size that the sequential test becomes significant given the true difference as . Note is a random variable. The power of the sequential test for a finite sample size can be expressed as the probability of the sequential process becoming significant before the given sample size:
(5) 
where indicates the power and denotes the sample size threshold for achieving the power given .
Following the intuition that is the percentile of the distribution for , it can be approximated by a linear combination of the distribution mean and standard deviation:
(6) 
A previous Monte Carlo study [20] found that the variance of the sample size needed to detect a given MDE for sequential test is approximately proportional to the square of the average sample size: .
In addition, the average sample size can be approximated by by taking (see Appendix).
To estimate the coefficient , an empirical power curve is fitted in a simulation study that yields the following approximation for the sample size required to achieve a given statistical power:
(7)  
(9)  
Given this formula, the power can be estimated by solving the formula for for a given sample size . This power estimation works well empirically for sample size larger than . Figure 3 shows the estimated power compared with the actual power.
V Rampup Algorithms
During the rollout, when the monitoring algorithm raises an alert, the rollout will be paused or reverted for investigation. However, if the monitoring algorithm does not raise an alert, it does not mean the feature is guaranteed to be flawless, but it could be the case that the sample size is not sufficient for the monitoring algorithm to catch the regression at certain levels of sensitivity. With the uncertainty in the feature performance, rolling out too fast could expose more than necessary users to the feature. Therefore, a separate rampup algorithm is needed to decide how to increase the rollout percentage from one stage to the next stage. Next, three approaches for ramping up the rollout are introduced.
Va Timebased Rampup
The timebased rampup schedule requires engineers to decide the stages and corresponding sample size percentage before the rollout begins. In addition, the time between each step is specified. After the rollout starts, if there is no alert made by the monitoring algorithm for the given time length, the rollout will ramp up to the next stage. In the example shown in Table II, if the rollout is set to be ramping up every day, then it will be fully rolled out in five days if there is regression captured. The advantage of the timebased ramp up is to give the developer more control on the rollout. However, since this rampup schedule does not adapt to the information collected during the rollout, it may suffer in 1) ramping up too quickly when the risk and uncertainty is high or 2) ramping up too slow when the data shows high level of certainty.
VB Powerbased Rampup
The powerbased rampup schedule aims to achieve sufficient statistical power for detecting the predefined MDE within a given time frame. For example, if the feature developer wants to roll out a feature with power to detect potential metric difference within one week, then the powerbased rampup schedule will adjust the sample size adaptively to achieve the rollout goal. At each stage, if the observed difference is less than predefined MDE , then the sample size required for the MDE is calculated. Otherwise, if the observed difference is bigger than MDE, then the sample size required for detecting the current difference is calculated. This way, the system makes sure the rampup does not put too many users under risk if the observed signal is negative. In addition, a maximum rampup proportion threshold can be put for each stage, so the rampup speed is not too aggressive. In sum, the sample size recommended by the powerbased algorithm is
where is the maximum sample size allowed at a given stage, is the sample size required for the MDE , and is the sample size corresponding to the observed difference . The recommended sample size can then be transformed into sample size percentage by taking account of the predicted total sample size for the next stage.
VC Riskbased Rampup
Riskbased rampup algorithm proceeds the rollout at the maximum allowed speed on a tolerable risk level. For example, if the risk criteria is set as ”avoiding loosing trips from more than 1000 users”. Then at any given stage, the system can calculate what the maximum number of extra users that could be exposed given the current experiment signals.
To state this idea in formal formulation, the potential risk of uncertain metric degradation is quantified and controlled by a probability threshold. Without loss of generality, assume the metric of interest is a proportion metric, e.g. users with an order (the continuous metric scenario can be derived in a similar approach). The negative impact of the feature can be quantified as number of treatment users not making an order because of the feature: . The risk tolerance during the rollout can be defined as:
(10) 
where

is a positive constant, as the prespecified tolerable cost threshold. One choice can be , that allows enough cost budget to yield sufficient statistical power.

is the prespecified risk probability threshold. is a parameter to trade off the rollout safety and speed.

indicates data, which means the probability specified is based on a posterior distribution given the data.
Note that the formula above is an example of onesided test assuming only potential decrease in the metric is a risk to be controlled. This framework can be easily generalized to onesided increase risk and twosided risk specification. Without loss of generality, the onesided decrease risk is used in the following derivation for simplicity.
In order to solve the inequality (10), the posterior distribution
can be derived as follows. For simplicity, assuming control and treatment have equal sample size and equal variance. By the central limit theorem, assume the sample mean follows a normal distribution
and , where and are the distribution means, is the common variance, and is the sample size. In addition, assume the prior distributions for the mean parameter and are and , where , , and are prior distribution parameters, indicating prior control mean, prior common variance, and prior difference parameter. Under these assumptions, the posterior distribution can be derived as a normal distribution (see [21]):(11) 
where
Note that the only unknown parameter in this posterior distribution is the metric variance . In practice, this parameter can be estimated and substituted by the pooled sample variance.
At stage , the observed cumulative sample size for treatment is . The riskbased rampup algorithm aims to decide an appropriate sample size for next stage that controls the potential risk. The risk inequality (10) can be written as
(12)  
(13)  
(14)  
(15) 
The left side of the inequality becomes the cumulative distribution function of the posterior distribution of
. The the maximum tolerable sample size can be derived as:(16)  
(17) 
The corresponding treatment rollout percentage for the next stage can be calculated by dividing the treatment rollout sample size by a predicted total population size for the next stage as:
The and conditions are added to ensure the rollout percentage is monotonic increasing (unless reverting to ), and the maximum rollout percentage that can be achieved by the rampup algorithm is . The final decision from ramping up from to (in one stage) is determined by power calculation, which is beyond the control of the rampup algorithm. The predicted total population size for the next stage can be estimated by a generic time series model. The discussion of this prediction model is beyond the scope of this paper.
Vi Examples
Metrics  Timebased A/A (Low Speed)  Timebased A/B (Low Speed)  Timebased A/A (High peed)  Timebased A/B (High Speed)  Powerbased A/A  Powerbased A/B  Riskbased A/A  Riskbased A/B 
Positive Rate (FPR for AA, Power for AB)  0.04  0.985  0.015  0.985  0.015  0.975  0.01  0.98 
Average Time before Detection (h)  NA  63  NA  55  NA  57  NA  52 
Average Time before Fully Rollout (h)  61  NA  57  NA  52  NA  47  NA 
Weighted Average Rollout Pct before Detection  NA  ,, ,  NA  ,, ,  NA  ,, ,, ,  NA  ,, ,, , 
Weighted Average Rollout Pct before Fully Rollout  ,, ,  NA  ,, ,  NA  ,, ,, ,  NA  ,, ,  NA 
Average Sample Size Used before Detection  NA  12085  NA  11803  NA  11528  NA  11857 
Average Sample Size Used before Fully Rollout  9591  NA  10186  NA  9876  NA  9300  NA 
Avg Total Loss  21  136  18  137  20  132  18  136 
of Tests Exceeding the Loss Tolerance 
Metrics  Timebased A/A (Low Speed)  Timebased A/B (Low Speed)  Timebased A/A (High peed)  Timebased A/B (High Speed)  Powerbased A/A  Powerbased A/B  Riskbased A/A  Riskbased A/B 
Positive Rate (FPR for AA, Power for AB)  0.01  0.97  0.01  0.99  0.015  0.99  0.01  0.99 
Average Time before Detection (h)  NA  58  NA  52  NA  54  NA  44 
Average Time before Fully Rollout (h)  67  NA  63  NA  60  NA  52  NA 
Weighted Average Rollout Pct before Detection  NA  ,, ,  NA  ,, ,  NA  ,, ,, ,,  NA  ,, ,, 
Weighted Average Rollout Pct before Fully Rollout  ,, ,  NA  ,, ,  NA  ,, ,, ,  NA  ,, ,,  NA 
Average Sample Size Used before Detection  NA  11126  NA  10981  NA  9282  NA  9438 
Average Sample Size Used before Fully Rollout  13281  NA  13104  NA  13420  NA  12420  NA 
Avg Total Loss  30  135  27  141  29  124  20  127 
of Tests Exceeding the Loss Tolerance 
This section evaluates the empirical performance of staged rollout framework with the three rampup algorithms on both real data and synthetic data examples.
The real data comes from a feature rollout experiment, which does not cause a regression. One of the key session level metric monitored is a binary metric with mean
. However, the session level metric can not be regarded as i.i.d. sample from Bernoulli distribution, since the same user can have multiple sessions. The data is collected hourly, and there are
days’ data in total. On the other hand, the synthetic data is generated as i.i.d. sample from a Bernoulli distribution with mean . During the rollout, the sequential test runs every hour and the rollout is reverted when a significant metric degradation is detected. The criteria for rolling out to all users is to achieve a MDE of at power of . If no regression is detected, the system ramps up the rollout to the next stage on the next day. Although the rampup frequency is fixed (daily), the sample size for the next stage is determined by the three rampup algorithms.Under this setting, the three rampup algorithms are evaluated through both A/A and A/B tests, each with
replications. Each trial of A/A test is generated by randomly splitting the data into treatment and control groups by hashing user ID. The success of an A/A test can be defined as the feature getting fully rolled out in a fast speed. Each trial of A/B test is first generated by the same random split as A/A test, but then an artificial metric difference is introduced to the treatment group. For each metric value in the treatment group, a relative difference is drawn from gamma distribution with shape
and scale (indicating mean as and standard deviation as ), that is applied to create the artificial difference. The success of an A/B test can be defined as detecting the difference with relatively small sample size.The Timebased rampup algorithm is tested at two speed levels with rollout percentages: (, , , ) and (, , , ). The Powerbased rampup algorithm sets the target MDE as . For the riskbased rampup algorithms, we set prior parameters as , prior variance , and risk probability threshold . Note that a nonzero prior mean is set in order to be conservative about the feature performance in the beginning. In this example, we are controlling the risk in the increasing direction.
Table III and Table IV present the results of different rollout approaches. Overall, all the three algorithms achieve false positive rate below for A/A tests, and power above for A/B tests. The Timebased algorithm, as a simple benchmark approach, seems to take longer time and more samples to make a final correct decision. The Powerbased ramp up algorithm gives relatively stable performance. Since this algorithm is set up based on the sequential test used for monitoring, it often uses the smallest sample size to detect the regression. This is reasonable since the powerbased algorithm estimates the number of sample size needed for the regression detection, which prevents overexposing users. The Riskbased algorithm on average achieves the earliest final detection in both A/A and A/B test. It also controls the risk to be around the desired level ( in simulated data example and in real data example). In contrast, both timebased and powerbased rampup algorithms do not exhibit consistent risk control performance.
Note that there are parameters that can be tuned for each of the three algorithms, especially the riskbased algorithm have multiple parameters. The results only display the outputs based on specific parameters. The performance order can change with different parameter settings and preference over safety and speed. In practice, such parameters can be determined or tuned through simulation before the actual rollout begins.
Vii Conclusion
In this paper, a staged rollout framework is presented to automatically release a feature with special attention to optimizing the safety and speed. This framework can be used broadly for code changes to product, both with and without an A/B experiment. In our experiments, the variance corrected sequential test with sample size and power calculation accommodates the practical needs as a continuous monitoring solution. The three rampup algorithms represent different perspectives of tradeoffs made between safety and speed.
The empirical evaluation of the system shows the sequential monitoring system can control false positive rate effectively while deliver a reasonable power. All three rampup algorithms, when configured properly, can make reasonable tradeoffs between safety and speed with emphasis on different aspects. This framework has been tested and proven useful in practice for regression detection and rampup management to assist developers with rolling out both safely and quickly.
For implementing the rampup algorithms in production, a practical recommendation is to start with timebased algorithm, and then powerbased algorithm or riskbased algorithm. In practice, the timebased rampup algorithm is easy to implement and it also gives more control and transparency to the feature development team. Also compared with a direct rollout, the timebased algorithm can provide sufficient safety for most business case scenarios. If lack of statistical power is a concern, then the powerbased algorithm can be further used. Otherwise, if the risk during the rollout period needs to be further controlled, the riskbased algorithm can be utilized.
While the examples of this paper mainly focus on one metric case for illustration, the staged rollout framework can adapt to the case with multiple metrics. The monitoring algorithm can monitor multiple metrics in parallel, and the significance level can be adjusted to control for familywise false positives. While a large set of metrics can be put under the monitoring system, a smaller set of key metrics can be selected to be used in the rampup algorithm. The timebased rampup can be extended to multimetric case naturally. The powerbased rampup can take the maximum over the recommended sample sizes across different metrics to achieve the power. The riskbased rampup can take the minimum over the recommended sample sizes across different metrics to control risk.
Another practical consideration in the rollout system is to monitor the data quality of the rollout. Poor data quality (caused by logging system, outliers, etc.) can lead to inflation of false positives or false negatives
[9]. Data quality check [22, 23] and outlier removal [24] is useful in practice for the rollout system. When a red flag is raised on data quality, the rollout platform team and the feature development team need to engage and investigate the root cause. Detailed discussion on the data quality monitoring and diagnosis is out of the scope of this paper.The powerbased rampup algorithm is built on the frequentist inference, and the riskbased rampup algorithm is based on Bayesian framework. One direction of future development is using reinforcement learning to guide the rollout process as a Markov Decision Process. Instead of focusing on statistical power and potential risk, the reinforcement learning algorithm will make the tradeoff between the value of rolling out the new feature versus the cost associated with the rollout. It will eventually optimize the final reward given the value and cost specification of the feature. This algorithm is currently in development.
Viii Appendix
Viiia Prior Distribution for mSPRT
In Section IV, the mSPRT test statistic is presented with a prior distribution . Here we describe a practical choice for parameter . Although choice of value does not affect the martingale and the statistical test property in theory, it affects the empirical convergence speed (i.e. sample size needed to achieve significance) in practice. Pollak and Siegmund [25] derived the average sample size needed to detect the real difference under sequential analysis. In our context, the following formula provides a reasonable approximation to the average sample size:
(18) 
Taking derivatives yields to achieve the smallest sample size. However, as the real difference is unknown to experimenters, we want to find a substitute for
to provide comparable type I and type II error. Compared with fixedhorizon t test, where we have the sample size
. We want to use Below is a simulation illustrating the choice of to type I and type II error.A sample data is generated from binomial distribution with p varying from 0.1 to 0.9 and relative difference equal to 0 or 0.05. Along the sequence, 200 checks are conducted in equal interval with each checkpoint using all previous observations. Treatment and control group have equal sample size per each check. A positive result is defined as the test reports at least one significance (at
level) out of the 200 tests, which indicates false positive rate in Figure 5 and power in Figure 6From above simulation, we see that using i.e. the square of Observed Difference and using
has similar performance in both empirical power and false positive rate. This is because in fixedhorizon test, the average sample size under Type I error
and Type II error has the following relationship . In Figure 6, the empirical power under three parameter choices are plotted against the empirical power under . When the empirical power under is small, the other three choices all yield relatively bigger power which is because of an overestimate of due to small sample size. This discrepancy dies down as the empirical power of increases. To balance out the type I and the type II error, we chose , yielding in practice.Acknowledgment
We would like to extend our thanks to Akash Parikh, Donald Stayner, and Anando Sen for insightful discussion on defining the problem and forming the solution, to Sisil Mehta and Tim Knapik for thoughtful design and implementation of the algorithms in production. Special thanks to Professor Peter Dayan on discussion and formulation of Reinforcement Learning algorithm as future development. We would also like to thank Olivia Liao for her early research and empirical work of applying sequential test on the experimentation platform.
References
 [1] G. Keppel, W. H. Saufley, and H. Tokunaga, Introduction to design and analysis: A student’s handbook. Cambridge Univ Press, 1992.
 [2] R. Kohavi, R. M. Henne, and D. Sommerfield, “Practical guide to controlled experiments on the web: listen to your customers not to the hippo,” in Proceedings of the 13th ACM SIGKDD international conference on Knowledge discovery and data mining. ACM, 2007, pp. 959–967.
 [3] R. Kohavi, R. Longbotham, D. Sommerfield, and R. M. Henne, “Controlled experiments on the web: survey and practical guide,” Data mining and knowledge discovery, vol. 18, no. 1, pp. 140–181, 2009.
 [4] R. Kohavi, A. Deng, B. Frasca, T. Walker, Y. Xu, and N. Pohlmann, “Online controlled experiments at large scale,” in Proceedings of the 19th ACM SIGKDD international conference on Knowledge discovery and data mining, ser. KDD ’13. New York, NY, USA: ACM, 2013, pp. 1168–1176. [Online]. Available: http://doi.acm.org/10.1145/2487575.2488217
 [5] R. Kohavi, A. Deng, R. Longbotham, and Y. Xu, “Seven rules of thumb for web site experimenters,” in Proceedings of the 20th ACM SIGKDD international conference on Knowledge discovery and data mining. ACM, 2014, pp. 1857–1866.
 [6] D. Tang, “Experimentation at google,” RecSys’14 Workshop: Controlled Experimentation, 2014.
 [7] B. Frasca, “A brief history of bing a/b,” RecSys’14 Workshop: Controlled Experimentation, 2014.
 [8] C. Smallwood, “The quest for the optimal experiment,” RecSys’14 Workshop: Controlled Experimentation, 2014.
 [9] Z. Zhao, M. Chen, D. Matheson, and M. Stone, “Online experimentation diagnosis and troubleshooting beyond aa validation,” in Data Science and Advanced Analytics (DSAA), 2016 IEEE International Conference on. IEEE, 2016, pp. 498–507.
 [10] Z. Zhao, Y. He, and M. Chen, “Inform product change through experimentation with datadriven behavioral segmentation,” in 2017 IEEE International Conference on Data Science and Advanced Analytics (DSAA), Oct 2017, pp. 69–78.
 [11] D. Tang, A. Agarwal, D. O’Brien, and M. Meyer, “Overlapping experiment infrastructure: More, better, faster experimentation,” in Proceedings of the 16th ACM SIGKDD international conference on Knowledge discovery and data mining. ACM, 2010, pp. 17–26.
 [12] T. Crook, B. Frasca, R. Kohavi, and R. Longbotham, “Seven pitfalls to avoid when running controlled experiments on the web,” in Proceedings of the 15th ACM SIGKDD international conference on Knowledge discovery and data mining. ACM, 2009, pp. 1105–1114.
 [13] B. K. Ghosh and P. K. Sen, Handbook of sequential analysis. CRC Press, 1991.
 [14] R. Johari, P. Koomen, L. Pekelis, and D. Walsh, “Peeking at a/b tests: Why it matters, and what to do about it,” in Proceedings of the 23rd ACM SIGKDD International Conference on Knowledge Discovery and Data Mining. ACM, 2017, pp. 1517–1525.
 [15] L. Pekelis, D. Walsh, and R. Johari, “The new stats engine,” Internet. Retrieved December, vol. 6, p. 2015, 2015.
 [16] A. Deng and X. Shi, “Datadriven metric development for online controlled experiments: Seven lessons learned,” in Proceedings of the 22nd ACM SIGKDD International Conference on Knowledge Discovery and Data Mining. ACM, 2016, pp. 77–86.
 [17] E. Bakshy and D. Eckles, “Uncertainty in online experiments with dependent data: an evaluation of bootstrap methods,” in Proceedings of the 19th ACM SIGKDD international conference on Knowledge discovery and data mining. ACM, 2013, pp. 1303–1311.
 [18] P. S. Kott, “The deleteagroup jackknife,” Journal of Official Statistics, vol. 17, no. 4, p. 521, 2001.
 [19] R. Kohavi, R. M. Henne, and D. Sommerfield, “Practical guide to controlled experiments on the web: listen to your customers not to the hippo,” in Proceedings of the 13th ACM SIGKDD international conference on Knowledge discovery and data mining. ACM, 2007, pp. 959–967.
 [20] C. P. Cox and T. D. Roseberry, “A note on the variance of the distribution of sample number in sequential probability ratio tests,” Technometrics, vol. 8, no. 4, pp. 700–704, 1966.

[21]
W. M. Bolstad and J. M. Curran,
Introduction to Bayesian statistics
. John Wiley & Sons, 2016.  [22] N. Appiktala, M. Chen, M. Natkovich, and J. Walters, “Demystifying dark matter for online experimentation,” in Big Data (Big Data), 2017 IEEE International Conference on. IEEE, 2017, pp. 1620–1626.
 [23] R. Chen, M. Chen, M. R. Jadav, J. Bae, and D. Matheson, “Faster online experimentation by eliminating traditional a/a validation,” in Big Data (Big Data), 2017 IEEE International Conference on. IEEE, 2017, pp. 1635–1641.

[24]
Y. He and M. Chen, “A probabilistic, mechanismindepedent outlier detection method for online experimentation,” in
Data Science and Advanced Analytics (DSAA), 2017 IEEE International Conference on. IEEE, 2017, pp. 640–647.  [25] M. Pollak and D. Siegmund, “Approximations to the expected sample size of certain sequential tests,” The Annals of Statistics, pp. 1267–1282, 1975.
Comments
There are no comments yet.