Online Learning for Non-Stationary A/B Tests

The rollout of new versions of a feature in modern applications is a manual multi-stage process, as the feature is released to ever larger groups of users, while its performance is carefully monitored. This kind of A/B testing is ubiquitous, but suboptimal, as the monitoring requires heavy human intervention, is not guaranteed to capture consistent, but short-term fluctuations in performance, and is inefficient, as better versions take a long time to reach the full population. In this work we formulate this question as that of expert learning, and give a new algorithm Follow-The-Best-Interval, FTBI, that works in dynamic, non-stationary environments. Our approach is practical, simple, and efficient, and has rigorous guarantees on its performance. Finally, we perform a thorough evaluation on synthetic and real world datasets and show that our approach outperforms current state-of-the-art methods.

Comments

There are no comments yet.

Authors

• 2 publications
• 6 publications
• 15 publications
• Cascading Non-Stationary Bandits: Online Learning to Rank in the Non-Stationary Cascade Model

Non-stationarity appears in many online applications such as web search ...
05/29/2019 ∙ by Chang Li, et al. ∙ 0

read it

• Piecewise-Stationary Off-Policy Optimization

Off-policy learning is a framework for evaluating and optimizing policie...
06/15/2020 ∙ by Joey Hong, et al. ∙ 0

read it

• Learning Contextual Bandits in a Non-stationary Environment

Multi-armed bandit algorithms have become a reference solution for handl...
05/23/2018 ∙ by Qingyun Wu, et al. ∙ 0

read it

• Online Ensemble Multi-kernel Learning Adaptive to Non-stationary and Adversarial Environments

Kernel-based methods exhibit well-documented performance in various nonl...
12/28/2017 ∙ by Yanning Shen, et al. ∙ 0

read it

• Second-Order Non-Stationary Online Learning for Regression

The goal of a learner, in standard online learning, is to have the cumul...
03/01/2013 ∙ by Nina Vaits, et al. ∙ 0

read it

• SAFE: Spectral Evolution Analysis Feature Extraction for Non-Stationary Time Series Prediction

This paper presents a practical approach for detecting non-stationarity ...
03/04/2018 ∙ by Arief Koesdwiady, et al. ∙ 0

read it

• Memory In Memory: A Predictive Neural Network for Learning Higher-Order Non-Stationarity from Spatiotemporal Dynamics

Natural spatiotemporal processes can be highly non-stationary in many wa...
11/19/2018 ∙ by Yunbo Wang, et al. ∙ 0

read it

This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

1 Introduction

Whether it is a minor tweak, or a major new update, releasing a new version of a running system is a stressful time. While the release has typically gone through rounds of offline testing, real world testing often uncovers additional corner cases that may manifest themselves as bugs, inefficiencies, or overall poor performance. This is especially the case in machine learning applications, where models are typically trained to maximize a proxy objective, and a model that performs better on offline metrics is not guaranteed to work well in practice.

The usual approach in such scenarios is to evaluate the new system through a series of closely monitored A/B tests. The new version is usually released to a small number of customers, and, if no concerns are found and metrics look good, the portion of traffic served by the new system is slowly increased.

While A/B tests provide a sense of safety in that a detrimental change will be quickly observed and corrected (or rolled back), they are not a silver bullet. First, A/B tests are labor intensive—they are typically monitored manually, with an engineer, or a technician, checking the results of the test on a regular basis (for example, daily or weekly). Second, the evaluation is usually dependent on average metrics—e.g. increasing clickthrough rate, decreasing latency, etc. However, fluctuations in that metric can be easily missed—for instance a system that performs well during the day, but lags at night, may appear better on a day-to-day view, and thus its sub-optimality at night is missed. Finally, due to their staged rollouts, A/B tests are inefficient—the better version of the system remains active only on a small subset of traffic until a human intervenes, even if it is universally better right off the bat.

We propose a method for automated version selection that addresses all of the inefficiencies described above. At a high level, we cede the decision of whether to use the old version or a new version to an automated agent. The agent repeatedly evaluates the performance of both versions and selects the better one automatically. As we describe below, this, by itself, is not a new approach, and is captured in the context of expert or bandit learning in the machine learning literature. However, the vast majority of previous work optimizes for the average case, where the distribution of real world example is fixed (but unknown). This modeling choice cannot capture short term but consistent fluctuations in efficacy of one system over another. The main contribution of this work is a simple, practical, and efficient algorithm that selects the best version even under non-stationary assumptions; and comes with strong theoretical guarantees on its performance.

The paper is organized as follows: first we describe the problem of learning with expert advice and discuss previous work related to the contents of this paper. In Section 2 we formalize the expert learning scenario we will be analyzing. Our proposed algorithm and its performance analysis are presented in Sections 3 and 4 respectively. Finally, in Section 5 we conduct extensive empirical evaluations demonstrating the advantages of our algorithm over the current state of the art.

1.1 Expert Learning

Learning with expert advice is a classical setting in online learning. Over a course of time steps, a player is repeatedly asked to decide on a best action in a particular setting. For example, in parallel computing, the agent may need to decide how many workers to allocate for a particular computation; in advertising auctions, the auctioneer needs to decide on a minimum (reserve) price to charge; in ranking systems, the algorithm needs to decide which answer to rank at the top, and so on.

The player does not make these decisions in a vacuum, instead he has access to a number of experts. Each of the experts may provide her own answer to the query; in our setting each expert represents a different version of the system. At a high level, the goal of the player is to perform (almost) as well as the best expert overall. The difference between the performance of the player and the best expert is known as regret, thus the player’s goal is equivalent to minimizing his regret. Note that if all experts perform poorly, the player is not penalized for not doing well, whereas if one expert does well, the player is best served by honing in on this ‘true’ expert and following her recommendations.

To prove bounds on the worst case regret achieved by the player following a particular strategy, we must make assumptions on the distribution from which queries are drawn. In the most studied versions of the problem, the environment is either stochastic (queries are drawn from a fixed, but unknown, distribution), or adversarial (where no assumptions are made). While it is not realistic to assume a fully stochastic setting, neither is an adversarial setting a good model for distributions observed in practice. Rather, the environment is better represented as a series of stochastic intervals, that may have sharp and unpredictable transitions between them. For a concrete example, consider the problem of predicting time to travel from point A to point B. If we have two experts, one performing well when the roads are congested, and another when the roads are relatively free, then a day may have multiple periods that are locally stochastic, but are drastically different from each other. Moreover, while some congested times are predictable, such as the morning and evening rush hour, those caused by accidents, special events, and holidays are not, and represent the scenario we target.

To model this setting, several notions such as shifting regret [1] and adaptive regret [2] have been introduced, and these works try to ensure that the performance of the algorithm over sub-intervals of time is comparable to the best expert for that time interval. Several algorithms were developed for this setting, and we provide a more comprehensive review of these algorithms in Section 1.2. Among these algorithms, AdaNormalHedge developed by Luo and Schapire [3] represents the current state of the art, since it is relatively fast, parameter-free, and gives provable regret bounds.

While AdaNormalHedge seems like the answer we seek, it has never been empirically validated and has some disadvantages which render it less effective in practice. The straightforward implementation has update time linear in the time elapsed since the start, and, while compression tricks borrowed from streaming literature reduce the bound to polylogarithmic in theory, these updates are nevertheless slow and costly in practice. However, the main drawback of this algorithm is it is not intuitive, and thus is hard to debug and interpret.

In this work, we propose a different approach. Our algorithm, Follow-The-Best-Interval (FTBI) builds on the standard Follow-The-Leader (FTL) algorithm for purely stochastic environments, and is easy to explain. We prove (see Theorem 1) that in the case of two experts, the algorithm has regret polylogarithmic in , and quadratic in the number of changes in the environment. We also show that the update time complexity of our algorithm is , which is provably better than that of AdaptiveNormalHedge even with the streaming technique. The regret bound of our algorithm is slightly weaker than that given by AdaNormalHedge; however as we show through extensive simulation and real-data experimental results, in practice FTBI performs significantly better.

1.2 Related work

Our setup best matches that of learning with expert advice. This area of machine learning has been well studied since the pioneering works of [4, 5, 6, 7]. There are two common assumptions on the nature of the rewards observed by the agent: stochastic and adversarial rewards. In the stochastic setting, Follow-The-Leader (FTL) is an easy and intuitive algorithm that can achieve constant regret; while for the adversarial setting, the Weighted Majority algorithm [5, 6] and the Follow the Perturbed Leader algorithm [8, 9] are the most commonly used algorithms with regret in .

The aforementioned algorithms measure regret against a single fixed best expert. This is a relatively simple and well-understood scenario. By contrast the non-stationary environments generally pose greater challenges in the design of regret minimizing algorithms. In this setting, the full learning period is usually partitioned into a few segments; and the algorithm aims to compete against the sequence of best experts over each segment. The regret in this setting is also known as shifting regret (See Section 2 for a precise definition). Similar to the non-adaptive scenario, we model the input as coming from stochastic or adversarial settings. In the stochastic setting, the rewards of each expert remain i.i.d. in each segment, but the distribution may change across the segments. For the adversarial setting, no assumptions are made as to how the rewards are generated. The first analysis of online learning with shifting environments was given by Herbster and Warmuth [1]. The authors proposed the Fixed-Share algorithm (and a few variants) to achieve low shifting regret in the adversarial setting. The idea of Fixed-Share, and other derivations of this algorithm [10, 11, 12, 13, 14], is to combine the Weighted Majority algorithm with a mixing strategy. In general these algorithms are guaranteed to achieve regret in with an additional penalty for shifting.

A different framework for solving this problem is the so called sleeping expert technique, originally introduced by Freund et al. [15]. Although the algorithm achieves good shifting regret bounds, in its original form, this algorithm has prohibitively high computational costs. To address this issue, Hazan and Seshadhri [2] propose the Follow the Leading History (FLH) algorithm with a data streaming technique to reduce the time complexity of sleeping expert algorithms. The authors provide an algorithm with logarithmic shifting regret. However, this bound holds only for rewards distributed according to an exp-concave distribution. Since we make no assumption on the distribution generating our rewards, we cannot extend the results of FLH to our setup.

One final line of work dealing with shifting regret under adversarial rewards is the so called strongly adaptive online learning (SAOL) [16, 17, 18]. SAOL framework aims to achieve low regret over any interval, and thus is strictly harder than achieving low shifting regret. Despite this difficulty, there are algorithms that achieve regret in , matching the traditional shifting regret scenario. Moreover, without any assumptions on the mechanism generating the rewards, this regret is in fact tight. In practice however, rewards are hardly fully adversarial. While we certainly don’t expect rewards to be i.i.d. over the full time period, it is generally reasonable to assume we can model rewards as a stochastic process with shifts. For instance, rewards can be i.i.d. throughout a day but change on weekends or holidays. When this is the case, one can obtain exponential improvements and achieve regret of . The design of an algorithm with this regret bound in the stochastic setting with shifts was proposed as an open problem by Warmuth et al. [19]. A few solutions have been given since then. Sani et al.  [20] propose an algorithm that can achieve low shifting regret in both stochastic and adversarial settings, however, their algorithm is not parameter-free and it requires proper tuning and knowledge of the number of shifts. Luo and Schapire  [3] introduced the AdaNormalHedge (ANH) algorithm, a parameter free approach that achieves shifting regret of , where is the number of shifts and captures the gap between the best and second best choices.

While this bound is strictly better than the regret bound we provide for FTBI, we will extensively demonstrate that in practice FTBI consistently outperforms ANH both in speed and accuracy. Moreover, FTBI is a much more intuitive and explainable algorithm since it is a simple generalization of the classic Follow The Leader approach (FTL). By contrast, the state of ANH depends on an streaming technique [2, 3] which offers no transparency in the decisions made by the algorithm.

2 Setup

Let denote a time horizon and . We consider the problem of designing an automated agent that, at every time chooses between two version of a system . (The setting easily extends to versions, and we will explore the performance of our algorithm for in Section 5

.) After the choice is made, a reward vector

is revealed and the agent obtains a reward . We consider a full information, as opposed to a bandit setup since in practice we can run simultaneous experiments that can provide us with the reward information for each version.

The goal of the monitoring system is to maximize the expected cumulative reward . For instance, and could be two versions of a network allocation algorithm and corresponds to the average number of queries per second the network can handle. We will abuse our notation and denote by the reward of using version .

We now describe the reward generation mechanism. Let be a partition of , where denotes the total number of shifts. For every , let be the -th segment of the partition. We assume that for , reward vectors are drawn i.i.d according to an unknown distribution . We make no assumptions on in particular can differ from for .

Notice that if we consider each version to be an expert, we can cast our problem as that of learning with expert advice under stochastic rewards and changing environments. Therefore, borrowing from the expert learning literature, we measure the performance of our agent using the shifting pseudo-regret.

Definition 1

For every , and , let denote the expected reward of version with respect to the distribution . Let and denote the optimal version over segment and its reward respectively. The shifting pseudo-regret is defined by

 RT:=N∑i=1τi∑t=τi−1+1μ∗i−E[rt(xt)],

where expectation is taken over both the reward vector and the randomization of the monitoring algorithm.

A successful algorithm for this scenario is one for which is in . That is, the agent learns to choose the optimal version in a sublinear number of rounds.

Let denote the expected reward gap between the best and next best version in , and let . Throughout the paper we will assume that . Notice that this is without loss of generality as the case is uninteresting since playing either version would have yielded the same expected reward.

3 Follow-The-Best-Interval

We begin this section by describing the Follow-The-Leader (FTL) algorithm, a simple yet powerful algorithm for learning with expert advice under stochastic rewards (with no shifts). FTL maintains a weight for each expert . Weights correspond to the cumulative reward seen thus far. At time , FTL chooses the expert with the highest weight, in other words, the leader. It is well known that if rewards are sampled i.i.d., FTL is guaranteed to achieve constant pseudo-regret in .

If we had access to the switching times , , , we could simply run FTL and restart the weights kept by the algorithm at every breakpoint . This would guarantee pseudo-regret of . In the absence of this information, we could try to detect when a shift has occurred. However, it is not hard to see that, due to the noise in the rewards, detecting a shift accurately would require time steps.

Instead, we propose the Follow-The-Best-Interval (FTBI) algorithm. At a high level, FTBI maintains a small number of FTL instances, each running over a different interval of a collection of nested subintervals of . The algorithm then follows the action of the best performing FTL instance. More precisely, let

 In={[i⋅2n,(i+1)⋅2n−1] : i∈N}.

Here, is a set of disjoint intervals of length . Let be the set of all such intervals. For every time , define

 ACTIVE(t):={I∈I : t∈I}

as the set of intervals in that contain . It is immediate that . A depiction of the set is shown in Figure 1.

For every interval , we keep an instance of FTL . Let

 Wt(k,I)=t∑s=prs(k)−rs(xs)

denote the weight assigns to expert at time time , where is the expert, or version, chosen by our algorithm. Notice that this differs slightly from the traditional FTL definition as we subtract the reward of the action taken by our algorithm. In doing so, we ensure that the scale of the weights is roughly the same for all intervals , regardless of their length.

The expert and interval chosen by FTBI is given by

 (xt,It)=argmaxk∈{1,2},I∈ACTIVE(t)Wt−1(k,I).

A full description of FTBI can be found can be found in Algorithm 1. The intuition for our algorithm is the following: consider a segment where rewards are i.i.d.; due to the properties of the FTL algorithm, any FTL instance beginning inside this segment will, in constant time, choose the optimal action . Therefore, as long as our algorithm chooses FTL instances initialized inside a segment , FTBI will choose the optimal action. We will show that the choice of weights will guarantee that only instances of FTL that observe i.i.d rewards will have large weights.

Remark.

Since our algorithm is motivated by the A/B test problem, we focus on the case with two experts. However, the FTBI algorithm naturally extends to the setting with experts. In that case, each interval needs to maintain weights, , . The algorithm chooses the expert and interval using:

 (xt,It)=argmaxk∈[K],I∈ACTIVE(t)Wt−1(k,I).

4 Analysis

In this section we provide regret guarantees for our algorithm. The main result of this paper is the following

Theorem 1

There exists a universal constant such that the shifting pseudo-regret of FTBI is bounded by

 RT≤CΔ3N2log3T.

4.1 Proof Sketch

We begin by defining a disjoint cover of each segment by elements of . This cover was first introduced by [16] to analyze their strongly adaptive algorithm.

Theorem 2 ([16], Geometric Covering)

Any interval can be partitioned into two finite sequences of disjoint and consecutive intervals, denoted by and such that

 (∀ i≥1), |I−i|/|I−i+1|≤1/2, (∀ i≥2), |Ii|/|Ii−1|≤1/2.

The above theorem shows that any interval in can be partitioned by at most intervals from . Since the first sequence has geometrically increasing lengths and the second sequence has geometrically decreasing lengths, we call this partitioning the geometric covering of interval . A depiction of this cover can be seen in Figure 2.

Denote by the collection of the geometric coverings for all segments . We know that, forms a partition of , and there are at most intervals in . Furthermore, since each interval is completely contained in for some , the rewards are i.i.d. for any interval . The main idea behind our proof is to analyze the regret of each interval , and to sum them up to obtain the overall regret of FTBI. At a first glance bounding the regret for is straightforward since the observed rewards are i.i.d. However, the FTL instance over interval is not isolated. In particular, it could compete with other instances over intervals . Therefore we must show that no FTL instance accumulates a large weight in which case can catch up to the weights of in a small number of rounds.

For any , let . That is, consists of all intervals in that contain . One can verify that, from the definition of , if an interval overlaps with , i.e., , and it was active before , i.e., , then it must be that , i.e., . In view of this, it is the weight of the intervals in that we need to bound. Let

 HkI=argmaxJ∈LIWp−1(k,J)

denote the interval in that assigns the highest weight to action at the beginning of interval . Let denote the sigma algebra generated by all history up to time and denote by the pseudo-regret of FTBI over interval conditioned on all the previous history, i.e.,

 RI:=maxk∈{1,2}E[q∑t=prt(k)−rt(xt)∣Fp−1].
Theorem 3

For any , we have

 RI≤ E[max{Wq(1,H1I),Wq(2,H2I),0}∣Fp−1] ≤ max{Wp−1(1,H1I),Wp−1(2,H2I),0}+28Δ3logT+O(1). (1)

We defer the full proof to the Appendix. This theorem shows that the conditional regret in each interval depends on the maximum weight accumulated over all FTL instances before time plus . More importantly, inequality (1) shows that at the end of interval the maximum weight over all active FTL instances and all actions can only increase by an additive constant.

We can now prove Theorem 1. First, by construction of the geometric covering, we have . Let be an ordered enumeration of the intervals in . For every interval , let . Applying inequality (1) recursively it is easy to see that:

 E[W∗(I(n))]≤(n−1)[28Δ3logT+O(1)],

which again in view of (1) yields

 E[RI(n)]≤n[28Δ3logT+O(1)].

Thus

 RT=m∑n=1E[RI(n)]≤12m(m+1)[28Δ3logT+O(1)].

The proof is completed using the fact that .

5 Experiments

We conduct extensive simulations comparing our algorithm (FTBI) to AdaNormalHedge (ANH) on both synthetic and real world datasets. In order to obtain an efficient implementation of ANH, we use the streaming trick mentioned in [2, 3].

5.1 Synthetic Data

The purpose of these simulations is to understand the advantages of FTBI over ANH in a controlled environment for different parameters representing the reward gap , number of shifts , and round length . For each experiment we calculate the regret:

. Each experiment is replicated 20 times and we report the average result. The error bars represent one standard deviation. For every experiment the rewards of the two experts over the

-th segment

are sampled i.i.d. from a Bernoulli distribution with parameters

.

We explore the performance of FTBI as a function of the reward gap, , the number of shifts, , round length, , and the number of experts, .

5.1.1 Sensitivity to the reward gap Δ

For this experiment, we vary the reward gap over the set and choose parameters such that . We allow for two intervals with rounds per interval, and plot the results in Figure  3(a). Observe that FTBI consistently outperforms ANH, and the regret exhibits a significantly better dependence on than guaranteed by Theorem 1.

5.1.2 Sensitivity to the number of shifts, N

We vary the number of shifts from to . For each setting, we sample parameters ,

from a uniform distribution in

, and fix the number of rounds per interval to . Observe that a new draw of

initiates a shift with high probability as

. We compare the regret of FTBI and ANH in Figure 3(b). Notice that although our theoretical bound shows a quadratic dependence on the number of shifts, we do not see this unfavorable bound empirically, and again FTBI consistently outperforms ANH.

5.1.3 Sensitivity to the round length, Ti

We fix the number of intervals to . On the first interval expert has mean reward and expert has reward . The means are swapped in the second interval. We vary the length of each interval over the set , and show the results in Figure 3(c). This experiment captures the main downside of ANH. Specifically, ANH is slow to adjust to a shift in distributions. Indeed, the longer one arm dominates the other, the longer it takes ANH to switch when a distribution shift occurs. By contrast, FTBI adjusts much faster.

5.1.4 Sensitivity to the number of experts, K

As mentioned in Section 3, although our main focus is the A/B test problem, as an expert learning algorithm, FTBI also applies to scenarios where there are experts. In this experiment, we study the performance of FTBI when we vary the number of experts . We sample the parameters i.i.d. from a uniform distribution in . We fix the length of each segment to rounds, and vary the number of experts from to . We show the results in Figure 3(d). Although FTBI does not have theoretical guarantees in this setting, it consistently outperforms ANH. In addition, FTBI appears to scale better than ANH as the number of experts grows.

5.1.5 Sensitivity to the number of shifts and reward gap as functions of T

We also test the performance of FTBI and ANH in the settings where the number of shifts and/or the reward gap are functions of . These settings simulate practical scenarios where there are relatively large number of shifts and ) small reward gap between the optimal and sub-optimal experts. More importantly, it is in these scenarios that our theoretical bounds become trivial as setting or makes the bound of Theorem 1 simply . In theory our algorithm should perform worse than ANH, However, as seen in Figure 4 this is not the case. For Figure 4(a), we fixed and let . We observe that FTBI achieves significantly smaller regret than ANH. Again, this is due to the fact that larger times without shifts are detrimental to the performance of ANH. For Figure 4(b), we fix and let since this setup corresponds to experts with similar performance both algorithms achieve similar regret.

5.1.6 Scalability

To conclude this synthetic evaluation, we compare the running time of both algorithms. We vary the length of each round over the set and show the results in Figure 5. Notice that while both algorithms have total running time grow linearly, and the per round time increases logarithmically, FTBI is approximately 10 times faster across the board.

To conclude, while ANH has a stronger theoretical regret bound, FTBI consistently performs better than ANH in the experiments. We believe that our current regret analysis can be further improved so that the theoretical regret bounds can match the performance of FTBI in practice. We propose this direction in our future work.

5.2 Evaluation on Public Datasets

We now evaluate our algorithm by simulating the outcome of an A/B test based on public time series data. The methodology is the following: given a time series we consider two prediction systems. The first one predicts that the value will be above a fixed threshold and the other one predicts the it will be below . The reward of each system is when the system is correct and otherwise. The goal is to find a combination of systems that yields the largest cumulative reward. As mentioned in the introduction, this setup matches that of expert learning with two experts.

The first time series we consider is the air quality data set (https://archive.ics.uci.edu/ml/datasets/Beijing+PM2.5+Data) consisting of hourly measurements of air quality in Beijing. The threshold of our systems is given by corresponding to the rounded median of the air quality index. The second data set consists of measurements of electric power usage (https://archive.ics.uci.edu/ml/datasets/Individual+household+electric+power+consumption) and the threshold is given by .

We compare the performance of FTBI, ANH, and FTL algorithms in these tasks. Before presenting the results of these comparisons we first show why these data sets benefit from the adaptive nature of FTBI. In Figure 6(a)(b), we plot the difference of the cumulative reward of the thresholding systems. Notice that if one system consistently dominated the other (a favorable case for FTL) we would see a monotonically increasing curve. Instead, we observe an oscillatory behavior, which indicates that the best system changes several times over the time period.

We run the three algorithms in these data sets and show the total reward of FTBI and ANH as a relative improvement over FTL in Figure 6(c)(d). There, we see that both ANH and FTL largely outperform FTL with improvements of up to , which validates the fact that these algorithms are able to adapt to the changes in the best system. Furthermore, our algorithm outperforms ANH. In the first data set it is considerably better, while in the second one the difference is not as large. By looking at the corresponding cumulative reward plots we see that the first task is noisier than the second one which leads us to believe that FTBI performs better under noisy conditions.

5.3 Evaluation on an Advertising Exchange Platform

Finally, to showcase the advantage of our method in practice, we apply the FTBI algorithm to compare two pricing mechanisms on an advertising exchange platform using real auction data. We consider the problem of setting reserve (or minimum) prices in the auctions, with the goal of increasing revenue. Learning good reserve prices is a notoriously challenging problem, with a myriad of algorithms designed to solve this task  [21, 22, 23, 24].

In this experiment, we collect a sample of auctions (highest and second highest bid) for an eight day window from four different publishers, see the total number of auctions in Table 1. We consider an A/B test over two algorithms and for setting the reserve price; we treat them as black boxes for the purpose of the experiment. At every time , we can compute the revenues111The revenue in a particular auction equals the amount of money that the winner of the auction pays to the publisher. and that could be obtained by each of these algorithms. (We ignore the question of incentive compatibility of learning in auctions, as it is beyond the scope of this work.)

Since the reserve price must be set prior to running the auction, our goal is to select one of these algorithms at each point of time , to maximize the overall revenue. We run the Follow-The- Leader (FTL) algorithm, AdaNormalHedge (ANH) and Follow-The-Best-Interval (FTBI) over these sequences of rewards. We present the relative revenue lift of FTBI (defined as the relative difference in revenue) over the two other algorithms in Figure 7, where we treat each day as a separate experiment. The box limits represent the 25 and 75 percent quantiles while the whiskers represent the 10 and 90 percent quantiles. Observe from Figure 7(a) is that FTBI is consistently better than ANH, in some cases improving total revenue by more than 4%.

Figure 7(b) shows that FTBI does not always compare favorably with respect to FTL. This can explained by the fact that in these sequences, one algorithm is consistently better than the other throughout the day. It is well known that FTL has the best regret guarantees in this, non-switching, scenario, and FTBI essentially matches its the performance on publishers and . In cases where the optimum strategy switches between and , FTBI tends to outperform FTL, sometimes significantly.

To investigate this further, in Figure 8, we plot the difference in the cumulative revenue of and as a function of time for each publisher for four different days. When the time series increases, it represents a segment of time where was consistently better than ; the opposite is true when the series decreases. In particular, notice that, for publishers and , consistently outperforms , explaining the results in Figure 7(b). On the other hand, the best algorithm in publisher switches considerably in days 2 and 4 and there is a clear switch for publisher in day 2. It is in these cases that we obtain the most benefits from an adaptive algorithm such as FTBI.

6 Conclusion

In this work, we study the setting of monitoring non-stationary A/B tests, and formulate an expert learning framework for this problem. We develop the Follow-The-Best-Interval (FTBI) algorithm and prove theoretical bounds on its regret. Our empirical evaluation suggests that it achieves lower regret than the state-of-the-art algorithm across many synthetic and real world datasets. In addition, FTBI is much faster than the state-of-the-art AdaNormalHedge algorithm. We suggest improving the theoretical guarantee of FTBI as a future research direction.

References

• [1] M. Herbster and M. K. Warmuth, “Tracking the best expert,” Machine learning, vol. 32, no. 2, pp. 151–178, 1998.
• [2] E. Hazan and C. Seshadhri, “Adaptive algorithms for online decision problems,” in Electronic Colloquium on Computational Complexity (ECCC), vol. 14, 2007.
• [3] H. Luo and R. E. Schapire, “Achieving all with no parameters: Adanormalhedge.,” in COLT, pp. 1286–1304, 2015.
• [4] N. Cesa-Bianchi, Y. Freund, D. Haussler, D. P. Helmbold, R. E. Schapire, and M. K. Warmuth, “How to use expert advice,” Journal of the ACM (JACM), vol. 44, no. 3, pp. 427–485, 1997.
• [5] Y. Freund and R. E. Schapire, “A decision-theoretic generalization of on-line learning and an application to boosting,” in

European conference on computational learning theory

, pp. 23–37, Springer, 1995.
• [6] N. Littlestone and M. K. Warmuth, “The weighted majority algorithm,” Information and computation, vol. 108, no. 2, pp. 212–261, 1994.
• [7] V. G. Vovk, “A game of prediction with expert advice,” in Proceedings of the eighth annual conference on Computational learning theory, pp. 51–60, ACM, 1995.
• [8] M. Hutter and J. Poland, “Adaptive online prediction by following the perturbed leader,” Journal of Machine Learning Research, vol. 6, no. Apr, pp. 639–660, 2005.
• [9] A. Kalai and S. Vempala, “Efficient algorithms for online decision problems,” Journal of Computer and System Sciences, vol. 71, no. 3, pp. 291–307, 2005.
• [10] D. Adamskiy, W. M. Koolen, A. Chernov, and V. Vovk, “A closer look at adaptive regret,” in International Conference on Algorithmic Learning Theory, pp. 290–304, Springer, 2012.
• [11] D. Adamskiy, M. K. Warmuth, and W. M. Koolen, “Putting bayes to sleep,” in Advances in Neural Information Processing Systems, pp. 135–143, 2012.
• [12] O. Bousquet and M. K. Warmuth, “Tracking a small set of experts by mixing past posteriors,” Journal of Machine Learning Research, vol. 3, no. Nov, pp. 363–396, 2002.
• [13] N. Cesa-Bianchi, P. Gaillard, G. Lugosi, and G. Stoltz, “Mirror descent meets fixed share (and feels no regret),” in Advances in Neural Information Processing Systems, pp. 980–988, 2012.
• [14] M. Herbster and M. K. Warmuth, “Tracking the best linear predictor,” Journal of Machine Learning Research, vol. 1, no. Sep, pp. 281–309, 2001.
• [15] Y. Freund, R. E. Schapire, Y. Singer, and M. K. Warmuth, “Using and combining predictors that specialize,” in

Proceedings of the twenty-ninth annual ACM symposium on Theory of computing

, pp. 334–343, ACM, 1997.
• [16] A. Daniely, A. Gonen, and S. Shalev-Shwartz, “Strongly adaptive online learning.,” in ICML, pp. 1405–1411, 2015.
• [17] K.-S. Jun, F. Orabona, R. Willett, and S. Wright, “Improved strongly adaptive online learning using coin betting,” arXiv preprint arXiv:1610.04578, 2016.
• [18] L. Zhang, T. Yang, R. Jin, and Z.-H. Zhou, “Strongly adaptive regret implies optimally dynamic regret,” arXiv preprint arXiv:1701.07570, 2017.
• [19] M. K. Warmuth and W. M. Koolen, “Open problem: Shifting experts on easy data.,” in COLT, pp. 1295–1298, 2014.
• [20] A. Sani, G. Neu, and A. Lazaric, “Exploiting easy data in online optimization,” in Advances in Neural Information Processing Systems, pp. 810–818, 2014.
• [21] N. Cesa-Bianchi, P. Gaillard, C. Gentile, and S. Gerchinovitz, “Algorithmic chaining and the role of partial feedback in online nonparametric learning,” in Proceedings of COLT, pp. 465–481, 2017.
• [22] N. Cesa-Bianchi, C. Gentile, and Y. Mansour, “Regret minimization for reserve prices in second-price auctions,” IEEE Trans. Information Theory, vol. 61, no. 1, pp. 549–564, 2015.
• [23] A. M. Medina and S. Vassilvitskii, “Revenue optimization with approximate bid predictions,” in Proceedings of NIPS, pp. 1856–1864, 2017.
• [24] M. Mohri and A. M. Medina, “Learning theory and algorithms for revenue optimization in second price auctions with reserve,” in Proceedings of ICML, pp. 262–270, 2014.

Appendix A Proof of Theorem 3

To simplify notation we fix and let for . We define and , we also let denote the conditional expectation with respect to , and denote the conditional probability with respect to . Finally, we assume, without loss of generality that expert 1 is optimal over the segment .

The proof of Theorem 3 depends on the following three technical lemmas.

Lemma 1

If , we have

 RI≤E[max{Wq(1,H1),Wq(2,H2),0}−W1]≤6Δ2logT+O(1). (2)
Lemma 2

If , we have

 RI≤E[max{Wq(1,H1),Wq(2,H2),0}−W1]≤(1+2Δ)(W2−W1)+16Δ2logT+O(1). (3)
Lemma 3

If , we have

 RI≤E[max{Wq(1,H1),Wq(2,H2),0}−W1]≤W2−W1+6Δ2logT+O(1). (4)

The proof of Theorem 3 is now immediate. Combining Lemma 2 and Lemma 3, we know that for , we always have

 RI ≤E[max{Wq(1,H1),Wq(2,H2),0}−W1] ≤W2−W1+(12Δ3+16Δ2)logT+O(1) ≤W2−W1+28Δ3logT+O(1). (5)

Moreover, by definition of and since we must have . Therefore by Lemma 1, Lemma 2 and the fact that is measurable we have:

 RI ≤RI+W1≤E[max{Wq(1,H1),Wq(2,H2),0}] ≤max{W1,W2,0}+28Δ3logT+O(1),

where the last inequality follows from (2) and (3).

Appendix B Proof of Lemma 1

By definition of regret we have

 RI =E[q∑t=prt(1)−rt(xt)]=E[Wq(1,H1)−W1] ≤E[max{Wq(1,H1),Wq(2,H2),0}−W1]:=~RI.

Fix , and let , and if for all . Define the following two events:

 X:={Wq(1,H1)≥Wq(2,H2)},
 Y:={τ

A simple observation is that, if happens, we must have for all . Indeed, for , by definition. Thus , and then remains unchanged as for all .

Therefore, conditioned on the event , we always have

 max{Wq(1,H1),Wq(2,H2),0}−W1≤δ+1.

Therefore we can bound as

 ~RI= E[max{Wq(1,H1),Wq(2,H2),0}−W1∣X∩Y]P{X∩Y} +E[max{Wq(1,H1),Wq(2,H2),0}−W1∣¯X∪¯Y]P{¯X∪¯Y} ≤ δ+1+2TP{¯X∪¯Y}.

Moreover, since , we have

 P{¯X}≤P{q∑t=prt(1)−rt(2)<0}≤e−12Δ2|I|,

where we have used the fact that and Hoeffding’s inequality.

Let us bound . Notice that event does not happen if and only if and there exists such that and for . Then remains unchanged for and for every . Thus, the previous event can happen if and only if there exists such that and . Therefore, let

 Et:={Wt−1(1,H1)≥W1+δ ∧ xt=2},

for all . It is easy to see that . Also, if , then . Therefore, we need only bound the probability of for . Recall in Algorithm 1, we define

 (xt,It)=argmaxk∈{1,2},~I∈ACTIVE(t)Wt−1(k,~I). (6)

Therefore, we have

 P{Et}= P{Wt−1(1,H1)≥W1+δ,xt=2} ≤ P{It=H2,xt=2}+∑J∈ACTIVE(t)pJ>pP{Wt−1(1,H1)≥W1+δ,It=J,xt=2} ≤ P{Wt−1(1,H1)pP{Wt−1(1,J)

where we denote by the starting time of interval . Applying Hoeffding’s inequality, we can bound the first term by . For the second term notice that if then the probability of the summand is . Thus we can restrict ourselves to intervals for which and we can bound the second term by

 ∑J∈ACTIVE(t)pJ>p,t−pJ≥δP{Wt−1(1,J)p,t−pJ≥δe−(t−pJ)Δ22≤Te−δΔ22,

where we again used Hoeffding’s inequality and we bound the number of active intervals by . Thus, by union bound, we get

 P{¯X∪¯Y}≤e−Δ2|I|2+T2e−δΔ22.

Then, by choosing , and considering the two cases with and , we get the desired result.

Appendix C Proof of Lemma 2

The proof follows the same line of reasoning as the previous lemma. Let be as in the proof of Lemma 1. Let be a constant to be chosen later and define . Let , and be the events defined in the proof of Lemma 1 The same argument as before shows that

 ~RI≤δ+1+2TP{¯X∪¯Y}

We proceed to bound . Without loss of generality, we assume that . We first consider the case where .

For event , we have

 P{¯X} =P{Wq(1,H1)

and since , by Hoeffding’s bound we have

 P{¯X}≤P{(q∑t=prt(1)−rt(2))−|I|Δ<−12|I|Δ}≤e−Δ2|I|8.

To bound we again bound the probability of the events for . Using the exact same technique as in Lemma 1 and letting we get

 P{Et}≤e−Δ2δ8+Te−Δ2δ2.

Then, by union bound, we obtain

 P{¯X∪¯Y}≤e−Δ2|I|8+Te−Δ2δ8+T2e−Δ2δ2.

Recall that and . A simple calculation verifies that setting , yields

 ~RI≤δ+O(1)≤2Δ(W2−W1+ClogT)+O(1)≤2Δ(W2−W1)+16Δ2logT+O(1).

On the other hand if , we have

 Wq(1,H1)−W1≤2ΔW, Wq(2,H2)−W1≤2ΔW+W2−W1.

Therefore we must have

 max{Wq(1,H1),Wq(2,H2),0}−W1≤2ΔW+W2−W1,

using the definition of and bounding the by a sum yields

 ~RI≤(1+2Δ)(W2−W1)+16Δ2logT+O(1).

Appendix D Proof of Lemma 3

Using the same notation as in the previous two lemmas, we strive to bound

 ~RI:=E[max{Wq(1,H1),Wq(2,H2),0}−W1]

under the assumption that for . Let and define the sopping time

 η:=inf{q≥t≥p | ∃J∈ACTIVE(t) with Wt(1,J)>Wt(2,H2)},

with if the event does not occur. Notice that by definition of , expert is chosen for the first time at time . We consider the following three disjoint events:

 U:={η=∞} V:={η<∞∧Wη(1,H1)≤Wη(2,H2)} W:={η<∞∧Wη(1,H1)>Wη(2,H2)}.

We make the following observations:

1. for . Indeed, since by definition of we have for it follows that remains constant.

2. If happens then . By definition of we must have , which implies the statement.

3. If happens then we must have

 Wη−1(1,H1)≤Wη−1(2,H2)=W2,

and this easily implies

 W2
4. if . This is immediate since is the largest weight over expert at time and by assumption .

Using observation (ii) can write as follows:

 ~RI =E[ξ∣U]P{U}+E[ξ∣V]P{V}+E[ξ1W] =(W2−W1)P{U}+E[ξ∣V