Background and motivation. In the presence of uncertainty and partial feedback on rewards, an agent that faces a sequence of decisions needs to judiciously use information collected from past observations when trying to optimize future actions. This fundamental paradigm is present in a variety of applications: an internet web site that seeks to customize recommendations to individual users whose tastes are a priori not known; a firm that launches a new product and needs to set a price to maximize profits but does not know the demand curve; a retailer that must select assortments of products among a larger variety of items but does not know the preferences of customers; a firm that selects routes over the internet to efficiently send packets of data to users but does not know the delay along available routes; as well as many other instances. In all the above examples decisions can be adjusted on a weekly, daily or hourly basis (if not more frequently), and the history of observations may be used to optimize current and future performance. To do so effectively, the decision maker tries to balance between the acquisition cost of new information (exploration) that may be used to improve future decisions and rewards, and the generation of instantaneous rewards based on the existing information (exploitation).
A widely studied paradigm that captures the tension between exploration and exploitation is that of multi armed bandits (MAB), originally proposed in the context of drug testing by Thompson (1933), and placed in a general setting by Robbins (1952). The original setting has a gambler choosing among slot machines at each round of play, and upon that selection observing a reward realization. In this classical formulation the rewards are assumed to be independent and identically distributed according to an unknown distribution that characterizes each machine. The objective is to maximize the expected sum of (possibly discounted) rewards received over a given (possibly infinite) time horizon. Since their inception, MAB problems with various modifications have been studied extensively in Statistics, Economics, Operations Research, and Computer Science, and are used to model a plethora of dynamic optimization problems under uncertainty; examples include clinical trials (Zelen 1969), strategic pricing (Bergemann and Valimaki 1996), investment in innovation (Bergemann and Hege 2005), packet routing (Awerbuch and Kleinberg 2004), on-line auctions (Kleinberg and Leighton 2003), assortment selection (Caro and Gallien 2007), and on-line advertising (Pandey et al. 2007), to name but a few. For overviews and further references cf. the monographs by Berry and Fristedt (1985), Gittins (1989) for Bayesian / dynamic programming formulations, and Cesa-Bianchi and Lugosi (2006)
that covers the machine learning literature and the so-called adversarial setting.
Since the set of MAB instances in which one can identify the optimal policy is extremely limited, a typical yardstick to measure performance of a candidate policy is to compare it to a benchmark: an oracle that at each time instant selects the arm that maximizes expected reward. The difference between the performance of the policy and that of the oracle is called the regret. When the growth of the regret as a function of the horizon is sub-linear, the policy is long-run average optimal: its long run average performance converges to that of the oracle. Hence the first order objective is to develop policies with this characteristic. The precise rate of growth of the regret as a function of provides a refined measure of policy performance. Lai and Robbins (1985) is the first paper that provides a sharp characterization of the regret growth rate in the context of the traditional (stationary random rewards) setting, often referred to as the stochastic MAB problem. Most of the literature has followed this path with the objective of designing policies that exhibit the “slowest possible” rate of growth in the regret (often referred to as rate optimal policies).
In many application domains, several of which were noted above, temporal changes in the structure of the reward distribution are an intrinsic characteristic of the problem. These are ignored in the traditional stochastic MAB formulation, but there have been several attempts to extend that framework. The origin of this line of work can be traced back to Gittins and Jones (1974) who considered a case where only the state of the chosen arm can change, giving rise to a rich line of work (see, e.g., Gittins 1979, and Whittle 1981). In particular, Whittle (1988) introduced the term restless bandits; a model in which the states (associated with the reward distributions) of the arms change in each step according to an arbitrary, yet known, stochastic process. Considered a notoriously hard class of problems (cf. Papadimitriou and Tsitsiklis 1994), this line of work has led to various approximation approaches, see, e.g., Bertsimas and Nino-Mora (2000), and relaxations, see, e.g., Guha and Munagala (2007) and references therein.
Departure from the stationarity assumption that has dominated much of the MAB literature raises fundamental questions as to how one should model temporal uncertainty in rewards, and how to benchmark performance of candidate policies. One extreme view, is to allow the reward realizations of arms to be selected at any point in time by an adversary
. These ideas have their origins in game theory with the work ofBlackwell (1956) and Hannan (1957), and have since seen significant development; Foster and Vohra (1999) and Cesa-Bianchi and Lugosi (2006) provide reviews of this line of research. Within this so called adversarial formulation, the efficacy of a policy over a given time horizon is often measured relative to a benchmark which is defined by the single best action one could have taken in hindsight (after seeing all reward realizations). The single best action benchmark represents a static oracle, as it is constrained to a single (static) action. For obvious reasons, this static oracle can perform quite poorly relative to a “dynamic oracle” that follows the optimal dynamic sequence of actions, as the latter optimizes the (expected) reward at each time instant over all possible actions.111
Under non-stationary reward structure it is immediate that the single best action may be sub-optimal in a large number of decision epochs, and the gap between the performance of the static and the dynamic oracles can grow linearly with. Thus, a potential limitation of the adversarial framework is that even if a policy has a “small” regret relative to a static oracle, there is no guarantee with regard to its performance relative to the dynamic oracle.
Main contributions. At a high level, the main contribution of this paper lies in fully characterizing the (regret) complexity of a broad class of MAB problems with non-stationary reward structure by establishing a direct link between the extent of reward “variation” and the minimal achievable worst-case regret. More specifically, the paper’s contributions are along four dimensions. On the modeling side we formulate a class of non-stationary reward structures that is quite general, and hence can be used to realistically capture a variety of real-world type phenomena, yet remain mathematically tractable. The main constraint that we impose on the evolution of the mean rewards is that their variation over the relevant time horizon is bounded by a variation budget ; a concept that was recently introduced in Besbes et al. (2013) in the context of non-stationary stochastic approximation. This limits the power of nature compared to the adversarial setup discussed above where rewards can be picked to maximally damage the policy at each instance within . Nevertheless, this constraint still allows for a very rich class of temporal changes. In particular, this class extends most of the treatment in the non-stationary stochastic MAB literature which mainly focuses on a finite (known) number of changes in the mean reward values, see, e.g., Garivier and Moulines (2011) and references therein (see also Auer et al. (2002) in the adversarial context). It is also consistent with more extreme settings, such as the one treated in Slivkins and Upfal (2008) where reward distributions evolve according to a Brownian motion and hence the regret is linear in (we explain these connections in more detail in §5).
The second dimension of contribution lies in the analysis domain. For the class of non-stationary reward distributions described above, we establish lower bounds on the performance of any non-anticipating policy relative to the dynamic oracle, and show that these bounds can be achieved, uniformly over the class of admissible reward distributions, by a suitable policy construction. The term “achieved” is meant in the sense of the order of the regret as a function of the time horizon , the variation budget , and the number of arms . More precisely, our policies are shown to be minimax optimal up to a term that is logarithmic in the number of arms, and the regret is sublinear and is of the order of . Auer et al. (2002), in the adversarial setting, and Garivier and Moulines (2011) in the stochastic setting, considered non-stationary rewards where the identity of the best arm can change a finite number of times; the regret in these instances (relative to a dynamic oracle) is shown to be of order . Our analysis complements these results by treating a broader and more flexible class of temporal changes in the reward distributions, yet still establishing optimality results and showing that sublinear regret is achievable. When increases with the time horizon , our results provide a spectrum of orders of the minimax regret ranging between order (when is a constant independent of ) and order (when grows linearly with ), mapping allowed variation to best achievable performance.
With the analysis described above we shed light on the exploration-exploitation trade off that is a characteristic of the non-stationary reward setting, and the change in this trade off compared to the stationary setting. In particular, our results highlight the tension that exists between the need to “remember” and “forget.” This is characteristic of several algorithms that have been developed in the adversarial MAB literature, e.g., the family of exponential weight methods such as EXP3, EXP3.S and the like; see, e.g., Auer et al. (2002), and Cesa-Bianchi and Lugosi (2006)
. In a nutshell, the fewer past observations one retains, the larger the stochastic error associated with one’s estimates of the mean rewards, while at the same time using more past observations increases the risk of these being biased.
One interesting observation drawn in this paper a connection between the adversarial MAB setting, and the non-stationary environment studied here. In particular, as in Besbes et al. (2013), it is seen that optimal policy in the adversarial setting may be suitably calibrated to perform near-optimally in the non-stationary stochastic setting. This will be further discussed after the main results are established.
Structure of the paper. §2 introduces the basic formulation of the stochastic non-stationary MAB problem. In §3 we provide a lower bound on the regret that any admissible policy must incur relative to a dynamic oracle. §4 introduces a policy that achieves that lower bound. §5 contains a brief discussion. Proofs can be found in the Appendix.
2 Problem Formulation
Let be a set of arms. Let denote the sequence of decision epochs faced by the decision maker. At any epoch , a decision-maker pulls one of the arms. When pulling arm at epoch , a reward is obtained, where
is a random variable with expectation
We denote the best possible expected reward at decision epoch by , i.e.,
Changes in the expected rewards of the arms. We assume the expected reward of each arm may change at any decision point. We denote by the sequence of expected rewards of arm : . In addition, we denote by
the sequence of vectors of allexpected rewards: . We assume that the expected reward of each arm can change an arbitrary number of times, but bound the total variation of the expected rewards:
Let be a non-decreasing sequence of positive real numbers such that , for all , and for normalization purposes set . We refer to as the variation budget over . We define the corresponding temporal uncertainty set, as the set of reward vector sequences that are subject to the variation budget over the set of decision epochs :
The variation budget captures the constraint imposed on the non-stationary environment faced by the decision-maker. While limiting the possible evolution in the environment, it allows for many different forms in which the expected rewards may change: continuously, in discrete shocks, and of a changing rate (for illustration, Figure 1 depicts two different variation patterns that correspond to the same variation budget). In general, the variation budget is designed to depend on the number of pulls .
Admissible policies, performance, and regret. Let
be a random variable defined over a probability space. Let and for be measurable functions. With some abuse of notation we denote by the action at time , that is given by
The mappings together with the distribution define the class of admissible policies. We denote this class by . We further denote by the filtration associated with a policy , such that and for all . Note that policies in are non-anticipating, i.e., depend only on the past history of actions and observations, and allow for randomized strategies via their dependence on .
We define the regret under policy compared to a dynamic oracle as the worst-case difference between the expected performance of pulling at each epoch the arm which has the highest expected reward at epoch (the dynamic oracle performance) and the expected performance under policy :
where the expectation is taken with respect to the noisy rewards, as well as to the policy’s actions. In addition, we denote by the minimal worst-case regret that can be guaranteed by an admissible policy :
is the best achievable performance. In the following sections we study the magnitude of . We analyze the magnitude of this quantity by establishing upper and lower bounds; in these bounds we refer to a constant as absolute if it is independent of , , and .
3 Lower bound on the best achievable performance
We next provide a lower bound on the the best achievable performance.
Assume that rewards have a Bernoulli distribution. Then, there is some absolute constant
Assume that rewards have a Bernoulli distribution. Then, there is some absolute constantsuch that for any policy and for any , and ,
We note that when reward distributions are stationary, there are known policies such as UCB1 and -greedy (Auer et al. 2002) that achieve regret of order in the stochastic setup. When the environment is non-stationary and the reward structure is defined by the class , then no policy may achieve such a performance and the best performance must incur a regret of at least order . This additional complexity embedded in the stochastic non-stationary MAB problem compared to the stationary one will be further discussed in §5.
(Growing variation budget) Theorem 1 holds when is increasing with . In particular, when the variation budget is linear in , the regret grows linearly and long run average optimality is not achievable. This also implies the observation of Slivkins and Upfal (2008) about linear regret in an instance in which expected rewards evolve according to a Brownian motion.
The driver of the change in the best achievable performance (relative to the one established in a stationary environment) is the optimal exploration-exploitation balance. Beyond the tension between exploring different arms and capitalizing on the information already collected, captured by the “classical” exploration-exploitation trade-off, a second tradeoff is introduced by the non-stationary environment, between “remembering” and “forgetting”: estimating the expected rewards is done based on past observations of rewards. While keeping track of more observations may decrease the variance of mean rewards estimates, the non-stationary environment implies that “old” information is potentially less relevant and creates a bias that stems from possible changes in the underlying rewards. The changing rewards give incentive to dismiss old information, which in turn encourages enhanced exploration. The proof of Theorem1 emphasizes these two tradeoffs and their impact on achievable performance. At a high level the proof of Theorem 1 builds on ideas of identifying a worst-case “strategy” of nature (e.g., Auer et al. 2002, proof of Theorem 5.1) adapting them to our setting. While the proof is deferred to the appendix, we next describe the key ideas.
Selecting a subset of feasible reward paths. We define a subset of vector sequences and show that when is drawn randomly from , any admissible policy must incur regret of order . We define a partition of the decision horizon into batches of size each (except, possibly the last batch):
where is the number of batches. In , in every batch there is exactly one “good” arm with expected reward for some , and all the other arms have expected reward
. The “good” arm is drawn independently in the beginning of each batch according to a discrete uniform distribution over. Thus, the identity of the “good” arm can change only between batches. See Figure 2 for a description and a numeric example of possible realizations of a sequence that is randomly drawn from .
Since there are batches we obtain a set of possible, equally probable realizations of . By selecting such that , any is composed of expected reward sequences with a variation of at most , and therefore . Given the draws under which expected reward sequences are generated, nature prevents any accumulation of information from one batch to another, since at the beginning of each batch a new “good” arm is drawn independently of the history.
Countering possible policies. For the sake of simplicity, the discussion in this paragraph assumes a variation budget that is fixed and independent of (the proof of the theorem details the more general treatment for a variation budget that depends on ). The proof of Theorem 1 establishes that under the setting described above, if no admissible policy can identify the “good” arm with high probability within a batch. Since there are epochs in each batch, the regret that any policy must incur along a batch is of order , which yields a regret of order throughout the whole horizon. Selecting the smallest feasible such that the variation budget constraint is satisfied leads to , yielding a regret of order throughout the horizon.
4 A near-optimal policy
In this section we apply the ideas underlying the lower bound in Theorem 1 to develop a rate optimal policy for the non-stationary MAB problem with a variation budget. Consider the following policy:
Rexp3. Inputs: a positive number , and a batch size .
Set batch index
Repeat while :
Initialization: for any set
Repeat for :
For each , set
Draw an arm from according to the distribution
Receive a reward
For arm set , and for any set . For all update:
Set , and return to the beginning of step 2
Clearly . The Rexp3 policy uses Exp3, a policy introduced by Freund and Schapire (1997) for solving a worst-case sequential allocation problem, as a subroutine, restarting it every epochs.
Let be the Rexp3 policy with a batch size and with . Then, there is some absolute constant such that for every , , and :
Theorem 2 is obtained by establishing a connection between the regret relative to the single best action in the adversarial setting, and the regret with respect to the dynamic oracle in non-stationary stochastic setting with variation budget. Several classes of policies, such as exponential-weight policies (including Exp3) and polynomial-weight policies, have been shown to achieve regret of order with respect to the single best action in the adversarial setting (see Auer et al. (2002) and chapter of Cesa-Bianchi and Lugosi (2006) for a review). While in general these policies tend to perform well numerically, there is no guarantee for its performance with respect to the dynamic oracle studied in this paper (see also Hartland et al. (2006) for a study of the empirical performance of one class of algorithms), since the single best action itself may incur linear (with respect to ) regret relative to the dynamic oracle. The proof of Theorem 2 shows that any policy that achieves regret of order with respect to the single best action in the adversarial setting, can be used as a subroutine to obtain near-optimal performance with respect to the dynamic oracle in our setting.
Rexp3 emphasizes the two tradeoffs discussed in the previous section. The first tradeoff, information acquisition versus capitalizing on existing information, is captured by the subroutine policy Exp3. In fact, any policy that achieves a good performance compared to a single best action benchmark in the adversarial setting must balance exploration and exploitation, and therefore the loss incurred by experimenting on sub-optimal arms is indeed balanced with the gain of better estimation of expected rewards. The second tradeoff, “remembering” versus “forgetting,” is captured by restarting Exp3 and forgetting any acquired information every pulls. Thus, old information that may slow down the adaptation to the changing environment is being discarded.
Hence, we have quantified the impact of the extent of change in the environment on the best achievable performance in this broad class of problems. For example, for the case in which , for some absolute constant and the best achievable regret is of order .
4.1 Numerical Results
We illustrate the upper bound on the regret by a numerical experiment that measures the average regret that is incurred by Rexp3, in the presence of changing environments.
Setup. We consider instances where two arms are available: . The reward associated with arm at epoch has a Bernoulli distribution with a changing expectation :
for all , and for any pulled arm . The evolution patterns of , will be specified below. At each epoch the policy selects an arm . Then, the binary rewards are generated, and is observed. The pointwise regret that is incurred at epoch is , where . We note that while the pointwise regret at epoch is not necessarily positive, its expectation is. Summing over the whole horizon and replicating 20,000 times for each instance of changing rewards, the average regret approximates the expected regret compared to the dynamic oracle.
First stage (Fixed variation, different time horizons). The objective of the first part of the simulation is to measure the growth rate of the average regret incurred by the policy, as a function of the horizon length, under a fixed variation budget. We use two basic instances. In the first instance (displayed on the left side of Figure 1) the expected rewards are sinusoidal:
for all . In the second instance (depicted on the right side of Figure 1) similar sinusoidal evolution of the expected reward is “compressed” into the first third of the horizon, where in the rest of the horizon the expected rewards remain fixed:
for all . Both instances describe different changing environments under the same (fixed) variation budget . While in the first instance the variation budget is spent throughout the whole horizon, in the second one the same variation budget is spent only over the first third of the horizon. For different values of (between 3000 and 40000) and for both variation instances we estimated the regret through 20,000 replications (the average performance trajectory of Rexp3 for is depicted in the upper-left and upper-right plots of Figure 3).
Discussion of the first stage. The first part of the simulation illustrates the decision process of the policy, as well as the order growth rate of the regret. The upper parts of Figure 3 describe the performance trajectory of the policy. One may observe that the policy identifies the arm with the higher expected rewards, and selects it with higher probability. The Rexp3 policy adjusts to changes in the expected rewards and updates the probabilities of selecting each arm according to the received rewards. While the policy adapts quickly to the changes in the expected rewards (and in the identity of the “better” arm), it keeps experimenting with the sub-optimal arm (the policy’s trajectory doesn’t reach the one of the dynamic oracle). The Rexp3 policy balances the remembering-forgetting tradeoff using the restarting points, occurring every epochs. The exploration-exploitation tradeoff is balanced throughout each batch by the subroutine policy Exp3. While Exp3 explores at an order of epochs in each batch, restarting it every ( is fixed, therefore one has an order of batches, each batch with an order of epochs) yields an exploration rate of order .
The lower-left and lower-right parts in Figure 3
show plots of the natural logarithm of the averaged regret as a function of the natural logarithm of the the horizon length. All the standard errors of the data points in these log-log plots are lower than. These plots detail the linear dependence between the natural logarithm of the averaged regret, and the natural logarithm of . In both cases the slope of the linear fit for increasing values of supports the dependence of the minimax regret.
Second stage (Increasing the variation). The objective of the second part of the simulation is to measure how the growth rate of the averaged regret (as a function of ) established in the first part changes when the variation increases. For this purpose we used a variation budget of the form . Using first instance of sinusoidal variation, we repeated the first step for different values of between (implying a constant variation, that was simulated at the first stage) and (implying linear variation). The upper plots of Figure 4 depicts the average performance trajectories of the Rexp3 policy under different variation budgets. The different slopes, representing different growth rate of the regret for different values of appear in the table and the plot, at the bottom of Figure 4.
Discussion of the second stage. The second part of the simulation illustrates the way variation affects the policy decision process and the minimax regret. Since is of order , holding fixed and increasing affects the decision process and in particular the batch size of the policy. This is illustrated at the top plots of Figure 4. The slopes that were estimated for each value (in the variation structure ) ranging from to describing the linear log-log dependencies (the case of is already depicted at the bottom-left plot in Figure 3) are summarized in Table 1.
The bottom part of Figure 4 show the slope of the linear fit between the data points of Table 1, illustrates the growth rate of the regret when the variation (as a function of ) increases, supports the dependence of the minimax regret, and emphasizes the full spectrum of minimax regret rates (of order ) that are obtained for different variation levels.
Contrasting with traditional (stationary) MAB problems.
The tight bounds that were established on the minimax regret in our stochastic non-stationary MAB problem allows one to quantify the “price of non-stationarity,” which mathematically captures the added complexity embedded in changing rewards versus stationary ones. While Theorem 1 and Theorem 2 together characterize minimax regret of order , the characterized minimax regret in the stationary stochastic setting is of order in the case where rewards are guaranteed to be “well separated” one from the other, and of order when expected rewards can be arbitrarily close to each other (see Lai and Robbins (1985) and Auer et al. (2002) for more details). Contrasting the different regret growth rates quantifies the “price,” in terms of best achievable performance, of non-stationary rewards compared to stationary ones, as a function of the variation that is allowed in the non-stationary case. Clearly, this comparison shows that additional complexity is introduced even when the allowed variation is fixed and independent of the horizon length.
Contrasting with other non-stationary MAB instances.
The class of MAB problems with non-stationary rewards that is formulated in the current chapter extends other MAB formulations that allow rewards to change in a more structured manner. We already discussed in Remark 1 the consistency of our results (in the case where the variation budget grows linearly with the time horizon) with the setting treated in Slivkins and Upfal (2008) where reward evolve according to a Brownian motion and hence the regret is linear in . Two other representative studies are those of Garivier and Moulines (2011), that study a stochastic MAB problems in which expected rewards may change a finite number of times, and Auer et al. (2002) that formulate an adversarial MAB problem in which the identity of the best arm may change a finite number of times. Both studies suggest policies that, utilizing the prior knowledge that the number of changes must be finite, achieve regret of order relative to the best sequence of actions. However, the performance of these policies can deteriorate to regret that is linear in when the number of changes is allowed to depend on . When there is a finite variation ( is fixed and independent of ) but not necessarily a finite number of changes, we establish that the best achievable performance deteriorate to regret of order . In that respect, it is not surprising that the “hard case” used to establish the lower bound in Theorem 1 describes a nature’s strategy that allocates the allowed variation over a large (as a function of ) number of changes in the expected rewards.
Appendix A Proofs
Proof of Theorem 1. At a high level the proof adapts a general approach of identifying a worst-case nature “strategy” (see proof of Theorem 5.1 in Auer et al. (2002) which analyze the worst-case regret relative to a single best action benchmark in a fully adversarial environment), extending these ideas appropriately to our setting. Fix , , and . In what follows we restrict nature to the class that was described in §3, and show that when is drawn randomly from , any policy in must incur regret of order .
Step 1 (Preliminaries). Define a partition of the decision horizon to batches batches of size each (except perhaps ) according to (2). For some that will be specified shortly, define to be the set of reward vectors sequences such that:
for all ,
for any , , for all
For each sequence in in any epoch there is exactly one arm with expected reward where the rest of the arms have expected reward , and expected rewards cannot change within a batch. Let . Then, for any one has:
where the first inequality follows from the structure of . Therefore, .
Step 2 (Single batch analysis). Fix some policy , and fix a batch . Let denote the “good” arm of batch . We denote by
the probability distribution conditioned on armbeing the “good” arm in batch , and by the probability distribution with respect to random rewards (i.e. expected reward ) for each arm. We further denote by and the respective expectations. Assuming binary rewards, we let denote a vector of rewards, i.e. . We denote by the number of times arm was selected in batch . In the proof we use Lemma A.1 from Auer et al. (2002) that characterizes the difference between the two different expectations of some function of the observed rewards vector:
Let be a bounded real function. Then, for any :
Recalling that denotes the “good” arm of batch , one has
In addition, applying Lemma 1 with (clearly ) we have:
Summing over arms, one has:
for any , where: (a) holds since , and thus by Cauchy-Schwarz inequality ; and (b) holds since for all .
Step 3 (Regret along the horizon). Let be a random sequence of expected rewards vectors, in which in every batch the “good” arm is drawn according to an independent uniform distribution over the set . Clearly, every realization of is in . In particular, taking expectation over , one has:
On the other hand, if , one has , and therefore
where the last two inequalities hold by . Thus, since , we have established that:
This concludes the proof.
Proof of Theorem 2
The structure of the proof is as follows. First, breaking the decision horizon to a sequence of batches of size each, we analyze the difference in performance between the the single best action and the performance of the dynamic oracle in a single batch. Then, we plug in a known performance guarantee for Exp3 relative to the single best action in the adversarial setting, and sum over batches to establish the regret of Rexp3 with respect to the dynamic oracle.
Step 1 (Preliminaries). Fix , , and . Let be the Rexp3 policy described in §4, tuned by and a batch size (to be specified later on). We break the horizon into a sequence of batches of size each (except, possibly ) according to (2). Let , and fix . We decompose the regret in batch :
The first component, , corresponds to the expected loss associated with using a single action over the batch. The second component, , corresponds to the expected regret with respect to the best static action in batch .
Step 2 (Analysis of and ). Defining for all , we denote by the variation in expected rewards along batch . We note that
Let by an arm with the best expected performance (the best static strategy) over batch , i.e., . Then,
and therefore, one has:
for any and , where holds by (7) and holds by the following argument: otherwise there is an epoch for which . Indeed, let . In such case, for all one has , since is the maximal variation in batch . This however, implies that the expected reward of is dominated by an expected reward of another arm throughout the whole period, and contradicts the optimality of .
In addition, Corollary 3.2 in Auer et al. (2002) points out that the regret with respect to the single best action of the batch, that is incurred by Exp3 with the tuning parameter , is bounded by . Therefore, for each one has
for any , where holds since within each batch arms are pulled according to Exp3().
Step 3 (Regret throughout the horizon). Summing over batches we have:
where: (a) holds by (5), (