Log In Sign Up

A Regret bound for Non-stationary Multi-Armed Bandits with Fairness Constraints

The multi-armed bandits' framework is the most common platform to study strategies for sequential decision-making problems. Recently, the notion of fairness has attracted a lot of attention in the machine learning community. One can impose the fairness condition that at any given point of time, even during the learning phase, a poorly performing candidate should not be preferred over a better candidate. This fairness constraint is known to be one of the most stringent and has been studied in the stochastic multi-armed bandits' framework in a stationary setting for which regret bounds have been established. The main aim of this paper is to study this problem in a non-stationary setting. We present a new algorithm called Fair Upper Confidence Bound with Exploration Fair-UCBe algorithm for solving a slowly varying stochastic k-armed bandit problem. With this we present two results: (i) Fair-UCBe indeed satisfies the above mentioned fairness condition, and (ii) it achieves a regret bound of O(k^3/2 T^1 - α/2√(log T)), for some suitable α∈ (0, 1), where T is the time horizon. This is the first fair algorithm with a sublinear regret bound applicable to non-stationary bandits to the best of our knowledge. We show that the performance of our algorithm in the non-stationary case approaches that of its stationary counterpart as the variation in the environment tends to zero.


page 1

page 2

page 3

page 4


Lifelong Learning in Multi-Armed Bandits

Continuously learning and leveraging the knowledge accumulated from prio...

Achieving Fairness in the Stochastic Multi-armed Bandit Problem

We study an interesting variant of the stochastic multi-armed bandit pro...

Fairness in Learning: Classic and Contextual Bandits

We introduce the study of fairness in multi-armed bandit problems. Our f...

Non-stationary Bandits with Knapsacks

In this paper, we study the problem of bandits with knapsacks (BwK) in a...

Non-Stationary Bandits under Recharging Payoffs: Improved Planning with Sublinear Regret

The stochastic multi-armed bandit setting has been recently studied in t...

Bootstrapping Upper Confidence Bound

Upper Confidence Bound (UCB) method is arguably the most celebrated one ...

Learning the Trading Algorithm in Simulated Markets with Non-stationary Continuum Bandits

The basic Multi-Armed Bandits (MABs) problem is trying to maximize the r...

1. Introduction

Multi-armed bandits and other related frameworks for studying sequential decision-making problems have been found to be useful in a wide variety of practical applications. For example, bandit formulations have been used in healthcare for modelling treatment allocation (Villar et al., 2015; Durand et al., 2018), studying influence in social networks (Wu et al., 2019; Wen et al., 2017), recommendation systems (Zhou et al., 2017) etc. The present paper deals with incorporating fairness conditions in multi-armed bandit problems where the underlying environment is non-stationary.

How can a bandit algorithm be ‘unfair’? In the classic stochastic -armed bandit problem, at each time step, an agent has to choose one out of

arms. When an arm is chosen, the agent receives a real-valued reward sampled from a probability distribution corresponding to the chosen arm. The goal of the agent is to maximize the expected reward obtained over some time horizon

. For this, the learning algorithm has to initially try out each arm to get an idea of its corresponding reward distribution. This is referred to as the exploration phase. Once the agent gathers enough information about the reward distribution of each arm, it can then make an informed decision and choose among the arms in such a way as to maximize the rewards obtained. The performance of the agent is usually measured with a notion of regret. This is defined as the expected difference in the rewards obtained if the agent follows the optimal policy of choosing the best arm versus the policy actually followed by the agent.

In some socially relevant practical problems of sequential decision making, judging an algorithm solely based on regret may not be enough. The reason being that regret only provides the picture of expected returns and does not deal with behavior or in what way decisions are taken, especially during the initial learning phase. Is it OK to be ‘unfair’ to a better candidate just because the learning agent or algorithm is still trying to learn? For example, selecting a poorly performing arm for a constant number of times does not affect the asymptotic regret achieved by the algorithm, but this behavior would still be unfair to better performing arms, which have been ignored either deliberately or due to carelessness during the learning phase.

To deal with such issues, one can enforce some well-defined fairness constraints on a learning algorithm in addition to the goal of minimizing regret. Various definitions of fairness motivated by real-world applications have been studied in the context of stationary multi-armed bandit problems (Joseph et al., 2016; Li et al., 2020). Intuitively, these fairness conditions insist that any algorithm solving the bandit problem should consider all the arms and their rewards and should be ‘fair’ when selecting the arms. This might be an important requirement in many real-world applications, especially those involving humans. The learning algorithms should be designed such that they do not give rise to decisions that unduly discriminate between different individuals or groups of people.

One notion of fairness is what could be considered equality of opportunity, in which arms with similar reward distributions are given a similar opportunity of being chosen at any given time. For instance, with high probability (at least

), at each time step, an arm can be assigned a higher probability of being chosen than another arm only if the expected reward of the former is strictly greater than that of the latter (Joseph et al., 2016). This notion of -fairness is what is considered in this paper when referring to the fairness of the proposed algorithm.

Most standard solutions to stochastic multi-armed bandit problems assume that the rewards are generated independently from reward distributions of arms. These distributions remain fixed over the entire time horizon . However, in many practical problems, the underlying environment cannot be expected to remain fixed. This amounts to multi-armed bandit problems where the reward distributions may change at each time step. This requires developing learning algorithms that should be able to cope with different kinds of changes in the environment. For example, a bandit algorithm for a recommendation system should be able to handle a change in user preferences over time (Zeng et al., 2016).

If there are no statistical assumptions or constraints on the rewards corresponding to any arm at any time step, the problem becomes what is referred to as the adversarial bandit problem (Auer et al., 1995). This problem is difficult to solve under the classic notion of regret since any information obtained about the reward distribution of an arm at a certain time becomes useless in the next time step, which means that any arm which is optimal at a one-time step need not be optimal later. However, this setting can still be studied and solved with respect to a weaker notion of regret. Here, the regret of an algorithm is measured against a policy that is allowed to choose the same arm during all time steps.

Since the adversarial setting is too general for obtaining good results with respect to the standard form of regret, other variations of non-stationary bandit problems have been extensively studied, which constrain how the reward distributions of the arms change as time passes. One possible constraint is bound on the absolute change in the expected rewards of all arms at each time step (Wei and Srivatsva, 2018). In this paper, we consider a variant of this slowly varying environment.

Since existing fair algorithms assume a stationary environment, their fairness guarantees do not hold when the stationarity assumptions are no longer true. Hence, modifying these algorithms to respect the fairness constraints in a non-stationary environment is non-trivial. In this work, we address the problem of satisfying fairness constraints in a slowly varying non-stationary environment.


In the literature, fair bandit algorithms have not been studied in the non-stationary setting. The main contribution of this paper is a fair UCB-like algorithm for solving a non-stationary stochastic multi-armed bandit problem. The environment considered is a slowly varying environment. We prove that the proposed algorithm is -fair (the fairness condition considered in (Joseph et al., 2016)) and achieves a regret of order for some and constant . As the non-stationarity of the environment considered is reduced, this regret bound approaches that achieved by a fair algorithm in the stationary setting, up to logarithmic factors.

2. Preliminaries

Consider a bandit with arms and a time horizon . At time , let the reward distribution of arm be on , with mean . Here, denotes the set and similarly denotes the set . Given history of arms chosen and rewards obtained till time , the agent chooses an arm at time by sampling an arm from the probability distribution , with the probability of choosing arm being . Let be the arm chosen at time , i.e, . Note that in the stationary case, the reward distribution and the mean will remain constant, that is and , for all , .

In this paper, we consider multi-armed bandits in a non-stationary setting, and hence, we assume that the means of the rewards distributions change as time progresses. Our assumption can be stated as follows. We assume that there exists known parameter such that for all , and all arms , , where and are the means of the reward distribution of arm at times and respectively. In other words, controls how much the mean of the reward distribution of an arm is allowed to change at each time step. It is to be noted that the change in the mean depends only on the horizon and not the current time step .

In this paper, we consider the notion of -fairness that has been introduced in  Joseph et al. (2016). The intuition behind this definition of fairness is that at each time step, with a high probability of , arms with similar reward distributions should have a similar chance of being selected. In other words, at any point in time, for any pair of arms, the learning algorithm should give preference to one of the arms over the other, only if it is ‘reasonably’ certain that its expected reward is strictly greater than that of the other. This can be stated as follows.

Definition 1 (-Fairness).

(Joseph et al., 2016) A multi armed bandit algorithm is said to be -fair if, with probability atleast , ,

where and denote the probability assigned by the algorithm to choose arm and respectively at time given history of arms chosen and rewards obtained till time , and and are the means of the rewards distributions of arms and respectively at time .

The dynamic regret achieved by a bandit algorithm is defined as

where is the mean of the reward distribution of arm at time , and is the arm selected by the algorithm at time . Using the dynamic regret as defined above, the performance of a bandit algorithm is measured by comparing the expected reward of the arm selected at each time step against the expected reward of the optimal arm at that time step, taking into account the fact that the optimal arm changes with time. This is in contrast to the static regret considered in stationary and adversarial settings. In these settings, the performance of a bandit algorithm is measured against a single fixed optimal arm, which is the arm that gives the highest total expected reward over the entire time horizon , when chosen at every single time step. Thus, dynamic regret is a stronger performance criterion than static regret.

3. Fair-UCBe Algorithm

Now we present some analysis that leads to our proposed algorithm. For satisfying the fairness constraint as given by Definition 1

, an arm should be preferentially chosen only if it is known that, with high probability, that arm indeed gives a strictly greater expected reward. To estimate this with high probability, confidence intervals are constructed for each arm, similar to the Upper Confidence Bound (UCB1) algorithm 

(Auer et al., 2002). However, instead of choosing the arm with the highest upper confidence bound, arms with high estimates of expected rewards are chosen in such a way that fairness is maintained, as described below.

Let be the confidence interval of arm at time . Suppose it has been proved that with probability atleast , , for all and . At each time , to minimize the regret, UCB1 deterministically chooses the arm with the highest upper confidence bound , say . Fairness demands that be chosen only if for all . However, if there exists an arm such that , their confidence intervals overlap. There is no guarantee that the expected mean of arm is greater than that of , forcing any fair agent to assign both these arms an equal probability of being chosen.

Now, with the above constraint, at time , regret minimization requires the agent to choose arms and with probability each and ignore all the other arms. However, this is fair only if all the other arms have expected rewards strictly less than those of and , which is not true if the confidence interval of some other arm overlaps with either of those of or . Arm should also be assigned an equal probability of being chosen as and , and the same argument extends to other arms whose confidence intervals overlap with that of and so on.

Let be the set of arms to be chosen at time with equal nonzero probability and let arms in be ignored. For optimality, should contain . For fairness, any arm whose confidence interval overlaps with that of any arm in should be added and this process should be repeated until no other arm can be added to . is called the active set of arms (Joseph et al., 2016) and at each time step, an arm is chosen uniformly from the active set.

Definition 2 (Active set).

Let be the confidence interval associated with each arm of the bandit at time . The active set is defined recursively as the set satisfying the following properties:

  1. If , then .

  2. If for some , then .

Intuitively, the active set of arms is the set of arms whose confidence intervals are chained via overlap to the confidence interval with the greatest upper confidence bound. For the algorithm to be fair, each arm in the active set should be assigned an equal probability of being selected.

Due to non-stationarity, as time progresses, older samples become less indicative of the current reward distribution. Therefore, at each time step, we choose only the latest samples of each arm to estimate the expected reward and construct the confidence interval, for suitably chosen . This progressive increase in the number of samples considered is similar to the use of a progressively increasing sliding window by Wei and Srivatsva (2018). Now, as time progresses, due to the increased number of samples obtained, the confidence intervals shrink, and the active set becomes small. If the arms that are not in the active set are ignored and not sampled for a long time, due to the nonstationarity of the environment, their expected rewards can change such that they fall into the confidence intervals of arms that are in the active set.

So, to ensure that the learning algorithm does not remain oblivious to the reward distributions of inactive arms, we propose, with some fixed probability at each time step, to choose uniformly from all arms. This exploration probability is chosen to be , for suitable to be specified. Due to this fixed exploration probability at each step, we refer to our proposed algorithm as Fair-UCB with exploration or Fair-UCBe. The overall steps involved are listed in Algorithm 1. We present two results in this regard. First, we show that the proposed algorithm is indeed -fair. Then, we establish an upper bound for the regret.

Given : Horizon , drift parameter
Choose , // Initialize sample sequences
for foreach  in  do
       // Construct confidence intervals for each arm
       foreach  in  do
             // Select the latest samples for estimation
             last elements of mean() // Determine width of confidence interval
      // Find arm with highest upper confidence bound
       Active set all arms whose intervals are chained to

Sample a random variable

Bernoulli() if  then Explore
             Sample an arm uniformly from
      else Exploit
             Sample an arm uniformly from
      Choose arm and append observed reward to sample sequence
Algorithm 1 Fair UCB with Exploration
Theorem 3.1.

The Fair-UCBe algorithm is -fair, as defined in Definition 1, for .

Theorem 3.2.

The regret achieved by Fair-UCBe satisfies


The above regret bound is non-trivial since it guarantees that this fair algorithm achieves sublinear regret even in the context of a non-stationary environment.

4. Discussion

The choice of parameters and in the algorithm is constrained by the inequality or equivalently . For large , can be chosen close to , leading to the constraint . Thus, when the non-stationarity in the environment is very high and , we have as well. Similary, when the non-stationarity is very low and the environment is almost stationary, is large and can be chosen close to .

Joseph et al. (2016) showed that in a stationary environment, their -fair algorithm FairBandits achieves a regret of , for , which is in the limiting case of . They also showed that FairBandits achieves the best possible performance in that setting. One can see that this bound is equivalent, upto logarithmic factors, to the regret of
achieved by Fair-UCBe in the limiting case of , which occurs for large and . In other words, as the non-stationarity of the environment is reduced, the performance of our algorithm remains consistent with the best performance possible in that setting.

When the change in the environment is high (i.e., is close to zero), the regret bound for Fair-UCBe is similar to that of SW-UCB# (Sliding Window Upper Confidence Bound) (Wei and Srivatsva, 2018), which assumes a non-stationarity constraint similar to ours but does not maintain fairness. The regret bounds are almost the same in terms of up to logarithmic factors, with the difference being a factor of , whose exponent goes to zero for large .

4.1. On Exploration

One aspect of our algorithm that distinguishes it from other upper confidence bound algorithms is the incorporation of an explicit fixed probability of exploration, in addition to the implicit exploration present in other similar algorithms. This exploration probability depends on the non-stationarity of the environment through the constraint . Smaller values of kappa lead to a larger probability of exploration. This is intuitive in the sense that the more the environment varies, the greater the need to sample inactive arms via explicit exploration to keep track of changes in their reward distributions. Thus, there is a smooth trade-off between exploitation and exploration, depending on the degree of non-stationarity of the environment.

4.2. On Sublinearity

Even though the upper bound for the regret of the algorithm is seemingly sublinear in (since the exponent of is ), the extra factor of may actually result in the regret being more than . In order to achieve sublinear regret, ignoring logarithmic factors for simplicity, it is necessary that , or equivalently . Due to the constraint , increases drastically for small values of , necessitating a very large value of to obtain sublinear regret. In other words, the regret of the algorithm is linear in the context of highly non-stationary environments. As the non-stationarity reduces, and the constraint becomes , which is identical to the constraint for FairBandits (Joseph et al., 2016).

5. Proofs for Theorems 3.1 and 3.2

5.1. Proof of Theorem 3.1

At each time step, all arms in the active set are assigned an equal probability of being chosen, say , and all arms not in the active set are assigned an equal probability of being chosen, say . Now, , since is the probability of exploration and is the probability of choosing a specific arm when exploring. For any arm in the active set, the probability of being chosen is

From this, we have . Therefore, due to the choice of the active set definition, for the algorithm to be -fair, it is sufficient to prove that with probability at least , the expected rewards of all arms at all time steps fall in their confidence intervals. Now we proceed to prove these results.

5.1.1. Spread of samples

For any arm at any time , the probability of that arm being chosen is at least . So, the expected number of time steps required for obtaining at least one sample from an arm is at most . This fact can be used to prove the following Lemma.

Lemma 5.1.

Let the time interval be divided into intervals of size , for as specified in Algorithm 1. Let be the event that each arm has atleast one sample in each of these intervals. Then, , for .

The constraint on can be simplified by the following Lemma. The proofs of both these Lemmas are given in the Appendix.

Lemma 5.2.

From the above Lemma, is sufficient for arbitrary . Moreover, for , the lower bound is a decreasing function of , and thus can be chosen much smaller, with the value going to for large .

5.1.2. Sufficiency of samples

At each point in time, we wish to choose the latest samples. But these many samples may not be available, especially if is small. From Lemma 5.1, we see that if is true, then each sample requires atmost time steps. So, for the availability of sufficient number of samples, we need . Suppose , then for , we have , which implies and hence .

So, if is true, after the initial time steps, the number of samples is always sufficient to construct a good confidence interval, provided the exponent of is sensible. We add the constraint, , to be satisfied by the exponent. This will be useful in the regret calculation. This can be simplified to or , which is the constraint specified in Algorithm 1.

5.1.3. Confidence intervals

At time , for arm , consider the latest rewards obtained. Let those rewards be , where with means . The Hoeffding inequality gives

We make use of the following Lemma, whose proof is provided in the Appendix.

Lemma 5.3.

Let , the empirical estimate of the expected reward of arm at time , using the latest samples obtained from that arm. Then, if is true, for ,

By letting , the above probability (that the true mean is outside the confidence interval) summed over all arms and times is bounded above by

Here, we use the fact that . Since the above analysis holds if is true, which happens with probability atleast , the above confidence intervals hold with probability atleast and the algorithm becomes -fair, where .

5.2. Proof of Theorem 3.2

The length of the confidence interval of arm is

The regret at any time step of exploitation is atmost times the size of the largest confidence interval, and atmost (and also, with probability less than , when any of the means fall outside their confidence intervals, it will be bounded by ). When is true, for the first time steps, the number of samples may be insufficient and the regret is bounded by , and for later time steps, , for all and

where the first two terms are due to exploitation epochs, the third term is due to exploration epochs and the fourth term corresponds to the event

. Since , we obtain

Since and equivalently , the first term becomes .

Now, consider the second term in the regret bound. Let . When is true and , and the term becomes

The second term in the above expression is , which is , since . By letting , the first term in the above expression becomes

Let , then and . The above expression becomes

The first inequality above is obtained by using integration by parts with functions and . The overall regret bound becomes

Now, . So, letting , we have and hence . Therefore,

6. Experiments

In this section, we present results from applying the proposed algorithm in a simulated environment. We consider a bandit with arms. The initial expected rewards for all arms are chosen uniformly from

. The rewards at each time step are sampled from a beta distribution. For non-stationarity, each arm is assigned randomly at the beginning to be drifting either upwards or downwards. An upward drifting arm is more likely to drift upwards with some fixed probability

and vice versa. At each time step, the drift in the expected value of each arm is sampled uniformly from and added to the expected reward while also constraining it to remain within the original interval. The results are plotted in Figure 0(a) as the ratio between the regret achieved by the algorithm and the regret bound .

Figure 1. (a) Plot showing the ratio of regret achieved by the algorithm and the regret bound , for time horizon and multiple values of . A smaller value of indicates a fast changing environment. (Best viewed in colour.) (b) and (c) show the change in the confidence intervals of the two arms under Fair-UCBe and FairBandits respectively in an environment with two arms, time horizon and , where the expected rewards of the arms continuously evolve in opposite directions. It can be observed that due to lack of exploration, throughout most of the time horizon, FairBandits fails to accurately estimate the reward distribution of whichever arm is not in the active set at that point in time and thus does not maintain fairness.

The apparent linear increase in the regret with time is due to the constant exploration probability at each time step, which is necessary to ensure fairness. This does not contradict the sublinear regret bound since the bound is for the cumulative expected regret achieved over the entire time horizon and does not constrain how the regret changes with time. It can be seen from the Figure 0(a) that the cumulative regret achieved does not exceed the derived upper bound.

The sharp changes in the slopes of some of the lines in the plot correspond to changes in the composition of the active set. Dropping a sub-optimal arm from the active set results in a drastic reduction in the expected regret at each time step of exploitation.

To further illustrate the necessity of a sliding window and exploration to deal with non-stationarity, we consider another experiment with two arms and parameters and . In this experiment, we let the first arm start with an expected reward of and decrease by the maximum amount possible, at each time step. Similarly, we let the second arm start with an expected reward of and increase it by at each time step. We run Fair-UCBe and FairBandits in this environment and plot the upper and lower confidence bounds of the arms in Figures 0(b) and 0(c) respectively.

Under Fair-UCBe, we can see the gradual shift of the confidence intervals of both arms as the underlying reward distributions change. In contrast, under FairBandits, we observe that as soon as the first arm becomes known to be better than the second arm, the latter is discarded from the active set, and the algorithm loses track of its reward distribution. Only much later does the estimated mean reward of the first arm become low enough for its confidence interval to overlap with that of the second arm. The algorithm then becomes aware of the change in the reward distribution of the second arm. After a few time steps, the active set again contains only one arm, this time the second arm, and the lack of exploration leads to a biased estimation of the other arm again, as seen by the lack of change in the confidence bounds of the first arm nearing the end of the horizon. Thus, we see that every aspect of our algorithm is crucial for dealing with a non-stationary environment while being fair to all arms.

7. Related work

In this work, we considered a specific version of a slowly varying environment to study regret bounds for a fair algorithm solving a non-stationary multi-armed bandit problem. Several variations of non-stationary bandit problems have been extensively studied, which constrain in various ways how the reward distributions of the arms change as time passes. Garivier and Moulines (2011) studied a setting with a constraint on the number of times of occurrence of an arbitrary change in the reward distribution, referred to as an abruptly changing environment. Alternatively, change in the reward distribution could be allowed at every time instant, but the nature of each change is constrained, leading to a slowly varying or drifting environment like the one considered in this work. The constraint could be on the absolute change in the expected reward at each time step (Wei and Srivatsva, 2018), or a stochastic change in the form of a known distribution (Slivkins and Upfal, 2008). Another extensively studied setting consists of a constraint on the total absolute variation, throughout all time steps, of the expected rewards (Besbes et al., 2019; Russac et al., 2019).

Our non-stationarity assumption is similar to that of Wei and Srivatsva (2018). Their algorithm, SW-UCB# (Sliding Window UCB), was designed for dealing with a slowly varying environment in which the change in the expected mean of each arm’s reward distribution is assumed to be , which is a slightly weaker assumption than ours. Its use of an increasing sliding window of samples to estimate the current mean is similar to our algorithm. At each step, instead of considering all the reward samples obtained till then, the only rewards used are those obtained in the last time steps, for suitable values of and . However, their work differs from ours in terms of fairness due to the lack of explicit exploration and the arm’s deterministic selection with the highest upper confidence bound at every time step. Another difference is that they use a sliding window of a certain size, whereas, in our method, a certain number of latest samples are used, irrespective of how old those samples are.

A similar algorithmic choice for dealing with non-stationarity is the use of a fixed-size sliding window. Another technique is to discount older rewards when estimating the expected reward of an arm. This ensures that older samples affect the estimation less and reduce bias in the estimation induced due to the environment’s non-stationarity. These two approaches have been studied by Garivier and Moulines (2011) for abruptly varying environments.

EXP3 Auer et al. (2002) is an algorithm used for solving an adversarial multi-armed bandit. Besbes et al. (2019) repurposed this algorithm and showed that by restarting it every time steps, for some suitable , this algorithm can be used to solve a stochastic and non-stationary bandit and achieves the lower bound of regret achievable in that setting.

The notion of -fairness considered in this paper has been studied for classic contextual bandits in Joseph et al. (2016). Another notion of fairness is to constrain the fraction of times an arm is chosen by a pre-specified lower bound (Li et al., 2020). However, the -fairness of Joseph et al. (2016) differs from this notion significantly since this definition of fairness depends on an external lower bound specification independent of the reward distributions of the arms themselves.

8. Conclusion

In this work, we have studied the problem of designing a -fair algorithm for a stochastic non-stationary multi armed bandit problem. Our non-stationarity assumption is that the absolute change in the expected reward of each arm is assumed to be at most at each time step, for some known . We have shown that the proposed algorithm Fair-UCBe indeed satisfies -fairness condition, for . We also show that it achieves a regret of , for some constant .

Appendix A Proof of Lemma 5.1-5.3

For the proof of Lemma 5.1, we need the following result.

Lemma A.1.

For , ,


For , is a convex function in . For , this is clear since it is linear, and for , the function becomes , which is clearly a convex quadratic function. For , and .

Now, consider the sequence . Clearly, it is an increasing sequence, and it converges to , which implies .

So, for , , and by the convexity of ,

since and . ∎

a.1. Proof of Lemma 5.1

The probability that more than time steps are required to obtain a single sample is at most , for an arm to be rejected consecutively at least times. This is the probability that, for a specific arm, there is no sample in a single time interval of length . We need to consider this failure probability for all intervals and all arms. So, setting interval size and dividing the available failure probability into parts for each of the arms, we need

or equivalently,

By Lemma A.1, with and ,

So, for the Lemma to hold, should satisfy . From Algorithm 1, we have

which implies , and hence .

a.2. Proof of Lemma 5.2

Consider the function . , which is positive for and negative for . So, attains its maximum at , with . Hence,

attains a maximum of , at or , which is approximately 14.5669… Since we considered the maximum value possible, this upper bound is a worst case bound for the value of and for , this can be improved considerably, since the function reduces to as increases.

a.3. Proof of Lemma 5.3

We have

Let , where is the true mean at time , and the true mean at a previous time differs from this by some error , which is bounded by times the time difference, since is true. Now,

So, for , it is sufficient that


Therefore, for , the empirical estimate of the expected reward of arm at time using the latest samples obtained from that arm, and , we have