# Bandit Algorithms for Precision Medicine

The Oxford English Dictionary defines precision medicine as "medical care designed to optimize efficiency or therapeutic benefit for particular groups of patients, especially by using genetic or molecular profiling." It is not an entirely new idea: physicians from ancient times have recognized that medical treatment needs to consider individual variations in patient characteristics. However, the modern precision medicine movement has been enabled by a confluence of events: scientific advances in fields such as genetics and pharmacology, technological advances in mobile devices and wearable sensors, and methodological advances in computing and data sciences. This chapter is about bandit algorithms: an area of data science of special relevance to precision medicine. With their roots in the seminal work of Bellman, Robbins, Lai and others, bandit algorithms have come to occupy a central place in modern data science ( Lattimore and Szepesvari, 2020). Bandit algorithms can be used in any situation where treatment decisions need to be made to optimize some health outcome. Since precision medicine focuses on the use of patient characteristics to guide treatment, contextual bandit algorithms are especially useful since they are designed to take such information into account. The role of bandit algorithms in areas of precision medicine such as mobile health and digital phenotyping has been reviewed before (Tewari and Murphy, 2017; Rabbi et al., 2019). Since these reviews were published, bandit algorithms have continued to find uses in mobile health and several new topics have emerged in the research on bandit algorithms. This chapter is written for quantitative researchers in fields such as statistics, machine learning, and operations research who might be interested in knowing more about the algorithmic and mathematical details of bandit algorithms that have been used in mobile health.

## Authors

• 5 publications
• 7 publications
• 64 publications
02/05/2021

### Machine Learning in Precision Medicine to Preserve Privacy via Encryption

Precision medicine is an emerging approach for disease treatment and pre...
02/28/2018

### Machine learning and genomics: precision medicine vs. patient privacy

Machine learning can have major societal impact in computational biology...
11/11/2019

### A Biologically Plausible Benchmark for Contextual Bandit Algorithms in Precision Oncology Using in vitro Data

Precision oncology, the genetic sequencing of tumors to identify druggab...
07/26/2019

### Measurement error and precision medicine: error-prone tailoring covariates in dynamic treatment regimes

Precision medicine incorporates patient-level covariates to tailor treat...
07/03/2018

### Building a Controlled Vocabulary for Standardizing Precision Medicine Terms

Rapid advances of technology and development of research in precision me...
06/28/2017

### An Actor-Critic Contextual Bandit Algorithm for Personalized Mobile Health Interventions

Increasing technological sophistication and widespread use of smartphone...
##### This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

## 1 Introduction

The Oxford English Dictionary defines precision medicine as “medical care designed to optimize efficiency or therapeutic benefit for particular groups of patients, especially by using genetic or molecular profiling.” It is not an entirely new idea: physicians from ancient times have recognized that medical treatment needs to consider individual variations in patient characteristics (Konstantinidou et al., 2017). However, the modern precision medicine movement has been enabled by a confluence of events: scientific advances in fields such as genetics and pharmacology, technological advances in mobile devices and wearable sensors, and methodological advances in computing and data sciences.

This chapter is about bandit algorithms: an area of data science of special relevance to precision medicine. With their roots in the seminal work of Bellman, Robbins, Lai and others, bandit algorithms have come to occupy a central place in modern data science (see the book by Lattimore and Szepesvári (2020) for an up-to-date treatment). Bandit algorithms can be used in any situation where treatment decisions need to be made to optimize some health outcome. Since precision medicine focuses on the use of patient characteristics to guide treatment, contextual bandit algorithms are especially useful since they are designed to take such information into account.

The role of bandit algorithms in areas of precision medicine such as mobile health and digital phenotyping has been reviewed before (Tewari and Murphy, 2017; Rabbi et al., 2019). Since these reviews were published, bandit algorithms have continued to find uses in mobile health and several new topics have emerged in the research on bandit algorithms. This chapter is written for quantitative researchers in fields such as statistics, machine learning, and operations research who might be interested in knowing more about the algorithmic and mathematical details of bandit algorithms that have been used in mobile health.

We have organized this chapter to meet two goals. First, we want to provide a concise exposition of basic topics in bandit algorithms. Section 2 will help the reader become familiar with basic problem setups and algorithms that appear frequently in applied work in precision medicine and mobile health (see, for example, Paredes et al. (2014); Piette et al. (2015); Rabbi et al. (2015); Piette et al. (2016); Yom-Tov et al. (2017); Rindtorff et al. (2019); Forman et al. (2019); Liao et al. (2020); Ameko et al. (2020); Aguilera et al. (2020); Tomkins et al. (2021)). Second, we want to highlight a few advanced topics that are important for mobile health and precision medicine applications but whose full potential remains to be realized. Section 3 will provide the reader with helpful entry points into the bandit literature on non-stationarity, robustness to corrupted rewards, satisfying additional constraints, algorithmic fairness, and causality.

## 2 Basic Topics

In this section, we begin by introducing the most simple of all bandit problems: the multi-armed bandit. Then we discuss a more advanced variant called contextual bandit that is especially suitable for precision medicine applications. The last topic we discuss in this section is offline learning which deals with algorithms that can use already collected data. The offline learning setting is to be contrasted with the online learning setting where the bandit algorithm has control over the data it collects.

### 2.1 Multi-armed Bandit

In recent years, the multi-armed bandit (MAB) framework has attracted a lot of attention in many application areas such as healthcare, marketing, and recommendation systems. MAB is a simple model that describes the interaction between an agent111Also referred to as a learner, statistician, or decision maker. and an environment. At every time step, the agent makes a choice from an action222

Since the historical roots of probability theory lie in gambling and casinos, it is not surprising that the MAB terminology comes from imagining a slot machine in a casino. A slot machine is also called a “one-armed bandit” as it robs you of your money. Therefore, we will use

actions and arms interchangeably. set and receives a reward. The agent may have different goals, such as maximizing the (discounted) cumulative reward within a time horizon, identifying the best arm, or competing with the arm with the best risk-return trade-off etc. In this section, we focus on maximizing the cumulative rewards for simplicity. An important observation is that the agent needs to balance between exploration and exploitation to achieve its goal of receiving high cumulative reward. That is, both under-explored arms as well as tested-and-tried arms with high rewards should be selected often but for different reasons: the former have the potential to achieve high rewards and the latter are already confirmed to be good based on the past experience.

To formally define the bandit framework, we start with introducing some notation. Suppose the agent interacts with the environment for time steps, where is called the horizon. In each round , the learner chooses an action from the action set and receives a corresponding reward . We denote the cardinality of by . The choice of depends on the action/reward history up to time : . A policy is defined as a mapping from the history up to time to the actions. For short, we use as the sequence of policies .

In a healthcare setting, the fundamental pattern that often occurs is the following. Of course, this simple pattern fails to capture the full complexity of decision making in healthcare, but it is a reasonable starting point, especially for theoretical analysis.

In the remainder of this section, we will review bandit algorithms that learn good decision policies over time. We focus on the two key settings: stochastic bandit and adversarial bandit. In both settings, the algorithms aim at minimizing their regret, which measures the difference between the maximal reward one can get, and the reward obtained by the algorithm. We will formally define regret in each setting.

#### 2.1.1 Stochastic Multi-armed Bandit

A stochastic bandit is a set of distributions and we define the environment class as a set of such distributions

 E={ν=(Pa:a∈A):Pa∈Ma for all a∈A},

where for each , is a set of distributions. For unstructured bandits, playing one action cannot help the agent deduce anything about other actions. Environment classes that are not unstructured are called structured, such as linear bandits (Abbasi-Yadkori et al., 2011), low-rank bandits (Lu et al., 2021b) and combinatorial bandits (Cesa-Bianchi and Lugosi, 2012) etc. Throughout this chapter, we assume all bandit instances are , which means the reward distribution for all arms is 1-subgaussian.

###### Definition 1 (Subgaussianity).

is -subgaussian if for all , .

It is not hard to see from the definition that many well-known distributions are subgaussian, e.g., any bounded-domain distribution, Bernoulli distribution and Gaussian distributions. Intuitively, a subgaussian distribution has tails no heavier than a Gaussian distribution. Many nice concentration inequalities have been developed for subgaussian variables and are widely used in the proofs of bandit algorithms.

In the process of interactions, once the agent performs action following a particular policy, the environment samples a reward from the distribution

. The combination of an environment and agent policy induces a probability measure on the sequence of outcomes

. A standard stochastic MAB protocol is following. At every time step , the learning agent

1. picks an action following policy ,

3. updates its policy to .

We note that is the conditional reward distribution of given and is a function from to . The expected reward of action is defined by , where is used as the reward variable. Then the maximum expected reward and the optimal action are given by

 μ∗(ν)=maxa∈Aμa(ν) and a∗(ν)∈argmaxa∈Aμa(ν).

According to above definition, more than one optimal actions can exist and the optimal policy is to select an optimal action at every round. For actions whose expected rewards are less than optimal actions, we call them sub-optimal actions and define the reward gap between action and by .

As mentioned earlier, the learner’s goal is to maximize the cumulative reward . We now define a performance metric called regret which is the difference between the expected reward that can obtain and . Minimizing the regret is equivalent to maximizing the reward. The reason why we do not directly optimize is that the cumulative rewards depends on the environment and it is hard to tell whether a policy is good or not by merely looking at the cumulative rewards unless it is compared to a good policy. So we define the problem-dependent regret of a policy on bandit instance by

 RegT(π,ν)=Tμ∗(ν)−E[ST],

where the expectation is taken over actions and rewards up to time . The worst-case regret of a policy is defined by

 RegT(π)=supν∈ERegT(π,ν).

We will drop and from the regret when they are clear from the context.

Throughout this chapter, we focus on minimizing regret by balancing exploration and exploitation. We also want to point out that in a different setting, the exploration cost may not be a concern and the agent just wants to output a final recommendation for the best arm after an exploration phase. Problems of this type are called pure exploration problems. In such cases, algorithms are usually evaluated by sample complexity or simple regret (Bubeck et al., 2009). Pure exploration is also related to randomized controlled trials (RCTs) including modern variants that involve sequential randomization such as sequential multiple assignment randomized trials (SMARTs) (Lei et al., 2012) and micro-randomized trials (MRTs) (Klasnja et al., 2015)

. Randomized trials are typically designed to enable estimation of treatment effects with sufficient statistical power. Since the concerns of pure exploration and randomized controlled trials are different from those of bandit algorithms, we do not discuss them further in this chapter. However, note that in an actual application, methodology from bandits and randomized trials may need to be integrated. Researchers may start off with a randomized trial and follow it up with bandit algorithm in the next iteration of their health app. They can also decide to run a randomized trial for one health outcome while simultaneously running a bandit for a different outcome (e.g., an outcome related to user engagement with the health app) in the same study. There is also ongoing work

(Yao et al., 2020; Zhang et al., 2021) on enabling the kind of statistical analysis done after randomized trials on data collected via online bandit algorithms.

##### Explore-then-Commit (ETC).

We start with a simple two-stage algorithm: Explore-then-Commit (ETC). In the first stage of ETC, the learner plays every arm for a fixed number of times () and obtain estimates of the expected rewards. In the second stage, the learner commits to the best arm according to the estimates in the first stage. For every arm , let denote the estimated expected reward up to time :

 ^μa(t)=1Ta(t)t∑s=11{As=a}Rs,

where is the number of times action has been performed up to round .

With the above definitions, we are ready to present the ETC in Algorithm 2. The overall performance of ETC crucially depends on the parameter . If is too small, the algorithm cannot estimate the performance of every arm accurately, so it is likely to exploit a sub-optimal arm in the second stage, which leads to high regret. If is too big, the first stage (explore step) plays with sub-optimal arms for too many times, so that the regret can be large again. The art is to choose an optimal value for in order to minimize the total regret incurred in both stages. Specifically, ETC achieves worst-case regret333Ignoring parameters other than . by choosing  (Lattimore and Szepesvári, 2020). Sub-linear regret performance is good as a starting point. Next, we will introduce another two classic algorithms which incur even less regret.

##### Upper Confidence Bound (UCB).

There are several types of exploration strategies to select actions such as greedy, Boltzmann, optimism and pessimism. Suppose the agent has reward estimates for all actions. A greedy exploration strategy simply selects the action with the highest . A Boltzmann exploration strategy picks each action with probability proportional to , where is a tuning parameter. Boltzmann becomes greedy as goes to infinity. For an optimism strategy, one picks the action with the highest reward estimate plus some bonus term, i.e. . In contrast, a pessimism strategy would pick the action: .

Out of these strategies, UCB algorithm follows the optimism strategy, in particular, a famous principle called optimism in the face of uncertainty (OFU), which means that one should act as if the environment is the best possible one among those that are plausible given current experience. The reason OFU works is that misplaced optimism gets corrected when under-explored actions are tried and low rewards are observed. In contrast, pessimism does not work (at least in the online setting; for the offline setting things can be different (Jin et al., 2021)) since wrong beliefs about low performance of under-explored actions do not get a chance to get revised by collecting more data from those actions.

At every step , the UCB algorithm updates a value called upper confidence bound defined for each action and confidence level as follows.

 UCBa(t−1,δ)=⎧⎪⎨⎪⎩∞, if Ta(t−1)=0^μa(t−1)+√2log(1/δ)Ta(t−1)∨1, % otherwise. (1)

The learner chooses the action with the highest UCB value at each step. Overall, UCB (Algorithm 3) guarantees a worst-case regret (where the informal notation hides constants and logarithmic factors).

According to the construction of upper confidence bounds, an action will be selected under two circumstances: under-explored ( small) or well-explored with good performance ( large). The upper confidence bound for an action gets close to its true mean after being selected for enough times. A sub-optimal action will only be played if its upper confidence bound is larger than that of the optimal arm. However, this is unlikely to happen too often. The upper confidence bound for the sub-optimal action will eventually fall below that of the optimal action as we play the sub-optimal actions more times. We present the regret guarantee for UCB (Algorithm 3) in Theorem 1.

###### Theorem 1 (Regret for UCB Algorithm).

If , then the problem-dependent regret of UCB, as defined in Algorithm 3, on any 1-subgaussian bandit is bounded by

 RegT(UCB,ν)≤3∑a∈AΔa+∑a:Δa>016logTΔa, (2)

where represents the corresponding gap term. The worst-case regret bound of UCB is:

 RegT(UCB)=O(√KTlogT). (3)
###### Proof.

We only present the worst-case regret for simplicity. For the problem-dependent regret proof, we refer the reader to Chapter 7 in Lattimore and Szepesvári (2020).

We define a good event as follows:

 E:={|^μa(t−1)−μa|≤√2log(1/δ)Ta(t−1)∨1,∀t∈[T],a∈A}.

By Hoeffding inequality and union bound, one can show that . Next, we decompose the regret.

 RegT(UCB) =E[T∑t=1μ∗−UCBa∗(t−1,δ)+UCBa∗(t−1,δ)−UCBAt(t−1,δ)+UCBAt(t−1,δ)−μAt] ≤E[T∑t=1μ∗−UCBa∗(t−1,δ)+UCBAt(t−1,δ)−μAt].

The inequality in above expression is due to the action selection criterion in UCB algorithm. Condition on event , we have for all ; otherwise, the regret can be bounded by . Combining these arguments, we have

 RegT(UCB) ≤P(Ec)⋅2T+P(E)⋅E[T∑t=1UCBAt(t−1,δ)−μAt∣E] ≤4T2Kδ4+2E⎡⎣T∑t=1√2log(1/δ)TAt(t−1)∨1⎤⎦ (By % definition of event E) ≤4T2Kδ4+√8log(1/δ)∑a∈AT∑t=1E⎡⎣√1TAt(t−1)∨11{At=a}⎤⎦ ≤4T2Kδ4+√8log(1/δ)∑a∈A∫Ta(T)1√1/sds ≤4T2Kδ4+√8log(1/δ)√KT=O(√KTlogT) (Set δ=1/T).

The last inequality in above is by Cauchy-Schwarz inequality. ∎

The UCB family has many variants, one of which is to replace the upper confidence bound for every action by . Even though the regret dominant terms ( and ) for this version has the same order as those of Algorithm 3, the leading constants for the two dominant terms become smaller.

Then one may ask the question: is it possible to further improve the regret bound of UCB and above variant? The answer is yes. Audibert et al. (2009) proposed an algorithm called MOSS (Minimax Optimal Strategy in the Stochastic case). MOSS replaces the upper confidence bounds in Algorithm 3 by

 ^μa(t−1)+  ⎷max{log(TKTa(t−1)),0}Ta(t−1).

Under this construction, the worst-case regret of MOSS is guaranteed to be only .

However, MOSS is not always good. One can easily construct regimes where the problem-dependent regret of MOSS is worse than UCB (Lattimore, 2015). On the other hand, the improved UCB algorithm proposed by Auer and Ortner (2010) satisfies a problem-dependent regret that is similar to (1), but the worst-case regret is . Later on, by carefully constructing the upper confidence bounds, Optimally Confidence UCB algorithm (Lattimore, 2015) and AdaUCB algorithm (Lattimore, 2018) are shown to achieve worst-case regret and their problem-dependent regret bounds are also not worse than that of the UCB algorithm. There are many more UCB variants in the literature that we do not cover in this chapter. The reader may refer to Table 2 in Lattimore (2018) for a comprehensive summary.

##### Successive Elimination (SE).

We now describe the SE algorithm that also relies on the upper confidence bound calculations. The idea is similar to UCB such that a sub-optimal arm is very unlikely to have large a upper confidence bound if it has been selected for enough times. At every round, SE maintains a confidence interval for the mean reward of every arm and removes all arms whose reward upper bound is smaller than the lower bound of the biggest estimated reward arm. The procedure ends when there is only one arm remained. We describe the SE algorithm in Algorithm

4 and define the UCB terms as (1) and LCB terms as

 LCBa(t−1,δ)=⎧⎪⎨⎪⎩∞, if Ta(t−1)=0^μa(t−1)−√2log(1/δ)Ta(t−1)∨1, % otherwise. (4)

SE was first proposed in Even-Dar et al. (2006) along with a similar action elimination based algorithm: Median Elimination (ME). They studied the probably approximately correct (PAC) setting (Haussler and Warmuth, 2018). In particular, Even-Dar et al. (2006) shows that for given arms, it suffices to pull the arms for times to find an -optimal arm with probability at least . It is not hard to prove that SE also satisfies the following regret bound.

###### Theorem 2 (Regret for SE Algorithm).

If , the worst-case regret of SE over 1-subgaussian bandit environments is bounded by

 RegT(SE)=O(√KTlogT). (5)
###### Proof.

Without loss of generality, we assume the optimal arm is unique. Define the event by , where denotes the confidence set width. By Hoeffding inequality and union bound, one can show that .

Define as the last round when arm is not eliminated yet. According to the elimination criterion in SE, the reward gap term can be bounded as:

 Δa:=μ∗−μa≤2(ca∗(t,δ)+ca(t,δ))=O(ca(t,δ)).

The last equality holds as and differ at most by by construction. Since is the last round being played, we have and , which implies below property for all non-optimal arms :

 Δa≤O(√log(1/δ)Ta(T)).

We thus obtain that under event ,

 T∑t=1E[μ∗−μat|E] ≤∑a∈A∖{a∗}Ta(T)Δa≤O(√log(1/δ))∑a∈A√Ta(T)≤O(√KTlog(1/δ)),

where the last inequality is by Cauchy-Schwarz inequality. Take , by conditional expectation calculations. ∎

##### Thompson Sampling (TS).

All of the methods we have mentioned so far select their actions based on a frequentist view. TS uses one of the oldest heuristic

(Thompson, 1933) for choosing actions and addresses the exploration-exploitation dilemma based on a Bayesian philosophy of learning. The idea is simple. Before the game starts, the agent chooses a prior distribution over a set of possible bandit environments. At every round, the agent samples an environment from the posterior and acts according to the optimal action in that environment. The exploration in TS comes from the randomization over bandit environments. At the beginning, the posterior is usually poorly concentrated, then the policy will likely explore. As more data being collected, the posterior tends to concentrate towards the true environment and the rate of exploration decreases. We present the TS algorithm in Algorithm 5.

To formally describe how TS works, we start with several definition related to Bayesian bandits.

###### Definition 2 (K-armed Bayesian bandit environment).

A K-armed Bayesian bandit environment is a tuple where is a measurable space and is a probability measure on called the prior. is the reward distribution for arms in bandit , where .

Given a K-armed Bayesian bandit environment and a policy , the Bayesian regret is defined as:

 BRegT(π,Q)=∫ERegT(π,ν)dQ(ν).

TS has been analyzed in both of the frequentist and the Bayesian settings and we will start with the Bayesian results.

###### Theorem 3 (Bayesian Regret for TS Algorithm).

For a K-armed Bayesian bandit environment such that is 1-subgaussian for all and with mean in . Then the policy of TS satisfies

 BRegT(π,Q)=O(√KTlogT). (6)
###### Proof.

The proof is quite similar to that of UCB. We abbreviate and let be the optimal arm. Note that is a random variable depending on . For every , we define a clipped upper bound term

 UCBa(t−1)=^μa(t−1)+√2log(1/δ)1∨Ta(t−1),

where and are defined in the same way as those in UCB. We define event such that for all and ,

 |^μa(t−1)−μa|<√2log(1/δ)1∨Ta(t−1).

By Hoeffding inequality and union bound, one can show that . This result will be used in later steps.

Let be the -algebra generated by the interaction sequence up to time . The key insight for the whole proof is to observe below property from the definition of TS:

 P(a∗|Ft−1)=P(At|Ft−1) a.s. (7)

Using above property and , we have

 E[μa∗−μAt∣Ft−1]=E[μa∗−UCBa∗(t−1)+UCBAt(t−1)−μAt∣Ft−1],

and thus

 BRegT=E[T∑t=1(μa∗−UCBa∗(t−1))+T∑t=1(UCBAt(t−1)−μAt)].

Conditioning on the high-probability event , the first sum is negative and the second sum is of the order , while if conditioning on . Take , one can verify that .

Compared to the analysis for Bayesian regret, frequentist (worst-case) regret analysis for TS gets a lot more technical. The key reason behind this is that the worst-case regret does not have an expectation with respect to the prior and therefore the property in (7) cannot be used. Even though TS was well-known to be easy to implement and competitive with state of the art methods, it lacked worst-case regret analysis for a long time. Significant progress was made by Agrawal and Goyal (2012) and Kaufmann et al. (2012). In Agrawal and Goyal (2012), the first logarithmic bound on the frequentist regret of TS was proven. Kaufmann et al. (2012) provided a bound that matches the asymptotic lower bound of Lai and Robbins (1985). However, both of these bounds were problem-dependent. The first near-optimal worst-case regret was proved by Agrawal and Goyal (2013) for Bernoulli bandits with Beta prior, where the reward is either zero or one. For TS that uses Gaussian prior, the same work proved a worst-case regret. Jin et al. (2020) proposed a variant of TS called MOTS (Minimax Optimal TS) that achieves regret.

In stochastic bandit models, the rewards are assumed to be strictly i.i.d. given actions. This assumption can be violated easily in practice. For example, the health feedback for a patient after certain treatments may vary slightly across times and the way it varies is usually unknown. In such scenarios, a best action that maximizes the total reward still exists, but algorithms designed for stochastic bandit environments are no longer guaranteed to work. As a more robust counterpart to the stochastic model, we study the adversarial bandit model in this section, in which the assumption that a single action is good in hindsight is retained but the rewards are allowed to be chosen adversarially.

where corresponds to the rewards over all actions at time . In every round, the agent chooses a distribution over the actions . An action is sampled from and the agent receives the reward . A policy in this setting maps the history sequences to distributions over actions. We evaluate the performance of policy by the expected regret, which is the cumulative reward difference between the best fixed action and the agent’s selections:

 RegT(π,r)=maxa∈AT∑t=1rta−E[T∑t=1rtAt] (8)

The worst-case regret of policy is defined by

 RegT(π)=supr∈[0,1]T×KRegT(π,r). (9)

It may not be very clear at the beginning that why we define the regret by comparing to the fixed best action instead of the best action at every round. In the later case, the regret should be . However, this definition provides the adversary too much power so that for any policy, one can show can be for certain reward vectors .

##### Remark on randomized policy:

In stochastic bandit models, the optimal action is deterministic and the optimal policy is simply to select the optimal action at every round. However, in adversarial bandit setting, the adversary has great power in designing the reward. It may know the agent’s policy and design the rewards accordingly, so that a deterministic policy can incur linear regret. For example, we consider there are two actions, whose reward is either or at any time. For a deterministic policy, the agent decides to choose an action at time . Then the adversary knows it and can set the reward of that action at time to be and the reward of the unselected action to be . The cumulative regret will be after rounds. However, one can improve the performance by a randomized policy, e.g., choosing either action with probability , then the adversary cannot make you incur regret at every round by manipulating the reward values for both actions.

##### Exponential-weight algorithm for Exploration and Exploitation (EXP3).

We now study one of the most famous adversarial bandit algorithm called EXP3. Before describing the algorithm, we define some related terms below. In a randomized policy, the conditional probability of the action being played is denoted by

 Pta=P(At=a∣A1,R1,…,At−1,Rt−1).

Assuming almost surely for all policies, a natural way to define the importance-weighted estimator of is

 ^Rta=1{At=a}RtPta. (10)

Let . A simple calculation shows that is conditionally unbiased, i.e.

. However, the variance of estimator

can be extremely large when is small and is non-zero. Let , then the variance of the estimator is:

 Vt[^Rta]=Et[^R2ta]−r2ta=Et[Atar2taP2ta]−r2ta=r2ta(1−Pta)Pta.

An alternative estimator is:

 ^Rta=1−1{At=a}Pta(1−Rt). (11)

This estimator is still unbiased and its variance is

 Vt[^Rta]=y2ta1−PtaPta,

where we define .

The best choice of the estimator depends on the actual rewards. One should use (10) for small rewards and (11) for large rewards. So far we have learned how to construct reward estimators for given sampling distributions . EXP3 algorithm provides a way to design the terms. Let be the total estimated reward until the end of round , where is defined in (11). We present EXP3 in Algorithm 6.

Surprisingly, even though adversarial bandit problems look more difficult than stochastic bandit problems due to the strong power of the adversary, one can show that the adversarial regret for EXP3 algorithm has the same order as before, i.e. .

###### Theorem 4 (Regret for EXP3 Algorithm).

Let . With learning rate , we have

 RegT(EXP3,r)≤2√KTlogK. (12)
###### Proof.

The proof for EXP3 is different than those for the stochastic bandit algorithms. We first define the expected regret relative to using action in rounds:

 RegTa=T∑t=1rta−E[T∑t=1Rt].

According to the definition of the adversarial bandit regret, the final result will follow if we can bound for every . It’s not hard to show and hold using the definition of . Then we can re-write as , where . Let , and , then one can show that

We next bound the ratio term

 WtWt−1≤1+η∑a′∈APta′^Rta′+η2∑a∈APta′^R2ta′≤exp(η∑a′∈APta′^Rta′+η2∑a′∈APta′^R2ta′),

using inequalities for and for .

Combining with previous results, we have

 exp(η^ST,a)≤Kexp(η^ST+η2T∑t=1∑a′∈APta′^R2ta′).

Taking logarithm on both sides and rearranging give us

 ^ST,a−^ST≤logKη+ηT∑t=1∑a′∈APta′^R2ta′. (13)

To bound , we only need to bound the expectation of the second term in above. By standard (conditional) expectation calculations, one can get

 E[T∑t=1∑a′∈APta′^R2ta′]≤TK.

By substituting above inequality into (13), we get

 RegTa≤logKη+ηTK=2√KTlogK, (14)

where we choose . By definition, the overall regret has the same upper bound as above. ∎

We just proved the expected regret of EXP3. However, if we consider the distribution of the random regret, EXP3 is not good enough. Define the random regret as . One can show that for all large enough and reasonable choices of , there exists a bandit such that the random regret of EXP3 satisfies , where is a constant. That means EXP3 sometimes can incur linear regret with non-trivial probability, which makes EXP3 unsuitable for practical problems. This phenomenon is caused by the high variance of the regret distribution. In next section, we will discuss how to resolve this problem by slightly modifying EXP3.

##### EXP3-IX (EXP3 with Implicit eXploration).

We have learned that small terms can cause enormous variance on the reward estimator, which then leads to high variance on the regret distribution. Thus, EXP3-IX (Neu, 2015) redefines the loss-estimator as

 ^Yta=1{At=a}YtPta+γ, (15)

where denotes the loss at round and . is a biased estimator for due to , but the variance can be reduced. An optimal choice for needs to balance the bias and variance. Other than this slight change on the loss estimator, the remaining procedures remain the same as EXP3. The name of ’IX’ (Implicit eXploration) can be justified by the following argument:

 Et[^Yta]=PtaytaPta+γ≤yta.

The effect of adding a to the denominator is that EXP3-IX tries to decrease the large losses for some actions, so that such actions can still be chosen occasionally. As a result, EXP3-IX explores more than EXP3. Neu (2015) has proved the following high probability regret bound for EXP3-IX.

###### Theorem 5 (Regret for EXP3-IX Algorithm).

With , EXP3-IX guarantees that

 ˆRegT(\emphEXP3−IX)≤2√2KTlogK+(√2KTlogK+1)log(2/δ) (16)

with probability at least .

#### 2.1.3 Lower Bound for MAB Problems

We have discussed two types of bandit models and their corresponding algorithms in regret minimization. Then a natural question are: what is the minimal regret bound we can hope for? To answer this question, we will introduce two types of lower bound results: minimax lower bound and instance dependent lower bound. Both of them are useful for describing the hardness of a class of bandit problems and are often used to evaluate the optimality of an existing algorithm. For example, suppose the worst-case regret of a policy matches the minimax lower bound up to a universal constant, we say that the policy is minimax-optimal.

##### Minimax Lower Bounds.

We consider a Gaussian bandit environment, in which the reward for every arm is Gaussian-distributed. We denote the class of Gaussian bandits with unit variance by and use as the reward mean vector. In particular, is a Gaussian bandit for which the th arm has reward distribution . The following result provides a minimax lower bound for the Gaussian bandit class .

###### Theorem 6 (Minimax Lower Bound for Gaussian Bandit Class).

Let and . For any policy , there exists a mean vector such that

 RegT(π,νμ)=Ω(√KT). (17)
###### Proof.

To prove the lower bound, we start with constructing two bandit instances that are very similar to each other and hard to distinguish. Let denote the mean vector for the first unit variance Gaussian bandit. We use and to denote the probability and expectation induced by environment and policy up to time . To choose the second environment, let

 i=argminj>1Eμ[Tj(T)].

Define the reward mean vector for the second bandit as , where . Decomposing the regret leads to

 RegT(π,νμ) ≥Pμ(T1(T)≤T/2)TΔ2, RegT(π,νμ′) >Pμ′(T1(T)>T/2)TΔ2.

Then, applying the Bretagnolle-Huber inequality, we get

 RegT(π,νμ)+RegT(π,νμ′)≥TΔ4exp(−KL(Pμ,Pμ′)).

It remains to upper bound the KL-divergence term in above. By divergence decomposition, one can show that

 KL(Pμ,Pμ′)=K∑i=1Eμ[Ti(T)]KL(Pi,P′i) =Eμ[Ti(T)]KL(N(0,1),N(2Δ,1)) =Eμ[Ti(T)](2Δ)22≤2TΔ2K−1.

In above , we use and denote the reward distribution of the th arm in and , respectively. For the last inequality, since , it holds that . Combining with previous results, we know that

 RegT(π,νμ)+RegT(π,νμ′)≥TΔ4exp(−2TΔ2K−1).

Choosing , the result follows. ∎

An algorithm is called minimax-optimal if its worst-case regret matches with the minimax lower bound.

##### Instance Dependent Lower Bounds.

An algorithm with nearly minimax-optimal regret is not always preferred, since it may fail to take advantage of environments that are not the worst case. In practice, what is more desirable is to have algorithms that are near minimax-optimal, while their performance gets better on “easier” instances Lattimore and Szepesvári (2020). This motivates the study of instance dependent regret. In this section, we present two types of lower bound for instance dependent regret: one is asymptotic, the other is finite-time.

We first define consistent policy and present the asymptotic instance-dependent lower bound result.

###### Definition 3 (Consistent Policy).

A policy is consistent if over bandit environment if for all bandits and for all it holds that

 RegT(π,ν)=O(Tp) as n→∞.
###### Theorem 7 (Asymptotic Instance Dependent Lower Bound for Gaussian Bandits (Lattimore and Szepesvári, 2020)).

For any policy consistent over -armed unit-variance Gaussian environments and any , it holds that

 liminfT→∞RegT(π,ν)logT≥∑i:Δi>02Δi.

A policy is called asymptotically optimal if the equality in above theorem holds. Interestingly, building on the similar idea of Theorem 7, one can also develop a finite-time instance dependent lower bound result.

###### Theorem 8 (Instance Dependent Lower Bound for Gaussian Bandits (Lattimore and Szepesvári, 2020)).

Let be a -armed Gaussian bandit with mean vector and suboptimality gaps . Define a bandit environment:

 E(ν)