1 Introduction
The Oxford English Dictionary defines precision medicine as “medical care designed to optimize efficiency or therapeutic benefit for particular groups of patients, especially by using genetic or molecular profiling.” It is not an entirely new idea: physicians from ancient times have recognized that medical treatment needs to consider individual variations in patient characteristics (Konstantinidou et al., 2017). However, the modern precision medicine movement has been enabled by a confluence of events: scientific advances in fields such as genetics and pharmacology, technological advances in mobile devices and wearable sensors, and methodological advances in computing and data sciences.
This chapter is about bandit algorithms: an area of data science of special relevance to precision medicine. With their roots in the seminal work of Bellman, Robbins, Lai and others, bandit algorithms have come to occupy a central place in modern data science (see the book by Lattimore and Szepesvári (2020) for an uptodate treatment). Bandit algorithms can be used in any situation where treatment decisions need to be made to optimize some health outcome. Since precision medicine focuses on the use of patient characteristics to guide treatment, contextual bandit algorithms are especially useful since they are designed to take such information into account.
The role of bandit algorithms in areas of precision medicine such as mobile health and digital phenotyping has been reviewed before (Tewari and Murphy, 2017; Rabbi et al., 2019). Since these reviews were published, bandit algorithms have continued to find uses in mobile health and several new topics have emerged in the research on bandit algorithms. This chapter is written for quantitative researchers in fields such as statistics, machine learning, and operations research who might be interested in knowing more about the algorithmic and mathematical details of bandit algorithms that have been used in mobile health.
We have organized this chapter to meet two goals. First, we want to provide a concise exposition of basic topics in bandit algorithms. Section 2 will help the reader become familiar with basic problem setups and algorithms that appear frequently in applied work in precision medicine and mobile health (see, for example, Paredes et al. (2014); Piette et al. (2015); Rabbi et al. (2015); Piette et al. (2016); YomTov et al. (2017); Rindtorff et al. (2019); Forman et al. (2019); Liao et al. (2020); Ameko et al. (2020); Aguilera et al. (2020); Tomkins et al. (2021)). Second, we want to highlight a few advanced topics that are important for mobile health and precision medicine applications but whose full potential remains to be realized. Section 3 will provide the reader with helpful entry points into the bandit literature on nonstationarity, robustness to corrupted rewards, satisfying additional constraints, algorithmic fairness, and causality.
2 Basic Topics
In this section, we begin by introducing the most simple of all bandit problems: the multiarmed bandit. Then we discuss a more advanced variant called contextual bandit that is especially suitable for precision medicine applications. The last topic we discuss in this section is offline learning which deals with algorithms that can use already collected data. The offline learning setting is to be contrasted with the online learning setting where the bandit algorithm has control over the data it collects.
2.1 Multiarmed Bandit
In recent years, the multiarmed bandit (MAB) framework has attracted a lot of attention in many application areas such as healthcare, marketing, and recommendation systems. MAB is a simple model that describes the interaction between an agent^{1}^{1}1Also referred to as a learner, statistician, or decision maker. and an environment. At every time step, the agent makes a choice from an action^{2}^{2}2
Since the historical roots of probability theory lie in gambling and casinos, it is not surprising that the MAB terminology comes from imagining a slot machine in a casino. A slot machine is also called a “onearmed bandit” as it robs you of your money. Therefore, we will use
actions and arms interchangeably. set and receives a reward. The agent may have different goals, such as maximizing the (discounted) cumulative reward within a time horizon, identifying the best arm, or competing with the arm with the best riskreturn tradeoff etc. In this section, we focus on maximizing the cumulative rewards for simplicity. An important observation is that the agent needs to balance between exploration and exploitation to achieve its goal of receiving high cumulative reward. That is, both underexplored arms as well as testedandtried arms with high rewards should be selected often but for different reasons: the former have the potential to achieve high rewards and the latter are already confirmed to be good based on the past experience.To formally define the bandit framework, we start with introducing some notation. Suppose the agent interacts with the environment for time steps, where is called the horizon. In each round , the learner chooses an action from the action set and receives a corresponding reward . We denote the cardinality of by . The choice of depends on the action/reward history up to time : . A policy is defined as a mapping from the history up to time to the actions. For short, we use as the sequence of policies .
In a healthcare setting, the fundamental pattern that often occurs is the following. Of course, this simple pattern fails to capture the full complexity of decision making in healthcare, but it is a reasonable starting point, especially for theoretical analysis.
In the remainder of this section, we will review bandit algorithms that learn good decision policies over time. We focus on the two key settings: stochastic bandit and adversarial bandit. In both settings, the algorithms aim at minimizing their regret, which measures the difference between the maximal reward one can get, and the reward obtained by the algorithm. We will formally define regret in each setting.
2.1.1 Stochastic Multiarmed Bandit
A stochastic bandit is a set of distributions and we define the environment class as a set of such distributions
where for each , is a set of distributions. For unstructured bandits, playing one action cannot help the agent deduce anything about other actions. Environment classes that are not unstructured are called structured, such as linear bandits (AbbasiYadkori et al., 2011), lowrank bandits (Lu et al., 2021b) and combinatorial bandits (CesaBianchi and Lugosi, 2012) etc. Throughout this chapter, we assume all bandit instances are , which means the reward distribution for all arms is 1subgaussian.
Definition 1 (Subgaussianity).
It is not hard to see from the definition that many wellknown distributions are subgaussian, e.g., any boundeddomain distribution, Bernoulli distribution and Gaussian distributions. Intuitively, a subgaussian distribution has tails no heavier than a Gaussian distribution. Many nice concentration inequalities have been developed for subgaussian variables and are widely used in the proofs of bandit algorithms.
In the process of interactions, once the agent performs action following a particular policy, the environment samples a reward from the distribution
. The combination of an environment and agent policy induces a probability measure on the sequence of outcomes
. A standard stochastic MAB protocol is following. At every time step , the learning agent
picks an action following policy ,

receives reward ,

updates its policy to .
We note that is the conditional reward distribution of given and is a function from to . The expected reward of action is defined by , where is used as the reward variable. Then the maximum expected reward and the optimal action are given by
According to above definition, more than one optimal actions can exist and the optimal policy is to select an optimal action at every round. For actions whose expected rewards are less than optimal actions, we call them suboptimal actions and define the reward gap between action and by .
As mentioned earlier, the learner’s goal is to maximize the cumulative reward . We now define a performance metric called regret which is the difference between the expected reward that can obtain and . Minimizing the regret is equivalent to maximizing the reward. The reason why we do not directly optimize is that the cumulative rewards depends on the environment and it is hard to tell whether a policy is good or not by merely looking at the cumulative rewards unless it is compared to a good policy. So we define the problemdependent regret of a policy on bandit instance by
where the expectation is taken over actions and rewards up to time . The worstcase regret of a policy is defined by
We will drop and from the regret when they are clear from the context.
Remark about Pure Exploration.
Throughout this chapter, we focus on minimizing regret by balancing exploration and exploitation. We also want to point out that in a different setting, the exploration cost may not be a concern and the agent just wants to output a final recommendation for the best arm after an exploration phase. Problems of this type are called pure exploration problems. In such cases, algorithms are usually evaluated by sample complexity or simple regret (Bubeck et al., 2009). Pure exploration is also related to randomized controlled trials (RCTs) including modern variants that involve sequential randomization such as sequential multiple assignment randomized trials (SMARTs) (Lei et al., 2012) and microrandomized trials (MRTs) (Klasnja et al., 2015)
. Randomized trials are typically designed to enable estimation of treatment effects with sufficient statistical power. Since the concerns of pure exploration and randomized controlled trials are different from those of bandit algorithms, we do not discuss them further in this chapter. However, note that in an actual application, methodology from bandits and randomized trials may need to be integrated. Researchers may start off with a randomized trial and follow it up with bandit algorithm in the next iteration of their health app. They can also decide to run a randomized trial for one health outcome while simultaneously running a bandit for a different outcome (e.g., an outcome related to user engagement with the health app) in the same study. There is also ongoing work
(Yao et al., 2020; Zhang et al., 2021) on enabling the kind of statistical analysis done after randomized trials on data collected via online bandit algorithms.ExplorethenCommit (ETC).
We start with a simple twostage algorithm: ExplorethenCommit (ETC). In the first stage of ETC, the learner plays every arm for a fixed number of times () and obtain estimates of the expected rewards. In the second stage, the learner commits to the best arm according to the estimates in the first stage. For every arm , let denote the estimated expected reward up to time :
where is the number of times action has been performed up to round .
With the above definitions, we are ready to present the ETC in Algorithm 2. The overall performance of ETC crucially depends on the parameter . If is too small, the algorithm cannot estimate the performance of every arm accurately, so it is likely to exploit a suboptimal arm in the second stage, which leads to high regret. If is too big, the first stage (explore step) plays with suboptimal arms for too many times, so that the regret can be large again. The art is to choose an optimal value for in order to minimize the total regret incurred in both stages. Specifically, ETC achieves worstcase regret^{3}^{3}3Ignoring parameters other than . by choosing (Lattimore and Szepesvári, 2020). Sublinear regret performance is good as a starting point. Next, we will introduce another two classic algorithms which incur even less regret.
Upper Confidence Bound (UCB).
There are several types of exploration strategies to select actions such as greedy, Boltzmann, optimism and pessimism. Suppose the agent has reward estimates for all actions. A greedy exploration strategy simply selects the action with the highest . A Boltzmann exploration strategy picks each action with probability proportional to , where is a tuning parameter. Boltzmann becomes greedy as goes to infinity. For an optimism strategy, one picks the action with the highest reward estimate plus some bonus term, i.e. . In contrast, a pessimism strategy would pick the action: .
Out of these strategies, UCB algorithm follows the optimism strategy, in particular, a famous principle called optimism in the face of uncertainty (OFU), which means that one should act as if the environment is the best possible one among those that are plausible given current experience. The reason OFU works is that misplaced optimism gets corrected when underexplored actions are tried and low rewards are observed. In contrast, pessimism does not work (at least in the online setting; for the offline setting things can be different (Jin et al., 2021)) since wrong beliefs about low performance of underexplored actions do not get a chance to get revised by collecting more data from those actions.
At every step , the UCB algorithm updates a value called upper confidence bound defined for each action and confidence level as follows.
(1) 
The learner chooses the action with the highest UCB value at each step. Overall, UCB (Algorithm 3) guarantees a worstcase regret (where the informal notation hides constants and logarithmic factors).
According to the construction of upper confidence bounds, an action will be selected under two circumstances: underexplored ( small) or wellexplored with good performance ( large). The upper confidence bound for an action gets close to its true mean after being selected for enough times. A suboptimal action will only be played if its upper confidence bound is larger than that of the optimal arm. However, this is unlikely to happen too often. The upper confidence bound for the suboptimal action will eventually fall below that of the optimal action as we play the suboptimal actions more times. We present the regret guarantee for UCB (Algorithm 3) in Theorem 1.
Theorem 1 (Regret for UCB Algorithm).
If , then the problemdependent regret of UCB, as defined in Algorithm 3, on any 1subgaussian bandit is bounded by
(2) 
where represents the corresponding gap term. The worstcase regret bound of UCB is:
(3) 
Proof.
We only present the worstcase regret for simplicity. For the problemdependent regret proof, we refer the reader to Chapter 7 in Lattimore and Szepesvári (2020).
We define a good event as follows:
By Hoeffding inequality and union bound, one can show that . Next, we decompose the regret.
The inequality in above expression is due to the action selection criterion in UCB algorithm. Condition on event , we have for all ; otherwise, the regret can be bounded by . Combining these arguments, we have
The last inequality in above is by CauchySchwarz inequality. ∎
The UCB family has many variants, one of which is to replace the upper confidence bound for every action by . Even though the regret dominant terms ( and ) for this version has the same order as those of Algorithm 3, the leading constants for the two dominant terms become smaller.
Then one may ask the question: is it possible to further improve the regret bound of UCB and above variant? The answer is yes. Audibert et al. (2009) proposed an algorithm called MOSS (Minimax Optimal Strategy in the Stochastic case). MOSS replaces the upper confidence bounds in Algorithm 3 by
Under this construction, the worstcase regret of MOSS is guaranteed to be only .
However, MOSS is not always good. One can easily construct regimes where the problemdependent regret of MOSS is worse than UCB (Lattimore, 2015). On the other hand, the improved UCB algorithm proposed by Auer and Ortner (2010) satisfies a problemdependent regret that is similar to (1), but the worstcase regret is . Later on, by carefully constructing the upper confidence bounds, Optimally Confidence UCB algorithm (Lattimore, 2015) and AdaUCB algorithm (Lattimore, 2018) are shown to achieve worstcase regret and their problemdependent regret bounds are also not worse than that of the UCB algorithm. There are many more UCB variants in the literature that we do not cover in this chapter. The reader may refer to Table 2 in Lattimore (2018) for a comprehensive summary.
Successive Elimination (SE).
We now describe the SE algorithm that also relies on the upper confidence bound calculations. The idea is similar to UCB such that a suboptimal arm is very unlikely to have large a upper confidence bound if it has been selected for enough times. At every round, SE maintains a confidence interval for the mean reward of every arm and removes all arms whose reward upper bound is smaller than the lower bound of the biggest estimated reward arm. The procedure ends when there is only one arm remained. We describe the SE algorithm in Algorithm
4 and define the UCB terms as (1) and LCB terms as(4) 
SE was first proposed in EvenDar et al. (2006) along with a similar action elimination based algorithm: Median Elimination (ME). They studied the probably approximately correct (PAC) setting (Haussler and Warmuth, 2018). In particular, EvenDar et al. (2006) shows that for given arms, it suffices to pull the arms for times to find an optimal arm with probability at least . It is not hard to prove that SE also satisfies the following regret bound.
Theorem 2 (Regret for SE Algorithm).
If , the worstcase regret of SE over 1subgaussian bandit environments is bounded by
(5) 
Proof.
Without loss of generality, we assume the optimal arm is unique. Define the event by , where denotes the confidence set width. By Hoeffding inequality and union bound, one can show that .
Define as the last round when arm is not eliminated yet. According to the elimination criterion in SE, the reward gap term can be bounded as:
The last equality holds as and differ at most by by construction. Since is the last round being played, we have and , which implies below property for all nonoptimal arms :
We thus obtain that under event ,
where the last inequality is by CauchySchwarz inequality. Take , by conditional expectation calculations. ∎
Thompson Sampling (TS).
All of the methods we have mentioned so far select their actions based on a frequentist view. TS uses one of the oldest heuristic
(Thompson, 1933) for choosing actions and addresses the explorationexploitation dilemma based on a Bayesian philosophy of learning. The idea is simple. Before the game starts, the agent chooses a prior distribution over a set of possible bandit environments. At every round, the agent samples an environment from the posterior and acts according to the optimal action in that environment. The exploration in TS comes from the randomization over bandit environments. At the beginning, the posterior is usually poorly concentrated, then the policy will likely explore. As more data being collected, the posterior tends to concentrate towards the true environment and the rate of exploration decreases. We present the TS algorithm in Algorithm 5.To formally describe how TS works, we start with several definition related to Bayesian bandits.
Definition 2 (Karmed Bayesian bandit environment).
A Karmed Bayesian bandit environment is a tuple where is a measurable space and is a probability measure on called the prior. is the reward distribution for arms in bandit , where .
Given a Karmed Bayesian bandit environment and a policy , the Bayesian regret is defined as:
TS has been analyzed in both of the frequentist and the Bayesian settings and we will start with the Bayesian results.
Theorem 3 (Bayesian Regret for TS Algorithm).
For a Karmed Bayesian bandit environment such that is 1subgaussian for all and with mean in . Then the policy of TS satisfies
(6) 
Proof.
The proof is quite similar to that of UCB. We abbreviate and let be the optimal arm. Note that is a random variable depending on . For every , we define a clipped upper bound term
where and are defined in the same way as those in UCB. We define event such that for all and ,
By Hoeffding inequality and union bound, one can show that . This result will be used in later steps.
Let be the algebra generated by the interaction sequence up to time . The key insight for the whole proof is to observe below property from the definition of TS:
(7) 
Using above property and , we have
and thus
Conditioning on the highprobability event , the first sum is negative and the second sum is of the order , while if conditioning on . Take , one can verify that .
∎
Compared to the analysis for Bayesian regret, frequentist (worstcase) regret analysis for TS gets a lot more technical. The key reason behind this is that the worstcase regret does not have an expectation with respect to the prior and therefore the property in (7) cannot be used. Even though TS was wellknown to be easy to implement and competitive with state of the art methods, it lacked worstcase regret analysis for a long time. Significant progress was made by Agrawal and Goyal (2012) and Kaufmann et al. (2012). In Agrawal and Goyal (2012), the first logarithmic bound on the frequentist regret of TS was proven. Kaufmann et al. (2012) provided a bound that matches the asymptotic lower bound of Lai and Robbins (1985). However, both of these bounds were problemdependent. The first nearoptimal worstcase regret was proved by Agrawal and Goyal (2013) for Bernoulli bandits with Beta prior, where the reward is either zero or one. For TS that uses Gaussian prior, the same work proved a worstcase regret. Jin et al. (2020) proposed a variant of TS called MOTS (Minimax Optimal TS) that achieves regret.
2.1.2 Adversarial Multiarmed Bandit
In stochastic bandit models, the rewards are assumed to be strictly i.i.d. given actions. This assumption can be violated easily in practice. For example, the health feedback for a patient after certain treatments may vary slightly across times and the way it varies is usually unknown. In such scenarios, a best action that maximizes the total reward still exists, but algorithms designed for stochastic bandit environments are no longer guaranteed to work. As a more robust counterpart to the stochastic model, we study the adversarial bandit model in this section, in which the assumption that a single action is good in hindsight is retained but the rewards are allowed to be chosen adversarially.
The adversarial bandit environment is often called as the adversary. In an adversarial bandit problem, the adversary secretly chooses reward vectors
where corresponds to the rewards over all actions at time . In every round, the agent chooses a distribution over the actions . An action is sampled from and the agent receives the reward . A policy in this setting maps the history sequences to distributions over actions. We evaluate the performance of policy by the expected regret, which is the cumulative reward difference between the best fixed action and the agent’s selections:(8) 
The worstcase regret of policy is defined by
(9) 
It may not be very clear at the beginning that why we define the regret by comparing to the fixed best action instead of the best action at every round. In the later case, the regret should be . However, this definition provides the adversary too much power so that for any policy, one can show can be for certain reward vectors .
Remark on randomized policy:
In stochastic bandit models, the optimal action is deterministic and the optimal policy is simply to select the optimal action at every round. However, in adversarial bandit setting, the adversary has great power in designing the reward. It may know the agent’s policy and design the rewards accordingly, so that a deterministic policy can incur linear regret. For example, we consider there are two actions, whose reward is either or at any time. For a deterministic policy, the agent decides to choose an action at time . Then the adversary knows it and can set the reward of that action at time to be and the reward of the unselected action to be . The cumulative regret will be after rounds. However, one can improve the performance by a randomized policy, e.g., choosing either action with probability , then the adversary cannot make you incur regret at every round by manipulating the reward values for both actions.
Exponentialweight algorithm for Exploration and Exploitation (EXP3).
We now study one of the most famous adversarial bandit algorithm called EXP3. Before describing the algorithm, we define some related terms below. In a randomized policy, the conditional probability of the action being played is denoted by
Assuming almost surely for all policies, a natural way to define the importanceweighted estimator of is
(10) 
Let . A simple calculation shows that is conditionally unbiased, i.e.
. However, the variance of estimator
can be extremely large when is small and is nonzero. Let , then the variance of the estimator is:An alternative estimator is:
(11) 
This estimator is still unbiased and its variance is
where we define .
The best choice of the estimator depends on the actual rewards. One should use (10) for small rewards and (11) for large rewards. So far we have learned how to construct reward estimators for given sampling distributions . EXP3 algorithm provides a way to design the terms. Let be the total estimated reward until the end of round , where is defined in (11). We present EXP3 in Algorithm 6.
Surprisingly, even though adversarial bandit problems look more difficult than stochastic bandit problems due to the strong power of the adversary, one can show that the adversarial regret for EXP3 algorithm has the same order as before, i.e. .
Theorem 4 (Regret for EXP3 Algorithm).
Let . With learning rate , we have
(12) 
Proof.
The proof for EXP3 is different than those for the stochastic bandit algorithms. We first define the expected regret relative to using action in rounds:
According to the definition of the adversarial bandit regret, the final result will follow if we can bound for every . It’s not hard to show and hold using the definition of . Then we can rewrite as , where . Let , and , then one can show that
We next bound the ratio term
using inequalities for and for .
Combining with previous results, we have
Taking logarithm on both sides and rearranging give us
(13) 
To bound , we only need to bound the expectation of the second term in above. By standard (conditional) expectation calculations, one can get
By substituting above inequality into (13), we get
(14) 
where we choose . By definition, the overall regret has the same upper bound as above. ∎
We just proved the expected regret of EXP3. However, if we consider the distribution of the random regret, EXP3 is not good enough. Define the random regret as . One can show that for all large enough and reasonable choices of , there exists a bandit such that the random regret of EXP3 satisfies , where is a constant. That means EXP3 sometimes can incur linear regret with nontrivial probability, which makes EXP3 unsuitable for practical problems. This phenomenon is caused by the high variance of the regret distribution. In next section, we will discuss how to resolve this problem by slightly modifying EXP3.
EXP3IX (EXP3 with Implicit eXploration).
We have learned that small terms can cause enormous variance on the reward estimator, which then leads to high variance on the regret distribution. Thus, EXP3IX (Neu, 2015) redefines the lossestimator as
(15) 
where denotes the loss at round and . is a biased estimator for due to , but the variance can be reduced. An optimal choice for needs to balance the bias and variance. Other than this slight change on the loss estimator, the remaining procedures remain the same as EXP3. The name of ’IX’ (Implicit eXploration) can be justified by the following argument:
The effect of adding a to the denominator is that EXP3IX tries to decrease the large losses for some actions, so that such actions can still be chosen occasionally. As a result, EXP3IX explores more than EXP3. Neu (2015) has proved the following high probability regret bound for EXP3IX.
Theorem 5 (Regret for EXP3IX Algorithm).
With , EXP3IX guarantees that
(16) 
with probability at least .
2.1.3 Lower Bound for MAB Problems
We have discussed two types of bandit models and their corresponding algorithms in regret minimization. Then a natural question are: what is the minimal regret bound we can hope for? To answer this question, we will introduce two types of lower bound results: minimax lower bound and instance dependent lower bound. Both of them are useful for describing the hardness of a class of bandit problems and are often used to evaluate the optimality of an existing algorithm. For example, suppose the worstcase regret of a policy matches the minimax lower bound up to a universal constant, we say that the policy is minimaxoptimal.
Minimax Lower Bounds.
We consider a Gaussian bandit environment, in which the reward for every arm is Gaussiandistributed. We denote the class of Gaussian bandits with unit variance by and use as the reward mean vector. In particular, is a Gaussian bandit for which the th arm has reward distribution . The following result provides a minimax lower bound for the Gaussian bandit class .
Theorem 6 (Minimax Lower Bound for Gaussian Bandit Class).
Let and . For any policy , there exists a mean vector such that
(17) 
Proof.
To prove the lower bound, we start with constructing two bandit instances that are very similar to each other and hard to distinguish. Let denote the mean vector for the first unit variance Gaussian bandit. We use and to denote the probability and expectation induced by environment and policy up to time . To choose the second environment, let
Define the reward mean vector for the second bandit as , where . Decomposing the regret leads to
Then, applying the BretagnolleHuber inequality, we get
It remains to upper bound the KLdivergence term in above. By divergence decomposition, one can show that
In above , we use and denote the reward distribution of the th arm in and , respectively. For the last inequality, since , it holds that . Combining with previous results, we know that
Choosing , the result follows. ∎
An algorithm is called minimaxoptimal if its worstcase regret matches with the minimax lower bound.
Instance Dependent Lower Bounds.
An algorithm with nearly minimaxoptimal regret is not always preferred, since it may fail to take advantage of environments that are not the worst case. In practice, what is more desirable is to have algorithms that are near minimaxoptimal, while their performance gets better on “easier” instances Lattimore and Szepesvári (2020). This motivates the study of instance dependent regret. In this section, we present two types of lower bound for instance dependent regret: one is asymptotic, the other is finitetime.
We first define consistent policy and present the asymptotic instancedependent lower bound result.
Definition 3 (Consistent Policy).
A policy is consistent if over bandit environment if for all bandits and for all it holds that
Theorem 7 (Asymptotic Instance Dependent Lower Bound for Gaussian Bandits (Lattimore and Szepesvári, 2020)).
For any policy consistent over armed unitvariance Gaussian environments and any , it holds that
A policy is called asymptotically optimal if the equality in above theorem holds. Interestingly, building on the similar idea of Theorem 7, one can also develop a finitetime instance dependent lower bound result.
Theorem 8 (Instance Dependent Lower Bound for Gaussian Bandits (Lattimore and Szepesvári, 2020)).
Let be a armed Gaussian bandit with mean vector and suboptimality gaps . Define a bandit environment:
Comments
There are no comments yet.