1 Introduction
People often don’t do as they are told. Approximately 50% of patients suffering from chronic illness do not take prescribed medications Sabaté (2003). It is safe to assume that the rate at which patients or doctors will follow the recommendations provided by an algorithm will fall well short of 100%. Unfortunately, despite its importance in medical applications Vrijens et al. (2012); Hugtenburg et al. (2013), compliance has not been analyzed in the bandit literature.
In this paper, we introduce compliance awareness into the bandit setting. Bandit problems are concerned with optimal repeated decisionmaking in the presence of uncertainty Robbins (1952); Lai & Robbins (1985); Bubeck (2012)
. The main challenge is to tradeoff exploration and exploitation, so as to collect enough samples to estimate the rewards from different strategies whilst also strongly biasing samples towards those actions most likely to yield high rewards.
Our running example is an algorithm that recommends treatments to patients. For concreteness, consider a mobile app that encourages patients who have recently suffered a stroke to carry out various low intensity interventions that may be beneficial in preventing future strokes. These could be as simple as meditating, going for a walk or taking an aspirin. The effects of the interventions on the probability of a future stroke may be small. The social benefits of collectively choosing the most effective interventions, however, may however be large.
However, there are other settings in which compliance information is potentially available. For example, an algorithm could recommend treatments to doctors. Whether or not the doctor then prescribes the recommended treatment to the patient is then extremely informative, since the doctor may make observations and have access to background knowledge that is not available to the algorithm. A quite different setting is online advertising, where bandit algorithms are extensively applied to recommend which ad to display Graepel et al. (2010); McMahan et al. (2013). In practice, the recommendations provided by the bandit may not be followed. For example, sales teams often have have handwritten rules that override the bandit in certain situations. Clearly, the bandit algorithm should be able to learn more efficiently if it is provided with information about which ads were actually shown.
In the classic multiarmed bandit setting, the player chooses one of arms on each round and receives a reward Auer et al. (2002); Auer (2002). The player is not told what the reward would have been had it chosen a different arm. The goal is to minimize the cumulative regret over a series of rounds. In the more general compliance setting, the action chosen by the algorithm is not necessarily the action that is finally carried out, see section 2.2. Instead, a compliance process mediates between the algorithm’s recommendation and the action that is actually taken. Importantly, the compliance process may depend on latent characteristics of the subject of the decision. We focus on the case where the outcome of the compliance process is observable.
Unfortunately, compliance information is a twoedged sword. There are settings where it is useful; but it can also lead to linear regret. We develop bounded regret algorithms that incorporate compliance information.
Outline.
Section 2 introduces the formal compliance setting and introduces three protocols for incorporating compliance information into bandit algorithms. It turns out that each protocol has strengths and weaknesses. The simplest protocol ignores compliance information – which yields the classical setting where standard regret bounds hold. If, instead of attending to its recommendations, the bandit attends to what whether the patient actually takes the treatment, then it is possible, in some scenarios, to learn faster than without compliance information. On the other hand, there are no guarantees on convergence when an algorithm attends purely to the compliance of patients and ignores its own prior recommendations – examples of linear regret are provided in section 2.3.
A natural goal is thus to simultaneously incorporate compliance information whilst preserving the noregret guarantees of the classical setting. Section 3 presents two hybrid algorithms that do both. The first, HierarchicalBandit is in a twolevel bandit algorithm. The bottomlevel learns three experts that specialize on difference kinds of compliance information. The toplevel is another bandit that learns which expert performs optimally. The algorithm thus has noregret against both the treatments and two natural reward protocols that incorporate compliance information. The second algorithm, ThompsonBounded
, rapidly converges to Thompson sampling with standard guarantees. However, when Thompson sampling is unsure about which arm to pull, the algorithm takes advantage of the uncertainty to introduce armpulls sampled from
HierchicalBandit.Empirically, ThompsonBounded achieves a surplus of 8.9 extra survivals (that is, human lives) relative to the randomized baseline. The HierarchicalBandit algorithm with Epsilon Greedy as the base algorithm achieves a surplus of 9.2. In contrast, the best performing strategy that is not compliance aware is Thompson sampling, which yields 7.9 extra survivals.
Comparison with other bandit settings.
It is useful to compare noncompliance with other bandit settings. Partial monitoring and its generalizations, such as feedback graphs, are concerned with situations where the player only partial observes its loss Alon et al. (2015). Our setting is an extension of the bandit setting, where additional complianceinformation is provided. Whether or not a patient complies is a form of sideinformation. However, in contrast to the sideinformation available to contextual bandits, it is only available after an arm is pulled. An interesting question, left for future work, is how contextual and compliance information can both be incorporated into bandit algorithms.
Hybrid algorithms were previously proposed in the bestofbothworlds scenario Bubeck & Slivkins (2012); Seldin & Slivkins (2014), where the goal is to construct a bandit that plays optimally in both stochastic and adversarial environments. Vapnik introduced a related notion of sideinformation into the supervised setting with his learning under privileged information framework Vapnik & Vashist (2009). Perhaps the closest setting to ours are the confounded bandits in Bareinboim et al. (2015), see section 2.1.
2 Models of Noncompliance
This section introduces a formal setting for bandits with noncompliance and introduces protocols that prescribe how to make use of compliance information. Before diving into the formalism let us discuss, informally, how compliance information can be useful.
First, suppose that the patient population is homogeneous in their response to the treatment, and that patients take the treatment with probability if prescribed and probability otherwise where . In this setting, it is clear that a bandit algorithm will learn faster by rewarding arms according to whether the treatment was taken by the patient, rather than whether it was recommended to the patient.
As a second example, consider corrective compliance where patients who benefit from a treatment are more likely to take it, since they have access to information that the bandit does not. The bandit clearly benefits by learning from the information expressed in the behavior of the patients. Learning from the treatment actually taken is therefore more efficient than learning from the bandit’s recommendations. Further examples are provided in section 2.2.
2.1 Unobserved confounders.
An important point of comparison is the bandits with unobserved confounders model introduced in Bareinboim et al. (2015). That paper was motivated using an extended example involving two subpopulations (drunk and sober) gambling in a casino. Since we are primarily interested in clinical applications, we map their example onto two subpopulations of patients, rich and poor. Suppose that rich patients always take the treatment (since they can afford it) and that they are also healthier in general. Poor patients only take the treatment when prescribed by a doctor.
Barenboim et al observe that the question “what is the patient’s expected reward when taking the treatment (formally: )?” is confounded by the latent variable wealth. Estimating the effect of the treatment – which may differ between poor and rich patients – requires more refined questions. In our notation: “what is the patient’s expected reward when taking the treatment, given she is wealthy (formally: )?” and “what is the patient’s expected reward when taking the treatment, given she is poor (formally: )”, see example 2.
The solution proposed in Bareinboim et al. (2015) is based on the regret decision criterion (RDC), which estimates the optimal action according to , where the action chosen, , may differ from the patient’s latent inclination. Essentially, computing the RDC requires imposing interventions via the operator. However, overruling a patient or doctor’s decision is often impossible and/or unethical in clinical settings. The counterfactual information required to compute the RDC may therefore not be available in practice.
Compliance information does not act as a direct substitute for the operator. However, compliance information is often readily available and, as we show below, can be used to ameliorate the effect of confounders by giving a partial view into the latent structure of the population that the bandit is interacting with.
2.2 Formal setting
More formally, we consider a sequential decision making problem where a process mediates between the actions chosen by the algorithm and the action carried out in the world. The general game is as follows:
Definition 1 (bandit with compliance information).
At each timestep , the player selects an action (the chosen action). The environment responds by carrying out an action (the actual action) and providing reward , or loss .
The standard bandit setting is the special case where is either unobserved or for all .
Compliance and outcomes are often confounded. For example, healthy patients may be less inclined to take a treatment than unhealthy patients. The set of compliancebehaviors is the set of functions from advice to treatmenttaken Koller & Friedman (2009).
Definition 2 (model assumptions).
We make the following assumptions:

Compliancebehavior depends on a latent variable sampled i.i.d. from unknown .

Outcomes depend on compliancebehavior, treatmenttaken and the latent . That is, outcomes are a fixed function .
When (corresponding to control and treatment), we can list the compliancebehaviors explicitly.
Definition 3 (compliance behaviors).
For , the following four subpopulations capture all deterministic compliancebehaviors:
nevertakers (:  (1)  
alwaystakers (:  (2)  
compliers (:  (3)  
defiers (:  (4) 
Let denote the probability of sampling from subpopulation .
Unfortunately, the subpopulations cannot be distinguished from observations. For example, a patient that takes a prescribed treatment may be a complier or an alwaystaker. Nevertheless, observing compliancebehavior provides potentially useful sideinformation. The setting differs from contextual bandits because the sideinformation is only available after the bandit chooses an arm.
Definition 4 (stochastic reward model).
The expected reward given subpopulation and the actual treatment is
(5) 
for .
The goal of the player is to maximize the cumulative reward received, i.e. choose a sequence of actions that maximizes . We quantify the performance of algorithms in terms of regret, which compares the cumulative reward against that of the best action in hindsight.
2.3 Reward protocols
Since complianceinformation is only available afterpulling an arm, it cannot be used directly when selecting arms. However, how complianceinformation can be used to modify the updates performed by the algorithm. For example, if the bandit recommends taking a treatment, and the patient does not do so, we have a choice about whether to update the arm that the bandit recommended (treatment) or the arm that the patient pulled (control).
Definition 5 (reward protocols).
We consider three protocols for assigning rewards to arms:

Chosen: chosentreatment updates.
Assign reward to arm if . 
Actual: actualtreatment updates.
Assign reward to arm if . 
Comply: compliancebased updates.
Assign reward to arm if and .
Each protocol has strengths and weaknesses.
Protocol #1: Chosen.
Under Chosen, the bandit advises the patient on which treatment to take, and ignores whether or not the patient complies.
Proposition 1.
Standard regret bounds hold for any algorithm under Chosen.
Proof.
The regret bound for any bandit algorithm holds since the setting is the standard bandit setting. ∎
Protocol #2: Actual.
Expected rewards depend on the treatment Eq. (5) chosen by the patient, and not directly on the arm pulled by the bandit. Thus, a natural alternative to Chosen is Actual, where the bandit assigns rewards to the treatment that the patient actually used – which may not in general coincide with the arm that the bandit pulled.
Proposition 2.
There are settings where Actual outperforms Chosen and Comply.
Proof.
Suppose that depends on the treatment but not the subpopulation. Further suppose the population is a mix of alwaystakers, nevertakers, and compliers – but no defiers. Alwaystakers and nevertakers ignore the bandit, which therefore only interacts with the compliers.
The rewards used to update Chosen are, in expectation
(6)  
(7) 
whereas the rewards used to update Actual are
(8) 
It follows that
(9)  
(10) 
Thus, Actual assigns rewards to arms based on their effect on compliers (which are the only subpopulation interacting with the bandit), whereas the rewards assigned to arms by Chosen are diluted by patients who do not take the treatment. Finally, Actual outperforms Comply because it updates more frequently. ∎
However, Actual can fail completely.
Example 1 (Actual has linear regret; defiers).
Suppose that the population consists in defiers and further suppose the treatment has a positive effect: and .
Bandit algorithms using protocol #2 will learn to pull arm , causing defiers to pull arm . The best move in hindsight is the opposite.
A population of defiers is arguably a pathological special case. The next scenario is more realistic in clinical trials:
Example 2 (Linear regret; harmful treatment).
Suppose there are two subpopulations: the first consists of rich, healthy patients who always take the treatment. The second consists of poor, less healthy patients who only take the treatment if prescribed. Finally, suppose the treatment reduces wellbeing by on some metric. We then have
(11)  
(12) 
If the population of healthy alwaystakers is sufficiently large, then Actual assigns higher rewards to the harmful treatment arm.
Protocol #3: Comply.
Finally, Chosen and Actual can be combined to form Comply, which only rewards an arm if (i) it was chosen by the bandit and (ii) the patient followed the bandit’s advice.
Proposition 3.
There are settings where Comply outperforms Chosen and Actual.
Proof.
It is easy to see that Comply outperforms Chosen in the setting of Proposition 2.
Consider a population of nevertakers, alwaystakers and compliers. Suppose that nevertakers are healthier than compliers whereas alwaystakers are less healthy .
Let and be the probability that the bandit pulls arms 0 and 1 respectively. The expected rewards received by Actual are
(13)  
(14) 
whereas the rewards used to update Comply are
(15)  
(16) 
It follows that
(17)  
(18) 
The reward estimates for compliers are diluted under both Actual and Comply. However, Comply’s estimate is more accurate. ∎
It is easy to see that Comply also has unbounded regret on example 2.
The rewards assigned to each arm by the three protocols are summarized in the table below. None of the protocols successfully isolates the compliers. It follows, as seen above, that which protocol is optimal depends on the structure of the population, which is unknown to the learner.
Arm updated  Chosen  Actual  Comply 

The table can be extended with additional reward protocols. In this paper, we restrict attention to the three most intuitive protocols.
3 Algorithms
In the noncompliance setting there is additional information available to the algorithm. Ignoring the complianceinformation (i.e. using the Chosen protocol) reduces to the standard bandit setting. However, it should be possible to improve performance by taking advantage of observations about when treatments are actually applied. Using complianceinformation is not trivial, since bandit algorithms that rely purely on treatments (Actual) or purely on compliance (Comply) can have linear regret.
This section proposes two hybrid algorithms that take advantage of compliance information, have bounded regret, and empirically outperform algorithms running the Chosen protocol.
3.1 Hierarchical bandits
A natural idea is to use the three protocols to learn three experts and, simultaneously, learn which expert to apply. The result is a hierarchical bandit algorithm. The hierarchical bandit integrates complianceinformation in a way that ensures the algorithm (i) has noregret, because one of the basealgorithms uses Chosen, and therefore has no regret; and (ii) benefits from the complianceinformation if it turns out to be useful.
The general construction is as follows. At the bottomlevel are three bandit algorithms implementing the three protocols (Chosen, Actual and Comply). On the toplevel is a fourth bandit algorithm whose arms are the three bottomlevel algorithms. The bottomlevel bandits optimally implement the three protocols, whereas the toplevel bandit learns which protocol is optimal.
The toplevel bandit is not in a stochastic environment even when the external environment is stochastic, since the lowlevel bandits are learning. We therefore use EXP3 as the toplevel bandit Auer et al. (2002).
Theorem 1 (Noregret with respect to Actual, Comply and individual treatment advice).
Let EXP3 be the noregret algorithm used in Algorithm 1 for both the bottom and toplevel bandits, with suitable choice of learning rate. Then, HierarchicalBandit satisfies
(19) 
where
denotes the expected loss vector of
EXP3 under the respective protocol on round . Furthermore, the regret against individual treatments is bounded by(20) 
Proof.
Apply Lemma 2 to HiearchicalBandit. ∎
Using EXP3 at the toplevel and a ThompsonSampler in the bottomlevel also yields a noregret algorithm. We modify the Thompson sampler to incorporate importance weighting, see Algorithm 5 in the Appendix.
3.2 Thompson bounding
The second strategy starts from the observation that Thompson sampling often outperforms other bandit algorithms in stochastic settings Thompson (1933); Chapelle & Li (2011) and has logarithmic regret Agrawal & Goyal (2012); Kaufmann et al. (2012). A natural goal is then to design an algorithm that performs like Thompson sampling under the Chosen protocol in the longrun – since Thompson sampling under Chosen is guaranteed to match the best action in hindsight in time – but also takes advantage of compliance sideinformation when Thompson sampling has not converged onto sampling a single arm with high probability.
The proposed algorithm, ThompsonBounded, adds an additional component to hierarchical bandit above: a Thompson sampler that learns from armpulls according to the Chosen protocol. The Thompson sampler is initially unbiased between arms; as it learns, the probabilities the Thompson sampler assigns to arms become increasingly concentrated. ThompsonBounded takes advantage of Thompson sampling’s uncertainty about which arm to pull in early rounds to safely introduce sideinformation. To do so, ThompsonBounded draws two samples: if they agree, it plays a third Thompson sample. If they disagree it plays the arm chosen by the hierarchical bandit.
Intuitively, if Thompson sampling is uncertain, then ThompsonBounded tends to use the hierarchical bandit. As the Thompson sampler’s confidence increases, ThompsonBounded is more likely to follow its advice. The next theorem shows that mixing in side information has no qualitative effect on the algorithm’s regret, which grows as .
Theorem 2.
The regret of ThompsonBounded is bounded by
(21) 
Proof.
Suppose without loss of generality that arm 1 yields a higher average payoff. Let be the probability that Thompson assigns to arm on round , so that is the probability that Thompson sampling does not pulling arm 1. The probability that ThompsonBounded follows the hierarchical bandit is then The additional expected regret from deviating from Thompson sampling is therefore at most twice the regret Thompson incurs by pulling suboptimal arms. Finally, it was shown in Agrawal & Goyal (2012); Kaufmann et al. (2012) that Thompson sampling has logarithmic regret. ∎
3.3 Dataefficiency.
The noregret guarantees for HB and TB are provided, respectively, by the bottomlevel expert running the Chosen protocol and the toplevel Thompson sampler learning from the Chosen protocol. We refer to these strategies as certified. The other strategies comprising the hybrids are not certified, but rather may boost empirical performance by bringing sideinformation from the compliance.
As described, the hybrid algorithms are datainefficient since, despite the i.i.d. assumption on the patient population, the certified strategies only learn when they are executed. We describe a recycling trick to improve the efficiency of the certified strategies.
A naive approach to increase dataefficiency is to reward the certified strategy on rounds where the executed strategy selects the same action as the certified strategy. However, this introduces a systematic bias. For example, consider two strategies: the first always picks arm 1, the second picks arms 1 and 2 with equal probability. Running a toplevel algorithm that picks both with equal probability results in a mixed distribution biased towards arm 1.
The recycling trick stores actions and subsequent rewards by noncertified strategies in a cache. When there is at least one of each action in the cache, the certified strategy is rewarded on rounds where it was not executed by sampling, without replacement, from the cache. Sampling without replacement is important in our setting since it prevents early unrepresentative samples introducing a bias into the behavior of the certified strategy through repeated sampling. A related trick, referred to as “experience replay” was introduced in reinforcement learning in
Mnih et al. (2015).4 Clinical trial data
The simulation data is taken from The International Stroke Trial (IST) database. A randomised trial where patients believed to have acute ischaemic stroke are treated with: aspirin, subcutaneous heparin, both, or neither Group (1997). Complete compliance and mortality data at 14 days for each of 19,422 patients To the best of our knowledge, this is the largest publicly available clinical trial with compliance data.^{1}^{1}1An extensive search failed to find other open randomized clinical trials datasets that included compliance. A systematic review by Ebrahim et al. (2014) identified 37 reanalyses of patientlevel data from previously published randomized control trials; five were performed by entirely independent authors. Data from drug abuse clinical trials is used in Kuleshov & Precup (2014). However, noncompliance is coded as failure so this source, and drug dependence treatments more generally, cannot be used in our setting. Given there is substantial loss of follow up at the 6 month measure we focus on the 14 day outcome.
4.1 Compliance variables
The main sources of noncompliance in the dataset are: the initial event not a stroke, clinical decision, administration problem, missed out more than 3 doses. A detailed table and counts of these are included in the datasets open access article Sandercock et al. (2011). While these might initially seem like reasons to discard the patients from the dataset, noncompliance is not necessarily random. Discarding these patients could cause algorithms to have unbounded regret (since the loss we care about is over all patients). In particular, misdiagnoses, administrative problems, not taking doses and other sources of noncompliance can be confounded with a patient’s socioeconomic status, age, and overall health, as well as the load and quality of the medical staff.
To construct our “actual arm” variable, we assume that noncompliance entails taking the opposite treatment. This is welldefined in the Aspirin case, which only has two arms, and thus noncompliance with placebo is likely to be taking the treatment.
Assigning an actual arm pulled in the heparin part of the trial is less clear cut, as it has three arms: none, low and medium. We construct the actual arm variable by combining assignment and noncompliance. Noncompliance with respect to low and medium assigned treatments is coded as nottakers, while noncompliance by a patient prescribed “none” is coded as low.
5 Results
Stroke trial.
In the stroke trial experiments, performance is measured in terms of the excess relative survival or surplus of different strategies. That is, the number of surviving patients in expectation, relative to a baseline that uniformly randomizes between treatment and control.
We simulate 10,000 patients per run, which allows us to not oversample the data in any single simulation; 2000 runs are performed for each algorithm. The EXP3 gamma parameter was set ahead of time to 0.085, a choice determined by the regretbounds for and or . EpsilonGreedy uses a standard annealing schedule. No data dependent parameter tuning was used. The simulation is carried out by creating a “counterfactual patient” by sampling (i.i.d.) one patient from each of the treatment and control groups in the clinical trial. If the algorithm selects the treatment, it then receives the reward and observes the action taken by the subject sampled form the treatment group, and vice versa for the control.
Empirically, ThompsonBounded
achieves a surplus of 8.9 extra survivals (that is, human lives) with 95% confidence interval
, relative to the randomized baseline. HierarchicalBandit with Epsilon Greedy as the base algorithm achieves a surplus of 9.2 (CI: ) In contrast, the best performing strategy that is not compliance aware is Thompson sampling, which yields 7.9 extra survivals (CI: ).The gains are largely concentrated in the Aspirin trial, which is consistent with the lack of benefits or severe ill effects found in the original study Group (1997) for heparin, and with the small but beneficial effect found for aspirin. If the underlying treatment has no positive or negative effect, sideinformation after the fact alone cannot be helpful.
Note that Actual, and to a lesser extent Comply, perform better than either Chosen or the hybrid algorithms. However, these cannot be used directly since no guarantees apply. The performance of the hybrids benefits from the information encoded in Actual and Comply whilst keeping the guarantees of Chosen.
Synthetic data.
To better understand the behaviour of the algorithms in a more varied range of settings, we present results of simulations with synthetic data.
The first simulation illustrates example 2. For comparison, is kept at 10,000, and consider the binary outcome case. We assign half the patients to rich and half to poor randomly. Rich patients always take the treatment and their outcome, which would otherwise be 1 with , becomes 1 with . Poor patients only take the treatment when prescribed, their favorable outcome has probability without treatment; taking the treatment reduces the probability of a favorable outcome to . Fig. 2 shows that the performance of Actual and Comply is much worse than Chosen and the hybrid algorithms.
The second simulation concerns small . A motivation for very small adaptive clinical trials is provided by rare diseases. The overall size of the patient population is by construction severely restricted in this setting. The priors for the mechanisms of action are also often poorly understood, so potential alternative treatments can have radically different probabilities of success. We simulate a adaptive trial with binary outcomes, with two treatments and expected rewards drawn uniformly from the unit interval, and compliance uniformly at random. We sample 1,000 such simulations. While our bounds are vacuous in this settings, it is interesting that there is on average an improvement from taking the noncompliance information into account, see Fig. 3.
6 Conclusions
This paper introduced compliance information into the bandit setting. Complianceinformation reflects the treatment actually taken by the patient, rather than the algorithm’s recommendation. In many cases (perhaps most cases in practice) compliance information can be used to accelerate learning. Unfortunately, however, naively incorporating compliance information leads to algorithms with linear regret as seen in example 2 and figure 2. We have therefore developed hybrid strategies that are the first algorithms that simultaneously incorporate compliance information while maintaining a worstcase guarantee.
Empirically, TB achieves a surplus of 8.9 extra survivals (that is, human lives) and HB achieves 9.2 surplus lives compared with 7.9 for the best classical algorithm. This suggests hybrid algorithms can make a significant difference to clinical outcomes.
References
 Agrawal & Goyal (2012) Agrawal, S and Goyal, N. Analysis of Thompson sampling for the multiarmed bandit problem. In Computational Learning Theory (COLT), 2012.
 Alon et al. (2015) Alon, Noga, CesaBianchi, Nicoló, Dekel, Ofer, and Koren, Tomer. Online Learning with Feedback Graphs: Beyond Bandits. In Computational Learning Theory (COLT), 2015.
 Auer (2002) Auer, Peter. Using confidence bounds for exploitationexploration tradeoffs. JMLR, 3:397–422, 2002.
 Auer et al. (2002) Auer, Peter, CesaBianchi, Nicoló, Freund, Yoav, and Schapire, Robert. The nonstochastic multiarmed bandit problem. SIAM J. Computing, 32(1):48–77, 2002.
 Bareinboim et al. (2015) Bareinboim, Elias, Forney, Andrew, and Pearl, Judea. Bandits with Unobserved Confounders: A Causal Approach. In Adv in Neural Information Processing Systems (NIPS), 2015.

Bubeck (2012)
Bubeck, Sébastien.
Regret analysis of stochastic and nonstochastic multiarmed bandit
problems.
Foundations and Trends in Machine Learning
, 5(1):1–122, 2012.  Bubeck & Slivkins (2012) Bubeck, Sébastien and Slivkins, Aleksandrs. The best of both worlds: stochastic and adversarial bandits. In Computational Learning Theory (COLT), 2012.
 Chang & Kaelbling (2005) Chang, YuHan and Kaelbling, Leslie Pack. Hedged learning: Regretminimization with learning experts. In ICML, 2005.
 Chapelle & Li (2011) Chapelle, Olivier and Li, Lihong. An Empirical Evaluation of Thompson Sampling. In Adv in Neural Information Processing Systems (NIPS), 2011.
 Ebrahim et al. (2014) Ebrahim, Shanil, Sohani, Zahra N, Montoya, Luis, Agarwal, Arnav, Thorlund, Kristian, Mills, Edward J, and Ioannidis, John PA. Reanalyses of randomized clinical trial data. Jama, 312(10):1024–1032, 2014.
 Graepel et al. (2010) Graepel, T, QuioneroCandela, J, Borchert, T, and Herbrich, R. Webscale Bayesian clickthrough rate prediction for sponsored search and advertising in Microsoft’s Bing engine. In ICML, 2010.
 Group (1997) Group, International Stroke Trial Collaborative. The international stroke trial (ist): a randomised trial of aspirin, subcutaneous heparin, both, or neither among 19 435 patients with acute ischaemic stroke. The Lancet, 349(9065):1569–1581, 1997.
 Hugtenburg et al. (2013) Hugtenburg, Jacqueline G, Timmers, Lonneke, Elders, PJ, Vervloet, Marcia, and van Dijk, Liset. Definitions, variants, and causes of nonadherence with medication: a challenge for tailored interventions. Patient Prefer Adherence, 7:675–682, 2013.
 Kaufmann et al. (2012) Kaufmann, E, Korda, N, and Munos, R. Thompson sampling: An asymptotically optimal finitetime analysis. In ALT, 2012.
 Koller & Friedman (2009) Koller, Daphne and Friedman, Nir. Probabilistic Graphical Models: Principles and Techniques. MIT Press, 2009.
 Kuleshov & Precup (2014) Kuleshov, Volodymyr and Precup, Doina. Algorithms for multiarmed bandit problems. arXiv preprint arXiv:1402.6028, 2014.
 Lai & Robbins (1985) Lai, T L and Robbins, H. Asymptotically efficient adaptive allocation rules. Advances in Applied Mathematics, 6:4–22, 1985.
 McMahan et al. (2013) McMahan, H. Brendan, Holt, Gary, Sculley, D., Young, Michael, Ebner, Dietmar, Grady, Julian, Nie, Lan, Phillips, Todd, Davydov, Eugene, Golovin, Daniel, Chikkerur, Sharat, Liu, Dan, Wattenberg, Martin, Hrafnkelsson, Arnar Mar, Boulos, Tom, and Kubica, Jeremy. Ad click prediction: A view from the trenches. In Proceedings of the 19th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, New York, NY, USA, 2013.
 Mnih et al. (2015) Mnih, Volodymyr, Kavukcuoglu, Koray, Silver, David, Rusu, Andrei A., Veness, Joel, Bellemare, Marc G., Graves, Alex, Riedmiller, Martin, Fidjeland, Andreas K., Ostrovski, Georg, Petersen, Stig, Beattie, Charles, Sadik, Amir, Antonoglou, Ioannis, King, Helen, Kumaran, Dharshan, Wierstra, Daan, Legg, Shane, and Hassabis, Demis. Humanlevel control through deep reinforcement learning. Nature, 518(7540):529–533, 02 2015.
 Robbins (1952) Robbins, H. Some aspects of the sequential design of experiments. Bull. AMS, 58:527–535, 1952.
 Sabaté (2003) Sabaté, Eduardo. Adherence to longterm therapies: evidence for action. World Health Organization, 2003.
 Sandercock et al. (2011) Sandercock, Peter AG, Niewada, Maciej, Członkowska, Anna, et al. The international stroke trial database. Trials, 12(1):1–7, 2011.
 Seldin & Slivkins (2014) Seldin, Yevgeny and Slivkins, Aleksandrs. One Practical Algorithm for Both Stochastic and Adversarial Bandits. In ICML, 2014.
 Thompson (1933) Thompson, William R. On the likelihood that one unknown probability exceeds another in view of the evidence of two samples. Biometrika, 25(3/4):285–294, 1933.
 Vapnik & Vashist (2009) Vapnik, Vladimir and Vashist, Akshay. A new learning paradigm: Learning using privileged information. Neural Netw, 22:544–557, 2009.
 Vrijens et al. (2012) Vrijens, Bernard, De Geest, Sabina, Hughes, Dyfrig A, Przemyslaw, Kardas, Demonceau, Jenny, Ruppar, Todd, Dobbels, Fabienne, Fargher, Emily, Morrison, Valerie, Lewek, Pawel, et al. A new taxonomy for describing and defining adherence to medications. British journal of clinical pharmacology, 73(5):691–705, 2012.
7 Noregret for HierarchicalBandit
This section shows that constructing a hierarchical bandit with EXP3 yields a noregret algorithm. The result is straightforward; we include it for completeness. A similar result was shown in Chang & Kaelbling (2005).
First, we construct a hierarchical version of Hedge, Algorithm 3, which is applicable in the fullinformation setting. On the bottomlevel are instantiations of Hedge. Instantiation , for , plays an dimensional weight vector and receives dimensional loss vector on round . We impose the assumption that all instantiations play vectors for notational convenience. The toplevel is another instantiation of Hedge, which plays a weighted combination of the bottomlevel instantiations.
(22)  
(23) 
We have the following lemma:
Lemma 1.
Introduce compound loss vector with . Then can be chosen in HHedge such that
(24) 
Moreover, and can be chosen such that, for all ,
(25) 
Proof.
Apply regret bounds for Hedge twice. ∎
Lemma 1 says, firstly, that HHedge has bounded regret relative to the bottomlevel instantiations and, secondly, that it has bounded regret relative to any of the experts on the bottomlevel.
Algorithm 4 modifies HHedge so that it is suitable for bandit feedback, yielding HEXP3. A corresponding noregret bound follows immediately:
Lemma 2.
Define as in Lemma 1. Then can be chosen in HEXP3 such that
(26) 
Moreover, and can be chosen such that
(27) 
Proof.
Follows from Lemma 1 and bounds for EXP3. ∎
(28)  
(29) 
Hierarchical Bandit with Thompson sampler base.
Algorithm 5 (BTS) shows how to modify the Thompson sampler for use as a bottomlevel algorithm in HierarchicalBandit. The modification applies the importance weighting trick: replace in Thompson sampling with , where is the probability that the toplevel bandit calls BTS on the given round.
Comments
There are no comments yet.