Compliance-Aware Bandits

02/09/2016 ∙ by Nicolás Della Penna, et al. ∙ Victoria University of Wellington Australian National University 0

Motivated by clinical trials, we study bandits with observable non-compliance. At each step, the learner chooses an arm, after, instead of observing only the reward, it also observes the action that took place. We show that such noncompliance can be helpful or hurtful to the learner in general. Unfortunately, naively incorporating compliance information into bandit algorithms loses guarantees on sublinear regret. We present hybrid algorithms that maintain regret bounds up to a multiplicative factor and can incorporate compliance information. Simulations based on real data from the International Stoke Trial show the practical potential of these algorithms.



There are no comments yet.


page 1

page 2

page 3

page 4

This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

1 Introduction

People often don’t do as they are told. Approximately 50% of patients suffering from chronic illness do not take prescribed medications Sabaté (2003). It is safe to assume that the rate at which patients or doctors will follow the recommendations provided by an algorithm will fall well short of 100%. Unfortunately, despite its importance in medical applications Vrijens et al. (2012); Hugtenburg et al. (2013), compliance has not been analyzed in the bandit literature.

In this paper, we introduce compliance awareness into the bandit setting. Bandit problems are concerned with optimal repeated decision-making in the presence of uncertainty Robbins (1952); Lai & Robbins (1985); Bubeck (2012)

. The main challenge is to trade-off exploration and exploitation, so as to collect enough samples to estimate the rewards from different strategies whilst also strongly biasing samples towards those actions most likely to yield high rewards.

Our running example is an algorithm that recommends treatments to patients. For concreteness, consider a mobile app that encourages patients who have recently suffered a stroke to carry out various low intensity interventions that may be beneficial in preventing future strokes. These could be as simple as meditating, going for a walk or taking an aspirin. The effects of the interventions on the probability of a future stroke may be small. The social benefits of collectively choosing the most effective interventions, however, may however be large.

However, there are other settings in which compliance information is potentially available. For example, an algorithm could recommend treatments to doctors. Whether or not the doctor then prescribes the recommended treatment to the patient is then extremely informative, since the doctor may make observations and have access to background knowledge that is not available to the algorithm. A quite different setting is online advertising, where bandit algorithms are extensively applied to recommend which ad to display Graepel et al. (2010); McMahan et al. (2013). In practice, the recommendations provided by the bandit may not be followed. For example, sales teams often have have hand-written rules that override the bandit in certain situations. Clearly, the bandit algorithm should be able to learn more efficiently if it is provided with information about which ads were actually shown.

In the classic multi-armed bandit setting, the player chooses one of -arms on each round and receives a reward Auer et al. (2002); Auer (2002). The player is not told what the reward would have been had it chosen a different arm. The goal is to minimize the cumulative regret over a series of rounds. In the more general compliance setting, the action chosen by the algorithm is not necessarily the action that is finally carried out, see section 2.2. Instead, a compliance process mediates between the algorithm’s recommendation and the action that is actually taken. Importantly, the compliance process may depend on latent characteristics of the subject of the decision. We focus on the case where the outcome of the compliance process is observable.

Unfortunately, compliance information is a two-edged sword. There are settings where it is useful; but it can also lead to linear regret. We develop bounded regret algorithms that incorporate compliance information.


Section 2 introduces the formal compliance setting and introduces three protocols for incorporating compliance information into bandit algorithms. It turns out that each protocol has strengths and weaknesses. The simplest protocol ignores compliance information – which yields the classical setting where standard regret bounds hold. If, instead of attending to its recommendations, the bandit attends to what whether the patient actually takes the treatment, then it is possible, in some scenarios, to learn faster than without compliance information. On the other hand, there are no guarantees on convergence when an algorithm attends purely to the compliance of patients and ignores its own prior recommendations – examples of linear regret are provided in section 2.3.

A natural goal is thus to simultaneously incorporate compliance information whilst preserving the no-regret guarantees of the classical setting. Section 3 presents two hybrid algorithms that do both. The first, HierarchicalBandit is in a two-level bandit algorithm. The bottom-level learns three experts that specialize on difference kinds of compliance information. The top-level is another bandit that learns which expert performs optimally. The algorithm thus has no-regret against both the treatments and two natural reward protocols that incorporate compliance information. The second algorithm, ThompsonBounded

, rapidly converges to Thompson sampling with standard guarantees. However, when Thompson sampling is unsure about which arm to pull, the algorithm takes advantage of the uncertainty to introduce arm-pulls sampled from


Empirically, ThompsonBounded achieves a surplus of 8.9 extra survivals (that is, human lives) relative to the randomized baseline. The HierarchicalBandit algorithm with Epsilon Greedy as the base algorithm achieves a surplus of 9.2. In contrast, the best performing strategy that is not compliance aware is Thompson sampling, which yields 7.9 extra survivals.

Comparison with other bandit settings.

It is useful to compare noncompliance with other bandit settings. Partial monitoring and its generalizations, such as feedback graphs, are concerned with situations where the player only partial observes its loss Alon et al. (2015). Our setting is an extension of the bandit setting, where additional compliance-information is provided. Whether or not a patient complies is a form of side-information. However, in contrast to the side-information available to contextual bandits, it is only available after an arm is pulled. An interesting question, left for future work, is how contextual and compliance information can both be incorporated into bandit algorithms.

Hybrid algorithms were previously proposed in the best-of-both-worlds scenario Bubeck & Slivkins (2012); Seldin & Slivkins (2014), where the goal is to construct a bandit that plays optimally in both stochastic and adversarial environments. Vapnik introduced a related notion of side-information into the supervised setting with his learning under privileged information framework Vapnik & Vashist (2009). Perhaps the closest setting to ours are the confounded bandits in Bareinboim et al. (2015), see section 2.1.

2 Models of Noncompliance

This section introduces a formal setting for bandits with noncompliance and introduces protocols that prescribe how to make use of compliance information. Before diving into the formalism let us discuss, informally, how compliance information can be useful.

First, suppose that the patient population is homogeneous in their response to the treatment, and that patients take the treatment with probability if prescribed and probability otherwise where . In this setting, it is clear that a bandit algorithm will learn faster by rewarding arms according to whether the treatment was taken by the patient, rather than whether it was recommended to the patient.

As a second example, consider corrective compliance where patients who benefit from a treatment are more likely to take it, since they have access to information that the bandit does not. The bandit clearly benefits by learning from the information expressed in the behavior of the patients. Learning from the treatment actually taken is therefore more efficient than learning from the bandit’s recommendations. Further examples are provided in section 2.2.

2.1 Unobserved confounders.

An important point of comparison is the bandits with unobserved confounders model introduced in Bareinboim et al. (2015). That paper was motivated using an extended example involving two subpopulations (drunk and sober) gambling in a casino. Since we are primarily interested in clinical applications, we map their example onto two subpopulations of patients, rich and poor. Suppose that rich patients always take the treatment (since they can afford it) and that they are also healthier in general. Poor patients only take the treatment when prescribed by a doctor.

Barenboim et al observe that the question “what is the patient’s expected reward when taking the treatment (formally: )?” is confounded by the latent variable wealth. Estimating the effect of the treatment – which may differ between poor and rich patients – requires more refined questions. In our notation: “what is the patient’s expected reward when taking the treatment, given she is wealthy (formally: )?” and “what is the patient’s expected reward when taking the treatment, given she is poor (formally: )”, see example 2.

The solution proposed in Bareinboim et al. (2015) is based on the regret decision criterion (RDC), which estimates the optimal action according to , where the action chosen, , may differ from the patient’s latent inclination. Essentially, computing the RDC requires imposing interventions via the operator. However, overruling a patient or doctor’s decision is often impossible and/or unethical in clinical settings. The counterfactual information required to compute the RDC may therefore not be available in practice.

Compliance information does not act as a direct substitute for the operator. However, compliance information is often readily available and, as we show below, can be used to ameliorate the effect of confounders by giving a partial view into the latent structure of the population that the bandit is interacting with.

2.2 Formal setting

More formally, we consider a sequential decision making problem where a process mediates between the actions chosen by the algorithm and the action carried out in the world. The general game is as follows:

Definition 1 (bandit with compliance information).

At each time-step , the player selects an action (the chosen action). The environment responds by carrying out an action (the actual action) and providing reward , or loss .

The standard bandit setting is the special case where is either unobserved or for all .

Compliance and outcomes are often confounded. For example, healthy patients may be less inclined to take a treatment than unhealthy patients. The set of compliance-behaviors is the set of functions from advice to treatment-taken Koller & Friedman (2009).

Definition 2 (model assumptions).

We make the following assumptions:

  1. Compliance-behavior depends on a latent variable sampled i.i.d. from unknown .

  2. Outcomes depend on compliance-behavior, treatment-taken and the latent . That is, outcomes are a fixed function .

When (corresponding to control and treatment), we can list the compliance-behaviors explicitly.

Definition 3 (compliance behaviors).

For , the following four subpopulations capture all deterministic compliance-behaviors:

never-takers (: (1)
always-takers (: (2)
compliers (: (3)
defiers (: (4)

Let denote the probability of sampling from subpopulation .

Unfortunately, the subpopulations cannot be distinguished from observations. For example, a patient that takes a prescribed treatment may be a complier or an always-taker. Nevertheless, observing compliance-behavior provides potentially useful side-information. The setting differs from contextual bandits because the side-information is only available after the bandit chooses an arm.

Definition 4 (stochastic reward model).

The expected reward given subpopulation and the actual treatment is


for .

The goal of the player is to maximize the cumulative reward received, i.e. choose a sequence of actions that maximizes . We quantify the performance of algorithms in terms of regret, which compares the cumulative reward against that of the best action in hindsight.

2.3 Reward protocols

Since compliance-information is only available after-pulling an arm, it cannot be used directly when selecting arms. However, how compliance-information can be used to modify the updates performed by the algorithm. For example, if the bandit recommends taking a treatment, and the patient does not do so, we have a choice about whether to update the arm that the bandit recommended (treatment) or the arm that the patient pulled (control).

Definition 5 (reward protocols).

We consider three protocols for assigning rewards to arms:

  1. Chosen: chosen-treatment updates.
    Assign reward to arm if .

  2. Actual: actual-treatment updates.
    Assign reward to arm if .

  3. Comply: compliance-based updates.
    Assign reward to arm if and .

Each protocol has strengths and weaknesses.

Protocol #1: Chosen.

Under Chosen, the bandit advises the patient on which treatment to take, and ignores whether or not the patient complies.

Proposition 1.

Standard regret bounds hold for any algorithm under Chosen.


The regret bound for any bandit algorithm holds since the setting is the standard bandit setting. ∎

Protocol #2: Actual.

Expected rewards depend on the treatment Eq. (5) chosen by the patient, and not directly on the arm pulled by the bandit. Thus, a natural alternative to Chosen  is Actual, where the bandit assigns rewards to the treatment that the patient actually used – which may not in general coincide with the arm that the bandit pulled.

Proposition 2.

There are settings where Actual  outperforms Chosen  and Comply.


Suppose that depends on the treatment but not the subpopulation. Further suppose the population is a mix of always-takers, never-takers, and compliers – but no defiers. Always-takers and never-takers ignore the bandit, which therefore only interacts with the compliers.

The rewards used to update Chosen  are, in expectation


whereas the rewards used to update Actual  are


It follows that


Thus, Actual  assigns rewards to arms based on their effect on compliers (which are the only subpopulation interacting with the bandit), whereas the rewards assigned to arms by Chosen  are diluted by patients who do not take the treatment. Finally, Actual  outperforms Comply  because it updates more frequently. ∎

However, Actual  can fail completely.

Example 1 (Actual  has linear regret; defiers).

Suppose that the population consists in defiers and further suppose the treatment has a positive effect: and . Bandit algorithms using protocol #2 will learn to pull arm , causing defiers to pull arm . The best move in hindsight is the opposite.

A population of defiers is arguably a pathological special case. The next scenario is more realistic in clinical trials:

Example 2 (Linear regret; harmful treatment).

Suppose there are two sub-populations: the first consists of rich, healthy patients who always take the treatment. The second consists of poor, less healthy patients who only take the treatment if prescribed. Finally, suppose the treatment reduces wellbeing by on some metric. We then have


If the population of healthy always-takers is sufficiently large, then Actual  assigns higher rewards to the harmful treatment arm.

Protocol #3: Comply.

Finally, Chosen  and Actual  can be combined to form Comply, which only rewards an arm if (i) it was chosen by the bandit and (ii) the patient followed the bandit’s advice.

Proposition 3.

There are settings where Comply  outperforms Chosen  and Actual.


It is easy to see that Comply  outperforms Chosen  in the setting of Proposition 2.

Consider a population of never-takers, always-takers and compliers. Suppose that never-takers are healthier than compliers whereas always-takers are less healthy .

Let and be the probability that the bandit pulls arms 0 and 1 respectively. The expected rewards received by Actual  are


whereas the rewards used to update Comply  are


It follows that


The reward estimates for compliers are diluted under both Actual and Comply. However, Comply’s estimate is more accurate. ∎

It is easy to see that Comply  also has unbounded regret on example 2.

The rewards assigned to each arm by the three protocols are summarized in the table below. None of the protocols successfully isolates the compliers. It follows, as seen above, that which protocol is optimal depends on the structure of the population, which is unknown to the learner.

Arm updated Chosen Actual Comply

The table can be extended with additional reward protocols. In this paper, we restrict attention to the three most intuitive protocols.

3 Algorithms

In the non-compliance setting there is additional information available to the algorithm. Ignoring the compliance-information (i.e. using the Chosen  protocol) reduces to the standard bandit setting. However, it should be possible to improve performance by taking advantage of observations about when treatments are actually applied. Using compliance-information is not trivial, since bandit algorithms that rely purely on treatments (Actual) or purely on compliance (Comply) can have linear regret.

This section proposes two hybrid algorithms that take advantage of compliance information, have bounded regret, and empirically outperform algorithms running the Chosen  protocol.

3.1 Hierarchical bandits

A natural idea is to use the three protocols to learn three experts and, simultaneously, learn which expert to apply. The result is a hierarchical bandit algorithm. The hierarchical bandit integrates compliance-information in a way that ensures the algorithm (i) has no-regret, because one of the base-algorithms uses Chosen, and therefore has no regret; and (ii) benefits from the compliance-information if it turns out to be useful.

The general construction is as follows. At the bottom-level are three bandit algorithms implementing the three protocols (Chosen, Actual  and Comply). On the top-level is a fourth bandit algorithm whose arms are the three bottom-level algorithms. The bottom-level bandits optimally implement the three protocols, whereas the top-level bandit learns which protocol is optimal.

  Input: Bandits running NoRegretAlgorithm on Chosen, Actual  and Comply  for respectively, with arms corresponding to treatments
  Input: Bandit running NoRegretAlgorithm, with arms corresponding to above
  for  to  do
     Draw bandit from and arm from
     Pull arm ; incur loss ; observe compliance
     Update with loss applied to bandit-arm
     if  then
        Update with loss applied to treatment-arm
     end if
     Update with loss according to relevant protocol
  end for
Algorithm 1 HierarchicalBandit (HB)

The top-level bandit is not in a stochastic environment even when the external environment is stochastic, since the low-level bandits are learning. We therefore use EXP3 as the top-level bandit Auer et al. (2002).

Theorem 1 (No-regret with respect to Actual, Comply  and individual treatment advice).

Let EXP3 be the no-regret algorithm used in Algorithm 1 for both the bottom and top-level bandits, with suitable choice of learning rate. Then, HierarchicalBandit satisfies



denotes the expected loss vector of

EXP3 under the respective protocol on round . Furthermore, the regret against individual treatments is bounded by


Apply Lemma 2 to HiearchicalBandit. ∎

Using EXP3 at the top-level and a ThompsonSampler in the bottom-level also yields a no-regret algorithm. We modify the Thompson sampler to incorporate importance weighting, see Algorithm 5 in the Appendix.

  Input: Bandit algorithm
  Input: Thompson sampler under Chosen  protocol
  for  to  do
     Sample and from Thompson
     if  then
        Pull arm sampled from Thompson
        Pull arm chosen by
     end if
     Incur loss, update algorithm used to pull arm
  end for
Algorithm 2 ThompsonBounded (TB)

3.2 Thompson bounding

The second strategy starts from the observation that Thompson sampling often outperforms other bandit algorithms in stochastic settings Thompson (1933); Chapelle & Li (2011) and has logarithmic regret Agrawal & Goyal (2012); Kaufmann et al. (2012). A natural goal is then to design an algorithm that performs like Thompson sampling under the Chosen  protocol in the long-run – since Thompson sampling under Chosen  is guaranteed to match the best action in hindsight in time – but also takes advantage of compliance side-information when Thompson sampling has not converged onto sampling a single arm with high probability.

The proposed algorithm, ThompsonBounded, adds an additional component to hierarchical bandit above: a Thompson sampler that learns from arm-pulls according to the Chosen  protocol. The Thompson sampler is initially unbiased between arms; as it learns, the probabilities the Thompson sampler assigns to arms become increasingly concentrated. ThompsonBounded takes advantage of Thompson sampling’s uncertainty about which arm to pull in early rounds to safely introduce side-information. To do so, ThompsonBounded draws two samples: if they agree, it plays a third Thompson sample. If they disagree it plays the arm chosen by the hierarchical bandit.

Intuitively, if Thompson sampling is uncertain, then ThompsonBounded tends to use the hierarchical bandit. As the Thompson sampler’s confidence increases, ThompsonBounded is more likely to follow its advice. The next theorem shows that mixing in side information has no qualitative effect on the algorithm’s regret, which grows as .

Theorem 2.

The regret of ThompsonBounded is bounded by


Suppose without loss of generality that arm 1 yields a higher average payoff. Let be the probability that Thompson assigns to arm on round , so that is the probability that Thompson sampling does not pulling arm 1. The probability that ThompsonBounded follows the hierarchical bandit is then The additional expected regret from deviating from Thompson sampling is therefore at most twice the regret Thompson incurs by pulling suboptimal arms. Finally, it was shown in Agrawal & Goyal (2012); Kaufmann et al. (2012) that Thompson sampling has logarithmic regret. ∎

3.3 Data-efficiency.

The no-regret guarantees for HB and TB are provided, respectively, by the bottom-level expert running the Chosen  protocol and the top-level Thompson sampler learning from the Chosen  protocol. We refer to these strategies as certified. The other strategies comprising the hybrids are not certified, but rather may boost empirical performance by bringing side-information from the compliance.

As described, the hybrid algorithms are data-inefficient since, despite the i.i.d. assumption on the patient population, the certified strategies only learn when they are executed. We describe a recycling trick to improve the efficiency of the certified strategies.

A naive approach to increase data-efficiency is to reward the certified strategy on rounds where the executed strategy selects the same action as the certified strategy. However, this introduces a systematic bias. For example, consider two strategies: the first always picks arm 1, the second picks arms 1 and 2 with equal probability. Running a top-level algorithm that picks both with equal probability results in a mixed distribution biased towards arm 1.

The recycling trick stores actions and subsequent rewards by non-certified strategies in a cache. When there is at least one of each action in the cache, the certified strategy is rewarded on rounds where it was not executed by sampling, without replacement, from the cache. Sampling without replacement is important in our setting since it prevents early unrepresentative samples introducing a bias into the behavior of the certified strategy through repeated sampling. A related trick, referred to as “experience replay” was introduced in reinforcement learning in

Mnih et al. (2015).

4 Clinical trial data

The simulation data is taken from The International Stroke Trial (IST) database. A randomised trial where patients believed to have acute ischaemic stroke are treated with: aspirin, subcutaneous heparin, both, or neither Group (1997). Complete compliance and mortality data at 14 days for each of 19,422 patients To the best of our knowledge, this is the largest publicly available clinical trial with compliance data.111An extensive search failed to find other open randomized clinical trials datasets that included compliance. A systematic review by Ebrahim et al. (2014) identified 37 reanalyses of patient-level data from previously published randomized control trials; five were performed by entirely independent authors. Data from drug abuse clinical trials is used in Kuleshov & Precup (2014). However, noncompliance is coded as failure so this source, and drug dependence treatments more generally, cannot be used in our setting. Given there is substantial loss of follow up at the 6 month measure we focus on the 14 day outcome.

4.1 Compliance variables

The main sources of noncompliance in the dataset are: the initial event not a stroke, clinical decision, administration problem, missed out more than 3 doses. A detailed table and counts of these are included in the datasets open access article Sandercock et al. (2011). While these might initially seem like reasons to discard the patients from the dataset, noncompliance is not necessarily random. Discarding these patients could cause algorithms to have unbounded regret (since the loss we care about is over all patients). In particular, misdiagnoses, administrative problems, not taking doses and other sources of noncompliance can be confounded with a patient’s socioeconomic status, age, and overall health, as well as the load and quality of the medical staff.

To construct our “actual arm” variable, we assume that noncompliance entails taking the opposite treatment. This is well-defined in the Aspirin case, which only has two arms, and thus noncompliance with placebo is likely to be taking the treatment.

Assigning an actual arm pulled in the heparin part of the trial is less clear cut, as it has three arms: none, low and medium. We construct the actual arm variable by combining assignment and noncompliance. Noncompliance with respect to low and medium assigned treatments is coded as not-takers, while noncompliance by a patient prescribed “none” is coded as low.

5 Results

Stroke trial.

Figure 1: 14 Day survivals: surplus over expectation of uniform random arm of a 10,000 patient each for simulated trials of Aspirin and Heparin.

In the stroke trial experiments, performance is measured in terms of the excess relative survival or surplus of different strategies. That is, the number of surviving patients in expectation, relative to a baseline that uniformly randomizes between treatment and control.

We simulate 10,000 patients per run, which allows us to not oversample the data in any single simulation; 2000 runs are performed for each algorithm. The EXP3 gamma parameter was set ahead of time to 0.085, a choice determined by the regret-bounds for and or . Epsilon-Greedy uses a standard annealing schedule. No data dependent parameter tuning was used. The simulation is carried out by creating a “counterfactual patient” by sampling (i.i.d.) one patient from each of the treatment and control groups in the clinical trial. If the algorithm selects the treatment, it then receives the reward and observes the action taken by the subject sampled form the treatment group, and vice versa for the control.

Empirically, ThompsonBounded

achieves a surplus of 8.9 extra survivals (that is, human lives) with 95% confidence interval

, relative to the randomized baseline. HierarchicalBandit with Epsilon Greedy as the base algorithm achieves a surplus of 9.2 (CI: ) In contrast, the best performing strategy that is not compliance aware is Thompson sampling, which yields 7.9 extra survivals (CI: ).

The gains are largely concentrated in the Aspirin trial, which is consistent with the lack of benefits or severe ill effects found in the original study Group (1997) for heparin, and with the small but beneficial effect found for aspirin. If the underlying treatment has no positive or negative effect, side-information after the fact alone cannot be helpful.

Note that Actual, and to a lesser extent Comply, perform better than either Chosen  or the hybrid algorithms. However, these cannot be used directly since no guarantees apply. The performance of the hybrids benefits from the information encoded in Actual  and Comply  whilst keeping the guarantees of Chosen.

Synthetic data.

To better understand the behaviour of the algorithms in a more varied range of settings, we present results of simulations with synthetic data.

Figure 2: Example 2 (rich and poor patients): surplus over expectation of uniform random arm on 1,000 bootstrap samples simulating a 10,000 trial on each.

The first simulation illustrates example 2. For comparison, is kept at 10,000, and consider the binary outcome case. We assign half the patients to rich and half to poor randomly. Rich patients always take the treatment and their outcome, which would otherwise be 1 with , becomes 1 with . Poor patients only take the treatment when prescribed, their favorable outcome has probability without treatment; taking the treatment reduces the probability of a favorable outcome to . Fig. 2 shows that the performance of Actual  and Comply  is much worse than Chosen  and the hybrid algorithms.

Figure 3: Expected surplus rewards relative to random assignment for an adaptive trial over 12 patients over 1,000 simulations.

The second simulation concerns small . A motivation for very small adaptive clinical trials is provided by rare diseases. The overall size of the patient population is by construction severely restricted in this setting. The priors for the mechanisms of action are also often poorly understood, so potential alternative treatments can have radically different probabilities of success. We simulate a adaptive trial with binary outcomes, with two treatments and expected rewards drawn uniformly from the unit interval, and compliance uniformly at random. We sample 1,000 such simulations. While our bounds are vacuous in this settings, it is interesting that there is on average an improvement from taking the noncompliance information into account, see Fig. 3.

6 Conclusions

This paper introduced compliance information into the bandit setting. Compliance-information reflects the treatment actually taken by the patient, rather than the algorithm’s recommendation. In many cases (perhaps most cases in practice) compliance information can be used to accelerate learning. Unfortunately, however, naively incorporating compliance information leads to algorithms with linear regret as seen in example 2 and figure 2. We have therefore developed hybrid strategies that are the first algorithms that simultaneously incorporate compliance information while maintaining a worst-case guarantee.

Empirically, TB achieves a surplus of 8.9 extra survivals (that is, human lives) and HB achieves 9.2 surplus lives compared with 7.9 for the best classical algorithm. This suggests hybrid algorithms can make a significant difference to clinical outcomes.


  • Agrawal & Goyal (2012) Agrawal, S and Goyal, N. Analysis of Thompson sampling for the multi-armed bandit problem. In Computational Learning Theory (COLT), 2012.
  • Alon et al. (2015) Alon, Noga, Cesa-Bianchi, Nicoló, Dekel, Ofer, and Koren, Tomer. Online Learning with Feedback Graphs: Beyond Bandits. In Computational Learning Theory (COLT), 2015.
  • Auer (2002) Auer, Peter. Using confidence bounds for exploitation-exploration trade-offs. JMLR, 3:397–422, 2002.
  • Auer et al. (2002) Auer, Peter, Cesa-Bianchi, Nicoló, Freund, Yoav, and Schapire, Robert. The non-stochastic multi-armed bandit problem. SIAM J. Computing, 32(1):48–77, 2002.
  • Bareinboim et al. (2015) Bareinboim, Elias, Forney, Andrew, and Pearl, Judea. Bandits with Unobserved Confounders: A Causal Approach. In Adv in Neural Information Processing Systems (NIPS), 2015.
  • Bubeck (2012) Bubeck, Sébastien. Regret analysis of stochastic and nonstochastic multi-armed bandit problems.

    Foundations and Trends in Machine Learning

    , 5(1):1–122, 2012.
  • Bubeck & Slivkins (2012) Bubeck, Sébastien and Slivkins, Aleksandrs. The best of both worlds: stochastic and adversarial bandits. In Computational Learning Theory (COLT), 2012.
  • Chang & Kaelbling (2005) Chang, Yu-Han and Kaelbling, Leslie Pack. Hedged learning: Regret-minimization with learning experts. In ICML, 2005.
  • Chapelle & Li (2011) Chapelle, Olivier and Li, Lihong. An Empirical Evaluation of Thompson Sampling. In Adv in Neural Information Processing Systems (NIPS), 2011.
  • Ebrahim et al. (2014) Ebrahim, Shanil, Sohani, Zahra N, Montoya, Luis, Agarwal, Arnav, Thorlund, Kristian, Mills, Edward J, and Ioannidis, John PA. Reanalyses of randomized clinical trial data. Jama, 312(10):1024–1032, 2014.
  • Graepel et al. (2010) Graepel, T, Quionero-Candela, J, Borchert, T, and Herbrich, R. Web-scale Bayesian click-through rate prediction for sponsored search and advertising in Microsoft’s Bing engine. In ICML, 2010.
  • Group (1997) Group, International Stroke Trial Collaborative. The international stroke trial (ist): a randomised trial of aspirin, subcutaneous heparin, both, or neither among 19 435 patients with acute ischaemic stroke. The Lancet, 349(9065):1569–1581, 1997.
  • Hugtenburg et al. (2013) Hugtenburg, Jacqueline G, Timmers, Lonneke, Elders, PJ, Vervloet, Marcia, and van Dijk, Liset. Definitions, variants, and causes of nonadherence with medication: a challenge for tailored interventions. Patient Prefer Adherence, 7:675–682, 2013.
  • Kaufmann et al. (2012) Kaufmann, E, Korda, N, and Munos, R. Thompson sampling: An asymptotically optimal finite-time analysis. In ALT, 2012.
  • Koller & Friedman (2009) Koller, Daphne and Friedman, Nir. Probabilistic Graphical Models: Principles and Techniques. MIT Press, 2009.
  • Kuleshov & Precup (2014) Kuleshov, Volodymyr and Precup, Doina. Algorithms for multi-armed bandit problems. arXiv preprint arXiv:1402.6028, 2014.
  • Lai & Robbins (1985) Lai, T L and Robbins, H. Asymptotically efficient adaptive allocation rules. Advances in Applied Mathematics, 6:4–22, 1985.
  • McMahan et al. (2013) McMahan, H. Brendan, Holt, Gary, Sculley, D., Young, Michael, Ebner, Dietmar, Grady, Julian, Nie, Lan, Phillips, Todd, Davydov, Eugene, Golovin, Daniel, Chikkerur, Sharat, Liu, Dan, Wattenberg, Martin, Hrafnkelsson, Arnar Mar, Boulos, Tom, and Kubica, Jeremy. Ad click prediction: A view from the trenches. In Proceedings of the 19th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, New York, NY, USA, 2013.
  • Mnih et al. (2015) Mnih, Volodymyr, Kavukcuoglu, Koray, Silver, David, Rusu, Andrei A., Veness, Joel, Bellemare, Marc G., Graves, Alex, Riedmiller, Martin, Fidjeland, Andreas K., Ostrovski, Georg, Petersen, Stig, Beattie, Charles, Sadik, Amir, Antonoglou, Ioannis, King, Helen, Kumaran, Dharshan, Wierstra, Daan, Legg, Shane, and Hassabis, Demis. Human-level control through deep reinforcement learning. Nature, 518(7540):529–533, 02 2015.
  • Robbins (1952) Robbins, H. Some aspects of the sequential design of experiments. Bull. AMS, 58:527–535, 1952.
  • Sabaté (2003) Sabaté, Eduardo. Adherence to long-term therapies: evidence for action. World Health Organization, 2003.
  • Sandercock et al. (2011) Sandercock, Peter AG, Niewada, Maciej, Członkowska, Anna, et al. The international stroke trial database. Trials, 12(1):1–7, 2011.
  • Seldin & Slivkins (2014) Seldin, Yevgeny and Slivkins, Aleksandrs. One Practical Algorithm for Both Stochastic and Adversarial Bandits. In ICML, 2014.
  • Thompson (1933) Thompson, William R. On the likelihood that one unknown probability exceeds another in view of the evidence of two samples. Biometrika, 25(3/4):285–294, 1933.
  • Vapnik & Vashist (2009) Vapnik, Vladimir and Vashist, Akshay. A new learning paradigm: Learning using privileged information. Neural Netw, 22:544–557, 2009.
  • Vrijens et al. (2012) Vrijens, Bernard, De Geest, Sabina, Hughes, Dyfrig A, Przemyslaw, Kardas, Demonceau, Jenny, Ruppar, Todd, Dobbels, Fabienne, Fargher, Emily, Morrison, Valerie, Lewek, Pawel, et al. A new taxonomy for describing and defining adherence to medications. British journal of clinical pharmacology, 73(5):691–705, 2012.

7 No-regret for HierarchicalBandit

This section shows that constructing a hierarchical bandit with EXP3 yields a no-regret algorithm. The result is straightforward; we include it for completeness. A similar result was shown in Chang & Kaelbling (2005).

First, we construct a hierarchical version of Hedge, Algorithm 3, which is applicable in the full-information setting. On the bottom-level are instantiations of Hedge. Instantiation , for , plays an -dimensional weight vector and receives -dimensional loss vector on round . We impose the assumption that all instantiations play -vectors for notational convenience. The top-level is another instantiation of Hedge, which plays a weighted combination of the bottom-level instantiations.

  Input: for ; for
  for  to  do
     Set where .
     Set where .
     Receive feedback
     Incur loss
  end for
Algorithm 3 Hierarchical Hedge (HHedge)

We have the following lemma:

Lemma 1.

Introduce compound loss vector with . Then can be chosen in HHedge such that


Moreover, and can be chosen such that, for all ,


Apply regret bounds for Hedge twice. ∎

Lemma 1 says, firstly, that HHedge has bounded regret relative to the bottom-level instantiations and, secondly, that it has bounded regret relative to any of the experts on the bottom-level.

Algorithm 4 modifies HHedge so that it is suitable for bandit feedback, yielding HEXP3. A corresponding no-regret bound follows immediately:

Lemma 2.

Define as in Lemma 1. Then can be chosen in HEXP3 such that


Moreover, and can be chosen such that


Follows from Lemma 1 and bounds for EXP3. ∎

  Input: for ; for
  for  to  do
     Set where .
     Set where .
     Draw and .
     Incur loss
  end for
Algorithm 4 Hierarchical EXP3 (HEXP3)

Hierarchical Bandit with Thompson sampler base.

Algorithm 5 (BTS) shows how to modify the Thompson sampler for use as a bottom-level algorithm in HierarchicalBandit. The modification applies the importance weighting trick: replace in Thompson sampling with , where is the probability that the top-level bandit calls BTS on the given round.

  Input: Probability that BTS is called by top-bandit
  For each arm sample
  Play arm and observe reward
  Sample from Bernoulli with success probability
  If then else
Algorithm 5 Base Thompson Sampler (BTS)