On the Identification and Mitigation of Weaknesses in the Knowledge Gradient Policy for Multi-Armed Bandits

by   James Edwards, et al.

The Knowledge Gradient (KG) policy was originally proposed for online ranking and selection problems but has recently been adapted for use in online decision making in general and multi-armed bandit problems (MABs) in particular. We study its use in a class of exponential family MABs and identify weaknesses, including a propensity to take actions which are dominated with respect to both exploitation and exploration. We propose variants of KG which avoid such errors. These new policies include an index heuristic which deploys a KG approach to develop an approximation to the Gittins index. A numerical study shows this policy to perform well over a range of MABs including those for which index policies are not optimal. While KG does not make dominated actions when bandits are Gaussian, it fails to be index consistent and appears not to enjoy a performance advantage over competitor policies when arms are correlated to compensate for its greater computational demands.


page 20

page 21


Simulation Based Algorithms for Markov Decision Processes and Multi-Action Restless Bandits

We consider multi-dimensional Markov decision processes and formulate a ...

Asymptotic Randomised Control with applications to bandits

We consider a general multi-armed bandit problem with correlated (and si...

A framework for optimizing COVID-19 testing policy using a Multi Armed Bandit approach

Testing is an important part of tackling the COVID-19 pandemic. Availabi...

Sequential Decision Making under Uncertainty with Dynamic Resource Constraints

This paper studies a class of constrained restless multi-armed bandits. ...

Lazy Restless Bandits for Decision Making with Limited Observation Capability: Applications in Wireless Networks

In this work we formulate the problem of restless multi-armed bandits wi...

Online Pandora's Boxes and Bandits

We consider online variations of the Pandora's box problem (Weitzman. 19...

DBA bandits: Self-driving index tuning under ad-hoc, analytical workloads with safety guarantees

Automating physical database design has remained a long-term interest in...

1 Introduction

Bayes sequential decision problems (BSDPs) constitute a large class of optimisation problems in which decisions (i) are made in time sequence and (ii) impact the system of interest in ways which may be not known or only partially known. Moreover, it is possible to learn about unknown system features by taking actions and observing outcomes. This learning is modelled using a Bayesian framework. One important subdivision of BSDPs is between offline and online problems. In offline problems some decision is required at the end of a time horizon and the purpose of actions through the horizon is to accumulate information to support effective decision-making at its end. In online problems each action can bring an immediate payoff in addition to yielding information which may be useful for subsequent decisions. This paper is concerned with a particular class of online problems although it should be noted that some of the solution methods have their origins in offline contexts.

The sequential nature of the problems coupled with imperfect system knowledge means that decisions cannot be evaluated alone. Effective decision-making needs to account for possible future actions and associated outcomes. While standard solution methods such as stochastic dynamic programming can in principle be used, in practice they are computationally impractical and heuristic approaches are generally required. One such approach is the knowledge gradient (KG) heuristic. Gupta and Miescke [8] originated KG for application to offline ranking and selection problems. After a period of time in which it appears to have been studied little, Frazier et al. [5] expanded on KG’s theoretical properties. It was adapted for use in online decision-making by Ryzhov et al. [14] who tested it on multi-armed bandits (MABs) with Gaussian rewards. They found that it performed well against an index policy which utilised an analytical approximation to the Gittins index; see Gittins et al. [7]. Ryzhov et al. [12]

have investigated the use of KG to solve MABs with exponentially distributed rewards while

Powell and Ryzhov [10] give versions for Bernoulli, Poisson and uniform rewards, though without testing performance. They propose the method as an approach to online learning problems quite generally, with particular emphasis on its ability to handle correlated arms. Initial empirical results were promising but only encompassed a limited range of models. This paper utilises an important sub-class of MABs to explore properties of the KG heuristic for online use. Our investigation reveals weaknesses in the KG approach. We inter alia propose modifications to mitigate these weaknesses.

In Section 2 we describe a class of exponential family MABs that we will focus on, together with the KG policy for them. Our main analytical results revealing weaknesses in KG are given in Section 3. Methods aimed at correcting these KG errors are discussed in Section 4 and are evaluated in a computational study which is reported in Section 5. In this study a range of proposals are assessed for Bernoulli and Exponential versions of our MAB models. Gaussian MABs have characteristics which give the operation of KG distinctive features. The issues for such models are discussed in Section 6, together with an associated computational study in Section 6.1. Section 7 identifies some key conclusions to be drawn.

2 A class of exponential family multi-armed bandits

2.1 Multi-Armed Bandit Problems for Exponential Families

We consider multi-armed bandits (MABs) with geometric discounting operating over a time horizon

which may be finite or not. Rewards are drawn from exponential families with independent conjugate priors for the unknown parameters. More specifically the set up is as follows:

  1. At each decision time an action is taken. Associated with each action, , is an (unknown) parameter, which we denote as . Action (pulling arm ) yields a reward which is drawn from the density (relative to some -finite measure on )


    where is the support of is a cumulant generating function and parameter space is such that . Reward distributions are either discrete or absolutely continuous, with a discrete or continuous interval where . We shall take a Bayesian approach to learning about the parameters .

  2. We assume independent conjugate priors for the unknown with Lebesgue densities given by


    where and are known hyper-parameters. This then defines a predictive density


    which has mean . Bayesian updating following an observed reward on arm produces a posterior . Thus at each time we can define an arm’s informational state as the current value of hyper-parameters , such that the posterior for given the observations to date is . The posterior for each arm is independent so the informational states of arms not pulled at are unchanged.

  3. The total return when reward is received at time is given by where discount rate satisfies either when or when . The objective is to design a policy, a rule for choosing actions, to maximise the Bayes’ return, namely the total return averaged over both realisations of the system and prior information.

The current informational state for all arms, denoted summarises all the information in the observations up to the current time.

When the Bayes’ return is maximised by the Gittins Index (GI) policy, see Gittins et al. [7]. This operates by choosing, in the current state, any action , satisfying


where is the Gittins index. We describe Gittins indices in Section 4 along with versions adapted for use in problems with . Given the challenge of computing Gittins indices and the general intractability of deploying dynamic programming to solve online problems, the prime interest is in the development of heuristic policies which are easy to compute and which come close to being return maximising.

2.2 The Knowledge Gradient Heuristic

The Knowledge Gradient policy KG is a heuristic which bases action choices both on immediate returns and also on the changes in informational state which flow from a single observed reward. It is generally fast to compute. To understand how KG works for MABs suppose that the decision time is and that the system is in information state then. The current decision is taken to be the last opportunity to learn and so from time through to the end of the horizon whichever arm has the highest mean reward following the observed reward at will be pulled at all subsequent times. With this informational constraint, the best arm to pull at (and the action mandated by KG in state ) is given by


where is the observed reward at and is an indicator taking the value if action is taken at otherwise. The conditioning indicates that the reward depends upon the current state and the choice of action . The constant is a suitable multiplier of the mean return of the best arm at to achieve an accumulation of rewards for the remainder of the horizon (denoted here by ). It is given by


KG can be characterised as the policy resulting from the application of a single policy improvement step to a policy which always pulls an arm with the highest prior mean return throughout. Note that for is increasing in both ( fixed) and in ( fixed). For any sequence of values approaching the limit in a manner which is co-ordinatewise increasing, the value of diverges to infinity. This fact is utilised heavily in Section 3.

We now develop an equivalent characterisation of KG based on Ryzhov et al. [14] which will be more convenient for what follows. We firstly develop an expression for the change in the maximal mean reward available from any arm when action is taken in state . We write


where is the current arm mean return and is the mean return available from arm at the next time conditional on the observed reward resulting from action . Please note that

is a random variable. It is straightforward to show that


Hence KG gives a score to each arm and chooses the arm of highest score. It is not an index policy because the score depends upon the informational state of arms other than the one being scored. That said, there are similarities between KG scores and Gittins indices. The Gittins index exceeds the mean return by an amount termed the uncertainty or learning bonus. This bonus can be seen as a measure of the value of exploration in choosing arm . The quantity

in the KG score is an alternative estimate of the learning bonus. Assessing the accuracy of this estimate will give an indication of the strengths and weaknesses of the policy.

2.3 Dominated arms

In our discussion of the deficiencies of the KG policy in the next section we shall focus, among other things, on its propensity to pull arms which are suboptimal to another arm with respect to both exploitation and exploration. Hence there is an alternative which is better both from an immediate return and from an informational perspective. We shall call such arms dominated. We begin our discussion with a result concerning properties of Gittins indices established by Yu [19].

Theorem 1

The Gittins index is decreasing in for any fixed and is increasing in for any fixed .

We proceed to a simple corollary whose proof is omitted. The statement of the result requires the following definition.

Definition 2

An arm in state dominates one in state if and only if and .

Corollary 3

The GI policy never chooses dominated arms.

Hence pulling dominated arms can never be optimal for infinite horizon MABs. We shall refer to the pulling of a dominated arm as a dominated action in what follows. Exploration of the conditions under which KG chooses dominated actions is a route to an understanding of its deficiencies and prepares us to propose modifications to it which achieve improved performance. This is the subject matter of the following two sections.

3 The KG policy and dominated actions

3.1 Conditions for the choice of dominated actions under KG

This section will elucidate sufficient conditions for the KG policy to choose dominated arms. A key issue here is that the quantity (and hence the KG learning bonus) can equal zero in cases where the true learning bonus related to a pull of arm may be far from zero. Ryzhov et al. [14] stated that . However, crucially, that paper only considered Gaussian bandits. The next lemma is fundamental to the succeeding arguments. It says that, for sufficiently high , the KG policy will choose the arm with the largest .

Lemma 4

for which such that .

  • Proof. The result is a trivial consequence of the definition of the KG policy in Section 2.2 together with the fact that diverges to infinity in the manner described in Section 2.  

The next result gives conditions under which .

Lemma 5

Let denote . If and the observation state space, , is bounded below with minimum value then


while if and is bounded above with maximum value then


In cases where with unbounded below, and where with is unbounded above, we have .

  • Proof. Note that




    If and so then, observing that


    we infer from equation Eq. (3.4) that if and only if


    with probability

    under the distribution of . Under our set up as described in Section 2, this condition is equivalent to the right hand side of Eq. (3.1). If then and so, suitably modifying the previous argument, we infer that if and only if


    with probability under the distribution of . Under our set up as described in Section 2, this condition is equivalent to the right hand side of Eq. (3.2). The unbounded cases follow directly from the formula for as the change in due to an observation has no finite limit in the direction(s) of unboundedness. This completes the proof.  

Informally, if no outcome from a pull on arm will change which arm has maximal mean value. When this depends on the lower tail of the distribution of while if it depends on the upper tail. This asymmetry is important in what follows.

Theorem 6

If is bounded below then there are choices of for which the KG policy chooses dominated arms.

  • Proof. If we consider cases for which


    then it follows that , and all arms except and can be ignored in the discussion. We first suppose that unbounded above. It follows from Lemma 5 that . Since we can further choose such that


    From the above result we infer that . We now suppose that is bounded above, and hence that . Choose as follows: . It is trivial that these choices mean that arm dominates arm . We have that


    and hence that . Further we have that


    and hence that . In both cases discussed (ie, bounded and unbounded above) we conclude from Lemma 4 the existence of such that which is a dominated arm, as required. This concludes the proof.  

Although the part of the above proof dealing with the case in which is bounded above identifies a specific state in which KG will choose a dominated arm when is large enough, it indicates how such cases may be identified more generally. These occur when the maximum positive change in the mean of the dominated arm () is larger than the maximum negative change in the mean of the dominating arm (). This can occur both when the

have distributions skewed to the right and also where the corresponding means are both small, meaning that a large

can effect a greater positive change in than can a small a negative change in . A detailed example of this is given for the Bernoulli MAB in the next section. Similar reasoning suggests that the more general sufficient condition for KG to choose dominated arms, namely with arm dominated, will hold in cases with unbounded above if the distribution of has an upper tail considerably heavier than its lower tail.

3.2 Stay-on-the winner rules

Berry and Fristedt [1] demonstrated that optimal policies for MABs with Bernoulli rewards and general discount sequences (including all cases considered here) have a stay-on-the-winner property. If arm

is optimal at some epoch and a pull of

yields a success () then arm continues to be optimal at the next epoch. Yu [19] extends this result to the exponential family considered here in the following way: an optimal arm continues to be optimal following an observed reward which is sufficiently large. The next result is an immediate consequence.

Lemma 7

Suppose that is bounded above. If arm is optimal at some epoch and a pull of yields a maximal reward () then arm is optimal at the next epoch.

The following result states that the KG policy does not share the stay-on-the-winner character of optimal policies as described in the preceding lemma. In its statement we use for the vector whose component is with zeroes elsewhere.

Proposition 8

If is bounded above and below choices of and for which , .

  • Proof. For the reasons outlined in the proof of Theorem 6 we may assume without loss of generality that . As in that proof we consider the state with . We suppose that a pull of arm yields an observed reward equal to . This takes the process state to say. We use the dashed notation for quantities associated with this new state. Observe that and hence that . We note that


    which implies via Lemma 5 that . We also have that


    which implies via Lemma 5 that . The existence of for which while now follows from Lemma 4.  

3.3 Examples

We will now give more details of how the KG policy chooses dominated actions in the context of two important members of the exponential family.

3.3.1 Exponential rewards

Suppose that and which yields the unconditional density for given by


with . Let arm dominate arm . For this case and from Lemma 5, the unboundedness of above means that while if and only if


Hence from Lemma 4 we can assert the existence of for which KG chooses dominated arm whenever (Eq. (3.15) holds.

Ryzhov and Powell [13] discuss the online KG policy for Exponential rewards in detail. They observe that can be zero but do not appear to recognise that this can yield dominated actions under the policy. Later work, [4], showed that this can lead to the offline KG policy never choosing the greedy arm, an extreme case of dominated errors. However, with the online KG policy the greedy arm will eventually be selected as for the other arm tends to zero. These papers note that, in states for which


the value of while not zero, penalises the choice of the greedy arm relative to other arms in a similar way to the bias which yields dominated actions. Policies which mitigate such bias are given in the next section and are evaluated in the computational study following.

3.3.2 Bernoulli rewards

Suppose that with and so and . Since is bounded above and below, dominated actions under KG will certainly occur. Demonstrating this in terms of the asymmetric updating of Beta priors can be helpful in understanding the more general case of bounded rewards. Use and for the magnitudes of the upward and downward change in under success and failure respectively. We have


from which we conclude that . Prompted by this analysis, consider a case in which for some . Arm dominates arm . Further, the fact that


implies via Lemma 5 that . From Lemma 5 we also conclude that


The strict inequality in the right hand side of Eq. (3.19) will hold whenever . Thus, for suitably chosen and the KG policy will take dominated actions in a wide range of states. Suppose now that and hence the immediate claim is that under the condition the KG policy will take dominated action for large enough. We now observe that in practice dominated actions can be taken for quite modest . Returning to the characterisation of the KG policy we infer that in the above example, dominated action will be chosen whenever


Such errors will often be costly. Note also that the condition suggests that dominated actions occur more often when arms have small mean rewards. This is investigated further in the computational study following.

3.3.3 Gaussian rewards

Here we have and . Hence is unbounded and if arm is chosen, the distribution of is symmetric about . In this case the KG policy does not choose dominated actions and the value of is always greater for the arm with smaller prior precision . Despite this fact, KG can still take poor decisions by underestimating the learning bonus for the greedy arm. The Gaussian MAB is discussed further in Section 6.

4 Policies which modify KG to avoid taking dominated actions

In this section we present new policies which are designed to mitigate the defects of the KG approach elucidated in the previous section. The performance of these are assessed along with some earlier proposals, in the numerical study of the next section.

Non-dominated KG (NKG): This proposal modifies standard KG by prohibiting dominated actions. It achieves this by always choosing a non-dominated arm with highest KG score. Any greedy arm is non-dominated and hence one always exists.

Positive KG (PKG): The KG score for a greedy arm reflects a negative change in its posterior mean while that for non-greedy arms reflect positive changes. The PKG policy modifies KG such that for all arms it is positive moves which are registered. It achieves this by modifying the KG scores for each greedy arm as follows: in the computation of the score replace the quantity by the quantity . This adjustment transforms the KG scores to adjusted values . The change maintains the key distance used in the KG calculation as but ensures that it is non-negative. For non-greedy arms we have .

Theorem 9

Policy PKG never chooses a strictly dominated arm.

  • Proof. Suppose that arm is strictly dominated by arm such that and . In the argument following we shall suppose that . This is without loss of generality as the addition of any other arm with does not effect the PKG score of arm 2 and can only increase the PKG score of the non-dominated arm 1. Given that in order to establish the result, namely that it is enough to establish that . From the definitions of the quantities concerned we have that




    However, under the conditions satisfied by it is easy to show that, ,


    and hence that


    But from Shaked and Shanthikumar [15] we infer that exceeds in the convex ordering. Since is convex in it follows that


    and the result follows.  

KG-index (KGI): Before we describe this proposal we note that Whittle [18] produced a proposal for index policies for a class of decision problems called restless bandits which generalise MABs by permitting movement in the states of non-active arms. Whittle’s indices generalise those of Gittins in that they are equal to the latter for MABs with . Whittle’s proposal is relevant for MABs with finite horizon since time-to-go now needs to be incorporated into state information which in turn induces a form of restlessness. In what follows we shall refer to Gittins/Whittle indices as those which emerge from this body of work for all versions of the MABs under consideration here.

The KGI policy chooses between arms on the basis of an index which approximates the Gittins/Whittle index appropriate for the problem by using the KG approach. We consider a single arm with prior, finite horizon and discount factor . To develop the Gittins/Whittle index for such a bandit we suppose that a charge is levied for bandit activation. We then consider the sequential decision problem which chooses from the actions for the bandit at each epoch over horizon with a view to maximising expected rewards net of charges for bandit activation. The value function satisfies Bellman’s equations as follows:


It is easy to show that this is a stopping problem in that, once it is optimal to choose the passive action at some epoch then it will be optimal to choose the passive action at all subsequent epochs. Hence, Eq. (4.6) may be replaced by the following:


We further observe that decreases as increases, while keeping and fixed. This yields the notion of indexability in index theory. We now define the Gittins/Whittle index as


This index is typically challenging to compute.

We obtain an index approximation based on the KG approach as follows: In the stopping problem with value function above, we impose the constraint that whatever decision is made at the second epoch is final, namely will apply for the remainder of the horizon. This in turn yields an approximating value function which when satisfies the equation


and which is also decreasing in for any fixed and . When the constant multiplying the expectation on the r.h.s of Eq. (4) becomes . The indices we use for the KGI policy when are given by


where are as previously and is the time to the end of the horizon. Note that the second equation in Eq. (4.10) follows from the evident fact that the index is guaranteed to be no smaller that the mean .

Trivially and are both increasing in the horizon and consequentially so are both and . When the limits and are guaranteed to exist and be finite. These limits are denoted and respectively, the former being the Gittins index. We use the indices for the KGI policy when .

Theorem 10

The KGI policy does not choose dominated arms.

We establish this result via a series of results.

Lemma 11

and are both increasing in

for any fixed values of .

  • Proof. Since the quantity is increasing in and is stochastically increasing in it follows easily that the expectation on the right hand side of Eq. (4) is increasing in . The result then follows straightforwardly.  

We now proceed to consider the equivalent bandit, but with prior where .

Lemma 12

is decreasing in for any fixed values of and for any .

  • Proof. First note that for the quantity regarded as a function of is decreasing when . For and hence is trivially decreasing in . Note also that the quantity regarded as a function of is increasing and convex. We also observe from Yu [19] that is decreasing in the convex order as increases. It then follows that, for and for


    from which the result trivially follows via a suitable form of Eq. (4).  

The following is an immediate consequence of the preceding lemma and Eq. (4.10).

Corollary 13

is decreasing in for any fixed values of .

It now follows trivially from the properties of the index established above that if dominates then for any . It must also follow that when . This completes the proof of the above theorem.

Closed form expressions for the indices are not usually available, but are in simple cases. For the Bernoulli rewards case of Subsection 3.3.2 we have that


In general numerical methods such as bisection are required to obtain the indices. If the state space is finite it is recommended that all index values are calculated in advance.

Fast calculation is an essential feature of KG but it should be noted that this is not universal and that index methods are more tractable in general. An example of this is the MAB with multiple plays (Whittle [17]). Here arms are chosen at each time rather than just one. Rewards are received from each of the arms as normal. For an index policy the computation required is unchanged - the index must be calculated for each arm as normal with arms chosen in order of descending indices. The computation for KG is considerably larger than when . The KG score must be calculated for each possible combination of arms, that is times. For each of these we must find the set of arms with largest expected reward conditional on each possible outcome. Even in the simplest case, with Bernoulli rewards, there are possible outcomes. For continuous rewards the problem becomes much more difficult even for . It is clear that KG is impractical for this problem.

An existing method with similarities to KG is the Expected Improvement algorithm of [9]. This is an offline method of which KG can be thought of as a more detailed alternative. It was compared with KG in [6] in the offline setting. The Expected Improvement algorithm is simpler than KG and always assigns positive value to the greedy arm unless its true value is known exactly. Its arm values are “optimistic” in a manner analogous to the PKG policy described above and it is reasonable to conjecture that it shares that rule’s avoidance of dominated actions (see Theorem 9). As an offline method it is not tested here but it may be possible to develop an online version.

5 Computational Study

This section will present the results of experimental studies for the Bernoulli and Exponential MAB. A further study will be made for the Gaussian MAB in Section 6.1.

5.1 Methodology

All experiments use the standard MAB setup as described in Section 2.1. For Bernoulli rewards with policy returns are calculated using value iteration. All other experiments use simulation for this purpose. These are truth-from-prior experiments i.e. the priors assigned to each arm are assumed to be accurate.

For each simulation run a is drawn randomly from the prior for each arm . A bandit problem is run for each policy to be tested using the same set of parameter values for each policy. Performance is measured by totalling, for each policy, the discounted true expected reward of the arms chosen. For each problem 160000 simulation runs were made.

In addition to the policies outlined in Section 4, also tested are the Greedy policy (described in Section 2.1) and a policy based on analytical approximations to the GI (Brezzi and Lai [2]

), referred to here as GIBL. These approximations are based on the GI for a Wiener process and therefore assume Normally distributed rewards. However, they can be appropriate for other reward distributions by Central Limit Theorem arguments and the authors found that the approximation was reasonable for Bernoulli rewards, at least for

not too small. Other papers have refined these approximations but, although they may be more accurate asymptotically, for the discount rates tested here they showed inferior performance and so only results for GIBL are given.

5.2 Bernoulli MAB

The first experiment tests performance over a range of for arms, each with uniform priors. The mean percentage lost reward for five policies are given in Figure 1. The results for the greedy policy are not plotted as they are clearly worse than the other policies (percentage loss going from 0.64 to 1.77 for ).

Figure 1: Mean percentage of lost reward compared to the GI policy for five policies for the Bernoulli MAB with uniform priors and . The left plot shows while on the right .

The overall behaviour of the policies is similar for and . KGI is strong for lower but is weaker for higher while GIBL is strongest as increases. The sharp change in performance for GIBL at occurs because the GIBL index is a piecewise function. Both NKG and PKG improve on KG for but the three KG variants are almost identical for . The difference between KG and NKG gives the cost for the KG policy of dominated actions. These make up a large proportion of the lost reward for KG for lower but, as increases, over-greedy errors due to the myopic nature of the KG policy become more significant and these are not corrected by NKG. These errors are also the cause of the deteriorating performance of KGI at higher . At the states given in Section 3.3 where KG was shown to take dominated actions occur infrequently. This is because, for larger numbers of arms there will more often be an arm with and such arms are chosen in preference to dominated arms.

However, states where for all arms will occur more frequently when arms have lower . Here dominated actions can be expected to be more common. We can test this by using priors where . Figure 2 shows the effect of varying the parameter for all arms.

Figure 2: Percentage lost reward relative to the GI policy for six policies for the Bernoulli MAB with and . The left plot shows while on the right .

The discount rate is quite a high value where the greedy policy can be expected to perform poorly since exploration will be important. However as increases the performance of KG deteriorates to the extent that it is outperformed by the greedy policy. This effect is still seen when . The superior performance of NKG shows that much of the loss of KG is due to dominated actions. Policy PKG improves further on NKG suggesting that KG makes further errors due to asymmetric updating even when it does not choose dominated arms. A clearer example of this is given in Section 5.3. Both policies based on GI approximations perform well and are robust to changes in . KGI is the stronger of the two as GIBL is weaker when the rewards are less Normally distributed.

The same pattern can also be seen to be present when arms have low success probabilities but prior variance is high. Figure

3 gives results for with low . The range shown focuses on lower prior which correspond to in the setup of the previous experiment. The higher prior variance makes arms with higher success probabilities more likely than in the previous experiment but as is reduced the performance of KG can still be seen to deteriorate markedly. The other policies tested do not show this problem.

Figure 3: Percentage lost reward relative to the GI policy for six policies for the Bernoulli MAB with and . The left plot shows while on the right .

Arms with low are common in many applications. For example, in direct mail marketing or web based advertising where is the probability that a user responds to an advert. The unmodified KG is unlikely to be an effective method in such cases.

The equivalent plots with prior do not show any significant changes in behaviour compared to uniform priors.

Another policy that is popular in the bandit literature and which has good theoretical properties is Thompson Sampling (e.g.

Russo and Van Roy [11]). Results for this method are not given in detail here as its performance is far inferior on these problems to the other policies tested. For example, on the results displayed in Figure 1 losses were in the ranges from and for and respectively with the best performance coming for . It is a stochastic policy and so makes many decisions that are suboptimal (including dominated errors). Its strength is that it explores well in the limit over time, eventually finding the true best arm. However, with discounted rewards or when the horizon is finite it gives up too much short term reward to be competitive unless is close to 1 or the finite horizon is long. In addition, note that it will spend longer exploring as increases as it seeks to explore every alternative. Performance on the other problems in this paper was similar and so are not given.

5.3 Exponential MAB

This section gives the results of simulations for policies run on the MAB with Exponentially distributed rewards as outlined in Section 3.3. These are shown in Figure 4. Here the lost reward is given relative to the KG policy (the negative values indicate that the other policies outperformed KG). Different priors give a similar pattern of results.

Figure 4: Mean percentage of lost reward compared to the KG policy for three policies for the Exponential MAB with Gamma(2,3) priors and . The left plot shows while on the right .

The results show a clear improvement over the KG policy by PKG and NKG policies. Notably the PKG earns better reward than the NKG indicating that the bias that causes dominated errors also causes suboptimal choices when arms are not dominated. Policy KGI gives the best performance although similar to PKG.

6 The Gaussian MAB

Here we consider the Gaussian case and . In the brief discussion in Section 3 we noted that KG does not take dominated actions in this case. While Ryzhov et al. [14] give computational results which demonstrate that KG outperforms a range of heuristic policies, the policy still makes errors. In this section we describe how errors in the estimation of arms’ learning bonuses constitute a new source of suboptimal actions. We also elucidate easily computed heuristics which outperform KG. A major advantage of KG cited by Ryzhov et al. [14] is its ability to incorporate correlated beliefs between arms. We will later show, in Section 6.1.3, that it is unclear whether KG enjoys a performance advantage in such cases.

We shall restrict the discussion to cases with , and and will develop a notion of relative learning bonus (RLB) which will apply across a wide range of policies for such problems. We shall consider stationary policies whose action in state