, multi-armed bandit (MAB) problems have been considered within a Bayesian framework, in which the unknown parameters are modeled as random variables drawn from a known prior distribution. In this setting, the problem can be viewed as a Markov decision process (MDP) with state that is an information state describing the beliefs of unknown parameters that evolves stochastically upon each play of an arm according to Bayes’ rule.
Under the objective of expected performance, where the expectation is taken with respect to the prior distribution over unknown parameters, the (Bayesian) optimal policy is characterized by Bellman equations immediately following from the MDP formulation. In the discounted infinite-horizon setting, the celebrated Gittins index [Gittins, 1979] determines an optimal policy, despite the fact that its computation is still challenging. In the non-discounted finite-horizon setting, which we consider, the problem becomes more difficult [Berry and Fristedt, 1985]
, and except for some special cases, the Bellman equations are neither analytically nor numerically tractable, due to the curse of dimensionality. In this paper, we focus on the determination of the optimal policy (Opt) as an ideal goal that can be tackled by dynamic programming (DP).
We introduce the information relaxation framework [Brown et al., 2010], a recently developed technique that provides a systemic way of obtaining the performance bounds on the optimal policy. It is common in multi-period stochastic DP problems to consider admissible policies that are required to make decisions based only on the previously revealed information. In our framework, we consider the non-anticipativity as a constraint imposed on the policy space that can be relaxed, as in a usual Lagrangian relaxation. Under such a relaxation, the decision maker (DM) is allowed to access to the future information and is asked to solve an optimization problem so as to maximize her total reward, in the presence penalties that punish the violation of the non-anticipativity. When the penalties satisfy a condition (dual feasibility, formally defined in §3), the expected value of maximal reward adjusted by the penalties provides an upper bound of the expected performance of the (non-anticipating) optimal policy.
The idea of relaxing the non-anticipativity constraint has been studied over time in the different contexts [Rockafellar and Wets, 1991, Davis and Karatzas, 1994, Rogers, 2002, Haugh and Kogan, 2004], and later formulated as a formal framework by Brown et al. , upon which our methodology is developed. This framework has been applied across a variety of applications including optimal stopping problems [Desai et al., 2012], linear-quadratic control [Haugh and Lim, 2012], dynamic portfolio execution [Haugh and Wang, 2014] and others (see Brown and Haugh ).
Our contribution is to apply the information relaxation techniques to the finite-horizon stochastic MAB problem exploiting the structures of Bayesian learning process. In particular:
we propose a series of information relaxations and penalties with increasing complexity;
we systematically obtain the upper bounds on the best achievable expected performance that are in trade-off between tightness and computational complexity;
and, we obtain the associated (randomized) policies that generalize Thompson Sampling (TS) in the finite-horizon setting.
In our framework, which we call information relaxation sampling, each of penalty functions (and information relaxations) determines one policy and one performance bound given a particular problem instance specified by the time horizon and prior belief. As a base case for our algorithms, we have TS [Thompson, 1933] and the conventional regret benchmark that has been popularized for Bayesian regret analysis since Lai and Robbins . On the other extreme, the optimal policy Opt and its expected performance follow from the ‘‘ideal’’ penalty which is intractable to specify. By picking increasingly strict information penalties, we can improve the policy and the associated bound between the two extremes of TS and Opt.
As an illustrating example, one of our algorithms, Irs.FH, provides a very simple modification of TS that takes into account the length of the time horizon . Recalling that TS
makes a decision based on sampled parameters from the posterior distribution in each epoch, we focus on the fact that knowing the parameters is as informative as having an infinite number of future reward observations in terms of the best arm identification. We let the policy, say, to make a decision based on the future Bayesian estimates, updated with only
future reward realizations for each arm, where the rewards are randomly generated based on the posterior belief at the moment. When(equivalently, at the last decision epoch), such a policy takes a myopically best action based only on the current estimates, which is indeed an optimal decision, whereas TS would still explore unnecessarily. While keeping the recursive structure in the sequential decision making process of TS, it naturally performs less exploration than TS as the remaining time horizon diminishes.
Beyond this, we propose other algorithms that more explicitly quantify the benefit of exploration and more explicitly trade-off exploration versus exploitation, at the cost of additional computational complexity. As we increase complexity, we achieve policies that improve performance, and separately provide tighter tractable computational upper bounds on the expected performance of any policy for a particular problem instance.
2 Notation and Preliminaries
Problem. We consider a classical stochastic MAB problem with independent arms and finite-horizon . At each decision epoch , the decision maker (DM) pulls an arm and earns a stochastic reward associated with arm . More formally, the reward from pull of arm is denoted by which is independently drawn from unknown distribution , where is the parameter associated with arm . We also have a prior distribution over unknown parameter , where , which we call belief
, is a hyperparameter describing the prior distribution:
We define two mean reward functions and as a function of unknown parameter and prior belief respectively. Through out the paper, we assume that the rewards are absolutely integrable over the prior distribution: i.e., or more explicitly, for all .
For brevity, we denote and
be the vector of parameters and beliefs across arms, respectively. We additionally define anoutcome as a combination of the parameters and all future reward realizations that incorporates all uncertainties in the environment that the DM encounters:
where represents the distribution of outcome.
Policy. Given an action sequence up to time , , define the number of pulls for each arm , and the corresponding reward realization . The natural filtration encodes the observations revealed up to time (inclusive).
Let be the action sequence taken by a policy . The (Bayesian) performance of a policy is defined as the expected total reward over the randomness associated with the outcome, i.e.,
A policy is called non-anticipating if its every action is -measurable, and we define be a set of all non-anticipating policies, including randomized ones.
MDP formulation. We assume that we are equipped with a Bayesian update function so that after observing from an arm , the belief is updated from to according to Bayes’ rule. We will often use to describe the updating of the entire belief vector ; i.e., after observing from some arm , the belief vector is updated from to where only the component is updated in this step.
In a Bayesian framework, the MAB problem has a recursive structure. Given a time horizon and prior belief , suppose the DM had just earned by pulling an arm at time . The remaining problem for the DM is equivalent to a problem with time horizon and prior belief . We further know the (unconditional) distribution of what the DM will observe when pulling an arm , a doubly stochastic random variable, and we denote it by . Following from this Markovian structure, we obtain the Bellman equations for the MAB problem:
with for all . While the Bellman equation is intractable to analyze, it offers a characterization of the Bayesian optimal policy (Opt) and the best achievable performance : i.e., .
3 Information Relaxation Sampling
We propose a general framework, which we refer to as information relaxation sampling (IRS), that takes as an input a ‘penalty function’, and produces as outputs a policy and an associated performance bound.
Information relaxation penalties and inner problem. If we relax the nonanticipativity constraint imposed on policy space (i.e., is -measurable), the DM will be allowed to first observe all future outcomes in advance, and then pick an action (i.e., is -measurable). To compensate for this relaxation, we impose a penalty on the DM for violating the nonanticipativity constraint.
We introduce a penalty function to denote the penalty that the DM incurs at time , when taking an action sequence given a particular instance specified by , and . The clairvoyant DM can find the best action sequence that is optimal for a particular outcome in the presence of penalties , by solving the following (deterministic) optimization problem, referred as the inner problem:
Definition 1 (Dual feasibility).
A penalty function is dual feasible if it is ex-ante zero-mean, i.e.,
To clarify the notion of conditional expectation, we remark that the penalty function is a stochastic function of the action sequence since the outcome is random.111 As in usual probability theory,
As in usual probability theory,represents the expected value of a random variable given the information , and is itself a random variable that has a dependency on . The dual feasibility condition requires that the DM who makes decisions on the natural filtration will receive zero penalties in expectation.
IRS performance bound. Let
be the expected maximal value of the inner problem (‣ 3), when the outcome is randomly drawn from its prior distribution , i.e., the expected total payoff that a clairvoyant DM can achieve in the presence of penalties:
We can obtain this value numerically via simulation: draw outcomes independently from , solve the inner problem for each outcome separately, and then take the average of the maximal value over samples. The following theorem shows that is indeed a valid performance bound of the stochastic MAB problem.
Theorem 1 (Weak duality and strong duality).
If the penalty function is dual feasible, is an upper bound on the optimal value : for any and ,
There exists a dual feasible penalty function, referred as the ideal penalty , such that
The ideal penalty function has a following functional form:
The ideal penalty yields the Bayesian optimal policy: i.e., .
|TS||Find a best arm given parameters.|
|Find a best arm given finite observations.||
|Find an optimal allocation of pulls.|
|Find an optimal action sequence.|
|Opt||Solve Bellman equations.||-|
3.1 Thompson Sampling
With the penalty function , the inner problem ( ‣ 3) reduces to
Given an outcome , in the presence of penalties, a hindsight optimal action sequence is to keep pulling one arm , times in a row. The resulting performance bound is equivalent to the conventional regret benchmark, i.e.,
which measures how much the DM could have achieved if the parameters are revealed in advance. The corresponding IRS policy is equivalent to Thompson Sampling: when the sampled outcome is used instead, it pulls the arm where each , and this sampling-based decision making is repeated at each epoch, while updating the belief sequentially, as described in IRS-Outer in Algorithm 1.
Note that the optimal solution is determined by the parameters only – it does not need to consider the future rewards, and thus it takes computations to make a single decision in policy or to obtain a single sample of performance bound.
Let be the expected mean reward of an arm inferred from reward realizations . Given (12), the optimal solution to the inner problem ( ‣ 3) is to pull an arm with the highest from beginning to the end:
Irs.FH is almost identical to TS except that is replaced with . Note that is less informative than from the DM’s perspective, since she will never be able to learn perfectly within a finite horizon. In terms of mean reward estimation, knowing the parameters is equivalent to having the infinite number of observations. The inner problem of TS asks the DM to ‘‘identify the best arm based on the infinite number of samples’’ whereas that of Irs.FH asks her to ‘‘identify the best arm based on the finite number of samples’’, which takes into account the length of time horizon explicitly.
Focusing on the randomness of and , we observe that the distribution of will be more concentrated around its mean . Following from Jensen’s inequality, we have for any problem instance, saying that Irs.FH
yields a performance bound tighter than the conventional benchmark. In terms of policy, the variance of(and ) also governs the degree of random exploration, deviating from the myopic decision of pulling an arm with the largest . When it approaches the end of the horizon (), naturally explores less than TS.
Sampling at once. In order to obtain for a synthesized outcome , one may apply Bayes’ rule sequentially for each reward realization, which will take computations in total. It can be done in if the prior distribution
is a conjugate prior of the reward distribution, in which the belief can be updated in a batch by the use of sufficient statistics of observations. In the case of the Beta-Bernoulli MAB or the Gaussian MAB, for example, can be represented as a convex combination of the current estimate and the sample mean . We further know that the distribution of is for the Beta-Bernoulli case, and for the Gaussian case, where represents the noise variance. After sampling the parameter , we can sample directly from the known distribution, and use it to compute without sequentially updating the belief. In such cases, a single decision of can be made within operations, similar in complexity to TS.
3.3 IRS.V-Zero and IRS.V-EMax
IRS.V-Zero. Let be the expected mean reward of arm inferred from the first reward realizations:
Under this penalty, the DM earns from the pull of an arm : for example, if , the total payoff is .
Given an outcome , the total payoff is determined only by the total number of pulls of each arm, and not the sequence in which the arms had been pulled. Therefore, solving the inner problem ( ‣ 3) is equivalent to ‘‘finding the optimal allocation among remaining opportunities’’: omitting for brevity, the inner problem reduces to
where is the cumulative payoff from the first pulls of an arm , and is the set of all feasible allocations. Once the ’s are computed, this inner problem can be solved within operations by sequentially applying sup convolution times. The detailed implementation is provided in §A.1.
Given an optimal allocation , the policy needs to select which arm to pull next.
In principle, any arm that was included in the solution of the inner problem, , would be fine, but we suggest a selection rule in which the arm that needs most pulls is chosen, i.e., .
It guarantees to behave like TS when is large, as formally stated in Proposition 1.
IRS.V-EMax. Irs.V-EMax includes an additional cost for using the information of future belief transitions. Compared to the ideal penalty (10), (14) is obtained by replacing the true value function with (16), as a tractable approximation. The use of leads to a simple expression for the conditional expectation with respect to the natural filtration. Since is distributed with , we have .
We observe that, given , the future belief is completely determined by how many times each arm had been pulled, irrespective of the sequence of the pulls. For example, consider two action sequences and . Even though the order of observations would differ, the agent will observe from arm 1 and from arm 2 in both cases that end up with the same belief .
Following from the observation above, the state (belief) space can be efficiently parameterized with the pull counts instead of action sequence .
Since the total number of possible future beliefs is , not , the inner problem ( ‣ 3) can be solved by dynamic programming in operations, where is the cost of numerically calculating (see §A.2 for the detail).
IRS.Index policy. Finally, we propose Irs.Index, which does not strictly belong to the IRS framework, and does not produce a performance bound, but it exhibits strong empirical performance.
Roughly speaking, Irs.Index approximates the finite-horizon Gittins index [Kaufmann et al., 2012] using Irs.V-EMax. For each arm in isolation, it internally solves the single-armed bandit problem in which there is a competing outside option that yields a deterministic (known) reward. Applying Irs.V-EMax to a single-armed bandit problem, we can find if the stochastic arm is worth trying against a particular value of outside option in . The threshold value that makes the arm barely worth trying can be obtained by binary search, repeatedly solving the singe-armed bandit problems while varying the value of outside option. The policy plays an arm with the largest threshold value. See §A.3.
Remark 3 (Optimality at the end).
When , all , , , and take the optimal action that is pulling the myopically best arm .
Proposition 1 (Asymptotic behavior).
Assume almost surely for any two distinct arms . As , the distribution of Irs.FH’s action converges to that of Thompson Sampling:
Similarly, so does Irs.V-Zero222We assume a particular selection rule such that as discussed in §3.3.:
TS, and denote the action taken by policies , and , repsectively, when the remaining time is and the prior belief is . These are random variables, since each of these policies uses a randomly sampled outcome on its own.
Remark 3 and Proposition 1 state that Irs.FH and Irs.V-Zero behave like TS during the initial decision epochs, gradually shift toward the myopic scheme and end up with optimal decision; in contrast, TS will continue to explore throughout. The transition from exploration to exploitation under these IRS policies occurs smoothly, without relying on an auxiliary control parameter. While maintaining their recursive structure, IRS policies take into account the horizon , and naturally balance exploitation and exploration.
Theorem 2 (Monotonicity in performance bounds).
Irs.FH and Irs.V-Zero monotonically improve the performance bound:
Note that is the conventional regret benchmark.
We interpret that the tightness of performance bound reflects the degree of optimism that each algorithm would possess.
Recall that is the expected value of the best possible payoff when the agent is informed with some future outcomes in advance.
The weak duality implies that IRS algorithms are basically optimistic in a sense that the agent would believe that she can earn more than the optimal policy in a hope that the additional information is true.
Even with the same outcome , depending on the penalties , the agent would have different anticipation about the future payoff.
As we incorporate the actual learning process, the agent’s anticipation becomes less optimistic and the performance bound gets tighter.
We define the ‘suboptimality gap’ of an IRS policy to be , and analyze it instead of the conventional (Bayesian) regret, . While its non-negativity is guaranteed from weak duality (Theorem 1), more desirably, the optimal policy yields a zero suboptimality gap (Theorem 1 & Remark 1). It coincides with the conventional regret measure only for TS.
Theorem 3 (Suboptimality gap).
For the Beta-Bernoulli MAB, for any and ,
We do not have a theoretical guarantee for monotonicity in the actual performance among IRS policies. Instead, Theorem 3 indirectly shows the improvements in suboptmality: although all the bounds have the same asymptotic order of ,333 Bubeck and Liu  had shown that the Bayesian regret of TS is bounded by when the rewards have a bounded support in including Beta-Bernoulli MAB. Despite of its lower asymptotic order, however, the actual number given in (23) is tighter than for small . As a side note, Lai  showed that the Bayesian regret of the optimal policy has an asymptotic lower bound of . the IRS policies improve the leading coefficient or the additional term.
The proof of Theorem 3, provided in C.4, relies on an interesting property of IRS policies, which is a generalization of TS. Russo and Van Roy  observed that TS is randomized in a way that, conditional on the past observations, the probability of choosing an action equals to the probability that the action is chosen by someone who knows the parameters. Analogously, the IRS policy is randomized in a way that, conditional on the past observations and the past actions, the probability of choosing an action matches the probability that the action is chosen by someone who knows the entire future but penalized (see Proposition 7). Recall that the penalties are designed to penalize the gain of having additional future information. A better choice of penalty function prevents the policy from picking up an action that is overly optimized to a randomly sampled future realization, which in turn improves the quality of the decision making.
5 Numerical Experiments
We visualize the effectiveness of IRS policies and performance bounds in case of Gaussian MAB with five arms () with different noise variances. More specifically, each arm has the unknown mean reward and yields the stochastic rewards where , , , and . Our experiment includes the state-of-the-art algorithms that are particularly suitable in a Bayesian framework: Bayesian Upper Confidence Bound (Bayes-UCB, Kaufmann et al. 
, with a quantile of), Information Directed Sampling (IDS, Russo and Van Roy ), and Optimistic Gittins Index (OGI, Farias and Gutin , one-step look ahead approximation with a discount factor of ). Irs.V-EMax algorithm is omitted here because of its time complexity. In §D, we provide the detailed simulation procedures and the results for the other settings including Irs.V-EMax.
Figure 1 shows the Bayesian regrets (solid lines, ) and the regret bounds (dashed lines, ) that are measured at the different values of . Note that lower regret curves are better, and higher bound curves are better. Also, the regret bound produced by TS is zero, since is the benchmark (16) used in this regret plot.
We first observe a clear improvement in both performances and bounds as we incorporate more complicated penalty functions from TS to Irs.V-Zero. As stated in Theorem 2, the monotonicity in the bound curves can be observed. The suboptimality gap (the gap between a regret curve and its corresponding bound curve) gets tightened, which is consistent with the implication of Theorem 3. As a trade-off, however, it requires a longer running time.
In this particular example, it is crucial to incorporate how much we can learn about each of the arms during the remaining time periods, which heavily depends on the noise level .444 In order for the posterior distribution to be concentrated so as to have the standard deviation of
In order for the posterior distribution to be concentrated so as to have the standard deviation of, for example, one observation is enough for arm 1 whereas 100 and 10,000 observations are required for arm 3 and arm 5, respectively. Comparing Irs.FH with TS, as a simple modification for finite-horizon setting, the performance has improved significantly without degrading its computational efficiency. We also observe that IRS policies and IDS outperform to Bayes-UCB, OGI and TS algorithms, since they explicitly incorporate the value of exploration – how quickly the posterior distribution will be concentrated upon each observation.
This example also illustrates us the significance of having a tighter performance bound. Benchmarking to , when , Irs.Index* policy achieves 94% of it. If the conventional benchmark is used instead, as in a usual regret analysis, we might have concluded that Irs.Index* only achieves 88% of that (looser) bound, which may suggest a larger margin of possible improvement.
- Berry and Fristedt  D. A. Berry and B. Fristedt. Bandit Problems: Sequential Allocation of Experiments. Champman and Hall, 1985.
- Bradt et al.  R. N. Bradt, S. M. Johnson, and S. Karlin. On sequential designs for maximizing the sum of n observations. The Annals of Mathematical Statistics, pages 1060–1074, 1956.
- Brown and Haugh  David B. Brown and Martin B. Haugh. Information relaxation bounds for infinite horizon markov decision processes. Operations Research, 65(5):1355–1379, 2017.
- Brown et al.  David B. Brown, James E. Smith, and Peng Sun. Information relaxations and duality in stochastic dynamic programs. Operations Research, 58(4):785–801, 2010.
- Bubeck and Liu  Sebastien Bubeck and Che-Yu Liu. Prior-free and prior-dependent regret bounds for Thompson Sampling. Proceedings of the 26th International Conference on Neural Information Processing Systems, 1(638-646), 2013.
- Davis and Karatzas  M. H. A. Davis and I. Karatzas. A Deterministic Approach to Optimal Stopping. Wiley, 1994.
- Desai et al.  Vijay V. Desai, Vivek F. Farias, and Ciamac C. Moallemi. Pathwise optimization for optimal stopping problems. Management Science, 58(12):2292–2308, 2012.
- Farias and Gutin  Vivek F. Farias and Eli Gutin. Optimistic Gittins indices. Proceedings of the 30th International Conference on Neural Information Processing Systems, (3161-3169), 2016.
- Gittins  J. C. Gittins. Bandit processes and dynamic allocation indices. Journal of the Royal Statistical Society, Series B, 41(2):148–177, 1979.
- Haugh and Kogan  Martin B. Haugh and Leonid Kogan. Pricing American options: A duality approach. Operations Research, 52(2):258–270, 2004.
- Haugh and Lim  Martin B. Haugh and Andrew E.B. Lim. Linear-quadratic control and information relaxations. Operations Research Letters, 40:521–528, 2012.
- Haugh and Wang  Martin B. Haugh and Chun Wang. Dynamic portfolio execution and information relaxations. SIAM Journal of Financial Math, 5:316–359, 2014.
Kaufmann et al. 
Emilie Kaufmann, Olivier Cappe, and Aurelien Garivier.
On Bayesian upper confidence bounds for bandit problems.
Proceedings of the Fifteenth International Conference on Artificial Intelligence and Statistics,, 22:592–600, 2012.
- Lai  T. L. Lai. Adaptive treatment allocation and the multi-armed bandit problem. The Annals of Statistics, 15(3):1091–1114, 1987.
- Lai and Robbins  T. L. Lai and Herbert Robbins. Asymptotically efficient adaptive allocation rules. Advances in Applied Mathematics, 6:4–22, 1985.
- Marchal and Arbel  Olivier Marchal and Julyan Arbel. On the sub-Gaussianity of the Beta and Dirichlet distributions. 2017.
- Rockafellar and Wets  Rockafellar and Wets. Scenarios and policy aggregation in optimization under uncertainty. Mathematics of Operations Research, 16(1):119–147, 1991.
- Rogers  L. C. G. Rogers. Monte carlo valuation of American options. Mathematical Finance, 12(3):271–286, 2002.
- Russo and Van Roy  Daniel Russo and Benjamin Van Roy. Learning to optimize via posterior sampling. Mathematics of Operations Research, 39(4):1221–1243, 2014.
- Russo and Van Roy  Daniel Russo and Benjamin Van Roy. Learning to optimize via Information-Directed Sampling. Operations Research, 66(1):230–252, 2017.
- Thompson  W. Thompson. On the likelihood that one unknown probability exceeds another in view of the evidence of two samples. Biometrika, 25(3/4):285–294, 1933.
Appendix A Algorithms in Detail
a.1 Implementation of IRS.V-Zero
We provide a pseudo-code of introduced in §3.3. The same logic can be directly used to compute the performance bound if the sampled outcome is replaced with the true outcome .
[H] Function Irs.V-Zero()
a.2 Implementation of IRS.V-EMax
Given the penalty function defined in (14), we define the payoff of pulling an arm one more time after pulling each arm , times: with ,
where is a basis vector such that component is one and the others are zero. Note that we used the fact that . We also use the notation of to denote the belief as a function of pull counts , based on the observation that the belief is completely determined by how many times each arm was pulled, , no matter in what order they were pulled.
Consider a subproblem of ( ‣ 3) such that maximizes the total payoff given the number of pulls