A General Approach to Multi-Armed Bandits Under Risk Criteria

by   Asaf Cassel, et al.
Columbia University

Different risk-related criteria have received recent interest in learning problems, where typically each case is treated in a customized manner. In this paper we provide a more systematic approach to analyzing such risk criteria within a stochastic multi-armed bandit (MAB) formulation. We identify a set of general conditions that yield a simple characterization of the oracle rule (which serves as the regret benchmark), and facilitate the design of upper confidence bound (UCB) learning policies. The conditions are derived from problem primitives, primarily focusing on the relation between the arm reward distributions and the (risk criteria) performance metric. Among other things, the work highlights some (possibly non-intuitive) subtleties that differentiate various criteria in conjunction with statistical properties of the arms. Our main findings are illustrated on several widely used objectives such as conditional value-at-risk, mean-variance, Sharpe-ratio, and more.



There are no comments yet.


page 1

page 2

page 3

page 4


Generalized Risk-Aversion in Stochastic Multi-Armed Bandits

We consider the problem of minimizing the regret in stochastic multi-arm...

Near-Optimal MNL Bandits Under Risk Criteria

We study MNL bandits, which is a variant of the traditional multi-armed ...

Risk-Aware Algorithms for Combinatorial Semi-Bandits

In this paper, we study the stochastic combinatorial multi-armed bandit ...

Constrained regret minimization for multi-criterion multi-armed bandits

We consider a stochastic multi-armed bandit setting and study the proble...

Distribution oblivious, risk-aware algorithms for multi-armed bandits with unbounded rewards

Classical multi-armed bandit problems use the expected value of an arm a...

Thompson Sampling Algorithms for Mean-Variance Bandits

The multi-armed bandit (MAB) problem is a classical learning task that e...

On the bias, risk and consistency of sample means in multi-armed bandits

In the classic stochastic multi-armed bandit problem, it is well known t...
This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

1 Introduction

Background and motivation. Consider a sequential decision making problem where at each stage one of independent alternatives is to be selected. When choosing alternative at stage (also referred to as time ), the decision maker receives a reward that is distributed according to some unknown distribution , and is independent of . (To ease notation, we avoid indexing with , and leave that implicit; the information will be encoded in the policy that governs said choices, which will be detailed in what follows.) At time

, the decision maker has accumulated a vector of rewards

. In our setting, performance criteria are defined by a function that maps the reward vector to a real-valued number. As is a random quantity, we consider the accepted notion of expected performance, i.e., . An oracle, with full knowledge of the arms’ distributions, will make a sequence of selections based on this information so as to maximize the expected performance criterion. This serves as a benchmark for any other policy which does not have such information a priori, and hence needs to learn it on the fly. The gap between the former (performance of the oracle) and the latter represents the usual notion of regret in the learning problem.

The most widely used performance criterion in the literature concerns the long run average reward, which involves the empirical mean, . In this case, the oracle rule, that maximizes the expected value of the above, just samples from the distribution with the highest mean value, namely, it selects . Learning algorithms for such problems date back to Robbins’ paper [Robbins(1952)] and were extensively studied subsequent to that. In particular, the seminal work of [Lai and Robbins(1985)] establishes that the regret in this problem cannot be made smaller than and there exist learning algorithms that achieve this regret by maximizing a confidence bound modification of the empirical mean (since then, this class of policies has been come to known as UCB, or upper confidence bound policies); some strands of literature that have emerged from this include [Auer et al.(2002)Auer, Cesa-Bianchi, and Fischer]

(non-asymptotic analysis of UCB-policies),

[Maillard et al.(2011)Maillard, Munos, and Stoltz] (empirical confidence bounds or KL-UCB), [Agrawal and Goyal(2012)]

(Thompson sampling based algorithms), and various works which consider an adversarial formulation (see, e.g.,

[Auer et al.(1995)Auer, Cesa-Bianchi, Freund, and Schapire]).

In this paper we are interested in studying the above problem for more general path dependent criteria that are of interest beyond the average. Many of these objectives bear an interpretation as “risk criteria” insofar as they focus on a finer probabilistic nature of the primitive distributions than the mean, such as viewed through the lens of the observations collected from the arms, and typically relate to the spread or tail behavior. Examples include: the so-called Sharpe ratio

, which is the ratio between the mean and standard deviation;

value-at-risk () which focuses on the percentile of the distribution (with small); or a close counterpart that integrates (averages) the values out in the tail beyond that point known as the expected shortfall (or conditional value at risk; ). The last example is of further interest as it belongs to the class of coherent risk measures which has various attractive properties from the risk theory perspective; a discussion thereof is beyond the scope of this paper. (cf. [Artzner et al.(1999)Artzner, Delbaen, Eber, and Heath] for further details.) In our problem setting, the above criteria are applied via the function to the empirical observations, and then the decision maker seeks, as before, to optimize the expected value. A typical example where such criteria may be of interest is that of medical trials. More specifically, suppose several new drugs are sequentially tested on individuals who share similar characteristics. If we consider average performance, we may conclude that the best choice is a drug with a non-negligible fatality rate but a high success rate. If we wish to control the fatality rate then using for example may be appropriate.

While some of the above mentioned criteria have been examined in the decision making and learning literature (see references and more precise discussion below), the analysis tends to be driven by very case-specific properties of the criterion in question. Unlike the standard mean criterion, various subtleties may arise. To see this, consider the example, which we will reference repeatedly to communicate salient features of our analysis. In terms of , it is given by , where is the order statistic of . Now, for horizon and , an oracle will at first select the arm that maximizes the mean value, just as it would under the traditional mean-criterion. But in step 2 it would seek the arm that maximizes the expected value of the minimum of the first two observations, namely, . It is easy to see that this results in a rule that need not select the same arm throughout the horizon of the problem. This presents a further obstacle in characterizing a learning policy that seeks to minimize regret by mimicking the oracle rule. However, as our analysis will flesh out, the oracle policy can be approximated asymptotically by a simple policy, that is, one that does select a single arm throughout the horizon. This simplification can be leveraged to address the learning problem which becomes much more tractable. It is therefore of interest to understand in what instances does this simplified structure exist. This is one of the main thrusts of the paper.

Main contributions of this paper. In this paper we consider a general approach to the analysis of performance criteria of the type outlined above. We identify the aforementioned examples, as well as others, as part of a wider class that we term Empirical Distribution Performance Measures (EDPM). In particular, let be the empirical distribution of the vector , i.e., is the fraction of rewards less or equal to real valued . An EDPM evaluates performance by means of a function , which maps to , i.e., . Alternatively,

may also serve to evaluate the distributions of the random variables

(). These evaluations may be aggregated to form a different type of performance criteria that we term proxy regret and consider as an intermediate learning goal. The construct plays a central role in the framework we develop, and while it may seem somewhat vague at this stage, it will be illustrated shortly by revisiting the example.

Our main results provide easy to verify explicit conditions which characterize the asymptotic behavior of the oracle rule, and culminate in a -type learning algorithm with regret. To make matters more concrete, we summarize our results for . First, its form as an EDPM is essentially given by (see (7) for exact definition). Our framework will establish that for arm distributions with integrable lower tails, choosing a single arm (simple policies) is asymptotically optimal. This, together with the above characterization of yield the desired simplification in identifying its oracle rule, and subsequently this is leveraged and incorporated in a -type learning algorithm that emulates the oracle policy. More concretely, if is the typical upper confidence bound, then a version of requires upper confidence bounds for and all , where the power of is a criterion dependent parameter. The implication for learning is that more exploration is required in the initial problem stages. Assuming sub-Gaussian arm distributions, the algorithm is shown to have regret, and under a further mild assumption yields the familiar regret which, in the traditional MAB objective, corresponds to the case where the means of the arms are “well separated.” Our framework allows for this analysis, and the results just mentioned for , to be easily derived for any admissible EDPM.

Previous works on bandits that concern path-dependent and risk criteria. To the best of our knowledge, the only works that consider path dependent criteria of the form presented here are [Sani et al.(2012)Sani, Lazaric, and Munos], which consider the mean-variance criterion and present the MV-UCB, and MV-DSEE algorithms, and [Vakili and Zhao(2016)], which complete the regret analysis of said algorithms. Other works consider criteria which are more in line with our intermediate learning goal (proxy regret), and lead to a different notion of regret. [Galichet et al.(2013)Galichet, Sebag, and Teytaud] present the MaRaB algorithm which uses in its implementation, however, they analyze the average reward performance, and do so under the assumptions that , and the and average optimal arms coincide. [Maillard(2013)] presents and analyzes the RA-UCB algorithm which considers the measure of entropic risk with a parameter . [Zimin et al.(2014)Zimin, Ibsen-Jensen, and Chatterjee] consider criteria based on the mean and variance of distributions, and present and analyze the algorithm. We note that these criteria correspond to a much narrower class of problems than the ones considered here.

Paper structure. For brevity, all proofs are deferred to the Appendix. In Section 2 we formulate the problem setting, oracle, and regret. In Section 3 characterize the asymptotic behavior of the oracle rule. In Sections 4 and 5 we provide the main results, and in Section 6 we demonstrate them on well-known risk criteria. We also include some negative examples, which show what can happen when the proposed conditions are not satisfied, indicating in some way the necessity of these conditions to achieve the unifying theme in our proposed framework.

2 Problem Formulation

Model and admissible policies.

Consider a standard MAB with , the set of arms. Arm is associated with a sequence () of random variables with distribution , the set of all distributions on the real line. When pulling arm for the time, the decision maker receives reward , which is independent of the remaining arms, i.e., the variables (for all ) are mutually independent.

We define the set of admissible policies (strategies) of the decision maker in the following way. Let be the number of times arm was pulled up to time . Let

be a random variable over a probability space

which is independent of the rewards. An admissible policy is a random process recursively defined by


We denote the set of admissible policies by , and note that admissible policies are non anticipating, i.e., depend only on the past history of actions and observations, and allow for randomized strategies via their dependence on . Formally, let be the filtration defined by , then is measurable.

Empirical Distribution Performance Measures (EDPM).

The classical bandit optimization criterion centers on the empirical mean i.e. . We generalize this by considering criteria that are based on the empirical distribution. Formally, the empirical distribution of a real number sequence is obtained through the mapping , given by,


where is the indicator function of the interval defined on the extended real line, i.e.

Of particular interest to this work are the empirical distributions of the reward sequence under policy , and of arm . We denote these respectively by,


The decision maker possesses a function , which measures the “quality” of a distribution. The resulting criterion is called EDPM, and the decision maker aims to maximize . In section 6 we provide further examples (including the classic empirical mean), but for now, we continue to consider the ([Rockafellar and Uryasev(2000)]) as our canonical example. This criterion measures the average reward below percentile level , and for distribution is given by


where , is the reward at percentile level , which is also known as Value at Risk. For further motivation regarding EDPMs and their relation to permutation invariant criteria we refer the reader to Appendix A.

When defining an objective, it was sufficient to consider as a mapping from (a set) to . Moving forward, our analysis relies on properties such as continuity and differentiability, which require that we consider as a mapping between Banach spaces. To that end is a subset of an infinite dimensional vector space for which norm equivalence does not hold. This hints at the importance of using the “correct” norm for each . As a result, our analysis is done with respect to a general norm and its matching Banach space , which will always be a subspace of , the space of all bounded functions , (i.e., ). We therefore consider EDPMs as mappings .

Oracle and regret.

For given horizon , the oracle policy is one that achieves optimal performance given full knowledge of the arm distributions (). Formally, it satisfies


Similarly to the classic bandit setting, we define a notion of regret that compares the performance of policy to that of . The expected regret of policy at time is given by,


where we note that this definition is normalized with respect to the horizon , thus transforming familiar regret bounds such as into . The goal of this work is to provide a generic analysis of this regret, similar to that of the classic bandit setting. However, unlike the latter, the oracle policy here need not choose a single arm. Since the typical learning algorithms are structured to emulate the oracle rule, we need to first understand the structure of the oracle policy before we can analyze .

3 The Infinite Horizon Oracle

Infinite horizon oracle.

The oracle problem in (8) does not admit a tractable solution, in the absence of further structural assumptions. In this section we consider a relaxation of the oracle problem which examines asymptotic behavior. We provide conditions under which this behavior is “simple” thus suggesting it as a proxy for the finite time performance. More concretely, let be the worst case asymptotic performance of policy , then the infinite horizon oracle satisfies


Note that is well defined as the limit inferior of a sequence of random variables, however we require that its expectation exist for (10) to be well defined.

Simple policies.

In the traditional Multi-Armed Bandit problem, the oracle policy, which selects a single arm throughout the horizon, is clearly simple. In this work, we consider “simple” to mean stationary policies whose actions are mutually independent and independent of the observed rewards. Such policies may differ from the single arm policy in that they allow for a specific type of randomization. The following defines this notion formally. [Simple policy] A policy is simple if are measurable random variables. Such policies satisfy

A deterministic simple policy further satisfies that for some . Denote the set of all simple policies by , and the dimensional simplex by,

Note that there is a one to one correspondence between and , we thus associate each with the simple policy defined by, for .


It may seem intuitive that EDPMs always admit a simple infinite horizon oracle policy. However, in Appendix E.2.3 we give counter examples, which arise from the “bad behavior” that is still allowed by this objective. The following condition is sufficient for EDPMs to be “well behaved.” We denote the convex combinations of the arms’ reward distributions by


and use this in the following definition.

[Stable EDPM] We say that is a stable EDPM if:

  1. is continuous on ;

  2.   almost surely .

Note that stability depends not only on but also on the given distributions . Meaning, a given could possibly be stable for some distributions and not stable for others. Moreover, the choice of a norm is important in order to get sharp conditions on the viable reward distributions. For example, consider the supremum norm given by . By the Glivenko-Cantelli theorem ([Van der Vaart(2000)]), it satisfies requirement 2 for any given distributions , . However, in most cases, requirement 1 holds only if the distributions have bounded support.

[Stable EDPM admits a simple oracle policy] A stable EDPM has a simple infinite horizon oracle policy . Further assuming that is quasiconvex, a deterministic simple exists, i.e., choosing a single arm throughout the horizon is asymptotically optimal. The main proof idea of Theorem 3 is as follows. We use requirement 2 of stability to show that with probability one and regardless of policy, any subsequence of the empirical distribution has a further subsequence that converges to an element of . Applying the continuity of , we conclude that asymptotic empirical performance is (almost surely) equivalent to that of elements in . However, similar claims show that such performance can also be achieved by a simple policy. In Section 6 we will see that stability is not a necessary condition for simple oracle policies. However, this definition has the advantage of being relatively easy to verify. This is due in part to the fact that continuity is preserved by composition. This facilitates the analysis and creation of complicated rewards by representing them as a composition of simpler ones.

Example ().

We can now summarize how the presented framework applies to . First and foremost, we need to define the “correct” norm. We notice that , as defined in (7), integrates only the lower tail of the distribution. This leads us to define the following norm


Verifying requirement 1 (continuity) of stability is a simple technical task. As for requirement 2, using the Glivenko-Cantelli theorem ([Van der Vaart(2000)]

), and the Strong Law of Large Numbers (

[Simonnet(1996)]), it holds when (). Further noticing that is convex over , we may use Theorem 3 to conclude that the single arm solution is asymptotically optimal.

4 Proxy Regret


Having gained some understanding of the infinite horizon oracle, we consider an intermediate learning goal that uses the infinite horizon performance as a benchmark. We refer to this goal as the proxy regret and dedicate this section to the design and analysis of a learning algorithm that seeks to minimize it. Formally, let


be the proxy distribution, where we recall that is the distribution associated with arm . The proxy regret is then defined as,


where is defined in (11), and .

Section 3 presented stability as a means of understanding the asymptotic behavior of performance. As we now seek a finite time analysis (of the proxy regret), it stands to reason to employ a stronger notion of stability which quantifies the rate of convergence. For that purpose, denote the set of empirical distributions created from sequences of any length by

[Strongly stable EDPM] We say that is a strongly stable EDPM if:

  1. There exist such that the restriction of to admits as a local modulus of continuity for all , i.e.,

  2. There exists a constant (which depends only on ), such that for all ,

One can easily verify that a strongly stable EDPM is indeed a stable EDPM. The first requirement quantifies the continuity of , and the second gives a rate of concentration for , thus refining Definition 3.

Proxy regret decomposition.

In the traditional bandit setting, which considers the average reward, the analysis of the regret is well understood. The same analysis extends to any linear EDPM, i.e., when is linear. This follows straightforwardly as such rewards can be formulated as the usual average criterion with augmented arm distributions. Linearity facilitates the regret analysis by providing a decomposition of contributions from each sub-optimal arm. Let

be the performance gap for arm . Defining , we have that the regret of a linear EDPM is given by, . Departing from the pleasant realm of linearity, we seek a similar decomposition of the proxy regret. [Proxy regret decomposition] Suppose that is a quasiconvex and strongly stable EDPM, then defining we have that

We note that while quasiconvexity is somewhat restrictive, it is also a necessity for the purpose of this decomposition. Foregoing this assumption leads to a seemingly similar yet inherently different decomposition which must be analyzed separately.

Learning algorithm.

We present , a natural adaptation of (see [Bubeck and Cesa-Bianchi(2012)]) to a strongly stable EDPM. Let,

where are the parameters of Definition 4. The policy is given by,


where for , it samples each arm once as initialization.

[ Proxy Regret] Suppose that for all , and is a quasiconvex and strongly stable EDPM. Then for defined in Lemma 4 and we have that

Example ().

Unlike stability, strong stability of , requires control of both upper and lower tails of the distribution. This leads us to consider the norm

Similarly to stability, verifying requirement 1 becomes mostly technical, and results with , and a value of which depends on an upper bound of the and values of the arm distributions. Requirement 2 then follows by Dvoretzky-Kiefer-Wolfowitz ([Massart(1990)]), and a sub-Gaussian assumption on the arm distributions (, ). We conclude that, for sub-Gaussian arms, incurs proxy regret.

5 Regret Bounds

The proxy regret is a relatively easy metric to analyze but leaves open the question of its relationship to the regret. In this section we answer this question thus obtaining bounds on the regret. [Strongly stable EDPM regret bound] Suppose that is a quasiconvex and strongly stable EDPM. Then for all and any satisfying we have that

where are constants that depend on the parameters of Definition 4, and on (). The proof of Theorem 5 may be split into two stages. Put . In the first stage we show that

and in the second, we bound in a way that does not depend on policy . For this purpose, we use the modulus of continuity to get


and then bound this term using the concentration assumption of strong stability. The main issue with Theorem 5 is the existence of instances where it may fail to capture the correct behavior of the regret. When it occurs, the source of this failure lies in the first inequality of (16). As an extreme example consider a linear . The left hand side of the inequality is clearly zero, while the right hand side behaves as even when . In order to fix this, we require an additional structural assumption that we term smoothness. Let be the space of bounded linear functionals on . For any , we define a residual function


where is the outcome of applying the linear operator to . Let,

be the empirical distributions that are no farther than from an element of . Note that may be a set but may also be a single element. [Smooth EDPM] We say that is a smooth EDPM, if there exist , , and , such that for any


Smoothness essentially amounts to validating the mean value theorem for . The importance of this added condition (smoothness) is summarized in the following result. [Smooth and strongly stable EDPM regret bound] Suppose that is a quasiconvex, smooth, and strongly stable EDPM. Then for all , and any satisfying and we have that

where is a constant that depends on (), and the parameters of Definitions 4 and 5. The main idea in the proof of Theorem 5

is to subtract a zero mean estimate of

before performing the problematic transition in (16). We construct this estimator using the operator in Definition 5, and carefully perform the transition. This results in two residual functions of the form given in (17) which we then bound using the conditions of Definitions 4 and 5. We conclude with the following corollary, which is an immediate result of Theorems 4, 5 and 5. [ regret] Suppose that for all , and is a quasiconvex and strongly stable EDPM. Then , and provided that is also smooth, then .

6 Illustrative Examples

The purpose of this section is, first and foremost, to show the relative ease with which various performance criteria can be analyzed within the framework developed in the previous sections. To make the exposition more accessible, we forego detailed introductions of the various criteria as well as various other technical details. We refer the interested reader to Appendix E for the complete details. At this stage we give a short summary of the main results.

  1. The infinite horizon oracle problem defined in (10) was shown to have a deterministic simple policy structure for stable and quasiconvex EDPMs.

  2. The regret defined in (9), and the proxy regret defined in (14) are such that:

    • A quasiconvex and strongly stable EDPM satisfies .

    • Provided the EDPM is also smooth, then .

In what follows we will see how these results are seen to hold for a wide range of criteria that satisfy the requisite assumptions, as well as some subtleties that arise.

Differentiable EDPMs.

Assuming that the “correct” norm is chosen, typical EDPMs are differentiable, thus making it easy to verify smoothness. Table 1 introduces some well-known criteria that are compositions of linear functionals, and as such differentiable and smooth (Definition 5). Table 2 presents the associated choice of norm and the constraints on arm distributions () required by our framework. It is not difficult to spot that the norms in Table 2 fall into a specific pattern, i.e., a baseline norm in the form of augmented with one or more semi-norms (linear operators). It then remains to verify strong stability (Definition 4). Verifying the modulus of continuity is a more of a technicality. Verifying the concentration splits into two: the baseline norm follows by Dvoretzky-Kiefer-Wolfowitz ([Massart(1990)]); for the semi-norms it is provided by the sub-Gaussian conditions of Table 2. We note that for the purpose of stability (Definition 3), it suffices to require that for all . Furthermore, we did not find any known examples of risk criteria that are not either linear, convex, or quasiconvex.

Empirical reward EDPM Definition Description
Mean The traditional MAB average reward.

Second moment

An average of the squared reward.
Below target semi-variance Measures the negative variation from a threshold .
Entropic Risk A risk assessment using an exponential utility function with risk aversion parameter .
Negative variance Empirical variance of the reward.
Mean-variance (Markowitz) A weighted sum (using ) of the empirical mean and variance.
Sharpe ratio A ratio between the empirical mean and variance, where is a minimum average reward, and is a regularization factor.
Sortino ratio Sharpe ratio with variance replaced by the below target semi-variance measure.
Table 1: Differentiable EDPMs
Empirical reward
The function in
Constraints on the random rewards for all
Mean are sub-Gaussian linear
Second moment are sub-Gaussian linear
Below target semivariance are sub-Gaussian linear
Entropic Risk are sub-Gaussian convex
Variance are sub-Gaussian convex
Mean-variance (Markowitz) are sub-Gaussian convex
Sharpe ratio are sub-Gaussian quasiconvex
Sortino ratio and are sub-Gaussian quasiconvex
Table 2: EDPM properties (see details in appendix E.1)
Non-differentiable EDPMs.

We now consider two examples of non-differentiable criteria. The first, , is found to be smooth and strongly stable under appropriate conditions. The second, , is strongly stable but appears to be non-smooth. In both cases the resulting conditions possess a more particular nature than those presented for differentiable EDPMs.

Recall the definitions of and given in (7). We denote the level set of a function by, , and consider the following set of conditions:

  1. [label=(C0),leftmargin=4]

  2. , for all .

  3. are sub-Gaussian for all .

  4. For all the cardinality of is at most 1.

  5. There exist such that for all ,

  6. All are twice continuously differentiable at .

Table 3 summarizes how these conditions correspond to the (strong) stability and smoothness of and . We conclude with some remarks regarding the necessity of our conditions. [ oracle policy] For , always admits a deterministic simple oracle policy , i.e., choosing a single arm throughout the horizon is asymptotically optimal. When considering the existence of simple oracle policies, Proposition 6 essentially means that condition 3, which implies stability, is unnecessary. However, for the purpose of regret analysis, we highlight the importance of conditions 3-5 by means of a simulation. Note that Theorem 5 relies on a fast convergence rate of the performance as measured by the regret, i.e., , to that of the proxy regret, i.e., . We denote this performance gap by , and calculate it in a simple simulation with arms. This is done for three different distributions, each not satisfying a different subset of the conditions 3-5. Figure 1 displays the simulation results, which show that the obtained rate is slower than the desired which is achieved in Theorem 5.

Reward  Property Stable Strongly stable Smooth and Strongly stable
Conditions: 3
Conditions: 2, 4
Conjecture: Never smooth but similar results assuming 2, 4, 5
Conditions: 1
Conditions: 2
Conditions: 2, 4
Table 3: and properties (see details in appendices E.2.2 and E.2.1)
Figure 1: and horizon gap for “bad” distribution. does not satisfy any of 3-5), does not satisfy 5, and does not satisfy 4. The figures essentially show that thus claiming that behaves as , which is slower than the desired .

7 Open Problems and Future Directions

One main question that we leave open is the dependence of the regret on the number of arms . We conjecture that a finer analysis of may reduce it from our to either or . The subject of lower bounds remains open as well. Future directions may include a more complete taxonomy of performance criteria, or an extension of this framework to different settings (e.g., adversarial or contextual). Additionally, we note that the majority of our proof techniques also apply to non-quasiconvex criteria. If such criteria are found to be of interest then extending the framework to this case may be appealing.

We thank Ron Amit, Guy Tennenholtz, Nir Baram and Nadav Merlis for helpful discussions of this work, and the anonymous reviewers for their helpful comments. This work was partially funded by the Israel Science Foundation under contract 1380/16 and by the European Community’s Seventh Framework Programme (FP7/2007-2013) under grant agreement 306638 (SUPREL).


  • [Agrawal and Goyal(2012)] Shipra Agrawal and Navin Goyal. Analysis of thompson sampling for the multi-armed bandit problem. In COLT, pages 39–1, 2012.
  • [Artzner et al.(1999)Artzner, Delbaen, Eber, and Heath] Philippe Artzner, Freddy Delbaen, Jean-Marc Eber, and David Heath. Coherent measures of risk. Mathematical finance, 9(3):203–228, 1999.
  • [Auer et al.(1995)Auer, Cesa-Bianchi, Freund, and Schapire] Peter Auer, Nicolo Cesa-Bianchi, Yoav Freund, and Robert E Schapire. Gambling in a rigged casino: The adversarial multi-armed bandit problem. In Foundations of Computer Science, 1995. Proceedings., 36th Annual Symposium on, pages 322–331. IEEE, 1995.
  • [Auer et al.(2002)Auer, Cesa-Bianchi, and Fischer] Peter Auer, Nicolo Cesa-Bianchi, and Paul Fischer. Finite-time analysis of the multiarmed bandit problem. Machine learning, 47(2-3):235–256, 2002.
  • [Bubeck and Cesa-Bianchi(2012)] Sébastien Bubeck and Nicolo Cesa-Bianchi. Regret analysis of stochastic and nonstochastic multi-armed bandit problems. Foundations and Trends® in Machine Learning, 5(1):1–122, 2012.
  • [Fisher(1992)] Evan Fisher. On the law of the iterated logarithm for martingales. The Annals of Probability, pages 675–680, 1992.
  • [Galichet et al.(2013)Galichet, Sebag, and Teytaud] Nicolas Galichet, Michele Sebag, and Olivier Teytaud. Exploration vs exploitation vs safety: Risk-aware multi-armed bandits. In ACML, pages 245–260, 2013.
  • [Klenke(2014)] Achim Klenke. Law of the Iterated Logarithm, pages 509–519. Springer London, London, 2014. ISBN 978-1-4471-5361-0. doi: 10.1007/978-1-4471-5361-0˙22. URL http://dx.doi.org/10.1007/978-1-4471-5361-0_22.
  • [Lai and Robbins(1985)] Tze Leung Lai and Herbert Robbins. Asymptotically efficient adaptive allocation rules. Advances in applied mathematics, 6(1):4–22, 1985.
  • [Maillard(2013)] Odalric-Ambrym Maillard. Robust risk-averse stochastic multi-armed bandits. In International Conference on Algorithmic Learning Theory, pages 218–233. Springer, 2013.
  • [Maillard et al.(2011)Maillard, Munos, and Stoltz] Odalric-Ambrym Maillard, Rémi Munos, and Gilles Stoltz.

    A finite-time analysis of multi-armed bandits problems with kullback-leibler divergences.

    In COLT, pages 497–514, 2011.
  • [Massart(1990)] Pascal Massart. The tight constant in the dvoretzky-kiefer-wolfowitz inequality. The annals of Probability, 18(3):1269–1283, 1990.
  • [Robbins(1952)] Herbert Robbins. Some aspects of the sequential design of experiments. Bulletin of the American Mathematical Society, 58(5):527–535, 1952.
  • [Rockafellar and Uryasev(2000)] R Tyrrell Rockafellar and Stanislav Uryasev. Optimization of conditional value-at-risk. Journal of risk, 2:21–42, 2000.
  • [Sani et al.(2012)Sani, Lazaric, and Munos] Amir Sani, Alessandro Lazaric, and Rémi Munos. Risk-aversion in multi-armed bandits. In Advances in Neural Information Processing Systems, pages 3275–3283, 2012.
  • [Simonnet(1996)] Michel Simonnet. The Strong Law of Large Numbers, pages 311–325. Springer New York, New York, NY, 1996. ISBN 978-1-4612-4012-9. doi: 10.1007/978-1-4612-4012-9˙15. URL http://dx.doi.org/10.1007/978-1-4612-4012-9_15.
  • [Vakili and Zhao(2016)] Sattar Vakili and Qing Zhao. Risk-averse multi-armed bandit problems under mean-variance measure. IEEE Journal of Selected Topics in Signal Processing, 10(6):1093–1111, 2016.
  • [Van der Vaart(2000)] Aad W Van der Vaart. Asymptotic statistics, volume 3. Cambridge university press, 2000.
  • [Zimin et al.(2014)Zimin, Ibsen-Jensen, and Chatterjee] Alexander Zimin, Rasmus Ibsen-Jensen, and Krishnendu Chatterjee. Generalized risk-aversion in stochastic multi-armed bandits. arXiv preprint arXiv:1405.0833, 2014.

Appendix A EDPM Motivation.

The following provides some of the motivation behind EDPMs. Let , where is a function that measures the quality of a given reward sequence of length . A decision maker may then wish to maximize the expected performance, i.e. . It makes sense that the preferences of the decision maker remain fixed over time. This means () should, in some sense, be time invariant. However, such an invariance is hard to grasp when the functions do not share a domain. One way of addressing this issue is to assume that is permutation invariant, i.e., it maps all the permutations of its reward sequence to the same value. We provide a formal definition in the proof of the following (known) result.

[Permutation invariant function representation] is permutation invariant if and only if, there exists such that, . The representation given in Lemma A suggests as a shared domain thus making it simple to define time invariance. We conclude that EDPMs describe the objectives that are time and permutation invariant.

of Lemma A. We start with a few definitions. Let denote the set of permutation matrices (binary and doubly stochastic). is said to be permutation invariant if for all and . Let, , be the set of empirical distributions created from elements (the image of ). Let,

be the inverse image of at . Let,

be the set of all permutations of . We can now begin the proof.

First direction. Suppose . Notice that is indeed permutation invariant as permuting its input simply reorders its finite sum thus not changing the value. This clearly implies that is permutation invariant.

Second direction. Suppose that is permutation invariant. Furthermore, assume that for any , we have that, . Then, define in the following way. For any choose arbitrarily . Further define by,

Then we have that, , and thus there exists , such that . We conclude that,

where the last step uses the permutation invariance of .

Proof of assumption. We show that for any , we have that,