The stochastic multi-armed bandit problem is a well-studied framework to model sequential decision-making problems. It has a wide range of theoretical as well as practical applications such as clinical trials, web advertisement placement, packet routing, to name a few. In the usual formulation, an agent (a learner, or an algorithm) has to choose from one of several unknown distributions (which are called arms), receive a sample (a loss) from the arm chosen, and repeat this process for some prescribed amount of time. The goal of the learner is expected regret minimization, i.e., minimization of the expectation of the difference between its own cumulative loss and the cumulative loss of the best arm, where the best arm is the one with the smallest mean. However, for some applications the expected criterion might not be the most desirable. For example, in clinical trials one might not be interested in the most effective treatment on average, but in the one that is more robust and still has a good effect on average. In terms of multi-armed bandits, in this case the best arm is defined not by the mean, but by some risk measure, which is a function of the distribution itself. This leads to the idea of the risk-averse bandit problem.
Risk-aversion has been extensively studied in other fields. Starting from the economic theory ([Markowitz01], [Neumann01]
) and ending up with the neighbouring field of reinforcement learning ([Defourny01], [Shen01], [Shen02], [Patek01]). In the field of online learning, risk-aversion was studied in the experts setting by [Even-dar03]. They obtained several negative and positive results for when Sharpe-ratio ([Sharpe01]) and mean-variance ([Markowitz01]) was used as risk measures. [Warmuth01] studied the problem of pure variance minimization. Other risk measures were studied in [Sani01] and [Maillard01]. The former proposes to use the mean-variance criterion as a measure of risk and aims at minimizing the notion of the regret that takes into account the variability of the algorithm. The latter considers log-exponential risk measure, which belongs to the class of so-called coherent risk measures ([Rockafellar03]) and minimizes the regret defined using this measure.
There is no universally agreed notion of what a good measure of risk is, and the appropriate notion can vary from one problem to another. All previous works focused on some particular risk measures, which has immediately limited the applicability of the results and raised a lot of questions on the quality of the particular risk measure. In this work, we consider a different approach: instead of a specific risk measure, we define the risk-averse bandit problem with arbitrary (but fixed) risk measure and the corresponding regret. We focus on risk measures defined as a function of the first two moments (the mean and the variance). This generalizes the setting of[Sani01] from linear to arbitrary functions, while considering notion of regret similar to [Maillard01].
We present two motivating examples of our framework: (1) We consider the threshold variance problem, where we have the usual bandit setting and interested in the means of the distributions (of the arms), but would like to chose only from those arms that has the variance smaller than a specified threshold. One possible formalization of this problem leads us to the risk-averse regret minimization with discontinuous function of the mean and the variance used as a risk measure. (2) Consider a risk measure that is a linear combination of the mean and the square root of the variance, where both the summands are of the same order. This is a natural variant of the mean-variance optimization and is a continuous function of the mean and the variance.
Our main results are as follows: (1) First we present an algorithm, namely, -LCB, which belongs to the wide family of Lower (Upper) Confidence Bound algorithm (the descendants of UCB algorithm of [Auer02], see also, e.g. [Audibert01], [Garivier01]), and prove logarithmic risk-averse regret bounds for all continuous functions. (2) Second, we present an example of a discontinuous function where no natural algorithm (based on the optimism in face of uncertainty principle) can achieve sublinear regret. (3) Finally, we present another algorithm, namely, -LCB2, that makes learning feasible with the mild assumption that no arm hits the discontinuity points. Our proof approach is similar to [Sani01] and [Maillard01], while the latter used slightly different KL-divergence based version of the algorithm ([Maillard02]).
Other related works. In the bandit setting risk-aversion has been approached from different perspectives. [Galichet01] designs an algorithm that uses conditional value at risk (CVaR) as a risk measure. However, they aim at minimizing the usual expected regret under the assumption that the best mean arm is also the best risk-aversion arm, which is completely different from our goal. [Yu04] derive PAC-bounds on the single- and multi-period risk for several different risk measures, nevertheless, the PAC-style of their results makes it inapplicable to our problem. [Salomon01] considers the deviations of the regret in the standard setting, which seem to address the same issues, but it remains unclear if their results can be connected to risk-averse regret minimization.
Organization. In Section 2 we introduce the notations to be used, formally state the problem, and present some examples which can be modeled in our framework. In Sections 3.1 and 3.2 we discuss two cases of the main problem and present the corresponding algorithms together with the risk-averse regret bounds. Section 4 discusses open problems and the possible extensions of the setting. The paper concludes with the proofs of the main theorems in Section 5.
2 The problem
Let denote the set of distributions supported on . We consider the stochastic multi-armed bandit setting with arms and being the distributions of arms. At time step the learner chooses arm to pull and receives a sample drawn from , where is the number of times that arm is pulled by the -th time step, that is,
We consider the case where the learner is given a risk measure . The risk measure of arm is . This measure defines the best arm by
and the goal of the algorithm is to identify that arm. The performance of the algorithm is measured by means of risk-averse regret:
Note that this corresponds to the notion of pseudo-regret for stochastic bandits, but there is no regret notion in our setting that directly corresponds to true regret in stochastic bandits. One could try to define true regret as the difference of risk measures applied to the empirical distributions of the algorithm and the best arm (similar to [Sani01]). However, then the algorithm could be punished even for switching between the best arms, which can be an undesirable feature.
Some examples of such risk measures are with
being a random variable (usual stochastic bandit) and, considered in [Maillard01].
In this paper we focus on the risk measures of the following form:
In other words, the learner is supplied by a function , where 111The domain of the second argument can be restricted to , since for a random variable which takes values in , the variance is upper bounded by .. If we denote the risk measure of arm by , i.e. , where and are the mean and the variance of the -th arm respectively, then and the regret is
Our class of risk measures is rich enough to model a lot of interesting problems:
Standard Bandit: . This is the standard stochastic multi-armed bandit setting.
Variance Minimization: . This is the variance minimization problem, considered in [Warmuth01].
Mean-variance Bandit: . This is a version of the problem considered in [Sani01]. A related and natural variant is , where both summands are of the same order.
Threshold Variance: . This risk measure can be used to model threshold variance problem described in Section 1.
Log-Exponential Risk: . This measure can be seen as an approximation to the coherent risk measure, considered in [Maillard01]: , when it is restricted to the first two moments.
Our goal is to study conditions on the function under which learning is possible.
3 Our Results
We distinguish between two cases of the problem: continuous and discontinuous functions . In the continuous case we prove that learning is possible for every function. In the discontinuous case we present an example where learning is not possible. The negative example motivates a restriction, and we show that under the restriction learning is feasible.
3.1 Continuous functions
In this section we will show that learning is possible for any continuous function . We start with a characterization of continuous functions that will be used to present the algorithm.
For every continuous function , there exists a function , such that
is a strictly increasing function;
As an example, consider an -Hölder continuous function : in this case would satisfy the conditions of Lemma 1 by the definition of -Hölder continuity. But Lemma 1 is stated for every continuous function: as another example, consider the continuous function
It is not -Hölder continuous for any , but satisfies the conditions of Lemma 1 for .
We will use Lemma 1
to construct a high-confidence interval forfrom the confidence intervals for its arguments. We start by defining the empirical mean and the empirical variance of arm :
The following concentration results are the basis for our argument.
Lemma 2 (Chernoff-Hoeffding bound)
For every , , and , with probability at least
, with probability at least
Lemma 3 (Lemma 2 from [Antos01])
For all , , and , with probability at least
The algorithm -LCB will at time step choose an arm that minimizes the corresponding lower confidence bound:
The algorithm chooses arm if is really small or if
is big. The former means that the algorithm tries to exploit the arm that has small estimated risk measures, while the latter means that the estimate for the armis rough and the algorithm tries to improve it by exploring this arm further. In other words, the -LCB algorithm tries to deal with exploration-exploitation trade-off using the so-called optimism in face of uncertainty principle.
Theorem 3.1 states the regret bound of the -LCB algorithm.
Theorem 3.1 (Feasibility of learning)
Consider a continuous function , then for with probability at least the regret of the -LCB algorithm at time is upper bounded by:
where . Moreover, for , if the algorithm is run with , then with probability at least the regret is upper bounded by:
Efficiency. Theorem 3.1 shows that learning is feasible for every continuous function. We now discuss the efficiency of the algorithm with respect to different classes of continuous functions.
Lipschitz functions: when is -Lipschitz, i.e. , the regret bound is
and the dependence on in front of matches the dependence in the regret of the -LCB algorithm in the standard stochastic bandit problem. The worse constant () term is an artifact of doing such general analysis. This case covers the standard bandit and the variance minimization problems with , the log-exponential risk problem with , and the mean-variance bandit problem with in which .
Hölder functions: when is -Hölder continuous, i.e. , the regret bound is
This case covers the mean-variance problem with which is -Hölder continuous with . Note that the dependence on in this case is worse than for Lipschitz functions, but it is still polynomial.
Non-Hölder functions: to demonstrate how efficiency can decrease for the general class of continuous functions, consider from (1), then and the regret bound becomes
We can see that the term in front of grows exponentially as goes to in comparison to the polynomial growth for Lipschitz and Hölder functions.
Note that it is possible to design an anytime version of -LCB for the case when is not known in advance. To do so, at each time step we take , where is a sequence decreasing at an appropriate rate. However, we do not pursue this direction further.
3.2 Discontinuous functions
The case of discontinuous functions is more tricky. We present a negative example and a partially positive result. We start with an example of a discontinuous function where no algorithm following the optimism in face of uncertainty principle can achieve sublinear regret.
Consider the following discontinuous function: Let
Consider two arms and such that and and and . Then any algorithm based on the optimism in face of uncertainty principle will keep on choosing arm 1 with non-negligible probability. This is because if the estimate of the algorithm is not precisely the discontinuity point, then arm 1 will be chosen due to optimism.
However, in the case when no arm hits the discontinuity point, learning is possible as we will show. Let be the distance to the point representing -th arm. Define to be the set of discontinuities of and to be the distance to the closest discontinuity point. We will show that learning is possible under the following assumption.
For each arm there exists such that is continuous in .
Let us introduce , then by Lemma 1 there exists a function that satisfies the required condition, but only in instead of . So when our estimated values are in we can use the same algorithm as before. We present a new algorithm -LCB2 that first pulls each arm some amount of times, such that with high probability is in for each arm, in other words, that . If we would know in advance, then to ensure this condition with high probability it is enough (from Lemma 2 and Lemma 3) that
Hence, we would need to pull each arm times. But since is not known in advance, we would pull each arm until its distance to is twice less than distance to the closest discontinuity point. Formally, the algorithm chooses each arm until
At the time when this happens, we can be sure that and this procedure does not increase the number of pulls too much. To ensure (5) with high probability it is enough that
After ensuring this for each arm, the algorithm proceeds as the -LCB algorithm, but uses for each arm instead of a common function :
Note that constructing requires knowledge of , but this can also be avoided if we construct it in the estimated (and smaller) region, defined at the time, when (6) occurs. The following theorem states the regret bound of the resulting algorithm.
Consider function that satisfies Assumption 1. Then for with probability at least for all the regret of the -LCB2 algorithm at time is upper bounded by:
where and as defined before. Moreover, if the algorithm is run with , then with probability at least for all the regret is upper bounded by:
The theorem can be applied to our motivating example: the threshold variance problem. There are two continuous regions, when and when . In either case we can take (in fact, we can take any increasing function for the region , since is just a constant there) and then the bound becomes
Actually, in this case the bound can be improved, since after Phase I the algorithm would know which arms have variance greater than and it would not pull them at all. Hence, for such arms term can be removed. Note that the efficiency of the algorithm depends on how fast we can compute : For the threshold variance problem it can be done efficiently, because , i.e. it can be done in constant time.
4 Conclusion and discussion
We described a framework for the risk-averse regret minimization without restriction to any particular risk measure. For a specific class of risk measures, which are functions of the mean and the variance, we proposed two algorithms that achieve logarithmic regret: one for the case of continuous functions and the one for the case of discontinuous functions. In the former case we proved logarithmic regret bound for any continuous function, while in the latter the problem need to satisfy a mild and reasonable assumption that arms should not hit the discontinuity points of the risk measure. Under this condition, the algorithms presented achieves the logarithmic regret.
We believe that assumption 1 might not be a necessary condition for learning. For example, even for the case when the risk measure is the Dirichlet function of the mean (which is continuous nowhere), it maybe be possible to design a sound algorithm, following the lines of [Cover02].
We remark that achieving optimal constants was not our goal and it is very likely that our bounds can be improved. An open problem, which we have not addressed in our work, is lower bounds on the risk-averse regret. Since the standard bandit problem is a particular case of our problem, we know that in this case the bound is tight (up to a constant), but obtaining a general lower bound remains an interesting research direction. Another open problem is the extension of our results to other classes of functions. While a long-term goal would be to consider general functionals, the class of coherent risk measures could be a plausible next step. It is interesting to note that while classes of coherent risk measures and general functions of the mean and the variance intersect, there is no inclusion in either direction. Finally, it is an interesting question to consider the best arm identification problem (e.g. [Bubeck02]) in the context of our framework. This problem is usually referred to as a pure exploration problem, where the goal is to explore the arms in the most efficient way, focusing on minimizing the notion of simple regret.
Proof (Lemma 1)
We will prove the lemma by directly constructing a candidate function, satisfying the stated conditions. First note that by Heine-Cantor theorem is uniformly continuous, since the domain is compact. Consider a sequence for , then for every such there exists , such that by uniform continuity. We now decrease each such that (if it is not the case). This does not invalidate the previous implication. Afterwards we construct the function . First, . Then for any we define
Then for . Now we need to deal with the case when . For this note that the fact for any implies for any . To see this, assume the former is true and fix such that . Take , then for both and : and hence . But then
We use the just proven fact to define for . Let be the smallest such that , then . To unify both cases we introduce
Letting , for . We then have that . By construction, satisfy Condition 1 and Condition 3 of the lemma (for any , and then ). Also, is well-defined, since for all (1) there exists some such that ; and (2) we have that and thus for . To deal with Condition 2, we can take any strictly increasing function that dominates
at every point. For example, we can linearly interpolate between discontinuity points, i.e. defineas
for and . It is strictly increasing (because is increasing, which we get from the definition of ) and Condition 3 follows from for .
Proof (Theorem 3.1)
The proof is similar to Theorem 1 from [Sani01] with minor modifications. We start with the following standard regret decomposition (recall that ).
Hence, our task is reduced to bounding for each arm. First, let be the second moment of the distribution of the arm , i.e. , where . Then
is the estimator of . Now we define a high probability event
Now let us consider the moment when arm is chosen at some time step . It means that its lower confidence index was lower than that of the best arm (by (4)):
We also know that on the event (by (3)):
Combining the last three inequalities,
Since is strictly increasing function it has a well-defined inverse and we can bound as follows:
If is the last time when arm is pulled, then and hence
Inserting this into (8) gives us the stated regret bound.
Proof (Theorem 3.2)
Again, as in Theorem 3.1, we are going to use regret decomposition (8). Hence, we will focus on bounding for each arm . We define the event as in (9) and everything we are deriving next is conditioned on . We introduce the following stopping times as
Then we have
where is the number of times the arm was pulled during the second phase of the algorithm. Conditioned on it can be bounded as in Theorem 3.1 by (10) with corresponding . Next we focus on . If we define
then, at time Condition (6) is necessarily fulfilled: