Multi-armed bandit (MAB) is a class of online learning and decision making problems. The origins of the problem can be traced back to the works by Thompson  for clinical trial applications. In this classic problem, there are arms and a single player who plays the arms sequentially. Playing each arm at each time results in a reward drawn from an unknown distribution. The player observes only the reward of the selected arm. In the classic formulation of the problem the objective is to maximize the cumulative reward over time or equivalently to minimize regret that is defined as the cumulative loss in comparison to the best arm selection policy under known distribution model (i.e., always playing the arm with the highest expected reward). The crux of the classic MAB problem is in the trade-off between exploration, learning the statistics of the rewards for each arm, and exploitation, capitalizing on the gathered information to make the optimal decision at each time.
The classic MAB mainly focuses on the expected regret of arm selection policies. Motivated by emerging engineering and financial applications, there recently has been increased attention to MAB under risk measures. In this paper, we study a MAB problem under a mean-variance measure which is a common risk measure in modern portfolio selection 
. The mean-variance of a random variableis defined as, , a linear combination of its mean and its variance . The higher the value of is, the more risk tolerant the measure is. In mean-variance portfolio optimization, the objective is to maximize the expected return for a given level of variance or to minimize the variance for a given expected return. The parameter can be interpreted as the Lagrangian multiplier in this constraint optimization problem.
Let denote the arm played by an arm selection policy at time and denote the reward obtained at time under policy . We define the cumulative mean-variance of observations as
Similar to the risk-neutral MAB, we set our objective to minimize the cumulative mean-variance of the rewards or equivalently minimize the regret defined as the excess in cumulative mean-variance in comparison to the optimal arm selection policy under known distribution model:
( is the optimal policy under known distribution model).
The regret definition in risk-averse MAB is similar to the one in risk-neutral MAB except for that the measure of expected value is replaced with the measure of mean-variance. However, the performance of a policy is not merely determined by the mean-variance of the rewards of the selected arms but also, as we shall see in Sec. 2, by the variance in the decisions; hence, the title of the paper.
In the risk-neutral MAB, an lower bound on distribution-dependent regret was shown in the seminal work by Lai and Robbins . The lower bound on the worst case (minimax) regret can be concluded to be from , as well as the lower bound results for non-stochastic MAB in . In Sec. 3, we establish fundamental limits on performance of policies under the risk measure. Specifically, we show that the lower bound on distribution-dependent regret holds also under the mean-variance risk measure. However, the variance in the decisions makes an worst case regret inevitable. This lower bound result is even stronger in the sense that it is established under the case of full feedback setting where the player observes the reward of all arms at each time; in contrast to observing only the selected arm.
We also analyze the performance of Upper Confidence Bound (UCB)-type policies under the risk-averse measure. In particular we study a modification of classic UCB introduced in  for risk-neutral MAB, referred to as Mean Variance Lower Confidence Bound (MV-LCB). We also study a policy based on arm eliminations introduced in  for risk-neutral MAB referred to as Confidence Bounds based Arm Eliminations (CB-AE). We show that, while an distribution-dependent regret is achievable, both MV-LCB and CB-AE have a linear worst-case regret in time. We also provide simulation examples that show CB-AE outperforms MV-LCB.
1.2 Related Work
The risk-neutral MAB problem has been extensively studied for various applications including clinical trials, internet advertisement, web search, target tracking (see  and references therein), as well as various financial and economical applications (see  and references therein).
The MAB problem has been studied much less extensively under the measure of mean-variance. In  and , the empirical mean-variance of the observed reward process was considered as a risk measure (which is commonly referred to as volatility) and a modification of UCB and a modification of DSEE (introduced in  for risk neutral MAB) were shown to achieve distribution dependent regret and worst-case regret, respectively. The empirical mean-variance calculated using the whole reward process is different than the actual instantaneous mean-variance considered in this paper where the variance in the decisions causes a dominant term in regret. An lower bound on worst case regret was established in  which showed the order optimality of the modification of the DSEE policy. Under a non-stochastic and full feedback setting, 
considered a linear combination of mean and standard deviation (in contrast to mean-variance) and established a negative result showing the infeasibility of sublinear regret. In, the quality of an arm was measured by a general function of the mean and the variance of the random variable. This study, however, is closer to the risk-neutral MAB problems than to the problem studied in this paper in that the variance in the decisions does not effect the regret in  as it dominantly does in our results.
In [16, 17], MAB under the measure of value at risk was studied. In , learning policies using the measure of conditional value at risk were developed. However, the performance guarantees were still within the risk-neutral MAB framework (in terms of the loss in the expected total reward) under the assumption that the best arm in terms of the mean value is also the best arm in terms of the conditional value at risk. Another risk measure for MAB problems was considered in 
2 Problem Formulation
Consider a stochastic MAB problem with a discrete set of actions. At each time , a learner chooses an action and receives the corresponding reward , drawn from an unknown distribution . The rewards are independent over , and i.i.d. over . Let denote the set of distributions. We use and to denote the expectation and probability with respect to and drop the subscript when it is clear from the context.
An arm selection policy specifies a sequence of mappings from the history of observations to the arm to play at each time : . We use to denote the random reward sequence under policy .
The mean-variance of a random variable is defined as, , a linear combination of its mean and variance . We also use the notation , and .
The objective is to design an arm selection policy which minimizes the total mean-variance of the observations or equivalently minimizes regret defined as
where , for all , is the optimal policy under a known model. Unlike the risk-neutral MAB, it is not obvious that the optimal policy under known model is a single arm policy. We shall show this using Lemma 1.
Let denote the indicator function, meaning, for an event , if and only if is true and otherwise. Let denote the number of times that arm has been played until time .
The risk-neutral regret , where , can be expressed as a weighted sum of with weights . In contrast, the risk-averse regret given in (3) depends also on the variance of the decisions through time. Lemma 1 provides an expression of regret which is used throughout the paper to analyze the performance of the policies. Let and .
The regret of a policy under the measure of total mean-variance of rewards satisfies
Proof. See Appendix A.
The regret expression given in Lemma 1 shows that for any policy , and , which proves that the optimal single arm policy is the optimal policy under the risk-averse measure. The second term in the regret expression in (4) captures the variance of the decisions.
We assume the random variable , for all , is Sub-Gaussian with parameter
, i.e., its moment generating function is bounded by that of a Gaussian distribution with variance:
As a result of the Chernoff-Hoedffding bound (), we have the concentration inequalities on the sample mean and the sample mean-variance given in Lemma 2. The sample mean, the sample variance and the sample mean-variance of each arm up to time are, respectively, defined as , and . To keep the notation uncluttered we drop the specification of the policy from , , and .
Lemma 2 (Lemma 1 in )
Let be the sample mean-variance of a random variable obtained from i.i.d. observations. Let , , and assume that has a sub-Gaussian distribution, i.e.,
for some constant . As a result has a sub-Gaussian distribution, i.e.,
Let . We have, for all constants and ,
3 The Proof of Lower Bounds
The regret expression given in Lemma 1 consists of two terms. The first term comes from playing suboptimal arms and the second term corresponds to the variance in the decisions. Since the second term is always positive, an lower bound for distribution-dependent regret can be established following the similar lines as in the proof of the lower bound results for risk-neutral MAB provided in [4, 5]. We thus focus on the lower bound for worst case regret which is significantly different than its counterpart in risk-neutral MAB due to dominant effect of the second term in risk averse regret under the worst case setting.
For the results presented in the rest of this section, we assume the player observes the rewards of all arms at each time , in contrast to observing only the reward of the chosen arm. Following the terminology in the literature, we refer to this case as full feedback setting. The lower bounds established under full feedback hold in the bandit setting as well (since extra information cannot worsen the performance). The following lemma is used in establishing the lower bound for worst case regret under risk-averse setting.
Let and and be two probability distributions supported on some set
be two probability distributions supported on some setwith absolutely continuous with respect to . For any measurable function , we have
Our lower bound proof is based on a coupling argument in a 2-armed bandit. Let ) and denote two different distribution assignments for a 2-armed bandit. Let
, a normal distribution with meanand variance . Also, let
, a Bernoulli distribution withand with , . For simplicity of presentation let us assume . Note that for the difference between the variance of the above distributions we have and , where and are the variance of arm under and , respectively. For any arm selection policy , we prove that, under at least one of the two systems, the number of times the suboptimal arm is played is high in expectation. Specifically, we show that ].
For any arm selection policy with full information and any parameter , in the 2-armed bandit with number of rounds ,
Proof. See Appendix B.
Using Lemma 4, we establish a lower bound on the worst case regret performance of any policy .
For any arm selection policy with full information, there exists a distribution assignment to a 2-armed bandit where
Proof. See Appendix C.
4 Risk-averse Policies
In this section, we introduce and analyze the performance of Mean-Variance Lower Confidence Bound (MV-LCB) policy and Confidence Bounds based Arm Elimination (CB-AE) policy.
MV-LCB is a modification of the classic UCB policy first introduced in  for risk-neutral MAB and then adopted for risk-averse MABs in [12, 11]. At each time , MV-LCB plays the arm with the smallest lower confidence bound on mean-variance:
where is a constant that depends on the distribution class parameters .
When there is a positive gap in the mean-variances of the best and the second best arms, for , the regret of MV-LCB policy satisfies111 is the distribution class parameter specified in concentration inequalities in Lemma 2.
Proof. See Appendix D. Theorem 2 shows a logarithmic upper bound on the distribution-dependent regret of LCB.
The CB-AE policy is a modification of Improved UCB introduced in  which proceeds in steps . At each step , a set of actions , initialized at , are played, each times where is initialized at and is a constant that depends only on the distribution class parameter . At each step, a number of actions are potentially removed from based on upper and lower confidence bounds on their mean-variance, respectively, in the from of and , where is the sample mean-variance obtained from the observations at step . If the lower confidence bound of arm is bigger than the minimum of the upper confidence bounds of all other remaining arms, arm is removed ; see lines 6-10 in Algorithm 2.
Let and be the number of steps taken in CB-AE. Let .
The risk-averse regret performance of CB-AE policy, for , satisfies
Theorem 2 shows a logarithmic upper bound on the distribution-dependent regret of CB-AE.
We compare the performance of and CB-AE policies and experimentally verify the theoretical predictions discussed above. We simulate arms, each Bernoulli distributed on a long time horizon ( and repeat the experiment times; see the appendix for further details on the exact simulation setup. We modify the MV gap of the optimal and all other arms to simulate different settings where the optimal arm is quite easy to identify or alternatively where the arms become indistinguishable as .
We show the performance of the two policies in terms of regret in Figure 1. As it is expected, CB-AE shows a better regret performance in the simulations in comparison to MV-LCB. The reason is that CB-AE, by fixing the arm elimination structure, reduces the variance in the decisions.
While both policies show a linear worst case regret performance, it is worth mentioning that MV-LCB has a linear regret performance for all the settings where there exists a with and (for example ). On the other hand, CB-AE, as it can be seen from the upper bound in Theorem 3, has a linear regret for the particular case of and . The CB-AE policy recovers the sublinear regret for the smaller values of , that is almost equivalent good arms do not cause a linear regret in contrast to the MV-LCB case, which is a useful property from a practical perspective.
In this paper, we studied MAB problems under a mean-variance measure. We showed that a dominant term in risk-averse regret comes from the variance in the decisions. We established fundamental limits on learning policies; while a logarithmic distribution-dependent regret is achievable by UCB type policies, similar to the risk-neutral MAB, an worst case regret is inevitable in contrast to the counterpart lower bound in the risk-neutral setting.
Proof 1 (Proof of Lemma 1)
We analyze the mean and the variance of the observed reward at time under policy . For the we have:
For the variance of , we have
We analyze the three term in (14) separately.
The first term:
The last equality is proven similar to (13).
The second term:
The equation (16) holds because and .
The third term:
Proof 2 (Proof of Lemma 4)
For the KL divergence between and , we have
Inequality (2) is obtained based on truncated Taylor expansion of for and the last inequality holds for all .
Let denote the joint distribution of the samples drawn from
denote the joint distribution of the samples drawn from.
Inequality (22) is obtained by Lemma 3. Inequality (23) is based on the assumption of i.i.d. samples for each arm over time, and (24) is obtained by replacing the upper bound on the from (21). To derive the desired lower bound in (6) we consider 2 cases for as follows.
If , then
If , then
Inequality (26) holds for .
Proof 3 (Proof of Theorem 1)
Let denote the set of time instances. For each and any policy in a 2-armed bandit, we construct a new policy , based on , that is obtained by altering the decision of policy on set . In particular,
In a 2-armed bandit, let where and . In the second term in regret expression given in (4), we have
The first term in the regret expression given in (4), is always positive. Thus
Since or , we have, for all