# Decision Variance in Online Learning

Online learning has classically focused on the expected behaviour of learning policies. Recently, risk-averse online learning has gained much attention. In this paper, a risk-averse multi-armed bandit problem where the performance of policies is measured based on the mean-variance of the rewards is studied. The variance of the rewards depends on the variance of the underlying processes as well as the variance of the player's decisions. The performance of two existing policies is analyzed and new fundamental limitations on risk-averse learning is established. In particular, it is shown that although an O( T) distribution-dependent regret in time T is achievable (similar to the risk-neutral setting), the worst-case (i.e. minimax) regret is lower bounded by Ω(T) (in contrast to the Ω(√(T)) lower bound in the risk-neutral setting). The lower bound results are even stronger in the sense that they are proven for the case of online learning with full feedback.

## Authors

• 15 publications
• 6 publications
05/04/2015

### On Regret-Optimal Learning in Decentralized Multi-player Multi-armed Bandits

We consider the problem of learning in single-player and multiplayer mul...
02/12/2018

### Multi-Armed Bandits on Unit Interval Graphs

An online learning problem with side information on the similarity and d...
06/09/2021

### Cooperative Online Learning

In this preliminary (and unpolished) version of the paper, we study an a...
03/13/2019

### Online Budgeted Learning for Classifier Induction

In real-world machine learning applications, there is a cost associated ...
11/28/2019

### Adaptive Communication Bounds for Distributed Online Learning

We consider distributed online learning protocols that control the excha...
11/18/2015

### Online learning in repeated auctions

Motivated by online advertising auctions, we consider repeated Vickrey a...
10/01/2013

### Online Learning of Dynamic Parameters in Social Networks

This paper addresses the problem of online learning in a dynamic setting...
##### This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

## 1 Introduction

Multi-armed bandit (MAB) is a class of online learning and decision making problems. The origins of the problem can be traced back to the works by Thompson [1] for clinical trial applications. In this classic problem, there are arms and a single player who plays the arms sequentially. Playing each arm at each time results in a reward drawn from an unknown distribution. The player observes only the reward of the selected arm. In the classic formulation of the problem the objective is to maximize the cumulative reward over time or equivalently to minimize regret that is defined as the cumulative loss in comparison to the best arm selection policy under known distribution model (i.e., always playing the arm with the highest expected reward). The crux of the classic MAB problem is in the trade-off between exploration, learning the statistics of the rewards for each arm, and exploitation, capitalizing on the gathered information to make the optimal decision at each time.

The classic MAB mainly focuses on the expected regret of arm selection policies. Motivated by emerging engineering and financial applications, there recently has been increased attention to MAB under risk measures. In this paper, we study a MAB problem under a mean-variance measure which is a common risk measure in modern portfolio selection [2]

. The mean-variance of a random variable

is defined as, , a linear combination of its mean and its variance  [3]. The higher the value of is, the more risk tolerant the measure is. In mean-variance portfolio optimization, the objective is to maximize the expected return for a given level of variance or to minimize the variance for a given expected return. The parameter can be interpreted as the Lagrangian multiplier in this constraint optimization problem.

Let denote the arm played by an arm selection policy at time and denote the reward obtained at time under policy . We define the cumulative mean-variance of observations as

 \footnotesize MVπ(T)=T∑t=1% \footnotesize MV(Xπt,t). (1)

Similar to the risk-neutral MAB, we set our objective to minimize the cumulative mean-variance of the rewards or equivalently minimize the regret defined as the excess in cumulative mean-variance in comparison to the optimal arm selection policy under known distribution model:

 Rπ(T)=\footnotesize MVπ(T)−\footnotesize MV% π∗(T), (2)

( is the optimal policy under known distribution model).

The regret definition in risk-averse MAB is similar to the one in risk-neutral MAB except for that the measure of expected value is replaced with the measure of mean-variance. However, the performance of a policy is not merely determined by the mean-variance of the rewards of the selected arms but also, as we shall see in Sec. 2, by the variance in the decisions; hence, the title of the paper.

### 1.1 Contribution

In the risk-neutral MAB, an lower bound on distribution-dependent regret was shown in the seminal work by Lai and Robbins [4]. The lower bound on the worst case (minimax) regret can be concluded to be from [5], as well as the lower bound results for non-stochastic MAB in [6]. In Sec. 3, we establish fundamental limits on performance of policies under the risk measure. Specifically, we show that the lower bound on distribution-dependent regret holds also under the mean-variance risk measure. However, the variance in the decisions makes an worst case regret inevitable. This lower bound result is even stronger in the sense that it is established under the case of full feedback setting where the player observes the reward of all arms at each time; in contrast to observing only the selected arm.

We also analyze the performance of Upper Confidence Bound (UCB)-type policies under the risk-averse measure. In particular we study a modification of classic UCB introduced in [7] for risk-neutral MAB, referred to as Mean Variance Lower Confidence Bound (MV-LCB). We also study a policy based on arm eliminations introduced in [8] for risk-neutral MAB referred to as Confidence Bounds based Arm Eliminations (CB-AE). We show that, while an distribution-dependent regret is achievable, both MV-LCB and CB-AE have a linear worst-case regret in time. We also provide simulation examples that show CB-AE outperforms MV-LCB.

### 1.2 Related Work

The risk-neutral MAB problem has been extensively studied for various applications including clinical trials, internet advertisement, web search, target tracking (see [9] and references therein), as well as various financial and economical applications (see [10] and references therein).

The MAB problem has been studied much less extensively under the measure of mean-variance. In [11] and [12], the empirical mean-variance of the observed reward process was considered as a risk measure (which is commonly referred to as volatility) and a modification of UCB and a modification of DSEE (introduced in [13] for risk neutral MAB) were shown to achieve distribution dependent regret and worst-case regret, respectively. The empirical mean-variance calculated using the whole reward process is different than the actual instantaneous mean-variance considered in this paper where the variance in the decisions causes a dominant term in regret. An lower bound on worst case regret was established in [11] which showed the order optimality of the modification of the DSEE policy. Under a non-stochastic and full feedback setting, [14]

considered a linear combination of mean and standard deviation (in contrast to mean-variance) and established a negative result showing the infeasibility of sublinear regret. In

[15], the quality of an arm was measured by a general function of the mean and the variance of the random variable. This study, however, is closer to the risk-neutral MAB problems than to the problem studied in this paper in that the variance in the decisions does not effect the regret in [15] as it dominantly does in our results.

In [16, 17], MAB under the measure of value at risk was studied. In [17], learning policies using the measure of conditional value at risk were developed. However, the performance guarantees were still within the risk-neutral MAB framework (in terms of the loss in the expected total reward) under the assumption that the best arm in terms of the mean value is also the best arm in terms of the conditional value at risk. Another risk measure for MAB problems was considered in [18]

in which the logarithm of moment generating function was used as a risk measure and high probability bounds on regret were obtained.

## 2 Problem Formulation

Consider a stochastic MAB problem with a discrete set of actions. At each time , a learner chooses an action and receives the corresponding reward , drawn from an unknown distribution . The rewards are independent over , and i.i.d. over . Let denote the set of distributions. We use and to denote the expectation and probability with respect to and drop the subscript when it is clear from the context.

An arm selection policy specifies a sequence of mappings from the history of observations to the arm to play at each time : . We use to denote the random reward sequence under policy .

The mean-variance of a random variable is defined as, , a linear combination of its mean and variance . We also use the notation , and .

The objective is to design an arm selection policy which minimizes the total mean-variance of the observations or equivalently minimizes regret defined as

 Rπ(T)=T∑t=1\footnotesize MV(Xπt,t)−T∑t=1\footnotesize MV(Xπ∗t,t), (3)

where , for all , is the optimal policy under a known model. Unlike the risk-neutral MAB, it is not obvious that the optimal policy under known model is a single arm policy. We shall show this using Lemma 1.

Let denote the indicator function, meaning, for an event , if and only if is true and otherwise. Let denote the number of times that arm has been played until time .

The risk-neutral regret , where , can be expressed as a weighted sum of with weights . In contrast, the risk-averse regret given in (3) depends also on the variance of the decisions through time. Lemma 1 provides an expression of regret which is used throughout the paper to analyze the performance of the policies. Let and .

###### Lemma 1

The regret of a policy under the measure of total mean-variance of rewards satisfies

 Rπ(T) = K∑k=1E[τk,T]Γk+T∑t=1E⎡⎢⎣⎛⎝∑k∈[K]∖k∗(I[πt=k]−Pr[πt=k])Δk⎞⎠2⎤⎥⎦. (4)

Proof. See Appendix A.

The regret expression given in Lemma 1 shows that for any policy , and , which proves that the optimal single arm policy is the optimal policy under the risk-averse measure. The second term in the regret expression in (4) captures the variance of the decisions.

#### Concentration Inequalities.

We assume the random variable , for all , is Sub-Gaussian with parameter

, i.e., its moment generating function is bounded by that of a Gaussian distribution with variance

:

 E[exp(u((Xk,1−μk)2−σ2k))]≤exp(u2b22).

As a result of the Chernoff-Hoedffding bound ([21]), we have the concentration inequalities on the sample mean and the sample mean-variance given in Lemma 2. The sample mean, the sample variance and the sample mean-variance of each arm up to time are, respectively, defined as , and . To keep the notation uncluttered we drop the specification of the policy from , , and .

###### Lemma 2 (Lemma 1 in [11])

Let be the sample mean-variance of a random variable obtained from i.i.d. observations. Let , , and assume that has a sub-Gaussian distribution, i.e.,

 E[eu((X−μ)2−σ2)]≤eζ1u2/2

for some constant . As a result has a sub-Gaussian distribution, i.e.,

 E[eu(X−μ)]≤eζ0u2/2.

Let . We have, for all constants and ,

 ⎧⎪⎨⎪⎩P[¯MVt−% \footnotesize MV(X)>δ]≤2exp(−αtδ2(2+λ)2),P[¯MVt−\footnotesize MV(X)<−δ]≤2exp(−αtδ2(2+λ)2).

## 3 The Proof of Lower Bounds

The regret expression given in Lemma 1 consists of two terms. The first term comes from playing suboptimal arms and the second term corresponds to the variance in the decisions. Since the second term is always positive, an lower bound for distribution-dependent regret can be established following the similar lines as in the proof of the lower bound results for risk-neutral MAB provided in [4, 5]. We thus focus on the lower bound for worst case regret which is significantly different than its counterpart in risk-neutral MAB due to dominant effect of the second term in risk averse regret under the worst case setting.

For the results presented in the rest of this section, we assume the player observes the rewards of all arms at each time , in contrast to observing only the reward of the chosen arm. Following the terminology in the literature, we refer to this case as full feedback setting. The lower bounds established under full feedback hold in the bandit setting as well (since extra information cannot worsen the performance). The following lemma is used in establishing the lower bound for worst case regret under risk-averse setting.

###### Lemma 3

Let and and

be two probability distributions supported on some set

with absolutely continuous with respect to . For any measurable function , we have

 Prν(ϕ(X)=1)+Prν′(ϕ(X)=0)≥12exp(−KL(ν,ν′)). (5)

Notations denotes the probability measure with respect to . Lemma 3 was used in [5] to establish a lower bound on the risk-neutral bandit with side information.

Our lower bound proof is based on a coupling argument in a 2-armed bandit. Let ) and denote two different distribution assignments for a 2-armed bandit. Let

, a normal distribution with mean

and variance . Also, let

, a Bernoulli distribution with

and with , . For simplicity of presentation let us assume . Note that for the difference between the variance of the above distributions we have and , where and are the variance of arm under and , respectively. For any arm selection policy , we prove that, under at least one of the two systems, the number of times the suboptimal arm is played is high in expectation. Specifically, we show that ].

###### Lemma 4

For any arm selection policy with full information and any parameter , in the 2-armed bandit with number of rounds ,

 {EF[τ2,T]∨EF′[τ1,T]}≥{0.01Γ2∧T2e}. (6)

Proof. See Appendix B.

Using Lemma 4, we establish a lower bound on the worst case regret performance of any policy .

###### Theorem 1

For any arm selection policy with full information, there exists a distribution assignment to a 2-armed bandit where

 RπF(T)≥T4e. (7)

Proof. See Appendix C.

## 4 Risk-averse Policies

In this section, we introduce and analyze the performance of Mean-Variance Lower Confidence Bound (MV-LCB) policy and Confidence Bounds based Arm Elimination (CB-AE) policy.

### 4.1 Mv-Lcb

MV-LCB is a modification of the classic UCB policy first introduced in [7] for risk-neutral MAB and then adopted for risk-averse MABs in [12, 11]. At each time , MV-LCB plays the arm with the smallest lower confidence bound on mean-variance:

 π\footnotesize MV-LCBt=argmink¯MVk,t−√clogtτk,t, (8)

where is a constant that depends on the distribution class parameters .

###### Theorem 2

When there is a positive gap in the mean-variances of the best and the second best arms, for , the regret of MV-LCB policy satisfies111 is the distribution class parameter specified in concentration inequalities in Lemma 2.

 Rπ\footnotesize MV-LCB(T)≤∑k∈[K]∖k∗(4clogTΓ2k+5∧T)(Γk+(K−1)Δ2k4). (9)

Proof. See Appendix D. Theorem 2 shows a logarithmic upper bound on the distribution-dependent regret of LCB.

### 4.2 Cb-Ae

The CB-AE policy is a modification of Improved UCB introduced in [8] which proceeds in steps . At each step , a set of actions , initialized at , are played, each times where is initialized at and is a constant that depends only on the distribution class parameter . At each step, a number of actions are potentially removed from based on upper and lower confidence bounds on their mean-variance, respectively, in the from of and , where is the sample mean-variance obtained from the observations at step . If the lower confidence bound of arm is bigger than the minimum of the upper confidence bounds of all other remaining arms, arm is removed ; see lines 6-10 in Algorithm 2.

Let and be the number of steps taken in CB-AE. Let .

###### Theorem 3

The risk-averse regret performance of CB-AE policy, for , satisfies

 Rπ\footnotesize CB-AE(T) ≤ ∑k∈[K]∖k∗⎛⎝4C3logTΓ2k+log2(1Γk)+Klog2T+2T3∧T⎞⎠Γk (10) + 12log2TΔ2max∑k∈[K]∖k∗((ClogTΓ2k+1)I[nk≤nmax] + +(Klog2T+2T4+Klog2TT)((K−1)2TΔ2max4).

Theorem 2 shows a logarithmic upper bound on the distribution-dependent regret of CB-AE.

## 5 Simulations

We compare the performance of and CB-AE policies and experimentally verify the theoretical predictions discussed above. We simulate arms, each Bernoulli distributed on a long time horizon ( and repeat the experiment times; see the appendix for further details on the exact simulation setup. We modify the MV gap of the optimal and all other arms to simulate different settings where the optimal arm is quite easy to identify or alternatively where the arms become indistinguishable as .

We show the performance of the two policies in terms of regret in Figure 1. As it is expected, CB-AE shows a better regret performance in the simulations in comparison to MV-LCB. The reason is that CB-AE, by fixing the arm elimination structure, reduces the variance in the decisions.

While both policies show a linear worst case regret performance, it is worth mentioning that MV-LCB  has a linear regret performance for all the settings where there exists a with and (for example ). On the other hand, CB-AE, as it can be seen from the upper bound in Theorem 3, has a linear regret for the particular case of and . The CB-AE policy recovers the sublinear regret for the smaller values of , that is almost equivalent good arms do not cause a linear regret in contrast to the MV-LCB case, which is a useful property from a practical perspective.

## 6 Conclusion

In this paper, we studied MAB problems under a mean-variance measure. We showed that a dominant term in risk-averse regret comes from the variance in the decisions. We established fundamental limits on learning policies; while a logarithmic distribution-dependent regret is achievable by UCB type policies, similar to the risk-neutral MAB, an worst case regret is inevitable in contrast to the counterpart lower bound in the risk-neutral setting.

## Appendix A

###### Proof 1 (Proof of Lemma 1)

We analyze the mean and the variance of the observed reward at time under policy . For the we have:

 E[Xπt,t] = E[K∑k=1I[πt=k]Xk,t] (11) = K∑k=1E[I[πt=k]Xk,t] = K∑k=1E[E[I[πt=k]Xk,t∣∣∣I[πt=k]]] = K∑k=1E[I[πt=k]E[Xk,t∣∣∣I[πt=k]]] = = K∑k=1Pr[πt=k]μk. (13)

Equation (11) comes from the linearity of the expectation and equation (1) is a result of the property of the conditional expectation that for two random variables and : .

For the variance of , we have

 E[(Xπt,t−E[Xπt,t])2] (14) = E[(K∑k=1I[πt=k]Xk,t−E[K∑k=1I[πt=k]Xk,t])2] = E⎡⎣(K∑k=1I[πt=k]Xk−K∑k=1I[πt=k]μk+K∑k=1I[πt=k]μk−K∑k=1Pr[πt=k]μk)2⎤⎦ = E⎡⎣(K∑k=1I[πt=k](Xk−μk)+K∑k=1(I[πt=k]−Pr[πt=k])μk)2⎤⎦ = E⎡⎣(K∑k=1I[πt=k](Xk−μk))2⎤⎦The first term+E⎡⎣(K∑k=1(I[πt=k]−Pr[πt=k])μk)2⎤⎦The second term +2E[(K∑k=1I[πt=k](Xk−μk))(K∑k=1(I[πt=k]−Pr[πt=k])μk)]The third term

We analyze the three term in (14) separately.

The first term:

 E⎡⎣(K∑k=1I[πt=k](Xk,t−μk))2⎤⎦ (15) = E[(K∑j=1I[πt=j](Xj,t−μj))(K∑k=1I[πt=k](Xk,t−μk))] = E[K∑j=1K∑k=1I[πt=j]I[πt=k](Xj,t−μj)(Xk,t−μk)] = K∑j=1K∑k=1E[I[πt=j]I[πt=k](Xj,t−μj)(Xk,t−μk)] = K∑k=1E[I[πt=k](Xk,t−μk)2] +K∑j=1K∑k=1k≠jE[I[πt=j]I[πt=k](Xj,t−μj)(Xk,t−μk)] = K∑k=1Pr[πt=k]σ2k.

The last equality is proven similar to (13).

The second term:

 E⎡⎣(K∑k=1(I[πt=k]−Pr[πt=k])μk)2⎤⎦ =E⎡⎢ ⎢ ⎢⎣⎛⎜ ⎜⎝K∑k=1k≠k∗(I[πt=k]−Pr[πt=k])μk+(I[πt=k∗]−Pr[πt=k∗])μk∗⎞⎟ ⎟⎠2⎤⎥ ⎥ ⎥⎦ =E⎡⎢ ⎢ ⎢⎣⎛⎜ ⎜⎝K∑k=1k≠k∗(I[πt=k]−Pr[πt=k])μk+⎛⎜ ⎜⎝1−K∑k=1k≠k∗I[πt=k]−(1−K∑k=1k≠k∗Pr[πt=k])⎞⎟ ⎟⎠μk∗⎞⎟ ⎟⎠2⎤⎥ ⎥ ⎥⎦ (16) =E⎡⎢ ⎢ ⎢⎣⎛⎜ ⎜⎝K∑k=1k≠k∗(I[πt=k]−Pr[πt=k])Δk⎞⎟ ⎟⎠2⎤⎥ ⎥ ⎥⎦. (17)

The equation (16) holds because and .

The third term:

 E[(K∑k=1I[πt=k](Xk−μk))(K∑k=1(I[πt=k]−Pr[πt=k])μk)] (18) = E[E[(K∑k=1I[πt=k](Xk−μk))(K∑k=1(I[πt=k]−Pr[πt=k])μk)∣∣∣I[πt=k]]] = 0.

Combining (13), (14), (15), (17), (18), we have

 \footnotesize MV(Xπt,t)=K∑k=1Pr[πt=k]\footnotesize MVk+E⎡⎢ ⎢ ⎢⎣⎛⎜ ⎜⎝K∑k=1k≠k∗(I[πt=k]−Pr[πt=k])Δk⎞⎟ ⎟⎠2⎤⎥ ⎥ ⎥⎦. (19)

Summing up the mean variance of observations at reach time and subtracting that of the optimal single arm strategy we arrive at (4).

## Appendix B

###### Proof 2 (Proof of Lemma 4)

For the KL divergence between and , we have

 KL(f2,f′2)=plogpq+(1−p)log1−p1−q = −(14+2Γ)log14−2Γ14+2Γ−(34−2Γ)log34+2Γ34−2Γ = −(14+2Γ)log(1−4Γ14+2Γ)−(34−2Γ)log(1+4Γ34−2Γ) ≤ −(14+2Γ)⎛⎝−4Γ14+2Γ−12(4Γ14+2Γ)2−13(4Γ14+2Γ)3⎞⎠ −(34−2Γ)⎛⎝4Γ34−2Γ+12(4Γ34−2Γ)2+13(1+18)3(4Γ34−2Γ)3⎞⎠ = Γ2⎛⎝814+2Γ+64Γ3(14+2Γ)2−834−2Γ−64Γ3(1+18)3(34−2Γ)2⎞⎠ ≤ 22Γ2. (21)

Inequality (2) is obtained based on truncated Taylor expansion of for and the last inequality holds for all .

Let

denote the joint distribution of the samples drawn from

.

 {EF[τ2,T]∨EF′[τ1,T]} ≥ 12(EF[τ2,T]+EF′[τ1,T) (22) = 12T∑t=1(PrF[I[πt=2]]+PrF′[I[πt=1]]) ≥ 12T∑t=1exp(−KL(f(t)2,f′(t)2)) = 12T∑t=1exp(−t∑s=1KL(f2,f′2)) (23) ≥ 12T∑t=1exp(−22tΓ2). (24)

Inequality (22) is obtained by Lemma 3. Inequality (23) is based on the assumption of i.i.d. samples for each arm over time, and (24) is obtained by replacing the upper bound on the from (21). To derive the desired lower bound in (6) we consider 2 cases for as follows.

#### Case 1

If , then

 12T∑t=1exp(−22tΓ2)≥12eT. (25)

#### Case 2

If , then

 12T∑t=1exp(−22tΓ2) ≥ 12∫Tx=1exp(−22xΓ2)dx (26) = 144Γ2(exp(−22Γ2)−exp(−22TΓ2)) = 144Γ2exp(−22Γ2)(1−exp(−22(T−1)Γ2)) ≥ exp(−2264)44Γ2(1−exp(−T−1T)) ≥ exp(−2264)44Γ2(1−exp(−99100)) ≥ 0.01Γ2. (27)

Inequality (26) holds for .

Combining (24), (25) and (27), we arrive at the theorem.

## Appendix C

###### Proof 3 (Proof of Theorem 1)

Let denote the set of time instances. For each and any policy in a 2-armed bandit, we construct a new policy , based on , that is obtained by altering the decision of policy on set . In particular,

 {πSt=πt,  if t∉SπSt=3−πt,  if t∈S. (28)

In a 2-armed bandit, let where and . In the second term in regret expression given in (4), we have

 EF⎡⎢ ⎢ ⎢⎣⎛⎜ ⎜⎝K∑k=1k≠k∗(I[πt=k]−PrF[πt=k])Δk⎞⎟ ⎟⎠2⎤⎥ ⎥ ⎥⎦ = EF[((I[πt≠∗]−PrF[πt≠∗])Δ)2] = PrF[πt≠∗](1−PrF[πt≠∗])Δ2.

The first term in the regret expression given in (4), is always positive. Thus

 RπF(T)≥T∑t=1PrF[πt≠∗](1−PrF[πt≠∗])Δ2. (29)

Since or , we have, for all