DeepAI

# Minimax Optimal Algorithms for Adversarial Bandit Problem with Multiple Plays

We investigate the adversarial bandit problem with multiple plays under semi-bandit feedback. We introduce a highly efficient algorithm that asymptotically achieves the performance of the best switching m-arm strategy with minimax optimal regret bounds. To construct our algorithm, we introduce a new expert advice algorithm for the multiple-play setting. By using our expert advice algorithm, we additionally improve the best-known high-probability bound for the multi-play setting by O(√(m)). Our results are guaranteed to hold in an individual sequence manner since we have no statistical assumption on the bandit arm gains. Through an extensive set of experiments involving synthetic and real data, we demonstrate significant performance gains achieved by the proposed algorithm with respect to the state-of-the-art algorithms.

• 6 publications
• 32 publications
• 28 publications
• 28 publications
11/16/2022

### Dueling Bandits: From Two-dueling to Multi-dueling

We study a general multi-dueling bandit problem, where an agent compares...
07/13/2020

### Relaxing the I.I.D. Assumption: Adaptive Minimax Optimal Sequential Prediction with Expert Advice

We consider sequential prediction with expert advice when the data are g...
02/14/2012

### Towards minimax policies for online linear optimization with bandit feedback

We address the online linear optimization problem with bandit feedback. ...
06/30/2016

### Asymptotically Optimal Algorithms for Budgeted Multiple Play Bandits

We study a generalization of the multi-armed bandit problem with multipl...
04/18/2018

### Online Non-Additive Path Learning under Full and Partial Information

We consider the online path learning problem in a graph with non-additiv...
12/15/2018

### A Bandit Approach to Maximum Inner Product Search

There has been substantial research on sub-linear time approximate algor...
03/05/2018

### Online learning over a finite action set with limited switching

This paper studies the value of switching actions in the Prediction From...

## I Introduction

### I-a Preliminaries

Multi-armed bandit problem is extensively investigated in the online learning[7, 9, 13, 34, 4, 33] and signal processing [22, 39, 11, 36, 21] literatures, especially for the applications where feedback is limited, and exploration-exploitation must be balanced optimally. In the classical framework, the multi-armed bandit problem deals with choosing a single arm out of arms at each round so as to maximize the total reward. We study the multiple-play version of this problem, where we choose an sized subset of arms at each round. We assume that

• The size is constant throughout the game and known a priori by the learner.

• The order of arm selections does not have an effect on the arm gains.

• The total gain of the selected arms is the sum of the gains of the selected individual arms.

• We can observe the gain of each one of the selected arms at the end of each round. Since we can observe the gains of the individual arms in the selected subset, we also obtain partial information about the other possible subset selections with common individual arms (semi-bandit feedback).

We point out that this framework is extensively used to model several real-life problems such as online shortest path and online advertisement placement[19, 25].

We investigate the multi-armed bandit problem with multiple plays (henceforth the MAB-MP problem) in an individual sequence framework where we make no statistical assumptions on the data in order to model chaotic, non-stationary or even adversarial environments[4]. To this end, we evaluate our algorithms from a competitive perspective and define our performance with respect to a competing class of strategies. As the competition class, we use the switching -arm strategies, where the term -arm is used to denote any distinct arms. We define the class of the switching -arm strategies as the set of all deterministic -arm selection sequences, where there are a total of sequences in a length game. We evaluate our performance with respect to the best strategy (maximum gain) in this class. We note that similar competing classes are widely used in the control theory[16, 23]

[20, 17], universal source coding theory[40, 24, 30][5, 15, 38], due to their modelling power to construct competitive algorithms that also work under practical conditions.

In the class of the switching -arm strategies, the optimal strategy is, by definition, the one whose -arm selection yields the maximum gain at each round of the game. If the optimal strategy changes its -arm selection times, i.e., switches, we say the optimal strategy has segments. Each such segment constitutes a part of the game (with possibly different lengths) where the optimum -arm selection stays the same. For this setting, which we will refer as the tracking the best -arm setting, we introduce a highly efficient algorithm that asymptotically achieves the performance of the best switching -arm strategy with minimax optimal regret bounds.

To construct our tracking algorithm, we follow the derandomizing approach [38]. We consider each -arm strategy as an expert with a predetermined -arm selection sequence, where the number of experts grows with , and combine them in an expert advice algorithm under semi-bandit feedback. Although we have exponentially many experts, we derive an optimal regret bound with respect to the best -arm strategy with a specific choice of initial weights. We then efficiently implement this algorithm with a weight-sharing network, which requires time and space. We note that our algorithm requires prior knowledge of the number of segments in the optimal strategy, i.e., . However, it can be extended to a truly online form, i.e., without any knowledge , by using the analysis in [13] with an additional time complexity cost.

We point out that the state-of-the-art expert advice algorithms[18, 4] cannot combine -arm sequences optimally due to the additional term in their regret bounds. Therefore, to construct an optimal algorithm, we introduce an optimal expert advice algorithm for the MAB-MP setting. In our expert advice algorithm, we utilize the structure of the expert set in order to improve the regret bounds of the existing expert advice algorithms[18, 4] up to . We then combine -arm sequences optimally in this algorithm and obtain the minimax optimal regret bound. By using our expert advice algorithm, we additionally improve the best-known high-probability bound[26] by , hence, close the gap between high-probability bounds[26] and the expected regret bounds[18, 35]. In the end, we also demonstrate significant performance gains achieved by our algorithms with respect to the state-of-the-art algorithms [14, 18, 35, 26, 4] through an extensive set of experiments involving synthetic and real data.

### I-B Prior Art and Comparison

The MAB-MP problem is mainly studied under three types of feedback: The full-information[32], where the gains of all arms are revealed to the learner, the semi-bandit feedback[3, 14, 35, 18, 26], where the gains of the selected arms are revealed, and the full bandit feedback[10, 1, 8], where only the total gain of the selected -arm is revealed. Since our study lies in the semi-bandit scenario, we focus on the relevant studies for the comparison.

The adversarial MAB-MP problem where the player competes against the best fixed -arm under semi-bandit feedback has a regret lower bound of 111We use big- notation, i.e., , to ignore constant factors and use soft- notation, i.e., , to ignore the logarithmic factors as well. for arms in a round game [3]. On the other hand, a direct application of Exp3 [4], i.e., the state-of-the-art for , achieves a regret bound with time and space complexity. One of the earliest studies to close this performance gap with an efficient algorithm is by Györgi et al. [14]. They derived a regret bound with respect to the best fixed -arm in hindsight with time complexity. This result is improved by Kale et al. [18] and Uchiya et al. [35] whose algorithms guarantee a regret bound with and time complexities respectively. Later, Audibert et al. [3] achieved the minimax optimal regret bound by the Online Stochastic Mirror Descent (OSMD) algorithm. The efficient implementation of OSMD is studied by Suehiro et al. [31] whose algorithm has time complexity.

We emphasize that although minimax optimal bound has been achieved, all of these results have been proven to hold only in expectation. In practical applications, these algorithms suffer from the large variance of the unbiased estimator, which leads

regret in the worst case[4, 6]. This problem is addressed by Györgi et al. [14] and Neu et al. [26] for the MAB-MP problem. They respectively derived and regret bounds holding with probability .

In this paper, we introduce algorithms that achieve minimax optimal regret (up to logarithmic terms) with high probability for both the vanilla MAB-MP and the tracking the best -arm settings. In order to generalize both settings in an optimal manner, we first introduce an optimal expert-mixture algorithm for the MAB-MP problem in Section III. In our expert-mixture algorithm, differing from the state-of-the-art[18], we exploit the structure of the expert set, and introduce the notion of underlying experts. By exploiting the structure of the expert set, we improve the regret bound of the state-of-the-art expert mixture algorithm for the MAB-MP setting [18] up to and obtain the optimal regret bound against to the best expert, which can follow any arbitrary strategy. We then consider the set of the deterministic -arm in our expert-mixture algorithm in Remark III.1 and close the gap between high-probability bounds[26, 14] and the expected regret bounds[18, 35] for the vanilla MAB-MP setting.

In addition to our improvement in high-probability bound, we use our optimal expert mixture algorithm to develop a tracking algorithm for the MAB-MP setting. We note that when competing against the best switching -arm strategy (as opposed to the best-fixed -arm), the minimax lower bound can be derived as 222When competing against the best switching bandit arm strategy (as opposed to the best fixed arm strategy), we can apply bound separately to each one of segment (if we know the switching instants). Hence, maximization of the total regret bound yields a minimax bound of since the square-root function is concave and the bound is maximum when each segment is of equal length .. However, similar to the case of Exp3, the direct implementation of the traditional multi-armed bandit algorithms into this problem suffers poor performance guarantees. To the best of our knowledge, only György et al. [14] studied competing against the switching -arm sequences and derived regret bound holding with probability . In Section IV, by mixing the sets of switching -arm sequences optimally in our expert mixture algorithm, we improve this result to regret bound holding with probability . We note that the computational complexity of our final algorithm is , whereas György et al.’s algorithm [14, Section 6] requires per round. Therefore, we also provide a highly efficient counterpart of the state-of-the-art.

### I-C Contributions

Our main contributions are as follows:

• As the first time in the literature, we introduce an online algorithm, i.e., Exp3.MSP, that truly achieves (with minimax optimal regret bounds) the performance of the best multiple-arm selection strategy.

• We achieve this performance with computational complexity only log-linear in the arm number, which is significantly smaller than the computational complexity of the state-of-the-art[14].

• In order to obtain the minimax optimal regret bound with Exp3.MSP, we introduce an optimal expert mixture algorithm for the MAB-MP setting, i.e., Exp4.MP. We derive a lower bound for the MAB-MP with expert advice setting and mathematically show the optimality of the Exp4.MP algorithm.

• By using Exp4.MP, we additionally improve the best-known high-probability bound for the multiple-play setting by , hence, close the gap between high-probability bounds[26, 14] and the expected regret bounds[18, 35].

### I-D Organization of the Paper

The organization of this paper is as follows: In Section II we formally define the adversarial multi-armed bandit problem with multiple plays. In Section III, we introduce an optimal expert mixture algorithm for the MAB-MP setting. In Section IV, by using our expert mixture algorithm , we construct an algorithm that competes with the best switching -arm strategy in a computationally efficient way. In Section V, we demonstrate the performance of our algorithms via an extensive set of experiments. We conclude with final remarks in Section VI.

## Ii Problem Description

We use bracket notation to denote the set of the first positive integers, i.e., . We use to denote the -sized combinations of the set . We use to denote the set of arms and to denote the set of all possible -arm selections. We use

to denote the column vector, whose

component is if , and otherwise.

We study the MAB-MP problem, where we have arms, and randomly select an -arm at each round . Based on our -arm selection , we observe only the gain of the selected arms, i.e., for , and receive their sum as the gain of our selection . We assume for notational simplicity; however, our derivations hold for any bounded gain after shifting and scaling in magnitude. We work in the adversarial bandit setting such that we do not assume any statistical model for the arm gains . The output of our algorithm at each round is strictly online and randomized. It is a function of only the past selections and observed gains.

In a round game, we define the variable , which represents a deterministic -arm selection sequence of length , i.e., for . In the rest of the paper, we refer to each such deterministic -arm selection sequence, , as an -arm strategy. The total gain of an -arm strategy and the total gain of our algorithm (for this section, say the name of our algorithm is ALG) are respectively defined as

 GMT≜T∑t=1∑i∈MT(t)xi(t), and GALG≜T∑t=1∑i∈U(t)xi(t).

Since we assume no statistical assumptions on the gain sequence, we define our performance with respect to the optimum strategy , which is given as . In order to measure the performance of our algorithm, we use the notion of regret such that

 R(T)≜GM∗T−GALG.

There are two different regret definitions for the randomized algorithms: the expected regret and high-probability regret. Since the algorithms that guarantee high-probability regret yield more reliable performance [4, 6], we provide high-probability regret with our algorithms. High-probability regret is defined as

 Pr[R(T)≥ϵ]≤δ,

which means that the total gain of our selections up to is not much smaller than the total gain of the best strategy with probability at least .

The regret depends on how hard it is to learn the optimum -arm strategy . Since at every switch we need to learn the optimal -arm from scratch, we quantify the hardness of learning the optimum strategy by the number of segments it has. We define the number of segments as . Our goal is to achieve that minimax optimal regret up to logarithmic factors with high probability, i.e.,

 Pr[R(T)≥~O(√mSKT)]≤δ.

## Iii MAB-MP with Expert Advice

In this section, we consider selecting an -arm with expert advice and introduce an optimal expert-mixture algorithm for the MAB-MP setting. We note that the primary aim of this section is to provide an optimal expert advice framework for the MAB-MP setting, on which we develop our optimal tracking algorithm in Section IV. By using our expert mixture algorithm, we additionally improve the best-known high-probability regret bound for the MAB-MP setting by in the last remark of this section.

For this section, we define the phrase ”expert advice” as the reference policies (or vectors) of the algorithm. The setting is as follows: At each round, each expert presents its -arm selection advice as a -dimensional vector, whose entries represent the marginal probabilities for the individual arms. The algorithm uses those vectors, along with the past performance of the experts, to choose an -arm. The goal is to asymptotically achieve the performance of the best expert with high probability. For this setting, we introduce an optimal algorithm Exp4.MP, which is shown in Algorithm 1. In Exp4.MP, instead of directly using the expert set, we use an underlying expert set to utilize the possible structure of the expert set. An underlying expert set is defined as a non-negative vector set, whose sum of -combinations constitute a set containing the expert advices (see Figure 1). By using an underlying expert set, we replace the dependence of the regret on the size of the expert set with the size of the underlying expert set , thus, obtain the minimax lower bound in the soft-Oh sense (proven in the following). In the rest of the paper, we use the term underlying experts to denote the elements of the underlying expert set, and the term actual experts (respectively actual advice vectors) to denote the experts (respectively expert advices) presented to the algorithm.

In Exp4.MP, we first get the actual advice vectors for in line 4. Since the entries of the actual advice vectors represent the marginal probabilities for the individual arms, they satisfy

 K∑j=1ξkj(t)=m,max1≤j≤Kξkj(t)≤1, min1≤j≤Kξkj(t)≥0.

Then we find the underlying experts, i.e., for , in line 5. We note that for the algorithms presented in this paper, we derive the underlying expert sets a priori. Therefore, our algorithms do not explicitly compute the underlying experts at each round.

In the algorithm, we keep a weight for each underlying expert, i.e., for . We use those weights as confidence measure to find the arm weights, i.e., , in line 6 as follows:

 vj(t)=Nr∑i=1wi(t)ζij(t)∑Nrl=1wl(t) for j∈[K]. (1)

In order to select an -arm, the expected total number of selection should be , i.e., . To satisfy this, we cap the arm weights so that the arm probabilities are kept in the range . For the arm capping, we first check if there is an arm weight larger than in line 7. If there is, we find the threshold , and define the set that includes the indices of the weights larger than , i.e., . We set the temporal weights of the arms in to , i.e., for , and leave the other weights unchanged, i.e., for (The implementation of this procedure is detailed in Appendix A). We then calculate the arm probabilities with the capped arm weights by

 pj(t)=m((1−γ)v′j(t)K∑l=1v′l(t)+γK) for j∈[K]. (2)

In order to efficiently select distinct arms with the marginal probabilities , we employ Dependent Rounding (DepRound) algorithm[12] in line 16 (For the description of DepRound, see Appendix A). After selecting an -arm, we observe the gain of each one of the selected arms and receive their sum as the reward of the round.

To update the weights of the underlying experts, i.e., for , we first find the estimated arm gains in lines 18-19:

 ^xj(t)=⎧⎨⎩xj(t)pj(t)ifj∈U(t)0otherwise. (3)

Then by using the estimated arm gains for , we calculate the estimated expected gain of the underlying experts by

 ^yi(t)=∑j∈[K]−U0(t)ζij(t)^xj(t) for i∈[Nr]. (4)

In order to obtain high-probability bound, we use upper confidence bounds. However, we note that we cannot directly use the upper confidence bound of the single arm setting [6] since Exp4.MP includes an additional non-linear weight capping in lines 7-14. In the following, we show that we can use a similar upper bounding technique by not including the capped arm weights , i.e.,

 ^ui(t)=∑j∈[K]−U0(t)ζij(t)pj(t). (5)

Then by using and , we update the weights in line 23 by

 wi(t+1)=wi(t)exp(η(^yi(t)+c√KT^ui(t))), (6)

where is the learning rate and is the scaling factor, which determines the range of the confidence bound.

For the following theorems, we respectively define the total gain of the underlying expert with index , and its estimation as

 Gi≜T∑t=1ζi(t)⋅x(t) and ^Gi≜T∑t=1ζi(t)⋅^x(t), (7)

where and are the column vectors containing the real and the estimated arm gains, i.e., and . Let us define a set that includes arbitrary underlying experts, i.e., . Then, by using the total gain of the underlying experts in the best (in terms of the total gain), the total gain of the best actual expert can be written as

 Gmax=maxA∈C([Nr],m)∑i∈AGi. (8)

We also define the upper bounded estimated gain of a set , i.e , and the set with the maximum upper bounded estimated gain, i.e., , as follows:

 ^ΓA≜∑i∈A^Gi+c√KTT∑t=1∑i∈A^ui(t) and A∗=argmaxA∈C([Nr],m)^ΓA. (9)

In the following theorem, we provide a useful inequality that relates , and the initial weights of the underlying experts in , i.e., for under a certain assumption. This inequality will be used to derive regret bounds for our algorithms in Corollary III.1 and Theorem IV.1, where we ensure that the assumption in Theorem III.1 holds.

###### Theorem III.1.

Let denote . Assuming

 η(^yi(t)+c^ui(t)√KT)≤1,∀i∈[Nr] and ∀t∈[T]

Exp4.MP ensures that

 (1−γ−2ηKm)^ΓA∗+ (1−γ)η(∑i∈A∗ln(wi(1))−mlnW1m) ≤GExp4.MP+c√KT+ηc22Kγm (10)

holds for any .

###### Proof.

See Appendix B. ∎

In the following corollary, we derive the regret bound of Exp4.MP with uniform initialization.

###### Corollary III.1.

If Exp4.MP is initialized with , and run with the parameters

 η=mγ2Kγ=√KlnNrmmTc=√mlnNrδ,

for any and , it ensures that

 Gmax−GExp4.MP ≤2√mKTlnNrδ +4√mKTlnNrm+mlnNrδ (11)

holds with probability at least .

###### Proof.

See Appendix B. ∎

In the next theorem, we show that in the MAB-MP with expert-advice setting, no strategy can enjoy smaller regret guarantee than in the minimax sense. In its following, we also show that the derived lower bound is tight and it matches the regret bound of Exp4.MP given in Corollary III.1.

###### Theorem III.2.

Assume that for an integer and that is a multiple of . Let us define the regret of an arbitrary forecasting strategy ALG in a game length of as

 RALG(T)=Gmax−GALG. (12)

Then there exists a distribution for gain assignments such that

 infALGsupξRALG(T)≥O(√mKTlnNrlnK), (13)

where is an infimum over all possible forecasting strategies, is a supremum over all possible expert advice sequences.

###### Proof.

The presented proof is a modification of [29, Theorem 1] for the MAB-MP with expert advice setting. To derive a lower bound for the MAB-MP with expert advice setting, we split the interval into non-overlapping subintervals of length , where each subinterval is assumed independent and indexed by . For each subinterval, we design a MAB-MP game, where the optimal policy is some different . We also design sequences of underlying expert advice, such that for every possible every possible sequence of arms , there is an underlying expert that recommends the arms from the sequence throughout the corresponding subintervals. By using the lower bound for the vanilla MAB-MP setting [3], for each subinterval , we have

 infALGRkALG(T/n)≥O(√mKTn).

where is the regret bound corresponding to the subinterval . By summing all the regret components in each subinterval, and noting and , we obtain

 infALGsupξRALG(T)≥O(√mKTlnNrlnK).

We note that for , MAB-MP with expert advice can be reduced to the vanilla MAB-MP setting (by considering underlying experts as arms) and in this case our regret lower bound matches the lower and upper bounds for the MAB-MP shown in [3]. Therefore, we maintain that our lower bound is tight. Furthermore, we note that the regret bound of Exp4.MP in Corollary III.1 matches the presented lower bound with an additional term, while the state-of-art [18] provides a suboptimal regret bound (notably for as in the vanilla MAB-MP and the tracking the best -arm settings). Therefore, we state that Exp4.MP is an optimal algorithm and it is required to obtain the improvements presented in this paper.

In the following remark, we improve the best-known high-probability bound[26] for the vanilla -arm multi-play setting by . We note that the resulting bound matches with the minimax lower bound in the soft-Oh sense. Therefore, it cannot be improved in the practical sense.

###### Remark III.1.

If we use constant and deterministic actual advice vectors in Exp4.MP, i.e., where , the algorithm becomes a vanilla K-armed MAB-MP algorithm. We note that in this scenario, we can directly operate with , where . By Corollary III.1, if we use and , Exp4.MP guarantees the regret bound with probability at least . Since the most expensive operation of this scenario is arm capping, our algorithm achieves this performance with time and space.

## Iv Competing Against the Switching Strategies

In this section, we consider competing against the switching -arm strategies. We present Exp3.MSP, shown in Algorithm 2, which guarantees to achieve the performance of the best switching -arm strategy with the minimax optimal regret bound.

We construct Exp3.MSP algorithm by using Exp4.MP algorithm. For this, we first consider a hypothetical scenario, where we mix each possible -arm selection strategy as an actual expert in Exp4.MP. We point out that the actual advice vectors will be a repeated permutation of the vectors at each round, which we can write as the sum of sized subsets of the set . Therefore, in this hypothetical scenario, we can directly combine all possible single arm sequences as the underlying experts, where . However, since the regret bound of Exp4.MP is , a straightforward combination of underlying experts produces a non-vanishing regret bound . To overcome this problem, we will assign a different prior weight for each one of strategies based on its complexity cost, i.e., the number of segments (more detail will be given later on).

Let be a sequence of single arm selections, where , and be its corresponding weight. For ease of notation, we define

 ^nst(t)=(^xst(t)+cpst(t)√KT)1st(t)∉U0(t) (14)

where is the probability of choosing at round , and

 ^Nst(1:t−1)=t−1∑τ=1^nst(τ) (15)

where denotes the through elements of the sequence . Then, the weight of the sequence is given by

 wst=πstexp(η^Nst(1:t−1)), (16)

where is the prior weight assigned to the sequence .

We point out that using non-uniform prior weights, i.e., , is required to have a vanishing regret bound since the number of single arm sequences grows exponentially with . As noted earlier, by Corollary III.1, the regret of Exp4.MP with uniform initialization is dependent on the logarithm of . Therefore, combining strategies with uniform initialization results in a linear regret bound, which is undesirable (the average regret does not diminish). In order to overcome this problem, similar to complexity penalty of AIC[2] and MDL[28], we assign different prior weights for each strategy based on its the number of segments . To get a truly online algorithm, we use a sequentially calculable prior assignment scheme that only depends on the last arm selection such that

 π(st|st(t−1))=⎧⎪ ⎪ ⎪⎨⎪ ⎪ ⎪⎩1Kift=11−βifst(t)=st(t−1)(no % switch)βK−1ifst(t)≠st(t−1)(switch). (17)

With the assignment scheme in (17), the prior weights are sequentially calculable as

 πst=π(st|st(t−1))πst(1:t−1)

and the weights of the arm selection strategies are given by

 wst=π(st|st(t−1))wst(1:t−1)exp(η^nst(t−1)). (18)

In the following theorem, we show that Exp4.MP algorithm running with underlying experts with the prior weighting scheme given in (17) and (18) guarantees the minimax optimal regret bound up to logarithmic factors with probability at least .

###### Theorem IV.1.

If Exp4.MP uses the prior weighting scheme given in (17) and (18), and the parameters

 η=mγ2K β=S−1T−1 γ= ⎷Kln(eK(T−1)S−1)mT c=√mSln(eK(T−1)(S−1)δ)

to combine all possible single arms sequences as the underlying experts, for any and

 GM∗T−GExp4.MP ≤6√mSKTln(eK(T−1)(S−1)δ) +mSln(eK(T−1)(S−1)δ) (19)

holds with probability at least .

###### Proof.

See Appendix C. ∎

Although we achieved minimax performance, we still suffer from exponential time and space requirements. In the following theorem, we show that by keeping weights and updating the weights as

 vj(t+1)=(1−β)~vj(t)+βK−1∑i≠j~vi(t)∑Kl=1~vl(t), (20)

where

 ~vj(t)=⎧⎨⎩vj(t)exp(η(^xj(t)+cpj(t)√KT))ifj∈[K]−U0(t)vj(t)otherwise, (21)

we can efficiently compute the same weights in the hypothetical Exp4.MP run with a computational complexity linear in . To show this, we extend [7, Theorem 5.1] for the MAB-MP problem:

###### Theorem IV.2.

For any , and for any , Exp4.MP algorithm that mixes all possible single arm sequences as underlying experts with the weighting scheme given in (17) and (18) has equal arm weights with Exp3.MSP algorithm, which updates its weights according to formulas in (20) and (21).

###### Proof.

See Appendix C. ∎

Theorem IV.2 proves that for any parameter selection Exp3.MSP is equivalent algorithm to the hypothetical Exp4.MP run. Therefore, Theorem IV.1 is valid for Exp3.MSP, which shows that Exp3.MSP has a regret bound holding with at probability at least with respect to the optimal -arm strategy. Since the most expensive operation in the algorithm is capping, Exp3.MSP requires time complexity per round.

## V Experiments

In this section, we demonstrate the performance of our algorithms with simulations on real and synthetic data. These simulations are mainly meant to provide a visualization of how our algorithms perform in comparison to the state-of-the-art techniques and should not be seen as verification of the mathematical results in the previous sections. We note that simulations only show the loss/gain of an algorithm for a typical sequence of examples; however, the mathematical results of our paper are the worst-case bounds that hold even for adversarially-generated sequences of examples.

For the following simulations, we use four synthesized datasets and one real dataset. We compare Exp4.MP with Exp3.P [4], Exp3-IX [27], Exp3.M [35], FPL +GR.P [26], [14, Figure 2], and [18, Figure 4]. We compare Exp3.MSP with [14, Figure 4], and Exp3.S [4]. We note that all the simulated algorithms are constructed as instructed in their original publications. The parameters of the individual algorithms are set as instructed by their respective publications. The information of the game length and the number of the segments in the best strategy have been given a priori to all algorithms. In each subsection, all the compared algorithms are presented to the identical games.

### V-a Robustness of the Performances

We conduct an experiment to demonstrate the robustness of our high-probability algorithms. For this, we run algorithms several times and compare the distributions of their total gains. For the comparison, we use Exp4.MP with the deterministic and constant advice vectors. We compare Exp4.MP with Exp3.P [4], Exp3-IX [27], Exp3.M [35], FPL +GR.P [26] and the high-probability algorithm introduced by György et al. in [14]. Since György et al. did not name their algorithms, we use GYA-P to denote their high-probability algorithm. We highlight that all the algorithms except Exp3.M guarantee a regret bound with high probability, whereas Exp3.M guarantees an expected regret bound.

For this experiment, we construct a -arm bandit game where we choose

bandit arms at each round. All gains are generated by independent draws of Bernoulli random variables. In the first half of the game, the mean gains of the first

arms are , and the mean gains of others’ are . In the second half of the game, the mean gains of the first arms have been reduced to , while the mean gains of others’ have been increased to . We point out that based on these selections, the -arm consisting of the last arms performs better than the others in the full game length.

We set the parameters , for all the algorithms, and for the high-probability algorithms. Our experiments are repeated times to obtain statistically significant results. Since the game environments are the same for all the algorithms, we directly compare the total gains. We study the total gain up to two interesting rounds in the game: up to , where the losses are independent and identically distributed, and up to , where the algorithms have to notice the shift in the gain distribution.

We have constructed box plots by using the resulting total gains of the algorithms. In the box plots, the lines extending from the boxes (the whiskers) illustrate the minimum and the maximum of the data. The boxes extend from the first quartile to the third quartile. The horizontal lines and the stars stand for the median and the mean of the distributions. Fig.

4 illustrates the distributions of the total gains up to . We observe that the variance of all the total gains are comparable. The mean total gain received by Exp4.MP is only comparable with that of Exp3.M while outperforms the rest. On the other hand, when the change occurred in the game, Exp4.MP outperforms the rest in the overall performance (Fig. 4). As expected, Exp3.M and Exp4.MP receive relatively higher gains than the other algorithms. However, since we give a special care for bounding the variance, Exp4.MP has a more robust performance. From the results, we can conclude that Exp4.MP yields the superior performance of the algorithms with an expected regret guarantee and the robustness of the high-probability algorithms at the same time.

### V-B Choosing an m-arm with Expert Advice

In this part, we demonstrate the performance of Exp4.MP when the advice vectors of the actual advice vectors are not necessarily constant nor deterministic. We compare our algorithm with the only known algorithm that is capable of choosing -arm with expert advice, i.e., Unordered Slate Algorithm with policies (USA-P) introduced in [18]. Since our main point is to improve the regret bound by , we compare algorithms under different subset sizes. For this, we construct five -armed games, where we choose arms respectively. In each game, the gains of the first arms are , and the gains of the others are throughout the game. In order to satisfy the condition , we first generate the underlying expert set where . The generation process is as follows: The first underlying experts are chosen constant and deterministic where for . The first entries of the last two underlying experts are chosen , i.e., for , while the other entries are determined randomly at each round under the constraint that their sum is . The actual vectors are generated by summing each sized subset of the underlying expert set at each round, where we have a total of experts. We note that based on our arm gains selection and the advice vector generating process, the actual expert which is the sum of the constant underlying experts is the best expert.

In the experiment, we set the parameters for both algorithms and for Exp4.MP. We have repeated all the games times and plotted the ensemble distributions in Fig. 8, Fig. 8, and Fig. 8. Fig. 8 illustrates the time averaged regret incurred by the algorithms at the end of the games with increasing . As can be seen, the regret incurred by our algorithm remains almost constant while the regret of USA-P increases as increases. In order to observe the temporal performances, we have plotted Fig. 8, which illustrates the time averaged regret performances of the algorithms when , and . We observe that our algorithm suffers a lower regret value at each round. To analyze this difference in the performances, we have also plotted the mean of the probability values assigned to the optimum arms by the algorithms at each round when (Fig. 8). We observe that USA-P saturates at the same probability value as Exp4.MP, its convergence rate is slower. Therefore, since our algorithm can explore the optimum -arm more rapidly, it is able to achieve better performance, especially in high values.

### V-C Sudden Game Change

In this section, we demonstrate the performance of Exp3.MSP in a synthesized game. We compare our algorithm with Exp3.S and the algorithm introduced by György et al. in [14, Section 6]. Since György et al. did not name their algorithms, we use GYA-SW to denote their algorithm. We also compare each algorithm against the trivial algorithm, Chance (i.e., random guess) for a baseline comparison.

For this experiment, we construct a game of length , where we need to choose arms out of bandit arms. The gains of the arms are deterministically selected as follows: Up to round , the gains of the first arms are , while the gains of the rest are . Between rounds and , the gains of the last arms are , while the gains of the rest . In the rest of the game, the gain distribution is the same as in the first rounds. The optimum -arms at consecutive segments are intentionally selected mutually exclusive in order to simulate sudden changes effectively. We point out that based on our arm gains selections, the number of segments in the optimum -arm sequence is , i.e., .

For Exp3.MSP and GYA-SW, we set . We have repeated the games times and plotted the ensemble distributions in Fig. 11 and Fig. 11. Fig. 11 illustrates the time-averaged regret performance of the algorithms. We observe that our algorithm has a lower regret value at any time instance. To analyse this, we have plotted the mean probability values assigned to the optimum arms by the algorithms at each round in Fig. 11. We observe that Exp3.S cannot reach high values of probability due to the exponential size of its action set. We also see that although GYA-SW saturates at the same probability value as Exp3.MSP, its convergence rate is slower. Therefore, Fig. 11 shows that since our algorithm adapts faster to the changes in the environment, it achieves a better performance throughout the game.

### V-D Random Game Change

In this part, we demonstrate the performance of Exp3.MSP on random data sequences. We compare our algorithm with two state-of-the-art techniques: Exp3.S[4] and GYA-SW[14]. We also compare each algorithm against the trivial algorithm, Chance (i.e., random guess) for a baseline comparison. For this experiment, we construct a game whose behavior is completely random with the only regularization condition being an -arm should be optimum throughout a segment. We start to synthesize the dataset by randomly selecting gains in for all arms for all rounds. We predetermine the optimum -arms in each segment and then switch the maximum gains with the gains of the optimum -arm at each round. This synthesized dataset creates a game with randomly determined gains while maintaining that one -arm is uniformly optimum throughout each segment. We synthesize multiple datasets to analyze the effects of the parameters of the game individually, where we compare the algorithms’ performances for varying game length , number of switches , number of arms and subset size . We start with the control group of , , , and both and is known a priori. Then, for each case, we vary one of the above four parameters. Differing from before, the time instances of switches are not fixed to and but instead selected randomly to be in anywhere in the game. Thus, we create completely random games with three segments.

To observe the effect of game length, we selected the different game lengths, which are linearly spaced between and . We provided the algorithms with the prior information of both the game length and the number of switches. In Fig. 11(a), we have plotted the average regret incurred at the end of the game, i.e., , by the algorithms at different values of game length while fixing the other parameters. We note that the error bars in Fig. 11(a) illustrate the maximum and the minimum average regret incurred at a fixed value of game length. For any set of parameters, we have simulated the setting for times with recreating the game each time to obtain statistically significant results. We observe that the algorithm Exp3.S performs close to random guess up to approximately game length of rounds, which is expected since it assumes each action as a separate arm. We also observe that Exp3.MSP and GYA-SW perform better than the chance for all values of game length. However, there is a significant performance difference in favor of Exp3.MSP.

To observe the effect of the number of switches on the performances, we created random change games with arms, subset size and game length of . We provided the algorithms with the prior information of both the game length and the number of switches. We selected different switch values, which are logarithmically spaced between and . In Fig. 11(b), we have plotted the average regret incurred at the end of the game by the algorithms at different values of number of switches while fixing the other parameters. For any set of parameters, we have simulated the setting for times with recreating the game each time. The algorithm Exp3.S performs similar to random guess after approximately switches, i.e., . Both Exp3.MSP and GYA-SW catch random guess at , which is comparable to the value of game length, i.e., . However, Exp3.MSP manages to provide better performance than the other algorithms for all number of switches.

To observe the effect of the number of bandit arms on the performances, we created random change games with segments, subset size of and game length of . We provided the algorithms with the prior information of both the game length and the number of switches. We selected the number of bandit arms to be even numbers between and . In Fig. 11(c), we have plotted the average regret incurred at the end of the game by the algorithms at different values of the number of bandit arms while fixing the other parameters. For any set of parameters, we have simulated the setting for times with recreating the game each time. The algorithm Exp3.S performs similar to random guess after approximately bandit arms. Exp3.MSP and GYA-SW outperform random guess for all values of bandit arms. On the other hand, Exp3.MSP outperforms all algorithms for all values of bandit arms uniformly.

To observe the effect of the subset size on the performances, we created random change games with segments,