# Multi-armed Bandit Problems with Strategic Arms

We study a strategic version of the multi-armed bandit problem, where each arm is an individual strategic agent and we, the principal, pull one arm each round. When pulled, the arm receives some private reward v_a and can choose an amount x_a to pass on to the principal (keeping v_a-x_a for itself). All non-pulled arms get reward 0. Each strategic arm tries to maximize its own utility over the course of T rounds. Our goal is to design an algorithm for the principal incentivizing these arms to pass on as much of their private rewards as possible. When private rewards are stochastically drawn each round (v_a^t ← D_a), we show that: - Algorithms that perform well in the classic adversarial multi-armed bandit setting necessarily perform poorly: For all algorithms that guarantee low regret in an adversarial setting, there exist distributions D_1,...,D_k and an approximate Nash equilibrium for the arms where the principal receives reward o(T). - Still, there exists an algorithm for the principal that induces a game among the arms where each arm has a dominant strategy. When each arm plays its dominant strategy, the principal sees expected reward μ'T - o(T), where μ' is the second-largest of the means E[D_a]. This algorithm maintains its guarantee if the arms are non-strategic (x_a = v_a), and also if there is a mix of strategic and non-strategic arms.

## Authors

• 17 publications
• 19 publications
• 15 publications
• 33 publications
• ### Multi-Armed Bandits with Correlated Arms

We consider a multi-armed bandit framework where the rewards obtained by...
11/06/2019 ∙ by Samarth Gupta, et al. ∙ 0

• ### Adaptive Portfolio by Solving Multi-armed Bandit via Thompson Sampling

As the cornerstone of modern portfolio theory, Markowitz's mean-variance...
11/13/2019 ∙ by Mengying Zhu, et al. ∙ 0

• ### Combinatorial Bandits under Strategic Manipulations

We study the problem of combinatorial multi-armed bandits (CMAB) under s...
02/25/2021 ∙ by Jing Dong, et al. ∙ 0

• ### Challenges in Statistical Analysis of Data Collected by a Bandit Algorithm: An Empirical Exploration in Applications to Adaptively Randomized Experiments

Multi-armed bandit algorithms have been argued for decades as useful for...
03/22/2021 ∙ by Joseph Jay Williams, et al. ∙ 0

• ### Multi-armed Bandit Requiring Monotone Arm Sequences

In many online learning or multi-armed bandit problems, the taken action...
06/07/2021 ∙ by Ningyuan Chen, et al. ∙ 0

• ### Rebounding Bandits for Modeling Satiation Effects

Psychological research shows that enjoyment of many goods is subject to ...
11/13/2020 ∙ by Liu Leqi, et al. ∙ 0

• ### Arm order recognition in multi-armed bandit problem with laser chaos time series

By exploiting ultrafast and irregular time series generated by lasers wi...
05/26/2020 ∙ by Naoki Narisawa, et al. ∙ 0

##### This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

## 1 Introduction

Classically, algorithms for problems in machine learning assume that their inputs are drawn either stochastically from some fixed distribution or chosen adversarially. In many contexts, these assumptions do a fine job of characterizing the possible behavior of problem inputs. Increasingly, however, these algorithms are being applied to contexts (ad auctions, search engine optimization, credit scoring, etc.) where the quantities being learned are controlled by rational agents with external incentives. To this end, it is important to understand how these algorithms behave in

strategic settings.

The multi-armed bandit problem is a fundamental decision problem in machine learning that models the trade-off between exploration and exploitation, and is used extensively as a building block in other machine learning algorithms (e.g. reinforcement learning). A learner (who we refer to as the

principal) is a sequential decision maker who at each time step , must decide which of arms to ‘pull’. Pulling this arm bestows a reward (either adversarially or stochastically generated) to the principal, and the principal would like to maximize his overall reward. Known algorithms for this problem guarantee that the principal can do approximately as well as the best individual arm.

In this paper, we consider a strategic model for the multi-armed bandit problem where each arm is an individual strategic agent and each round one arm is pulled by an agent we refer to as the principal. Each round, the pulled arm receives a private reward and then decides what amount of this reward gets passed on to the principal (upon which the principal receives utility and the arm receives utility ). Each arm therefore has a natural tradeoff between keeping most of its reward for itself and passing on the reward so as to be chosen more frequently. Our goal is to design mechanisms for the principal which simultaneously learn which arms are valuable while also incentivizing these arms to pass on most of their rewards.

This model captures a variety of dynamic agency problems, where at each time step the principal must choose to employ one of agents to perform actions on the principal’s behalf, where the agent’s cost of performing that action is unknown to the principal (for example, hiring one of contractors to perform some work, or hiring one of investors with external information to manage some money). In this sense, this model can be thought of as a multi-agent generalization of the principal-agent problem in contract theory (see Section 1.2 for references). The model also captures, for instance, the interaction between consumers (as the principal) and many sellers deciding how steep a discount to offer the consumers - higher prices now lead to immediate revenue, but offering better discounts than your competitors will lead to future sales. In all domains, our model aims to capture settings where the principal has little domain-specific or market-specific knowledge, and can really only process the reward they get for pulling an arm and not any external factors that contributed to that reward.

### 1.1 Our results

#### 1.1.1 Low-regret algorithms are far from strategyproof

Many algorithms for the multi-armed bandit problem are designed to work in worst-case settings, where an adversary can adaptively decide the value of each arm pull. Here, algorithms such as EXP3 ([ACBFS03]) guarantee that the principal receives almost as much as if he had only pulled the best arm. Formally, such algorithms guarantee that the principal experiences at most regret over rounds compared to any algorithm that only plays a single arm (when the adversary is oblivious).

Given these worst-case guarantees, one might naively expect low-regret algorithms such as EXP3 to also perform well in our strategic variant. It is important to note, however, that single arm strategies perform dismally in this strategic setting; if the principal only ever selects one arm, the arm has no incentive to pass along any surplus to the principal. In fact, we show that the objectives of minimizing adversarial regret and performing well in this strategic variant are fundamentally at odds.

###### Theorem 1.1 (informal restatement of Theorem 3.4).

Let be a low-regret algorithm for the classic multi-armed bandit problem with adversarially chosen values. Then there exists an instance of the strategic multi-armed bandit problem and an -Nash equilibrium for the arms where a principal running receives at most revenue.

Here we assume the game is played under a tacit observational model, meaning that arms can only observe which arms get pulled by the principal, but not how much value they give to the principal. In the explicit observational model, where arms can see both which arms get pulled and how much value they pass on, even stronger results hold.

###### Theorem 1.2 (informal restatement of Theorem 3.1).

Let be a low-regret algorithm for the classic multi-armed bandit problem with adversarially chosen values. Then there exists an instance of the strategic multi-armed bandit problem in the explicit observational model along with a -Nash equilibrium for the arms where a principal running receives zero revenue.

While not immediately apparent from the above claims, these instances where low-regret algorithms fail are far from pathological; in particular, there is a problematic equilibrium for any instance where arm receives a fixed reward each round it is pulled, as long as the the gap between the largest and second-largest is not too large (roughly 1/#arms).

The driving cause behind both results is possible collusion between the arms (similar to collusion that occurs in the setting of repeated auctions, see [SH04]). For example, consider a simple instance of this problem with two strategic arms, where arm 1 always gets private reward 1 if pulled and arm 2 always gets private reward 0.8. In this example, we also assume the principal is using algorithm EXP3. By always reporting some value slightly larger than 0.8, arm 1 can incentivize the principal to almost always pull it in the long run. This gains arm 1 roughly 0.2 utility per round (and arm 2 nothing). On the other hand, if arm 1 and arm 2 never pass along any surplus to the principal, they will likely be played equally often, gaining arm 1 roughly 0.5 utility per round and arm 2 0.4 utility per round.

To show such a market-sharing strategy works for general low-regret algorithms, much more work needs to be done. The arms must be able to enforce an even split of the principal’s pulls (as soon as the principal starts lopsidedly pulling one arm more often than the others, the remaining arms can defect and start reporting their full value whenever pulled). As long as the principal guarantees good performance in the non-strategic adversarial case (achieving regret), we show that the arms can (at cost to themselves) cooperate so that they are all played equally often.

#### 1.1.2 Mechanisms for strategic arms with stochastic values

We next show that, in certain settings, it is in fact possible for the principal to extract positive values from the arms per round. We consider a setting where each arm ’s reward when pulled is drawn independently from some distribution with mean (known to arm but not to the principal). In this case the principal can extract the value of the second-best arm. In the below statement, we are using the term “truthful mechanism” quite loosely as shorthand for “strategy that induces a game among the arms where each arm has a dominant strategy.”

###### Theorem 1.3 (restatement of Corollary 4.2).

Let be the second largest mean amongst the set of s. Then there exists a truthful mechanism for the principal that guarantees revenue at least when arms use their dominant strategies.

The mechanism in Theorem 1.3 is a slight modification of the second-price auction strategy adapted to the multi-armed bandit setting. The principal begins by asking each arm for its mean , where we incentivize arms to answer truthfully by recompensating arms according to a proper scoring rule. For the remainder of the rounds, the principal then asks the arm with the highest mean to give him the second-largest mean worth of value per round. If this arm fails to comply in any round, the principal avoids picking this arm for the remainder of the rounds. (A more detailed description of the mechanism can be seen in Mechanism 1 in Section 4). In addition, we show that the performance of this mechanism is as good as possible in this setting; no mechanism can do better than the second-best arm in the worst case (Lemma 4.3).

We further show how to adapt this mechanism in the setting where some arms are strategic and some arms are non-strategic (and our mechanism does not know which arms are which).

###### Theorem 1.4 (restatement of Theorem 4.5).

Let be the second largest mean amongst the means of the strategic arms, and let

be the largest mean amongst the means of the non-strategic arms. Then there exists a truthful mechanism for the principal that guarantees (with probability

) revenue at least when arms use their dominant strategies.

A detailed description of the modified mechanism can be found in Mechanism 2 in Section 4.

### 1.2 Related work

The study of classical multi-armed bandit problems was initiated by [Rob52], and has since grown into an active area of study. The most relevant results for our paper concern the existence of low-regret bandit algorithms in the adversarial setting, such as the EXP 3 algorithm ([ACBFS03]), which achieves regret . Other important results in the classical setting include the upper confidence bound (UCB) algorithm for stochastic bandits ([LR85]) and the work of [GJ74] for Markovian bandits. For further details about multi-armed bandit problems, see the survey [BC12].

One question that arises in the strategic setting (and other adaptive settings for multi-armed bandits) is what the correct notion of regret is; standard notions of regret guarantee little, since the best overall arm may still have a small total reward. [ADT12] considered the multi-armed bandit problem with an adaptive adversary and introduced the quantity of “policy regret”, which takes the adversary’s adaptiveness into account. They showed that any multi-armed bandit algorithm will get policy regret. This indicates that it is not enough to treat strategic behaviors as an instance of adaptively adversarial behavior; good mechanisms for the strategic multi-armed bandits problem must explicitly take advantage of the rational self-interest of the arms.

Our model bears some similarities to the principal-agent problem of contract theory, where a principal employs an more informed agent to make decisions on behalf of the principal, but where the agent may have incentives misaligned from the principal’s interests when it gets private savings (for example [Cha13]). For more details on principal-agent problem, see the book [LM02]. Our model can be thought of as a sort of multi-armed version of the principal-agent problem, where the principal has many agents to select from (the arms) and can try to use competition between the agents to align their interests with the principal.

Our negative results are closely related to results on collusions in repeated auctions. Existing theoretical work [MM92, AB01, JR99, Aoy03, Aoy07, SH04] has shown that collusive schemes exist in repeated auctions in many different settings, e.g., with/without side payments, with/without communication, with finite/infinite typespace. In some settings, efficient collusion can be achieved, i.e., bidders can collude to allocate the good to the bidders who values it the most and leave 0 asymptotically to the seller. Even without side payments and communication, [SH04] showed that tacit collusion exists and can achieve asymptotic efficiency with a large cartel.

Our truthful mechanism uses a proper scoring rule [Bri50, McC56] implicitly. In general, scoring rules are used to assessing the accuracy of a probabilistic prediction. In our mechanisms, we use a logarithmic scoring rule to incentivize arms to truthfully report their average rewards.

Our setting is similar to settings considered in a variety of work on dynamic mechanism design, often inspired by online advertising. [BV96] considers the problem where a buyer wants to buy a stream of goods with an unknown value from two sellers, and examines Markov perfect equilibria in this model. [BSS09, DK09, BKS10] study truthful pay-per-click auctions where the auctioneer wishes to design a truthful mechanism that maximizes the social welfare. [KMP14, FKKK14] consider the scenario where the principal cannot directly choose which arm to pull, and instead must incentivize a stream of strategic players to prevent them from acting myopically. [ARS13, ARS14] consider a setting where a seller repeatedly sells to a buyer with unknown value distribution, but the buyer is more heavily discounted than the seller. [KLN13] develops a general method for finding optimal mechanisms in settings with dynamic private information. [NSV08] develops an ex ante efficient mechanism for the Cost-Per-Action charging scheme in online advertising.

### 1.3 Open Problems and Future Directions

We are far from understanding the complete picture of multi-armed bandit problems in strategic settings. Many questions remain, both in our model and related models.

One limitation of our negative results is that they only show there exists some ‘bad’ approximate Nash equilibrium for the arms, i.e., one where any low-regret principal receives little revenue. This, however, says nothing about the space of all approximate Nash equilibria. Does there exist a low-regret mechanism for the principal along with an approximate Nash equilibria for the arms where the principal extracts significant utility? An affirmative answer to this question would raise hope for the possibility of a mechanism that can perform well in both the adversarial and strategic setting, whereas a negative answer would strengthen our claim that these two settings are fundamentally at odds.

One limitation of our positive results is that all of the learning takes place at the beginning of the protocol, and is deferred to the arms themselves. As a result, our mechanism fails in cases where the arms’ distributions can change over time. Is it possible to design good mechanisms for such settings? Ideally, any good mechanism should learn the arms’ values continually throughout the rounds, but this seems to open up the possibility of collusion between the arms.

Throughout this paper, whenever we consider strategic bandits we assume their rewards are stochastically generated. Can we say anything about strategic bandits with adversarially generated rewards? The issue here seems to be defining what a strategic equilibrium is in this case - arms need some underlying priors to reason about their future expected utility. One possibility is to consider what happens when the arms all play no-regret strategies with respect to some broad class of strategies.

Finally, there are other quantities one may wish to optimize instead of the utility of the principal. For example, is it possible to design an efficient principal, who almost always picks the best arm (even if the arm passes along little to the principal)? Theorem 3.4 implies the answer is no if the principal also has to be efficient in the adversarial case, but are there other models where we can answer this question affirmatively?

## 2 Preliminaries

### 2.1 Classic Multi-Armed Bandits

We begin by reviewing the definition of the classic multi-armed bandits problem and associated quantities.

In the classic multi-armed bandit problem a learner (the principal) chooses one of choices (arms) per round, over rounds. On round , the principal receives some reward for pulling arm . The values are either drawn independently from some distribution corresponding to arm (in the case of stochastic bandits) or adaptively chosen by an adversary (in the case of adversarial bandits). Unless otherwise specified, we will assume we are in the adversarial setting.

Let denote the arm pulled by the principal at round . The revenue of an algorithm

is the random variable

 Rev(M)=T∑t=1vIt,t

and the the regret of is the random variable

 Reg(M)=maxiT∑t=1vi,t−Rev(M)
###### Definition 2.1 (δ-Low Regret Algorithm).

Mechanism is a -low regret algorithm for the multi-armed bandit problem if

 E[Reg(M)]≤δ.

Here the expectation is taken over the randomness of and the adversary.

###### Definition 2.2 ((ρ,δ)-Low Regret Algorithm).

Mechanism is a ()-low regret algorithm for the multi-armed bandit problem if with probability ,

 Reg(M)≤δ.

There exist -low regret algorithms and -low regret algorithms for the multi-armed bandit problem; see Section 3.2 of [BC12] for details.

### 2.2 Strategic Multi-Armed Bandits

The strategic multi-armed bandits problem builds upon the classic multi-armed bandits problem with the notable difference that now arms are strategic agents with the ability to withhold some payment from the principal. Instead of the principal directly receiving a reward when choosing arm , now arm receives this reward and passes along some amount to the principal, gaining the remainder as utility.

For simplicity, in the strategic setting, we will assume the rewards are generated stochastically; that is, each round, is drawn independently from a distribution (where the distributions are known to all arms but not to the principal). While it is possible to pose this problem in the adversarial setting (or other more general settings), this comes at the cost of there being no clear notion of strategic equilibrium for the arms.

This strategic variant comes with two additional modeling assumptions. The first is the informational model of this game; what information does an arm observe when some other arm is pulled. We define two possible observational models:

1. Explicit: After each round , every arm sees the arm played along with the quantity reported to the principal.

2. Tacit: After each round , every arm only sees the arm played .

In both cases, only arm knows the size of the original reward ; in particular, the principal also only sees the value and learns nothing about the amount withheld by the arm. Collusion between arms is generally easier in the explicit observational model than in the tacit observational model.

The second modeling assumption is whether to allow arms to go into debt while paying the principal. In the restricted payment model, we impose that ; an arm cannot pass along more than it receives in a given round. In the unrestricted payment model, we let be any value in . We prove our negative results in the restricted payment model and our positive results in the unrestricted payment model, but our proofs for our negative results work in both models (in particular, it is easier to collude and prove negative results in the unrestricted payment model).

Finally, we proceed to define the set of strategic equilibria for the arms. We assume the mechanism of the principal is fixed ahead of time and known to the arms. If each arm is using a (possibly adaptive) strategy , then the expected utility of arm is defined as

 ui(M,S1,…,SK)=E[T∑t=1(vi,t−wi,t)⋅1It=i].

An -Nash equilibrium for the arms is then defined as follows.

###### Definition 2.3 (ε-Nash Equilibrium for the arms).

Strategies form an -Nash equilibrium for the strategic multi-armed bandit problem if for all and any deviating strategy ,

 ui(S1,…,Si,…,SK)≥ui(S1,…,S′i,…,SK)−ε.

The goal of the principal is to choose a mechanism which guarantees large revenue in any -Nash Equilibrium for the arms.

In Section 4, we will construct mechanisms for the strategic multi-armed bandit problem which are truthful for the arms. We define the related terminology below.

###### Definition 2.4 (Dominant Strategy).

When the principal uses mechanism , we say is a dominant strategy for arm if for any deviating strategy and any strategies for other arms ,

 ui(M,S1,…,Si,…,SK)≥ui(M,S1,…,S′i,…,SK).
###### Definition 2.5 (Truthfulness).

We say that a mechanism for the principal is truthful, if all arms have some dominant strategies.

## 3 Negative Results

In this section, we show that algorithms that achieve low-regret in the multi-armed bandits problem with adversarial values perform poorly in the strategic multi-armed bandits problem. Throughout this section, we will assume we are working in the restricted payment model (i.e., arms can only pass along a value that is at most ), but all proofs work also work in the unrestricted payment model (and in fact are much easier there).

### 3.1 Explicit Observational Model

We begin by showing that in the explicit observational model, there is an approximate equilibrium for the arms that results in the principal receiving no revenue. Since arms can view other arms’ reported values, it is easy to collude in the explicit model; simply defect and pass along the full amount as soon as you observe another arm passing along a positive amount.

###### Theorem 3.1.

Let mechanism be a -low regret algorithm for the multi-armed bandit problem. Then in the strategic multi-armed bandit problem under the explicit observational model, there exist distributions and a -Nash equilibrium for the arms where a principal using mechanism receives zero revenue.

###### Proof.

Consider the two-arm setting where and are both deterministic distributions supported entirely on , so that for all and . Consider the following strategy for arm :

1. Set if at time , the other arm always reports 0 when pulled.

2. Set otherwise.

We will show that is a -Nash Equilibrium. It suffices to show that arm 1 can get at most more utility by deviating. Consider any deviating strategy for arm 1. By convexity, we can assume is deterministic (there is some best deterministic deviating strategy). Since mechanism might be randomized, let be the randomness used by and define to be the deterministic mechanism when uses randomness . Now, consider the case when arm 1 plays strategy , arm 2 plays strategy and the principal is usings mechanism .

1. If arm 1 never reports any value larger than 0 when pulled, then behaves exactly the same as . Therefore,

 u1(MR,S′,S∗)=u1(MR,S∗,S∗).
2. If arm 1 ever reports some value larger than 0 when pulled, let be the first time it does so. We know that behaves the same as before . Therefore,

 u1(MR,S′,S∗) ≤ u1(MR,S∗,S∗)+T∑t=τR(v1,t−w1,t)⋅1It=1 ≤ u1(MR,S∗,S∗)+1+T∑t=τR+1(max(w1,t,w2,t)−w1,t)⋅1It=1

So in general, we have

 u1(MR,S′,S∗)≤ui(MR,S∗,S∗)+1+T∑t=τR+1(max(w1,t,w2,t)−w1,t)⋅1It=1.

Therefore

 u1(M,S′,S∗) = ER[u1(MR,S′,S∗)] ≤ ER[u1(MR,S∗,S∗)]+1+ER⎡⎣T∑t=τR+1(max(w1,t,w2,t)−w1,t)⋅1It=1⎤⎦ = u1(M,S∗,S∗)+1+ER⎡⎣T∑t=τR+1(max(w1,t,w2,t)−w1,t)⋅1It=1⎤⎦.

Notice that this expectation is at most the regret of in the classic multi-armed bandit setting when the adversary sets rewards equal to the values and passed on by the arms when they play . Therefore, by our low-regret guarantee on , we have that

 ER⎡⎣T∑t=τR+1(max(w1,t,w2,t)−w1,t)⋅1It=1⎤⎦≤δ.

Thus

 u1(M,S′,S∗)≤u1(M,S∗,S∗)+1+δ

and this is a -approximate Nash equilibrium. Finally, it is easy to check that the principal receives zero revenue when both arms play according to this equilibrium strategy. ∎

### 3.2 Tacit Observational Model

We next show that even in the tacit observational model, where the arms don’t see the amounts passed on by other arms, it is still possible for the arms to collude and leave the principal with revenue. The underlying idea here is that the arms work to try to maintain an equal market share, where each of the arms are each played approximately of the time. To ensure this happens, arms collude so that arms that aren’t as likely to be pulled pass along a tiny amount to the principal, whereas arms that have been pulled a lot or are more likely to be pulled pass along ; this ends up forcing any low-regret algorithm for the principal to choose all the arms equally often. Interestingly, unlike the collusion strategy in the explicit observational model, this collusion strategy is mechanism dependent

, as arms need to estimate the probability they will be pulled in the next round.

We begin by proving this result for the case of two arms, where the proof is slightly simpler.

###### Theorem 3.2.

Let mechanism be a -low regret algorithm for the multi-armed bandit problem with two arms, where and . Then in the strategic multi-armed bandit problem under the tacit observational model, there exist distributions and an -Nash Equilibrium where a principal using mechanism gets at most revenue.

###### Proof.

Let and be distributions with means and respectively, such that . Additionally, assume both and are supported on . We now describe the equilibrium strategy (the below description is for arm 1; for arm 2 is symmetric):

1. Set parameters and .

2. Define to be the number times arm is pulled in rounds . Similarly define to be the number times arm is pulled in rounds .

3. For :

1. If there exists a such that , set .

2. If the condition in (a) is not true, let be the probability that the principal will pick arm 1 in this round conditioned on the history (assuming player is also playing ), and let . Then:

1. If and , set .

2. Otherwise, set .

We will now show that is an -Nash equilibrium. To do this, for any deviating strategy , we will both lower bound and upper bound , hence bounding the net utility of deviation.

We begin by proving that . We need the following lemma.

###### Lemma 3.3.

If both arms are using strategy , then with probability , for all .

###### Proof.

Assume that both arms are playing the strategy with the modification that they never defect (i.e. condition (a) in the above strategy is removed). This does not change the probability that for all .

Define be the regret the principal experiences for not playing only arm 1. Define similarly. We will begin by showing that with high probability, these regrets are bounded both above and below. In particular, we will show that with probability at least , lies in for all and .

To do this, note that there are two cases where the regrets and can possibly change. The first is when and . In this case, the arms offer . With probability the principal chooses arm and the regrets update to , and with probability the principal chooses arm and the regrets update to . It follows that .

In the second case, and , and a similar calculation shows again that . It follows that forms a submartingale.

From the above analysis, it is also clear that . It follows from Azuma’s inequality that, for any fixed ,

 Pr[R1,t+R2,t≤−2θ√TlogT]≤1T2

Applying the union bound, with probability at least , for all . Furthermore, since the principal is using a -low-regret algorithm, it is also true that with probability at least (for any fixed ) both and are at most . Applying the union bound again, it is true that and for all with probability at least . Finally, combining this with the earlier inequality (and applying union bound once more), with probability at least , , as desired. For the remainder of the proof, condition on this being true.

We next proceed to bound the probability that (for a fixed ) . Define the random variable to be the largest value such that – note that if , then for all in the range . Additionally let denote the random variable given by the difference . We can then write

 c1,t−c2,t ≤ t∑s=τ+1Δs ≤ t∑s=τ+1Δs⋅1p1,s>p2,s+t∑s=τ+1Δs⋅1p1,s≤p2,s

Here the first summand corresponds to times where one of the arms offers (and hence the regrets change), and the second summand corresponds to times where both arms offer . Note that since in this interval, the regret increases by whenever (i.e., arm is chosen), and furthermore no choice of arm can decrease in this interval. Since we know that lies in the interval for all , this bounds the first sum by

 t∑s=τ+1Δs⋅1p1,s>p2,s≤2δ+2θ√TlogTθ=2δθ+2√TlogT

On the other hand, when , then . By Hoeffding’s inequality, it then follows that with probability at least ,

 t∑s=τ+1Δs⋅1p1,s≤p2,s≤2√TlogT

Altogether, this shows that with probability at least ,

 c1,t−c2,t≤2δθ+4√TlogT≤6√Tδ=B

The above inequality therefore holds for all with probability at least . Likewise, we can show that also holds for all with probability at least . Since we are conditioned on the regrets being bounded (which is true with probability at least ), it follows that for all with probability at least .

By Lemma 3.3, we know that with probability , throughout the mechanism. In this case, arm 1 never uses step (a), and . Therefore

 u1(M,S∗,S∗) ≥ (1−4T)⋅(μ1−θ)⋅(T−B)/2 ≥ μ1T2(1−4T−θμ1−BT) = μ1T2−2μ1−θT2−Bμ12 ≥ μ1T2−O(√Tδ)

Now we will show that . Without loss of generality, we can assume is deterministic. Let be the deterministic mechanism when ’s randomness is fixed to some outcome . Consider the situation when arm is using strategy , arm 2 is using strategy and the principal is using mechanism . There are two cases:

1. is true for all . In this case, we have

 u1(MR,S′,S∗)≤c1,T⋅μ1≤μ1(T+B)/2.
2. There exists some such that : Let be the smallest such that . We know that . Therefore we have

 u1(MR,S′,S∗) = T∑t=1(μ1−w1,t)⋅1It=1 = T∑t=1(μ1−w2,t)⋅1It=1+T∑t=1(w2,t−w1,t)⋅1It=1 ≤ c1,τRμ1+μ1+(T−τR−1)max(μ1−μ2,0)+T∑t=1(w2,t−w1,t)⋅1It=1 ≤ μ1(τR+B)/2+μ1+(T−τR−1)(μ1/2)+T∑t=1(w2,t−w1,t)⋅1It=1 ≤ μ1T/2+μ1(B+1)/2+T∑t=1(w2,t−w1,t)⋅1It=1.

In general, we thus have that

 u1(MR,S′,S∗)≤μ1T/2+μ1(B+1)/2+max(0,T∑t=1(w2,t−w1,t)⋅1It=1).

Therefore

 u1(M,S′,S∗) = ER[u1(MR,S′,S∗)] ≤ μ1T/2+μ1(B+1)/2+ER[max(0,T∑t=1(w2,t−w1,t)⋅1It=1)].

Notice that is the regret of not playing arm 2 (i.e., in the proof of Lemma 3.3). Since the mechanism is low regret, with probability , this sum is at most (and in the worst case, it is bounded above by ). We therefore have that:

 u1(M,S′,S∗) ≤ μ1T2+μ1(B+1)2+δ+ρTμ2 ≤ μ1T2+O(√Tδ)

From this and our earlier lower bound on , it follows that , thus establishing that is an -Nash equilibrium for the arms.

Finally, to bound the revenue of the principal, note that if the arms both play according to and for all (so they do not defect), the principal gets a maximum of revenue overall. Since (by Lemma 3.3) this happens with probability at least (and the total amount of revenue the principal is bounded above by ), it follows that the total expected revenue of the principal is at most . ∎

We now extend this proof to the arm case, where can be as large as .

###### Theorem 3.4.

Let mechanism be a -low regret algorithm for the multi-armed bandit problem with arms, where , , and . Then in the strategic multi-armed bandit problem under the tacit observational model, there exist distributions and an -Nash Equilibrium for the arms where the principal gets at most revenue.

###### Proof Sketch.

As in the previous proof, let denote the mean of the th arm’s distribution . Without loss of generality, further assume that . We will show that as long as , there exists some -Nash equilibrium for the arms where the principal gets at most revenue.

We begin by describing the equilibrium strategy for the arms. Let denote the number of times arm has been pulled up to time . As before, set and set . The equilibrium strategy for arm at time is as follows:

1. If at any time in the past, there exists an arm with , defect and offer your full value .

2. Compute the probability , the probability that the principal will pull arm conditioned on the history so far.

3. Offer .

The remainder of the proof proceeds similarly as the proof of Theorem 3.2. The full proof can be found in Appendix A. ∎

While the theorems above merely claim that a bad set of distributions for the arms exists, note that the proofs above show it is possible to collude in a wide range of instances - in particular, any set of distributions which satisfy . A natural question is whether we can extend the above results to show that it is possible to collude in any set of distributions.

One issue with the collusion strategies in the above proofs is that if , then arm 1 will have an incentive to defect in any collusive strategy that plays all the arms evenly (arm 1 can report a bit over per round, and make every round instead of every rounds). One solution to this is to design a collusive strategy that plays some arms more than others in equilibrium (for example, playing arm 90% of the time). We show how to modify our result for two arms to achieve an arbitrary market partition and thus work over a broad set of distributions.

###### Theorem 3.5.

Let mechanism be a -low regret algorithm for the multi-armed bandit problem with two arms, where and . Then, in the strategic multi-armed bandit problem under the tacit observational model, for any distributions of values for the arms (supported on ), there exists an -Nash Equilibrium for the arms where a principal using mechanism gets at most revenue.

###### Proof.

See Appendix A. ∎

Unfortunately, it as not as easy to modify the proof of Theorem 3.4 to prove the same result for arms. It is an interesting open question whether there exist collusive strategies for arms that can achieve an arbitrary partition of the market.

## 4 Positive Results

In this section we will show that, in contrast to the previous results on collusion, there exists a mechanism for the principal that can obtain revenue from the arms. This mechanism essentially incentivizes each arm to report the mean of its distribution and then runs a second-price auction, asking the arm with the highest mean for the second-highest mean each round. By slightly modifying this mechanism, we can obtain a mechanism that works for a combination of strategic and non-strategic arms.

Throughout this section we will assume we are working in the tacit observational model and the unrestricted payment model.

### 4.1 All Strategic Arms with Stochastic Values

We begin by considering the case when all arms are strategic.

Define as the mean of distribution for and . We assume throughout that .

We will first show that the dominant strategy of each arm in this mechanism includes truthfully reporting their mean at the beginning, and then then compute the principal’s revenue under this dominant strategy.

###### Lemma 4.1.

The following strategy is the dominant strategy for arm in Mechanism 1:

1. (line 1 of Mechanism 1) Report the mean value of the first time when arm is played.

2. (lines 3,4 of Mechanism 1) If , for the rounds that the principal expects to see reported value , report the value . For the bonus round, report 0. If , report 0.

3. (line 5 of Mechanism 1) For all other rounds, report .

###### Proof.

Note that the mechanism is naturally divided into three parts (in the same way the strategy above is divided into three parts): (1) the start, where each arm is played once and reports its mean, (2) the middle, where the principal plays the best arm and extracts the second-best arm’s value (and plays each other arm once), and (3) the end, where the principal plays each arm some number of times, effectively paying them off for responding truthfully in step (1). To show the above strategy is dominant, we will proceed by backwards induction, showing that each part of the strategy is the best conditioned on an arbitrary history.

We start with step (3). It is easy to check that these rounds don’t affect how many times the arm is played or not. It follows that it is strictly dominant to just report 0 (and receive your full value for the turn). Note that the reward the arm receives in expectation for this round is ; we will use this later.

For step (2), assume that ; otherwise, arm is played only once, and the dominant strategy is to report and receive expected reward . Depending on what happened in step (1), there are two cases; either , or . We will show that if , the arm should play for the next rounds (not defecting) and report 0 for the bonus round. If , the arm should play (defecting immediately).

Note that we can recast step (2) as follows: arm starts by receiving a reward from his distribution . For the next turns, he can pay for the privilege of drawing a new reward from his distribution (ending the game immediately if he refuses to pay). If , then paying for a reward is positive in expectation, whereas if , then paying for a reward is negative in expectation. It follows that the dominant strategy is to continue to report if (receiving a total expected reward of ) and to immediately defect and report if (receiving a total expected reward of ).

Finally, we analyze step (1). We will show that, regardless of the values reported by the other players, it is a dominant strategy for arm to report its true mean . If arm reports , and , then arm will receive in expectation reward

 G=(μi−wi)+μi+max(u+log(wi),0)μi

If , then this is maximized when and (note that by our construction of , ). On the other hand, if , then this is maximized when and . Since , the overall maximum occurs at .

Similarly, when arm reports and , then arm receives in expectation reward

 G′=(μi−wi)+min(0,R(μi−w′))+μi+max(u+log(wi),0)μi

which is similarly maximized at . Finally, it follows that if , , so it is dominant to report . On the other hand, if , then reporting will ensure and so once again it is dominant to report . ∎

###### Corollary 4.2.

Under Mechanism 1, the principal will receive revenue at least when arms use their dominant strategies, where is the second largest mean in the set of means .

###### Lemma 4.3.

For any constant , no truthful mechanism can guarantee revenue in the worst case. Here is the largest value among . And is the second largest value among .

###### Proof.

Suppose there exists an truthful mechanism guarantees revenue for any distributions. We will show this results in a contradiction.

We now consider inputs. The -th input has and . Among these inputs, one arm (call it arm ) is always the arm with largest mean and another arm is always the arm with the second largest mean. Other arms have the same input distribution in all the inputs.

Consider all the arms are using their dominant strategies. For the -th input, let be the expected number of pulls by on the arm and be the expected amount arm gives to the principal. Because the mechanism is truthful, in the -th distribution, arm prefers its dominant strategy than the dominant strategy it uses in some -th distribution (). In other words, we have for ,

 bixi−pi≥bixj−pj.

We also have, for all ,

 bixi−pi≥0.

By using these inequalities , we get for all ,

 pi≤bixi+i−1∑j=1xj(bj+1−bj).

On the other hand, ’s revenue in the -th distribution is at most . Therefore we have, for all ,

 pi+(1−xi)μ′≥α⋅bi+(1−α)μ′.

So we get

 (1−xi)μ′+bixi+i−1∑j=1xj(bj+1−bj)≥α⋅bi+(1−α)μ′.

It can be simplified as

 xi≥α+i−1∑j=1xjbj+1−bjbi−μ′=α+1i⋅i−1∑j=1x