 # Solving Bernoulli Rank-One Bandits with Unimodal Thompson Sampling

Stochastic Rank-One Bandits (Katarya et al, (2017a,b)) are a simple framework for regret minimization problems over rank-one matrices of arms. The initially proposed algorithms are proved to have logarithmic regret, but do not match the existing lower bound for this problem. We close this gap by first proving that rank-one bandits are a particular instance of unimodal bandits, and then providing a new analysis of Unimodal Thompson Sampling (UTS), initially proposed by Paladino et al (2017). We prove an asymptotically optimal regret bound on the frequentist regret of UTS and we support our claims with simulations showing the significant improvement of our method compared to the state-of-the-art.

## Authors

##### This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

## 1 Introduction

We consider Stochastic Rank-One Bandits, a class of bandit problems introduced by Katariya et al. (2017b). These models provide a clear framework for the exploration-exploitation problem of adaptively sampling the entries of a rank-one matrix in order to find the largest one. Consider for instance the problem of finding the best design of a display, say for instance a colored shape to be used as a button on a website. One may have at hand a set of different shapes, and a set of different colors to be tested. A display is a combination of those two attributes, and a priori the tester has as many options as there are different pairs of shapes and colors. Now let us assume the effect of each factor is independent of the other factor. Then, the value of a combination, say for instance the click rate on the constructed button, is the product of the values of each of its attributes. The better the shape, the higher the rate, and similarly for the color. This type of independence assumptions is ubiquitous in click models such as the position-based model Chuklin et al. (2015); Richardson et al. (2007). It is also closely related to online learning to rank Zoghi et al. (2017) where sequential duels allow to find the optimal ordering of a list of options. We review the related literature in Section 4 further below.

We formalize our example above into a Bernoulli rank-one bandit model (Katariya et al., 2017a)

: this model is parameterized by two nonzero vectors

and . There are arms, indexed by , where we use the notation for any positive integer . Each arm

is associated with a Bernoulli distribution with mean

. Observe that the matrix of means has rank one, hence the name. We denote the class of all such instances . At each time step an agent selects an arm and receives a reward , independently from previous rewards. To select , the agent may exploit the knowledge of previous observations and possibly some external randomness . Formally, letting denote the -field generated by , is measurable with respect to .

The objective of the learner is to adjust their selection strategy to maximize the expected total reward accumulated. The oracle or optimal strategy here is to always play the arm with largest mean. Thus, maximizing rewards is equivalent to designing a strategy with small regret, where the -step regret is defined as the difference between the expected cumulative rewards of the oracle and the cumulative rewards of the strategy :

 Rμ(T,\cA)=T∑t=1[max(i,j)∈[K]×[L]μ(i,j)−\bEμ[μ(I(t),J(t))]]. (1)

Letting and , we assume that for all and for all . This assumption is equivalent to assuming that the rank-one bandit instance has a unique optimal action, which is . We let denote this class of rank-one instance with a unique optimal arm. In this paper, we will furthermore restrict our attention to rank-one models for which either or . This assumption is not very restrictive, but it rules out the possibility that and for a certain arm (i.e. neither shape nor color attract any user). We found this assumption to be necessary to exhibit a unimodal structure in rank-one bandits.

An algorithm is called uniformly efficient if its regret is sub-polynomial in any instance . That is, for all , for all , . In their paper, Katariya et al. (2017b) provide the first uniformly efficient algorithm, , for stochastic rank-one bandits, and Katariya et al. (2017a) propose an adaptation of this algorithm tailored for Bernoulli rewards, . They also provide a problem-dependent asymptotic lower bound on the regret in the line of Lai and Robbins (1985). This type of result gives a precise characterization of the regret for a specific instance of the problem that one should expect for any uniformly efficient algorithm. We report their result below.

For any algorithm which is uniformly efficient and for any Bernoulli rank-one bandit problem, ,

 liminfT→∞Rμ(\cA,T)log(T)≥∑i∈[K]∖i⋆μi⋆,j⋆−μi,j⋆\kl(μi,j⋆,μi⋆,j⋆)+∑j∈[L]/j⋆μi⋆,j⋆−μi⋆,j\kl(μi⋆,j,μi⋆,j⋆).

where is the binary relative entropy.

In contrast to this result, the Lai and Robbins (1985) lower bound, which applies to algorithms that are uniformly efficient for any reward matrix , involves a sum over all matrix entries instead of restricting to arms in the best row and in the best column of the matrix. Thus a good algorithm for the rank-one problem should manage to select all entries that are not in this best row and column only times. However, neither nor achieve the asymptotic performance of Proposition 1: the regret upper bounds provided by Katariya et al. (2017b, a) show much larger constants, and the empirical performance is not much tighter. A natural question one might ask then is: Is that lower bound achievable ?

#### Contributions

The main contribution of this paper is to close this existing gap. To do so, we notice and prove that a stochastic rank-one bandit satisfying or is a particular instance of Unimodal Bandits (Combes and Proutière, 2014). Interestingly, when derived in the specific rank-one bandits setting, the algorithm proposed in the latter reference achieves the optimal asymptotic regret of Proposition 1. Unifying those two apparently independent lines of work sheds a new light on stochastic rank-one bandits.

Indeed, follow-up works on unimodal bandits sought ways to construct more efficient algorithms than . In particular Paladino et al. (2017) propose , a Bayesian strategy based on Thompson Sampling (Thompson, 1933). Unfortunately, the theoretical analysis they provide does not allow to conclude an upper bound on the performance of their algorithm. We shall comment on that in Section 2.3. Thus, a second major contribution of the present work is a new finite-time analysis of the frequentist regret of for Bernoulli stochastic rank-one bandits. Doing so, we provide an optimal regret bound for an efficient and easy-to-implement rank-one bandit algorithm.

Finally, our analysis provides new insights on the calibration of the leader exploration parameter which is present in other algorithms.

#### Outline

The paper is organised as follows. Section 2 proves that rank-one bandits are an instance of unimodal bandits, and describes the algorithm. The regret upper bound is proved in Section 3. In order to perform a fair empirical comparison with existing rank-one bandit algorithms, we give more background on this literature in Section 4. Finally, experiments in Section 5 provide empirical evidence of the optimality of and show an improvement of an order of magnitude compared to the state-of-the-art .

## 2 Rank-One Bandits, a particular case of Unimodal Bandits

In this section, we explain why the rank-one bandit model can be seen as a graphical unimodal bandit model as introduced by Yu and Mannor (2011); Combes and Proutière (2014). For completeness, we recall the relevant definition.

Given a undirected graph , a vector is unimodal with respect to if

• there exists a unique such that

• from any , we can find an increasing path to the optimal arm. Formally: , there exists a path of length , such that for all , , and .

We denote by the set of vectors that are unimodal with respect to .

A bandit instance is unimodal with respect to an undirected graph if its vector of means is unimodal with respect to : . For a unimodal instance, we define the set of neighbors of an arm as . Without loss of generality, we can assume that does not contain self-edges (which do not contribute to increasing paths), therefore . The extended neighborhood of is defined as .

In a unimodal bandit problem, the learner knows the graph G (hence the neighborhoods for all ), but not its parameters , which must be learnt adaptively by sampling the vertices of the graph.

### 2.1 Rank-One Bandits are Unimodal

We define the undirected graph as the graph with vertices and such that if and only if and ( or ). In words, viewing the vertices as a matrix, two distinct entries are neighbors if they belong to the same line or to the same column. In particular it can be observed that the graph has diameter two, and we shall exhibit below increasing paths of length at most two between any sub-optimal arm and the best arm .

The main result of this section is Proposition 2.1. It allows us to build on the existing results for unimodal bandits in order to close the remaining theoretical gap in the understanding of rank-one bandits.

Let and be two nonzero vectors such that or . A rank-one bandit instance parameterized by satisfies .

#### Proof

Let and be the two vectors parameterizing the rank-one bandit model, and denote the best arm by . Then for any with , one can find several increasing paths in from to :

• If or , then is valid as and ;

• Otherwise, first note that either or . In the first case is a valid increasing path. Indeed, and allow us to conclude that . In the second case, one can similarly show that is a valid increasing path.

Figure 1 below illustrates a possible optimal path in a rank-one bandit with and also shows the neighbors of a particular arm in the graph .

### 2.2 Solving Unimodal Bandits

In their initial paper, Yu and Mannor (2011) propose an algorithm based on sequential elimination that does not efficiently exploit the structure of the graph. Combes and Proutière (2014) take over the unimodal bandit problem and provide a more in-depth analysis of the achievable regret in that setting. In particular, their Theorem 4.1 states an asymptotic regret lower bound that we state below for Bernoulli rewards.

Let define a Bernoulli unimodal bandit problem, with denoting the set of neighbors of arm . Let be a uniformly efficient algorithm for every Bernoulli bandit instance with means in . Then

 ∀μ∈\cU(G),  lim infT→∞Rμ(\cA,T)ln(T)≥∑k∈\cNG(k⋆)μk⋆−μk\kl(μk,μk⋆).

In the particular case , and we recover Proposition 1. An asymptotically optimal algorithm for unimodal bandits therefore particularizes into an asymptotically optimal algorithm for rank-one bandits.

### 2.3 Candidate algorithms and their analysis

There exists only a few optimal algorithms for unimodal bandits. Combes and Proutière (2014) propose , a computationally efficient algorithm that is proved to have the best achievable regret. Paladino et al. (2017) propose a Bayesian alternative, however for reasons detailed below we believe their regret analysis does not hold as is. Another valid algorithm would be (Combes et al., 2017), a generic method for structured bandits, however its implementation for rank-one bandits is not obvious (the matrix of empirical mean would need to have rank one), and its generality often makes it less empirically efficient when compared to algorithms exploiting a particular structure, like here the rank-one structure.

#### Notation

We now present the existing algorithms for unimodal bandits with respect to some undirected graph . For , we let be the number of selections of arm up to round and be the empirical means of the rewards from that arm. We also define the (empirical) leader and keep track of how many times each arm has been the leader in the past by defining .

#### Optimal Sampling for Unimodal Bandits (OSUB)

(Combes and Proutière, 2014) is the adaptation of the -UCB algorithm of Cappé et al. (2013), an asymptotically optimal algorithm for (unstructured) Bernoulli bandits. The vanilla -UCB algorithm uses as upper confidence bounds the indices

 \UCBk(t)=max{q:Nk(t)\kl(^μk(t),q)≤f(t)},

and selects at each round the arm with largest index.

The idea of is to restrict -UCB to the neighborhood of the leader while adding a leader exploration mechanism to ensure that the leader gets “checked” enough and can eventually be trusted. Letting

 (2)

selects at time

 At+1={L(t)if ℓL(t)(t)≡1[γ],\argmaxk ˜uk(t)else. (3)

The parameter quantifies how often the leader should be checked. is proved to be asymptotically optimal when is equal to the maximal degree in , which yields for rank-one bandits. Compared to -UCB, the alternative exploration rate that appears in the index (2) makes the analysis of quite intricate.

#### Unimodal Thompson Sampling (UTS)

For classical bandits, Thompson Sampling (TS) is known to be a good alternative to -UCB as it shares its optimality property for Bernoulli distributions (Kaufmann et al., 2012; Agrawal and Goyal, 2013)

without the need to tune any confidence interval and often with better performance.

Paladino et al. (2017) therefore naturally proposed Unimodal Thompson Sampling (UTS). The algorithm, described in detail in Section 3.1, consists in running Thompson Sampling instead of -UCB in the neighborhood of the leader, while keeping a leader exploration mechanism similar to the one in (3). The exploration parameter should also be set to in the rank-one case in order to prove the asymptotic optimality of UTS.

The analysis proposed by Paladino et al. (2017) (detailed in Appendix A of the extended version Paladino et al. (2016)) hinges on adapting some elements of the Thompson Sampling proof of Kaufmann et al. (2012) and is not completely satisfying. Our main objection is the upper bound that is proposed on the number of times a sub-optimal arm is the leader (term of the second equation on page 8). To deal with this term, a quite imprecise reduction argument is given (definition of ) showing that one essentially needs to control the quantity for Thompson Sampling playing in and being the element with largest mean in this neighborhood. However, we do not believe this quantity can be easily controlled for Thompson Sampling, as we have to handle a random number of observations (that may be small) from both and . Besides, the upper bound on proposed by Paladino et al. (2017) holds for the choice in the rank-one case, which we show is unnecessary.

Due to the lack of accuracy of the existing proof, we believe that a new, precise analysis of Unimodal Thompson Sampling is needed to corroborate its good empirical performance for rank-one bandits, which we provide in the next section. Our analysis borrows elements from both the TS analysis of Agrawal and Goyal (2013) and that of Kaufmann et al. (2012). It also reveals that unlike what was previously believed, the leader exploration parameter can be set to an arbitrary value .

## 3 Analysis of Unimodal Thompson Sampling

In this section, we present the Unimodal Thompson Sampling algorithm () for Bernoulli rank-one bandits, and we state our main theorem proving a problem-dependent regret upper bound for this algorithm, which extends to the graphical unimodal case.

### 3.1 UTS for Rank-One Bandits

is a very simple computationally efficient, anytime algorithm. Its pseudo-code for Bernoulli rank-one bandits is given in Algorithm 1. It relies on one integer parameter controlling the fraction of rounds spent exploring the leader. After an initialization phase where each entry is pulled once, at each round , the algorithm computes the leader , that is the empirical best entry in the matrix. If the number of times has been leader is multiple of , selects the empirical leader. The rest of the time, it draws a posterior sample for every entry in the same row and column as the leader, and selects the entry associated to the largest posterior sample. This can be viewed as performing Thompson Sampling in , the augmented neighborhood of the leader in the graph defined in Section 2.

For completeness, we recall that given a prior distribution Thompson Sampling maintains a posterior distribution for each hidden Bernoulli parameter of the problem, that is, for each entry of the matrix. To do so, it uses a convenient uniform () prior, for which the posterior distribution is a Beta distribution. We refer the interested reader to the recent survey Russo et al. (2018) for more details on the topic.

### 3.2 Regret upper bound and asymptotic optimality

can be easily extended to any graphical unimodal bandit problem with respect to a graph , by performing Thompson Sampling on instead of . For this more general algorithm, we state the following theorem, which is our main technical contribution.

Let be a graphical unimodal bandit instance with respect to a graph . For all , UTS with parameter satisfies, for every ,

 Rμ(T,UTS(γ))≤(1+ϵ)∑k∈\cN(k⋆)(μ⋆−μk)\kl(μk,μ⋆)ln(T)+C(μ,γ,ϵ),

where is some constant depending on the environment , on and on .

A consequence of this finite-time bound is that, for every parameter ,

 limsupT→∞Rμ(T,UTS(γ))ln(T)≤∑k∈\cN(k⋆)(μ⋆−μk)\kl(μk,μ⋆),

therefore is asymptotically optimal for any graphical unimodal bandit problem. Particularizing this result to rank-one bandits, one obtains that Algorithm 1 has a regret which is asymptotically matching the lower bound in Proposition 1.

Unlike previous work, in which logarithmic regret is proved only for the choice in the rank-one case111For general unimodal bandits, OSUB sets to be the maximal degree of an arm, whereas UTS adaptively sets to be the degree of the current leader. Both parameterization coincide for rank-one bandits., we emphasize that this result holds for any choice of the leader exploration parameter. We conjecture that without any leader exploration scheme is also asymptotically optimal. However, our experiments of Section 5 reveal that this particular kind of “forced exploration” is not hurting for rank-one bandits, and that the choice actually leads to the best empirical performance.

### 3.3 Proof of Theorem 3.2

We consider a general -armed graphical unimodal bandit problem with respect to some graph and let denote the arm selected at round . We recall some important notations defined in Section 2.3: the number of arms selections , the empirical means , the leader as , and the number of times arm has been the leader up to time : . Observe that the leader exploration scheme ensures that

 ∀k∈{1,…,K},∀t∈\N,Nk(t)≥⌊ℓk(t)/γ⌋. (4)

Introducing the gap , recall that the regret rewrites . Just like in the analysis of Combes and Proutière (2014); Paladino et al. (2017), we start by distinguishing the times when the leader is the optimal arm, and the times when the leader is a sub-optimal arm:

 \cRμ(T,UTS(γ))=∑k≠k⋆ΔkE[T∑t=11(K(t)=k)] =∑k≠k⋆ΔkE[T∑t=11(K(t)=k,L(t)=k⋆)]R1(T)+∑k≠k⋆ΔkE[T∑t=11(K(t)=k,L(t)≠k⋆)]R2(T).

To upper bound , it can be noted that when is the leader, the selected arm is necessarily in the neighborhood of , hence the sum can be restricted to the neighborhood of . Therefore, we expect to upper bound by the same quantity which upper bounds the regret of Thompson Sampling restricted to . Such an argument is used for KL-UCB and Thompson Sampling by Combes and Proutière (2014) and Paladino et al. (2017) respectively, without much justification. However, a proper justification does need some care, as between two times the leader is , UTS may update the posterior of some arms in for they belong to the neighborhoods of other potential leaders.

In this work, we carefully adapt the analysis Agrawal and Goyal (2013) to get the following upper bound. The proof can be found in Appendix B.

For all and all ,

 R1(T)≤(1+ϵ)∑k∈N(k∗)Δk\kl(μk,μ⋆)ln(T)+~C(μ,ϵ),

for some quantity which depends on the means and on but not on .

We now upper bound

, which can be related to the probability of choosing any given suboptimal arm

 \cR2(T) ≤ ∑ℓ≠k⋆∑k≠k⋆ΔkE[T∑t=11(K(t)=k,L(t)=ℓ)] ≤ ∑ℓ≠k⋆T∑t=1\bE⎡⎣\ind(L(t)=ℓ)∑k≠k⋆\ind(K(t)=k)⎤⎦=∑k≠k⋆T∑t=1\bP(L(t)=k).

For each , we define the set of best neighbors of , . Due to the unimodal structure, we know this set is nonempty because there exists at least one arm such that (such an arm belongs to the path from to ). All arms belonging to have same mean, that we note . We also introduce , the maximal number of best arms in the neighborhood of all sub-optimal arms, which is bounded by the maximum degree of the graph. With this notation, one can write, for any ,

 T∑t=1\bP(L(t)=k) = T∑t=1\bP(L(t)=k,∃k2∈\cB\cN(k),Nk2(t)>(ℓk(t))b)\cTk1(T) +T∑t=1\bP(L(t)=k,∀k2∈\cB\cN(k),Nk2(t)≤(ℓk(t))b)\cTk2(T)

The first term can be easily upper bounded by using the fact that if both arm and one of its best neighbors are selected enough, it is unlikely that .

On the event , the empirical mean of the -th arm is necessarily greater than that of the other arms (especially those in ) . Therefore, letting ,

 \cTk1(T) = T∑t=1\bP(L(t)=k,∃k2∈\cB\cN(k),^μk(t)≥^μk2(t),Nk2(t)>(ℓk(t))b) (6) ≤ T∑t=1\bP(L(t)=k,^μk(t)>μk+δk,Nk(t)>⌊ℓk(t)/γ⌋) +T∑t=1\bP(L(t)=k,∃k2∈\cB\cN(k),^μk2(t)≤μk2−δk,Nk2(t)>(ℓk(t))b),

where in (6), we have used the leader exploration mechanism (4). (6) and (6) can be upper bounded in the same way, by introducing the sequence of stopping times , where is the instant at which arm is the leader for the -th time (one can have or if arm would be the leader only a finite number of time when is run forever).

 (???) ≤∑k2∈\cB\cN(k)T∑i=1T∑t=1E[\ind(L(t)=k,ℓk(t)=i,^μk2(t)≤μk2−δk,Nk2(t)>ib)] =~BT∑i=1\bP(^μk2(τki)≤μk2−δk,Nk2(τki)>ib,τki≤T) ≤~BT∑i=1T∑u=ib\bP(^μk2,u≤μk2−δk,Nk2(τki)=u) ≤~B∞∑i=1∞∑u=ibexp(−2δ2ku)≤~B∞∑i=1exp(−2δ2kib)1−exp(−2δ2k).

The notation used above denotes the empirical mean of the first observations from arm , which are i.i.d. with mean . Thus, Hoeffding’s inequality can be applied to obtain the last but one inequality.

To upper bound (6) we use the same approach (with replaced by ), which yields

 \cTk1(T)≤∞∑i=1exp(−2δ2kib)1−exp(−2δ2k)+∞∑i=1exp(−2δ2k⌊i/γ⌋)1−exp(−2δ2k):=Ck(μ,γ,b)<∞.

To finish the proof, we upper bound for some well chosen value of . The upper bound given in Lemma 3.3 is a careful adaptation (and generalization) of the proof of Proposition 1 in Kaufmann et al. (2012), which says that for vanilla Thompson Sampling restricted to , the (unique) optimal arm cannot be drawn too few times. Observe that Lemma 3.3 permits to handle possible multiple optimal arms. Again, we emphasize that in UTS, there is an extra difficulty due to the fact that arms in are not only selected when is the leader. The proof of Lemma 3.3, given in Appendix C overcomes this difficulty.

When , there exists and a constant such that

 T∑t=1\bP(L(t)=k,∀k2∈\cB\cN(k),Nk2(t)≤(ℓk(t))b)≤Dk(μ,b,γ).

Putting things together, one obtains, for all , with chosen as in Lemma 3.3,

 \cRμ(\cA,T)≤(1+ϵ)∑k∈N(k∗)Δk\kl(μk,μ⋆)ln(T)+~C(μ,ϵ)+∑k≠k⋆[Ck(μ,γ,b)+Dk(μ,b,γ)],

which yields the claimed upper bound.

## 4 Related Work on Rank-One Bandits

Multi-armed bandits are a rich class of statistical models for sequential decision making (see Lattimore and Szepesvári (2019); Bubeck et al. (2012) for two surveys). They offer a clear framework as well as computationally efficient algorithms for many practical problems such as online advertising Zoghi et al. (2017), a context in which the empirical efficiency of Thompson Sampling (Thompson, 1933) has often been noticed (Scott, 2010; Chapelle and Li, 2011)

. The wide success of Bayesian methods in bandit or reinforcement learning problems can no longer be ignored

Russo et al. (2018); Osband and Van Roy (2017).

As already mentioned, stochastic rank-one bandits were introduced by Katariya et al. (2017b, a) which are indeed among the closest works related to ours. The original algorithm proposed therein,

, relies on a complex sequential elimination scheme. It operates in stages that progressively quadruple in length. At the end of each stage, the significantly worst rows and columns are eliminated; this is done using carefully tuned confidence intervals. The exploration is simple but costly: every remaining row is played with a randomly chosen remaining column, and conversely for the columns. At the end of the stage, the value of each row is computed by averaging over all columns, such that the estimate of the row parameter is scaled by some measurable constant that is

the same for all rows. Then, or confidence intervals are used to perform the elimination by respectively or . The advantage of this method is that the worst rows and columns disappear very early from the game. However, eliminating them requires that their confidence intervals no longer intersect, which is quite costly. Moreover, the averaging performed to compute individual estimates for each parameter may be arbitrarily bad: if all columns but one have a parameter close to zero, the scaling constant on the row estimates is close to zero and the rows become hard to distinguish. All those issues are mentioned in the according papers. Nonetheless, the advantage of a rank-one algorithm, as opposed to playing a vanilla bandit algorithm, on a large (typically ) matrix remains perfectly significant, which has motivated various further work on the topic.

In particular, Kveton et al. (2017) generalizes this elimination scheme to low-rank matrices, where the objective is to discover the best set of entries. Jun et al. (2019) modify a bit the problem and formulate it as Bilinear bandits, where the two chosen vector arms and have an expected payoff of , where is a low-rank matrix. Kotłowski and Neu (2019) study an adversarial version of this problem, the Bandits Online PCA: the learner sequentially chooses vectors and observes a loss , where the loss is arbitrarily and possibly adversarially chosen by the environment. Zimmert and Seldin (2018)

considers a more general problem where matrices are replaced by rank-one tensors in dimension

. The main message of the paper is to propose a unified view of Factored Bandits encompassing both rank-one bandits and dueling bandits Yue and Joachims (2009).

## 5 Numerical Experiments

To assess the empirical efficiency of against other competitors, we follow the same experimental protocol as Katariya et al. (2017a) and run the algorithm on simulated matrices of arms of increasing sizes. We set for different values of . The parameters are defined symmetrically: such that the best entry of the matrix is always . In our experiments, the cumulative regret up to an horizon is estimated based on independent runs. The shaded areas on our plots show the 10% percentiles.

#### Study of hyperparameter γ

According to the original paper, the exploration parameter of should be set to for rank-one bandits. However, in the proof we derived in Section 3, there is no need to fix to this value. To confirm this statement and study the influence of , we ran on a the toy problem described above, with different values of

. We also run the heuristic version of

that would use no leader exploration scheme (corresponding to ).

On Figure 2, we show the cumulative regret in log-scale. We notice that all curves align with the optimal logarithmic rate, with a lower offset for lower values of . Empirically, the performance seems the best for . Figure 2: Cumulative regret of UTS for γ varying in {2,5,10,20,+∞} for K=4. Figure 3: Cumulative regret of Rank1ElimKL, OSUB, UTS and KL−UCB, on K×K rank-one matrices with K=4 (top left), K=8 (top right) and K=16 (bottom)

#### Cumulative regret and optimality of UTS.

We now compare the regret of run with to that of other algorithms on the above mentioned family of instances for different values of in . Note that in Katariya et al. (2017a), the simulations are run on larger matrices, for . In those settings, only outperforms for but it is better than and one can easily see that it scales better with the problem size than UCB1. However, given the much better performance of and , we were able to show the same trends with much smaller problem sizes.

In Figure 3 we compare the cumulative regret of with , (with ) and . One first obvious observation is that has a regret an order of magnitude larger than all other policies, including on this size of problems. We also notice that the final regret, at , roughly doubles for all rank-one policies while it quadruples for , as expected. To illustrate the asymptotic optimality of and compared to , we show on Figure 4 the results of the simulations in log-scale, and we plot the lower bound of Proposition 1. We observe that both optimal policies asymptotically align with the lower bound, while adopts a faster growth rate, that indeed corresponds to the constant of Lai & Robbins. Figure 4: Regret for K=4 in log-scale: the lower bound (in blue) shows the optimal asymptotic logarithmic growth of the regret. UTS and OSUB align with it, while KL−UCB has a larger slope.

## 6 Conclusion

This paper proposed a new perspective on the rank-one bandit problem by showing it can be cast into the unimodal bandit framework. This led us to propose an algorithm closing the gap between existing regret upper and lower bound for Bernoulli rank-one bandits: Unimodal Thompson Sampling (). is easy to implement and very efficient in practice, as our experimental study reveals an improvement of a factor at least 20 with respect to the state-of-the art algorithm. Our main theoretical contribution is a novel regret analysis of this algorithm in the general unimodal setting, which sheds a new light on the leader exploration parameter to use. Interestingly, forcing exploration of the leader appears to help in practice in the rank-one example, and it may be interesting to investigate whether this remains the case for other structured bandit problems (Combes et al., 2017).

#### Acknowledgement

The authors acknowledge the French National Research Agency under projects BADASS (ANR-16-CE40-0002) and BOLD (ANR-19-CE23-0026-04).

## References

• Agrawal and Goyal (2013) Shipra Agrawal and Navin Goyal. Further optimal regret bounds for thompson sampling. In

Proceedings of the 16th Conference on Artificial Intelligence and Statistics

, 2013.
• Bubeck et al. (2012) Sébastien Bubeck, Nicolo Cesa-Bianchi, et al. Regret analysis of stochastic and nonstochastic multi-armed bandit problems.

Foundations and Trends® in Machine Learning

, 5(1):1–122, 2012.
• Cappé et al. (2013) Olivier Cappé, Aurélien Garivier, Odalric-Ambrym Maillard, Rémi Munos, Gilles Stoltz, et al. Kullback–leibler upper confidence bounds for optimal sequential allocation. The Annals of Statistics, 41(3):1516–1541, 2013.
• Chapelle and Li (2011) Olivier Chapelle and Lihong Li. An empirical evaluation of thompson sampling. In Advances in neural information processing systems, pages 2249–2257, 2011.
• Chuklin et al. (2015) Aleksandr Chuklin, Ilya Markov, and Maarten de Rijke. Click models for web search. Synthesis Lectures on Information Concepts, Retrieval, and Services, 7(3):1–115, 2015.
• Combes and Proutière (2014) Richard Combes and Alexandre Proutière. Unimodal bandits: Regret lower bounds and optimal algorithms. 2014.
• Combes et al. (2017) Richard Combes, Stefan Magureanu, and Alexandre Proutiere. Minimal exploration in structured stochastic bandits. In Advances in Neural Information Processing Systems, 2017.
• Jun et al. (2019) Kwang-Sung Jun, Rebecca Willett, Stephen Wright, and Robert Nowak. Bilinear bandits with low-rank structure. arXiv preprint arXiv:1901.02470, 2019.
• Katariya et al. (2017a) Sumeet Katariya, Branislav Kveton, Csaba Szepesvári, Claire Vernade, and Zheng Wen. Bernoulli rank-1 bandits for click feedback. In IJCAI, 2017a.
• Katariya et al. (2017b) Sumeet Katariya, Branislav Kveton, Csaba Szepesvári, Claire Vernade, and Zheng Wen. Stochastic rank-1 bandits. In Proceedings of the 20th International Conference on Artificial Intelligence and Statistics, 2017b.
• Kaufmann et al. (2012) Emilie Kaufmann, Nathaniel Korda, and Rémi Munos. Thompson Sampling : an Asymptotically Optimal Finite-Time Analysis. In Proceedings of the 23rd conference on Algorithmic Learning Theory, 2012.
• Kotłowski and Neu (2019) Wojciech Kotłowski and Gergely Neu. Bandit principal component analysis. arXiv preprint arXiv:1902.03035, 2019.
• Kveton et al. (2017) Branislav Kveton, Csaba Szepesvári, Anup Rao, Zheng Wen, Yasin Abbasi-Yadkori, and S Muthukrishnan. Stochastic low-rank bandits. arXiv preprint arXiv:1712.04644, 2017.
• Lai and Robbins (1985) Tze Leung Lai and Herbert Robbins. Asymptotically efficient adaptive allocation rules. Advances in applied mathematics, 6(1):4–22, 1985.
• Lattimore and Szepesvári (2019) Tor Lattimore and Csaba Szepesvári. Bandit algorithms. Cambridge University Press, 2019.
• Osband and Van Roy (2017) Ian Osband and Benjamin Van Roy. Why is posterior sampling better than optimism for reinforcement learning? In Proceedings of the 34th International Conference on Machine Learning-Volume 70, pages 2701–2710. JMLR. org, 2017.
• Paladino et al. (2016) Stefano Paladino, Francesco Trovò, Marcello Restelli, and Nicola Gatti. Unimodal thompson sampling for graph-structured arms. arXiv:1611.05724v2, 2016.
• Paladino et al. (2017) Stefano Paladino, Francesco Trovò, Marcello Restelli, and Nicola Gatti. Unimodal thompson sampling for graph-structured arms. In Thirty-First AAAI Conference on Artificial Intelligence, 2017.
• Richardson et al. (2007) Matthew Richardson, Ewa Dominowska, and Robert Ragno. Predicting clicks: estimating the click-through rate for new ads. In Proceedings of the 16th international conference on World Wide Web, pages 521–530. ACM, 2007.
• Russo et al. (2018) Daniel J Russo, Benjamin Van Roy, Abbas Kazerouni, Ian Osband, Zheng Wen, et al. A tutorial on thompson sampling. Foundations and Trends® in Machine Learning, 11(1):1–96, 2018.
• Scott (2010) Steven L. Scott. A modern Bayesian look at the multi-armed bandit. Applied Stochastic Models in Business and Industry, 26:639–658, 2010.
• Thompson (1933) William R Thompson. On the likelihood that one unknown probability exceeds another in view of the evidence of two samples. Biometrika, 25(3/4):285–294, 1933.
• Yu and Mannor (2011) Jia Yuan Yu and Shie Mannor. Unimodal bandits. Citeseer, 2011.
• Yue and Joachims (2009) Yisong Yue and Thorsten Joachims. Interactively optimizing information retrieval systems as a dueling bandits problem. In Proceedings of the 26th Annual International Conference on Machine Learning, pages 1201–1208. ACM, 2009.
• Zimmert and Seldin (2018) Julian Zimmert and Yevgeny Seldin. Factored bandits. In Advances in Neural Information Processing Systems, pages 2835–2844, 2018.
• Zoghi et al. (2017) Masrour Zoghi, Tomas Tunys, Mohammad Ghavamzadeh, Branislav Kveton, Csaba Szepesvari, and Zheng Wen. Online learning to rank in stochastic click models. In Proceedings of the 34th International Conference on Machine Learning-Volume 70, pages 4199–4208. JMLR. org, 2017.

## Appendix A Important Results

We recall two important results that are repeatedly used in our analysis.

(Hoeffding’s inequality) Let

be independent bounded random variables supported in

. For all ,

 \bP(1nn∑i=1(Xi−\bE[Xi])≥t)≤exp(−2nt2)

and

 \bP(1nn∑i=1(Xi−\bE[Xi])≤−t)≤exp(−2nt2)

(Beta Binomial trick) Letting and

respectively denote the cumulative distribution function of a Beta distribution with parameters

, and of a Binomial distribution with parameters

. It holds that

 F\emphBetaα,β(y)=1−F\emphBinα+β−1,y(α−1)

## Appendix B Proof of Lemma 3.3

In this section, we adapt the analysis of Agrawal and Goyal (2013), highlighting the steps that need extra justification.

Let be a sub-optimal arm. We introduce two thresholds and such that , that we specify later. We define the following “good” events: (t) = {} and (t) = {}. The event can be decomposed as follows:

 {K(t)=k,L(t)=k