 # Stochastic Rank-1 Bandits

We propose stochastic rank-1 bandits, a class of online learning problems where at each step a learning agent chooses a pair of row and column arms, and receives the product of their values as a reward. The main challenge of the problem is that the individual values of the row and column are unobserved. We assume that these values are stochastic and drawn independently. We propose a computationally-efficient algorithm for solving our problem, which we call Rank1Elim. We derive a O((K + L) (1 / Δ) n) upper bound on its n-step regret, where K is the number of rows, L is the number of columns, and Δ is the minimum of the row and column gaps; under the assumption that the mean row and column rewards are bounded away from zero. To the best of our knowledge, we present the first bandit algorithm that finds the maximum entry of a rank-1 matrix whose regret is linear in K + L, 1 / Δ, and n. We also derive a nearly matching lower bound. Finally, we evaluate Rank1Elim empirically on multiple problems. We observe that it leverages the structure of our problems and can learn near-optimal solutions even if our modeling assumptions are mildly violated.

## Authors

##### This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

## 1 Introduction

We study the problem of finding the maximum entry of a stochastic rank- matrix from noisy and adaptively-chosen observations. This problem is motivated by two problems, ranking in the position-based model  and online advertising.

The position-based model (PBM)  is one of the most fundamental click models , a model of how people click on a list of items out of . This model is defined as follows. Each item is associated with its attraction and each position in the list is associated with its examination

. The attraction of any item and the examination of any position are i.i.d. Bernoulli random variables. The item in the list is

clicked

only if it is attractive and its position is examined. Under these assumptions, the pair of the item and position that maximizes the probability of clicking is the maximum entry of a rank-

matrix, which is the outer product of the attraction probabilities of items and the examination probabilities of positions.

As another example, consider a marketer of a product who has two sets of actions, population segments and marketing channels. Given a product, some segments are easier to market to and some channels are more appropriate. Now suppose that the conversion happens only if both actions are successful and that the successes of these actions are independent. Then similarly to our earlier example, the pair of the population segment and marketing channel that maximizes the conversion rate is the maximum entry of a rank- matrix.

We propose an online learning model for solving our motivating problems, which we call a stochastic rank- bandit. The learning agent interacts with our problem as follows. At time , the agent selects a pair of row and column arms, and receives the product of their individual values as a reward. The values are stochastic, drawn independently, and not observed. The goal of the agent is to maximize its expected cumulative reward, or equivalently to minimize its expected cumulative regret with respect to the optimal solution, the most rewarding pair of row and column arms.

We make five contributions. First, we precisely formulate the online learning problem of stochastic rank- bandits. Second, we design an elimination algorithm for solving it, which we call . The key idea in

is to explore all remaining rows and columns randomly over all remaining columns and rows, respectively, to estimate their expected rewards; and then eliminate those rows and columns that seem suboptimal. This algorithm is computationally efficient and easy to implement. Third, we derive a

gap-dependent upper bound on its -step regret, where is the number of rows, is the number of columns, and is the minimum of the row and column gaps; under the assumption that the mean row and column rewards are bounded away from zero. Fourth, we derive a nearly matching gap-dependent lower bound. Finally, we evaluate our algorithm empirically. In particular, we validate the scaling of its regret, compare it to multiple baselines, and show that it can learn near-optimal solutions even if our modeling assumptions are mildly violated.

We denote random variables by boldface letters and define . For any sets and , we denote by

the set of all vectors whose entries are indexed by

and take values from .

## 2 Setting

We formulate our online learning problem as a stochastic rank- bandit. An instance of this problem is defined by a tuple , where is the number of rows, is the number of columns,

is a probability distribution over a unit hypercube

, and is a probability distribution over a unit hypercube .

Let be an i.i.d. sequence of vectors drawn from distribution and be an i.i.d. sequence of vectors drawn from distribution , such that and are drawn independently at any time . The learning agent interacts with our problem as follows. At time , it chooses arm based on its history up to time ; and then observes , which is also its reward.

The goal of the agent is to maximize its expected cumulative reward in steps. This is equivalent to minimizing the expected cumulative regret in steps

 R(n)=E[n∑t=1R(it,jt,ut,vt)],

where is the instantaneous stochastic regret of the agent at time and

 (i∗,j∗)=argmax(i,j)∈[K]×[L]E[u1(i)v1(j)]

is the optimal solution in hindsight of knowing and . Since and are drawn independently, and for all and for all , we get that

 i∗=argmaxi∈[K]μ¯u(i),j∗=argmaxj∈[L]μ¯v(j),

for any , where and . This is the key idea in our solution.

Note that the problem of learning and from stochastic observations is a special case of matrix completion from noisy observations . This problem is harder than that of learning . In particular, the most popular approach to matrix completion is alternating minimization of a non-convex function , where the observations are corrupted with Gaussian noise. In contrast, our proposed algorithm is guaranteed to learn the optimal solution with a high probability, and does not make any strong assumptions on and .

## 3 Naive Solutions

Our learning problem is a -arm bandit with parameters, and . The main challenge is to leverage this structure to learn efficiently. In this section, we discuss the challenges of solving our problem by existing algorithms. We conclude that a new algorithm is necessary and present it in Section 4.

Any rank- bandit is a multi-armed bandit with arms. As such, it can be solved by . The -step regret of in rank- bandits is . Therefore, is impractical when both and are large.

Note that for any . Therefore, a rank- bandit can be viewed as a stochastic linear bandit and solved by [8, 1], where the reward of arm is and its features are

 xi,j(e)={1{e=i},e≤K;1{e−K=j},e>K, (1)

for any . This approach is problematic for at least two reasons. First, the reward is not properly defined when either or . Second,

 E[log(ut(i))+log(vt(j))]≠log(¯u(i))+log(¯v(j)).

Nevertheless, note that both sides of the above inequality have maxima at , and therefore should perform well. We compare to it in Section 6.2.

Also note that for . Therefore, a rank- bandit can be viewed as a generalized linear bandit and solved by , where the mean function is and the feature vector of arm is in (1). This approach is not practical for three reasons. First, the parameter space is unbounded, because as and as

. Second, the confidence intervals of

are scaled by the reciprocal of the minimum derivative of the mean function , which can be very large in our setting. In particular, . In addition, the gap-dependent upper bound on the regret of is , which further indicates that is not practical. Our upper bound in Theorem 1 scales much better with all quantities of interest. Third, needs to compute the maximum-likelihood estimates of and at each step, which is a non-convex optimization problem (Section 2).

Some variants of our problem can be solved trivially. For instance, let for all and for all . Then can be identified from , and the learning problem does not seem more difficult than a stochastic combinatorial semi-bandit . We do not focus on such degenerate cases in this paper.

## 4 Rank1Elim Algorithm

Our algorithm, , is shown in Algorithm 1. It is an elimination algorithm , which maintains confidence intervals  on the expected rewards of all rows and columns. operates in stages, which quadruple in length. In each stage, it explores all remaining rows and columns randomly over all remaining columns and rows, respectively. At the end of the stage, it eliminates all rows and columns that cannot be optimal.

The eliminated rows and columns are tracked as follows. We denote by the index of the most rewarding row whose expected reward is believed by to be at least as high as that of row in stage . Initially, . When row is eliminated by row in stage , is set to ; then when row is eliminated by row in stage , is set to ; and so on. The corresponding column quantity, , is defined and updated analogously. The remaining rows and columns in stage , and , are then the unique values in and , respectively; and we set these in line of Algorithm 1.

Each stage of Algorithm 1 has two main steps: exploration (lines ) and elimination (lines ). In the row exploration step, each row is explored randomly over all remaining columns such that its expected reward up to stage is at least , where is in (4). To guarantee this, we sample column randomly and then substitute it with column , which is at least as rewarding as column . This is critical to avoid in our regret bound, which can be large and is not necessary. The observations are stored in reward matrix . As all rows are explored similarly, their expected rewards are scaled similarly, and this permits elimination. The column exploration step is analogous.

In the elimination step, the confidence intervals of all remaining rows, for any , are estimated from matrix ; and the confidence intervals of all remaining columns, for any , are estimated from . This separation is needed to guarantee that the expected rewards of all remaining rows and columns are scaled similarly. The confidence intervals are designed such that

 U\textscuℓ(i)≤L\textscuℓ(iℓ)=maxi∈IℓL\textscuℓ(i)

implies that row is suboptimal with a high probability for any column elimination policy up to the end of stage , and

 U\textscvℓ(j)≤L\textscvℓ(jℓ)=maxj∈JℓL\textscvℓ(j)

implies that column is suboptimal with a high probability for any row elimination policy up to the end of stage . As a result, all suboptimal rows and columns are eliminated correctly with a high probability.

## 5 Analysis

This section has three subsections. In Section 5.1, we derive a gap-dependent upper bound on the -step regret of . In Section 5.2, we derive a gap-dependent lower bound that nearly matches our upper bound. In Section 5.3, we discuss the results of our analysis.

### 5.1 Upper Bound

The hardness of our learning problem is measured by two sets of metrics. The first metrics are gaps. The gaps of row and column are defined as

 Δ\textscui=¯u(i∗)−¯u(i),Δ\textscvj=¯v(j∗)−¯v(j), (2)

respectively; and the minimum row and column gaps are defined as

 Δ\textscumin=mini∈[K]:Δ\textscui>0Δ\textscui,Δ\textscvmin=minj∈[L]:Δ\textscvj>0Δ\textscvj, (3)

respectively. Roughly speaking, the smaller the gaps, the harder the problem. The second metric is the minimum of the average of entries in and , which is defined as

 μ=min{1KK∑i=1¯u(i), 1LL∑j=1¯v(j)}. (4)

The smaller the value of , the harder the problem. This quantity appears in our regret bound due to the averaging character of (Section 4). Our upper bound on the regret of is stated and proved below.

###### Theorem 1.

The expected -step regret of is bounded as

 R(n)≤1μ2⎛⎜⎝K∑i=1384¯Δ\textscui+L∑j=1384¯Δ\textscvj⎞⎟⎠logn+3(K+L),

where

 ¯Δ\textscui =Δ\textscui+1{Δ\textscui=0}Δ\textscvmin, ¯Δ\textscvj =Δ\textscvj+1{Δ\textscvj=0}Δ\textscumin.

The proof of Theorem 1 is organized as follows. First, we bound the probability that at least one confidence interval is violated. The corresponding regret is small, . Second, by the design of and because all confidence intervals hold, the expected reward of any row is at least . Because all rows are explored in the same way, any suboptimal row is guaranteed to be eliminated after observations. Third, we factorize the regret due to exploring row into its row and column components, and bound both of them. This is possible because eliminates rows and columns simultaneously. Finally, we sum up the regret of all explored rows and columns.

Note that the gaps in Theorem 1, and , are slightly different from those in (2). In particular, all zero row and column gaps in (2) are substituted with the minimum column and row gaps, respectively. The reason is that the regret due to exploring optimal rows and columns is positive until all suboptimal columns and rows are eliminated, respectively. The proof of Theorem 1 is below.

###### Proof.

Let and be the stochastic regret associated with exploring row and column , respectively, in stage . Then the expected -step regret of is bounded as

 R(n)≤E[n−1∑ℓ=0(K∑i=1R\textscuℓ(i)+L∑j=1R\textscvℓ(j))],

where the outer sum is over possibly stages. Let

 ¯uℓ(i) =ℓ∑t=0E⎡⎣L∑j=1C\textscut(i,j)−C\textscut−1(i,j)nℓ∣∣ ∣∣h\textscvt⎤⎦ =¯u(i)ℓ∑t=0nt−nt−1nℓL∑j=1¯v(h\textscvt(j))L

be the expected reward of row in the first stages, where and ; and let

 E\textscuℓ={∀i∈Iℓ:¯uℓ(i)∈[L\textscuℓ(i),U\textscuℓ(i)], ¯uℓ(i)≥μ¯u(i)}

be the event that for all remaining rows at the end of stage , the confidence interval on the expected reward holds and that this reward is at least . Let be the complement of event . Let

 ¯vℓ(j) =ℓ∑t=0E⎡⎣K∑i=1C\textscvt(i,j)−C\textscvt−1(i,j)nℓ∣∣ ∣∣h\textscut⎤⎦ =¯v(j)ℓ∑t=0nt−nt−1nℓK∑i=1¯u(h\textscut(i))K

denote the expected reward of column in the first stages, where and ; and let

 E\textscvℓ={∀j∈Jℓ:¯vℓ(j)∈[L\textscvℓ(j),U\textscvℓ(j)], ¯vℓ(j)≥μ¯v(j)}

be the event that for all remaining columns at the end of stage , the confidence interval on the expected reward holds and that this reward is at least . Let be the complement of event . Let be the event that all events and happen; and be the complement of , the event that at least one of and does not happen. Then the expected -step regret of is bounded from above as

 R(n)≤ E[(n−1∑ℓ=0(K∑i=1R\textscuℓ(i)+L∑j=1R\textscvℓ(j)))1{E}]+ nP(¯¯¯E) ≤ K∑i=1E[n−1∑ℓ=0R\textscuℓ(i)1{E}]+ L∑j=1E[n−1∑ℓ=0R\textscvℓ(j)1{E}]+2(K+L),

where the last inequality is from Lemma 1 in Appendix A.

Let be the rows and columns in stage , and

 Fℓ={∀i∈Iℓ,j∈Jℓ:Δ\textscui≤2~Δℓ−1μ, Δ\textscvj≤2~Δℓ−1μ}

be the event that all rows and columns with “large gaps” are eliminated by the beginning of stage . By Lemma 2 in Appendix A, event causes event . Now note that the expected regret in stage is independent of given . Therefore, the regret can be further bounded as

 R(n)≤ (5) 2(K+L).

By Lemma 3 in Appendix A,

 E[n−1∑ℓ=0E[R\textscuℓ(i)∣∣Hℓ]1{Fℓ}] ≤384μ2¯Δ\textscuilogn+1, E[n−1∑ℓ=0E[R\textscvℓ(j)∣∣Hℓ]1{Fℓ}] ≤384μ2¯Δ\textscvjlogn+1,

for any row and column . Finally, we apply the above upper bounds to (5) and get our main claim.

### 5.2 Lower Bound

We derive a gap-dependent lower bound on the family of rank- bandits where and are products of independent Bernoulli variables, which are parameterized by their means and , respectively. The lower bound is derived for any uniformly efficient algorithm , which is any algorithm such that for any and any , .

###### Theorem 2.

For any problem with a unique best arm and any uniformly efficient algorithm whose regret is ,

 liminfn→∞R(n)logn≥ ∑i∈[K]∖{i∗}¯u(i∗)¯v(j∗)−¯u(i)¯v(j∗)d(¯u(i)¯v(j∗),¯u(i∗)¯v(j∗))+ ∑j∈[L]∖{j∗}¯u(i∗)¯v(j∗)−¯u(i∗)¯v(j)d(¯u(i∗)¯v(j),¯u(i∗)¯v(j∗)),

where is the Kullback-Leibler (KL) divergence between Bernoulli random variables with means and .

The lower bound involves two terms. The first term is the regret due to learning the optimal row , while playing the optimal column . The second term is the regret due to learning the optimal column , while playing the optimal row . We do not know whether this lower bound is tight. We discuss its tightness in Section 5.3.

###### Proof.

The proof is based on the change-of-measure techniques from Kaufmann et al.  and Lagree et al. , who ultimately build on Graves and Lai . Let

 w∗(¯u,¯v)=max(i,j)∈[K]×[L]¯u(i)¯v(j)

be the maximum reward in model . We consider the set of models where and remain the same, but the optimal arm changes,

 B(¯u,¯v)={ (¯u′,¯v′)∈[0,1]K×[0,1]L:¯u(i∗)=¯u′(i∗), ¯v(j∗)=¯v′(j∗), w∗(¯u,¯v)

By Theorem 17 of Kaufmann et al. ,

 liminfn→∞K∑i=1L∑j=1E[Tn(i,j)]d(¯u(i)¯v(j),¯u′(i)¯v′(j))logn≥1

for any , where is the expected number of times that arm is chosen in steps in problem . From this and the regret decomposition

 R(n)=∑Ki=1∑Lj=1E[Tn(i,j)](¯u(i∗)¯v(j∗)−¯u(i)¯v(j)),

we get that

 liminfn→∞R(n)logn≥f(¯u,¯v),

where

 f(¯u,¯v)=infc∈Θ K∑i=1L∑j=1(¯u(i∗)¯v(j∗)−¯u(i)¯v(j))ci,j s.t. ∀(¯u′,¯v′)∈B(¯u,¯v): K∑i=1L∑j=1d(¯u(i)¯v(j),¯u′(i)¯v′(j))ci,j≥1

and . To obtain our lower bound, we carefully relax the constraints of the above problem, so that we do not loose much in the bound. The details are presented in Appendix B. In the relaxed problem, only entries in the optimal solution are non-zero, as in Combes et al. , and they are

 c∗i,j=⎧⎨⎩1/d(¯u(i)¯v(j∗),¯u(i∗)¯v(j∗)),j=j∗,i≠i∗;1/d(¯u(i∗)¯v(j),¯u(i∗)¯v(j∗)),i=i∗,j≠j∗;0,otherwise.

Now we substitute into the objective of the above problem and get our lower bound.

### 5.3 Discussion

We derive a gap-dependent upper bound on the -step regret of in Theorem 1, which is

 O((K+L)(1/μ2)(1/Δ)logn),

where denotes the number of rows, denotes the number of columns, is the minimum of the row and column gaps in (3), and is the minimum of the average of entries in and , as defined in (4).

We argue that our upper bound is nearly tight on the following class of problems. The -th entry of , , is an independent Bernoulli variable with mean

 ¯u(i)=p\textscu+Δ\textscu1{i=1}

for some and row gap . The -th entry of , , is an independent Bernoulli variable with mean

 ¯v(j)=p\textscv+Δ\textscv1{j=1}

for and column gap . Note that the optimal arm is and that the expected reward for choosing it is . We refer to the instance of this problem by ; and parameterize it by , , , , , and .

Let for , and for . Then the upper bound in Theorem 1 is

 O([K(1/Δ\textscu)+L(1/Δ\textscv)]logn)

since . On the other hand, the lower bound in Theorem 2 is

 Ω([K(1/Δ\textscu)+L(1/Δ\textscv)]logn)

since and . Note that the bounds match in , , the gaps, and .

We conclude with the observation that is suboptimal in problems where in (4) is small. In particular, consider the above problem, and choose and