We study the problem of finding the maximum entry of a stochastic rank- matrix from noisy and adaptively-chosen observations. This problem is motivated by two problems, ranking in the position-based model  and online advertising.
The position-based model (PBM)  is one of the most fundamental click models , a model of how people click on a list of items out of . This model is defined as follows. Each item is associated with its attraction and each position in the list is associated with its examination
. The attraction of any item and the examination of any position are i.i.d. Bernoulli random variables. The item in the list isclicked
only if it is attractive and its position is examined. Under these assumptions, the pair of the item and position that maximizes the probability of clicking is the maximum entry of a rank-matrix, which is the outer product of the attraction probabilities of items and the examination probabilities of positions.
As another example, consider a marketer of a product who has two sets of actions, population segments and marketing channels. Given a product, some segments are easier to market to and some channels are more appropriate. Now suppose that the conversion happens only if both actions are successful and that the successes of these actions are independent. Then similarly to our earlier example, the pair of the population segment and marketing channel that maximizes the conversion rate is the maximum entry of a rank- matrix.
We propose an online learning model for solving our motivating problems, which we call a stochastic rank- bandit. The learning agent interacts with our problem as follows. At time , the agent selects a pair of row and column arms, and receives the product of their individual values as a reward. The values are stochastic, drawn independently, and not observed. The goal of the agent is to maximize its expected cumulative reward, or equivalently to minimize its expected cumulative regret with respect to the optimal solution, the most rewarding pair of row and column arms.
We make five contributions. First, we precisely formulate the online learning problem of stochastic rank- bandits. Second, we design an elimination algorithm for solving it, which we call . The key idea in
is to explore all remaining rows and columns randomly over all remaining columns and rows, respectively, to estimate their expected rewards; and then eliminate those rows and columns that seem suboptimal. This algorithm is computationally efficient and easy to implement. Third, we derive agap-dependent upper bound on its -step regret, where is the number of rows, is the number of columns, and is the minimum of the row and column gaps; under the assumption that the mean row and column rewards are bounded away from zero. Fourth, we derive a nearly matching gap-dependent lower bound. Finally, we evaluate our algorithm empirically. In particular, we validate the scaling of its regret, compare it to multiple baselines, and show that it can learn near-optimal solutions even if our modeling assumptions are mildly violated.
We denote random variables by boldface letters and define . For any sets and , we denote by
the set of all vectors whose entries are indexed byand take values from .
We formulate our online learning problem as a stochastic rank- bandit. An instance of this problem is defined by a tuple , where is the number of rows, is the number of columns,
is a probability distribution over a unit hypercube, and is a probability distribution over a unit hypercube .
Let be an i.i.d. sequence of vectors drawn from distribution and be an i.i.d. sequence of vectors drawn from distribution , such that and are drawn independently at any time . The learning agent interacts with our problem as follows. At time , it chooses arm based on its history up to time ; and then observes , which is also its reward.
The goal of the agent is to maximize its expected cumulative reward in steps. This is equivalent to minimizing the expected cumulative regret in steps
where is the instantaneous stochastic regret of the agent at time and
is the optimal solution in hindsight of knowing and . Since and are drawn independently, and for all and for all , we get that
for any , where and . This is the key idea in our solution.
Note that the problem of learning and from stochastic observations is a special case of matrix completion from noisy observations . This problem is harder than that of learning . In particular, the most popular approach to matrix completion is alternating minimization of a non-convex function , where the observations are corrupted with Gaussian noise. In contrast, our proposed algorithm is guaranteed to learn the optimal solution with a high probability, and does not make any strong assumptions on and .
3 Naive Solutions
Our learning problem is a -arm bandit with parameters, and . The main challenge is to leverage this structure to learn efficiently. In this section, we discuss the challenges of solving our problem by existing algorithms. We conclude that a new algorithm is necessary and present it in Section 4.
Any rank- bandit is a multi-armed bandit with arms. As such, it can be solved by . The -step regret of in rank- bandits is . Therefore, is impractical when both and are large.
for any . This approach is problematic for at least two reasons. First, the reward is not properly defined when either or . Second,
Nevertheless, note that both sides of the above inequality have maxima at , and therefore should perform well. We compare to it in Section 6.2.
Also note that for . Therefore, a rank- bandit can be viewed as a generalized linear bandit and solved by , where the mean function is and the feature vector of arm is in (1). This approach is not practical for three reasons. First, the parameter space is unbounded, because as and as
. Second, the confidence intervals ofare scaled by the reciprocal of the minimum derivative of the mean function , which can be very large in our setting. In particular, . In addition, the gap-dependent upper bound on the regret of is , which further indicates that is not practical. Our upper bound in Theorem 1 scales much better with all quantities of interest. Third, needs to compute the maximum-likelihood estimates of and at each step, which is a non-convex optimization problem (Section 2).
Some variants of our problem can be solved trivially. For instance, let for all and for all . Then can be identified from , and the learning problem does not seem more difficult than a stochastic combinatorial semi-bandit . We do not focus on such degenerate cases in this paper.
Our algorithm, , is shown in Algorithm 1. It is an elimination algorithm , which maintains confidence intervals  on the expected rewards of all rows and columns. operates in stages, which quadruple in length. In each stage, it explores all remaining rows and columns randomly over all remaining columns and rows, respectively. At the end of the stage, it eliminates all rows and columns that cannot be optimal.
The eliminated rows and columns are tracked as follows. We denote by the index of the most rewarding row whose expected reward is believed by to be at least as high as that of row in stage . Initially, . When row is eliminated by row in stage , is set to ; then when row is eliminated by row in stage , is set to ; and so on. The corresponding column quantity, , is defined and updated analogously. The remaining rows and columns in stage , and , are then the unique values in and , respectively; and we set these in line of Algorithm 1.
Each stage of Algorithm 1 has two main steps: exploration (lines –) and elimination (lines –). In the row exploration step, each row is explored randomly over all remaining columns such that its expected reward up to stage is at least , where is in (4). To guarantee this, we sample column randomly and then substitute it with column , which is at least as rewarding as column . This is critical to avoid in our regret bound, which can be large and is not necessary. The observations are stored in reward matrix . As all rows are explored similarly, their expected rewards are scaled similarly, and this permits elimination. The column exploration step is analogous.
In the elimination step, the confidence intervals of all remaining rows, for any , are estimated from matrix ; and the confidence intervals of all remaining columns, for any , are estimated from . This separation is needed to guarantee that the expected rewards of all remaining rows and columns are scaled similarly. The confidence intervals are designed such that
implies that row is suboptimal with a high probability for any column elimination policy up to the end of stage , and
implies that column is suboptimal with a high probability for any row elimination policy up to the end of stage . As a result, all suboptimal rows and columns are eliminated correctly with a high probability.
This section has three subsections. In Section 5.1, we derive a gap-dependent upper bound on the -step regret of . In Section 5.2, we derive a gap-dependent lower bound that nearly matches our upper bound. In Section 5.3, we discuss the results of our analysis.
5.1 Upper Bound
The hardness of our learning problem is measured by two sets of metrics. The first metrics are gaps. The gaps of row and column are defined as
respectively; and the minimum row and column gaps are defined as
respectively. Roughly speaking, the smaller the gaps, the harder the problem. The second metric is the minimum of the average of entries in and , which is defined as
The smaller the value of , the harder the problem. This quantity appears in our regret bound due to the averaging character of (Section 4). Our upper bound on the regret of is stated and proved below.
The expected -step regret of is bounded as
The proof of Theorem 1 is organized as follows. First, we bound the probability that at least one confidence interval is violated. The corresponding regret is small, . Second, by the design of and because all confidence intervals hold, the expected reward of any row is at least . Because all rows are explored in the same way, any suboptimal row is guaranteed to be eliminated after observations. Third, we factorize the regret due to exploring row into its row and column components, and bound both of them. This is possible because eliminates rows and columns simultaneously. Finally, we sum up the regret of all explored rows and columns.
Note that the gaps in Theorem 1, and , are slightly different from those in (2). In particular, all zero row and column gaps in (2) are substituted with the minimum column and row gaps, respectively. The reason is that the regret due to exploring optimal rows and columns is positive until all suboptimal columns and rows are eliminated, respectively. The proof of Theorem 1 is below.
Let and be the stochastic regret associated with exploring row and column , respectively, in stage . Then the expected -step regret of is bounded as
where the outer sum is over possibly stages. Let
be the expected reward of row in the first stages, where and ; and let
be the event that for all remaining rows at the end of stage , the confidence interval on the expected reward holds and that this reward is at least . Let be the complement of event . Let
denote the expected reward of column in the first stages, where and ; and let
be the event that for all remaining columns at the end of stage , the confidence interval on the expected reward holds and that this reward is at least . Let be the complement of event . Let be the event that all events and happen; and be the complement of , the event that at least one of and does not happen. Then the expected -step regret of is bounded from above as
Let be the rows and columns in stage , and
be the event that all rows and columns with “large gaps” are eliminated by the beginning of stage . By Lemma 2 in Appendix A, event causes event . Now note that the expected regret in stage is independent of given . Therefore, the regret can be further bounded as
for any row and column . Finally, we apply the above upper bounds to (5) and get our main claim.
5.2 Lower Bound
We derive a gap-dependent lower bound on the family of rank- bandits where and are products of independent Bernoulli variables, which are parameterized by their means and , respectively. The lower bound is derived for any uniformly efficient algorithm , which is any algorithm such that for any and any , .
For any problem with a unique best arm and any uniformly efficient algorithm whose regret is ,
where is the Kullback-Leibler (KL) divergence between Bernoulli random variables with means and .
The lower bound involves two terms. The first term is the regret due to learning the optimal row , while playing the optimal column . The second term is the regret due to learning the optimal column , while playing the optimal row . We do not know whether this lower bound is tight. We discuss its tightness in Section 5.3.
be the maximum reward in model . We consider the set of models where and remain the same, but the optimal arm changes,
By Theorem 17 of Kaufmann et al. ,
for any , where is the expected number of times that arm is chosen in steps in problem . From this and the regret decomposition
we get that
and . To obtain our lower bound, we carefully relax the constraints of the above problem, so that we do not loose much in the bound. The details are presented in Appendix B. In the relaxed problem, only entries in the optimal solution are non-zero, as in Combes et al. , and they are
Now we substitute into the objective of the above problem and get our lower bound.
We derive a gap-dependent upper bound on the -step regret of in Theorem 1, which is
We argue that our upper bound is nearly tight on the following class of problems. The -th entry of , , is an independent Bernoulli variable with mean
for some and row gap . The -th entry of , , is an independent Bernoulli variable with mean
for and column gap . Note that the optimal arm is and that the expected reward for choosing it is . We refer to the instance of this problem by ; and parameterize it by , , , , , and .
Let for , and for . Then the upper bound in Theorem 1 is
since . On the other hand, the lower bound in Theorem 2 is
since and . Note that the bounds match in , , the gaps, and .
We conclude with the observation that is suboptimal in problems where in (4) is small. In particular, consider the above problem, and choose and