## 1 Introduction

Many interactive online information systems (search, recommendation) present to a stream of users rankings of a set items in response to a specific query. As feedback, these systems often observe a click (or a tap) on one (or more) of these items. Such systems are considered to be good if users click on items that are closer to the top of the retrieved ranked list, because it means they spent little time finding their sought information needs (making the simplifying assumption that a typical user scans the list from top to bottom).

We model this as the following iterative game. There is a fixed set of objects.
For simplicity, we first describe the *single choice* setting in which for , exactly one item from is chosen.
At each step ,
the system outputs a (randomized) ranking of the set, and then is revealed to it.
The system loses nothing if is the first element in , a unit cost if is in the second position,
units if it is in the third position, and so on. The goal of the system is to minimize its total loss after
steps. (For simplicity we assume is known in this work.)
The expected loss of the system is (additively) compared against that of the best (in hindsight) single ranking played throughout.

More generally, nature can choose a subset per round. We view the set of chosen items in round as an indicator function so that if and only if . The loss function now penalizes the algorithm by the sum, over the elements of , of the positions of those elements in .

We term such feedback as *discrete choice*, thinking of the elements of as
items chosen by a user in an online system.
This paper studies online ranking over discrete choice problems, as well as over other more complex forms of feedback.
We derive both upper and lower regret bounds and improve on the state-of-the-art.

### 1.1 Main Results

For the discrete choice setting, we design an algorithm and derive bounds on its maximal expected regret as a function of and a uniform upper bound on . Our main result for discrete choice is given in Theorem 3.1 below. Essentially, we show an expected regret bound of . We argue in Theorem 3.3 that this bound is tight. The proofs of these theorems are given in Sections 6 and 7. In Section 4 we compare our result to previous approaches. To the best of our knowledge, our bound is better than the best two previous approaches (which are incomparable): (1) We improve on Kalai et al.’s Follow the Perturbed Leader (FPL) algorithm’s analysis Kalai & Vempala (2005) by a factor of , and (2) We improve on a more general algorithm by Helmbold et al. for learning permutations Helmbold & Warmuth (2009) by a factor of . It should be noted here, however, that a more careful analysis of FPL results in regret bounds comparable with ours, and equivalently, a faster learning rate than that guaranteed in the paper Kalai & Vempala (2005). (This argument will be explained in detail in Section 8.)

In Section 5, we show that using our techniques, the problem of online rank aggregation over the Spearman correlation measure, commonly used in nonparametric statistics Spearman (1904), also enjoys improved regret bounds. This connects our work to Yasutake et al. (2012) on a similar problem with respect to the Kendall- distance.

In the full version of this extended abstract we discuss a more general class of loss functions which assigns other importance weights to the various positions in the output ranking (other than the linear function defined above). The result and the proof idea are presented in Section 8.

### 1.2 Main Techniques

Our algorithm maintains a weight vector

which is updated at each step after nature reveals the subset . This weight vector is, in fact, a histogram counting the number of times each element appeared so far. In the next round, it will use this weight vector as input to a noisy sorting procedure.^{1}

^{1}1By this we mean, a procedure that outputs a randomized ranking of an input set. The main result in this work is, that as long as the noisy sorting procedure’s output satisfies a certain property (see Lemma 6.1), the algorithm has the desired regret bounds. Stated simply, this property ensures that for any fixed pair of items , the marginal distribution of the order between the two elements follows a multiplicative weight update scheme with respect to and . We show that two noisy sorting procedures, one a version of QuickSort and the other based on a statistical model for rank data by Plackett and Luce, satisfy this property. (We refer the reader to the book Marden (1995) for more details about the Plackett-Luce model in statistics.)

## 2 Definitions and Problem Statement

Let be a ground set of items. A ranking over is an injection , where denotes .
We let denote the space of rankings over .
The expression for is the *position* of in the ranking, where we think of lower positions as *more favorable*.
For distinct , we say that if (in words: *beats* ). We use as shorthand
for the indicator function of the predicate .

At each step the algorithm outputs a ranking over and then observes a subset which we also denote by its indicator function . The instantaneous loss incurred by the algorithm at step is

(2.1) |

namely, the dot product of the and , both viewed as vectors in .
Since in this work we are interested in bounding *additive* regret, we can equivalently work with any loss function that
differs from by a constant that may depend on (but not on ). This work will take advantage of this
fact and will use the following *pairwise* loss function, , defined as follows:

(2.2) |

where is if and otherwise. In words, this will introduce a cost of whenever , and the pair is misordered in the sense that . A zero loss is incurred exactly if the algorithm places the elements in the preimage before the elements in . It should be clear that for any and , the losses and differ by a number that depends on only. Slightly abusing notation, we define

so that takes the form .^{2}^{2}2Note that this expression makes sense because is symmetric in its last two arguments.

Over a horizon of steps, the algorithm’s total loss is .
We will compare the expected total loss of our algorithm with that of , where . ^{3}^{3}3We slightly abuse notation by thinking of both as a ranking and
as an algorithm that outputs the same ranking at each step.

Thinking of the aforementioned applications, we say that is *chosen* at step if and only if .
In case exactly one item is chosen at each step we say that we are in the *single choice*
setting. If at most items are chosen we say that we are in the -choice model.
Note that in the single choice case, the instantaneous losses and at time each time are identical.

We will need an invariant which measures a form of complexity of the value functions , given as

(2.3) |

Note that since is a binary function, this is also equivalent to , namely, the maximal loss of any ranking at any time step. (Later in the discussion we will study non-binary , where this will not hold). In fact, we need an upper bound on , which (abusing notation) we will also denote by . In the most general case, can be taken as (achieved if exactly half of the elements are chosen). In the single choice case, can be taken as . In the -choice case, can be taken as . (We will assume always that .)

## 3 The Algorithm and its Guarantee for Discrete Choice

Our algorithm (Algorithm 1) takes as input the ground set , a learning rate parameter , a reference to a randomized sorting procedure and a time horizon . We present two possible randomized sorting procedures, (Algorithm 2) and (Algorithm 3). Both options satisfy an important property, described below in Lemma 6.1. Our main result for discrete choice is as follows.

###### Theorem 3.1.

Assume the time horizon is at least . If is run with either or and with , then

(3.1) |

Additionally, the running time per step is .

The proof of the theorem is deferred to Section 6. We present a useful corollary for the cases of interest.

###### Corollary 3.2.

We also have the following lower bound.

###### Theorem 3.3.

There exists an integer and some function such that for all and , for any algorithm, the minimax expected total regret in the single choice case after steps is at least .

Note that we did not make an effort to bound the function

in the theorem, which relies on weak convergence properties guaranteed by the central limit theorem. Better bounds could be derived by considering tight convergence rates of binomial distributions to the normal distribution. We leave this to future work.

## 4 Comparison With Previous Work

There has been much work on online ranking with various types of feedback and loss functions. We are not aware of work that studies the exact setting here.

Yasutake et al. Yasutake et al. (2012) consider online learning for *rank aggregation*, where at
each step nature chooses a permutation , and the algorithm incurs the loss
.
Optimizing over this loss summed over is NP-Hard even in the offline setting Dwork et al. (2001), while our problem, as we shall shortly see, is easy to solve offline. Additionally, our problem is different and is not simply an easy instance of Yasutake et al. (2012).

A naïve, obvious approach to the problem of prediction rankings, which we state for the purpose of self containment, is by viewing each permutation as one of actions, and “tracking” the best permutation using a standard Multiplicative Weight (MW) update. Such schemes Freund & Schapire (1995); Littlestone & Warmuth (1994) guarantee an expected regret bound of . The guarantee of Theorem 3.1 is better by at least a factor of in the general case, in the -choice case and in the single choice case. The distribution arising in the MW scheme would assign a probability proportional to for any ranking at time , and for some learning rate . This distribution is not equivalent to neither nor , and it is not clear how to efficiently draw from it for large .

### 4.1 A Direct Online Linear Optimization View

Our problem easily lends itself to online linear optimization Kalai & Vempala (2005) over a discrete subset of a real vector space. In fact, there are multiple ways for doing this.

The loss , as defined in Section 2, is a linear function of .
The vector can take any vertex in the *permutahedron*, equivelently, the set of vectors
with distinct coordinates over . It is easy to see that for any real vector , minimizing is done by ordering the elements of in decreasing -value and setting
for all .
The highly influencial paper of Kalai et al. Kalai & Vempala (2005) suggests Follow the Perturbed
Leader (FPL) as a general approach for solving such online linear optimization problems. The bound derived there
yields an expected regret bound of
for our problem. This bound is comparable to ours for the single choice case, is worse by a factor of in the -choice case and by a factor of in the general case.
To see how the bound is derived, we remind the reader of how FPL works: At time , let denote the number
of times such that (the number of appearances of in the current history). The algorithm then outputs
the permutation ordering the elements of in decreasing order, where for each ,

is an iid real random variable uniformly drawn from an “uncertainty” distribution with a shape parameter that is controled by a chosen learning rate, determined by the algorithm. One version of FPL in

Kalai & Vempala (2005), considers an uncertainty distribution which is uniform in the interval for a shape parameter . The analysis there guarantees an expected regret of as long as is taken as , where (here) is the diameter of the permutahedron in sense, is defined as (the maximal per-step loss) and is the maximal norm of the indicator vectors . A quick calculation shows that we have, for the -choice case, , , , giving the stated bound.As mentioned in the introduction, however, it seems that this suboptimal bound is due to the fact that analysis of FPL should be done more carefully, taking advantage of the structure of rankings and of the loss functions we consider. We further elaborate on this in Section 8.

Very recently, Suehiro et al. (2012) considered a similar problem, in a setting in which the loss vector can be assumed to be anything with coordinates bounded by . In particular, that result applies to the case in which is binary. They obtain the same expected regret bound, but with a per-step time complexity of , which is worse than our . Their analysis takes advantage of the fact that optimization over the permutahedron can be viewed as a prediction problem under submodular constraints.

Continuing our comparison to previous results, Dani et al. Dani et al. (2007) provide for online linear oprimization problems a regret bound of

(4.1) |

where is the ambient dimension of the set . Clearly , hence this bound is worse than ours by a factor of in the single choice case and in the -choice case.

A less efficient embedding can be done in using the Birkhoff-vonNeumann embedding, as follows. Given , we define the matrix by

. For an indicator function we define the embedding by . It is clear that defined above is equivalently given by . Using the analysis of FPL Kalai & Vempala (2005) gives an expected regret bound of in the single choice case and in the -choice case, which is worse than our bounds by at least a factor of and , respectively.

Another recent work that studied linear optimization over cost functions of the form for general cost
matrices is that of Helmut and Warmuth Helmbold & Warmuth (2009). The expected regret
bound for that algorithm in our case is (assuming there is no prior upper bound on the total
optimal loss).^{4}^{4}4Note that one needs to carefully rescale the bounds to obtain a correct comparison with Helmbold & Warmuth (2009). Also, the variable there, upper bounding the highest possible optimal loss, is computed by assuming all elements are chosen exactly times. This is worse by a factor of than our bounds.

#### Comparison of the Single Choice Case to Previous Algorithms for the Bandit Setting

It is worth noting that in the single choice case, given and it is possible to recover exactly.
This means that we can study the game in the single choice case in the so-called *bandit setting*, where
the algorithm only observes the loss at each step.^{5}^{5}5Note that generally the bandit setting is more difficult
than the full-information setting, where the loss of all actions are known to the algorithm. The fact that the
two are equivalent in the single choice case is a special property of the problem.
This allows us to compare our algorithm’s regret guarantees to those of algorithms for online linear
optimization in the bandit setting.

Cesa-Bianchi and Lugosi have studied the problem of optimizing
in the
bandit setting in Cesa-Bianchi & Lugosi (2012),
where is the ranking embedding in defined above.
They build on the methodolog of Dani et al. (2007).
They obtain an expected regret bound of , which is much worse
than the single choice bound in Corollary 3.2.^{6}^{6}6This is not explicitly stated in their work, and requires plugging in various calculations (which they provide) in the bound provided in their main theorem, in addition to scaling by . Also, it is worth noting that the method for drawing a random ranking in each step in their
algorithm relies on the idea of approximating the permanent, which is much more complicated than the algorithms
presented in this work.

Finally, we mention the online linear optimization approach in the bandit setting of Abernethy et al. Abernethy et al. (2008) in case the search is in a convex polytope. The expected regret for our problem in the single choice setting using their approach is , where is the ambient dimension of the polytope, and is a number that can be bounded by the number of its facets Hazan (2013). In the compact embedding (in ), and . In the embedding in , we have and . For both embeddings and for all cases we study, the bound is worse than ours.

#### Comparison of Lower Bounds

Our lower bound (Theorem 3.3) is a refinement of the lower bound in Helmbold & Warmuth (2009), because the lower bound there was derived for a larger class of loss functions. In fact, the method used there for deriving the lower bound could not be used here. Briefly explained, they reduce from simple online optimization over experts, each mapped to a ranking so that no two rankings share the same element in the same position. That technique cannot be used to derive lower bounds in our settings, because all such rankings would have the exact same loss.

## 5 Implications for Rank Aggregation

The (unnormalized) Spearman correlation between two rankings , as .

The corresponding *online rank aggregation* problem, closely related to that of Yasutake et al. (2012),
is defined as follows. A sequence of rankings are chosen in advanced by the
adversary. At each time step, the algorithm outputs , and then is revealed to it. The instantaneous loss
is defined as . The total loss is , and the goal is to minimize the expected regret, defined with respect to .^{7}^{7}7For the purpose of rank aggregation, the Spearman correlation is something that we’d want to maximize. We prefer to keep the mindset of *loss minimization*, and hence work with instead.

Notice now that there was nothing in our analysis leading to Theorem 3.1 that required to be a binary
function. Indeed, if we identify , then the loss (2.1) is exactly . Additionally,
the pairwise loss (2.2) satisfies that for all and ,
,
where is a constant that depends on only. To see why, one trivially verifies that when
moving from to a ranking obtained from by swapping two *consecutive* elements,
the two differences and are equal. Hence again, we can consider
regret with respect to , instead of . The value of from (2.3) is clearly .
Hence, by an application of Theorem 3.1, we conclude the following bound for online rank aggregation over Spearman correlation:

###### Corollary 5.1.

Assume a time horizon larger than some global constant. If is run with either or , for for all and , then the expected regret is at most .

A similar comparison to previous approaches can be done for the rank aggregation problem, as we did in Section 4 for the cases of binary . Comparing with the direct analysis of FPL, the expected regret would be (using here). Comparing to Helmbold & Warmuth (2009), we again obtain here an improvement of .

## 6 Proof of Theorem 3.1

Let denote an optimal ranking of in hindsight. In order to analyze Algorithm 1 with both and , we start with a simple lemma.

###### Lemma 6.1.

The random ranking returned by satisfies that for any given pair of distinct elements , the probability of the event equals , for both and .

The proof for case uses techniques from e.g. Ailon et al. (2008).

###### Proof.

For the case , the internal order between and can be determined in one of two ways. (i) The element (resp. ) is chosen as pivot in some recursive call, in which (resp. ) is part of the input. Denote this event . (ii) Some element is chosen as pivot in a recursive call in which both and are part of the input, and in this recursive call the elements and are separated (one goes to the left recursion, the other to the right one). Denote this event .

It is clear that the collection of events is a disjoint cover of the probability space of . If is the (random) output, then it is clear from the algorithm that

It is also clear, using Bayes rule, that for all ,

as required. For the case , for any subset containing and , let denote the event that, when the first of is chosen in Line 5, the value of (in the main loop) equals . It is clear that is a disjoint cover of the probability space of the algorithm. If now denotes the output of , then the proof is completed by noticing that for any , . ∎

The conclusion from the lemma is, as we show now, that for each pair the algorithm plays a standard multiplicative update scheme over the set of two possible actions, namely and

. We now make this precise. For each ordered pair

of two distinct elements in , let . We also let . On one hand, we have(6.1) | |||||

On the other hand,

It is now easily verified that for any ,

(6.2) | |||||

## 7 Proof of Theorem 3.3

We provide a proof for the single choice case in this extended abstract, and include notes fo the -choice case within the proof. For the single choice case, recall that the losses and are identical.

Fix and of size , and assume . Assume the adversary chooses the sequence of single elements so that each element is chosen independently and uniformly at random from . [For general , we will select subsets of size at each step, uniformly at random from the space of such subsets]. For each , let denote the frequency of in the sequence, namely . Clearly, the minimizer of can be taken to be any ranking satisfying . For ease of notation we let , namely the element in position in . The cost is given by . For any number , let , namely, the number of elements with frequency at least . Changing order of summation, can also be written as

. This, in turn, equals .

By linearity of expectation, . This clearly equals , where are any two fixed, distinct elements of . Note that is distributed for any , where denotes Binomial with trials and probability of success. In what follows we let be a random variable distributed . Let by the expectation of , and let

be its standard deviation. [For general

, instead, we have moments of a the binomial with

trials and probability of success.] We will assume for simplicity that is an integer (although this requirement can be easily removed). We will fix an integer that will be chosen later. We split the last expression as , whereBefore we bound , first note that for any , the random variable is distributed . Also, for any the function is monotonically decreasing in . Hence, for any ,

(7.2) |

#### Bounding :

We use Chernoff bound, stating that for any integer and probability ,

(7.3) | |||||

(7.4) |

#### Bounding :

Using the same as just chosen, possibly increasing and applying the central limit theorem, we conclude that there exists a function such that for all and ,

(7.6) |

where is the normal cdf. For notation purposes, let and . Hence,

We now make some rough estimates of the normal cdf. The reason for doing these tedious calculations will be made clear shortly. One verifies that

, , , , , , . Hence,It is now easy to verify using standard analysis that for all ,

(7.7) |

Therefore,

(Note that the crux of the enitre proof is in getting the first summand in the last expression to be for some . This is the reason we needed t

Comments

There are no comments yet.