A Nearly Instance Optimal Algorithm for Top-k Ranking under the Multinomial Logit Model

07/25/2017 ∙ by Xi Chen, et al. ∙ Princeton University NYU college 0

We study the active learning problem of top-k ranking from multi-wise comparisons under the popular multinomial logit model. Our goal is to identify the top-k items with high probability by adaptively querying sets for comparisons and observing the noisy output of the most preferred item from each comparison. To achieve this goal, we design a new active ranking algorithm without using any information about the underlying items' preference scores. We also establish a matching lower bound on the sample complexity even when the set of preference scores is given to the algorithm. These two results together show that the proposed algorithm is nearly instance optimal (similar to instance optimal [FLN03], but up to polylog factors). Our work extends the existing literature on rank aggregation in three directions. First, instead of studying a static problem with fixed data, we investigate the top-k ranking problem in an active learning setting. Second, we show our algorithm is nearly instance optimal, which is a much stronger theoretical guarantee. Finally, we extend the pairwise comparison to the multi-wise comparison, which has not been fully explored in ranking literature.

READ FULL TEXT VIEW PDF
POST COMMENT

Comments

There are no comments yet.

Authors

page 1

page 2

page 3

page 4

This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

1 Introduction

The problem of inferring a ranking over a set of

items (e.g., products, movies, URLs) is an important problem in machine learning and finds numerous applications in recommender systems, web search, social choice, and many other areas. To learn the global ranking, an effective way is to present at most

() items at each time and ask about the most favorable item among the given items. Then, the answers from these multi-wise comparisons will be aggregated to infer the global ranking. When the number of items becomes large, instead of inferring the global ranking over all the items, it is of more interest to identify the top- items with a pre-specified . In this paper, we study the problem of active top- ranking from multi-wise comparisons, where the goal is to adaptively choose at most items for each comparison and accurately infer the top- items with the minimum number of comparisons (i.e., the minimum sample complexity). As an illustration, let us consider a practical scenario: an online retailer is facing the problem of choosing best designs of handbags among candidate designs. One popular way is to display several designs to each arriving customer and observe which handbag is chosen. Since a shopping website has a capacity on the maximum number of display spots, each comparison will involve at most possible designs.

Given the wide application of top- ranking, this problem has received a lot of attention in recent years, e.g., [SW15, STZ17] (please see Section 1.4 for more details). Our work greatly extends the existing literature on top- ranking in the following three directions:

  1. Most existing work studies a non-active ranking aggregation problem, where the answers of comparisons are provided statically or the items for each comparison are chosen completely at random. Instead of considering a passive ranking setup, we propose an active ranking algorithm, which adaptively chooses the items for comparisons based on previously collected information.

  2. Most existing work chooses some specific function (call this function ) of problem parameters (e.g., , , and preference scores) and shows that the algorithm’s sample complexity is at most . For the optimality, they also show that for any value of , there exists an instance whose sample complexity equals to that value and any algorithm needs at least comparisons on this instance. However, this type of algorithms could perform poorly on some instances other than those instances for establishing lower bounds (see examples from [CGMS17]); and the form of function can vary the designed algorithm a lot.

    To address this issue, we establish a much more refined upper bound on the sample complexity. The derived sample complexity matches the lower bound when all the parameters (including the set of underlying preference scores for items) are given to the algorithm. They together show that our lower bound is tight and also our algorithm is nearly instance optimal (see Definition 1.1 for the definition of nearly instance optimal).

  3. Existing work mainly focuses on pairwise comparisons. We extend the pairwise comparison to the multi-wise comparison (at most items) and further quantify the role of in the sample complexity. From our sample complexity result (see Section 1.2), we show that the pairwise comparison could be as helpful as multi-wise comparison unless the underlying instance is very easy.

1.1 Model

In this paper, we adopt the widely used multinomial logit (MNL) model [Luc59, McF73, Tra03] for modeling multi-wise comparisons. In particular, we assume that each item has an underlying preference score (a.k.a. utility in economics) for . These scores, which are unknown to the algorithm, determine the underlying ranking of the items. Specifically, means that item is preferred to item and item should have a a higher rank. Without loss of generality, we assume that , and thus the true top- items are . At each time from to , the algorithm chooses a subset of items with at least two items, denoted by , for query/comparison. The size of the set is upper bounded by a pre-fixed parameter , i.e., .

Given the set , the agent will report her most preferred item following the multinomial logit (MNL) model:

(1)

When the size of is (i.e., ), the MNL model reduces to Bradley-Terry model [BT52], which has been widely studied in rank aggregation literature in machine learning (see, e.g., [NOS17, JKSO13, RA14, CS15]).

In fact, the MNL model has a simple probabilistic interpretation as follows [Tra03]. Given the set , the agent draws her valuation for each item , where is the mean utility for item and each

is independently, identically distributed random variable following the Gumbel distribution. Then, the probability that

is chosen as the most favorable item is . With some simple algebraic derivation using the density of Gumbel distribution (see Chapter 3.1 in [Tra03]), the choice probability has an explicit expression in (1). For notational convenience, we define for , and the choice probability in (1) can be equivalently written as . By adaptively querying the set for and observing the reported most favorable item in , the goal is to identify the set of top- items with high probability using the minimum number of queries.

For notation convenience, we assume the -th item (with the preference score ) is labeled as by the algorithm at the beginning. Since the algorithm has no prior knowledge on the ranking of items before it makes any comparison, the ranking of the items should have no correlation with the labels of the items. Therefore, is distributed as a uniform permutation of .

The notion of instance optimal was originally defined and emphasized as an important concept in [FLN03]. With the MNL model in place, we provide a formal definition of nearly instance optimal in our problem. To get a definition of instance optimal in our problem, we can just replace with in Definition 1.1. The “nearly” here just means we allow polylog factors.

Definition 1.1 (Nearly Instance Optimal).

Given instance , define
to be the sample complexity of an optimal adaptive algorithm on the instance. We say that an algorithm is nearly instance optimal, if for any instance , the algorithm outputs the top- items with high probability and only uses at most
number of comparisons. (Note that hides polylog factors of and .)

1.2 Main results

Under the MNL model described in Section 1.1, the main results of this paper include the following upper and lower bounds on the sample complexity.

Theorem 1.2.

We design an active ranking algorithm which uses

comparisons with the set size at most (can be -wise, -wise,…,-wise comparisons) to identify the top- items with high probability.

We note that in Theorem 1.2, the notation hides polylog factors of and .

Next, we present a matching lower bound result, which shows that our sample complexity in Theorem 1.2 is nearly instance optimal.

Theorem 1.3.

For any (possibly active) ranking algorithm , suppose that uses comparisons of set size at most . Even when the algorithm is given the values of (note that does not know which item takes the preference score for each ), still needs

comparisons to identify the top- items with probability at least .

Here we give some intuitive explanations of the terms in the above bounds before introducing the proof overview:

  1. Term : Since each comparison has size at most , we need at least comparisons to query each item at least once.

  2. Term : As the proof will suggest, in order to find the top- items, we need to observe most items in the top- set as chosen items from comparisons. However, we do not have to observe most items in the bottom- set. Therefore, there is no term in the bound.

  3. Term : Roughly speaking, when and , is the amount of information that the comparison between item and item reveals. So intuitively, we need to tell that item ranks after item . Other quantity can also be understood from an information theoretic perspective.

It is also worthwhile to note that when is a constant, it’s easy to check that

This is a simpler expression of the instance optimal sample complexity when is a constant.

Based on the sample complexity results in Theorem 1.2 and 1.3, we summarize the main theoretical contribution of this paper:

  1. We design an active ranking algorithm for identifying top- items under the popular MNL model. We further prove a matching lower bound, which establishes that the proposed algorithm is nearly instance optimal.

  2. Our result shows that the improvement of the multi-wise comparison over the pairwise comparison depends on the difficulty of the underlying instance. Note that the only term in the sample complexity involving is . Therefore, the multi-wise comparison makes a significant difference from the pairwise comparison only when is the leading term in the sample complexity.

    Therefore, unless the underlying instance is really easy (e.g., the instance-adaptive term is . One implication is that most of the ’s among are much smaller than ), the pairwise comparison is as helpful as the multi-wise comparison.

1.3 Proof overview

In this section, we give some very high level overviews of how we prove Theorem 1.2 and Theorem 1.3.

1.3.1 Algorithms

To prove Theorem 1.2, we consider two separate cases: or .

  1. In the first case, by losing a log-factor, we can just focus on only using pairwise comparisons. Our algorithm first randomly select

    pairs and proceed by querying all of them once per iteration. After getting the query results, by a standard binomial concentration bound, we are able to construct a confident interval of

    for each pair selected by the algorithm in the beginning. In a high level, our algorithm goes by declaring for pair , if the lower bound of the corresponding confident interval is bigger or equal to , or if there already exists items such that we have already declared for all . We are able to show that, if , then for all with , the algorithm will successfully declare after many total queries. Thus, we can remove at least items and recurse on a smaller set.

  2. The more interesting case is when . As we have argued before, it is only beneficial to use multi-wise comparisons when . This implies that and therefore among , there are more than half of ’s whose value is smaller than some constant fraction of . Thus, intuitively, if we select a random subset of items that contains and keep querying this set, then, instead of seeing all items in this set with roughly equal probability, we will be seeing item much more often than the median of frequencies of items in the set. Thus, our algorithm can select an item if it “appears very often when querying a set containing it”. We will show that, if the number of total queries is

    then we will be able to select all the top -items while not selecting any of the bottom items. Thus, we can remove at least items and recurse on a smaller set.

1.3.2 Lower bounds

To prove Theorem 1.3, we establish several lower bounds and combine them using a simple averaging argument. Most of these lower bounds follow the following general proof strategy:

  1. For a given instance , consider other instances on which no algorithm can output with high probability 111Recall that denotes the initial label of -th item given as the input to the algorithm, and thus the true top- items are labeled by .. For example if we just change to , then no algorithm can output with probability more than . This is because item and item look the same now and thus all the algorithms will output and with the same probability in the modified instance.

  2. We then consider a well-designed distribution over these modified instances. We show that for any algorithm with not enough comparisons, the transcript of running on the original instance distributes very closely to the transcript of running on the well-design distribution over modified instances.

  3. Finally, since the transcript also includes the output, step 2 will tell us that if does not use enough comparisons, then must fail to output with some constant probability.

1.4 Related Works

Rank aggregation from pairwise comparisons is an important problem in computer science, which has been widely studied under different comparison models. Most existing works focus on the non-active setting: the pairs of items for comparisons are fixed (or chosen completely at random) and the algorithm cannot adaptively choose the next pair for querying. In this non-active ranking setup, when the goal is to obtain a global ranking over all the items, Negahban et al. [NOS17] proposed the RankCentrality algorithm under the popular Bradley-Terry model, which is a special case of the MNL model for pairwise comparisons. Lu and Boutilier [LB11] proposed a ranking algorithm under the Mallows model. Rajkumar and Agarwal [RA14] investigated different statistical assumptions (e.g., generalized low-noise condition) for guaranteeing to recover the true ranking. Shah et al. [SBGW17]

studied the ranking aggregation under a non-parametric comparison model—strong stochastic transitivity (SST) model, and converted the ranking problem into a matrix estimation problem under shape-constraints. Most machine learning literature assumes that there is a true global ranking of items and the output of each pairwise comparison follows a probabilistic model. Another way of formulating the ranking problem is via the minimum feedback arc set problem on tournaments, which does not assume a true global ranking and aims to find a ranking that minimizes the number of inconsistent pairs. There is a vast literature on the minimum feedback arc set problem and here we omit the survey of this direction (please see

[KMS07] and references therein). Due to the increasing number of items, it is practically more useful to identify the top- items in many internet applications. Chen and Suh [CS15], Jang et al. [JKSO13], and Suh et al. [STZ17] proposed various spectral methods for top- item identification under the BTL model or mixture of BTL models. Shah and Wainwright [SW15] proposed a counting-based algorithm under the SST model and Chen et al. The notion of instance optimal was originally defined and emphasized as an important concept in [FLN03] for identifying the top- objects from sorted lists. [CGMS17] suggested that notion “instance optimal” is necessary for rank aggregation from noisy pairwise comparisons in complicated noise models and further improved [SW15] by proposing an algorithm that has competitive ratio compared to the best algorithm of each instance and proving is tight.

In addition to static rank aggregation, active noisy sorting and ranking problems have received a lot of attentions in recent years. For example, several works [BM08, Ail11, JN11, WMJ13] studied the active sorting problem from noisy pairwise comparisons and explored the sample complexity to approximately recover the true ranking in terms of some distance function (e.g., Kendall’s tau). Chen et al. [CBCTH13] proposed a Bayesian online ranking algorithm under the mixture of BTL models. Dwork et al. [DKNS01] and Ailon et al. [ACN08] considered a related Kemeny optimization problem, where the goal is to determine the total ordering that minimizes the sum of the distances to different permutations. For top- identification, Braverman et al. [BMW16] initiated the study of how round complexity of active algorithms can affect the sample complexity. Szörényi et al. [SBPH15] studied the case of under the BTL model. Heckel et. al. [HSRW16]

investigated the active ranking under a general class of nonparametric models and also established a lower bound on the number of comparisons for parametric models. A very recent work by Mohajer and Suh

[MS16] proposed an active algorithm for top- identification under a general class of pairwise comparison models, where the instance difficulty is characterized by the key quantity . Here, is the probability of item is preferred over item . However, according to our result in Theorem 1.3, the obtained sample complexities in previous works are not instance optimal. We note that the lower bound result in Theorem 1.3 holds for algorithms even when all the values of ’s are known (but without the knowledge of which item corresponds to which value) and thus characterizes the difficulty of each instance. Moreover, we study the the multi-wise comparisons, which has not been explored in ranking aggregation literature but has a wide range applications.

Finally, we note that the top- ranking problem is related to the best arm identification in multi-armed bandit literature [BWV13, JMNB14, ZCL14, CCZZ17]. However, in the latter problem, the samples are i.i.d. random variables rather than comparisons and the goal is to identify the top- distributions with largest means.

2 Algorithm

For notational simplicity, throughout the paper we use the words w.h.p. to denote with probability for sufficiently large constant .

2.1 Top- item identification (For logarithmic )

For , we can always use pairwise comparisons by losing a polylog factor. Therefore, we only focus on the case when in this section.

Before presenting the algorithm, let us first consider a graph where each edge is labeled with either or (see Line 9 in Algorithm 1). Based on the labeling of edges, we give the following definition of label monotone, which will be used in Algorithm 1.

Definition 2.1 (Monotone).

We call a path strictly label monotone if:

  1. For every , the edge is labeled with either or .

  2. There exists at least one edge with label .

Moreover, we call a path “label monotone” if only property 1 holds.

Theorem 2.2.

For every items with , Algorithm 1, on given a random permutation of labels and , returns top- items w.h.p. using

total number of pairwise comparisons.

For the page limit, we defer the proof of Theorem 2.2 to Appendix A. We only provide the pseudocode in Algorithm 1. In Algorithm 1, we note that a different letter (instead of ) is used for denoting the set size because we will run the algorithm recursively with smaller sets. And also notice that the parameter regardless of the value of . We also defer our result for superlogarithmic to Appendix A.

1:: .
2:: A set of randomly permuted labels with , : number of top items.
3:Uniformly at random sample subsets of , each of size . Associate these subsets with a graph , where each edge consists of all the vertices in for .
4:, .
5:while true do
6:     .
7:     Query each set time, obtain in total query results . ( indicates the reported most favorable item)
8:     For all , for , let
9:     For each edge , we label it as:
  1. if

  2. if

  3. if

  4. if

  5. if

10:     For every , we call if there exists a strictly label monotone path of length at most from to .
11:     For each , if there exists at least many such that , then add to . ( is the subset of items that we are sure not in top-.)
12:     For each , if there exists at least many such that , then add to . ( is the subset of items that we are sure in top-.)
13:     Break if .
14:, .
15: .
Algorithm 1 AlgPairwise

3 Lower bounds

We will prove lower bounds on the number of comparison used by any algorithm which identifies top- items even when the values of preference scores are given to the algorithm. (The algorithm just do not know which item has which ). For the page limit, all the proofs are deferred to Appendix B.

3.1 Lower bounds for close weights

Theorem 3.1.

Assume and . For any algorithm (can be adaptive), if uses comparisons of any size (can be -wise comparison for ), then will identify the top- items with probability at most .

Theorem 3.2.

Assume and . For any algorithm (can be adaptive), if uses comparisons of any size (can be -wise comparison for ), then will identify the top- items with probability at most .

3.2 Lower bounds for arbitrary weights

Theorem 3.3.

Assume . For any algorithm (can be adaptive), if uses comparisons of any size (can be -wise comparison for ), then will identify the top- items with probability at most .

Theorem 3.4.

For any algorithm (can be adaptive), if uses comparisons of any size (can be -wise comparison for ), then will identify the top- items with probability at most .

Theorem 3.5.

Assume . For any algorithm (can be adaptive), if uses comparisons of size at most (can be -wise, -wise,…,-wise comparisons), then will identify the top- items with probability at most .

3.3 Combining lower bounds

Corollary 3.6 (Restatement of Theorem 1.3).

For any algorithm (can be adaptive), suppose uses comparisons of size at most (can be -wise, -wise,…,-wise comparisons). needs

to identify the top- items with probability at least .

Proof.

To prove this corollary, we just need to combine all the results in Theorem 3.1, Theorem 3.2, Theorem 3.3, Theorem 3.4 and Theorem 3.5. And then use the fact that if then there exists such that . ∎

Appendix A Additional Results and Proofs of Section 2

Throughout the proofs we are going to use the following claim which is a simple fact about the binomial concentration.

Claim A.1 (Binomial concentration).

For every , every , suppose , then w.h.p (with high probability respect to ).

a.1 Top- item identification (For logarithmic )

In this section, we prove Theorem 2.2 of Section 2.

Following Claim A.1, we know that for every , every , w.h.p. W.l.o.g, let us just focus on the case that this bound is satisfied for all and every .

We have the following Lemma about the labelling:

Lemma A.1 (Label).

For , we have:

  1. if , then or .

  2. if , then .

  3. if or , then

  4. if , then

Proof of Lemma a.1.
  1. We know that for and :

  2. Again by and , we know that , therefore, we have:

  3. Let us suppose , otherwise we already complete the proof. Now, we have:

    Which implies that

    Therefore, by , we have:

    Which implies that

  4. Let us suppose , otherwise we already complete the proof. Again, we have:

    Which implies that

Above, the Lemma A.1 implies that w.h.p. the labelling of each edge is consistent with the order of . Now, the algorithm will declare if there exists strictly label monotone path from to . Using the Lemma above we can show that if such path exists, then . To show the other direction that such paths exists when , we first consider the following graph Lemma that gives the exists of monotone path in random graph .

Lemma A.2 (Graph Path).

For every , every random graph on vertices , if , then w.h.p. For every with , there exists a path such that

  1. .

  2. for every .

We call such a path a monotone path from to .

Proof of Lemma a.2.

It is sufficient to consider the case when , otherwise w.h.p. the graph is a complete graph and theorem is automatically true.

We consider a sequential way of generating : At each time , a vertex arrives and there exists an edge between and each with probability . Let us consider a fixed and . Let . We will divide the set into subsets such that

Since we know that .

Let us define the random variable as:

and .

For each , we define

Clearly, and each is i.i.d. random variable in {0, 1} with . On the other hand, by definition,

We consider two cases:

  1. , then .

  2. , then by for , we have .

Consider a fixed and for each , let be the random variable. By standard Chernoff bound, we have:

  1. If , then w.h.p. .

  2. , then w.h.p. .

    Recall that and , therefore,

    Which implies that w.h.p. .

Putting everything together, we know that for , w.h.p. . Therefore, condition on this event, by

We complete the proof.

Having this Lemma, we can present the main Lemma above the algorithm:

Lemma A.3 (Main 3).

Suppose , then w.h.p. the following holds:

  1. , .

  2. If and , then .

  3. If and , then .

Since the algorithm terminates within recursions, moreover, in each recursion, the algorithm makes at most queries. Therefore, Lemma A.3 implies that the algorithm runs in total queries:

Proof of Lemma a.3.
  1. It suffices to show that if , then . To see this, consider a strictly label monotone path with length . By Lemma A.1, we know that for every , we have: . Moreover, there exists an such that . Multiply every thing together, we know that

  2. Let us denote the set , we will prove that . Consider one , by Lemma A.