# Spectral Method and Regularized MLE Are Both Optimal for Top-K Ranking

This paper is concerned with the problem of top-K ranking from pairwise comparisons. Given a collection of n items and a few pairwise binary comparisons across them, one wishes to identify the set of K items that receive the highest ranks. To tackle this problem, we adopt the logistic parametric model---the Bradley-Terry-Luce model, where each item is assigned a latent preference score, and where the outcome of each pairwise comparison depends solely on the relative scores of the two items involved. Recent works have made significant progress towards characterizing the performance (e.g. the mean square error for estimating the scores) of several classical methods, including the spectral method and the maximum likelihood estimator (MLE). However, where they stand regarding top-K ranking remains unsettled. We demonstrate that under a random sampling model, the spectral method alone, or the regularized MLE alone, is minimax optimal in terms of the sample complexity---the number of paired comparisons needed to ensure exact top-K identification. This is accomplished via optimal control of the entrywise error of the score estimates. We complement our theoretical studies by numerical experiments, confirming that both methods yield low entrywise errors for estimating the underlying scores. Our theory is established based on a novel leave-one-out trick, which proves effective for analyzing both iterative and non-iterative optimization procedures. Along the way, we derive an elementary eigenvector perturbation bound for probability transition matrices, which parallels the Davis-Kahan sinΘ theorem for symmetric matrices. This further allows us to close the gap between the ℓ_2 error upper bound for the spectral method and the minimax lower limit.

## Authors

• 70 publications
• 53 publications
• 18 publications
• 7 publications
• ### Spectral MLE: Top-K Rank Aggregation from Pairwise Comparisons

This paper explores the preference-based top-K rank aggregation problem....
04/27/2015 ∙ by Yuxin Chen, et al. ∙ 0

• ### Spectral Ranking using Seriation

We describe a seriation algorithm for ranking a set of items given pairw...
06/20/2014 ∙ by Fajwel Fogel, et al. ∙ 0

• ### Partial Recovery for Top-k Ranking: Optimality of MLE and Sub-Optimality of Spectral Method

Given partially observed pairwise comparison data generated by the Bradl...
06/30/2020 ∙ by Pinhan Chen, et al. ∙ 17

• ### Minimax-optimal Inference from Partial Rankings

This paper studies the problem of inferring a global preference based on...
06/21/2014 ∙ by Bruce Hajek, et al. ∙ 0

• ### Top-K Ranking from Pairwise Comparisons: When Spectral Ranking is Optimal

We explore the top-K rank aggregation problem. Suppose a collection of i...
03/14/2016 ∙ by Minje Jang, et al. ∙ 0

• ### As you like it: Localization via paired comparisons

Suppose that we wish to estimate a vector x from a set of binary paired ...
02/19/2018 ∙ by Andrew K. Massimino, et al. ∙ 0

• ### Rank Centrality: Ranking from Pair-wise Comparisons

The question of aggregating pair-wise comparisons to obtain a global ran...
09/08/2012 ∙ by Sahand Negahban, et al. ∙ 0

##### This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

## 1 Introduction

Imagine we have a large collection of items, and we are given partially revealed comparisons between pairs of items. These paired comparisons are collected in a non-adaptive fashion, and could be highly noisy and incomplete. The aim is to aggregate these partial preferences so as to identify the items that receive the highest ranks. This problem, which is called top- rank aggregation, finds applications in numerous contexts, including web search (Dwork et al., 2001), recommendation systems (Baltrunas et al., 2010), sports competition (Masse, 1997), to name just a few. The challenge is both statistical and computational: how can one achieve reliable top- ranking from a minimal number of pairwise comparisons, while retaining computational efficiency?

### 1.1 Popular approaches

To address the aforementioned challenge, many prior approaches have been put forward based on certain statistical models. Arguably one of the most widely used parametric models is the Bradley-Terry-Luce (BTL) model (Bradley and Terry, 1952; Luce, 1959), which assigns a latent preference score to each of the items. The BTL model posits that: the chance of each item winning a paired comparison is determined by the relative scores of the two items involved, or more precisely,

 P{item j is preferred over item i}=w∗jw∗i+w∗j (1)

in each comparison of item against item . The items are repeatedly compared in pairs according to this parametric model. The task then boils down to identifying the items with the highest preference scores, given these pairwise comparisons.

Among the ranking algorithms tailored to the BTL model, the following two procedures have received particular attention, both of which rank the items based on appropriate estimates of the latent preference scores.

1. The spectral method. By connecting the winning probability in (1) with the transition probability of a reversible Markov chain, the spectral method attempts recovery of via the leading left eigenvector of a sample transition matrix. This procedure, also known as Rank Centrality (Negahban et al., 2017a), bears similarity to the PageRank algorithm.

2. The maximum likelihood estimator (MLE). This approach proceeds by finding the score assignment that maximizes the likelihood function (Ford, 1957). When parameterized appropriately, solving the MLE becomes a convex program, and hence is computationally feasible. There are also important variants of the MLE that enforce additional regularization.

Details are postponed to Section 2.2. In addition to their remarkable practical applicability, these two ranking paradigms are appealing in theory as well. For instance, both of them provably achieve intriguing accuracy when estimating the latent preference scores (Negahban et al., 2017a).

Nevertheless, the error for estimating the latent scores merely serves as a “meta-metric” for the ranking task, which does not necessarily reveal the accuracy of top- identification. In fact, given that the loss only reflects the estimation error in some average sense, it is certainly possible that an algorithm obtains minimal estimation loss but incurs (relatively) large errors when estimating the scores of the highest ranked items. Interestingly, a recent work Chen and Suh (2015) demonstrates that: a careful combination of the spectral method and the coordinate-wise MLE is optimal for top- ranking. This leaves open the following natural questions: where does the spectral alone, or the MLE alone, stand in top- ranking? Are they capable of attaining exact top- recovery from minimal samples? These questions form the primary objectives of our study.

As we will elaborate later, the spectral method part of the preceding questions was recently explored by (Jang et al., 2016), for a regime where a relatively large fraction of item pairs have been compared. However, it remains unclear how well the spectral method can perform in a much broader — and often much more challenging — regime, where the fraction of item pairs being compared may be vanishingly small. Additionally, the ranking accuracy of the MLE (and its variants) remains unknown.

### 1.2 Main contributions

The central focal point of the current paper is to assess the accuracy of both the spectral method and the regularized MLE in top- identification. Assuming that the pairs of items being compared are randomly selected and that the preference scores fall within a fixed dynamic range, our paper delivers a somewhat surprising message:

• Both the spectral method and the regularized MLE achieve perfect identification of top- ranked items under optimal sample complexity (up to some constant factor)!

It is worth emphasizing that these two algorithms succeed even under the sparsest possible regime, a scenario where only an exceedingly small fraction of pairs of items have been compared. This calls for precise control of the entrywise error — as opposed to the loss — for estimating the scores. To this end, our theory is established upon a novel leave-one-out argument, which might shed light on how to analyze the entrywise error for more general optimization problems.

As a byproduct of the analysis, we derive an elementary eigenvector perturbation bound for (asymmetric) probability transition matrices, which parallels Davis-Kahan’s theorem for symmetric matrices. This simple perturbation bound immediately leads to an improved error bound for the spectral method, which allows to close the gap between the theoretical performance of the spectral method and the minimax lower limit.

### 1.3 Notation

Before proceeding, we introduce a few notations that will be useful throughout. To begin with, for any strictly positive probability vector

, we define the inner product space indexed by as a vector space in endowed with the inner product . The corresponding vector norm and the induced matrix norm are defined respectively as and .

Additionally, the notation or means there is a constant such that , or means there is a constant such that , or means that there exist constants such that , and means .

Given a graph with vertex set and edge set , we denote by the (unnormalized) Laplacian matrix (Chung, 1997) associated with it, where are the standard basis vectors in . For a matrix with

real eigenvalues, we let

be the eigenvalues sorted in descending order.

## 2 Statistical models and main results

### 2.1 Problem setup

We begin with a formal introduction of the Bradley-Terry-Luce parametric model for binary comparisons.

Preference scores. As introduced earlier, we assume the existence of a positive latent score vector

 w∗=[w∗1,⋯,w∗n]⊤ (2)

that comprises the underlying preference scores assigned to each of the items. Alternatively, it is sometimes more convenient to reparameterize the score vector by

 θ∗=[θ∗1,⋯,θ∗n]⊤,% whereθ∗i=logw∗i. (3)

These scores are assumed to fall within a dynamic range given by

 w∗i∈[wmin,wmax],orθ∗i∈[θmin,θmax] (4)

for all and for some , , and . We also introduce the condition number as

 κ:=wmaxwmin. (5)

Notably, the current paper primarily focuses on the case with a fixed dynamic range (i.e.  is a fixed constant independent of ), although we will also discuss extensions to the large dynamic range regime in Section 3. Without loss of generality, it is assumed that

 wmax≥w∗1≥w∗2≥…≥w∗n≥wmin, (6)

meaning that items through are the desired top- ranked items.

Comparison graph. Let stand for a comparison graph, where the vertex set represents the items of interest. The items and are compared if and only if falls within the edge set . Unless otherwise noted, we assume that is drawn from the Erdős–Rényi random graph , such that an edge between any pair of vertices is present independently with some probability . In words, captures the fraction of item pairs being compared.

Pairwise comparisons. For each , we obtain independent paired comparisons between items and . Let be the outcome of the -th comparison, which is independently drawn as

 y(l)i,j ind.= ⎧⎪⎨⎪⎩1,% with probability w∗jw∗i+w∗j=eθ∗jeθ∗i+eθ∗j,0,else. (7)

By convention, we set for all throughout the paper. This is also known as the logistic

pairwise comparison model, due to its strong resemblance to logistic regression. It is self-evident that the sufficient statistics under this model are given by

 y:={yi,j∣(i,j)∈E},whereyi,j:=1L∑Ll=1y(l)i,j. (8)

To simplify the notation, we shall also take

 y∗i,j:=w∗jw∗i+w∗j=eθ∗jeθ∗i+eθ∗j.

Goal. The goal is to identify the set of top- ranked items — that is, the set of items that enjoy the largest preference scores — from the pairwise comparison data .

### 2.2 Algorithms

#### 2.2.1 The spectral method: Rank Centrality

The spectral ranking algorithm, or Rank Centrality (Negahban et al., 2017a), is motivated by the connection between the pairwise comparisons and a random walk over a directed graph. The algorithm starts by converting the pairwise comparison data into a transition matrix in such a way that

 Pi,j=⎧⎪ ⎪⎨⎪ ⎪⎩1dyi,j,if (i,j)∈E,1−1d∑k:(i,k)∈Eyi,k,if i=j,0,otherwise, (9)

for some given normalization factor , and then proceeds by computing the stationary distribution of the Markov chain induced by . As we shall see later, the parameter is taken to be on the same order of the maximum vertex degree of while ensuring the non-negativity of . As asserted by Negahban et al. (2017a), is a faithful estimate of up to some global scaling. The algorithm is summarized in Algorithm 1.

To develop some intuition regarding why this spectral algorithm gives a reasonable estimate of , it is perhaps more convenient to look at the population transition matrix :

 P∗i,j=⎧⎪ ⎪ ⎪ ⎪ ⎪⎨⎪ ⎪ ⎪ ⎪ ⎪⎩1dw∗jw∗i+w∗j,if (i,j)∈E,1−1d∑k:(i,k)∈Ew∗kw∗i+w∗k,if i=j,0,otherwise,

which coincides with by taking . It can be seen that the normalized score vector

 π∗:=1∑ni=1w∗i[w∗1,w∗2,…,w∗n]⊤ (10)

is the stationary distribution of the Markov chain induced by the transition matrix , since and are in detailed balance, namely,

 π∗iP∗i,j=π∗jP∗j,i,∀(i,j). (11)

As a result, one expects the stationary distribution of the sample version to form a good estimate of , provided the sample size is sufficiently large.

#### 2.2.2 The regularized MLE

Under the BTL model, the negative log-likelihood function conditioned on is given by (up to some global scaling)

 L(θ;y): =−∑(i,j)∈E,i>j{yj,ilogeθieθi+eθj+(1−yj,i)logeθjeθi+eθj} =−∑(i,j)∈E,i>j{−yj,i(θi−θj)+log(1+eθi−θj)}. (12)

The regularized MLE then amounts to solving the following convex program

 minimizeθ∈Rn Lλ(θ;y):=L(θ;y)+12λ∥θ∥22, (13)

for a regularization parameter . As will be discussed later, we shall adopt the choice throughout this paper. For the sake of brevity, we let represent the resulting penalized maximum likelihood estimate whenever it is clear from the context. Similar to the spectral method, one reports the items associated with the largest entries of .

### 2.3 Main results

The most challenging part of top- ranking is to distinguish the -th and the -th items. In fact, the score difference of these two items captures the distance between the item sets and . Unless their latent scores are sufficiently separated, the finite-sample nature of the model would make it infeasible to distinguish these two critical items. With this consideration in mind, we define the following separation measure

 ΔK:=w∗K−w∗K+1wmax. (14)

This metric turns out to play a crucial role in determining the minimal sample complexity for perfect top- identification.

The main finding of this paper concerns the optimality of both the spectral method and the regularized MLE in the presence of a fixed dynamic range (i.e. ). Recall that under the BTL model, the total number of samples we collect concentrates sharply around its mean, namely,

 N=(1+o(1))E[N] =(1+o(1))n2pL/2 (15)

occurs with high probability. Our main result is stated in terms of the sample complexity required for exact top- identification.

###### Theorem 1.

Consider the pairwise comparison model specified in Section 2.1 with . Suppose that and that

 n2pL2≥c1nlognΔ2K (16)

for some sufficiently large positive constants and . Further assume for any absolute constants . With probability exceeding , the set of top- ranked items can be recovered exactly by the spectral method given in Algorithm 1, and by the regularized MLE given in (13). Here, we take in the spectral method and in the regularized MLE, where and are some absolute constants.

###### Remark 1.

We emphasize that for is a fundamental requirement for the ranking task. In fact, if for any constant , then the comparison graph is disconnected with high probability. This means that there exists at least one isolated item (which has not been compared with any other item) and cannot be ranked.

###### Remark 2.

In fact, the assumption that for any absolute constants is not needed for the spectral method.

###### Remark 3.

Here, we assume the same number of comparisons to simplify the presentation as well as the proof. The result still holds true if we have distinct ’s for each , as long as .

Theorem 1 asserts that both the spectral method and the regularized MLE achieve a sample complexity on the order of . Encouragingly, this sample complexity coincides with the minimax limit identified in (Chen and Suh, 2015, Theorem 2) in the fixed dynamic range, i.e. .

###### Theorem 2 (Chen and Suh (2015)).

Fix , and suppose that

 n2pL≤2c2(1−ϵ)nlogn−2Δ2K, (17)

where . Then for any ranking procedure , one can find a score vector with separation such that fails to retrieve the top- items with probability at least .

We are now positioned to compare our results with Jang et al. (2016), which also investigates the accuracy of the spectral method for top- ranking. Specifically, Theorem 3 in Jang et al. (2016) establishes the optimality of the spectral method for the relatively dense regime where

 p≳√lognn.

In this regime, however, the total sample size necessarily exceeds

 n2pL/2 ≥ n2p/2 ≳ √n3logn, (18)

which rules out the possibility of achieving minimal sample complexity if is sufficiently large. For instance, consider the case where , then the optimal sample size — as revealed by Theorem 1 or (Chen and Suh, 2015, Theorem 1) — is on the order of

 (nlogn)/Δ2K≍nlogn,

which is a factor of lower than the bound in (18). By contrast, our results hold all the way down to the sparsest possible regime where , confirming the optimality of the spectral method even for the most challenging scenario. Furthermore, we establish that the regularized MLE shares the same optimality guarantee as the spectral method, which was previously out of reach.

### 2.4 Optimal control of entrywise estimation errors

In order to establish the ranking accuracy as asserted by Theorem 1, the key is to obtain precise control of the loss of the score estimates. Our results are as follows.

###### Theorem 3 (Entrywise error of the spectral method).

Consider the pairwise comparison model in Section 2.1 with . Suppose for some sufficiently large constant . Choose for some constant in Algorithm 1. Then the spectral estimate satisfies

 ∥π−π∗∥∞∥π∗∥∞≲√lognnpL (19)

with probability , where is the normalized score vector (cf. (10)).

###### Theorem 4 (Entrywise error of the regularized MLE).

Consider the pairwise comparison model specified in Section 2.1 with . Suppose that for some sufficiently large constant and that for any absolute constants . Set the regularization parameter to be for some absolute constant . Then the regularized MLE satisfies

with probability exceeding , where and .

Theorems 34 indicate that if the number of comparisons associated with each item — which concentrates around — exceeds the order of , then both methods are able to achieve a small error when estimating the scores.

Recall that the estimation error of the spectral method has been characterized by Negahban et al. (2017a) (or Theorem 9 of this paper that improves it by removing the logarithmic factor), which obeys

 ∥π−π∗∥2∥π∗∥2≲√lognnpL (20)

with high probability. Similar theoretical guarantees have been derived for another variant of the MLE (the constrained version) under a uniform sampling model as well (Negahban et al., 2017a). In comparison, our results indicate that the estimation errors for both algorithms are almost evenly spread out across all coordinates rather than being localized or clustered. Notably, the pointwise errors revealed by Theorems 3-4 immediately lead to exact top- identification as claimed by Theorem 1.

###### Proof of Theorem 1.

In what follows, we prove the theorem for the spectral method part. The regularized MLE part follows from an almost identical argument and hence is omitted.

Since the spectral algorithm ranks the items in accordance with the score estimate , it suffices to demonstrate that

 πi−πj>0,∀1≤i≤K, K+1≤j≤n.

To this end, we first apply the triangle inequality to get

 πi−πj∥π∗∥∞ ≥π∗i−π∗j∥π∗∥∞−|πi−π∗i|∥π∗∥∞−|πj−π∗j|∥π∗∥∞≥ΔK−2∥π−π∗∥∞∥π∗∥∞. (21)

In addition, it follows from Theorem 3 as well as our sample complexity assumption that

 ∥π−π∗∥∞∥π∗∥∞≲√lognnpLandn2pL≳nlognΔ2K.

These conditions taken collectively imply that as long as exceeds some sufficiently large constant. Substitution into (21) reveals that , as claimed. ∎

### 2.5 Heuristic arguments

We pause to develop some heuristic explanation as to why the estimation errors are expected to be spread out across all entries. For simplicity, we focus on the case where

and is sufficiently large, so that and sharply concentrate around and , respectively.

We begin with the spectral algorithm. Since and are respectively the invariant distributions of the Markov chains induced by and , we can decompose

 (π−π∗)⊤ =π⊤P−π∗⊤P∗=(π−π∗)⊤P+π∗⊤(P−P∗):=ξ. (22)

When and , the entries of (resp. the off-diagonal entries of and ) are all of the same order and, as a result, the energy of the uncertainty term is spread out (using standard concentration inequalities). In fact, we will demonstrate in Section 5.2 that

 ∥ξ∥∞∥π∗∥∞≲√lognnpL≍∥π−π∗∥2√logn∥π∗∥2, (23)

which coincides with the optimal rate. Further, if we look at each entry of (22), then for all ,

 πm−π∗m=[(π−π∗)⊤P]m+ξm, (24)

By construction of the transition matrix, one can easily verify that is bounded away from and for all . As a consequence, the identity allows one to treat each as a mixture of three effects: (i) the first term of (24) behaves as an entrywise contraction of the error; (ii) the second term of (24) is a (nearly uniformly weighted) average of the errors over all coordinates, which can essentially be treated as a smoothing operator applied to the error components; and (iii) the uncertainty term . Rearranging terms in (24), we are left with

 (1−Pm,m)|πm−π∗m| ≲1n∑ni=1∣∣πi−π∗i∣∣+ξm,∀m (25)

which further gives,

 ∥π−π∗∥∞ ≲1n∑ni=1∣∣πi−π∗i∣∣+∥ξ∥∞. (26)

There are two possibilities compatible with this bound (26): (1) , and (2) by (23). In either case, the errors are fairly delocalized, revealing that

We now move on to the regularized MLE, following a very similar argument. By the optimality condition that , one can derive (for some to be specified later)

 θ−θ∗ =θ−η∇Lλ(θ)−θ∗ =θ−η∇Lλ(θ)−(θ∗−η∇Lλ(θ∗))−η∇Lλ(θ∗):=ζ ≈(I−η∇2Lλ(θ∗))(θ−θ∗)−ζ.

Write , where and denote respectively the diagonal and off-diagonal parts of . Under our assumptions, one can check that for all and for any . With these notations in place, one can write the entrywise error as follows

 θm−θ∗m=(1−ηDm,m)(θm−θ∗m)+∑j:j≠mηAj,m(θj−θ∗j)−ζm.

By choosing for some sufficiently small constant , we get and . Therefore, the right-hand side of the above relation also comprises a contraction term as well as an error smoothing term, similar to (24). Carrying out the same argument as for the spectral method, we see that the estimation errors of the regularized MLE are expected to be spread out.

### 2.6 Numerical experiments

It is worth noting that extensive numerical experiments on both synthetic and real data have already been conducted in Negahban et al. (2017a) to confirm the practicability of both the spectral method and the regularized MLE. See also Chen and Suh (2015) for the experiments on the Spectral-MLE algorithm. This section provides some additional simulations to complement their experimental results as well as our theory. Throughout the experiments, we set the number of items to be , while the number of repeated comparisons and the edge probability can vary with the experiments. Regarding the tuning parameters, we choose in the spectral method where is the maximum degree of the graph and in the regularized MLE, which are consistent with the configurations considered in the main theorems. Additionally, we also display the experimental results for the unregularized MLE, i.e. . All of the results are averaged over 100 Monte Carlo simulations.

We first investigate the error of the spectral method and the (regularized) MLE when estimating the preference scores. To this end, we generate the latent scores () independently and uniformly at random over the interval . Figure 1(a) (resp. Figure 1(b)) displays the entrywise error in the spectral score estimation as the number of repeated comparisons (resp. the edge probability ) varies. As is seen from the plots, the error of all methods gets smaller as and increase, confirming our results in Theorems 3-4. Next, we show in Figure 1(c) the relative error while fixing the total number of samples (i.e. ). It can be seen that the performance almost does not change if the sample complexity remains the same. It is also interesting to see that the error of the spectral method and the MLE are very similar. In addition, Figure 2 illustrates the relative error and the relative error in score estimation for all three methods. As we can see, the relative errors are not much larger than the relative errors (recall that ), thus offering empirical evidence that the errors in the score estimates are spread out across all entries.

Further, we examine the top- ranking accuracy of all three methods. Here, we fix and , set , and let for all and for all . By construction, the score separation satisfies . Figure 3 illustrates the accuracy in identifying the top- ranked items. The performance of them improves when the score separation becomes larger, which matches our theory in Theorem 1.

### 2.7 Other related works

The problem of ranking based on partial preferences has received much attention during the past decade. Two types of observation models have been considered: the cardinal-based model, where users provide explicit numerical ratings of the items, the ordinal-based model, where users are asked to make comparative measurements. See Ammar and Shah (2011) for detailed comparisons between them.

In terms of the ordinal-based model — and in particular, ranking from pairwise comparisons — both parametric and nonparametric models have been extensively studied. For example, Hunter (2004) examined variants of the parametric BTL model, and established the convergence properties of the minorization-maximization algorithm for computing the MLE. Moreover, the BTL model falls under the category of low-rank parametric models, since the preference matrix is generated by passing a rank-2 matrix through the logistic link function (Rajkumar and Agarwal, 2016). Additionally, the work Jiang et al. (2011) proposed a least-squares type method to estimate the full ranking, which generalizes the simple Borda count algorithm (Ammar and Shah, 2011). For many of these algorithms, the sample complexities needed for perfect total ranking were determined by Rajkumar and Agarwal (2014), although the top- ranking accuracy was not considered there.

Going beyond the parametric models, a recent line of works Shah et al. (2017); Shah and Wainwright (2015); Chen et al. (2017); Pananjady et al. (2017) considered the nonparametric stochastically transitive model, where the only model assumption is that the comparison probability matrix follows certain transitivity rules. This type of models subsumes the BTL model as a special case. For instance, Shah and Wainwright (2015) suggested a simple counting-based algorithm which can reliably recover the top- ranked items for various models. However, the sampling paradigm considered therein is quite different from ours in the sparse regime; for instance, their model does not come close to the setting where is small but is large, which is the most challenging regime of the model adopted in our paper and Negahban et al. (2017a); Chen and Suh (2015).

All of the aforementioned papers concentrate on the case where there is a single ground-truth ordering. It would also be interesting to investigate the scenarios where different users might have different preference scores. To this end, Negahban et al. (2017b); Lu and Negahban (2014) imposed the low-rank structure on the underlying preference matrix and adopted the nuclear-norm relaxation approach to recover the users’ preferences. Additionally, several papers explored the ranking problem for the more general Plackett-Luce model (Hajek et al., 2014; Soufiani et al., 2013), in the presence of adaptive sampling (Jamieson and Nowak, 2011; Busa-Fekete et al., 2013; Heckel et al., 2016; Agarwal et al., 2017), for the crowdsourcing scenario (Chen et al., 2013), and in the adversarial setting (Suh et al., 2017). These are beyond the scope of the present paper.

Speaking of the error metric, the norm is appropriate for top- ranking problem and other learning problems as well. In particular, perturbation bounds for eigenvectors of symmetric matrices (Koltchinskii and Lounici, 2016; Fan et al., 2016; Eldridge et al., 2017; Abbe et al., 2017) and singular vectors of general matrices (Koltchinskii and Xia, 2016) have been studied. In stark contrast, we study the norm errors of the leading eigenvector of a class of asymmetric matrices (probability transition matrix) and the regularized MLE. Furthermore, most existing results require the expectations of data matrices to have low rank, at least approximately. We do not impose such assumptions.

When it comes to the technical tools, it is worth noting that the leave-one-out idea has been invoked to analyze random designs for other high-dimensional problems, e.g. robust M-estimators (El Karoui, 2017)

, confidence intervals for Lasso

(Javanmard and Montanari, 2015), likelihood ratio test (Sur et al., 2017), and nonconvex statistical learning (Ma et al., 2017; Chen et al., 2018). In particular, Zhong and Boumal (2017) and Abbe et al. (2017) use it to precisely characterize entrywise behavior of eigenvectors of a large class of symmetric random matrices, which improves upon prior eigenvector analysis. Consequently, they are able to show the sharpness of spectral methods in many popular models. Our introduction of leave-one-out auxiliary quantities is similar in spirit to these papers.

Finally, the family of spectral methods has been successfully applied in numerous applications, e.g. matrix completion (Keshavan et al., 2010), phase retrieval (Chen and Candès, 2017), graph clustering (Rohe et al., 2011; Abbe et al., 2017), joint alignment (Chen and Candes, 2016). All of them are designed based on the eigenvectors of some symmetric matrix, or the singular vectors if the matrix of interest is asymmetric. Our paper contributes to this growing literature by establishing a sharp eigenvector perturbation analysis framework for an important class of asymmetric matrices — the probability transition matrices.

## 3 Extension: general dynamic range

All of the preceding results concern the regime with a fixed dynamic range (i.e. ). This section moves on to discussing the case with large .

To start with, by going through the same proof technique, we can readily obtain — in the general setting — the following performance guarantees for both the spectral estimate and the regularized MLE .

###### Theorem 5.

Consider the pairwise comparison model in Section 2.1. Suppose that for some sufficiently large constant , and choose for some constant in Algorithm 1. Then with probability exceeding ,

1. the spectral estimate satisfies

where is the normalized score vector as defined in (10).

2. the set of top- ranked items can be recovered exactly by the spectral method given in Algorithm 1, as long as

 n2pL2≥c1κ2nlognΔ2K

for some sufficiently large constant .

###### Theorem 6.

Consider the pairwise comparison model in Section 2.1. Suppose that for some sufficiently large constant and that for any absolute constants . Set the regularization parameter to be for some absolute constant . Then with probability exceeding ,

1. the regularized MLE satisfies

where and .

2. the set of top- ranked items can be recovered exactly by the regularized MLE given in (13), as long as

 n2pL2≥c1κ4nlognΔ2K

for some sufficiently large constant .

###### Remark 4.

The guarantees on exact top- recovery for both the spectral method and the regularized MLE are immediate consequences of their error bound, as we have argued in Section 2.4. Hence we will focus on proving the error bound in Sections 56.

Notably, the achievability bounds for top- ranking in Theorems 56 do not match the lower bound asserted in Theorem 2 in terms of . This is partly because the separation measure fails to capture the information bottleneck for the general setting. In light of this, we introduce the following new measure that seems to be a more suitable metric to reflect the hardness of the top- ranking problem:

 Δ∗K:=w∗K−w∗K+1w∗K+1⋅ ⎷1n∑ni=1w∗K+1w∗i(w∗K+w∗i)2, (27)

which will be termed the generalized separation measure. Informally, is a reasonably tight upper bound on certain normalized KL divergence metric. With this metric in place, we derive another lower bound as follows.

###### Theorem 7.

Fix , and let . Consider any preference score vector , and let denote its generalized separation. If

 n2pL≤ϵ22n(Δ∗K)2,

then there exists another preference score vector with the same generalized separation and different top- items such that for any ranking scheme . Here, represents the probability of error in distinguishing these two vectors given .

###### Proof.

See Appendix A.∎

The preceding sample complexity lower bound scales inversely proportionally to . To see why this generalized measure may be more suitable compared to the original separation metric, we single out three examples in Appendix B. Unfortunately, our current analyses do not yield a matching upper bound with respect to unless is a constant. For instance, the analysis of the spectral method relies on the eigenvector perturbation bound (Theorem 8), where the spectral gap and matrix perturbation play a crucial rule. However, the current results for controlling these quantities have explicit dependency on Negahban et al. (2017a). It is not clear whether we could incorporate the new measure to eliminate such dependency on . This calls for more refined analysis techniques, which we leave for future investigation.

Moreover, it is not obvious whether the spectral method alone or the regularized MLE alone can achieve the minimal sample complexity in the general regime. It is possible that one needs to first screen out those items with extremely high or low scores using methods like Borda count (Ammar and Shah, 2012), as advocated by (Negahban et al., 2017a; Chen and Suh, 2015; Jang et al., 2016). All in all, finding tight upper bounds for general remains an open question.

## 4 Discussion

This paper justifies the optimality of both the spectral method and the regularized MLE for top- rank aggregation for the fixed dynamic range case. Our theoretical studies are by no means exhaustive, and there are numerous directions that would be of interest for future investigations. We point out a few possibilities as follows.

General condition number . As mentioned before, our current theory is optimal in the presence of a fixed dynamic range with . We have also made a first attempt in considering the large regime. It is desirable to characterize the statistical and computational limits for more general .

Goodness-of-fit. Throughout this paper, we have assumed the BTL model captures the randomness underlying the data we collect. A practical question is whether the real data actually follows the BTL model. It would be interesting to investigate how to test the goodness-of-fit of this model.

Unregularized MLE. We have studied the optimality of the regularized MLE with the regularization parameter . Our analysis relies on the regularization term to obtain convergence of the gradient descent algorithm (see Lemma 11). It is natural to ask whether such a regularization term is necessary or not. This question remains open.

More general comparison graphs. So far we have focused on a tractable but somewhat restrictive comparison graph, namely, the Erdős–Rényi random graph. It would certainly be important to understand the performance of both methods under a broader family of comparison graphs, and to see which algorithms would enable optimal sample complexities under general sampling patterns.

Entrywise perturbation analysis for convex optimization. This paper provides the perturbation analysis for the regularized MLE using the leave-one-out trick as well as an inductive argument along the algorithmic updates. We expect this analysis framework to carry over to a much broader family of convex optimization problems, which may in turn offer a powerful tool for showing the stability of optimization procedures in an entrywise fashion.

## 5 Analysis for the spectral method

This section is devoted to proving Theorem 5 and hence Theorem 3, which characterizes the pointwise error of the spectral estimate.

### 5.1 Preliminaries

Here, we gather some preliminary facts about reversible Markov chains as well as the Erdős–Rényi random graph.

The first important result concerns the eigenvector perturbation for probability transition matrices, which can be treated as the analogue of the celebrated Davis-Kahan theorem (Davis and Kahan, 1970). Due to its potential importance for other problems, we promote it to a theorem as follows.

###### Theorem 8 (Eigenvector perturbation).

Suppose that , , and are probability transition matrices with stationary distributions , , , respectively. Also, assume that represents a reversible Markov chain. When , it holds that

 ∥π−^π∥π∗≤∥∥π⊤(P−^P)∥∥π∗1−max{λ2(P∗),−λn(P∗)}−∥∥P−^P∥∥π∗.
###### Proof.

See Appendix C.1. ∎

Several remarks regarding Theorem 8 are in order. First, in contrast to standard perturbation results like Davis-Kahan’s theorem, our theorem involves three matrices in total, where , , and can all be arbitrary. For example, one may choose to be the population transition matrix, and and as two finite-sample versions associated with . Second, we only impose reversibility on , whereas and need not induce reversible Markov Chains. Third, Theorem 8 allows one to derive the estimation error in Negahban et al. (2017a) directly without resorting to the power method; in fact, our estimation error bound improves upon Negahban et al. (2017a) by some logarithmic factor.

###### Theorem 9.

Consider the pairwise comparison model specified in Section 2.1 with . Suppose for some sufficiently large constant and for in Algorithm 1. With probability exceeding , one has

 ∥π−π∗∥2∥π∗∥2≲1√npL.
###### Proof.

See Appendix C.2. ∎

Notably, Theorem 9 matches the minimax lower bound derived in (Negahban et al., 2017a, Theorem 3). As far as we know, this is the first result that demonstrates the orderwise optimality of the spectral method when measured by the loss.

The next result is concerned with the concentration of the vertex degrees in an Erdős–Rényi random graph.

###### Lemma 1 (Degree concentration).

Suppose that . Let be the degree of node , and . If for some sufficiently large constant , then the following event

 A0={np2≤dmin≤dmax≤3np2} (28)

obeys

 P(A0)≥1−O(n−10).
###### Proof.

The proof follows from the standard Chernoff bound and is hence omitted. ∎

Since is chosen to be for some constant , we have, by Lemma 1, that the maximum vertex degree obeys with high probability.

### 5.2 Proof outline of Theorem 5

In this subsection, we outline the proof of Theorem 5.

Recall that and are the stationary distributions associated with and , respectively. This gives

 π⊤P=π⊤andπ∗⊤P∗=π∗⊤.

For each , one can decompose

 πm−π∗m =π⊤P⋅m−π∗⊤P∗⋅m=π∗⊤(P⋅m−P∗⋅m)+(π−π∗)⊤P⋅m =∑jπ∗j(Pj,m−P∗j,m):=Im1+(πm−π∗m)Pm,m∑jπ∗j:=Im2+∑j:j≠m(πj−π∗j)Pj,m,

where (resp. ) denotes the -th column of (resp. ). Then it boils down to controlling , and .

1. Since is deterministic while is random, we can easily control using Hoeffding’s inequality. The bound is the following.

###### Lemma 2.

With probability exceeding , one has

 maxm|Im1|≲√lognLd∥π∗∥∞.
###### Proof.

See Appendix C.3. ∎

2. Next, we show the term behaves as a contraction of .

###### Lemma 3.

With probability exceeding , there exists some constant such that for all ,

###### Proof.

See Appendix C.4. ∎

3. The statistical dependency between and introduces difficulty in obtaining a sharp estimate of the third term . Nevertheless, the leave-one-out technique helps us decouple the dependency and obtain effective control of this term. The key component of the analysis is the introduction of a new probability transition matrix , which is a leave-one-out version of the original matrix . More precisely, replaces all of the transition probabilities involving the -th item with their expected values (unconditional on ); that is, for any ,

 P(m)i,j:={Pi,j,i≠m, j≠mpdy∗i,j,i=m or j=m

with . For any