# Stochastically Transitive Models for Pairwise Comparisons: Statistical and Computational Issues

There are various parametric models for analyzing pairwise comparison data, including the Bradley-Terry-Luce (BTL) and Thurstone models, but their reliance on strong parametric assumptions is limiting. In this work, we study a flexible model for pairwise comparisons, under which the probabilities of outcomes are required only to satisfy a natural form of stochastic transitivity. This class includes parametric models including the BTL and Thurstone models as special cases, but is considerably more general. We provide various examples of models in this broader stochastically transitive class for which classical parametric models provide poor fits. Despite this greater flexibility, we show that the matrix of probabilities can be estimated at the same rate as in standard parametric models. On the other hand, unlike in the BTL and Thurstone models, computing the minimax-optimal estimator in the stochastically transitive model is non-trivial, and we explore various computationally tractable alternatives. We show that a simple singular value thresholding algorithm is statistically consistent but does not achieve the minimax rate. We then propose and study algorithms that achieve the minimax rate over interesting sub-classes of the full stochastically transitive class. We complement our theoretical results with thorough numerical simulations.

## Authors

• 31 publications
• 51 publications
• 16 publications
• 86 publications
• ### Estimation from Pairwise Comparisons: Sharp Minimax Bounds with Topology Dependence

Data in the form of pairwise comparisons arises in many domains, includi...
05/06/2015 ∙ by Nihar B. Shah, et al. ∙ 0

• ### Active Ranking from Pairwise Comparisons and when Parametric Assumptions Don't Help

We consider sequential or active ranking of a set of n items based on no...
06/28/2016 ∙ by Reinhard Heckel, et al. ∙ 0

• ### Social Choice Random Utility Models of Intransitive Pairwise Comparisons

There is a growing need for discrete choice models that account for the ...
10/05/2018 ∙ by Rahul Makhijani, et al. ∙ 6

• ### Feeling the Bern: Adaptive Estimators for Bernoulli Probabilities of Pairwise Comparisons

We study methods for aggregating pairwise comparison data in order to es...
03/22/2016 ∙ by Nihar B. Shah, et al. ∙ 0

• ### Analysis of Minimax Error Rate for Crowdsourcing and Its Application to Worker Clustering Model

While crowdsourcing has become an important means to label data, crowdwo...
02/13/2018 ∙ by Hideaki Imamura, et al. ∙ 0

• ### A General Pairwise Comparison Model for Extremely Sparse Networks

Statistical inference using pairwise comparison data has been an effecti...
02/20/2020 ∙ by Ruijian Han, et al. ∙ 0

• ### Optimal Full Ranking from Pairwise Comparisons

We consider the problem of ranking n players from partial pairwise compa...
01/21/2021 ∙ by Pinhan Chen, et al. ∙ 0

##### This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

## 1 Introduction

Pairwise comparison data is ubiquitous and arises naturally in a variety of applications, including tournament play, voting, online search rankings, and advertisement placement problems. In rough terms, given a set of objects along with a collection of possibly inconsistent comparisons between pairs of these objects, the goal is to aggregate these comparisons in order to estimate underlying properties of the population. One property of interest is the underlying matrix of pairwise comparison probabilities—that is, the matrix in which entry corresponds to the probability that object is preferred to object in a pairwise comparison. The Bradley-Terry-Luce [BT52, Luc59] and Thurstone [Thu27] models are mainstays in analyzing this type of pairwise comparison data. These models are parametric in nature: more specifically, they assume the existence of an

-dimensional weight vector that measures the quality or strength of each item. The pairwise comparison probabilities are then determined via some fixed function of the qualities of the pair of objects. Estimation in these models reduces to estimating the underlying weight vector, and a large body of prior work has focused on these models (e.g., see the papers

[NOS12, HOX14, SBB16]). However, such models enforce strong relationships on the pairwise comparison probabilities that often fail to hold in real applications. Various papers [DM59, ML65, Tve72, BW97] have provided experimental results in which these parametric modeling assumptions fail to hold.

Our focus in this paper is on models that have their roots in social science and psychology (e.g., see Fishburn [Fis73] for an overview), in which the only coherence assumption imposed on the pairwise comparison probabilities is that of strong stochastic transitivity, or SST for short. These models include the parametric models as special cases but are considerably more general. The SST model has been validated by several empirical analyses, including those in a long line of work [DM59, ML65, Tve72, BW97]. The conclusion of Ballinger et al. [BW97] is especially strongly worded:

All of these parametric c.d.f.s are soundly rejected by our data. However, SST usually survives scrutiny.

We are thus provided with strong empirical motivation for studying the fundamental properties of pairwise comparison probabilities satisfying SST.

In this paper, we focus on the problem of estimating the matrix of pairwise comparison probabilities—that is, the probability that an item will beat a second item

in any given comparison. Estimates of these comparison probabilities are useful in various applications. For instance, when the items correspond to players or teams in a sport, the predicted odds of one team beating the other are central to betting and bookmaking operations. In a supermarket or an ad display, an accurate estimate of the probability of a customer preferring one item over another, along with the respective profits for each item, can effectively guide the choice of which product to display. Accurate estimates of the pairwise comparison probabilities can also be used to infer partial or full rankings of the underlying items.

##### Our contributions:

We begin by studying the performance of optimal methods for estimating matrices in the SST class: our first main result (Theorem 1) characterizes the minimax rate in squared Frobenius norm up to logarithmic factors. This result reveals that even though the SST class of matrices is considerably larger than the classical parametric class, surprisingly, it is possible to estimate any SST matrix at nearly the same rate as the classical parametric family. On the other hand, our achievability result is based on an estimator involving prohibitive computation, as a brute force approach entails an exhaustive search over permutations. Accordingly, we turn to studying computationally tractable estimation procedures. Our second main result (Theorem 2) applies to a polynomial-time estimator based on soft-thresholding the singular values of the data matrix. An estimator based on hard-thresholding was studied previously in this context by Chatterjee [Cha14]. We sharpen and generalize this previous analysis, and give a tight characterization of the rate achieved by both hard and soft-thresholding estimators. Our third contribution, formalized in Theorems 3 and 4, is to show how for certain interesting subsets of the full SST class, a combination of parametric maximum likelihood [SBB16] and noisy sorting algorithms [BM08] leads to a tractable two-stage method that achieves the minimax rate. Our fourth contribution is to supplement our minimax lower bound with lower bounds for various known estimators, including those based on thresholding singular values [Cha14], noisy sorting [BM08], as well as parametric estimators [NOS12, HOX14, SBB16]. These lower bounds show that none of these tractable estimators achieve the minimax rate uniformly over the entire class. The lower bounds also show that the minimax rates for any of these subclasses is no better than the full SST class.

##### Related work:

The literature on ranking and estimation from pairwise comparisons is vast and we refer the reader to various surveys [FV93, Mar96, Cat12] for a more detailed overview. Here we focus our literature review on those papers that are most closely related to our contributions. Some recent work [NOS12, HOX14, SBB16] studies procedures and minimax rates for estimating the latent quality vector that underlie parametric models. Theorem 4 in this paper provides an extension of these results, in particular by showing that an optimal estimate of the latent quality vector can be used to construct an optimal estimate of the pairwise comparison probabilities. Chatterjee [Cha14] analyzed matrix estimation based on singular value thresholding, and obtained results for the class of SST matrices. In Theorem 2, we provide a sharper analysis of this estimator, and show that our upper bound is—in fact—unimprovable.

In past work, various authors [KMS07, BM08] have considered the noisy sorting problem, in which the goal is to infer the underlying order under a so-called high signal-to-noise ratio (SNR) condition. The high SNR condition means that each pairwise comparison has a probability of agreeing with the underlying order that is bounded away from by a fixed constant. Under this high SNR condition, these authors provide polynomial-time algorithms that, with high probability, return an estimate of true underlying order with a prescribed accuracy. Part of our analysis leverages an algorithm from the paper [BM08]; in particular, we extend their analysis in order to provide guarantees for estimating pairwise comparison probabilities as opposed to estimating the underlying order.

As will be clarified in the sequel, the assumption of strong stochastic transitivity has close connections to the statistical literature on shape constrained inference (e.g., [SS11]), particularly to the problem of bivariate isotonic regression. In our analysis of the least-squares estimator, we leverage metric entropy bounds from past work in this area (e.g., [GW07, CGS15]).

In Appendix D of the present paper, we study estimation under two popular models that are closely related to the SST class, and make even weaker assumptions. We show that under these moderate stochastic transitivity (MST) and weak stochastic transitivity (WST) models, the Frobenius norm error of any estimator, measured in a uniform sense over the class, must be almost as bad as that incurred by making no assumptions whatsoever. Consequently, from a statistical point of view, these assumptions are not strong enough to yield reductions in estimation error. We note that the “low noise model” studied in the paper [RA14] is identical to the WST class.

##### Organization:

The remainder of the paper is organized as follows. We begin by providing a background and a formal description of the problem in Section 2. In Section 3, we present the main theoretical results of the paper. We then present results from numerical simulations in Section 4. We present proofs of our main results in Section 5. We conclude the paper in Section 6.

## 2 Background and problem formulation

Given a collection of items, suppose that we have access to noisy comparisons between any pair of distinct items. The full set of all possible pairwise comparisons can be described by a probability matrix , in which is the probability that item is preferred to item . The upper and lower halves of the probability matrix are related by the

shifted-skew-symmetry condition

111In other words, the shifted matrix is skew-symmetric. for all . For concreteness, we set for all .

### 2.1 Estimation of pairwise comparison probabilities

For any matrix with for every

, suppose that we observe a random matrix

with (upper-triangular) independent Bernoulli entries, in particular, with for every and . Based on observing , our goal in this paper is to recover an accurate estimate, in the squared Frobenius norm, of the full matrix .

Our primary focus in this paper will be on the setting where for items we observe the outcome of a single pairwise comparison for each pair. We will subsequently (in Section 3.5) also address the more general case when we have partial observations, that is, when each pairwise comparison is observed with a fixed probability.

For future reference, note that we can always write the Bernoulli observation model in the linear form

 Y =M∗+W, (1)

where is a random matrix with independent zero-mean entries for every given by

 Wij ∼{1−M∗ijwith probability M∗ij−M∗ijwith probability 1−M∗ij, (2)

and for every . This linearized form of the observation model is convenient for subsequent analysis.

### 2.2 Strong stochastic transitivity

Beyond the previously mentioned constraints on the matrix —namely that and —more structured and interesting models are obtained by imposing further restrictions on the entries of . We now turn to one such condition, known as strong stochastic transitivity (SST), which reflects the natural transitivity of any complete ordering. Formally, suppose that the full collection of items is endowed with a complete ordering . We use the notation to convey that item is preferred to item in the total ordering . Consider some triple such that . A matrix satisfies the SST condition if the inequality holds for every such triple. The intuition underlying this constraint is the following: since dominates in the true underlying order, when we make noisy comparisons, the probability that is preferred to should be at least as large as the probability that is preferred to . The SST condition was first described222We note that the psychology literature has also considered what are known as weak and moderate stochastic transitivity conditions. From a statistical standpoint, pairwise comparison probabilities cannot be consistently estimated in a minimax sense under these conditions. We establish this formally in Appendix D. in the psychology literature (e.g., [Fis73, DM59]).

The SST condition is characterized by the existence of a permutation such that the permuted matrix has entries that increase across rows and decrease down columns. More precisely, for a given permutation , let us say that a matrix is -faithful if for every pair such that , we have for all . With this notion, the set of SST matrices is given by

 C\scalebox{.5}{{SST}}={M∈[0,1]n×n∣ Mba=1−Mab∀(a,b), and ∃ perm. π∗ s.t. M is π∗-faithful}. (3)

Note that the stated inequalities also guarantee that for any pair with , we have for all , which corresponds to a form of column ordering. The class is our primary focus in this paper.

### 2.3 Classical parametric models

Let us now describe a family of classical parametric models, one which includes Bradley-Terry-Luce and Thurstone (Case V) models [BT52, Luc59, Thu27]. In the sequel, we show that these parametric models induce a relatively small subclass of the SST matrices .

In more detail, parametric models are described by a weight vector that corresponds to the notional qualities of the items. Moreover, consider any non-decreasing function such that for all ; we refer to any such function as being valid. Any such pair induces a particular pairwise comparison model in which

 M∗ij=F(w∗i−w∗j)for all pairs (i,j). (4)

For each valid choice of , we define

 C\scalebox{.5}{{PAR}}(F)={M∈[0,1]n×n∣ M induced by Equation~{}(???) for some w∗∈Rn}. (5)

For any choice of , it is straightforward to verify that is a subset of , meaning that any matrix induced by the relation (4) satisfies all the constraints defining the set . As particular important examples, we recover the Thurstone model by setting where is the Gaussian CDF, and the Bradley-Terry-Luce model by setting

, corresponding to the sigmoid function.

##### Remark:

One can impose further constraints on the vector without reducing the size of the class . In particular, since the pairwise probabilities depend only on the differences , we can assume without loss of generality that . Moreover, since the choice of can include rescaling its argument, we can also assume that . Accordingly, we assume in our subsequent analysis that belongs to the set

 {w∈Rn∣such that ⟨w,1⟩=0 and ∥w∥∞≤1.}.

### 2.4 Inadequacies of parametric models

As noted in the introduction, a large body of past work (e.g., [DM59, ML65, Tve72, BW97]) has shown that parametric models, of the form (5) for some choice of , often provide poor fits to real-world data. What might be a reason for this phenomenon? Roughly, parametric models impose the very restrictive assumption that the choice of an item depends on the value of a single latent factor (as indexed by )—e.g., that our preference for cars depends only on their fuel economy, or that the probability that one hockey team beats another depends only on the skills of the goalkeepers.

This intuition can be formalized to construct matrices that are far away from every valid parametric approximation as summarized in the following result:

###### Proposition 1.

There exists a universal constant such that for every , there exist matrices for which

 1n2infvalid FinfM∈C\scalebox{.5}{{PAR}}(F)|||M−M∗|||2\tiny{F}≥c. (6)

Given that every entry of matrices in lies in the interval , the Frobenius norm diameter of the class is at most , so that the scaling of the lower bound (6) cannot be sharpened. See Appendix B for a proof of Proposition 1.

What sort of matrices are “bad” in the sense of satisfying a lower bound of the form (6)? Panel (a) of Figure 1 describes the construction of one “bad” matrix . In order to provide some intuition, let us return to the analogy of rating cars. A key property of any parametric model is that if we prefer car 1 to car 2 more than we prefer car 3 to car 4, then we must also prefer car 1 to car 3 more than we prefer car 2 to car 4.333This condition follows from the proof of Proposition 1. This condition is potentially satisfied if there is a single determining factor across all cars—for instance, their fuel economy.

This ordering condition is, however, violated by the pairwise comparison matrix from Figure 1(a). In this example, we have and . Such an occurrence can be explained by a simple two-factor model: suppose the fuel economies of cars and are , , and kilometers per liter respectively, and the comfort levels of the four cars are also ordered , with meaning that is more comfortable than . Suppose that in a pairwise comparison of two cars, if one car is more fuel efficient by at least 10 kilometers per liter, it is always chosen. Otherwise the choice is governed by a parametric choice model acting on the respective comfort levels of the two cars. Observe that while the comparisons between the pairs , and of cars can be explained by this parametric model acting on their respective comfort levels, the preference between cars and , as well as between cars and , is governed by their fuel economies. This two-factor model accounts for the said values of , , and , which violate parametric requirements.

While this was a simple hypothetical example, there is a more ubiquitous phenomenon underlying our example. It is often the case that our preferences are decided by comparing items on a multitude of dimensions. In any situation where a single (latent) parameter per item does not adequately explain our preferences, one can expect that the class of parametric models to provide a poor fit to the pairwise preference probabilities.

The lower bound on approximation quality guaranteed by Proposition 1 means that any parametric estimator of the matrix should perform poorly. This expectation is confirmed by the simulation results in panel (b) of Figure 1. After generating observations from a “bad matrix” over a range of , we fit the data set using either the maximum likelihood estimate in the Thurstone model, or the singular value thresholding (SVT) estimator, to be discussed in Section 3.2. For each estimator , we plot the rescaled Frobenius norm error versus the sample size. Consistent with the lower bound (6), the error in the Thurstone-based estimator stays bounded above a universal constant. In contrast, the SVT error goes to zero with , and as our theory in the sequel shows, the rate at which the error decays is at least as fast as .

## 3 Main results

Thus far, we have introduced two classes of models for matrices of pairwise comparison probabilities. Our main results characterize the rates of estimation associated with different subsets of these classes, using either optimal estimators (that we suspect are not polynomial-time computable), or more computationally efficient estimators that can be computed in polynomial-time.

### 3.1 Characterization of the minimax risk

We begin by providing a result that characterizes the minimax risk in squared Frobenius norm over the class of SST matrices. The minimax risk is defined by taking an infimum over the set of all possible estimators, which are measurable functions . Here the data matrix is drawn from the observation model (1).

###### Theorem 1.

There are universal constants such that

 cℓn≤inf˜MsupM∗∈C\scalebox{.5}{{SST}}1n2E[|||˜M−M∗|||2\tiny{F}]≤culog2(n)n, (7)

where the infimum ranges over all measurable functions of the observed matrix .

We prove this theorem in Section 5.1. The proof of the lower bound is based on extracting a particular subset of the class such that any matrix in this subset has at least positions that are unconstrained, apart from having to belong to the interval . We can thus conclude that estimation of the full matrix is at least as hard as estimating Bernoulli parameters belonging to the interval based on a single observation per number. This reduction leads to an lower bound, as stated.

Proving an upper bound requires substantially more effort. In particular, we establish it via careful analysis of the constrained least-squares estimator

 ˆM∈\operatornamewithlimitsarg minM∈C\scalebox{.5}{{SST}}|||Y−M|||2\tiny{F}. (8a) In particular, we prove that there are universal constants (c0,c1,c2) such that, for any matrix M∗∈C\scalebox{.5}{{SST}}, this estimator satisfies the high probability bound P[1n2|||ˆM−M∗|||2\tiny{F}≥c0log2(n)n] ≤c1e−c2n. (8b)

Since the entries of and all lie in the interval , integrating this tail bound leads to the stated upper bound (7) on the expected mean-squared error. Proving the high probability bound (8b) requires sharp control on a quantity known as the localized Gaussian complexity of the class . We use Dudley’s entropy integral (e.g., [VDVW96, Corollary 2.2.8]) in order to derive an upper bound that is sharp up to a logarithmic factor; doing so in turn requires deriving upper bounds on the metric entropy of the class for which we leverage the prior work of Gao and Wellner [GW07].

We do not know whether the constrained least-squares estimator (8a) is computable in time polynomial in , but we suspect not. This complexity is a consequence of the fact that the set is not convex, but is a union of convex sets. Given this issue, it becomes interesting to consider the performance of alternative estimators that can be computed in polynomial-time.

### 3.2 Sharp analysis of singular value thresholding (SVT)

The first polynomial-time estimator that we consider is a simple estimator based on thresholding singular values of the observed matrix

, and reconstructing its truncated singular value decomposition. For the full class

, Chatterjee [Cha14] analyzed the performance of such an estimator and proved that the squared Frobenius error decays as uniformly over . In this section, we prove that its error decays as , again uniformly over , and moreover, that this upper bound is unimprovable.

Let us begin by describing the estimator. Given the observation matrix , we can write its singular value decomposition as , where the matrix is diagonal, whereas the matrices and are orthonormal. Given a threshold level , the soft-thresholded version of a diagonal matrix is the diagonal matrix with values

 [Tλn(D)]jj =max{0,Djj−λn}for every integer j∈[1,n]. (9)

With this notation, the soft singular-value-thresholded (soft-SVT) version of is given by . The following theorem provides a bound on its squared Frobenius error:

###### Theorem 2.

There are universal positive constants such that the soft-SVT estimator with satisfies the bound

 P[1n2|||ˆMλn−M∗|||2\tiny{F}≥cu√n] ≤c0e−c1n (10a) for any M∗∈C\scalebox{.5}{{SST}}. Moreover, there is a universal constant cℓ>0 such that for any choice of λn, we have supM∗∈C\scalebox{.5}{{SST}}1n2|||ˆMλn−M∗|||2\tiny{F}≥cℓ√n. (10b)

A few comments on this result are in order. Since the matrices and have entries in the unit interval , the normalized squared error is at most . Consequently, by integrating the the tail bound (10a), we find that

 supM∗∈C\scalebox{.5}{{SST}}E[1n2|||ˆMλn−M∗|||2\tiny{% F}] ≤cu√n+c0e−c1n≤c′u√n.

On the other hand, the matching lower bound (10b) holds with probability one, meaning that the soft-SVT estimator has squared error of the order , irrespective of the realization of the noise.

To be clear, Chatterjee [Cha14] actually analyzed the hard-SVT estimator, which is based on the hard-thresholding operator

 [Hλn(D)]jj =Djj1{Djj≥λn}.

Here denotes the 0-1-valued indicator function. In this setting, the hard-SVT estimator is simply, . With essentially the same choice of as above, Chatterjee showed that the estimate has a mean-squared error of . One can verify that the proof of Theorem 2 in our paper goes through for the hard-SVT estimator as well. Consequently the performance of the hard-SVT estimator is of the order , and is identical to that of the soft-thresholded version up to universal constants.

Note that the hard and soft-SVT estimators return matrices that may not lie in the SST class . In a companion paper [SBW16], we provide an alternative computationally-efficient estimator with similar statistical guarantees that is guaranteed to return a matrix in the SST class.

Together the upper and lower bounds of Theorem 2 provide a sharp characterization of the behavior of the soft/hard SVT estimators. On the positive side, these are easily implementable estimators that achieve a mean-squared error bounded by uniformly over the entire class . On the negative side, this rate is slower than the rate achieved by the least-squares estimator, as in Theorem 1.

In conjunction, Theorems 1 and 2 raise a natural open question: is there a polynomial-time estimator that achieves the minimax rate uniformly over the family ? We do not know the answer to this question, but the following subsections provide some partial answers by analyzing some polynomial-time estimators that (up to logarithmic factors) achieve the optimal -rate over some interesting sub-classes of . In the next two sections, we turn to results of this type.

### 3.3 Optimal rates for high SNR subclass

In this section, we describe a multi-step polynomial-time estimator that (up to logarithmic factors) can achieve the optimal rate over an interesting subclass of . This subset corresponds to matrices that have a relatively high signal-to-noise ratio (SNR), meaning that no entries of fall within a certain window of . More formally, for some , we define the class

 C\scalebox{.5}{{HIGH}}(γ)={M∈C\scalebox{.5}{{SST}}∣max(Mij,Mji)≥1/2+γ  ∀  i≠j}. (11)

By construction, for any matrix , the amount of information contained in each observation is bounded away from zero uniformly in , as opposed to matrices in which some large subset of entries have values equal (or arbitrarily close) to . In terms of estimation difficulty, this SNR restriction does not make the problem substantially easier: as the following theorem shows, the minimax mean-squared error remains lower bounded by a constant multiple of . Moreover, we can demonstrate a polynomial-time algorithm that achieves this optimal mean squared error up to logarithmic factors.

The following theorem applies to any fixed independent of , and involves constants that may depend on but are independent of .

###### Theorem 3.

There is a constant such that

 inf˜MsupM∗∈C\scalebox{.5}% {{HIGH}}(γ)1n2E[|||˜M−M∗|||2\tiny{F}] ≥cℓn, (12a) where the infimum ranges over all estimators. Moreover, there is a two-stage estimator ˆM, computable in polynomial-time, for which P[1n2|||ˆM−M∗|||2\tiny{F}≥culog2(n)n] ≤cn2, (12b) valid for any M∗∈C\scalebox{.5}{{HIGH}}(γ).

As before, since the ratio is at most , so the tail bound (12b) implies that

 supM∗∈C\scalebox{.5}{{HIGH}}(γ)1n2E[|||ˆM−M∗|||2\tiny{F}] ≤culog2(n)n+cn2≤c′ulog2(n)n. (13)

We provide the proof of this theorem in Section 5.3. As with our proof of the lower bound in Theorem 1, we prove the lower bound by considering the sub-class of matrices that are free only on the two diagonals just above and below the main diagonal. We now provide a brief sketch for the proof of the upper bound (12b). It is based on analyzing the following two-step procedure:

1. [leftmargin=*]

2. In the first step of algorithm, we find a permutation of the items that minimizes the total number of disagreements with the observations. (For a given ordering, we say that any pair of items are in disagreement with the observation if either is rated higher than in the ordering and , or if is rated lower than in the ordering and .) The problem of finding such a disagreement-minimizing permutation is commonly known as the minimum feedback arc set (FAS) problem. It is known to be NP-hard in the worst-case [ACN08, Alo06], but our set-up has additional probabilistic structure that allows for polynomial-time solutions with high probability. In particular, we call upon a polynomial-time algorithm due to Braverman and Mossel [BM08] that, under the model (11), is guaranteed to find the exact solution to the FAS problem with high probability. Viewing the FAS permutation as an approximation to the true permutation , the novel technical work in this first step is show that is “good enough” for Frobenius norm estimation, in the sense that for any matrix , it satisfies the bound

 1n2|||π∗(M∗)−ˆπ\tiny{% FAS}(M∗)|||2\tiny{F} ≤clognn (14a) with high probability. In this statement, for any given permutation π, we have used π(M∗) to denote the matrix obtained by permuting the rows and columns of M∗ by π. The term 1n2|||π∗(M∗)−ˆπ\tiny{FAS}(M∗)|||2\tiny{F} can be viewed in some sense as the bias in estimation incurred from using ˆπ\tiny{FAS} in place of π∗.
3. Next we define as the class of “bivariate isotonic” matrices, that is, matrices that satisfy the linear constraints for all , and whenever and . This class corresponds to the subset of matrices that are faithful with respect to the identity permutation. Letting denote the image of this set under , the second step involves computing the constrained least-squares estimate

 ˆM∈\operatornamewithlimitsarg minM∈ˆπ\tiny{FAS}(C\scalebox{.5}{{BISO}})|||Y−M|||2\tiny{F}. (14b)

Since the set is a convex polytope, with a number of facets that grows polynomially in , the constrained quadratic program (14b) can be solved in polynomial-time. The final step in the proof of Theorem 3 is to show that the estimator also has mean-squared error that is upper bounded by a constant multiple of .

Our analysis shows that for any fixed , the proposed two-step estimator works well for any matrix . Since this two-step estimator is based on finding a minimum feedback arc set (FAS) in the first step, it is natural to wonder whether an FAS-based estimator works well over the full class . Somewhat surprisingly the answer to this question turns out to be negative: we refer the reader to Appendix C for more intuition and details on why the minimal FAS estimator does not perform well over the full class.

### 3.4 Optimal rates for parametric subclasses

Let us now return to the class of parametric models introduced earlier in Section 2.3. As shown previously in Proposition 1, this class is much smaller than the class , in the sense that there are models in that cannot be well-approximated by any parametric model. Nonetheless, in terms of minimax rates of estimation, these classes differ only by logarithmic factors. An advantage of the parametric class is that it is possible to achieve the minimax rate by using a simple, polynomial-time estimator. In particular, for any log concave function , the maximum likelihood estimate can be obtained by solving a convex program. This MLE induces a matrix estimate via Equation (4), and the following result shows that this estimator is minimax-optimal up to constant factors.

###### Theorem 4.

Suppose that is strictly increasing, strongly log-concave and twice differentiable. Then there is a constant , depending only on , such that the minimax risk over is lower bounded as

 inf˜MsupM∗∈C\scalebox{.5}% {{PAR}}(F)1n2E[|||˜M−M∗|||2\tiny{F}]≥cℓn, (15a) Conversely, there is a constant cu≥cℓ, depending only on F, such that the matrix estimate M(ˆw\tiny{ML}) induced by the MLE satisfies the bound supM∗∈C\scalebox{.5}{{PAR}}(F)1n2E[|||M(ˆw\tiny{ML})−M∗|||2\tiny{F}]≤cun. (15b)

To be clear, the constants in this theorem are independent of , but they do depend on the specific properties of the given function . We note that the stated conditions on are true for many popular parametric models, including (for instance) the Thurstone and BTL models.

We provide the proof of Theorem 4 in Section 5.4. The lower bound (15a) is, in fact, stronger than the the lower bound in Theorem 1, since the supremum is taken over a smaller class. The proof of the lower bound in Theorem 1 relies on matrices that cannot be realized by any parametric model, so that we pursue a different route to establish the bound (15a). On the other hand, in order to prove the upper bound (15b), we make use of bounds on the accuracy of the MLE from our own past work (see the paper [SBB16]).

### 3.5 Extension to partial observations

We now consider the extension of our results to the setting in which not all entries of are observed. Suppose instead that every entry of is observed independently with probability . In other words, the set of pairs compared is the set of edges of an Erdős-Rényi graph that has the items as its vertices.

In this setting, we obtain an upper bound on the minimax risk of estimation by first setting whenever the pair is not compared, then forming a new matrix as

 Y′:=1pobsY−1−pobs2pobs11T, (16a) and finally computing the least squares solution (16b)

Likewise, the computationally-efficient singular value thresholding estimator is also obtained by thresholding the singular values of with a threshold . See our discussion following Theorem 5 for the motivation underlying the transformed matrix .

The parametric estimators continue to operate on the original (partial) observations, first computing a maximum likelihood estimate of using the observed data, and then computing the associated pairwise-comparison-probability matrix via (4).

###### Theorem 5.

In the setting where each pair is observed with a probability , there are positive universal constants , and such that:

1. The minimax risk is sandwiched as

 cℓpobsn≤inf˜MsupM∗∈C\scalebox{.5}{{SST}}1n2E[|||˜M−M∗|||2\tiny{F}]≤cu(logn)2pobsn, (17a) when pobs≥c0n.
2. The soft-SVT estimator, with , satisfies the bound

 supM∗∈C\scalebox{.5}{{SST}}1n2E[|||ˆMλn−M∗|||2\tiny{% F}] ≤cu√npobs. (17b)
3. For a parametric sub-class based on a strongly log-concave and smooth , the estimator induced by the maximum likelihood estimate of the parameter vector has mean-squared error upper bounded as

 supM∗∈C\scalebox{.5}{{PAR}}(F)1n2E[|||M(ˆw\tiny{ML})−M∗|||2\tiny{F}]≤cupobsn, (17c)

when .

The intuition behind the transformation (16a) is that the matrix can equivalently be written in a linearized form as

 Y′=M∗+1pobsW′, (18a) where W′ has entries that are independent on and above the diagonal, satisfy skew-symmetry, and are distributed as [W′]ij=⎧⎪ ⎪ ⎪⎨⎪ ⎪ ⎪⎩pobs(12−[M∗]ij)+12with probability pobs[M∗]ijpobs(12−[M∗]ij)−12with probability pobs(1−[M∗]ij)pobs(12−[M∗]ij)with probability 1−pobs. (18b)

The proofs of the upper bounds exploit the specific relation (18a) between the observations and the true matrix , and the specific form of the additive noise (18b).

The result of Theorem 5(b) yields an affirmative answer to the question, originally posed by Chatterjee [Cha14], of whether or not the singular value thresholding estimator can yield a vanishing error when .

We note that we do not have an analogue of the high-SNR result in the partial observations case since having partial observations reduces the SNR. In general, we are interested in scalings of which allow as . The noisy-sorting algorithm of Braverman and Mossel [BM08] for the high-SNR case has computational complexity scaling as , and hence is not computable in time polynomial in when . This restriction disallows most interesting scalings of with .

## 4 Simulations

In this section, we present results from simulations to gain a further understanding of the problem at hand, in particular to understand the rates of estimation under specific generative models. We investigate the performance of the soft-SVT estimator (Section 3.2) and the maximum likelihood estimator under the Thurstone model (Section 2.3).444We could not compare the algorithm that underlies Theorem 3, since it is not easily implementable. In particular, it relies on the algorithm due to Braverman and Mossel [BM08] to compute the feedback arc set minimizer. The computational complexity of this algorithm, though polynomial in , has a large polynomial degree which precludes it from being implemented for matrices of any reasonable size. The simulations in this section add to the simulation results of Section 2.4 (Figure 1) demonstrating a large class of matrices in the SST class that cannot be represented by any parametric class. The output of the SVT estimator need not lie in the set of matrices; in our implementation, we take a projection of the output of the SVT estimator on this set, which gives a constant factor reduction in the error.

In our simulations, we generate the ground truth in the following five ways:

• [leftmargin=*]

• Uniform: The matrix is generated by drawing values independently and uniformly at random in and sorting them in descending order. The values are then inserted above the diagonal of an matrix such that the entries decrease down a column or left along a row. We then make the matrix skew-symmetric and permute the rows and columns.

• Thurstone: The matrix is generated by first choosing uniformly at random from the set satisfying . The matrix is then generated from via Equation (4) with

chosen as the CDF of the standard normal distribution.

• Bradley-Terry-Luce (BTL): Identical to the Thurstone case, except that is given by the sigmoid function.

• High SNR: A setting studied previously by Braverman and Mossel [BM08], in which the noise is independent of the items being compared. Some global order is fixed over the items, and the comparison matrix takes the values for every pair where is ranked above in the underlying ordering. The entries on the diagonal are .

• Independent bands: The matrix is chosen with diagonal entries all equal to . Entries on diagonal band immediately above the diagonal itself are chosen i.i.d. and uniformly at random from . The band above is then chosen uniformly at random from the allowable set, and so on. The choice of any entry in this process is only constrained to be upper bounded by and lower bounded by the entries to its left and below. The entries below the diagonal are chosen to make the matrix skew-symmetric.

Figure 2 depicts the results of the simulations based on observations of the entire matrix . Each point is an average across trials. The error bars in most cases are too small and hence not visible. We see that the uniform case (Figure 1(a)) is favorable for both estimators, with the error scaling as . With data generated from the Thurstone model, both estimators continue to perform well, and the Thurstone MLE yields an error of the order (Figure 1(b)). Interestingly, the Thurstone model also fits relatively well when data is generated via the BTL model (Figure 1(c)). This behavior is likely a result of operating in the near-linear regime of the logistic and the Gaussian CDF where the two curves are similar. In these two parametric settings, the SVT estimator has squared error strictly worse than order but better than . The Thurstone model, however, yields a poor fit for the model in the high-SNR (Figure 1(d)) and the independent bands (Figure 1(e)) cases, incurring a constant error as compared to an error scaling as for the SVT estimator. We recall that the poor performance of the Thurstone estimator was also described previously in Proposition 1 and Figure 1.

In summary, we see that while the Thurstone MLE estimator gives minimax optimal rates of estimation when the underlying model is parametric, it can be inconsistent when the parametric assumptions are violated. On the other hand, the SVT estimator is robust to violations of parametric assumptions, and while it does not necessarily give minimax-optimal rates, it remains consistent across the entire SST class. Finally, we remark that our theory predicts that the least squares estimator, if implementable, would outperform both these estimators in terms of statistical error.

## 5 Proofs of main results

This section is devoted to the proofs of our main results–namely, Theorems 1 through 5. Throughout these and other proofs, we use the notation and so on to denote positive constants whose values may change from line to line. In addition, we assume throughout that is lower bounded by a universal constant so as to avoid degeneracies. For any square matrix , we let denote its singular values (ordered from largest to smallest), and similarly, for any symmetric matrix , we let

denote its ordered eigenvalues. The identity permutation is one where item

is the most preferred item, for every .

### 5.1 Proof of Theorem 1

This section is devoted to the proof of Theorem 1, including both the upper and lower bounds on the minimax risk in squared Frobenius norm.

#### 5.1.1 Proof of upper bound

Define the difference matrix - between and the optimal solution to the constrained least-squares problem. Since is optimal and is feasible, we must have , and hence following some algebra, we arrive at the basic inequality

 12|||ˆΔ|||2\tiny{F}≤⟨⟨ˆΔ,W⟩⟩, (19)

where is the noise matrix in the observation model (1), and denotes the trace inner product.

We introduce some additional objects that are useful in our analysis. The class of bivariate isotonic matrices is defined as

 C\scalebox{.5}{{BISO}}:={M∈[0,1]n×n∣Mkℓ≥Mij whenever k≤i and ℓ≥j}. (20)

For a given permutation and matrix , we let denote the matrix obtained by applying to its rows and columns. We then define the set

 C\scalebox{.5}{{DIFF}} :={π1(M1)−π2(M2)∣for some M1,M2∈C\scalebox{.5}{{BISO}}, and % perm. π1 and π2}. (21)

corresponding to the set of difference matrices. Note that by construction. One can verify that for any , we are guaranteed the inclusion

 {M−M∗∣M∈C\scalebox{.5}{{SST}}, |||M−M∗|||\tiny{F}≤t}⊂C% \scalebox{.5}{{DIFF}}∩{|||D|||\tiny{F}≤t}.

Consequently, the error matrix must belong to , and so must satisfy the properties defining this set. Moreover, as we discuss below, the set is star-shaped, and this property plays an important role in our analysis.

, define the random variable

 Z(t) :=supD∈C\scalebox{.5}{{DIFF}},|||D|||\tiny{F}≤t⟨⟨D,W⟩⟩. (22)

Using our earlier basic inequality (19), the Frobenius norm error then satisfies the bound

 12|||ˆΔ|||2\tiny{F} ≤⟨⟨ˆΔ,W⟩⟩≤Z(|||ˆΔ|||\tiny{F}). (23)

Thus, in order to obtain a high probability bound, we need to understand the behavior of the random quantity .

One can verify that the set is star-shaped, meaning that for every and every . Using this star-shaped property, we are guaranteed that there is a non-empty set of scalars satisfying the critical inequality

 E[Z(δn)] ≤δ2n2. (24)

Our interest is in the smallest (strictly) positive solution to the critical inequality (24), and moreover, our goal is to show that for every , we have with probability at least .

For each , define the “bad” event as

 At ={∃Δ∈C\scalebox{.5}{{DIFF}}∣|||Δ|||\tiny{F}≥√tδn% and⟨⟨Δ,W⟩⟩≥2|||Δ|||\tiny{F}√tδn}. (25)

Using the star-shaped property of , it follows by a rescaling argument that

 P[At]≤P[Z(δn)≥2δn√tδn]for all t≥δn.

The entries of lie in , are i.i.d. on and above the diagonal, are zero-mean, and satisfy skew-symmetry. Moreover, the function is convex and Lipschitz with parameter . Consequently, from known concentration bounds(e.g., [Led01, Theorem 5.9][Sam00]) for convex Lipschitz functions, we have

 P[Z(δn)≥E[Z(δn)]+√tδnδn] ≤2e−c1tδnfor all t≥δn.

By the definition of , we have for any , and consequently

 P[At] ≤P[Z(δn)≥2δn√tδn]≤2e−c1tδnfor all t≥δn.

Consequently, either