1 Introduction
Pairwise comparison data is ubiquitous and arises naturally in a variety of applications, including tournament play, voting, online search rankings, and advertisement placement problems. In rough terms, given a set of objects along with a collection of possibly inconsistent comparisons between pairs of these objects, the goal is to aggregate these comparisons in order to estimate underlying properties of the population. One property of interest is the underlying matrix of pairwise comparison probabilities—that is, the matrix in which entry corresponds to the probability that object is preferred to object in a pairwise comparison. The BradleyTerryLuce [BT52, Luc59] and Thurstone [Thu27] models are mainstays in analyzing this type of pairwise comparison data. These models are parametric in nature: more specifically, they assume the existence of an
dimensional weight vector that measures the quality or strength of each item. The pairwise comparison probabilities are then determined via some fixed function of the qualities of the pair of objects. Estimation in these models reduces to estimating the underlying weight vector, and a large body of prior work has focused on these models (e.g., see the papers
[NOS12, HOX14, SBB16]). However, such models enforce strong relationships on the pairwise comparison probabilities that often fail to hold in real applications. Various papers [DM59, ML65, Tve72, BW97] have provided experimental results in which these parametric modeling assumptions fail to hold.Our focus in this paper is on models that have their roots in social science and psychology (e.g., see Fishburn [Fis73] for an overview), in which the only coherence assumption imposed on the pairwise comparison probabilities is that of strong stochastic transitivity, or SST for short. These models include the parametric models as special cases but are considerably more general. The SST model has been validated by several empirical analyses, including those in a long line of work [DM59, ML65, Tve72, BW97]. The conclusion of Ballinger et al. [BW97] is especially strongly worded:
All of these parametric c.d.f.s are soundly rejected by our data. However, SST usually survives scrutiny.
We are thus provided with strong empirical motivation for studying the fundamental properties of pairwise comparison probabilities satisfying SST.
In this paper, we focus on the problem of estimating the matrix of pairwise comparison probabilities—that is, the probability that an item will beat a second item
in any given comparison. Estimates of these comparison probabilities are useful in various applications. For instance, when the items correspond to players or teams in a sport, the predicted odds of one team beating the other are central to betting and bookmaking operations. In a supermarket or an ad display, an accurate estimate of the probability of a customer preferring one item over another, along with the respective profits for each item, can effectively guide the choice of which product to display. Accurate estimates of the pairwise comparison probabilities can also be used to infer partial or full rankings of the underlying items.
Our contributions:
We begin by studying the performance of optimal methods for estimating matrices in the SST class: our first main result (Theorem 1) characterizes the minimax rate in squared Frobenius norm up to logarithmic factors. This result reveals that even though the SST class of matrices is considerably larger than the classical parametric class, surprisingly, it is possible to estimate any SST matrix at nearly the same rate as the classical parametric family. On the other hand, our achievability result is based on an estimator involving prohibitive computation, as a brute force approach entails an exhaustive search over permutations. Accordingly, we turn to studying computationally tractable estimation procedures. Our second main result (Theorem 2) applies to a polynomialtime estimator based on softthresholding the singular values of the data matrix. An estimator based on hardthresholding was studied previously in this context by Chatterjee [Cha14]. We sharpen and generalize this previous analysis, and give a tight characterization of the rate achieved by both hard and softthresholding estimators. Our third contribution, formalized in Theorems 3 and 4, is to show how for certain interesting subsets of the full SST class, a combination of parametric maximum likelihood [SBB16] and noisy sorting algorithms [BM08] leads to a tractable twostage method that achieves the minimax rate. Our fourth contribution is to supplement our minimax lower bound with lower bounds for various known estimators, including those based on thresholding singular values [Cha14], noisy sorting [BM08], as well as parametric estimators [NOS12, HOX14, SBB16]. These lower bounds show that none of these tractable estimators achieve the minimax rate uniformly over the entire class. The lower bounds also show that the minimax rates for any of these subclasses is no better than the full SST class.
Related work:
The literature on ranking and estimation from pairwise comparisons is vast and we refer the reader to various surveys [FV93, Mar96, Cat12] for a more detailed overview. Here we focus our literature review on those papers that are most closely related to our contributions. Some recent work [NOS12, HOX14, SBB16] studies procedures and minimax rates for estimating the latent quality vector that underlie parametric models. Theorem 4 in this paper provides an extension of these results, in particular by showing that an optimal estimate of the latent quality vector can be used to construct an optimal estimate of the pairwise comparison probabilities. Chatterjee [Cha14] analyzed matrix estimation based on singular value thresholding, and obtained results for the class of SST matrices. In Theorem 2, we provide a sharper analysis of this estimator, and show that our upper bound is—in fact—unimprovable.
In past work, various authors [KMS07, BM08] have considered the noisy sorting problem, in which the goal is to infer the underlying order under a socalled high signaltonoise ratio (SNR) condition. The high SNR condition means that each pairwise comparison has a probability of agreeing with the underlying order that is bounded away from by a fixed constant. Under this high SNR condition, these authors provide polynomialtime algorithms that, with high probability, return an estimate of true underlying order with a prescribed accuracy. Part of our analysis leverages an algorithm from the paper [BM08]; in particular, we extend their analysis in order to provide guarantees for estimating pairwise comparison probabilities as opposed to estimating the underlying order.
As will be clarified in the sequel, the assumption of strong stochastic transitivity has close connections to the statistical literature on shape constrained inference (e.g., [SS11]), particularly to the problem of bivariate isotonic regression. In our analysis of the leastsquares estimator, we leverage metric entropy bounds from past work in this area (e.g., [GW07, CGS15]).
In Appendix D of the present paper, we study estimation under two popular models that are closely related to the SST class, and make even weaker assumptions. We show that under these moderate stochastic transitivity (MST) and weak stochastic transitivity (WST) models, the Frobenius norm error of any estimator, measured in a uniform sense over the class, must be almost as bad as that incurred by making no assumptions whatsoever. Consequently, from a statistical point of view, these assumptions are not strong enough to yield reductions in estimation error. We note that the “low noise model” studied in the paper [RA14] is identical to the WST class.
Organization:
The remainder of the paper is organized as follows. We begin by providing a background and a formal description of the problem in Section 2. In Section 3, we present the main theoretical results of the paper. We then present results from numerical simulations in Section 4. We present proofs of our main results in Section 5. We conclude the paper in Section 6.
2 Background and problem formulation
Given a collection of items, suppose that we have access to noisy comparisons between any pair of distinct items. The full set of all possible pairwise comparisons can be described by a probability matrix , in which is the probability that item is preferred to item . The upper and lower halves of the probability matrix are related by the
shiftedskewsymmetry condition
^{1}^{1}1In other words, the shifted matrix is skewsymmetric. for all . For concreteness, we set for all .2.1 Estimation of pairwise comparison probabilities
For any matrix with for every
, suppose that we observe a random matrix
with (uppertriangular) independent Bernoulli entries, in particular, with for every and . Based on observing , our goal in this paper is to recover an accurate estimate, in the squared Frobenius norm, of the full matrix .Our primary focus in this paper will be on the setting where for items we observe the outcome of a single pairwise comparison for each pair. We will subsequently (in Section 3.5) also address the more general case when we have partial observations, that is, when each pairwise comparison is observed with a fixed probability.
For future reference, note that we can always write the Bernoulli observation model in the linear form
(1) 
where is a random matrix with independent zeromean entries for every given by
(2) 
and for every . This linearized form of the observation model is convenient for subsequent analysis.
2.2 Strong stochastic transitivity
Beyond the previously mentioned constraints on the matrix —namely that and —more structured and interesting models are obtained by imposing further restrictions on the entries of . We now turn to one such condition, known as strong stochastic transitivity (SST), which reflects the natural transitivity of any complete ordering. Formally, suppose that the full collection of items is endowed with a complete ordering . We use the notation to convey that item is preferred to item in the total ordering . Consider some triple such that . A matrix satisfies the SST condition if the inequality holds for every such triple. The intuition underlying this constraint is the following: since dominates in the true underlying order, when we make noisy comparisons, the probability that is preferred to should be at least as large as the probability that is preferred to . The SST condition was first described^{2}^{2}2We note that the psychology literature has also considered what are known as weak and moderate stochastic transitivity conditions. From a statistical standpoint, pairwise comparison probabilities cannot be consistently estimated in a minimax sense under these conditions. We establish this formally in Appendix D. in the psychology literature (e.g., [Fis73, DM59]).
The SST condition is characterized by the existence of a permutation such that the permuted matrix has entries that increase across rows and decrease down columns. More precisely, for a given permutation , let us say that a matrix is faithful if for every pair such that , we have for all . With this notion, the set of SST matrices is given by
(3) 
Note that the stated inequalities also guarantee that for any pair with , we have for all , which corresponds to a form of column ordering. The class is our primary focus in this paper.
2.3 Classical parametric models
Let us now describe a family of classical parametric models, one which includes BradleyTerryLuce and Thurstone (Case V) models [BT52, Luc59, Thu27]. In the sequel, we show that these parametric models induce a relatively small subclass of the SST matrices .
In more detail, parametric models are described by a weight vector that corresponds to the notional qualities of the items. Moreover, consider any nondecreasing function such that for all ; we refer to any such function as being valid. Any such pair induces a particular pairwise comparison model in which
(4) 
For each valid choice of , we define
(5) 
For any choice of , it is straightforward to verify that is a subset of , meaning that any matrix induced by the relation (4) satisfies all the constraints defining the set . As particular important examples, we recover the Thurstone model by setting where is the Gaussian CDF, and the BradleyTerryLuce model by setting
, corresponding to the sigmoid function.
Remark:
One can impose further constraints on the vector without reducing the size of the class . In particular, since the pairwise probabilities depend only on the differences , we can assume without loss of generality that . Moreover, since the choice of can include rescaling its argument, we can also assume that . Accordingly, we assume in our subsequent analysis that belongs to the set
2.4 Inadequacies of parametric models
As noted in the introduction, a large body of past work (e.g., [DM59, ML65, Tve72, BW97]) has shown that parametric models, of the form (5) for some choice of , often provide poor fits to realworld data. What might be a reason for this phenomenon? Roughly, parametric models impose the very restrictive assumption that the choice of an item depends on the value of a single latent factor (as indexed by )—e.g., that our preference for cars depends only on their fuel economy, or that the probability that one hockey team beats another depends only on the skills of the goalkeepers.
This intuition can be formalized to construct matrices that are far away from every valid parametric approximation as summarized in the following result:
Proposition 1.
There exists a universal constant such that for every , there exist matrices for which
(6) 
Given that every entry of matrices in lies in the interval , the Frobenius norm diameter of the class is at most , so that the scaling of the lower bound (6) cannot be sharpened. See Appendix B for a proof of Proposition 1.
What sort of matrices are “bad” in the sense of satisfying a lower bound of the form (6)? Panel (a) of Figure 1 describes the construction of one “bad” matrix . In order to provide some intuition, let us return to the analogy of rating cars. A key property of any parametric model is that if we prefer car 1 to car 2 more than we prefer car 3 to car 4, then we must also prefer car 1 to car 3 more than we prefer car 2 to car 4.^{3}^{3}3This condition follows from the proof of Proposition 1. This condition is potentially satisfied if there is a single determining factor across all cars—for instance, their fuel economy.
This ordering condition is, however, violated by the pairwise comparison matrix from Figure 1(a). In this example, we have and . Such an occurrence can be explained by a simple twofactor model: suppose the fuel economies of cars and are , , and kilometers per liter respectively, and the comfort levels of the four cars are also ordered , with meaning that is more comfortable than . Suppose that in a pairwise comparison of two cars, if one car is more fuel efficient by at least 10 kilometers per liter, it is always chosen. Otherwise the choice is governed by a parametric choice model acting on the respective comfort levels of the two cars. Observe that while the comparisons between the pairs , and of cars can be explained by this parametric model acting on their respective comfort levels, the preference between cars and , as well as between cars and , is governed by their fuel economies. This twofactor model accounts for the said values of , , and , which violate parametric requirements.
While this was a simple hypothetical example, there is a more ubiquitous phenomenon underlying our example. It is often the case that our preferences are decided by comparing items on a multitude of dimensions. In any situation where a single (latent) parameter per item does not adequately explain our preferences, one can expect that the class of parametric models to provide a poor fit to the pairwise preference probabilities.



(a)  (b) 
The lower bound on approximation quality guaranteed by Proposition 1 means that any parametric estimator of the matrix should perform poorly. This expectation is confirmed by the simulation results in panel (b) of Figure 1. After generating observations from a “bad matrix” over a range of , we fit the data set using either the maximum likelihood estimate in the Thurstone model, or the singular value thresholding (SVT) estimator, to be discussed in Section 3.2. For each estimator , we plot the rescaled Frobenius norm error versus the sample size. Consistent with the lower bound (6), the error in the Thurstonebased estimator stays bounded above a universal constant. In contrast, the SVT error goes to zero with , and as our theory in the sequel shows, the rate at which the error decays is at least as fast as .
3 Main results
Thus far, we have introduced two classes of models for matrices of pairwise comparison probabilities. Our main results characterize the rates of estimation associated with different subsets of these classes, using either optimal estimators (that we suspect are not polynomialtime computable), or more computationally efficient estimators that can be computed in polynomialtime.
3.1 Characterization of the minimax risk
We begin by providing a result that characterizes the minimax risk in squared Frobenius norm over the class of SST matrices. The minimax risk is defined by taking an infimum over the set of all possible estimators, which are measurable functions . Here the data matrix is drawn from the observation model (1).
Theorem 1.
There are universal constants such that
(7) 
where the infimum ranges over all measurable functions of the observed matrix .
We prove this theorem in Section 5.1. The proof of the lower bound is based on extracting a particular subset of the class such that any matrix in this subset has at least positions that are unconstrained, apart from having to belong to the interval . We can thus conclude that estimation of the full matrix is at least as hard as estimating Bernoulli parameters belonging to the interval based on a single observation per number. This reduction leads to an lower bound, as stated.
Proving an upper bound requires substantially more effort. In particular, we establish it via careful analysis of the constrained leastsquares estimator
(8a)  
In particular, we prove that there are universal constants such that, for any matrix , this estimator satisfies the high probability bound  
(8b) 
Since the entries of and all lie in the interval , integrating this tail bound leads to the stated upper bound (7) on the expected meansquared error. Proving the high probability bound (8b) requires sharp control on a quantity known as the localized Gaussian complexity of the class . We use Dudley’s entropy integral (e.g., [VDVW96, Corollary 2.2.8]) in order to derive an upper bound that is sharp up to a logarithmic factor; doing so in turn requires deriving upper bounds on the metric entropy of the class for which we leverage the prior work of Gao and Wellner [GW07].
We do not know whether the constrained leastsquares estimator (8a) is computable in time polynomial in , but we suspect not. This complexity is a consequence of the fact that the set is not convex, but is a union of convex sets. Given this issue, it becomes interesting to consider the performance of alternative estimators that can be computed in polynomialtime.
3.2 Sharp analysis of singular value thresholding (SVT)
The first polynomialtime estimator that we consider is a simple estimator based on thresholding singular values of the observed matrix
, and reconstructing its truncated singular value decomposition. For the full class
, Chatterjee [Cha14] analyzed the performance of such an estimator and proved that the squared Frobenius error decays as uniformly over . In this section, we prove that its error decays as , again uniformly over , and moreover, that this upper bound is unimprovable.Let us begin by describing the estimator. Given the observation matrix , we can write its singular value decomposition as , where the matrix is diagonal, whereas the matrices and are orthonormal. Given a threshold level , the softthresholded version of a diagonal matrix is the diagonal matrix with values
(9) 
With this notation, the soft singularvaluethresholded (softSVT) version of is given by . The following theorem provides a bound on its squared Frobenius error:
Theorem 2.
There are universal positive constants such that the softSVT estimator with satisfies the bound
(10a)  
for any . Moreover, there is a universal constant such that for any choice of , we have  
(10b) 
A few comments on this result are in order. Since the matrices and have entries in the unit interval , the normalized squared error is at most . Consequently, by integrating the the tail bound (10a), we find that
On the other hand, the matching lower bound (10b) holds with probability one, meaning that the softSVT estimator has squared error of the order , irrespective of the realization of the noise.
To be clear, Chatterjee [Cha14] actually analyzed the hardSVT estimator, which is based on the hardthresholding operator
Here denotes the 01valued indicator function. In this setting, the hardSVT estimator is simply, . With essentially the same choice of as above, Chatterjee showed that the estimate has a meansquared error of . One can verify that the proof of Theorem 2 in our paper goes through for the hardSVT estimator as well. Consequently the performance of the hardSVT estimator is of the order , and is identical to that of the softthresholded version up to universal constants.
Note that the hard and softSVT estimators return matrices that may not lie in the SST class . In a companion paper [SBW16], we provide an alternative computationallyefficient estimator with similar statistical guarantees that is guaranteed to return a matrix in the SST class.
Together the upper and lower bounds of Theorem 2 provide a sharp characterization of the behavior of the soft/hard SVT estimators. On the positive side, these are easily implementable estimators that achieve a meansquared error bounded by uniformly over the entire class . On the negative side, this rate is slower than the rate achieved by the leastsquares estimator, as in Theorem 1.
In conjunction, Theorems 1 and 2 raise a natural open question: is there a polynomialtime estimator that achieves the minimax rate uniformly over the family ? We do not know the answer to this question, but the following subsections provide some partial answers by analyzing some polynomialtime estimators that (up to logarithmic factors) achieve the optimal rate over some interesting subclasses of . In the next two sections, we turn to results of this type.
3.3 Optimal rates for high SNR subclass
In this section, we describe a multistep polynomialtime estimator that (up to logarithmic factors) can achieve the optimal rate over an interesting subclass of . This subset corresponds to matrices that have a relatively high signaltonoise ratio (SNR), meaning that no entries of fall within a certain window of . More formally, for some , we define the class
(11) 
By construction, for any matrix , the amount of information contained in each observation is bounded away from zero uniformly in , as opposed to matrices in which some large subset of entries have values equal (or arbitrarily close) to . In terms of estimation difficulty, this SNR restriction does not make the problem substantially easier: as the following theorem shows, the minimax meansquared error remains lower bounded by a constant multiple of . Moreover, we can demonstrate a polynomialtime algorithm that achieves this optimal mean squared error up to logarithmic factors.
The following theorem applies to any fixed independent of , and involves constants that may depend on but are independent of .
Theorem 3.
There is a constant such that
(12a)  
where the infimum ranges over all estimators. Moreover, there is a twostage estimator , computable in polynomialtime, for which  
(12b)  
valid for any . 
As before, since the ratio is at most , so the tail bound (12b) implies that
(13) 
We provide the proof of this theorem in Section 5.3. As with our proof of the lower bound in Theorem 1, we prove the lower bound by considering the subclass of matrices that are free only on the two diagonals just above and below the main diagonal. We now provide a brief sketch for the proof of the upper bound (12b). It is based on analyzing the following twostep procedure:

[leftmargin=*]

In the first step of algorithm, we find a permutation of the items that minimizes the total number of disagreements with the observations. (For a given ordering, we say that any pair of items are in disagreement with the observation if either is rated higher than in the ordering and , or if is rated lower than in the ordering and .) The problem of finding such a disagreementminimizing permutation is commonly known as the minimum feedback arc set (FAS) problem. It is known to be NPhard in the worstcase [ACN08, Alo06], but our setup has additional probabilistic structure that allows for polynomialtime solutions with high probability. In particular, we call upon a polynomialtime algorithm due to Braverman and Mossel [BM08] that, under the model (11), is guaranteed to find the exact solution to the FAS problem with high probability. Viewing the FAS permutation as an approximation to the true permutation , the novel technical work in this first step is show that is “good enough” for Frobenius norm estimation, in the sense that for any matrix , it satisfies the bound
(14a) with high probability. In this statement, for any given permutation , we have used to denote the matrix obtained by permuting the rows and columns of by . The term can be viewed in some sense as the bias in estimation incurred from using in place of . 
Next we define as the class of “bivariate isotonic” matrices, that is, matrices that satisfy the linear constraints for all , and whenever and . This class corresponds to the subset of matrices that are faithful with respect to the identity permutation. Letting denote the image of this set under , the second step involves computing the constrained leastsquares estimate
(14b) Since the set is a convex polytope, with a number of facets that grows polynomially in , the constrained quadratic program (14b) can be solved in polynomialtime. The final step in the proof of Theorem 3 is to show that the estimator also has meansquared error that is upper bounded by a constant multiple of .
Our analysis shows that for any fixed , the proposed twostep estimator works well for any matrix . Since this twostep estimator is based on finding a minimum feedback arc set (FAS) in the first step, it is natural to wonder whether an FASbased estimator works well over the full class . Somewhat surprisingly the answer to this question turns out to be negative: we refer the reader to Appendix C for more intuition and details on why the minimal FAS estimator does not perform well over the full class.
3.4 Optimal rates for parametric subclasses
Let us now return to the class of parametric models introduced earlier in Section 2.3. As shown previously in Proposition 1, this class is much smaller than the class , in the sense that there are models in that cannot be wellapproximated by any parametric model. Nonetheless, in terms of minimax rates of estimation, these classes differ only by logarithmic factors. An advantage of the parametric class is that it is possible to achieve the minimax rate by using a simple, polynomialtime estimator. In particular, for any log concave function , the maximum likelihood estimate can be obtained by solving a convex program. This MLE induces a matrix estimate via Equation (4), and the following result shows that this estimator is minimaxoptimal up to constant factors.
Theorem 4.
Suppose that is strictly increasing, strongly logconcave and twice differentiable. Then there is a constant , depending only on , such that the minimax risk over is lower bounded as
(15a)  
Conversely, there is a constant , depending only on , such that the matrix estimate induced by the MLE satisfies the bound  
(15b) 
To be clear, the constants in this theorem are independent of , but they do depend on the specific properties of the given function . We note that the stated conditions on are true for many popular parametric models, including (for instance) the Thurstone and BTL models.
We provide the proof of Theorem 4 in Section 5.4. The lower bound (15a) is, in fact, stronger than the the lower bound in Theorem 1, since the supremum is taken over a smaller class. The proof of the lower bound in Theorem 1 relies on matrices that cannot be realized by any parametric model, so that we pursue a different route to establish the bound (15a). On the other hand, in order to prove the upper bound (15b), we make use of bounds on the accuracy of the MLE from our own past work (see the paper [SBB16]).
3.5 Extension to partial observations
We now consider the extension of our results to the setting in which not all entries of are observed. Suppose instead that every entry of is observed independently with probability . In other words, the set of pairs compared is the set of edges of an ErdősRényi graph that has the items as its vertices.
In this setting, we obtain an upper bound on the minimax risk of estimation by first setting whenever the pair is not compared, then forming a new matrix as
(16a)  
and finally computing the least squares solution  
(16b) 
Likewise, the computationallyefficient singular value thresholding estimator is also obtained by thresholding the singular values of with a threshold . See our discussion following Theorem 5 for the motivation underlying the transformed matrix .
The parametric estimators continue to operate on the original (partial) observations, first computing a maximum likelihood estimate of using the observed data, and then computing the associated pairwisecomparisonprobability matrix via (4).
Theorem 5.
In the setting where each pair is observed with a probability , there are positive universal constants , and such that:

The minimax risk is sandwiched as
(17a) when . 
The softSVT estimator, with , satisfies the bound
(17b) 
For a parametric subclass based on a strongly logconcave and smooth , the estimator induced by the maximum likelihood estimate of the parameter vector has meansquared error upper bounded as
(17c) when .
The intuition behind the transformation (16a) is that the matrix can equivalently be written in a linearized form as
(18a)  
where has entries that are independent on and above the diagonal, satisfy skewsymmetry, and are distributed as  
(18b) 
The proofs of the upper bounds exploit the specific relation (18a) between the observations and the true matrix , and the specific form of the additive noise (18b).
The result of Theorem 5(b) yields an affirmative answer to the question, originally posed by Chatterjee [Cha14], of whether or not the singular value thresholding estimator can yield a vanishing error when .
We note that we do not have an analogue of the highSNR result in the partial observations case since having partial observations reduces the SNR. In general, we are interested in scalings of which allow as . The noisysorting algorithm of Braverman and Mossel [BM08] for the highSNR case has computational complexity scaling as , and hence is not computable in time polynomial in when . This restriction disallows most interesting scalings of with .
4 Simulations
In this section, we present results from simulations to gain a further
understanding of the problem at hand, in particular to understand the
rates of estimation under specific generative models. We investigate
the performance of the softSVT estimator (Section 3.2) and the maximum likelihood
estimator under the Thurstone model
(Section 2.3).^{4}^{4}4We could not compare
the algorithm that underlies Theorem 3, since it is
not easily implementable. In particular, it relies on the algorithm
due to Braverman and Mossel [BM08] to compute the
feedback arc set minimizer. The computational complexity of this
algorithm, though polynomial in , has a large
polynomial degree which precludes it from being implemented for
matrices
of any reasonable size.
The simulations in this section add to the
simulation results of Section 2.4
(Figure 1) demonstrating a large class of
matrices in the SST class that cannot be represented by any
parametric class. The output of the SVT estimator need not lie in
the set of matrices; in our
implementation, we take a projection of the output of the SVT
estimator on this set, which gives a constant factor reduction in the
error.
In our simulations, we generate the ground truth in the following five ways:

[leftmargin=*]

Uniform: The matrix is generated by drawing values independently and uniformly at random in and sorting them in descending order. The values are then inserted above the diagonal of an matrix such that the entries decrease down a column or left along a row. We then make the matrix skewsymmetric and permute the rows and columns.

Thurstone: The matrix is generated by first choosing uniformly at random from the set satisfying . The matrix is then generated from via Equation (4) with
chosen as the CDF of the standard normal distribution.

BradleyTerryLuce (BTL): Identical to the Thurstone case, except that is given by the sigmoid function.

High SNR: A setting studied previously by Braverman and Mossel [BM08], in which the noise is independent of the items being compared. Some global order is fixed over the items, and the comparison matrix takes the values for every pair where is ranked above in the underlying ordering. The entries on the diagonal are .

Independent bands: The matrix is chosen with diagonal entries all equal to . Entries on diagonal band immediately above the diagonal itself are chosen i.i.d. and uniformly at random from . The band above is then chosen uniformly at random from the allowable set, and so on. The choice of any entry in this process is only constrained to be upper bounded by and lower bounded by the entries to its left and below. The entries below the diagonal are chosen to make the matrix skewsymmetric.
Figure 2 depicts the results of the simulations based on observations of the entire matrix . Each point is an average across trials. The error bars in most cases are too small and hence not visible. We see that the uniform case (Figure 1(a)) is favorable for both estimators, with the error scaling as . With data generated from the Thurstone model, both estimators continue to perform well, and the Thurstone MLE yields an error of the order (Figure 1(b)). Interestingly, the Thurstone model also fits relatively well when data is generated via the BTL model (Figure 1(c)). This behavior is likely a result of operating in the nearlinear regime of the logistic and the Gaussian CDF where the two curves are similar. In these two parametric settings, the SVT estimator has squared error strictly worse than order but better than . The Thurstone model, however, yields a poor fit for the model in the highSNR (Figure 1(d)) and the independent bands (Figure 1(e)) cases, incurring a constant error as compared to an error scaling as for the SVT estimator. We recall that the poor performance of the Thurstone estimator was also described previously in Proposition 1 and Figure 1.
In summary, we see that while the Thurstone MLE estimator gives minimax optimal rates of estimation when the underlying model is parametric, it can be inconsistent when the parametric assumptions are violated. On the other hand, the SVT estimator is robust to violations of parametric assumptions, and while it does not necessarily give minimaxoptimal rates, it remains consistent across the entire SST class. Finally, we remark that our theory predicts that the least squares estimator, if implementable, would outperform both these estimators in terms of statistical error.
5 Proofs of main results
This section is devoted to the proofs of our main results–namely, Theorems 1 through 5. Throughout these and other proofs, we use the notation and so on to denote positive constants whose values may change from line to line. In addition, we assume throughout that is lower bounded by a universal constant so as to avoid degeneracies. For any square matrix , we let denote its singular values (ordered from largest to smallest), and similarly, for any symmetric matrix , we let
denote its ordered eigenvalues. The identity permutation is one where item
is the most preferred item, for every .5.1 Proof of Theorem 1
This section is devoted to the proof of Theorem 1, including both the upper and lower bounds on the minimax risk in squared Frobenius norm.
5.1.1 Proof of upper bound
Define the difference matrix  between and the optimal solution to the constrained leastsquares problem. Since is optimal and is feasible, we must have , and hence following some algebra, we arrive at the basic inequality
(19) 
where is the noise matrix in the observation model (1), and denotes the trace inner product.
We introduce some additional objects that are useful in our analysis. The class of bivariate isotonic matrices is defined as
(20) 
For a given permutation and matrix , we let denote the matrix obtained by applying to its rows and columns. We then define the set
(21) 
corresponding to the set of difference matrices. Note that by construction. One can verify that for any , we are guaranteed the inclusion
Consequently, the error matrix must belong to , and so must satisfy the properties defining this set. Moreover, as we discuss below, the set is starshaped, and this property plays an important role in our analysis.
For each choice of radius
, define the random variable
(22) 
Using our earlier basic inequality (19), the Frobenius norm error then satisfies the bound
(23) 
Thus, in order to obtain a high probability bound, we need to understand the behavior of the random quantity .
One can verify that the set is starshaped, meaning that for every and every . Using this starshaped property, we are guaranteed that there is a nonempty set of scalars satisfying the critical inequality
(24) 
Our interest is in the smallest (strictly) positive solution to the critical inequality (24), and moreover, our goal is to show that for every , we have with probability at least .
For each , define the “bad” event as
(25) 
Using the starshaped property of , it follows by a rescaling argument that
The entries of lie in , are i.i.d. on and above the diagonal, are zeromean, and satisfy skewsymmetry. Moreover, the function is convex and Lipschitz with parameter . Consequently, from known concentration bounds(e.g., [Led01, Theorem 5.9], [Sam00]) for convex Lipschitz functions, we have
By the definition of , we have for any , and consequently
Consequently, either