1 Introduction
The problems of ranking and estimation from ordinal data arise in a variety of disciplines, including web search and information retrieval [DKNS01], crowdsourcing [CBCTH13], tournament play [HMG06], social choice theory [CN91] and recommender systems [BMR10]. The ubiquity of such datasets stems from the relative ease with which ordinal data can be obtained, and from the empirical observation that using pairwise comparisons as a means of data elicitation can lower the noise level in the observations [Bar03, SBC05].
Given that the number of items to be compared can be very large, it is often difficult or impossible to obtain comparisons between all pairs of items. A subset of pairs to compare, which defines the comparison topology, must therefore be chosen. For example, such topologies arise from tournament formats in sports, experimental designs in psychology set up to aid interpretability, or properties of the elicitation process. For instance, in rating movies, pairwise comparisons between items of the same genre are typically more abundant than comparisons between items of dissimilar genres. For these reasons, studying the performance of ranking algorithms based on fixed comparison topologies is of interest. Fixed comparison topologies are also important in rank breaking [HOX14, KO16], and more generally in matrix completion based on structured observations [KTT15, PABN16].
An important problem in ranking is the design of accurate models for capturing uncertainty in pairwise comparisons. Given a collection of items, the results of pairwise comparisons are completely characterized by the dimensional matrix of comparison probabilities,^{1}^{1}1A comparison probability refers to the probability that item beats item in a comparison between them. and various models have been proposed for such matrices. The most classical models, among them the BradleyTerryLuce [BT52, Luc59] and Thurstone models [Thu27]
, assign a quality vector to the set of items, and assign pairwise probabilities by applying a cumulative distribution function to the difference of qualities associated to the pair. There is now a relatively large body of work on methods for ranking in such parametric models (e.g., see the papers
[NOS16, HOX14, CS15, SBB16] as well as references therein). In contrast, less attention has been paid to a richer class of models proposed decades ago in the sociology literature [Fis73, ML65], which impose a milder set of constraints on pairwise comparison matrix. Rather than positing a quality vector, these models impose constraints that are typically given in terms of a latent permutation that rearranges the matrix into a specified form, and hence can be referred to as permutationbased models. Two such models that have been recently analyzed are those of strong stochastic transitivity [SBGW17], as well as the special case of noisy sorting [BM08]. The strong stochastic transitivity (SST) model, in particular, has been shown to offer significant robustness guarantees and provide a good fit to many existing datasets [BW97], and this flexibility has driven recent interest in understanding its properties. Also, perhaps surprisingly, past work has shown that this additional flexibility comes at only a small price when one has access to all possible pairwise comparisons, or more generally, to comparisons chosen at random [SBGW17]; in particular, the rates of estimation in these SST models differ from those in parametric models by only logarithmic factors in the number of items. On a related note, permutationbased models have also recently been shown to be useful in other settings like crowdlabeling [SBW16b], statistical seriation [FMR16][PWC16].Given pairwise comparison data from one of these models, the problem of estimating the comparison probabilities has applications in inferring customer preferences in recommender systems, advertisement placement, and sports, and is the main focus of this paper.
Our Contributions:
Our goal is to estimate the matrix of comparison probabilities for fixed comparison topologies, studying both the noisy sorting and SST classes of matrices. Focusing first on the worstcase setting in which the assignment of items to the topology may be arbitrary, we show in Theorem 1 that consistent estimation is impossible for many natural comparison topologies. This result stands in sharp contrast to parametric models, and may be interpreted as a “no free lunch” theorem: although it is possible to estimate SST models at rates comparable to parametric models when given a full set of observations [SBGW17], the setting of fixed comparison topologies is problematic for the SST class. This can be viewed as a price to be paid for the additional robustness afforded by the SST model.
Seeing as such a worstcase design may be too strong for permutationbased models, we turn to an averagecase setting in which the items are assigned to a fixed graph topology in a randomized fashion. Under such an observation model, we propose and analyze two efficient estimators: Theorems 2 and 4 show that consistent estimation is possible under commonly used comparison topologies. Moreover, the error rates of these estimators depend only on the degree sequence of the comparison topology, and are shown to be unimprovable for a large class of graphs, in Theorem 3.
Our results therefore establish a sharp distinction between worstcase and averagecase designs when using fixed comparison topologies in permutationbased models. Such a phenomenon arises from the difference between minimax risk and Bayes risk under a uniform prior on the ranking, and may also be worth studying for other ranking models.
Related Work:
The literature on ranking and estimation from pairwise comparisons is vast, and we refer the reader to some surveys [FV93, Mar96, Cat12] and references therein for a more detailed overview. Estimation from pairwise comparisons has been analyzed under various metrics like top ranking [CS15, SW15, JKSO16, CGMS17] and comparison probability or parameter estimation [HOX14, SBB16, SBGW17]. There have been studies of these problems under active [JN11, HSRW16, MG15], passive [NOS16, RA16], and collaborative settings [PNZ15, NOTX17], and also for fixed as well as random comparison topologies [WJJ13, SBGW17]. Here we focus on the subset of papers that are most relevant to the work described here.
The problem of comparison probability estimation under a passively chosen fixed topology has been analyzed for parametric models by Hajek et al. [HOX14] and Shah et al. [SBB16]. Both papers analyze the worstcase design setting in which the assignment of items to the topology may be arbitrary, and derive bounds on the minimax risk of parameter (or equivalently, comparison probability) estimation. While their characterizations are not sharp in general, the rates are shown to depend on the spectrum of the Laplacian matrix of the topology. We point out an interesting consequence of both results: in the parametric model, provided that the comparison graph is connected, the maximum likelihood solution, in the limit of infinite samples for each graph edge, allows for exact recovery of the quality vector, and hence matrix of comparison probabilities. We will see that this property no longer holds for the SST models considered in this paper: there are comparison topologies and SST matrices for which it is impossible to recover the full matrix even given an infinite amount of data per graph edge. It is also worth mentioning that the top ranking problem has been analyzed for parametric models under fixed design assumptions [JKSO16], and here as well, asymptotic consistency is observed for connected comparison topologies.
Notation:
Here we summarize some notation used throughout the remainder of this paper. We use to denote the number of items, and adopt the shorthand . We use
to denote a Bernoulli random variable with success probability
. For two sequences and , we write if there is a universal constant such that for all . The relation is defined analogously, and we write if the relations and hold simultaneously. We use to denote universal constants that may change from line to line.We use to denote the allones vector in . Given a matrix , its th row is denoted by . For a graph with edge set , let denote the entries of the matrix restricted to the edge set of , and let . For a matrix and a permutation , we use the shorthand , where represents the row permutation matrix corresponding to the permutation . We let denote the identity permutation. The Kendall tau distance [Ken48] between two permutations and is given by
Let represent the set of all connected, vertexinduced subgraphs of a graph , and let and represent the vertex and edge set of a subgraph , respectively. We let denote the size of the largest independent set of the graph , which is a largest subset of vertices that have no edges among them. Define a biclique of a graph as two disjoint subsets of its vertices and such that for all and . Define the biclique number as the maximum number of edges in any such biclique, given by . Let denote the degree of vertex .
2 Background and Problem Formulation
Consider a collection of items that obey a total ordering or ranking determined by a permutation . More precisely, item is preferred to item in the underlying ranking if and only if . We are interested in observations arising from stochastic pairwise comparisons between items. We denote the matrix of underlying comparison probabilities by , with representing the probability that item beats item in a comparison.
Each item is associated with a score, given by the probability that item beats another item chosen uniformly at random. More precisely, the score of item is given by
(1) 
Arranging the scores in descending order naturally yields a ranking of items. In fact, for the models we define below, the ranking given by the scores is consistent with the ranking given by , i.e., if . The converse also holds if the scores are distinct.
2.1 Pairwise comparison models
We consider a permutationbased model for the comparison matrix , one defined by the property of strong stochastic transitivity [Fis73, ML65], or the SST property for short. In particular, a matrix of pairwise comparison probabilities is said to obey the SST property if for items and in the total ordering such that , it holds^{2}^{2}2We set by convention. that . Alternatively, recalling that denotes the matrix obtained from by permuting its rows and columns according to the permutation , the SST matrix class can be defined in terms of permutations applied to the class of bivariate isotonic matrices as
(2) 
Here the class of bivariate isotonic matrices is given by
where denotes a vector of all ones.
As shown by Shah et al. [SBGW17], the SST class is substantially larger than commonly used class of parametric models, in which each item is associated with a parameter , and the probability that item beats item is given by , where is a smooth monotone function of its argument.
A special case of the SST model that we study in this paper is the noisy sorting model [BM08], in which the all underlying probabilities are described with a single parameter . The matrix has entries
and the noisy sorting classes are given by
(3) 
Here is the sign operator, with the convention that . In words, the noisy sorting class models the case where the probability depends only on the parameter and whether . Although a noisy sorting model is a very special case of an SST model, apart from the degenerate case , it cannot be represented by any parametric model with a smooth function , and so captures the essential difficulty of learning in the SST class.
We now turn to describing the observation models that we consider in this paper.
2.2 Partial observation models
Our goal is to provide guarantees on estimating the underlying comparison matrix when the comparison topology is fixed. Suppose that we are given data for comparisons in the form of a graph , where the vertices represent the items and edges represent the comparisons made between items. We assume that the observations obey the probabilistic model
(4) 
where indicates a missing observation. We set the diagonal entries of equal to , and also specify that for , so that . We consider two different instantiations of the edge set given the graph.
2.2.1 Worstcase setting
In this setting, we assume that the assignment of items to vertices of the comparison graph is arbitrary. In other words, once the graph and its edges are fixed, we observe the entries of the matrix according to the observation model (4), and would like to provide uniform guarantees in the metric over all matrices in our model class given this restricted set of observations.
This setting is of the worstcase type, since the adversary is allowed to choose the underlying matrix with knowledge of the edge set . Providing guarantees against such an adversary is known to be possible for parametric models [HOX14, SBB16]. However, as we show in Section 3.1, such a guarantee is impossible to obtain even over the the noisy sorting subclass of the full SST class. Consequently, the latter parts of our analysis apply to a less rigid, averagecase setting.
2.2.2 Averagecase setting
In this setting, we assume that the assignment of items to vertices of the comparison graph is random. Equivalently, given a fixed comparison graph having adjacency matrix , the subset of the entries that we observe can be modeled by the operator for a permutation chosen uniformly at random. For a fixed comparison matrix , our observations themselves consist of a random subset of the entries of the matrix determined by the operator : a location where (respectively ) indicates that entry is observed (respectively is not observed). Such a setting is reasonable when the graph topology is constrained, but we are still given the freedom to assign items to vertices of the comparison graph, e.g. in psychology experiments. A natural extension of such an observation model is the one of random designs, consisting of multiple random observation operators , chosen with independent, random permutations .
Our guarantees in the one sample setting with the observation operator can be seen as a form of Bayes risk, where given a fixed observation pattern (consisting of the entries of the comparison matrix determined by the adjacency matrix of the graph , with representing the indicator that entry is observed), we want to estimate a matrix under a uniform Bayesian prior on the ranking . Studying this averagecase setting is wellmotivated, since given fixed comparisons between a set of items, there is no reason to assume a priori that the underlying ranking is generated adversarially.
We are now ready to state the goal of the paper. We address the problems of recovering the ranking and estimating the matrix in the Frobenius norm. More precisely, given the observation matrix (where the set is random in the averagecase observation model), we would like to output a matrix that is function of , and for which good control on the Frobenius norm error can be guaranteed.
3 Main results
In this section, we state our main results and discuss some of their consequences. Proofs are deferred to Section 5.
3.1 Worstcase design: minimax bounds
In the worstcase setting of Section 2.2.1, the performance of an estimator is measured in terms of the normalized minimax error
where the expectation is taken over the randomness in the observations as well as any randomness in the estimator, and represents the model class. Our first result shows that for many comparison topologies, the minimax risk is prohibitively large even for the noisy sorting model.
Theorem 1.
For any graph , the diameter of the set consistent with observations on the edges of is lower bounded as
(5a)  
Consequently, the minimax risk of the noisy sorting model is lower bounded as  
(5b) 
Note that via the inclusion , Theorem 1 also implies the same lower bound (5b) on the risk . In addition to these bounds, the lower bounds for estimation in parametric models, known from past work [SBB16], carry over directly to the SST model, since parametric models are subclasses of the SST class.
Theorem 1 is approximationtheoretic in nature: more precisely, inequality (5a) is a statement purely about the size of the set of matrices consistent with observations on the graph. Consequently, it does not capture the uncertainty due to noise, and thus can be a loose characterization of the minimax risk for some graphs, with the complete graph being one example. The bound (5a) on the diameter of the set of consistent observations may be interpreted as the worst case error in the infinite sample limit of observations on . Hence, Theorem 1 stands in sharp contrast to analogous results for parametric models [HOX14, SBB16], in which it suffices for the graph to be connected in order to obtain consistent estimation in the infinite sample limit. For example, connected graphs with large independent sets of order do not admit consistent estimation over the noisy sorting and hence SST classes.
It is also worth mentioning that the connectivity properties of the graph that govern minimax estimation in the larger SST model are quite different from those appearing in parametric models. In particular, the minimax rates for parametric models are closely related (via the linear observation model) to the spectrum of the Laplacian matrix of the graph . In Theorem 1, however, we see other functions of the graph appearing that are not directly related to the Laplacian spectrum. In Section 4, we evaluate these functions for commonly used graph topologies, showing that for many of them, the risk is lower bounded by a constant even for graphs admitting consistent parametric estimation.
Seeing as the minimax error in the worstcase setting can be prohibitively large, we now turn to evaluating practical estimators in the random observation models of Section 2.2.2.
3.2 Averagecase design: noisy sorting matrix estimation
In the averagecase setting described in Section 2.2.2, we measure the performance of an estimator using the risk
It is important to note that the expectation is taken over both the comparison noise, as well as the random observation pattern (or equivalently, the underlying random permutation assigning items to vertices). We propose the AverageSortProject estimator ( for short) for matrix estimation in this metric, which is a natural generalization of the Borda count estimator [CM16, SBW16a]. It consists of three steps, described below for the noisy sorting model:

Averaging step: Compute the average , corresponding to the fraction of comparisons won by item .

Sorting step: Choose the permutation such that the sequence is decreasing in , with ties broken arbitrarily.

Projection step: Find the maximum likelihood estimate by treating as the true permutation that sorts items in decreasing order. Output the matrix .
We now state an upper bound on the meansquared Frobenius error achievable using the estimator. It involves the degree sequence of a graph without isolated vertices, meaning that for all .
Theorem 2.
Let the observation process be given by . For any graph without isolated vertices and any matrix , we have
(6a)  
(6b) 
A few comments are in order. First, while the results are stated in expectation, a high probability bound can be proved for permutation estimation—namely
Second, it can be verified that , so that taking a supremum over the parameter guarantees that the meansquared Frobenius error is upper bounded as , uniformly over the entire noisy sorting class . Third, it is also interesting to note the dependence of the bounds on the noise parameter of the noisy sorting model. The “highnoise” regime is a good one for estimating the underlying matrix, since the true matrix is largely unaffected by errors in estimating the true permutation. However, as captured by equation (6b), the permutation estimation problem is more challenging in this regime.
The bound (6a) can be specialized to the complete graph and the ErdősRényi random graph with edge probability to obtain the rates and , respectively, for estimation in the meansquared Frobenius norm. These rates are strictly suboptimal for these graphs, since the minimax rates scale as and , respectively; both are achieved by the global MLE [SBGW17]. Such a phenomenon is consistent with the gap observed between computationally constrained and unconstrained estimators in similar and related problems [SBGW17, FMR16, PWC17].
Interestingly, it turns out that the estimation rate (6a) is optimal in a certain sense, and we require some additional notions to state this precisely. Fix constants and and two sequences and of (strictly) positive scalars. For each , define the family of graphs
As noted in Section 2.2.2, the averagecase design observation model is equivalent to choosing the matrix from a random ensemble with the permutation chosen uniformly at random, and observing fixed pairwise comparisons. Such a viewpoint is useful in order to state our lower bound. Expectations are taken over the randomness of both and the Bernoulli observation noise.
Theorem 3.
(a) Let , where the permutation is chosen uniformly at random on the set . For any pair of sequences such that the set is nonempty for every , and for any estimators that are measurable functions of the observations on , we have
(b) For any graph , let , with the permutation chosen uniformly at random and the constant chosen sufficiently small. Then for any estimators that are measurable functions of the observations on , we have
Parts (a) and (b) of the lower bound may be interpreted respectively as the approximation error caused by having observations only on a subset of edges, and the estimation error arising from the Bernoulli observation noise. Note that part (b) applies to every graph, and is particularly noteworthy for sparse graphs. In particular, in the regime in which the graph has bounded average degree, it shows that the inconsistency exhibited by the estimator is unavoidable for any estimator. A more detailed discussion for specific graphs may be found in Section 4.
Although part (a) of the theorem is stated for a supremum over graphs, we actually prove a stronger result that explicitly characterizes the class of graphs that attain these lower bounds. As an example, given the sequences and , we show that the estimator is informationtheoretically optimal for the sequence of graphs consisting of two disjoint cliques , which can be verified to lie within the class .
The estimator for the SST model would replace step (iii), as stated, by a maximum likelihood estimate using the entries on the edges that we observe. However, analyzing such an estimator given only a single sample on the entries is a challenging problem due to dependencies between the different steps of the estimator, and the difficulty of solving the associated matrix completion problem. Consequently, we turn to an observation model consisting of two random designs, and design a different estimator that renders the matrix completion problem tractable.
3.3 Two random designs: SST matrix estimation
Recall the averagecase setting with multiple random designs, as described in Section 2.2.2, in which the comparison topology is fixed ahead of time, but one can collect multiple observations by assigning items to the vertices of the underlying graph at random. In this section, we rely on two such independent observations and to design an estimator that is consistent over the SST class. In order to describe our estimator, we require some additional notation. For any matrix such that , we use to denote the vector of its row sums. Note that this vector is related to the vector of scores, as defined in equation (1), via .
Our estimator relies on the approximation of any matrix by a blockwise constant matrix, and we require some more definitions to make this precise. For any vector , fix some value and define a block partition of as
In particular, the blocking vector contains a partition of indices such that the row sums of the matrix within each block of the partition are within a gap of each other. Denote the set of all possible partitions of the set by . For any partition of the indices , define the set of blocks .
By definition, given a partition of , the set is a partition of the set into blocks. We are now ready to describe the blocking operation. For indices , denote by the block in that contains the tuple . Given a matrix satisfying , we define the blocked version of depending on observations in a set as
(7) 
In words, this defines a projection of the matrix onto the set of blockwise constant matrices, by blockwise averaging the entries of over the observed set of entries . We now turn to our estimator, called the BlockAverageProject estimator ( for short), of the underlying matrix . Given the observation matrix , define
where is the (random) degree of
item . We now perform three steps:
(1) Blocking step: Fix
, and obtain the blocking vector
and permutation
as in step (2) of the estimator.
(2) Averaging step: Average the matrix within each block to
obtain the matrix .
(3)
Projection step: Project onto the space , to obtain the estimator .
The blocking and averaging steps of the estimator are the main ingredients that we use to bound the error of the associated matrix completion problem. Also, the projection step of the estimator can be computed in polynomial time via bivariate isotonic regression [BDPR84].
Theorem 4.
Let the observation process be given by . For any graph without isolated vertices and any matrix , we have
where the expectation is taken over the noise, and observation patterns and .
To be clear, the blocking estimate is welldefined even when we have just one sample instead of two samples and , where step (2) is replaced by the estimate . In the simulations of Section 4, we see that for a large variety of graphs, using a single sample enjoys similar performance to using two independent samples and . We require two independent samples of the observations in our theoretical analysis to decouple the randomness of the first step of the algorithm from the second. When using one sample , the dependencies that are introduced between the different steps of the algorithm make the analysis challenging.
4 Dependence on graph topologies
In this section, we discuss implications of our results for some comparison topologies. Let us focus first on the worstcase design setting, and the lower bound of Theorem 1. For the star, path (or more generally, any graph with bounded average degree), and complete bipartite graphs, one can verify that we have , so . If the graph is a union of disjoint cliques (or having a constant number of edges across the cliques, like a barbell graph), then we see that , so . Thus, our theory yields pessimistic results for many practically motivated comparison topologies under worstcase designs, even though all the connected graphs above admit consistent estimation for parametric models^{3}^{3}3The complete bipartite graph, for instance, admits optimal rates of estimation. as the number of samples grows. In the average casesetting of Section 2.2.2, Theorems 2, 3 and 4 characterize the meansquared Frobenius norm errors of the corresponding estimators (up to constants) as .
In order to illustrate our results for the averagecase setting, we present the results of simulations on data generated synthetically^{4}^{4}4Note that the SST model has been validated extensively on real data in past work (see, e.g. Ballinger and Wilcox [BW97]). from two special cases of the SST model. We fix without loss of generality, and generate the ground truth comparison matrix in one of two ways:

Noisy sorting with high SNR: We set .

SST with independent bands: We first set for every . Entries on the diagonal band immediately above the diagonal (i.e. for ) are chosen i.i.d. and uniformly at random from the set . The band above is then chosen uniformly at random from the allowable set, where every entry is constrained to be upper bounded by and lower bounded by the entries to its left and below. We also set to fill the rest of the matrix.
For each graph with adjacency matrix , the data is generated from ground truth by observing independent Bernoulli comparisons under the observation process , for a randomly generated permutation . For the SST model, we also generate data from two independent random observations and as required by the estimator; however, we also simulate the behaviour of the estimator for one sample and show that it closely tracks that of the twosample estimator.
Recall that the estimation error rate was dictated by the degree
functional . While our graphs were chosen to
illustrate scalings of , some variants of these graphs
also naturally arise as comparison topologies.
(1) Twodisjointclique graph: For this graph , we have for every , and
simple calculations yield .
It is interesting to note that this graph has unfavorable guarantees
for parametric estimation under the adversarial model, because it is
disconnected (and thus has a Laplacian with zero spectral gap.) We
observe that this spectral property does not play a role in our
analysis of the or estimator under the averagecase
observation model, and this behavior is corroborated by our
simulations. Although we do not show it here, a similar behavior is
observed for the stochastic block model, a practically motivated
comparison topology when there are genres present among the items,
which is a relaxation of the twoclique case allowing for sparser
“communities” instead of cliques, and edges between the
communities.
(2) Cliquepluspath graph: The nodes are partitioned into two
sets of nodes each. The graph contains an edge between every
two nodes in the first set, and a path starting from one of the nodes
in the first set and chaining the other nodes. This is an
example of a graph construction that has many () edges,
but is unfavorable for noisy sorting or SST estimation. Simple
calculations show that the degree functional is dominated by the
constant degree terms and we obtain .
(3) Power law graph: We consider the special power law
graph [BA99] with degree sequence for
, and construct it using the HavelHakimi
algorithm [Hav55, Hak62]. For this
graph, we have a disparate degree sequence, but , and the simulated estimators are consistent.
(4) regular bipartite graphs: A
final powerful illustration of our theoretical guarantees is provided
by a regular bipartite graph construction in which the nodes are
partitioned into two sets of nodes each, and each node in one
set is (deterministically) connected to nodes in the other set. This results in the degree sequence
for all , and the
degree functional evaluates to .
The value of thus determines the scaling of the estimation
error for the estimator in the noisy sorting case, as well as
the estimator in the SST case, as seen from the slopes of the
corresponding plots.
5 Proofs
In this section, we provide the proofs of our main results. We assume throughout that , and use to denote universal constants that may change from line to line.
5.1 Proof of Theorem 1
For each fixed graph , define the quantity
corresponding to the diameter quantity that is lower bounded in equation (5a). Taking the lower bound (5a
) as given for the moment, we first prove the lower bound (
5b) on the minimax risk. It suffices to show that the minimax risk is lower bounded in terms of as(8) 
In order to verify this claim, consider the two matrices that attain the supremum in the definition of ; note that such matrices exist due to the compactness of the space and the continuity of the squared loss. By construction, these two matrices satisfy the properties
We can now reduce the problem to one of testing between the two matrices and , with the distribution of observations being identical for both alternatives. Consequently, any procedure can do no better than to make a random guess between the two, so we have
which proves the claim (8).
It remains to prove the claimed lower bound (5a) on . This lower bound can be split into the following two claims:
(9a)  
(9b) 
We use a different argument to establish each claim.
Proof of claim (9a): Recall from Section 1 the definition of the largest independent set. Without loss of generality, let the largest independent set be given by . Assign item to vertex for . Now we choose permutations and so that

for ,

for ,

and agree on .
Note that last step is possible because . Moreover, define the matrices and . Note that by construction, we have ensured that . However, it holds that
which completes the proof.
Proof of claim (9b): Recall the definition of a maximum biclique from Section 1. Since the complement graph has a biclique with edges, the graph has two disjoint sets of vertices and with that do not have edges connecting one to the other. We now pick the two permutations and so that

the permutation ranks items from as the top items, and ranks items from as the next items;

the permutation ranks items from as the top items, and ranks items from as the next items;

the permutations and agree with each other apart from the above constraints.
As before, we define and , and again, we have . The relative orders of items have been interchanged across the biclique, so it holds that , which completes the proof. ∎
5.2 Some useful lemmas for averagecase proofs
We now turn to proofs for the averagecase setting. For convenience, we begin by stating two lemmas that are used in multiple proofs. The first lemma bounds the performance of the permutation estimator for a general SST matrix, and is thus of independent interest.
Lemma 1.
For any matrix , the permutation estimator satisfies
(10a)  
and if additionally, , we have  
(10b) 
In addition, the score estimates satisfy the bounds
Our second lemma is a type of rearrangement inequality.
Lemma 2.
Let be an increasing sequence of positive numbers and let be a decreasing sequence of positive numbers. Then we have
5.2.1 Proof of Lemma 1
Assume without loss of generality that . We begin by applying Hölder’s inequality to obtain
In the case where , we have ; in the general case , we have . Next, if denotes the matrix obtained from permuting the rows of by , then it holds that
where the equality follows from the condition . We also have
where step is due to monotonicity along each column of , and step follows from the rearrangement inequality (see, e.g., Example 2 in the paper [Vin90]), using the fact that both sequences and are sorted in decreasing order. Combining the last three displays yields the claimed bounds (10a) and (10b).
In order to prove the second part of the lemma, it suffices to show that the random variable is subGaussian with parameter , where . Let be the uniform random assignment of items to vertices with , and let denote the random degree of item . Note that conditioned on the event , the difference between a score and its empirical version can be written as