Log In Sign Up

Minimax Rates and Efficient Algorithms for Noisy Sorting

by   Cheng Mao, et al.

There has been a recent surge of interest in studying permutation-based models for ranking from pairwise comparison data. Despite being structurally richer and more robust than parametric ranking models, permutation-based models are less well understood statistically and generally lack efficient learning algorithms. In this work, we study a prototype of permutation-based ranking models, namely, the noisy sorting model. We establish the optimal rates of learning the model under two sampling procedures. Furthermore, we provide a fast algorithm to achieve near-optimal rates if the observations are sampled independently. Along the way, we discover properties of the symmetric group which are of theoretical interest.


page 1

page 2

page 3

page 4


Just Sort It! A Simple and Effective Approach to Active Preference Learning

We address the problem of learning a ranking by using adaptively chosen ...

Permutree sorting

Generalizing stack sorting and c-sorting for permutations, we define the...

Low Permutation-rank Matrices: Structural Properties and Noisy Completion

We consider the problem of noisy matrix completion, in which the goal is...

Worst-case vs Average-case Design for Estimation from Fixed Pairwise Comparisons

Pairwise comparison data arises in many domains, including tournament ra...

Sorting and Ranking of Self-Delimiting Numbers with Applications to Tree Isomorphism

Assume that an N-bit sequence S of k self-delimiting numbers is given as...

Breaking the 1/√(n) Barrier: Faster Rates for Permutation-based Models in Polynomial Time

Many applications, including rank aggregation and crowd-labeling, can be...

Stochastically Transitive Models for Pairwise Comparisons: Statistical and Computational Issues

There are various parametric models for analyzing pairwise comparison da...

1 Introduction

Pairwise comparison data is frequently observed in various domains, including recommender systems, website ranking, voting and social choice (see, e.g. Baltrunas et al., 2010; Dwork et al., 2001; Liu, 2009; Young, 1988; Caplin and Nalebuff, 1991). For these applications, it is of significant interest to produce a suitable ranking of the items by aggregating the outcomes of pairwise comparisons. The general problem of interest can be stated as follows. Suppose there are items to be compared and an underlying matrix

of probability parameters, each entry

of which represents the probability that item beats item if they are compared. Hence we have and the event that item beats item

in a comparison can be viewed as a Bernoulli random variable with probability

. Observing the outcomes of independent pairwise comparisons, we aim to estimate the absolute ranking of the items.

For the sake of consistency, one needs of course to impose some structure on the matrix . These structural assumptions are traditionally split between parametric and nonparametric ones. Classical parametric models include the Bradley-Terry-Luce model (Bradley and Terry, 1952; Luce, 1959) and the Thurstone model (Thurstone, 1927). These models can be recast as log-linear models, which enables the use of the statistical and computational machinery of maximum likelihood estimation in generalized linear models (Hunter, 2004; Negahban et al., 2012; Rajkumar and Agarwal, 2014; Hajek et al., 2014; Shah et al., 2015; Negahban et al., 2016, 2017).

To allow richer structures on beyond the scope of parametric models, permutation-based models such as the noisy sorting model (Braverman and Mossel, 2008, 2009) and the strong stochastic transitivity (SST) model (Chatterjee, 2015; Shah et al., 2017a) have recently become more prevalent. These models only require shape constraints on the matrix and are typically called nonparametric. In these models, the underlying ranking of items is determined by an unknown permutation , and, additionally, the comparison probabilities are assumed to have a bi-isotonic structure when the items are aligned according to . While permutation-based models provide ordering structures that are not captured by parametric models (Agarwal, 2016; Shah et al., 2017a), they introduce both statistical and computational barriers for estimation of the underlying ranking. These barriers are mainly due to the complexity of the discrete set of permutations. On the one hand, the complexity of the set of permutations is not well understood (see the discussion following Theorem 8 in Collier and Dalalyan, 2016), which leads to logarithmic gaps in the current statistical bounds for permutation-based models. On the other hand, it is computationally challenging to optimize over the set of permutations, so current algorithms either sacrifice nontrivial statistical performance or have impractical time complexity. In this work, we aim to address both questions for the noisy sorting model.

In practice, it is unlikely that all the items are compared to each other. To account for this limitation, a widely used scheme consists in assuming that that each pairwise comparison is observed with probability (see, e.g. Chatterjee, 2015; Shah et al., 2017a). In addition to this model of missing comparisons, we study the model where pairwise comparisons are sampled uniformly at random from the pairs, with replacement and independent of each other. It turns out that sampling with and without replacement yields the same rate of estimation up to a constant when the expected numbers of observations coincide.

Our contributions.

We focus on the noisy sorting model with partial observations, under which a stronger item wins a comparison against a weaker item with probability at least where . For sampling both with and without replacement, we establish the minimax rate of learning the underlying permutation. In particular, the rate does not involve a logarithmic term, and we explain this phenomenon through a careful analysis of the metric entropy of the set of permutations equipped with the Kendall tau distance, which is of independent theoretical interest.

Moreover, we propose a multistage sorting algorithm that has time complexity . For the sampling with replacement model, we prove a theoretical guarantee on the performance of the multistage sorting algorithm, which differs from the minimax rate by only a polylogarithmic factor. In addition, the algorithm is demonstrated to perform similarly for both sampling models using simulated examples.

Related work.

The noisy sorting model was proposed by Braverman and Mossel (2008). In the original paper, the optimal rate of estimation achieved by the maximum likelihood estimator (MLE) is established, and an algorithm with time complexity is shown to find the MLE with high probability in the case of full observations111If the algorithm is allowed to actively choose the pairs to be compared, the sample complexity can be reduced to . However, in the passive setting which we adopt throughout this work, the algorithm still needs pairwise comparisons., where is a large unknown constant. Moreover, their algorithm does not have a polynomial running time if only random pairwise comparisons are observed. Our work generalizes the optimal rate to the partial observation settings by studying a variant of the MLE for the upper bound. In the model of sampling with replacement, our fast multistage sorting algorithm provably achieves near-optimal rate of estimation. Since finding the MLE for the noisy sorting model is an instance of the NP-hard feedback arc set problem (Alon, 2006; Kenyon-Mathieu and Schudy, 2007; Ailon et al., 2008; Braverman and Mossel, 2008), our results indicate that, despite the NP-hardness of the worst-case problem, it is still possible to achieve (near-)optimal rates for the average-case statistical setting in polynomial time.

The SST model generalizes the noisy sorting model, and minimax rates in the SST model have been studied by Shah et al. (2017a). However, the upper bound specialized to noisy sorting contains an extra logarithmic factor, which this work shows to be unnecessary. Moreover, the lower bound there is based on noisy sorting models with shrinking to zero as , while we establish a matching lower bound at any fixed . In addition, algorithms of Wauthier et al. (2013); Shah et al. (2017a); Chatterjee and Mukherjee (2016) are all statistically suboptimal for the noisy sorting model. This is partially addressed by our multistage sorting algorithm as discussed above.

In fact, both with- and without-replacement sampling models discussed in this paper are restrictive for applications where the set of observed comparisons is subject to certain structural constraints (Hajek et al., 2014; Shah et al., 2015; Negahban et al., 2017; Pananjady et al., 2017a). Obtaining sharper rates of estimation for these more complex sampling models is of significant interest but is beyond the scope of the current work.

Finally, we mention a few other lines of related work. Besides permutation-based models, low-rank structures have also been proposed by Rajkumar and Agarwal (2016) to generalize classical parametric models. Moreover, there is an extensive literature on active ranking from pairwise comparisons (see, e.g., Jamieson and Nowak, 2011; Heckel et al., 2016; Agarwal et al., 2017, and references therein)

, where the pairs to be compared are chosen actively and in a sequential fashion by the learner. The sequential nature of the models greatly reduces sample complexity, so we do not compare our results for passive observations to the literature on active learning. However, it is interesting to note that our multistage sorting algorithm is reminiscent of active algorithms, because it uses different batches of samples for different stages. Thus active learning algorithms could potentially be useful even for passive sampling models.


The noisy sorting model together with the two sampling models is formalized in Section 2. In Section 3, we present our main results, the minimax rate of estimation for the latent permutation and the near-optimal rate achieved by an efficient multistage sorting algorithm. To complement our theoretical findings, we inspect the empirical performance of the multistage sorting algorithm on numerical examples in Section 4. Section 5 is devoted to the study of the set of permutations equipped with the Kendall tau distance. Proofs of the main results are provided in Section 6. We discuss directions for future research in Section 7.


For a positive integer , let . For a finite set , we denote its cardinality by . Given , let and . We use and , possibly with subscripts, to denote universal positive constants that may change at each appearance. For two sequences and , we write if there exists a universal constant such that for all . We define the relation analogously, and write if both and hold. Let denote the symmetric group on , i.e., the set of permutations .

2 Problem formulation

The noisy sorting model can be formulated as follows. Fix an unknown permutation which determines the underlying order of items. More precisely, orders the items from the weakest to the strongest, so that item is the -th weakest among the items. For a fixed , we define a class of matrices

where is the

-dimensional all-ones vector. In addition, we define a special matrix


Note that satisfies strong stochastic transitivity but other matrices may not. Though this observation plays a crucial role in the design of efficient algorithms, our statistical results hold for general matrices in .

To model pairwise comparisons, fix and let denote the probability that items beats item when they are compared222The diagonal entries of are inessential in the model as an item is not compared to itself, and they are set to only for concreteness., so that a stronger item beats a weaker item with probability at least . As a result, captures the signal-to-noise ratio of our problem and our minimax results explicitly capture the dependence in this key parameter.

2.1 Sampling models

In the noisy sorting model, suppose that for each (unordered) pair with , we observe the outcomes of comparisons between them, and item wins a comparison against item with probability independently. The set of nonnegative integers is determined by certain sampling models described below. We allow to be zero, which means that and are not compared. We collect sufficient statistics into a matrix consisting of outcomes of pairwise comparisons, by defining to be the number of times item beats item among the comparisons between and . In particular, we have for and . Our goal is to aggregate the results of pairwise comparisons to estimate , the underlying order of items.

In the full observation setup of Braverman and Mossel (2008), we have for each pair and the total number of observations is . Instead, we are interested here in the regime where the total number of observations is much smaller than . We study the following two sampling models in this work:

  1. [label=()]

  2. Sampling without replacement. In this sampling model, instead of observing all the pairwise comparisons, we observe each pair with probability independently. Hence each is a Bernoulli random variable with parameter , and in expectation we have observations in total.

  3. Sampling with replacement. We observe pairwise comparisons between the items, sampled uniformly and independently with replacement from the pairs.

In the sequel, we study the noisy sorting model with either of the above two sampling models. In particular, the minimax rates of estimating coincide for the two sampling models if , i.e., if the expected number of observations are of the same order.

2.2 Measures of performance

Having discussed the sampling and comparison models, we turn to the distance used to measure the difference between the underlying permutation and an estimated permutation . Among various distances defined on the symmetric group, we consider primarily the Kendall tau distance, i.e., the number of inversions (or discordant pairs) between permutations, defined as

for . Note that . The Kendall tau distance between two permutations is a natural metric on , and it is equal to the minimum number of adjacent transpositions required to change from one permutation to another (Knuth, 1998). A closely related distance on is the -distance, also known as Spearman’s footrule, defined as

for . It is well known (Diaconis and Graham, 1977) that


Hence the rates of estimation in the two distances coincide. Another distance on we use is the -distance, defined as

Note that unlike existing literature on ranking from pairwise comparisons where metrics on the probability parameters are studied, we employ here distances that measure how far an item is from its true ranking.

3 Main results

In this section, we state our main results. Specifically, we establish the minimax rates of estimating in the Kendall tau distance (and thus in distance) for noisy sorting under both sampling models 1 and 2. The minimax estimator that we propose is intractable in general and we complement our results with an efficient estimator of which achieves near-optimal rates in both the Kendall tau and the -distance, under the sampling model 2.

3.1 Minimax rates of noisy sorting

Under the noisy sorting model with latent permutation and matrix of probabilities , we determine the minimax rate of estimating in the following theorem. Let

denote the expectation with respect to the probability distribution of the observations in the noisy sorting model with underlying permutation

and matrix of probabilities , in either sampling model.

Fix where is a universal positive constant. It holds that

where the minimum is taken minimized over all permutation estimators that are measurable with respect to the observations.

The theorem establishes the minimax rates for noisy sorting, including the case of partial observations and weak signals. The upper bounds in fact hold with high probability as shown in Theorem 6.1. If the expected numbers of observations in the two sampling models 1 and 2 are of the same order, i.e., , then the two rates coincide. In this sense, the two sampling models are statistically equivalent. In sampling model 1, if and is larger than a constant, then the rate of order recovers the upper bound proved by Braverman and Mossel (2008).

Note in particular the absence of logarithmic factor in the rates. Naively bounding the metric entropy of by actually yields a superfluous logarithmic term in the upper bound. To avoid it, we study the doubling dimension of ; see the discussion after Proposition 5. Closing this logarithmic gap for other problems involving latent permutations (Collier and Dalalyan, 2016; Flammarion et al., 2016; Shah et al., 2017a; Pananjady et al., 2017b) remains an open question.

The technical assumption in Theorem 3.1 is very mild, because we are interested in the “noisy” sorting model (meaning that the pairwise comparisons are noisy, or equivalently that is not close to ). In fact the requirement that be bounded away from can be lifted, in which case we establish upper and lower bounds that match up to a logarithmic factor of order , where (see Section 6).

Finally, we note that the proof of Theorem 3.1 holds even in the so-called semi-random setting (Blum and Spencer, 1995; Makarychev et al., 2013), in which observations are generated by one of the random procedures described above, but a “helpful” adversary is allowed to reverse the outcome of any comparison in which a weaker item beat a stronger item. Though these reversals appear benign at first glance, the presence of such an adversary can in fact worsen statistical rates of estimation in more brittle models such as stochastic block models and the related broadcast tree model (Moitra et al., 2016). Our results indicate that no such degradation occurs for the rates of estimation in the noisy sorting problem.

3.2 Efficient multistage sorting

The minimax upper bound in Theorem 3.1 is established using a computationally prohibitive estimator, so we now introduce an efficient estimator of the underlying permutation that can be computed in time . In this section, we prove theoretical guarantees for this estimator under the noisy sorting model with probability matrix and observations sampled with replacement according to 2 when is bounded away from zero by a universal constant. No polynomial-time algorithm was previously known to achieve near-optimal rates even in this simplified setting when pairwise comparisons are observed.

Since we aim to prove guarantees up to constants, we may assume that we have pairwise comparisons, and split them into two independent samples, each containing pairwise comparisons. The first sample is used to estimate the parameter and the second one is used to estimate the permutation .

First, we introduce a fairly simple estimator of that can be described informally as follows: first sort in increasing order the items according to the number of wins. Then for any pair for which item is ranked positions higher than item , it is very likely that item is stronger than item so that it beats item with probability . We then average the variables over all such pairs to obtain an estimator of . More formally, we further split the first sample into two subsamples, each containing pairwise comparisons. Denote by and the number of wins item has against item in the first and second subsample, respectively. The estimator is given by the following procedure:

  1. For each , associate with item a score .

  2. Construct a permutation by sorting the scores in increasing order, i.e., is chosen so that if , with ties broken arbitrarily.

  3. Define

Given the estimator , we now describe a multistage procedure to estimate the permutation . To recover the underlying order of items, it is equivalent to estimate the row sums which we call scores of the items, because the scores are increasing linearly if the items are placed in order. Initially, for each , we estimate the score of item by the number of wins item has. If item has a much higher score than item in the first stage, then we are confident that item is stronger than item . Hence in the second stage, we can estimate by , which is very close to the truth. For those pairs that we are not certain about,

is still estimated by its empirical version. The variance of each score is thus greatly reduced in the second stage, thereby yielding a more accurate order of the items. Then we iterate this process to obtain finer and finer estimates of the scores and the underlying order.

To present the Multistage Sorting (MS) algorithm formally, let us fix a positive integer which is the number of stages of the algorithm. We further split the second sample into subsamples each containing pairwise comparisons333We assume without loss of generality that divides to ease the notation.. Similar to the data matrix for the full sample, for we define a matrix by setting to be the number of wins item has against item in the -th sample. The MS algorithm proceeds as follows:

  1. For each , define , and .

  2. At the -th stage where , compute the score of item :

  3. Let and be sufficiently large universal constants444Determined according to Lemma 6.2 and Lemma 6.2 respectively.. If it holds that


    then we set the threshold

    and define the sets

    If (3.1) does not hold, then we define , and . Note that denotes the set of items whose ranking relative to has not been determined by the algorithm at stage .

  4. After repeating Step 2 and 3 for , output a permutation by sorting the scores in increasing order, i.e., is chosen so that if with ties broken arbitrarily.

It is clear that the time complexity of each stage of the algorithm is . Take so that the overall time complexity of the MS algorithm is only . Our main result in this section is the following guarantee on the performance of the estimator given by the MS algorithm.

Suppose that for a sufficiently large constant and that where for a constant . Then, under the noisy sorting model with sampling model 2, the following holds. With probability at least , the MS algorithm with stages outputs an estimator that satisfies


Note that the second statement follows from the first one together with (2.1). Indeed, we have

which is optimal up to a polylogarithmic factor in the regime where is bounded away from according to Theorem 3.1 (and Theorem 6.1). Therefore, the MS algorithm achieves significant computational efficiency while sacrificing little in terms of statistical performance. On the downside, it is limited to the noisy sorting model where —this assumption is necessary to exploit strong stochastic transitivity—and our analysis does not account for the dependence in .

Furthermore, although we only consider model 2 of sampling with replacement in this section, the MS algorithm can be easily modified to handle model 1 of sampling without replacement. It is much more challenging to prove analogous theoretical guarantees in this case, because we cannot split the observations into independent samples. In Section 4, however, we provide empirical evidence showing that the MS estimator has very similar performance for the two sampling models.

Our algorithm bears comparison with the algorithm proposed by Braverman and Mossel (2008). Their algorithm—which works in the full observation case —achieves the statistically optimal rate in time , where is a large positive constant depending on . Though our algorithm’s statistical performance falls short of the optimal rate by a polylogarithmic factor, it runs in time and works in the partial observation setting as long as . Note by way of comparison that Theorem 6.1 indicates that no procedure achieves nontrivial recovery unless .

4 Simulations

To support our theoretical findings in Section 3.2, we implement the MS algorithm on synthetic instances generated from the noisy sorting model. For simplicity, we take and set in the algorithm. Theorem 3.2 predicts a scaling of the estimation error in the Kendall tau distance for model 2 of sampling with replacement, where is the number of items and is the number of pairwise comparisons. This rate is optimal up to a polylogarithmic factor according to Theorem 6.1.

Figure 1: Estimation errors for the observations sampled with and without replacement. Left: and ranging from to ; Right: and ranging from to .

In Figure 1, we plot estimation errors averaged over instances generated from the model. In the left plot, we let range from to and set . For this choice of , Theorem 3.2 predicts that and we indeed observe a near-linear scaling in that plot. In the right plot, we fix and let the proportion of observed entries, range from to . For this choice of parameters, Theorem 3.2 predicts that (recall that here is fixed), and we clearly observe a sublinear relation between and . Note that this does not contradict the lower bound since the latter is stated up to constants.

Moreover, the MS algorithm can be easily modified to work for the without replacement model 1. Namely, given the partially observed pairwise comparisons, we assign each comparison to one of the samples uniformly at random, independent of all the other assignments. After splitting the whole sample into subsamples, we execute the MS algorithm as in the previous case. In Figure 1, we take and plot the estimation errors for sampling without replacement, which closely follow the errors for observations sampled with replacement. Therefore, although it seems difficult to prove analogous guarantees on the performance of the MS algorithm applied to the without replacement model, empirically the algorithm performs very similarly for the two sampling models.

Stage 1 Stage 2 Stage 3
Figure 2: The uncertainty regions at stages of the MS algorithm. The two axes represent the indices of the items. A black pixel at indicates that , i.e., the algorithm is not certain about the relative order of item and item at stage . A white pixel indicates the opposite.

To gain further intuition about the MS algorithm, we consider the set defined in the algorithm. At stage of the algorithm, the set consists of all indices for which we are not certain about the relative order of item and item . The proof of Theorem 3.2 essentially shows that the uncertainty set is shrinking as the algorithm proceeds. To verify this intuition, in Figure 2 we plot the uncertainty regions

at stages of the MS algorithm, for and . The items are ordered according to for visibility of the region. As exhibited in the plots, the uncertainty region is indeed shrinking as the algorithm proceeds.

5 The symmetric group and inversions

Before proving the main results for the noisy sorting model, we study the metric entropy of the symmetric group with respect to the Kendall tau distance. Counting permutations subject to constraints in terms of the Kendall tau distance is of theoretical importance and has interesting applications, e.g., in coding theory (see, e.g, Barg and Mazumdar, 2010; Mazumdar et al., 2013). We present the results in terms of metric entropy, which easily applies to the noisy sorting problem and may find further applications in statistical problems involving permutations.

For and , let and denote respectively the -covering number and the -packing number of with respect to the Kendall tau distance. The following main result of this section provides bounds on the metric entropy of balls in .

Consider the ball centered at with radius . We have that for ,

We now discuss some high-level implications of Proposition 5. Note that if , the lemma states that the -metric entropy of a ball of radius in the Kendall tau distance scales as . In other words, the symmetric group equipped with the Kendall tau metric is a doubling space with doubling dimension . One of the main messages of the current work is that although , the intrinsic dimension of is , which explains the absence of logarithmic factor in the minimax rate.

To start the proof, we first recall a useful tool for counting permutations, the inversion table. Formally, the inversion table of a permutation is defined by

for . Clearly, we have that and . It is easy to reconstruct a unique permutation using an inversion table with , so the set of inversion tables is bijective to via this relation; see, e.g., Mahmoud (2000). We use this bijection to bound the number of permutations that differ from the identity by at most inversions. The following lemma appears in a different form in Barg and Mazumdar (2010). We provide a simple proof here for completeness.

For , we have that

According to the discussion above, the cardinality , which we denote by , is equal to the number of inversion tables where such that . On the one hand, if for all , then , so a lower bound on is given by

Using Stirling’s approximation, we see that

On the other hand, if is only required to be a nonnegative integer for each , then we can use a standard “stars and bars” counting argument (Feller, 1968) to get an upper bound of the form

Taking the logarithm finishes the proof.

We are ready to prove Proposition 5.

[of Proposition 5] The relation between the covering and the packing number is standard.

We employ a standard volume argument to control these numbers. Let be a -packing of so that the balls are disjoint for . Moreover, by the triangle inequality, for each . By the invariance of the Kendall tau distance under composition, Lemma 5 yields

On the other hand, if is an -net of , then the set of balls covers . By Lemma 5, we obtain

as claimed.

The lower bound on the packing number in Proposition 5 becomes vacuous when and are smaller than , so we complement it with the following result, which is useful for proving minimax lower bounds.

Consider the ball where . We have that

Without loss of generality, we may assume that and is even. The sparse Varshamov-Gilbert bound (Massart, 2007, Lemma 4.10) states that there exists a set of -sparse vectors in , such that and any two distinct vectors in are separated by at least in the Hamming distance. We now map every to a permutation by defining

  1. and if , and

  2. and if ,

for . Note that because swaps at most adjacent pairs. Denote by the image of under this mapping. Since the Hamming distance between any two distinct vectors in is lower bounded by , we see that for any distinct . Thus is an -packing of . By construction, , so we can use the standard relation to complete the proof.

6 Proofs of the main results

This section is devoted to the proofs of our main results. We start with a lemma giving useful tail bounds for the binomial distribution.

Suppose that has the Binomial distribution where and . Then for and , we have

  1. and

First, for

, by the definition of the Kullback-Leibler divergence, we have


Thus we also have


Moreover, by Theorem 1 of Arratia and Gordon (1989) and symmetry, it holds that

  1. and

The claimed tail bounds hence follow from (6.1) and (6.2).

6.1 Proof of Theorem 3.1

First, to achieve optimal upper bounds, we consider a variant of maximum likelihood estimation. Fix and define in the case of sampling model 1, and in the case of sampling model 2. If or is unknown, one may learn these scalar parameters easily from the observations and define using the estimated values. For readability, we assume that they are given to avoid these technical complications.

Let be a maximal -packing (and thus a -net) of the symmetric group with respect to . Consider the following estimator:


It is easy to see that is the MLE of over . Such an estimator is often called sieve estimator (see, e.g. Le Cam, 1986) in the statistics literature. The estimator satisfies the following upper bounds.

Consider the noisy sorting model with underlying permutation and probability matrix where . Then, with probability at least , the estimator defined in (6.3) satisfies

By integrating the tail probabilities of the above bounds, we easily obtain bounds on the expectation of the same order, which then prove the upper bounds in Theorem 3.1. One may wonder whether the rate in Theorem 6.1 can be achieved by the MLE over defined by

Our current techniques only allow us to prove bounds on that incur an extra factor (resp. ) in model 1 (resp. 2). It is unclear whether these logarithmic factors can be removed for the MLE.

[of Theorem 6.1] We assume that is lower bounded by a constant without loss of generality, and note that the bounds of order are trivial. The proof is split into four parts to improve readability.

Basic setup.

Since is a maximal -packing of , it is also a -net and thus there exists such that . By definition of , Canceling concordant pairs under and , we see that

Splitting the summands according to yields that

Since , we may drop the leftmost term and drop the condition in the rightmost term to obtain that


This inequality is crucial to proving that is close to with high probability.

To set up the rest of the proof, we define, for ,