Imagine we have a large collection of items, and we are given partially revealed comparisons between pairs of items. These paired comparisons are collected in a non-adaptive fashion, and could be highly noisy and incomplete. The aim is to aggregate these partial preferences so as to identify the items that receive the highest ranks. This problem, which is called top- rank aggregation, finds applications in numerous contexts, including web search (Dwork et al., 2001), recommendation systems (Baltrunas et al., 2010), sports competition (Masse, 1997), to name just a few. The challenge is both statistical and computational: how can one achieve reliable top- ranking from a minimal number of pairwise comparisons, while retaining computational efficiency?
1.1 Popular approaches
To address the aforementioned challenge, many prior approaches have been put forward based on certain statistical models. Arguably one of the most widely used parametric models is the Bradley-Terry-Luce (BTL) model (Bradley and Terry, 1952; Luce, 1959), which assigns a latent preference score to each of the items. The BTL model posits that: the chance of each item winning a paired comparison is determined by the relative scores of the two items involved, or more precisely,
in each comparison of item against item . The items are repeatedly compared in pairs according to this parametric model. The task then boils down to identifying the items with the highest preference scores, given these pairwise comparisons.
Among the ranking algorithms tailored to the BTL model, the following two procedures have received particular attention, both of which rank the items based on appropriate estimates of the latent preference scores.
The spectral method. By connecting the winning probability in (1) with the transition probability of a reversible Markov chain, the spectral method attempts recovery of via the leading left eigenvector of a sample transition matrix. This procedure, also known as Rank Centrality (Negahban et al., 2017a), bears similarity to the PageRank algorithm.
The maximum likelihood estimator (MLE). This approach proceeds by finding the score assignment that maximizes the likelihood function (Ford, 1957). When parameterized appropriately, solving the MLE becomes a convex program, and hence is computationally feasible. There are also important variants of the MLE that enforce additional regularization.
Details are postponed to Section 2.2. In addition to their remarkable practical applicability, these two ranking paradigms are appealing in theory as well. For instance, both of them provably achieve intriguing accuracy when estimating the latent preference scores (Negahban et al., 2017a).
Nevertheless, the error for estimating the latent scores merely serves as a “meta-metric” for the ranking task, which does not necessarily reveal the accuracy of top- identification. In fact, given that the loss only reflects the estimation error in some average sense, it is certainly possible that an algorithm obtains minimal estimation loss but incurs (relatively) large errors when estimating the scores of the highest ranked items. Interestingly, a recent work Chen and Suh (2015) demonstrates that: a careful combination of the spectral method and the coordinate-wise MLE is optimal for top- ranking. This leaves open the following natural questions: where does the spectral alone, or the MLE alone, stand in top- ranking? Are they capable of attaining exact top- recovery from minimal samples? These questions form the primary objectives of our study.
As we will elaborate later, the spectral method part of the preceding questions was recently explored by (Jang et al., 2016), for a regime where a relatively large fraction of item pairs have been compared. However, it remains unclear how well the spectral method can perform in a much broader — and often much more challenging — regime, where the fraction of item pairs being compared may be vanishingly small. Additionally, the ranking accuracy of the MLE (and its variants) remains unknown.
1.2 Main contributions
The central focal point of the current paper is to assess the accuracy of both the spectral method and the regularized MLE in top- identification. Assuming that the pairs of items being compared are randomly selected and that the preference scores fall within a fixed dynamic range, our paper delivers a somewhat surprising message:
Both the spectral method and the regularized MLE achieve perfect identification of top- ranked items under optimal sample complexity (up to some constant factor)!
It is worth emphasizing that these two algorithms succeed even under the sparsest possible regime, a scenario where only an exceedingly small fraction of pairs of items have been compared. This calls for precise control of the entrywise error — as opposed to the loss — for estimating the scores. To this end, our theory is established upon a novel leave-one-out argument, which might shed light on how to analyze the entrywise error for more general optimization problems.
As a byproduct of the analysis, we derive an elementary eigenvector perturbation bound for (asymmetric) probability transition matrices, which parallels Davis-Kahan’s theorem for symmetric matrices. This simple perturbation bound immediately leads to an improved error bound for the spectral method, which allows to close the gap between the theoretical performance of the spectral method and the minimax lower limit.
Before proceeding, we introduce a few notations that will be useful throughout. To begin with, for any strictly positive probability vector, we define the inner product space indexed by as a vector space in endowed with the inner product . The corresponding vector norm and the induced matrix norm are defined respectively as and .
Additionally, the notation or means there is a constant such that , or means there is a constant such that , or means that there exist constants such that , and means .
2 Statistical models and main results
2.1 Problem setup
We begin with a formal introduction of the Bradley-Terry-Luce parametric model for binary comparisons.
Preference scores. As introduced earlier, we assume the existence of a positive latent score vector
that comprises the underlying preference scores assigned to each of the items. Alternatively, it is sometimes more convenient to reparameterize the score vector by
These scores are assumed to fall within a dynamic range given by
for all and for some , , and . We also introduce the condition number as
Notably, the current paper primarily focuses on the case with a fixed dynamic range (i.e. is a fixed constant independent of ), although we will also discuss extensions to the large dynamic range regime in Section 3. Without loss of generality, it is assumed that
meaning that items through are the desired top- ranked items.
Comparison graph. Let stand for a comparison graph, where the vertex set represents the items of interest. The items and are compared if and only if falls within the edge set . Unless otherwise noted, we assume that is drawn from the Erdős–Rényi random graph , such that an edge between any pair of vertices is present independently with some probability . In words, captures the fraction of item pairs being compared.
Pairwise comparisons. For each , we obtain independent paired comparisons between items and . Let be the outcome of the -th comparison, which is independently drawn as
By convention, we set for all throughout the paper. This is also known as the logistic
pairwise comparison model, due to its strong resemblance to logistic regression. It is self-evident that the sufficient statistics under this model are given by
To simplify the notation, we shall also take
Goal. The goal is to identify the set of top- ranked items — that is, the set of items that enjoy the largest preference scores — from the pairwise comparison data .
2.2.1 The spectral method: Rank Centrality
The spectral ranking algorithm, or Rank Centrality (Negahban et al., 2017a), is motivated by the connection between the pairwise comparisons and a random walk over a directed graph. The algorithm starts by converting the pairwise comparison data into a transition matrix in such a way that
for some given normalization factor , and then proceeds by computing the stationary distribution of the Markov chain induced by . As we shall see later, the parameter is taken to be on the same order of the maximum vertex degree of while ensuring the non-negativity of . As asserted by Negahban et al. (2017a), is a faithful estimate of up to some global scaling. The algorithm is summarized in Algorithm 1.
To develop some intuition regarding why this spectral algorithm gives a reasonable estimate of , it is perhaps more convenient to look at the population transition matrix :
which coincides with by taking . It can be seen that the normalized score vector
is the stationary distribution of the Markov chain induced by the transition matrix , since and are in detailed balance, namely,
As a result, one expects the stationary distribution of the sample version to form a good estimate of , provided the sample size is sufficiently large.
2.2.2 The regularized MLE
Under the BTL model, the negative log-likelihood function conditioned on is given by (up to some global scaling)
The regularized MLE then amounts to solving the following convex program
for a regularization parameter . As will be discussed later, we shall adopt the choice throughout this paper. For the sake of brevity, we let represent the resulting penalized maximum likelihood estimate whenever it is clear from the context. Similar to the spectral method, one reports the items associated with the largest entries of .
2.3 Main results
The most challenging part of top- ranking is to distinguish the -th and the -th items. In fact, the score difference of these two items captures the distance between the item sets and . Unless their latent scores are sufficiently separated, the finite-sample nature of the model would make it infeasible to distinguish these two critical items. With this consideration in mind, we define the following separation measure
This metric turns out to play a crucial role in determining the minimal sample complexity for perfect top- identification.
The main finding of this paper concerns the optimality of both the spectral method and the regularized MLE in the presence of a fixed dynamic range (i.e. ). Recall that under the BTL model, the total number of samples we collect concentrates sharply around its mean, namely,
occurs with high probability. Our main result is stated in terms of the sample complexity required for exact top- identification.
Consider the pairwise comparison model specified in Section 2.1 with . Suppose that and that
for some sufficiently large positive constants and . Further assume for any absolute constants . With probability exceeding , the set of top- ranked items can be recovered exactly by the spectral method given in Algorithm 1, and by the regularized MLE given in (13). Here, we take in the spectral method and in the regularized MLE, where and are some absolute constants.
We emphasize that for is a fundamental requirement for the ranking task. In fact, if for any constant , then the comparison graph is disconnected with high probability. This means that there exists at least one isolated item (which has not been compared with any other item) and cannot be ranked.
In fact, the assumption that for any absolute constants is not needed for the spectral method.
Here, we assume the same number of comparisons to simplify the presentation as well as the proof. The result still holds true if we have distinct ’s for each , as long as .
Theorem 1 asserts that both the spectral method and the regularized MLE achieve a sample complexity on the order of . Encouragingly, this sample complexity coincides with the minimax limit identified in (Chen and Suh, 2015, Theorem 2) in the fixed dynamic range, i.e. .
Theorem 2 (Chen and Suh (2015)).
Fix , and suppose that
where . Then for any ranking procedure , one can find a score vector with separation such that fails to retrieve the top- items with probability at least .
We are now positioned to compare our results with Jang et al. (2016), which also investigates the accuracy of the spectral method for top- ranking. Specifically, Theorem 3 in Jang et al. (2016) establishes the optimality of the spectral method for the relatively dense regime where
In this regime, however, the total sample size necessarily exceeds
which rules out the possibility of achieving minimal sample complexity if is sufficiently large. For instance, consider the case where , then the optimal sample size — as revealed by Theorem 1 or (Chen and Suh, 2015, Theorem 1) — is on the order of
which is a factor of lower than the bound in (18). By contrast, our results hold all the way down to the sparsest possible regime where , confirming the optimality of the spectral method even for the most challenging scenario. Furthermore, we establish that the regularized MLE shares the same optimality guarantee as the spectral method, which was previously out of reach.
2.4 Optimal control of entrywise estimation errors
In order to establish the ranking accuracy as asserted by Theorem 1, the key is to obtain precise control of the loss of the score estimates. Our results are as follows.
Theorem 3 (Entrywise error of the spectral method).
Theorem 4 (Entrywise error of the regularized MLE).
Consider the pairwise comparison model specified in Section 2.1 with . Suppose that for some sufficiently large constant and that for any absolute constants . Set the regularization parameter to be for some absolute constant . Then the regularized MLE satisfies
with probability exceeding , where and .
Theorems 3–4 indicate that if the number of comparisons associated with each item — which concentrates around — exceeds the order of , then both methods are able to achieve a small error when estimating the scores.
with high probability. Similar theoretical guarantees have been derived for another variant of the MLE (the constrained version) under a uniform sampling model as well (Negahban et al., 2017a). In comparison, our results indicate that the estimation errors for both algorithms are almost evenly spread out across all coordinates rather than being localized or clustered. Notably, the pointwise errors revealed by Theorems 3-4 immediately lead to exact top- identification as claimed by Theorem 1.
Proof of Theorem 1.
In what follows, we prove the theorem for the spectral method part. The regularized MLE part follows from an almost identical argument and hence is omitted.
Since the spectral algorithm ranks the items in accordance with the score estimate , it suffices to demonstrate that
To this end, we first apply the triangle inequality to get
In addition, it follows from Theorem 3 as well as our sample complexity assumption that
These conditions taken collectively imply that as long as exceeds some sufficiently large constant. Substitution into (21) reveals that , as claimed. ∎
2.5 Heuristic arguments
We pause to develop some heuristic explanation as to why the estimation errors are expected to be spread out across all entries. For simplicity, we focus on the case whereand is sufficiently large, so that and sharply concentrate around and , respectively.
We begin with the spectral algorithm. Since and are respectively the invariant distributions of the Markov chains induced by and , we can decompose
When and , the entries of (resp. the off-diagonal entries of and ) are all of the same order and, as a result, the energy of the uncertainty term is spread out (using standard concentration inequalities). In fact, we will demonstrate in Section 5.2 that
which coincides with the optimal rate. Further, if we look at each entry of (22), then for all ,
By construction of the transition matrix, one can easily verify that is bounded away from and for all . As a consequence, the identity allows one to treat each as a mixture of three effects: (i) the first term of (24) behaves as an entrywise contraction of the error; (ii) the second term of (24) is a (nearly uniformly weighted) average of the errors over all coordinates, which can essentially be treated as a smoothing operator applied to the error components; and (iii) the uncertainty term . Rearranging terms in (24), we are left with
which further gives,
We now move on to the regularized MLE, following a very similar argument. By the optimality condition that , one can derive (for some to be specified later)
Write , where and denote respectively the diagonal and off-diagonal parts of . Under our assumptions, one can check that for all and for any . With these notations in place, one can write the entrywise error as follows
By choosing for some sufficiently small constant , we get and . Therefore, the right-hand side of the above relation also comprises a contraction term as well as an error smoothing term, similar to (24). Carrying out the same argument as for the spectral method, we see that the estimation errors of the regularized MLE are expected to be spread out.
2.6 Numerical experiments
It is worth noting that extensive numerical experiments on both synthetic and real data have already been conducted in Negahban et al. (2017a) to confirm the practicability of both the spectral method and the regularized MLE. See also Chen and Suh (2015) for the experiments on the Spectral-MLE algorithm. This section provides some additional simulations to complement their experimental results as well as our theory. Throughout the experiments, we set the number of items to be , while the number of repeated comparisons and the edge probability can vary with the experiments. Regarding the tuning parameters, we choose in the spectral method where is the maximum degree of the graph and in the regularized MLE, which are consistent with the configurations considered in the main theorems. Additionally, we also display the experimental results for the unregularized MLE, i.e. . All of the results are averaged over 100 Monte Carlo simulations.
|(a) spectral method||(b) regularized MLE||(c) MLE|
We first investigate the error of the spectral method and the (regularized) MLE when estimating the preference scores. To this end, we generate the latent scores () independently and uniformly at random over the interval . Figure 1(a) (resp. Figure 1(b)) displays the entrywise error in the spectral score estimation as the number of repeated comparisons (resp. the edge probability ) varies. As is seen from the plots, the error of all methods gets smaller as and increase, confirming our results in Theorems 3-4. Next, we show in Figure 1(c) the relative error while fixing the total number of samples (i.e. ). It can be seen that the performance almost does not change if the sample complexity remains the same. It is also interesting to see that the error of the spectral method and the MLE are very similar. In addition, Figure 2 illustrates the relative error and the relative error in score estimation for all three methods. As we can see, the relative errors are not much larger than the relative errors (recall that ), thus offering empirical evidence that the errors in the score estimates are spread out across all entries.
Further, we examine the top- ranking accuracy of all three methods. Here, we fix and , set , and let for all and for all . By construction, the score separation satisfies . Figure 3 illustrates the accuracy in identifying the top- ranked items. The performance of them improves when the score separation becomes larger, which matches our theory in Theorem 1.
2.7 Other related works
The problem of ranking based on partial preferences has received much attention during the past decade. Two types of observation models have been considered: the cardinal-based model, where users provide explicit numerical ratings of the items, the ordinal-based model, where users are asked to make comparative measurements. See Ammar and Shah (2011) for detailed comparisons between them.
In terms of the ordinal-based model — and in particular, ranking from pairwise comparisons — both parametric and nonparametric models have been extensively studied. For example, Hunter (2004) examined variants of the parametric BTL model, and established the convergence properties of the minorization-maximization algorithm for computing the MLE. Moreover, the BTL model falls under the category of low-rank parametric models, since the preference matrix is generated by passing a rank-2 matrix through the logistic link function (Rajkumar and Agarwal, 2016). Additionally, the work Jiang et al. (2011) proposed a least-squares type method to estimate the full ranking, which generalizes the simple Borda count algorithm (Ammar and Shah, 2011). For many of these algorithms, the sample complexities needed for perfect total ranking were determined by Rajkumar and Agarwal (2014), although the top- ranking accuracy was not considered there.
Going beyond the parametric models, a recent line of works Shah et al. (2017); Shah and Wainwright (2015); Chen et al. (2017); Pananjady et al. (2017) considered the nonparametric stochastically transitive model, where the only model assumption is that the comparison probability matrix follows certain transitivity rules. This type of models subsumes the BTL model as a special case. For instance, Shah and Wainwright (2015) suggested a simple counting-based algorithm which can reliably recover the top- ranked items for various models. However, the sampling paradigm considered therein is quite different from ours in the sparse regime; for instance, their model does not come close to the setting where is small but is large, which is the most challenging regime of the model adopted in our paper and Negahban et al. (2017a); Chen and Suh (2015).
All of the aforementioned papers concentrate on the case where there is a single ground-truth ordering. It would also be interesting to investigate the scenarios where different users might have different preference scores. To this end, Negahban et al. (2017b); Lu and Negahban (2014) imposed the low-rank structure on the underlying preference matrix and adopted the nuclear-norm relaxation approach to recover the users’ preferences. Additionally, several papers explored the ranking problem for the more general Plackett-Luce model (Hajek et al., 2014; Soufiani et al., 2013), in the presence of adaptive sampling (Jamieson and Nowak, 2011; Busa-Fekete et al., 2013; Heckel et al., 2016; Agarwal et al., 2017), for the crowdsourcing scenario (Chen et al., 2013), and in the adversarial setting (Suh et al., 2017). These are beyond the scope of the present paper.
Speaking of the error metric, the norm is appropriate for top- ranking problem and other learning problems as well. In particular, perturbation bounds for eigenvectors of symmetric matrices (Koltchinskii and Lounici, 2016; Fan et al., 2016; Eldridge et al., 2017; Abbe et al., 2017) and singular vectors of general matrices (Koltchinskii and Xia, 2016) have been studied. In stark contrast, we study the norm errors of the leading eigenvector of a class of asymmetric matrices (probability transition matrix) and the regularized MLE. Furthermore, most existing results require the expectations of data matrices to have low rank, at least approximately. We do not impose such assumptions.
When it comes to the technical tools, it is worth noting that the leave-one-out idea has been invoked to analyze random designs for other high-dimensional problems, e.g. robust M-estimators (El Karoui, 2017)
, confidence intervals for Lasso(Javanmard and Montanari, 2015), likelihood ratio test (Sur et al., 2017), and nonconvex statistical learning (Ma et al., 2017; Chen et al., 2018). In particular, Zhong and Boumal (2017) and Abbe et al. (2017) use it to precisely characterize entrywise behavior of eigenvectors of a large class of symmetric random matrices, which improves upon prior eigenvector analysis. Consequently, they are able to show the sharpness of spectral methods in many popular models. Our introduction of leave-one-out auxiliary quantities is similar in spirit to these papers.
Finally, the family of spectral methods has been successfully applied in numerous applications, e.g. matrix completion (Keshavan et al., 2010), phase retrieval (Chen and Candès, 2017), graph clustering (Rohe et al., 2011; Abbe et al., 2017), joint alignment (Chen and Candes, 2016). All of them are designed based on the eigenvectors of some symmetric matrix, or the singular vectors if the matrix of interest is asymmetric. Our paper contributes to this growing literature by establishing a sharp eigenvector perturbation analysis framework for an important class of asymmetric matrices — the probability transition matrices.
3 Extension: general dynamic range
All of the preceding results concern the regime with a fixed dynamic range (i.e. ). This section moves on to discussing the case with large .
To start with, by going through the same proof technique, we can readily obtain — in the general setting — the following performance guarantees for both the spectral estimate and the regularized MLE .
Consider the pairwise comparison model in Section 2.1. Suppose that for some sufficiently large constant and that for any absolute constants . Set the regularization parameter to be for some absolute constant . Then with probability exceeding ,
the regularized MLE satisfies
where and .
the set of top- ranked items can be recovered exactly by the regularized MLE given in (13), as long as
for some sufficiently large constant .
Notably, the achievability bounds for top- ranking in Theorems 5–6 do not match the lower bound asserted in Theorem 2 in terms of . This is partly because the separation measure fails to capture the information bottleneck for the general setting. In light of this, we introduce the following new measure that seems to be a more suitable metric to reflect the hardness of the top- ranking problem:
which will be termed the generalized separation measure. Informally, is a reasonably tight upper bound on certain normalized KL divergence metric. With this metric in place, we derive another lower bound as follows.
Fix , and let . Consider any preference score vector , and let denote its generalized separation. If
then there exists another preference score vector with the same generalized separation and different top- items such that for any ranking scheme . Here, represents the probability of error in distinguishing these two vectors given .
See Appendix A.∎
The preceding sample complexity lower bound scales inversely proportionally to . To see why this generalized measure may be more suitable compared to the original separation metric, we single out three examples in Appendix B. Unfortunately, our current analyses do not yield a matching upper bound with respect to unless is a constant. For instance, the analysis of the spectral method relies on the eigenvector perturbation bound (Theorem 8), where the spectral gap and matrix perturbation play a crucial rule. However, the current results for controlling these quantities have explicit dependency on Negahban et al. (2017a). It is not clear whether we could incorporate the new measure to eliminate such dependency on . This calls for more refined analysis techniques, which we leave for future investigation.
Moreover, it is not obvious whether the spectral method alone or the regularized MLE alone can achieve the minimal sample complexity in the general regime. It is possible that one needs to first screen out those items with extremely high or low scores using methods like Borda count (Ammar and Shah, 2012), as advocated by (Negahban et al., 2017a; Chen and Suh, 2015; Jang et al., 2016). All in all, finding tight upper bounds for general remains an open question.
This paper justifies the optimality of both the spectral method and the regularized MLE for top- rank aggregation for the fixed dynamic range case. Our theoretical studies are by no means exhaustive, and there are numerous directions that would be of interest for future investigations. We point out a few possibilities as follows.
General condition number . As mentioned before, our current theory is optimal in the presence of a fixed dynamic range with . We have also made a first attempt in considering the large regime. It is desirable to characterize the statistical and computational limits for more general .
Goodness-of-fit. Throughout this paper, we have assumed the BTL model captures the randomness underlying the data we collect. A practical question is whether the real data actually follows the BTL model. It would be interesting to investigate how to test the goodness-of-fit of this model.
Unregularized MLE. We have studied the optimality of the regularized MLE with the regularization parameter . Our analysis relies on the regularization term to obtain convergence of the gradient descent algorithm (see Lemma 11). It is natural to ask whether such a regularization term is necessary or not. This question remains open.
More general comparison graphs. So far we have focused on a tractable but somewhat restrictive comparison graph, namely, the Erdős–Rényi random graph. It would certainly be important to understand the performance of both methods under a broader family of comparison graphs, and to see which algorithms would enable optimal sample complexities under general sampling patterns.
Entrywise perturbation analysis for convex optimization. This paper provides the perturbation analysis for the regularized MLE using the leave-one-out trick as well as an inductive argument along the algorithmic updates. We expect this analysis framework to carry over to a much broader family of convex optimization problems, which may in turn offer a powerful tool for showing the stability of optimization procedures in an entrywise fashion.
5 Analysis for the spectral method
Here, we gather some preliminary facts about reversible Markov chains as well as the Erdős–Rényi random graph.
The first important result concerns the eigenvector perturbation for probability transition matrices, which can be treated as the analogue of the celebrated Davis-Kahan theorem (Davis and Kahan, 1970). Due to its potential importance for other problems, we promote it to a theorem as follows.
Theorem 8 (Eigenvector perturbation).
Suppose that , , and are probability transition matrices with stationary distributions , , , respectively. Also, assume that represents a reversible Markov chain. When , it holds that
See Appendix C.1. ∎
Several remarks regarding Theorem 8 are in order. First, in contrast to standard perturbation results like Davis-Kahan’s theorem, our theorem involves three matrices in total, where , , and can all be arbitrary. For example, one may choose to be the population transition matrix, and and as two finite-sample versions associated with . Second, we only impose reversibility on , whereas and need not induce reversible Markov Chains. Third, Theorem 8 allows one to derive the estimation error in Negahban et al. (2017a) directly without resorting to the power method; in fact, our estimation error bound improves upon Negahban et al. (2017a) by some logarithmic factor.
See Appendix C.2. ∎
Notably, Theorem 9 matches the minimax lower bound derived in (Negahban et al., 2017a, Theorem 3). As far as we know, this is the first result that demonstrates the orderwise optimality of the spectral method when measured by the loss.
The next result is concerned with the concentration of the vertex degrees in an Erdős–Rényi random graph.
Lemma 1 (Degree concentration).
Suppose that . Let be the degree of node , and . If for some sufficiently large constant , then the following event
The proof follows from the standard Chernoff bound and is hence omitted. ∎
Since is chosen to be for some constant , we have, by Lemma 1, that the maximum vertex degree obeys with high probability.
5.2 Proof outline of Theorem 5
In this subsection, we outline the proof of Theorem 5.
Recall that and are the stationary distributions associated with and , respectively. This gives
For each , one can decompose
where (resp. ) denotes the -th column of (resp. ). Then it boils down to controlling , and .
Since is deterministic while is random, we can easily control using Hoeffding’s inequality. The bound is the following.
With probability exceeding , one has
See Appendix C.3. ∎
Next, we show the term behaves as a contraction of .
With probability exceeding , there exists some constant such that for all ,
See Appendix C.4. ∎
The statistical dependency between and introduces difficulty in obtaining a sharp estimate of the third term . Nevertheless, the leave-one-out technique helps us decouple the dependency and obtain effective control of this term. The key component of the analysis is the introduction of a new probability transition matrix , which is a leave-one-out version of the original matrix . More precisely, replaces all of the transition probabilities involving the -th item with their expected values (unconditional on ); that is, for any ,
with . For any