Phase Transitions in Approximate Ranking

11/30/2017 ∙ by Chao Gao, et al. ∙ The University of Chicago 0

We study the problem of approximate ranking from observations of pairwise interactions. The goal is to estimate the underlying ranks of n objects from data through interactions of comparison or collaboration. Under a general framework of approximate ranking models, we characterize the exact optimal statistical error rates of estimating the underlying ranks. We discover important phase transition boundaries of the optimal error rates. Depending on the value of the signal-to-noise ratio (SNR) parameter, the optimal rate, as a function of SNR, is either trivial, polynomial, exponential or zero. The four corresponding regimes thus have completely different error behaviors. To the best of our knowledge, this phenomenon, especially the phase transition between the polynomial and the exponential rates, has not been discovered before.

READ FULL TEXT VIEW PDF
POST COMMENT

Comments

There are no comments yet.

Authors

page 1

page 2

page 3

page 4

This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

1 Introduction

Given data , we study recovery of the underlying ranks of the objects in the paper. The observation can be interpreted as the outcome of an interaction between and . For example, in sports, can be the match result of a game between team and team . In a coauthorship network, can be the number of scientific papers jointly written by author and author . We consider a very general approximate ranking model in the paper. It imposes the mean structure

Here is the rank of object . The interaction outcome is only determined by the latent positions of and , and is thus a noisy measurement of . The goal of approximate ranking is to recover the underlying for each .

In the literature, the problem of exact ranking is a well studied topic, especially in the settings of pairwise comparison with Bernoulli outcomes. The goal of exact ranking assumes that the underlying is a permutation, and therefore estimating is equivalent to sorting the objects, which gives an alternative name “noisy sorting” to such a problem [3]. We refer the readers to [14, 16, 15, 12] and references therein for recent developments in this area.

In contrast, this paper studies the approximate ranking problem. We do not impose the constraint that must be a permutation. More generally, we allow any that satisfies

for each entry plus some moment conditions. This allows possible ties in the rank, and the ranks of the

objects do not necessarily start from or end at . The number should instead be interpreted as a discrete latent position of object . Such an approximate ranking setting is more natural for many applications. For example, it is conceivable in certain situations that there is a subset of objects that may behave very similarly through pairwise interactions. As a consequence, we can allow the same value for all in the group in such a scenario. Moreover, the numbers ’s in the approximate ranking setting not only reflect the order of the objects, but they also carry information about their relative differences through the interpretation of latent positions. These features and advantages distinguish the approximate ranking problem from the exact ranking problem studied in the literature.

The main contribution of the paper is the exact characterization of the optimal statistical error of the approximate ranking problem. Given an estimator

, we measure the error through the loss function

. With the signal parameter and the noise level defined later in (3) and (2), we show that the optimal rate of is a function of the signal-to-noise ratio parameter . Our results are summarized in Figure 1.

Figure 1: Optimal error rate with respect to as a function of SNR.

According to the plot in Figure 1, the optimal error exhibits interesting and delicate phase transition phenomena. Depending on the value of SNR, the optimal rate falls into four different regimes. In the first regime where the SNR is smaller than , the rate of trivially takes the order of , which is the largest possible value of . Next, as long as the SNR is greater than , the optimal rate starts to decrease polynomially fast. After the SNR passes the threshold of , the optimal error accelerates to an exponential rate. In the final regime where , we achieve exact recovery and

with high probability. The dashed curve in the final regime is the error in expectation, which still decreases at the rate of

. Besides the loss function , optimal rates and similar phase transition boundaries are also derived for the loss .

The phase transition between the polynomial rate and the exponential rate is remarkable. To the best knowledge of the author, this is a new phenomenon first discovered in the approximate ranking problem. Mathematically speaking, the polynomial rate is driven by an entropy calculation argument, and when , estimating is like estimating a continuous parameter in . In comparison, when , the discrete nature of starts to come into effect, and we have sufficient information thanks to a high SNR to distinguish between and its neighboring values for each , which contributes to the exponential rate.

The paper is organized as follows. In Section 2, we introduce the approximate ranking model, and then present the main results on optimal rates and phase transitions. In Section 3, we consider two special cases of the approximate ranking model, and derive optimal procedures and adaptive algorithms that can achieve the optimal rates. We then discuss a few related topics to our paper in Section 4. All proofs are given in Section 5.

We close this section by introducing notations that will be used later. For , let and . For an integer , denotes the set . Given a set , denotes its cardinality, and is the associated indicator function. We use and to denote generic probability and expectation whose distribution is determined from the context. For two positive sequences and , we use the notation if for all with some consntant independent of . Finally, for two probability measures and

, the Kullback-Leibler divergence is defined as

.

2 Main Results

2.1 The Approximate Ranking Model

Consider objects with ranks . We observe that follow the generating process

(1)

In other words, the outcome is determined by a noisy version of , which solely depends on the ranks of and . In this paper, we consider ’s with sub-Gaussian tails. In particular, we assume that for any ,

(2)

The goal of this paper is to recover the underlying ranks from the observations . We consider the following space of ranks

where the number satisfies . The flexibility of the space allows ties. To be more precise, an should be interpreted as discrete latent positions of the objects. Therefore, we refer to the problem as approximate ranking. Three loss functions are considered in this paper. We define

The loss function is the normalized Hamming distance between and . It measures the proportion of objects that are given incorrect ranks. Compared with , the other two loss functions and also measure the magnitude of the incorrectness of each . It is easy to see that for ,

Our model is quite general. It characterizes the pairwise relation between the two objects and through their ranks. The literature is popularized with pairwise comparison models. In such a setting, the value of is an increasing function of the difference between and . We are also interested in the pairwise collaboration setting, where a larger value of is implied by either or both of the values of and . Without specifying a particular setting, we impose the following general assumption. There exists a number , such that for any ,

(3)

Later in Section 3, various examples will be given to satisfy this condition.

2.2 Minimax Rates

With the observations and the knowledge of , we consider a least-squares estimator

(4)

This estimator may not be computationally efficient and it depends on the model parameters, but it serves as an important benchmark of approximate ranking. Adaptive procedures with unknown model parameters will be discussed in Section 3.

Use and to denote the distribution of (1), and the performance of the least-squares estimator is characterized by the following theorem.

Theorem 2.1.

Under the conditions (2) and (3), we have

and

where denotes a vanishing sequence as .

Theorem 2.1 reveals an important quantity of signal-to-noise ratio. It is in the form of . When , all the three loss functions converges to zero exponentially fast. In comparison, when , the three loss functions have polynomial rates capped at the order of , and for , and , respectively.

Both the loss functions and involve a logarithmic factor in the polynomial regime. This factor can be removed with an extra assumption on the model. Given a constant that satisfies , we assume that for any ,

(5)

The following theorem gives this improvement.

Theorem 2.2.

Under the conditions (2), (3) and (5), we have

and

where denotes a vanishing sequence as .

The rates given by Theorem 2.2 are sharp, and they cannot be further improved. We give matching lower bounds in the next theorem.

Theorem 2.3.

There exists a distribution of (1) that satisfies (2), (3) and (5), such that

and

where denotes a vanishing sequence as .

2.3 Exact Recovery and Phase Transitions

According to Theorem 2.1 and Theorem 2.2, the convergence rates for all the three loss functions are when . Therefore, suppose , then the convergence rate will be smaller than . Since the loss functions that we consider do not take any value in the interval , a convergence rate smaller than is expected to imply exact recover of the underlying . This intuition is made rigorous by the following theorem.

Theorem 2.4.

Under the conditions (2) and (3), if we further assume that , then the LSE (4) satisfies

as .

Moreover, the next result shows that the condition in Theorem 2.4 is necessary for exact recover.

Theorem 2.5.

Assume . There exists a distribution of (1) that satisfies (2), (3) and (5), such that

The results in Theorem 2.2, Theorem 2.3, Theorem 2.4 and Theorem 2.5 together give us a very good picture of the optimal error behavior. Interesting phase transitions are revealed for the approximate ranking problem. Depending on the signal-to-noise ratio , the optimal error rates have different behaviors. A graphical illustration for the loss function is given by Figure 1. We summarize the phase transitions in the following table.

trivial non-trivial consistent strongly consistent

We call a rate trivial if it is at the same order of the maximal value of the loss. The maximum of the three loss functions , and are of orders , and , respectively. A rate is consistent if the loss converges to zero. We refer to exact recovery as being strongly consistent. Then, for each loss function, there are four different regimes: the trivial regime, the non-trivial regime, the consistent regime and the strongly consistent regime. The only exception is that for the loss , its non-trivial regime is identical to the consistent regime. For all the three loss functions, the rates in the consistent regime are exponential and the rates in the non-trivial regime are polynomial.

3 Adaptive Procedures

Our paper considers a very general framework of the approximate ranking model in the form of . Though we are able to characterize the exact minimax rates of the problem under various regimes of the signal-to-noise ratio, the least-squares procedure (4) that can achieve the statistical optimality is usually infeasible in practice. In fact, similar optimization problems as (4) have been considered in the literature of graph matching/isomorphism problem [19], which is believed very unlikely to be solved by a polynomial-time algorithm [10].

In this section, we consider some special cases of the general model , and then discuss possible adaptive procedures that can take advantage of the additional model structures to achieve the optimal statistical rates. Inspired by the latent space model in network analysis [9], we consider examples in the form

(6)

Here, stands for the latent ability parameter of the position . In particular, we will analyze a differential comparison model and an additive collaboration model, both of which are in the form of (6). Interestingly, we will show for these two models, the approximate ranking problem is reduced to a feature matching problem considered by [6], for which efficient algorithms are available. When the latent parameters are unknown but admits a linear relation with respect to the underlying ranks, we show that a profile least-squares procedure, which can be solved by iterative feature matching, can adaptively achieve the optimal statistical rates.

3.1 Differential Comparison

We first consider a differential comparison model, which imposes the structure

(7)

Therefore, the mean of the observation is given by , the difference between the abilities of and . We propose the following signal conditions. For any , we assume

(8)
(9)

It is easy to check that the general condition (3) is implied by (8) and (9). We remark that the condition (9) is coherent with the definition of the space .

In the current setting, the ability parameters are given but the correspondence between and is linked by unknown ranks . Our general strategy to find ranks is based on the idea of feature matching [6]. We first define the score of by

(10)

Then, the estimator of ranks is defined by

(11)

This optimization can be efficiently solved by feature matching algorithms discussed in [6]. Its statistical performance is given by the following theorem.

Theorem 3.1.

Under the conditions (2), (8) and (9), we have

and

where denotes a vanishing sequence as . Moreover, when , we have with probability .

It is easy to check that the distribution constructed to prove the lower bound results in Theorem 2.3 satisfies the conditions (8) and (9). This implies that the rates achieved by the computationally efficient estimator in Theorem 3.1 is optimal under the differential comparison model. It is interesting that we do not have any logarithmic factor in the convergence rates when even without any condition similar to (5). This is a consequence by the special structure in the differential comparison model (7).

3.2 Additive Collaboration

Since our framework is quite general, we can also consider an additive collaboration model. It imposes the structure

(12)

Compared with (7), the collaboration model assumes that the mean is increasing with respect to both the abilities of and . Despite the difference in interpretation, the two models have very similar mathematical structures. We will adopt the signal condition (8) here, but we do not need to assume the second condition (9). With the additive structure, the condition (8) alone implies (3). We also use the feature-matching estimator (11), with the score of each defined as

(13)

The performance of the estimator is given by the following theorem.

Theorem 3.2.

Under the conditions (2) and (8), we have

and

where denotes a vanishing sequence as . Moreover, when , we have with probability .

We obtain the same rates as in Theorem 3.1 for the differential comparison model. Note that the same argument in the lower bound proof of Theorem 2.3 can easily be adapted for the additive collaboration model in this section. This implies the rates in Theorem 3.2 are all optimal.

3.3 Applications in a Parametric Model

In this section, we consider both the comparison and the collaboration models with for some and . That is, the ability parameter is a linear function of its latent position. It is easy to check that both conditions (8) and (9) are satisfied with some .

Since and are unknown, we cannot directly use the feature matching estimator (11). Instead, we propose the following profile least-squares objective,

(14)

Then, the estimator is found through minimizing . Note that we use a linear model as a surrogate for in the definition of (14). The feature matching procedure and the linear model fit of are done simultaneously. An equivalent way of writing is by

where

(15)

For the differential comparison model, we use the score

For the additive collaboration model, we use

Therefore, the estimator is fully data-driven.

Optimizing over can be done in a iterative fashion. At each iteration, one first optimize over . Then, one can update the parameters and using the least-squares formula (15

). In other words, feature matching and linear regression are run in turn iteratively. In this section, our focus is on the study of the statistical property of the global optimizer of

. The convergence analysis of the iterative algorithm will be studied in a much more general framework of profile least-squares optimization in the future.

To study the statistical error of the profile least-squares estimator, we consider the following space of approximate ranks. Define

where . The set is a subset of with an additional constraint on . This extra constraint does not make the problem easier, and the same minimax lower bound results also hold for the set with a simple modification of the proof of Theorem 2.3.

Theorem 3.3.

Consider the estimator . Under the conditions (2) for the differential comparison model or the additive collaboration model with for some , we have

and

where denotes a vanishing sequence as . Moreover, when , we have with probability .

4 Discussion

4.1 Comparison with Community Detection

Our approximate ranking model shares some similarity with the stochastic block model that is widely studied in the problem of community detection. The relation between ranking and clustering was previously discussed in the paper [4]. Our discussion here is from a different aspect. The goal of community detection is to partition the set into clusters. In the setting of stochastic block model, the observation can be written as

(16)

where is the expectation of , and it characterizes the interaction level between and . The value of is determined by the clustering labels and . Note that the form (16) is exactly the same as (1). The literature usually studies stochastic block models with Bernoulli observations. However, to make comparison more intuitive, we consider the same sub-Gaussian setting as in (2).

Just like in the approximate ranking model, the goal here is to estimate the clustering labels . In the most basic setting, the class is considered as a subset of that consists of clustering labels that induce roughly equal-sized clusters. Recently, the minimax rate of estimating was derived by [20]. They impose the condition that

(17)

The numbers and represent the within-cluster and the between-cluster interaction levels. In fact, we can write down an alternative condition in the style of (3). Assume that for any ,

(18)

when . The loss function is defined in the same way as up to a permutation of clustering labels, so that it is more appropriate to measure the difference between two clustering structures. One can check that (17) implies (18) with . With similar techniques used in this paper, it can be shown that there is an estimator that achieves

This is essentially the same result in [20] by replacing

with the variance parameter

in the their Bernoulli setting. Moreover, it shares the same form of exponential rates in Theorem 2.1 in view of the relation .

On the other hand, we also point out some major differences between approximate ranking and community detection. First of all, the ranking labels are ordered numbers, while the clustering labels have no ordering. Therefore, one can only measure whether the estimation of is right or wrong by an indicator function. In comparison, one can not only measure whether each is correctly estimated, but one can also measure the deviation in terms of the distance or the squared distance between and . Secondly, the approximate ranking model naturally has latent positions, and each has possible values, while for the stochastic block model, there are only latent positions, and usually is assumed to be a constant or grow with very slowly in the literature. These two differences lead to the interesting phase transitions for the approximate ranking problem in our paper, and such a new phenomenon did not appear in community detection or in any other problems before.

4.2 Results for Poisson Distributions

In this section, we consider a natural Poisson model for discrete observations. We assume independently for all . Note that models both mean and variance of the observation . Thus, it is more appropriate to consider the following condition instead of (3). We assume there exists a , such that for any ,

(19)

Compared with the condition (3), the condition for the Poisson model involves . The square-root transformation can be dated back to the famous variance-stabilizing transformation [1].

Instead of using the least-squares estimator (4), we propose the maximum likelihood estimator

Theorem 4.1.

Under the condition (19), we have

and

where denotes a vanishing sequence as .

Again, Theorem 4.1 exhibits different behaviors of the ranking errors depending on the regime of . Here, the signal-to-noise ratio is , and it plays the same role as that of in Theorem 2.1.

We also give a complementary lower bound.

Theorem 4.2.

There exists a Poisson distribution that satisfies (

19), such that

and

where denotes a vanishing sequence as .

The lower bound results imply that the rates that we obtain in Theorem 4.1 are optimal up to a logarithmic factor in the polynomial regime where . Different from what we have for the model (1), it is not clear whether such a logarithmic factor can be removed in the upper bounds with some extra condition for the Poisson model.

To close this section, we give sufficient and necessary conditions for exact recovery in the following theorem.

Theorem 4.3.

Under the condition (19), if we further assume that , then the MLE satisfies