# Two-Hop Walks Indicate PageRank Order

This paper shows that pairwise PageRank orders emerge from two-hop walks. The main tool used here refers to a specially designed sign-mirror function and a parameter curve, whose low-order derivative information implies pairwise PageRank orders with high probability. We study the pairwise correct rate by placing the Google matrix G in a probabilistic framework, where G may be equipped with different random ensembles for model-generated or real-world networks with sparse, small-world, scale-free features, the proof of which is mixed by mathematical and numerical evidence. We believe that the underlying spectral distribution of aforementioned networks is responsible for the high pairwise correct rate. Moreover, the perspective of this paper naturally leads to an O(1) algorithm for any single pairwise PageRank comparison if assuming both A=G-I_n, where I_n denotes the identity matrix of order n, and A^2 are ready on hand (e.g., constructed offline in an incremental manner), based on which it is easy to extract the top k list in O(kn), thus making it possible for PageRank algorithm to deal with super large-scale datasets in real time.

## Authors

• 2 publications
• ### Unsupervised spectral learning

In spectral clustering and spectral image segmentation, the data is part...
07/04/2012 ∙ by Susan Shortreed, et al. ∙ 0

• ### Hyper-Path-Based Representation Learning for Hyper-Networks

Network representation learning has aroused widespread interests in rece...
08/24/2019 ∙ by Jie Huang, et al. ∙ 0

• ### A DEMATEL-Based Completion Method for Incomplete Pairwise Comparison Matrix in AHP

Pairwise comparison matrix as a crucial component of AHP, presents the p...
07/23/2016 ∙ by Xinyi Zhou, et al. ∙ 0

• ### When condition of order preservation is met?

The article shows a relationship between inconsistency in the pairwise c...
02/07/2018 ∙ by Konrad Kulakowski, et al. ∙ 0

• ### DS-FACTO: Doubly Separable Factorization Machines

Factorization Machines (FM) are powerful class of models that incorporat...
04/29/2020 ∙ by Parameswaran Raman, et al. ∙ 0

• ### Exact Camera Location Recovery by Least Unsquared Deviations

We establish exact recovery for the Least Unsquared Deviations (LUD) alg...
09/27/2017 ∙ by Gilad Lerman, et al. ∙ 0

• ### A General Pairwise Comparison Model for Extremely Sparse Networks

Statistical inference using pairwise comparison data has been an effecti...
02/20/2020 ∙ by Ruijian Han, et al. ∙ 0

##### This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

## 1 Introduction

The PageRank algorithm and related variants have attracted much attention in many applications of practical interests head1 ; head3 ; head4

, especially known for their key role in the Google’s search engine. These principal eigenvector (the one corresponding to the largest eigenvalue) based algorithms share the same spirit and were rediscovered again and again by different communities from 1950’s. PageRank-type algorithms have appeared in the literatures on bibliometrics

bib1 ; bib2 ; bib3 , sociometry soc1 ; soc2 , econometrics eco1 , or web link analysis page , etc. Two excellent historical reviews on this technique can be found in survey1 ; survey2 .

Regardless of various motivations, this family of algorithms stand on the similar observations: an entity (person, page, node, etc) is important if it is pointed by other important entities, thus the resulting importance score should be computed in a recursive manner. More precisely, given a -dimensional matrix G with its element encoding some form of endorsement sent from the entity to the entity (both G and the transpose of G

are alternately used in literatures, but which introduces no essential difference. Here, the former is adopted for convenience), then the importance score vector

r is defined as the solution of the linear system:

 Gr=r. (1)

However, some constraints are required for G such that there exists an unique and nonnegative solution in (1). In the PageRank algorithm, G is constructed by page ; page-total1

 G=α(ˆG+udT)+(1−α)v1T, (2)

where is the column-normalized adjacent matrix of the web graph, i.e., the element of is one divided by the outdegree of the page if there is an link from the page to the page (zero otherwise), 1 is the all-ones vector, d is the indicator vector of dangling nodes (those having no outgoing edges), u and v are nonnegative and have unit norm (known as the dangling-node and personalization vectors, respectively. By default ), and is the damping factor (had better not be too close to 1. Usually by default) for avoiding the “sink effect” caused by the modules with in- but no out-links page-total2 . Then, it is easy to verify that G

constructed as above is a markov matrix with each column summing to one, and has an unrepeated largest eigenvalue valued 1 corresponding to the left eigenvector

1 (the modulus of the second largest eigenvalue of G is upper-bounded by 2nd ). Due to the PerronFrobenius theorem eco1 , this means that the (right) positive principal eigenvector of G actually is the unique PageRank vector in (1). Note that such a solution is only defined up to a positive scale, but introducing no harm in the ranking context.

### 1.2 Outline of Our Algorithm

In this paper, we always assume that G is a nonnegative real matrix with the spectral radius , and 1 is an unique eigenvalue. We will use to denote arbitrary nonnegative principal eigenvector of G although the PageRank vector may be defined up to a positive scale. Let , where is the unit matrix of order . The main tool used in this paper is a specially designed curve , , , where and are three parameters. Here, we drop the dependency of on A and w to make notations less cluttered. Throughout the paper, we will indicate vectors and matrices using bold faced letters.

We expect to have the following properties: (a). For any positive w, the curve converges to the positive principal eigenvector of A (thus converges to r) as . Let , thus the task of comparing the PageRank score between the and nodes is reduced to determining the sign of ; (b). Denote by the -order derivative of F w.r.t. , which had better be a simple function of w and A such that evaluating it at causes relatively low computational cost; (c). Around the neighbourhood of , the shape of on the plane (spanned by the and axes in ) can be flexibly controlled by w and .

With a carefully chosen w, it is possible to find a scale function , , simplified as , such that the probability is sufficiently close to one. We call the sign-mirror function for since it reflects the sign of in a probabilistic sense shown as above, although itself only contains the local information of around . Furthermore, to avoid unnecessary computational cost, we also expect that small can do this job.

Section 2 provides a curve equipped with the above properties with . There we also construct the corresponding sign-mirror function and formulate as a function of , an angle variable dependent on the eigenvalue distribution of A. In the same section, we discuss some extensions of the algorithms. Section 3 checks the numerical properties of , then verifies that keeps small for variant types of model-generated or real-world graphs (sparse, scale-free, small-world, etc). This means that with a high probability the proposed algorithm succeeds to extract the true pairwise PageRank order for those common types of graphs mentioned as above. Then, it is relatively straightforward to develop a top list extraction algorithm based on partial (not total) pairwise orders, which will be discussed in section 4.

Nevertheless, it will be helpful to roughly imagine how such a curve possibly looks. Fig. 1 plots four possible trajectories of on the plane. Intuitively, ’s plotted in Fig. 1(a) and (b) are unpredictable in the sense that intuitively we have no confidence to predict whether they will cross the line at some or not. On the contrary, ’s shown in Fig. 1(c) or (d) seem more revealing due to the following facts: with higher probability, those two curves will not cross the line again for since both have been tangent to the line at , and will locally move away from the line soon since they have unequal acceleration along axes at . In fact, the imagined Fig. 1(c) and (d) do motivate us to construct an eligible sign-mirror function from a geometric view.

Finally, we point out that the algorithm of this paper is not only valid for the Google matrix defined in (2), which can even be applied to the non-markov matrix G as long as G meets the two conditions presented at the beginning of this subsection.

## 2 Model

Let be the imaginary unit and be a block diagonal matrix, where , is a square matrix at the diagonal block. Unless specially mentioned, in this paper and G is defined at the beginning of subsection 1.2. From a practical view, we also assume that G (thus A) is diagonalizable since any matrix can be perturbed into a diagonalizable one with perturbation arbitrary small. Thus, A is real and diagonalizable, and all the eigenvalues of A except the unrepeated zero eigenvalue have negative real parts.

### 2.1 Designing Curve

Lemma 1. matrix

For any real and diagonalizable matrix

A of order

, there is an invertible matrix

P such that

 A=P⋅{diag}[λ1,⋯,λrr,Ar+1,⋯,Ar+ss]⋅P−1,r+2s=n,0≤r≤n, (3)

where

 P=[p1 ,  ⋯  , prr real eigenvectors,pR,r+1,pI,r+1,⋯,pR,r+s,pI,r+ss pairs of% complex eigenvectors],
 Ar+k=(λR,r+kλI,r+k−λI,r+kλR,r+k),k=1,⋯,s.

In the above equation, are real eigenvalues of A sorted in descending order, corresponding to the real eigenvectors , and , are pairs of complex eigenvalues of A (sorted descendingly w.r.t. the real parts) corresponding to the pairs of complex eigenvectors , respectively.

In this paper, there exists , and all the other ’s () as well as ’s (, , ) are negative. Moreover, we will use to denote the column of P in lemma 1 for convenience, i.e., , . Since are linearly independent, any w takes the form as

 w=n∑k=1wkpk. (4)

Let , and define for

 Br+k=pr+2k−1vTr+2k−1+pr+2kvTr+2k,Cr+k=pr+2k−1vTr+2k−pr+2kvTr+2k−1.

Then it is ready to construct the following curve with the desired properties given in subsection 1.2:

 F(A,w,t)=(r∑k=1eλktpkvTk+s∑k=1eλR,r+kt[cos(λI,r+kt)Br+k+sin(λI,r+kt)Cr+k])w, (5)

where is the time parameter and w is the -dimensional “shape adjusting” vector. Although and appear in (5), it is not necessary to compute them throughout our algorithm, which will be clear in the sequel.

Lemma 2. There exist and , where is the projection of w on .
Proof. Noting and , thus the first equality holds. Since , for , and for , there exists . Due to , thus and for , which yields

 F(A,w,∞)=p1vT1n∑k=1wkpk=w1p1.

thus proving the second equality.

Clearly, with probability 1, thus let us assume . In the sequel, we will also restrict w to be nonnegative, from which it is easy to see that is always nonnegative, regardless of being the nonpositive or nonnegative principal eigenvector of G. Based on the above analysis and lemma 2, we can write , which verifies the property (a) presented in subsection 1.2. Thus, the task of comparing the PageRank score for the pair of pages is equivalent to determining the sign of .

The next lemma shows that both the first- and second-order derivatives of have a neat relation w.r.t. A and w at , which coincides with the highly desired property (b) given in subsection 1.2.

Lemma 3. There exist and .
Proof. From (3), we have

 A = r∑k=1λkpkvTk+s∑k=1[(λR,r+kpr+2k−1−λI,r+kpr+2k)vTr+2k−1 (6) +(λI,r+kpr+2k−1+λR,r+kpr+2k)vTr+2k] = r∑k=1λkpkvTk+s∑k=1(λR,r+kBr+k+λI,r+kC% r+k).

Similarly, from the equality , a simple computation shows that

 A2=r∑k=1λ2kpk%vTk+s∑k=1[(λ2R,r+k−λ2I,r+k)Br+k+2λR,r+kλI,r+kCr+k)]. (7)

Based on the definition of as in (5), a direct computation yields

 F(1)(A,w,0)=dF(A,w,t)dt|t=0=(r∑k=1λkpkvTk+s∑k=1(λR,r+kBr+k+λI,r+kCr+k))w(6)==Aw,
 F(2)(A,w,0) = d2F(A,w,t)dt2|t=0 = (r∑k=1λ2kpkvTk+s∑k=1(λ2R,r+kBr+k+2λR,r+kλI,r+kCr+k−λ2I,r+kBr+k))w% (7)==A2w.

thus proving the lemma.

### 2.2 Designing the Sign-Mirror Function

Let and be the element of and , respectively. In this subsection, we will focus on the key part of our eigenvector-computation-free algorithm: constructing the sign-mirror function for (recall the notations defined in subsection 1.2). Obviously, the bigger is, with more confidence and share the same sign, In such a manner, we say that the sign of , which indicates the PageRank score order for the pair of the pages, is mirrored by the sign of . As mentioned before, Fig. 1 suggests an intuition for constructing the sign-mirror function as follows: Let , under the constraints and . From lemma 3, the above equations can be rewritten into

 ϕij=(A2w)i−(A2w)j,withwi=wj, w≥0, (Aw)i=(Aw)j, (8)

which possibly is the simplest form for to adapt in practice. Although other more sophisticated candidates may be considered, constructed as above has worked well enough for our goal.

Note that there exit many choices for w meeting the constraints in (8). For reducing computational cost, in this paper we suggest to restrict w in the type of vectors only composed of three different values.

Let be an index subset containing and such that . Clearly, does not exist if and only if and , which corresponds to an event with zero probability if regarding A

as a random matrix. In what follows we assume the existence of

.

Let be any index such that has the opposite sign to that of (the exceptional case where does not exist will be discussed later). Then, let and define w by

 wk=−q(aih−ajh)−ζij∑k∈J(aik−ajk)≜z,∀k∈J; wh=ε+max(0,ζijaih−ajh)≜q;  otherwise wk=1. (9)

where is an adjustable positive constant ( is used in our simulation). It is easy to verify that w constructed as above meets all the constraints in (8). Let be the element of . A simple simplification shows that with w as in (9) can be rewritten into:

 ϕij=z∑k∈J(bik−bjk)+q(bih−bjh)+∑k∉J∪{h}(bik−bjk).

Specially, in the case of , i.e., , which corresponds to an almost sure event in practice, let us denote by and the sum of the row of A and B, respectively. In this case, takes a more computation-friendly form:

 ϕij=sumi(B)−sumj(B)+(z−1)(bii+bij−bji−bjj)+(q−1)(bih−bjh), (10)

where and are computed from (9) with . Now, we conclude our pairwise PageRank ranking algorithm as follows:

 ϕij>0 ⇒ ri>rj or ϕij<0 ⇒ ri

The whole algorithm flow is depicted in Algorithm 1. As for the exceptional case that no index exists, i.e., are all positive (or negative), which is an almost null event in practice, it is intuitive to claim (or ) due to the PageRank principle.

Finally, we provide a complexity analysis for single run of (11). If A and (constructed offline) are ready in memory, the time cost comes from two parts: time for finding the index plus a dozen of simple algebraic computation involved in (9) and (10). Given , let be the probability that has the same sign as that of for a randomly chosen . Then, the mean number of sampling equals to , just a small constant. Thus, the time complexity for single run of (11) is . Moreover, it is easy to see that both A and B can be constructed incrementally. Actually, the whole algorithm (11) is almost ready to work in an incremental fashion with slight modifications, which is omitted here.

### 2.3 Evaluating πij

Here, we study the probability (recall the notations defined in subsection 1.2) given constructed in (8), which determines the correct rate of our algorithm (11). Let , and , thus from the second equality in lemma 2. Based on (4), the constraint in (8) means , i.e.,

 Δij(∞)=w1τij1=−n∑k=2wkτijk. (12)

Based on (4) and (6), the constraint in (8) indicates

 0 = r∑k=2λkwkτijk+s∑k=1[λR,r+k(wr+2k−1τijr+2k−1+wr+2kτijr+2k) (13) +λI,r+k(wr+2kτijr+2k−1−wr+2k−1τijr+2k)].

where we use the fact , and for . Similarly, using (4) and (7), can be rewritten into

 ϕij= r∑k=2λ2kwkτijk+s∑k=1[(λ2R,r+k−λ2I,r+k)(wr+2k−1τijr+2k−1+wr+2kτijr+2k) (14) + 2λR,r+kλI,r+k(wr+2kτijr+2k−1−wr+2k−1τijr+2k)].

Next, we want to eliminate one redundant item from both (12) and (14) with the help of (13). This redundant item corresponds to if there exists (i.e, there are at least one pair of complex eigenvalues, called case 1), or to if there exists (i.e, there are two or more real eigenvalues, called case 2). A direct computation gives the following theorem:

Theorem 4. Given any pair of , we have . In case 1, there exists

 \boldmath{β}ij=[w2τij2, ⋯, wrτijrr−1, γijr+2, ⋯, γijr+2s2s−1]T∈Rn−2, ¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯\boldmath{λ}1=[λ2λR,r+1−1,⋯,λrλR,r+1−1r−1,λ%I,r+1λR,r+1,λR,r+2λR,r+1−1,λI,r+2λR,r+1,⋯,λR,r+sλ%R,r+1−1,λI,r+sλR,r+12s−1]T∈Rn−2, ¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯\boldmath{λ}2=[d2,⋯,drr−1,er+1−cλI,r+1,fr+2−cλR,r+2,er+2−cλ%I,r+2,⋯,fr+s−cλR,r+s,er+s−cλI,r+s2s−1]T∈Rn−2, (15)

where , for , for , and for .

In case 2, there exists

 \boldmath{β}ij = [w3τij3,⋯,wrτijrr−2,γijr+1,⋯,γijr+2s2s]T∈Rn−2, ¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯\boldmath{λ}1 = [λ3λ2−1,⋯,λrλ2−1r−2,λR,r+1λ2−1,λI,r+1λ2,λR,r+2λ2−1,⋯,λR,r+sλ2−1,λI,r+sλ22s]T∈Rn−2, ¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯\boldmath{λ}2 = [d3,⋯,drr−2,fr+1−cλR,r+1,er+1−cλI,r+1,fr+2−cλR,r+2,⋯,fr+s−cλR,r+s,er+s−cλI,r+s2s]T∈Rn−2,

Here, all variables are same to those in (15) except .

It is worthy noting that and are two ()-dimensional random vectors only dependent on the eigenvalue distribution of , and is a ()-dimensional random vector w.r.t. the eigenvector distribution of A and the projections of w along eigenvectors. From now on, will treat the Google matrix G as a random one that encodes the topological structure of a model-generated or real-world networks following different ensembles, e.g., scale-free model-sc , or small-world model-sw , etc.

Denote by the angle between