Personalized PageRank dimensionality and algorithmic implications

04/09/2018
by   Daniel Vial, et al.
University of Michigan
0

Many systems, including the Internet, social networks, and the power grid, can be represented as graphs. When analyzing graphs, it is often useful to compute scores describing the relative importance or distance between nodes. One example is Personalized PageRank (PPR), which assigns to each node v a vector whose i-th entry describes the importance of the i-th node from the perspective of v. PPR has proven useful in many applications, such as recommending who users should follow on social networks (if this i-th entry is large, v may be interested in following the i-th user). Unfortunately, computing n such PPR vectors (where n is the number of nodes) is infeasible for many graphs of interest. In this work, we argue that the situation is not so dire. Our main result shows that the dimensionality of the set of PPR vectors scales sublinearly in n with high probability, for a certain class of random graphs and for a notion of dimensionality similar to rank. Put differently, we argue that the effective dimension of this set is much less than n, despite the fact that the matrix containing these vectors has rank n. Furthermore, we show this dimensionality measure relates closely to the complexity of a PPR estimation scheme that was proposed (but not analyzed) by Jeh and Widom. This allows us to argue that accurately estimating all n PPR vectors amounts to computing a vanishing fraction of the n^2 vector elements (when the technical assumptions of our main result are satisfied). Finally, we demonstrate empirically that similar conclusions hold when considering real-world networks, despite the assumptions of our theory not holding.

READ FULL TEXT VIEW PDF
POST COMMENT

Comments

There are no comments yet.

Authors

page 1

page 2

page 3

page 4

04/14/2020

Norm and trace estimation with random rank-one vectors

A few matrix-vector multiplications with random vectors are often suffic...
06/17/2019

Homogeneous Network Embedding for Massive Graphs via Personalized PageRank

Given an input graph G and a node v in G, homogeneous network embedding ...
05/15/2011

Generating Similar Graphs From Spherical Features

We propose a novel model for generating graphs similar to a given exampl...
11/12/2020

On a question of Haemers regarding vectors in the nullspace of Seidel matrices

In 2011, Haemers asked the following question: If S is the Seidel matrix...
03/08/2019

Understanding Sparse JL for Feature Hashing

Feature hashing and more general projection schemes are commonly used in...
06/17/2021

An Attract-Repel Decomposition of Undirected Networks

Dot product latent space embedding is a common form of representation le...
01/20/2021

Fast Evaluation for Relevant Quantities of Opinion Dynamics

One of the main subjects in the field of social networks is to quantify ...
This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

1 Introduction

Many natural and man-made systems can be represented as graphs, sets of objects (called nodes) and pairwise relations between these objects (called edges). These include the brain, which contains neurons (nodes) that exchange signals through chemical pathways (edges), the Internet, which contains websites (nodes) that are connected via hyperlinks (edges), etc. To study graphs, researchers in diverse domains have used Personalized PageRank (PPR)

[22]. Informally, PPR assigns to each node a vector , where describes the importance of from the perspective of . PPR has proven useful in many practical and graph theoretic applications. Examples include recommending who a user should follow on Twitter [23] (user may wish to follow user if is large), and partitioning graphs locally around a seed node [3] (the set of nodes with large can be viewed as a community surrounding ). Unfortunately, computing all PPR vectors (where is the number of nodes) is infeasible for the massive graphs encountered in practice.

In this work, we argue that all PPR vectors can be accurately estimated by computing only a vanishing fraction of the vector elements, with high probability and for a certain class of random graphs. This arises as a consequence of our main (structural) result, which shows that the dimensionality of the set of PPR vectors scales sublinearly in with high probability, for the same class of random graphs and for a notion of dimensionality somewhat similar to matrix rank. We note that the estimation scheme considered was first proposed by Jeh and Widom in [25] without a formal analysis, so another contribution of our paper is to address this lacuna.

The paper is organized as follows. We begin in Section 2 with preliminary definitions. Section 3 discusses related work. In Section 4, we state our main result. We then discuss algorithmic implications and present empirical results in Section 5. Finally, we close in Section 6.

2 Preliminaries

We begin by defining the main ingredients of the paper. Most notation is standard or defined as needed, but we note the following is often used: for and , , satisfies (where is the indicator function), and .

2.1 Directed configuration model (DCM)

We consider a random graph model called the directed configuration model (DCM). For the DCM, we are given realizations of random sequences and satisfying and (we assume for simplicity).111More specifically, we would like i.i.d. and i.i.d. for given distributions and , but this does not guarantee . For this reason, the authors of [17] provide an method to generate these sequences such that , and , , where i.i.d., i.i.d., and denotes convergence in distribution. Our goal is to construct a directed graph , such that has in- and out-degree and , respectively. For this, we first assign incoming half-edges and outgoing half-edges to each ; we call these half-edges instubs and outstubs, respectively. We then randomly pair half-edges in a breadth-first search fashion that proceeds as follows:

  1. [topsep=-5.5pt,partopsep=0pt,parsep=0pt,itemsep=2pt]

  2. Choose uniformly. For each of the outstubs assigned to , sample an instub uniformly from the set of all instubs (resampling if the sampled instub has already been paired), and pair the outstub and instub to form a directed edge out of .

  3. Let . For each , pair the outstubs assigned to using the method that ’s outstubs were paired in Step 1.

  4. Continue iteratively until all half-edges have been paired. Namely, during the -th iteration we pair the oustubs of all , where are nodes at distance from (those for which a path of length from to exists, but no shorter path from to exists).

We define this procedure formally in Appendix A.2. For now, the important points to remember are that the initial node is chosen uniformly at random from , and that, at the end of the -th iteration, the -step neighborhood out of has been constructed. We emphasize the resulting graph will be a multi-graph in general, i.e. it will contain self-loops (edges for ) and multi-edges (more than one edge from to ). In [17], the authors provide conditions under which a simple graph results with positive probability as , but these are stronger than the conditions we require to prove our main result. Hence, we assume is a multi-graph.

2.2 Personalized PageRank (PPR)

To define PPR, we require some notation. First, let denote the adjacency matrix for some realization of the DCM, i.e.  is the number of directed edges from to (). Next, let

be the row stochastic matrix with

. Finally, let , and let denote the length- vector of ones. We then have the following.

Definition 1.

For , the PPR row vector

is the stationary distribution of the Markov chain with transition matrix

.

Note that , , , and all depend on . However, to avoid cumbersome notation, we do not explicitly denote this, and the dependence on will be clear from context.

The Markov chain described in Definition 1 has the following dynamics: follow a uniform random walk with probability , and jump to with probability . This motivates an interpretation of PPR as a centrality measure of the nodes from the perspective of . To see this, let denote the Markov chain with transition matrix . Then one can show (see Appendix B.1.1)

(1)

where , and where denotes expectation with some realization of the DCM held fixed. Hence, is large when is frequently visited (a notion of centrality) on -length walks beginning at (a notion of ’s perspective).

We note the typical definition of PPR assumes is constant; in contrast, we take . We argue in Section 4.2 that this is appropriate when considering the asymptotic behavior of PPR on the DCM. Specifically, we argue that the size of the set of nodes that are important to grows with the graph, but grows slowly enough that a notion of ’s perspective remains, when . (In contrast, this set has constant size when is constant.) Additionally, the spectral gap of is lower bounded by , so as results in this lower bound vanishing asymptotically. We note a line of work by Boldi et al. [10, 11] analyzed the limit of PPR as for a fixed graph ; in contrast, we fix a value of for each .

Finally, we emphasize the distinction between PPR and the more commonly known notion of PageRank, which we refer to as global PageRank. In short, global PageRank is the average of all PPR vectors, i.e. . Hence, global PageRank is a centrality measure from the perspective of a uniform node. More generally, given a distribution on , the PPR corresponding to is

, where the random variable

has as its distribution.

2.3 PPR dimensionality and algorithmic implications

Our main goal is to investigate the dimensionality of the set of PPR vectors, . A standard measure of the dimension of such a set is the size of its largest linearly independent subset. However, , is a linearly independent set itself 222To see why, first suppose is not invertible. Then for some , so . But, by the Perron-Frobenius theorem,

cannot have eigenvalue

, since it is row stochastic. Hence, is invertible, so by (4), the matrix with rows is invertible as well., so we will instead consider a different notion of dimensionality. This notion is motivated by the following observation: given vectors , the size of a linearly independent subset of can be bounded by , where and . We will relax this slightly, by only including in those that are not “close” to a linear combination of . In particular, given , our notion of dimensionality is , where

(2)
(3)

Note we can also interpret (2) algorithmically: if is known, can be accurately estimated by computing , when and fails. Hence, (2) is the number of vectors that must be computed to ensure all vectors are accurately estimated (see Section 5). We note is included in (3) because it is a known component of ; indeed, by Definition 1,

(4)

For ease of analysis, we will upper bound by choosing solely based on the degree sequence. Specifically, let , define , and let . For , we then define

(5)

where the subscript indicates that the right side depends on through . Our main result, Theorem 1, shows that scales sublinearly in with high probability, under certain assumptions on the degree sequence and for a particular choice of . In other words, though is a linearly independent set (for every finite ), our notion of dimensionality suggests the effective dimension is (asymptotically) much smaller.

We note that, in addition to bounding by , we will later bound by choosing a specific , which is not necessarily the solution of the optimization problem in (3). Hence, the exact solution of (2) remains an open question. Furthermore, in light of the preceding algorithmic interpretation of (2), another open problem is to solve (2) while ensuring can be efficiently computed when and fails.

Finally, recall is a random sequence; hence, with fixed, is a random sequence as well. Towards proving our main result, intermediate results will be established with held fixed, after which conditional expectation with respect to will be taken. This motivates the following definitions: .

3 Related work

Before proceeding to our results, we comment on relationships to prior work. We focus on [25] and [16], the papers most closely related to our own.

In [25], Jeh and Widom propose a scheme for estimating all PPR vectors, . The scheme relies crucially on the Hubs Theorem in [25], which states that the PPR vector , can be written as a linear combination of and another vector. The Hubs Theorem is central to our results as well; an alternative formulation appears as Lemma 2 here. We discuss the algorithm of Jeh and Widom in more detail in Section 5.

Unfortunately, the authors of [25] present no analysis of their scheme. Hence, it is unclear how should be chosen and how large it must be to guarantee accurate estimation. Our work addresses this shortcoming. Specifically, as discussed briefly in the introduction and in more detail in Section 5, our dimensionality measure (5) relates to the complexity of this scheme.

In [16], Chen, Litvak, and Olvera-Cravioto consider the limiting value of as

weakly converges to probability distribution

. Specifically, they show that the PPR value of a uniformly chosen node is given by the solution of a recursive distributional equation (RDE) [1]. They also show (roughly) that PPR values follow a power law when in-degrees follow a power law, establishing the so-called “power law hypothesis.” Similar results were later established for a family of inhomogeneous directed graphs in [27]. On the other hand, [16] was preceded by [15], where the power law hypothesis was established for global PageRank; further back, the hypothesis was studied under more restrictive assumptions in [29, 36, 37].

While [15, 16, 27, 29, 36, 37] share a goal of understanding the power law behavior of PPR on random graphs, our goal is to instead understand structural properties of the PPR vectors collectively, with the focus of this paper being dimensionality. Since dimensionality carries with it algorithmic implications, our work is perhaps more useful from a practical perspective when compared to this body of work. However, the analytical approaches of these works will be extremely useful to us. Specifically, the proof of our main result follows an approach similar to [16], and we use a modified version of Lemma 5.4 from [16], which appears as Lemma 5 here.

In short, our work can be seen as an attempt to combine the strengths of [25], which is entirely algorithmic, and [16], which is entirely analytical. Specifically, we leverage the analytical approach from [16] to obtain guarantees on the algorithm from [25].

More broadly, references for PageRank and PPR include [32], in which PageRank and PPR were first proposed, and [24], an early study of PPR (there called “topic-sensitive” PageRank). Beyond [25], many other works have proposed efficient computation and estimation algorithms for PPR; a small sample includes those using linear algebraic techniques [33, 34], those using dynamic programming [2, 3], and those using randomized schemes [6, 30]. In addition to the body of work on the power law hypothesis, analysis of PPR on random graphs includes [4]. Here it is shown that, for undirected random graphs with a certain expansion property, can be well approximated (in the total variation norm) as a convex combination of and the degree distribution.

The DCM was proposed and analyzed in [17] as an extension of the (undirected) configuration model, the development of which began in [8, 13, 38]. The configuration model (and variants) have been studied in detail; for example, [35] considers graph diameter in this model, while [31] studies the emergence of a giant component.

4 Dimensionality analysis

In this section, we present our dimensionality analysis. We begin by defining our assumptions and proposing a specific choice of . We then state the result and comment on our assumptions.

4.1 Assumptions on degree sequence

To prove our main result, we require Assumption 1

, which states that certain empirical moments of the sequence

exist with high probability, and furthermore, converge to limits at a uniform rate. Since we follow the analytical approach of [16], this assumption is similar to the main assumption in that work. We offer more specific comments shortly.

Assumption 1.

We have for some , where and for some constants and ,

(6)

Furthermore, we have , and we define .

The constants and will appear in our main result, and both have simple interpretations: letting satisfy , it is straightforward to show and , i.e.  and give the limiting expected out-degree and the limiting probability of belonging to , respectively. (The other constants in Algorithm 1 will not appear in our main result, but they have similar interpretations.) We also remark that is not necessary to establish our results but, given this interpretation, is the more interesting case.

4.2 Choice of

As mentioned in Section 2.2, we take in this work. Having defined Assumption 1, we choose a specific value of . For this, we first present the following claim.

Claim 1.

Let be a constant, and let uniformly. For , let denote the -step neighborhood out of , i.e. . If for some , let . Then

(7)

If instead is a constant, let . Then

(8)
Proof 1.

See Appendix C.1.

Loosely speaking, Claim 1 states that, for both choices of , all but of ’s PPR concentrates on a small neighborhood surrounding , for any . The difference is the size of this neighborhood: when , the neighborhood grows with the graph; when is constant, the neighborhood has constant size. From the PPR interpretation of Section 2.2, this suggests that the number of nodes that are important to grows in the former case but remains fixed in the latter case. We believe the former case is more appropriate. Additionally, the growth of the important set of nodes remains sublinear in in the former case; intuitively, this says that a vanishing fraction of all nodes are important to , i.e. a notion of ’s perspective remains. Finally, Claim 1 suggests that is necessarily linear when is constant: since PPR vectors are supported on constant size sets in this case, we expect must be linear to cover a linear number of these sets.

4.3 Main result

We now turn to our main result, which relies on the following key lemma.

Lemma 1.

Given Assumption 1, we have for uniformly and for any ,

(9)

where are defined in Assumption 1 and are defined in Claim 1.

The proof of Lemma 1 is lengthy; we outline it in Appendix A and provide the details in Appendix B. At a high level, our approach is similar to [16] and proceeds as follows:

  1. [topsep=-5.5pt,partopsep=0pt,parsep=0pt,itemsep=2pt]

  2. Show that, for a certain choice of , the error term in can be bounded by only examining the -step neighborhood out of .

  3. Argue that, conditioned on certain events not occurring during the first steps of the graph construction, this bound follows the same distribution as a quantity defined on a tree.

  4. Bound the probability of these events occurring during the first iterations.

  5. Bound conditioned on the events not occurring by analyzing the tree quantity.

Before proceeding, we pause to state the choice of from Step 1, which will be used in Section 5. First, for any realization of the DCM and for , we define

(10)

where is defined in Section 2.2. Note is the transition matrix of a Markov chain similar to that in Definition 1; however, upon reaching , the random walker jumps back to with probability 1. Letting denote the stationary distribution of this chain, one can show (see Appendix A.1)

(11)

We note (11) is an alternate formulation of the Hubs Theorem. With (11) in mind, we define

(12)

and we take as in (12) in Step 1. We also note this provides another interpretation of Lemma 1. Informally, since , (12) implies for large , so is nearly a convex combination. Hence, when fails, is close to the convex hull of , a small subset of the -dimensional simplex to which belongs.

We now turn to the main result. First, note Lemma 1 will allow us to show the second summand in (5) is bounded (in expectation) by , which is sublinear. Hence, to ensure (5) is sublinear, it only remains to choose such that is sublinear as well. On the other hand, in Assumption 1 requires to contain a constant fraction of all instubs, suggesting we should choose to be nodes with high in-degree. Together, these observations motivate our choice of : for we define as the function that chooses the nodes of highest in-degree as . Formally, is the function that maps to with , where is such that .

With this in place, we present Theorem 1. Together with Assumption 1, it states the following: when certain moments of the degree sequence exist, and when a sublinear number of nodes contains a constant fraction of instubs, the dimension of the set of PPR vectors scales sublinearly.

Theorem 1.

Assume such that the sequence satisfies Assumption 1 when . Then ,

(13)

where is defined in Lemma 1. As a consequence, and ,

(14)
Proof 2.

See Appendix C.2.

To illustrate the theorem, we give an example in (15). Here yields satisfying Assumption 1, i.e. the assumptions of Theorem 1 are satisfied with .

(15)

4.4 Comments on assumptions

We begin with comments on in Assumption 1. First, note that, given and , implicitly requires to converge to a specific limit: indeed, assuming it converges,

(16)

With sublinear in Theorem 1, , so we require .

We next argue is not restrictive (at least in its own right). In fact, it is essentially implied by sublinearity of in Theorem 1 and , since then the fraction in satisfies

(17)

Next, we note are similar to assumptions found in [16] and are fairly standard given our approach, which leverages the fact that the random graph is asymptotically locally treelike [14]. In fact, is a weaker assumption than that required in [16], which is why (as mentioned in Section 3) we use a modified version of one of their lemmas. See Appendix A.3 for details.

Finally, requires to converge to with sublinear in Theorem 1. We offer empirical evidence that this occurs for certain graphs of interest. Specifically, in Figure 0(a), remains constant and strictly less than 1 as grows, for a variety of sublinear choices. For this plot, in-degrees were sampled from a power law distribution with exponent , i.e. . This in-degree distribution is commonly seen in real graphs and has been studied extensively, e.g. [7, 18]. As an example, Figure 0(b) compares the histogram of these in-degrees with the in-degrees of the Twitter graph (available at [9] from WebGraph [12]). The histograms are similar for most values of ; both are roughly linear with slopes over . In short, a common model of in-degree distributions empirically satisfies with sublinear.

(a) For power law in-degrees, contains a constant fraction of instubs with sublinear.
(b) The in-degrees for Fig. 0(a) are similar to in-degrees for Twitter graph from [9]. (Here .)
Figure 1: is empirically satisfied with sublinear for power law in-degree distributions.

5 Algorithms and experiments

In this section, we use our dimensionality analysis to analyze the algorithm from [25] mentioned in Section 3. We then present empirical results to complement our analysis.

5.1 Algorithm to estimate

In Section 4.4.3 of [25], Jeh and Widom propose the following algorithm to estimate . First, compute . Next, for , compute (12) and estimate as

(18)

The basic idea behind this scheme is that, from (11), may be close to ; however, no formal analysis is provided. Here we show that our dimensionality result provides such an analysis.

For this, letting and using (96) from Appendix B.2, it is straightforward to show

(19)

In other words, (19) shows we can use to compute the estimation error indirectly, i.e. without actually computing . This suggests a new scheme, which proceeds as follows. First, compute (as in the existing scheme). Next, for , compute . If (19) holds, estimate as ; else, compute .

Using this scheme, we either compute exactly, or we obtain an estimate within of (in the norm), . The remaining question is the scheme’s complexity, which we take to be the number of PPR values that are computed. First, for , such values () are computed. Next, for , such values () are computed. Finally, an additional such values () are computed for s.t. (19) fails; by definition, this occurs for such when is chosen by . Hence, the number of PPR values computed is

(20)

which is sub-quadratic with high probability when Theorem 1 applies. (We have assumed the computation of is no more costly than the computation of PPR values on the original graph; this is because are computed on a sparser graph.) Hence, all PPR vectors can be accurately estimated by computing a vanishing fraction of the vector elements.

Finally, we remark that this scheme can also be viewed as approximating , the matrix with -th row . To see this, let be the estimate of from the scheme, i.e.  if and (19) holds, otherwise. Then, by (19), , so (where is the norm of the matrix ). Hence, the scheme approximates with bounded error in the norm.

5.2 Empirical results

We now demonstrate the performance of this algorithm using two datasets from the Stanford Network Analysis Platform (SNAP) [28]: soc-Pokec, a social network, and web-Google, a partial web graph (see Appendix D.1 for details). For both graphs, we choose the top nodes by in-degree as (i.e. ), set , and, , compute a bound on the error using a power iteration scheme described in Appendix D.2. Figure 1(a) shows histograms of the error bound, while Figure 1(b) shows our dimensionality measure. Note (as proven in Appendix D.2), error is zero when , where

(21)

(In words, the error is zero when no outgoing neighbors of belong to .) As a result, the spikes at in Figure 1(a) have height , and in Figure 1(b). Additionally, we show in Appendix D.2 that error is bounded by ; hence, the spikes at right in Figure 1(a), and the “dips” at right in Figure 1(b), occur at ). Between these spikes, the soc-Pokec histogram quickly decay beyond ; this corresponds to the dimensionality being nearly flat beyond in Figure 1(b). (For web-Google, similar behavior occurs, though it is less pronounced). Finally, we highlight two points on Figure 1(b), for soc-Pokec and for web-Google. The soc-Pokec point, for example, shows that computing of PPR vectors guarantees the estimation error for other PPR vectors is below (i.e. the worst-case error is reduced by a factor of 3). See Appendix D.3 for further empirical results for these datasets.

Figure 1(b) also highlights another aspect of . Specifically, the discussion at the end of Section 5.1 and the steep decay in Figure 1(b) suggests that most of the “energy” of is contained in a small number of dimensions, in the norm. Hence,

is roughly analogous to stable rank, a more common dimensionality measure that instead measures energy using singular values (namely, stable rank is

, where are the ordered singular values).

In Appendix D.2, we also describe how the power iteration scheme allows us to compute a bound on the average error indirectly (i.e., without actually computing the error for each ). Hence, we show the average error bound for a wider variety of SNAP datasets in Figure 2(a)

. Interestingly, the two social networks soc-LiveJournal1 and soc-Pokec have similar behavior, as do the two web graphs web-BerkStan and web-Stanford (web-Google is somewhat of an outlier; we believe its average error is lowest in part because its

is largest). Finally, in Figure 2(b), we show the average error bound computed on a DCM with power law in-degrees. As suggested by Lemma 1, average error shrinks as grows (despite shrinking as well); this is in part because, from Figure 0(a), the fraction of instubs belonging to is constant.

(a) (Normalized) histograms of the estimation error .
(b) Computing 9% or 15% of PPR vectors reduces worst-case estimation error by a factor of 3.
Figure 2: Error and dimensionality for soc-Pokec (social network) and web-Google (web graph)
(a) Average error decreases as grows for a variety of social networks and web graphs.
(b) For uniformly on the DCM, error decreases as grows, despite decreasing.
Figure 3: Average error experiments for real and synthetic datasets

6 Conclusions

In this work, we argued (analytically for the DCM and empirically for other graphs) that the dimensionality of scales sublinearly in . We also used our analysis to bound the complexity of the algorithm from [25]. Our analysis suggests several avenues for future work. First, the proof of Lemma 1 can be modified to analyze the tail of the error (this would essentially involve replacing Lemma 6 with a tail bound on a maximum instead of a sum). Hence, bounding absolute error for the estimate of for any is a straightforward extension; a more useful but less immediate analysis would involve bounding relative error. Second, examining PPR dimensionality for other random graph models may be of interest. For example, several papers have analyzed PPR on preferential attachment models [5, 21]; we suspect a dimensionality analysis for such graphs would yield a message similar to our work ( should contain nodes with highest in-degree). A more interesting class of graphs would be the stochastic block model; here it may be more beneficial to choose such that each community contains a nonempty subset of .

Note on the organization of appendices: Appendix A outlines the key ideas and intuition behind the proof of Lemma 1, which contains the bulk of our technical analysis and itself requires five lemmas. The proofs of these lemmas are found in subsections of Appendix B, in the order their statements appear in Appendix A. Shorter proofs (those of Claim 1 and Theorem 1) are found in Appendix C. Finally, Appendix D contains details on the experiments of Section 5.

Appendix A Lemma 1 proof outline

In this appendix, we outline the proof of Lemma 1. Our approach follows the outline described in Section 4.3. Specifically, we consider Steps 1-4 of the outline in Appendices A.1-A.4, respectively. In Appendix A.5, we combine the results to prove the lemma.

a.1 Error bound in -step neighborhood (Step 1)

Our first goal is to bound the error term