Personalized PageRank (PPR) and shortest distance are both commonly used local features in graph based recommendation and search systems [31, 52]. Instead of computing from scratch, it is more desirable to create data structures that can quickly answer queries online, by pre-processing the graph. Yet storing the data structures can be expensive, especially for very large graphs.
Recent algorithmic results have evaluated the storage requirements on social and information networks for personalized PageRank and shortest path. The synthesis of the experimental results is that surprisingly low complexity data structures can be efficiently computed in practice, based on the following meta-design [4, 23, 40]: each node stores a label consisting of a list of (node, value) pairs offline during the pre-processing phase, and the query answering algorithm computes the output by taking a vectorized operation between the two labels online. Besides the simplicity and storage efficiency, such designs have provided much faster ranking results for personalized search on social graphs [10, 40].
Despite the success found in practice, the limits of existing algorithms are far from well-understood. For personalized PageRank, the only known lower bound is a computational lower bound, which says that computing the PPR value between a single pair of vertices requires time in the worst case (without any pre-processing) , where is the number of vertices. For shortest path, the seminal work of Gavioli et al.  shows that for the family of graphs with maximum degree , any distance labeling scheme requires space in the worst case. However, social and information networks are typically sparse expander graphs with a heavy tailed degree distribution , which are not always similar to the hard instances Gavioli et al. constructed. This raises the question whether similar results can be obtained for average case instances.
In this work we introduce a general framework for proving lower bounds for labeling schemes, closing the gap described above. A labeling scheme defines a data access model, where two vertices compute the output using just their own information/labels. While such an access model may seem restrictive, the state of the art performance is achieved via labeling schemes for both personalized PageRank  and shortest paths [4, 23] in practice.
We present a tight lower bound for personalized PageRank in the data access model of labeling schemes. Our lower bound matches the existing algorithms [39, 40, 41], and is stated in terms of the desired accuracy threshold — if one starts to care about smaller PPR values, then the lower bound would scale up accordingly.
We then study distance labeling schemes on sparse random graphs with a power law degree distribution. Our techniques yield upper and lower bounds which depend on the power law exponent, and nearly tight in certain regimes. The bounds significantly improve over the worse case bounds via exploiting the expansion property of random graphs. We hope such theoretical analysis may find applications for proving lower bounds on other data structure problems, where the labeling methods have found algorithmic success.
1.1 Results for Personalized PageRank
Let be an undirected and unweighted graph. Let be the number of vertices and be the number of edges. Consider a random walk that starts at a vertex with teleport probability . At each step, with probability , the random walk stops; with probability , it moves to a uniformly random neighbor. The personalized PageRank from to a vertex , denoted by , is the probability that the random walk starting at stops at .111Equivalently, suppose that with probability , the random walk teleports to instead. Then the stationary distribution of this random walk is also equal the personalized PageRank vector of (see e.g. Chapter 1 in ). We say that an algorithm is -accurate if for any :
if , then .
if , then .
Following Lofgren et al., is assumed to be larger than for a certain value , to capture PPR values which are above average. And is assumed to be at least , so that the expected length of a random walk is not too large. This setting is critical for personalized PageRank to capture enough local graph structures .
It’s not hard to see that simply by sampling at most random walks per vertex, we obtain an -accurate data structure of total size . By combining random walk and linear algebraic methods , Lofgren et al. [40, 41] improved the storage complexity over the above baseline to , when . In this upper bound, each vertex stores a set of random walks and local graph statistics as a label vector . To obtain the personalized PageRank between and , the query algorithm simply computes the dot product between and . The total length of the labeling is simply . Our main result is a matching lower bound to the above algorithm for labeling schemes on sparse graphs, even if the algorithm is only required to approximate the answer within a factor of .
Theorem 1 (informal).
Let be an Erdös-Rényi random graph where an edge is sampled independently between every vertex pair with probability . Let be the number of vertices. Let be the teleport probability for the personalized PageRank random walk on . With high probability over the randomness of , any -accurate labeling data structure will output a labeling of total length , for and .
We remark that for graphs where the number of edges , our results also imply a lower bound of , under technical conditions on (see Section 4 for details). Since labeling schemes are only a special family of data structures, it would be interesting to see whether one can obtain stronger lower bounds. However, in the more general cell-probe model, it has been notoriously difficult to prove super-logarithmic query time lower bounds [34, 43, 44], even for non-adaptive static data structures. Thus we view our work as the first step towards closing the gap between the upper and lower bounds of static personalized PageRank data structures.
1.2 Results for Shortest Path on Power Law Graphs
Our lower bound techniques can be extended to shortest paths labeling schemes on more general random graphs. Along the way, we also present upper bounds to complete the picture. We describe the setup and main results below. The details are deferred to Appendix 5.
We will focus on the Chung-Lu model, which generalizes Erdös-Rényi random graphs to general degree distributions.222Our results can be extended to other random graph models as well (see the discussion in Appendix 5 for details). In the Chung-Lu model , each vertex has a weight , which is the expected degree of . For every pair of vertices and , there is an undirected and unweighted edge between them with probability proportional to , independent of other edges. We assume that for each , is drawn independently from a power law distribution with exponent , e.g. the probability that is proportional to .
We point out that when the degree distribution has bounded variance, the techniques from Theorem1 already imply the optimal bound for shortest path labeling schemes. See Section 3 for details. This lower bound bears resemblance to the seminal work of Gavoille et al. , and extends their analysis to obtain average case lower bounds for natural distributions. In the lower bound obtained for distance labeling on bounded degree graphs, the hard instance consists of a set of graphs whose distance labels must all be different from each other. Hence the lower bound is obtained for the worst case instance via a counting argument. Whereas in our result, we show that the lower bound holds even for an average case instance from the Erdös-Rényi random graph distribution with high probability.
For degree distributions with high variance, we present upper and lower bounds on the space complexity of exact distance labeling schemes, which can answer distance queries correctly for all pairs of vertices.
Let be a sparse random power law graph model with average degree and exponent . For a random graph drawn from , we have that with high probability over the randomness of , there exists an exact distance labeling scheme where the label sizes of every vertex in are all bounded by .
Secondly, any exact distance labeling scheme will output a labeling of total length at least for .
The bounds are nearly tight when is close to , and has a small gap when . It would be interesting to close the gap when . We mention that if the labeling scheme is only required to output approximately correct distances, then the amount of space needed can be significantly reduced. Our techniques can also be used to obtain a -stretch scheme where the maximum label length among all the vertices is at most , and a -stretch scheme where the maximum label length among all the vertices is at most . See Figure 1 for a summary of the results.
1.3 Intuitions for The Analyses
We show that on an Erdös-Rényi random graph , each pair of labels convey a certain amount of information about , as the pair of labels determines their PPR value. To augment the entropy obtained from pairs of vertices, we identify a maximal set of vertices, such that their pairwise PPR values are almost independent — as the number of such vertices increases, their total label size, which upper bounds the amount of information they can possibly convey, grows linearly in , whereas the total entropy of pairwise PPR values grows quadratically. We also discover a tight connection between PPR and shortest distance. That is, given that the random walk starting at stops at , the most likely route is to walk directly along the shortest path from to . While PPR is a weighted combination over different paths, the connection to shortest path allows us to extract edge information explicitly from the graph.
To obtain the “independence” of PPR values between a sufficiently large set of vertices , we describe an iterative process to “grow” the local neighborhood of each vertex in . At every iteration of the process, we grow the neighborhood of one vertex up to a certain level , on the subgraph which has not been explored yet. Constructed in this way, the
-th level sets of every vertex are disjoint from each other. We show that based on the estimated PPR values, we can infer whether the-th level sets are connected by any edge or not.
For Theorem 2, the upper bound uses the fact that there are lots of very high degree vertices in the graph. However, when gets close to
, even though the degree distribution gets more skewed towards high degree vertices, the space complexity increases again to. The reason is that the average distance also matters. If there exists lots of short paths that can not be compressed, then the space complexity will increase. The lower bound builds on the insights from Erdös-Rényi random graphs. However, the neighborhood growth has very high variance. To overcome the issue, we carefully construct a set of “good” path, so that with high probability, a vertex will follow a “good” path during the neighborhood growth.333The idea is inspired from Chapter 3 in Van Der Hofstad . See Section 5 for details.
The rest of the paper is organized as follows. In Section 2 we give a review of personalized PageRank and Chung-Lu model. In Section 3 we introduce our main technical contribution by illustrating a shortest path lower bound for . In Section 4 we prove the lower bound for personalized PageRank labeling schemes. In Section 5 we present distance labeling schemes for . Then we present the analysis of the algorithm in Appendix A and also evaluate it experimentally in Appendix B. Appendix C describes relevant tools from random graph theory.
For a vertex , Denote by the degree of . For a set of vertices , let denote the sum of their degrees. Denote by if there is an edge between . For two disjoint sets and , denote by if there exists an edge between and , and if there does not exist any edge between and . For a graph , let denote the distance of and in . When there is no ambiguity, we drop the subscript and simply denote by the distance between and .
We use to hide absolute multiplicative constants. Similarly, means that there exists an absolute constant such that . We use to hide poly-logarithmic factors.
2 Preliminaries and Related Work
Recall that denotes an undirected and unweighted graph. Let denote the weight of every vertex . Given the weight vector over
, the Chung-Lu model defines a probability distribution over the set of all graphs. Let denote the volume of . And let
denote the second moment of. Each edge is chosen independently with probability
Thus, is approximately the expected degree of , and is approximately the expected number of edges (multiplied by two). Let denote such a probability distribution over , and denote a sample from the distribution. The following proposition bounds the probability that two sets connect. The proof can be found in Appendix LABEL:sec_pf_connect.
Let be a random graph. For any two disjoint set of vertices and ,
In particular, when , we have that .
The proof is derived from the following calculations.
When the degree distribution has bounded variance, we can bound the neighborhood growth rates as follows. The result is standard (see e.g. Chung and Lu ) – we will present a proof in Section C.1 for the completeness of this paper.
Proposition 4 (Growth rates for ).
Let be a random graph model with weight sequence satisfying the following properties:
for some constant ;
for some constant ;
for some positive constant and , where ;
The growth rate is bounded away from ().
Then for any vertex with a constant weight, the set of vertices at distance exactly from satisfy that
for every ;
for every positive integer .
As a corollary, we have that for every , where is any vertex with constant weight.
Random power law graph:
denote the probability density function of a power law distribution with exponent, i.e. , where . The expectation of exists when . The second moment is finite, only when .
In the random power law graph model, the weight of each vertex is drawn independently from a power law distribution (with the same mean and exponent ). Given the weight vector , we then sample a random graph according to the Chung-Lu model.
It is known that if , then almost surely a random graph with weight has a unique giant component (see e.g. Chung and Lu ).444If , almost surely all connected components have at most vertices. In this paper, we will assume that the average degree is a constant greater than .
2.1 Related work
Landmark based labeling:
There is a rich history of study on how to preprocess a graph for answering shortest path queries [6, 18, 13, 50]. A commonly used algorithm is landmark based labelings [4, 12, 19, 22, 23], also known as 2-hop covers  or hub labeling . The empirical results of Akiba et al.  and Delling et al.  found that only a few hundred landmarks per vertex suffices to recover all-pairs distances exactly, in a large collection of social, Web, and computer networks with tens of millions of edges. The idea is to find central landmarks that lie on the shortest paths of many sources and destinations. In a landmark based labeling, every vertex stores a set of landmarks as well as its distance to each landmark. To answer a distance query , we simply find a common landmark in the landmark sets of and to minimize the sum of distances . It is NP-hard to compute the optimal landmark based labeling (or 2-hop cover), and a -approximation can be obtained via a greedy algorithm . See also the references [8, 9, 24, 30] for a line of followup work. Another closely related line of work is approximate distance oracle [3, 5, 21, 45, 46, 47, 53]. We refer the reader to the excellent survey  for further reading.
Random graph models:
Existing models for social and information networks build on random graphs with a fixed degree distribution [25, 16, 51]. Informally, we assume that the degree sequence of our graph is given, and then we draw a “uniform” sample from graphs that have the same or very similar degree sequences. Random graphs capture the small world phenomenon , because the average distance grows logarithmically in the number of vertices. They serve as a basic block to richer models with more realistic features, e.g. community structures , shrinking diameters in temporal graphs . It has been empirically observed that many social and information networks have a heavy-tailed degree distribution [17, 26] — concretely, the number of vertices whose degree is , is proportional to .
Previous work of Chen et al.  presented a 3-approximate labeling scheme requiring storage per vertex, on random power law graphs with . Our (+2)-stretch result improves upon this scheme in the amount of storage needed per vertex for , with a strictly better accuracy guarantee. Another related line of work considers compact routing schemes on random graphs. Enachescu et al.  presented a 2-approximate compact routing scheme using space on Erdös-Rényi random graphs, and Gavoille et al.  obtained a 5-approximate compact routing scheme on random power law graphs. Other existing mathematical models on special families of graphs related to distance queries include road networks , planar graphs  and graphs with doubling dimension . However none of them can capture the expansion properties that have been observed on sub-networks of real-world social networks . Apart from the Chung-Lu model and the configuration model that we have mentioned, the preferential attachment graph is also well-understood . It would be interesting to see if our results extend to preferential attachment graphs as well. The Kronecker model  allows a richer set of features by extending previous random graph models, however its mathematical properties are not as well-understood as the other three models.
3 Warm Up and Shortest Paths Lower Bounds for
In this section, we illustrate our main ideas by presenting a lower bound for labeling schemes that can estimate all pairs distances up to ,555Note that the average distance of is (see e.g. Bollobás ). where is equal to . More formally, we say that a labeling scheme is -accurate if for any :
if , then the labeling scheme returns the exact distance .
if , then the labeling scheme returns “”.
For any integer ,
denote the set of vertices whose distance from is equal to .
And let denote the set of vertices whose distance from is at most .
Let be an integer smaller than .
666We assume that is odd without loss of generality.
is odd without loss of generality.
We may assume without loss of generality for every , the label of stores the distances between and all vertices in . This is because the lower bound we are aiming for is larger than the size of , and hence we can always afford to store them. From the labels of , either we see a non-empty intersection between and , which determines their distance; or the two sets are disjoint, in which case we are certain that . In a random graph, the event that , conditioned on and and are disjoint, happens with probability
by Proposition 3, assuming that . Note that this probability gives us a lower bound on the entropy of the event . Since the labels of and determine their distance, if we can find a large number of pairwise independent pairs such that the entropy of is large (e.g. suffices), then we obtain a lower bound on the total labeling size.
Our discussion so far suggests the following three step proof plan.
Pick a parameter and a maximal set of vertices , such that by “growing” the local neighborhood of up to , are disjoint/independent and have large volume, for a large number of pairs from .
Use the labels of to infer whether there are edges between and , for a large number of pairs from . Obtain a lower bound on the total label length of via entropic arguments.
Partition the graph into disjoint groups of size . Apply the first two steps for each group.
Clearly, given any two vertices, their neighborhood growth are correlated with each other. However, one would expect that the correlation is small, so long as the volume of the neighborhood has not reached . To leverage this observation, We describe an iterative process to grow the neighborhood of up to distance . Let be a set of vertices whose weights are roughly close to expected degree of . The motivation is to find disjoint sets for each , such that is almost as large as , and if , then there is no edge between and .
The iterative process:
We grow the neighborhood of each vertex in by an arbitrary but fixed order, up to level . Denote by , where and . For any , define to be the set of of vertices in whose distance is at most from . Define to be the set of vertices in whose distance is equal to from . More formally,
We then define ( by default). Denote by to be the induced subgraph of on the remaining vertices .
We note that in the above iterative process, the neighborhood growth of only depends on the degree sequence of . We show that under certain conditions, with high probability, a constant fraction of vertices satisfy that .
Lemma 5 (Martingale inequality).
Let . Let be an integer and be a set of vertices whose weight are all within and . Assume that
, for all ;
, for all ;
, for all .
Then with high probability, at least vertices satisfy that , for a certain fixed constant .
Consider the following random variable, for any.
We have by Assumption i). Thus by Azuma-Hoeffding inequality, with high probability. We will show below that the contributions to from the first two predicates is . Hence by taking union bound, we obtain the desired conclusion.
First, we show that the number of such that is with high probability. Note that implies that there exists some vertex such that . On the other hand, for any two vertices , , by Assumption iii). Hence, the expected number of vertex pairs in whose distance is at most , is , by the assumption on the size of . By Markov’s inequality, with high probability only vertex pairs have distance at most in . Hence there exists at most ’s such that .
Secondly, for all , with high probability. This is because the set of vertices is a subset of , the set of vertices within distance to on . Thus, by Assumption ii), we have
And the expected volume of is at most
Hence by Markov’s inequality, the probability that is at most . This proves the lemma. ∎
For the rest of this section, we show how to implement the three step proof plan, for distance labeling on random graphs with . For personalized PageRank, we need to show in step b) that PPR values can infer distance information. And for distance labeling on random graphs with , we need to deal with the fact that the local neighborhood growth has high variance in step a). We refer the reader to Section 4 and Section 5 for details.
3.1 The case
We first introduce the following proposition, which instantiates the martingale inequality towards proving the distance labeling lower bound for .
Proposition 6 (Iterative neighborhood growth).
Let and . Let be a set of vertices whose weight are all within . With high probability, at least vertices in satisfy that .
It’s easy to verify that . It suffices to verify the assumptions required in Lemma 5. Note that Assumption ii) and iii) simply follows from Proposition 4. Hence it suffices to verify Assumption i). Note that the subgraph can be viewed as a random graph sampled from Chung-Lu model over . By setting
we have that
Hence we see that is equivalent to a random graph drawn from degree sequence . Denote by the growth rate on . When , by Hölder’s inequality,
by straightforward calculation. Hence is a constant strictly greater than . By Proposition 4, with constant probability , because
Since the vertices at distance from in is exactly , we have verified that Assumption i) is correct. ∎
Now we state the main result of this section.
Let be a random power law graph model with average degree and exponent . Let and be a fixed integer. For a random graph drawn from , we have that any -accurate labeling scheme will almost surely output a labeling whose total length is .
We know that there are vertices whose weights are between , by an averaging argument. Divide them into groups of size . Clearly, there are disjoint groups. Denote by a small fixed value (e.g. suffices). We argue that for each group ,
Hence by Markov’s inequality, except for groups, all the other groups will have label size at least . For the rest of the proof, we focus on an individual group .
Given the labels of , we can recover all pairwise distances which are less than in . Let denote the distance function restricted to . Consider the following two cases:
pairs such that . By Lemma 4, we know that , for any . Hence the expected number of pairs with distance at most in , is at most . Hence by Markov’s inequality, the probability that a random graph induces any such distance function is .
The number of pairs such that is at most in . Let
By Lemma 5, the size of is at least For any , and are clearly disjoint. Conditional on for all , the probability of the existences of edges between and are unaffected.
(by Proposition 3)
Note that the number of labeling of size less than is at most . Therefore by union bound, the probability that the total label size of is at most is at most:
By combining the two cases, we have shown that Equation (1) is true. Hence the proof is complete. ∎
It’s not hard to obtain a matching upper bound to Theorem 7. To see this, in each vertex’s label set, we simply add all the vertices up to distance from the vertex. The proof uses standard arguments from the random graph literature and we omit the details.
4 Personalized PageRank Lower Bounds
In this section, we present the lower bound on the space complexity of labeling schemes for personalized PageRank. The key intuition is based on the following lemma, which states that in an Erdös-Rényi random graph, personalized PageRank values are closely related to distances.
Let be an Erdös-Rényi random graph where every edge is sampled independently with probability , and be a positive integer less than . Then almost surly, for all pair of vertices such that , we have .
By Chernoff bound, for each vertex , its degree is close to with high probability:
Hence when , by union bound, almost surly all vertices have degree no more than . For any path , denote by as:
Therefore, almost surly for every path ,
In particular, for all pairs of vertices such that , there is a path from to with length at most . Thus, for all such and , by the random walk definition of personalized PageRank,
This proves the lemma. ∎
Let be an Erdös-Rényi random graph where every edge is sampled independently with probability . Let be the desired accuracy threshold and be the desired approximation ratio. Let be the teleport probability of the personalized PageRank random walk on . For any -accurate labeling data structures for personalized PageRank, with high probability over the randomness of , the total labeling size for is at least , where
for large enough and . Since and , we have . Then
since with high probability. Hence for general number of edges , we obtain a lower bound which also matches the upper bound of Lofgren et al. .
Based on the Lemma 8, we present the analysis for our main result.
Proof of Theorem 9.
Divide to groups of size . We will show that for each group ,
where is a certain constant to be specified later. Hence by Markov’s inequality, almost surly there are groups with total label size at least . And this implies the total label size of will be at least . For the rest of the proof, we focus on proving Equation (4).
We apply the iterative process in Section 3 to generate boundary sets for each . Note that Erdös-Rényi random graphs can be generated from the Chung-Lu model where each vertex has weight . Hence the average weight is equal to , and the growth rate is . And we obtain that . We also verify the assumptions required to in Lemma 5. Note that the second and third assumption follows from straightforward calculation, hence we omit the details. For the first assumption, conditional on , we have that the size of is at most , because every vertex of has weight equal to . Thus the number of vertices in is at least , and the growth rate of on is . By Chernoff bound, the neighborhood growth on concentrates at rate with high probability because . Hence we obtain that the size of is at least with high probability, since .
Thus by Lemma 5, almost surely there exists with size at least , such that for all we have , for a certain fixed constant . Consider the following two cases:
There exists pairs such that