# Closure Operators and Spam Resistance for PageRank

We study the spammablility of ranking functions on the web. Although graph-theoretic ranking functions, such as Hubs and Authorities and PageRank exist, there is no graph theoretic notion of how spammable such functions are. We introduce a very general cost model that only depends on the observation that changing the links of a page that you own is free, whereas changing the links on pages owned by others requires effort or money. We define spammability to be the ratio between the amount of benefit one receives for one's spamming efforts and the amount of effort/money one must spend to spam. The more effort/money it takes to get highly ranked, the less spammable the function. Our model helps explain why both hubs and authorities and standard PageRank are very easy to spam. Although standard PageRank is easy to spam, we show that there exist spam-resistant PageRanks. Specifically, we propose a ranking method, Min-k-PPR, that is the component-wise min of a set of personalized PageRanks centered on k trusted sites. Our main results are that Min-k-PPR is, itself, a type of PageRank and that it is expensive to spam. We elucidate a surprisingly elegant algebra for PageRank. We define the space of all possible PageRanks and show that this space is closed under some operations. Most notably, we show that PageRanks are closed under (normalized) component-wise min, which establishes that (normalized) Min-k-PPRis a PageRank. This algebraic structure is also key to demonstrating the spam resistance of Min-k-PPR.

## Authors

• 1 publication
• 17 publications
• 6 publications
• 8 publications
• 1 publication
04/24/2013

### A Theoretical Analysis of NDCG Type Ranking Measures

A central problem in ranking is to design a ranking measure for evaluati...
10/06/2019

### Kernel Density Estimation for Totally Positive RandomVectors

We study the estimation of the density of a totally positive random vect...
10/06/2019

### Kernel Density Estimation for Totally Positive Random Vectors

We study the estimation of the density of a totally positive random vect...
07/17/2020

### A Hölderian backtracking method for min-max and min-min problems

We present a new algorithm to solve min-max or min-min problems out of t...
02/28/2020

### Improved Algorithm for Min-Cuts in Distributed Networks

In this thesis, we present fast deterministic algorithm to find small cu...
09/18/2020

### Hardness and approximation of the Probabilistic p-Center problem under Pressure

The Probabilistic p-Center problem under Pressure (Min PpCP) is a varian...
##### This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

## 1 Introduction

For a directed graph , representing e.g. the web graph, a ranking function maps elements of to a numeric value, so that subsets of can be sorted. Link spam is the manipulation of the edges to game the ranking function.

Spam-fighting is the subject of intense research (see Section 2). In 1998, Brin and Page  [12] introduced PageRank as a graph-theoretic spam-resistant ranking function. However, as we discuss below, the standard formulation of PageRank is easy to spam.

In this paper, we show how to make PageRank [12] spam resistant.

We begin by giving a simple model of link spam. The following discussion relies on the rather mundane observations that if you own a web page, you can edit its links, but acquiring a link from a page you do not own requires effort (or money).

### 1.1 A cost model for spamming

Define the web graph to be , where is a partition over that defines ownership, that is, if two nodes are in the same partition of , then they have the same owner. Each partition in is called the property of its owner. is a set of trusted sites that are known not to be owned by spammers. The ranking function is only responsible for ranking nodes reachable from some node in .

• Create a node in its own property;

• Change the out-links of the nodes in its property.

An owner can perform the following operation on , but at a cost:

• Purchase a link from a node outside its property to a node in its property;

• Purchase an existing node (with its in-links) from the current owner.

For any property , we define the boundary to be the nodes in that have links from other properties and the interior to be those that do not.

Define the cost of a node to be , where . Normalize ranking function so that . We say that is -spam resistant if there is a way to set so that, for every property , , that is, the amount of ranking function accumulated in the interior of a property is at most proportional to the cost of acquiring a boundary.

Notice that we say a ranking function is spam resistant if any pricing scheme shows that you must pay for your rank value. It is too much to require all pricing schemes to yield a pay-for-rank bound. For example, suppose that all but one node cost 0, and one node costs 1. Then any ranking scheme would be free to spam. So our notion of spam resistance is a necessary condition: there exists a market value for nodes so that a spammer is forced to pay for what they get. The more they have to pay, the more resistant.

What about the cost of buying an edge? There is no restriction that the cost function assign non-zero costs only to nodes outside of a partition. If the spammer chooses to buy edges, they have targets inside the property. We can assign the cost of edges to their target node. Thus it is sufficient to assign costs to nodes to capture the cost of both buying nodes (as when a spammer buys an existing domain), or acquiring edges (via all kinds of techniques, including making a high-quality page that people choose to link to, market places for buying edges, etc).

Our (unspecified) cost function allows us to combine costs in terms of money, effort, etc. What about setting the cost of a node to be its rank value? As we will see below (see Theorem 5), this is too restrictive. Ranking schemes can have non-local effects that must be accounted for in the cost scheme.

This definition makes the simplification that a cabal, in which different agents cooperate, say, by exchanging links, is a single owner.

We can assume that a spammer begins disconnected from the main web graph (or even with an empty partition). The spammer must therefore buy one node to connect to the network, and there is always a node that costs . Therefore, the smallest resistance a ranking function can have is . Surprisingly, some ranking functions can be spammed for just this cost. Other ranking functions are spam resistant. The most resistance a function can have is , since it could simply by all the ranking (and receive benefit ), by buying all the nodes (for cost ). Here we give examples that range from resistance to resistance.

In 1999, Kleinberg [26] introduced the notion of hub and authority scores and the HITS algorithm to compute them. Intuitively, a page receives a high authority score if it is pointed to by many high-quality hubs. A page receives a high hub score if it points to many high-quality authorities. It suffices to note that hub scores depend on out-links and is therefore free to spam. But once an agent owns many pages with high hub scores, it can create pages with high authority scores, once again for free. Such considerations are far from hypothetical. Assano et. al [5] report that HITS was unusable by 2007, due to its spammability. This vulnerability was already well understood in 2004 [19]. In the parlance of this paper, HITS is -spam resistant.

PageRank was introduced to be harder to spam. The idea is that if many high-quality nodes point to a node, then that is evidence for the high quality of the target. The stationary distribution of a random walk on a graph has a similar behavior: the more in-edges from high-probability nodes one has, the higher one’s own probability in the random walk.

However, not every directed graph has a well defined stationary distribution. Therefore Brin and Page introduced the notion of a reset. Let be the reset probability. Consider a random walk which, with probability , traverses an out edge from the current node, selected uniformly at random. With probability , it resets to a node chosen according to , the

reset probability distribution

, or

reset vector

, over . We refer to such a random walk model as . The PageRank of is the stationary distribution of this process.

But this definition presents a problem: it might be possible to link-spam itself, and thus spam the PageRank! Indeed, the reset vector

is usually taken to be the uniform distribution, as it was in the original paper. Spamming PageRank with a uniform reset vector is basically free. Once a single node is purchased, the spammer can create an arbitrarily large subgraph (for free), which can gather and focus PageRank. Indeed, this method can focus an arbitrarily large fraction of the total PageRank into the spam region, yielding a spam resistance of

.

In fact, both of these ranking function ignore the trusted sites and suffer from the lowest possible spam resistance. This conjunction of features is not accidental:

###### Lemma 1.

Any ranking that gives the same values to all nodes, no matter what the trusted sites are, is -spam resistant.

###### Proof.

The spammer can make a copy of the rest of the graph for free. If the ranking function ignores the trusted sites, it has no way of knowing which copy is the “real” graph. Thus the spammer can accumulate half the ranking for the cost of one edge. ∎

Therefore, we need to consider ranking functions that use the trusted sites, if we want to have any hope of achieving spam resistance. Consider a PageRank where all the reset goes to a single trusted node, . Such a PageRank is called a personalized PageRank  [24] of center node , which we denote . Personalized PageRanks are not susceptible to the direct link spam attack described above, because a spammer can only acquire PageRank by gaining more and more nodes from the “legitimate” web. Furthermore, they are not susceptible to manipulations of the reset vector, since the interior of any property that does not include receives no reset. To see how spammable this ranking function is, set the cost of a node to be its Personalized PageRank. The total PageRank in the interior of property is a constant times the amount that flows across the boundary, as long as is a constant. Thus, this ranking function is -spam resistant.

However, Personalized PageRanks are not very useful for ranking pages, since any page that’s downstream from and near to the center will have a huge PageRank. Such a PageRank is too biased to serve as a reasonable ranking function.

### 1.2 Our proposed link-spam-fighting method

We would like to keep the spam resistance of Personalized PageRanks, while reducing the large PageRanks downstream from the center node.

Consider a set of trusted nodes and their corresponding Personalized PageRanks, . We define the Min--PPR method to be the component-wise min of these PageRanks, that is, the th component of the Min--PPR is the min of the th component of all the (perhaps normalized so that the value of the Min--PPR sum to ).

Now, the node does not receive a huge PageRank, as it would in its own Personalized PageRank. Instead, it gets a value that is set by one of the other Personalized PageRanks. Thus, intuitively, Min--PPR should avoid the local bias inherent in Personalized PageRanks. We discuss the bias of Min--PPR in the conclusion (see Section LABEL:sec:conclusion).

On the spam-fighting side, component-wise min combinations of personalized PageRanks would appear to inherit the spam resistance of personalized PageRanks, since, in order to spam one’s way to a high min value, one must engineer being in the downstream neighborhood of all the nodes used for each constituent .

In the following, we show that this simple analysis of spam resistance is, in fact, too simple. Nonetheless, we show that Min--PPR is -spam resistant, when is a constant.

### 1.3 Our Results

In this paper, we explore the algebra of PageRank. Recall that the space of PageRanks is the set of stationary distributions of , over all choices of and . Results include:

• We show that for any graph , reset vector and reset probability , the PageRank exists and is unique. This contrasts with claims in the literature [30, 10, 6, 34] that must be carefully chosen to give a well-defined stationary distribution.

• We show necessary and sufficient conditions for a vector to constitute a PageRank for a graph. We show how to compute the source reset vector and reset probability for a putative PageRank vector.

• We demonstrate a class of functions which, when applied component-wise to a set of PageRanks and rescaled, yields another PageRank. That is, this class of functions is closed over the set of PageRanks. Most notably, we establish that component-wise min with rescaling is closed for PageRanks.

We use the machinery so developed to propose a spam-fighting method:

• We propose a method – called the Min--PPR– for fighting link spam. This method consists of taking the component-wise min of a set of personalized PageRanks centered on trusted sites.

• We show that Min--PPR is -spam resistant, when is a constant.

## 2 Related Work

Originally, and most famously, PageRank was used by Google as a ranking function for web pages [2]

, but since then, it has been used to analyze networks of neurons

[14], Twitter recommendation systems  [17], protein networks [23], etc. (See [16] for a survey of non-web uses). Here, we focus on its application to Web link analysis.

As noted above, PageRank is susceptible to link spam. Thus, other ranking functions have been proposed [29, 11, 20]. TrustRank [20] for example is based on assigning higher reputation to a subset of pages curated by an expert, and the assumption that pages linked from these reputable pages are reputable as well. A similar method can be applied for low reputation pages, which is called Anti-Trust Rank [28]. In both, reliability lowers as distance from the reference pages increases.

Other work is geared towards modifications of the PageRank mechanism. For instance, Global Hitting Time [22] was designed as a transformation of PageRank to counter cross-reference link spam, where nodes link each other to increase their rank, but it still suffers if the number of spammers is large. Variants include Personalized Hitting Time [32].

Despite the progress on other ranking mechanisms, PageRank still stands as the most popular [37] ranking function, and therefore the most attractive for link-spammers.

Google discouraged PageRank manipulation through the buying of highly ranked links by announcing that pages discovered to participate in such activity will be left out of the PageRank calculation (hence, their rank lowered) and encouraging the public to notify Google about such pages [1].

Other research has focused on link-spam detection [18] and quantifying the rank increase obtained by creating Sybil pages [13].

Link-spam detection may be useful for excluding pages from the PageRank calculation, but better than building a fortress is to make the attack futile, that is, to develop techniques that yield PageRank spam resistance. Towards that end, some work limits or assign reset probability selectively [15, 20]. These approaches are generalizations of Personalized PageRank [24].

As expected, personalized PageRank is biased towards the vicinity of the trusted node. This undesired effect can be compensated for to some extent by concentrating reset probability on a subset of nodes rather than one (as in [15, 20]). Indeed, the approach has been successful for particular areas where the search space is relatively small (e.g. in Linguistic Knowledge Builder graph [3], Social Networks [8, 25], and Protein Interaction Networks [23]). But the scale of the web graph may require a large set of trusted pages for a general purpose PageRank.

The question of how to compute PageRank fast enough in practice has attracted a lot of attention, yielding a variety of theoretical and experimental results. The literature includes exact as well as approximation algorithms.

Exact computations involve the application of standard iterative methods, such as the Jacobi, Krylov subspace and power methods. A survey was published in [30]. If the computation is parallel, the web graph is first partitioned in blocks (cf. [27] and the references therein).

Approximations include a variety of techniques, either heuristic or with theoretical guarantees

[4, 9, 33, 35]. A common approach is to start multiple random walks from different nodes, aiming to approximate the rank with the number of visits to each node.

Less related work includes distributed implementations [36], stochastic methods [7]

, and bounds on the second Eigenvalue of the web hyperlink matrix. The second Eigenvalue is related to convergence time to compute PageRank by iterated methods

[21].

## 3 Preliminaries

In this section, we introduce notation and define a somewhat extended notion of PageRank.

Let be a graph. For any node with no outgoing links, we assume it has a self loop, in order to simplify definitions and lemmas. For any edge , let be called an out-neighbor of and be called an in-neighbor of . For any , let and denote the set of in-neighbors and out-neighbors of respectively. For , let . Note that for all , because of the self loops. The definitions of IN and OUT are extended to subsets of nodes in a straightforward manner.

Let be the n-dimensional non-negative unit sphere denoting all possible -dimensional probability vectors.

Note: throughout the rest of the paper, if is not specified, we will take it to be a default value. In the literature, this value is almost always .

The transition probability matrix, , of the random walk is

 A≜(1−ϵ)M+ϵR, (1)

where and denote matrices as follows:

 ∀u,v∈V:M[u,v] ={1/degOUT(u)%if$(u,v)∈E$0otherwise R =(r,…,r)T1×n

For instance, for the case of a uniform reset vector, is the all matrix.

The random walk is a Markov chain defined by the sequence of moves of a particle between the nodes of

, where the location of the particle at any given time is the state of the system, and the one-step transition matrix is . This random walk on is well behaved, as summarized in the following.

###### Theorem 1.

The random walk on , as defined above, has a unique stationary distribution for any graph , reset vector , and reset probability .

###### Proof.

Let be the set of nodes with positive probability in . Notice that these nodes belong to a strongly connected component in consisting of the nodes reachable from . These nodes form a unique essential communicating class in the Markov chain of the random walk on . By Proposition 1.26 in [31], such a Markov chain has a unique stationary distribution. ∎

This stationary distribution, , has weight 0 on all nodes not reachable from a node in . On all nodes reachable from a node in , they have PageRank defined by the standard random walk algorithm. It is not necessary that all nodes in form a communicating class, as is often claimed [30, 10, 6, 34].

We call this stationary distribution the PageRank defined by a graph, a reset vector and a reset probability. Theorem 1 states that the PageRank is well defined for any graph, reset vector and reset probability. When it is clear from the context what , and are, we simply write instead of . We denote the set of all possible PageRanks for with reset probability as , and the set of all possible PageRanks for with any reset probability as .

Let be the PageRank of . Then, is the unique vector that satisfies:

 ϵrT =p(I−(1−ϵ)M). (2)

To see why, notice that, being a stationary distribution, must satisfy

 pA =p p((1−ϵ)M+ϵR) =p p(ϵR) =p(I−(1−ϵ)M) ϵrT =p(I−(1−ϵ)M).

Then, from Equation 2, we can obtain the following expression for the -th component of a reset vector .

 ri=piϵ−1−ϵϵ∑j∈IN(i)pjdegOUT(j). (3)

Equation 3 establishes a test to see if vector .

###### Lemma 2.

Let be a graph and let . Then belongs to if and only if there exists an such that the obtained by Equation 3 belongs to .

Next, we show how to compute a valid from a valid PageRank and reset vector .

###### Lemma 3.

Let be a graph and let for some . Let and define for each

 γi≜pi−(pM)iri−(pM)i.

Then, one of the following holds:

1. If , then there is a (unique) such that

 (∀i∈V′:γi=ϵ) ⟺p∈Pϵ(G).
2. If , then for every , .

###### Proof.

The proof of the first item is as follows. If , Equation 3 holds. Rearranging it, we get that for any

 =pi−∑j∈IN(i)pjdegOUT(j).

Re-writing in terms of ,

 ϵ(ri−(pM)i) =pi−(pM)i.

Given that for all , we can write

 ϵ =pi−(pM)iri−(pM)i.

Yielding a unique value as claimed. The other direction of the implication is immediate by reversing the algebra above.

The proof of the second item is as follows. Recall that . Substituting for (see Equation 1), and setting , we get:

 p =p((1−ϵ)M+ϵR) =(1−ϵ)pM+ϵpM =pM.

Thus, any suffices. ∎

The second case happens when or .

From Equation 3, we can also compute the -th component of a PageRank vector obtained from an arbitrary reset vector as follows.

 pi=ϵri+(1−ϵ)∑j∈%IN(i)pjdegOUT(j). (4)

Finally, it will sometimes be convenient in the following to consider scaled PageRanks, defined as a positive scalar times a PageRank. For any scaled PageRank, , normalizing yields (standard) PageRank .

## 4 Closed Operators for PageRanks

In this section, we show that PageRank is closed under a class of functions. It is straightforward to show that the (linear) convex combination of any two PageRank vectors is, itself, a PageRank. Here we show that a more general class of operators is also closed for PageRanks.

Let denote an operator whose domain is -tuples of non-negative reals and whose range is the non-negative reals. For example, might be ‘’ or ‘’. We abuse the notation by extending to vectors, where the result is the component-wise application of . If is ’‘, then applying to vectors is simply vector addition.

For clarity in the analysis that follows, we define the following notation. We denote the second term in Equation 4 as

 fi(p) ≜(1−ϵ)∑j∈IN(i)pjdegOUT(j).
###### Definition 1.

A function is Semi-Dual-Commutative (SDC) if

 ∀j∈V:g(ϵrx1j+fj(x1),…,ϵrxkj+fj(xk))≥fj(g(x1,…,xk)).

#### What kind of functions are semi-dual-commutative?

If is monotone-increasing and , then is SDC, since for all :

 g(ϵrx1j+fj(x1),…,ϵrxkj+fj(xk)) ≥g(fj(x1),…,j(xk)) ≥fj(g(x1,…,xk)),

where in the first inequality we used the monotonicity of and in the second line we used the SDC property. Both ‘’ and ‘’ are SDC, but ‘median’ is not. For instance, consider the following counterexample (order of the addends is important, “med” means median).

 25= med(1+2+7+8,3+4+18+7,2+3+19+1) ≱ med(1,3,2)+med(2,4,3) +med(7,18,19)+med(8,7,1) = 30.

Moreover, this counterexample can be adjusted (up to normalization) so that a negative reset is needed in order to compensate the flow adjustments of the ‘median’.

We now show that SDC functions are closed over the class of PageRanks.

###### Theorem 2.

Let be PageRanks, for some , and let be Semi-Dual-Commutative. Then, if is defined, then .

###### Proof.

We obtain the components of each from Equation 4 as follows.

 ∀j∈V:xij =ϵrxij+(1−ϵ)∑ℓ∈IN(j)xiℓdegOUT(ℓ).

Rearranging, we have that

 ϵrxij =xij−(1−ϵ)∑ℓ∈IN(j)xiℓdegOUT(ℓ).

Applying for PageRanks on the right-hand side of the latter, gives us:

 g(x1j,…,xkj)−(1−ϵ)∑ℓ∈IN(j)g(x1ℓ,…,xkℓ)degOUT(ℓ).

Normalizing,

Replacing Equation 4 and the definition of , we have

 g(ϵrx1j+fj(x1),…,ϵrxkj+fj(xk))−fj(g(x1,…,xk))||g(x1,…,xk)||.

Given that is SDC, we observe that the latter is non-negative for each . Furthermore, summing over all yields . Thus, by Lemma 2, is a PageRank. ∎

## 5 The Cost of Spamming Min-k-Ppr

In this section, we describe our method for addressing link spam. We show that this method spam resistant.

Recall that is a set of trusted sites. We define Min--PPR

 t(T)=min(p(t1),…,p(tk))

where the implicit is fixed for all Personalized PageRanks. In certain circumstances we will want to consider the normalized version of , but using the unnormalized version will make the algebra in this section more transparent.

Let be the reset vector of and let . That is, is the (scaled) reset vector derived by applying Equation 3 to .

Our naive analysis in the introduction suggested that the only way to get PageRank from a Min--PPR system was to find (and acquire) nodes that are in the neighborhoods of all the trusted centers of the Personalized PageRanks. This analysis is incomplete. It is also possible for a property to acquire PageRank by engineering , the reset vector. The main result of this section is that the total in the interior of a property is bounded by a function of the behavior of the on the boundary. Thus, in order to spam Min--PPR, it is necessary to spend money/effort on the boundary itself. We conclude by showing that Min--PPR is spam resistant by exhibiting an appropriate cost function for nodes.

We begin by building up some intuition about , before proving our main theorem and exploring some consequences.

### 5.1 The Structure of r(T)

Notice that induces a natural coloring of nodes in , as follows. Suppose node derives its value in from , that is, . Then we set the color of to be the minimum such (in case of ties). We call a node homogeneous if so that , for all We begin by showing that homogeneous nodes receive no reset.

###### Lemma 4.

Node is homogeneous iff , that is, receives no reset.

###### Proof.

Whether is homogenous or not,

 r(T)v=t(T)vϵ−1−ϵϵ∑u∈IN(v)t(tχ(u))udegOUT(u)

If is homogeneous, then for all . Thus,

 r(T)v =t(tχ(v))vϵ−1−ϵϵ∑u∈IN(v)t(tχ(v))udegOUT(u) =r(tχ(v))v =0.

Otherwise, there exists some in such that . Therefore,

 r(T)v >t(tχ(v))vϵ−1−ϵϵ∑u∈IN(v)t(tχ(v))udegOUT(u) >r(tχ(v))v =0.\qed

In other words, in order to garner more reset, an owner needs to have a mixture of colors in its nodes, and, as we will see, at its boundary. We next show that this effort has limited efficacy.

### 5.2 Combining Two PageRanks

Consider two (possibly scaled) PageRanks, which, for ease of reference we call the yellow PageRank, , and the blue PageRank, , with (correspondingly scaled) reset vectors and , respectively. Let the green PageRank , and let be ’s scaled reset vector.

In this section we bound the total reset over the interior of any property, as a function of and at the boundary of the property, in addition to the resets and , themselves, over those nodes. The main theorem of this section serves as the technical workhorse for the results on spam resistance of Min--PPR, established in subsequent sections.

As with above, we say a node is yellow (resp. blue) if it achieves its minimum in the yellow (resp. blue) computation.

For any edge , define

 y(u,v)≜(1−ϵ)yudegOUT(u),

and similarly for . That is, in the flow interpretation of PageRank, is the amount of yellow flow that receives from . Define

 δv≜yv−bv,

and similarly for edges.

Let be the reset of a node in this two-color Min--PPR. We can bound the reset as follows.

###### Lemma 5.

For any Min--PPR and any node ,

 r(g)v≤ ∑u∈IN(v)|δ(u,v)|−|∑u∈IN% (v)δ(u,v)|2+r(χ(v))v
###### Proof.

Let be a blue node. By using Min--PPR, the reset of , that is , increases (above ) by the difference between the yellow and blue values of ’s yellow parents, To see why, notice that the value of corresponds to the blue PageRank, and its value is the sum of two contributions: reset and fractions of parents values in the blue PageRank (cf. Equation 4). Therefore, if some of ’s parents are yellow instead of blue, they are contributing less to ’s value. But ’s value is fixed to the blue PageRank. Hence, it has to gain on reset as stated.

Notice that

 ∑u∈IN(v)|δ(u,v)|=∑u∈IN(v)δ(u,v)>0|δ(u,v)|+∑u∈IN(v)δ(u,v)<0|δ(u,v)|,

and

 |∑u∈IN(v)δ(u,v)|=∑u∈IN(v)δ(u,v)>0|δ(u,v)|−∑u∈IN(v)δ(u,v)<0|δ(u,v)|,

Subtracting, we get

 2∑u∈IN(v)δ(u,v)<0|δ(u,v)|=∑u∈IN(v)|δ(u,v)|−|∑u∈IN(v)δ(u,v)|.

Thus,

 r(g)v≤∑u∈IN(v)|δ(u,v)|−|∑u∈IN(v)δ(u,v)|2+r(b)v

If is a yellow node, then the are negated, but the increase in the resets comes out the same, but the is replaced by . ∎

Now that we have a bound on the reset at any node, we can sum over the nodes in the interior of a region and prove that the increase in the reset of the interior of a property depends only on Personalized PageRanks at the boundary. This will form the basis for showing that Min--PPR is spam resistant, since getting more reset requires buying a bigger boundary.

###### Theorem 3 (The Reset Theorem).

For any two (possibly scaled) PageRanks, and , over , and for any with boundary and interior ,

 2∑v∈Ir(g)v≤∑u∈B|δu|−|∑u∈Bδu|+∑v∈Ir(χ(v))v
###### Proof.

We begin by showing that

 2∑v∈Ir(g)v≤ ∑u∈B∑v∈I|δ(u,v)|−|∑u∈B∑v∈Iδ(u,v)| −⎛⎝∑v∈I∑w∉I|δ(v,w)|−|∑v∈I∑w∉Iδ(v,w)|⎞⎠ −(∑v∈I|ϵδv|−|∑v∈Iϵδv|) +∑v∈Ir(χ(v))v, (5)

as follows. Notice that

 |∑u∈IN(v)δ(u,v)| =|∑u∈OUT(v)δ(v,u)|+|ϵδv| =∑u∈OUT(v)|δ(v,u)|+|ϵδv|.

Then, replacing in Lemma 5, we know that

 2r(g)v≤ ∑u∈IN(v)|δ(u,v)|−∑u∈OUT(v)|δ(v,u)|−|ϵδv|+r(χ(v))v 2∑v∈Ir(g)v≤ ∑v∈I∑u∈IN(v)|δ(u,v)|−∑v∈I∑u∈OUT(v)|δ(v,u)| −∑v∈I|ϵδv|+∑v∈Ir(χ(v))v.

From the definition of , we know that only has in-links from nodes in the same property, but some of them may be in the boundary and some in the interior. Thus, . On the other hand, may have out-links to nodes in the interior as well as any other nodes. Hence, it is . Therefore, we know that

 ∑v∈I∑u∈IN(v)|δ(u,v)| =∑u∈B∑v∈I|δ(u,v)|+∑u∈I∑v∈I|δ(u,v)| ∑v∈I∑u∈OUT(v)|δ(v,u)| =∑v∈I∑u∉I|δ(v,u)|+∑v∈I∑u∈I|δ(u,v)|

Substituting, we get

 2∑v∈Ir(g)v≤ ∑u∈B∑v∈I|δ(u,v)|+∑u∈I∑v∈I|δ(u,v)| −⎛⎝∑v∈I∑u∉I|δ(v,u)|+∑v∈I∑u∈I|δ(u,v)|⎞⎠ −∑v∈I|ϵδv|+∑v∈Ir(χ(v))v 2∑v∈Ir(g)v≤ ∑u∈B∑v∈I|δ(u,v)|−∑v∈I∑u∉I|δ(v,u)| −∑v∈I|ϵδv|+∑v∈Ir(χ(v))v. (6)

By symmetry, we have that

 |∑v∈I∑u∉Iδ(v,u)|+|∑v∈Iϵδv| =|∑u∉I∑v∈Iδ(u,v)|,

and given that nodes in the interior only have links from the same property, it is

 |∑v∈I∑u∉Iδ(v,u)|+|∑v∈Iϵδv| =|∑u∈B∑v∈Iδ(u,v)|,

so, it is

 0 =|∑u∈B∑v∈Iδ(u,v)|−|∑v∈I∑u∉Iδ(v,u)|−|∑v∈Iϵδv|,

and we can introduce the latter in Equation 6 getting

 2∑v∈Ir(g)v≤ ∑u∈B∑v∈I|δ(u,v)|−∑v∈I∑u∉I|δ(v,u)| −⎛⎝|∑u∈B∑v∈Iδ(u,v)|−|∑v∈I∑u∉Iδ(v,u)|⎞⎠ −(∑v∈I|ϵδv|−|∑v∈Iϵδv|) +∑v∈Ir(χ(v))v,

and rearranging Equation 5 follows.

Finally, notice that

 0 ≤∑v∈I∑w∉I|δ(v,w)|−|∑v∈I∑w∉Iδ(v,w)|, 0 ≤∑v∈I|ϵδv|−|∑v∈Iϵδv|

and so, from Equation 5, it is

 2∑v∈Ir(g)v≤ ∑u∈B∑v∈I|δ(u,v)|−|∑u∈B∑v∈Iδ(u,v)| +∑v∈Ir(χ(v))v.

The theorem follows directly by noting that . ∎

### 5.3 Implications of the Reset Theorem

Theorem 3 serves as the base case for a bound on a Min--PPR for properties that do not contain a trusted site in the interior. We begin with some notation. Let , and let . Then, we get the following.

###### Theorem 4.

For the Min--PPR on any permutation of the PageRank trusted sites in , and for any property with boundary and interior , such that , the following holds.

 2∑v∈Ir(g)v ≤∑i≤k∑u∈B|p(ti)u−t(Ti)u| −∑i≤k|∑u∈Bp(ti)u−t(Ti)u|
###### Proof.

Consider and . Given that , and that and are both Personalized PageRanks, the reset on is on both. Then, we can apply Theorem 3. Subsequently, consider adding the th Personalized PageRank to the min of the first (which is a scaled PageRank, by the closure of min). The second term has reset computed inductively, and the first has zero reset. So, once again by Theorem 3, the inductive step holds, thus establishing the theorem. ∎

###### Corollary 1.

If every node in the boundary of a property has the same color, then the interior receives no reset.

###### Proof.

This follows from Theorem 4 by noticing that, if all nodes on the boundary have color , then for , , so all the absolute values in Theorem 4 are of non-negative values, and thus the terms cancel, leaving a reset of 0. ∎

###### Lemma 6.

In Theorem 4, we can ignore any color that does not occur on the boundary.

###### Proof.

Consider a color that does not occur on the boundary. Put it last in the permutation order in Theorem 4. Then , for every . Thus the two summations for ’s contribution to the reset cancel, and this color adds no reset. We an extend this to any number of colors by placing them at the end of the permutation. ∎

### 5.4 Spam Resistance

Here we show that Min--PPR is spam resistant by assigning a cost to each node, so that the PageRank in the interior of a property is bounded by the cost of the boundary.

We conclude with the following.

###### Theorem 5.

If is a constant, then Min--PPR is -spam resistant.

###### Proof.

Fix the cost of a node to be at least its PageRank, plus the amount by which it contributes to the reset in the interior of a property.

We can derive the amount that any unit of added PageRank contributes to the interior as it flows from node to node by noticing that only a fraction gets propagated. Then the PageRank multiplier is

Then, for any property with boundary and interior , by Theorem 4, we know that

 ∑v∈It(T)v ≤1−ϵϵ(∑v∈Bt(T)v+∑i≤k∑u∈B∣∣p(ti)u−t(Ti)u∣∣). (7)

We further upper bound Equation 7 as follows. Consider the following function on any node .

 h(u)=t(T)u+∑i∈[1,k]∣∣p(ti)u−t(Ti)u∣∣.

Given that

 t(T)u+∑i∈[1,k]∣∣p(ti)u−t(Ti)u∣∣≤t(T)u+∑i∈[1,k](p(ti)u+t(Ti)u),

we know that . So, we can define a function as the normalized version of , where the factor of normalization is at most . Replacing in Equation 7, we get

 ∑v∈It(T)v ≤1−ϵϵ(2k+1)∑v∈BC(v),

which establishes the theorem. ∎

## References

• [2] Google press center: Fun facts. Accessed on 10/28/2017.
• [3] Eneko Agirre and Aitor Soroa. Personalizing pagerank for word sense disambiguation. In Proceedings of the 12th Conference of the European Chapter of the Association for Computational Linguistics, pages 33–41. Association for Computational Linguistics, 2009.
• [4] Reid Andersen, Christian Borgs, Jennifer Chayes, John Hopcroft, Vahab Mirrokni, and Shang-Hua Teng. Local computation of pagerank contributions. Internet Mathematics, 5(1-2):23–45, 2008.
• [5] Yasuhito Asano, Yu Tezuka, and Takao Nishizeki. Improvements of hits algorithms for spam links. In Advances in Data and Web Management, pages 479–490. Springer, 2007.
• [6] Konstantin Avrachenkov and Nelly Litvak. The effect of new links on google pagerank. Stochastic Models, 22(2):319–331, 2006.
• [7] Konstantin Avrachenkov, Nelly Litvak, Danil Nemirovsky, and Natalia Osipova. Monte carlo methods in pagerank computation: When one iteration is sufficient. SIAM Journal on Numerical Analysis, 45(2):890–904, 2007.
• [8] Bahman Bahmani, Abdur Chowdhury, and Ashish Goel. Fast incremental and personalized pagerank. Proceedings of the VLDB Endowment, 4(3):173–184, 2010.
• [9] Bahman Bahmani, Abdur Chowdhury, and Ashish Goel. Fast incremental and personalized pagerank. Proceedings of the VLDB Endowment, 4(3):173–184, 2010.
• [10] Pavel Berkhin. A survey on pagerank computing. Internet Mathematics, 2(1):73–120, 2005.
• [11] Rajat Bhattacharjee and Ashish Goel. Incentive based ranking mechanisms. In First Workshop on the Economics of Networked Systems (Netecon?06), pages 62–68, 2006.
• [12] Sergey Brin and Lawrence Page. The anatomy of a large-scale hypertextual web search engine. Computer Networks, 30(1-7):107–117, 1998.
• [13] Alice Cheng and Eric Friedman. Manipulability of pagerank under sybil strategies, 2006.
• [14] Jack McKay Fletcher and Thomas Wennekers. From structure to activity: Using centrality measures to predict neuronal activity. International journal of neural systems, 28(02):1750013, 2016.
• [15] Dániel Fogaras, Balázs Rácz, Károly Csalogány, and Tamás Sarlós. Towards scaling fully personalized pagerank: Algorithms, lower bounds, and experiments. Internet Mathematics, 2(3):333–358, 2005.
• [16] David F. Gleich. Pagerank beyond the web. 2014.
• [17] Pankaj Gupta, Ashish Goel, Jimmy J. Lin, Aneesh Sharma, Dong Wang, and Reza Zadeh. WTF: the who to follow service at twitter. In 22nd International World Wide Web Conference, WWW ’13, Rio de Janeiro, Brazil, May 13-17, 2013, pages 505–514, 2013.
• [18] Zoltan Gyongyi, Pavel Berkhin, Hector Garcia-Molina, and Jan Pedersen.

Link spam detection based on mass estimation.

In Proceedings of the 32nd international conference on Very large data bases, pages 439–450. VLDB Endowment, 2006.
• [19] Zoltan Gyongyi and Hector Garcia-Molina. Web spam taxonomy. Technical Report 2004-25, Stanford InfoLab, March 2004.
• [20] Zoltán Gyöngyi, Hector Garcia-Molina, and Jan Pedersen. Combating web spam with trustrank. In Proceedings of the Thirtieth international conference on Very large data bases-Volume 30, pages 576–587. VLDB Endowment, 2004.
• [21] Taher Haveliwala and Sepandar Kamvar. The second eigenvalue of the google matrix. Technical report, Stanford, 2003.
• [22] John Hopcroft and Daniel Sheldon. Manipulation-resistant reputations using hitting time. Internet Mathematics, 5(1-2):71–90, 2008.
• [23] Gábor Iván and Vince Grolmusz. When the web meets the cell: using personalized pagerank for analyzing protein interaction networks. Bioinformatics, 27(3):405–407, 2010.
• [24] Glen Jeh and Jennifer Widom. Scaling personalized web search. In Proceedings of the Twelfth International World Wide Web Conference, WWW 2003, Budapest, Hungary, May 20-24, 2003, pages 271–279, 2003.
• [25] Zhaoyan Jin, Dianxi Shi, Quanyuan Wu, Huining Yan, and Hua Fan. Lbsnrank: personalized pagerank on location-based social networks. In Proceedings of the 2012 ACM Conference on Ubiquitous Computing, pages 980–987. ACM, 2012.
• [26] Jon M Kleinberg. Hubs, authorities, and communities. ACM computing surveys (CSUR), 31(4es):5, 1999.
• [27] Christian Kohlschütter, Paul-Alexandru Chirita, and Wolfgang Nejdl. Efficient parallel computation of pagerank. In European Conference on Information Retrieval, pages 241–252. Springer, 2006.
• [28] Vijay Krishnan and Rashmi Raj. Web spam detection with anti-trust rank. In AIRWeb, volume 6, pages 37–40, 2006.
• [29] Ravi Kumar, Prabhakar Raghavan, Sridhar Rajagopalan, and Andrew Tomkins. Core algorithms in the clever system. ACM Transactions on Internet Technology (TOIT), 6(2):131–152, 2006.
• [30] Amy N Langville and Carl D Meyer. Deeper inside pagerank. Internet Mathematics, 1(3):335–380, 2004.
• [31] David Asher Levin, Yuval Peres, and Elizabeth Lee Wilmer. Markov chains and mixing times. American Mathematical Soc., 2009.
• [32] Brandon K Liu, David C Parkes, and Sven Seuken. Personalized hitting time for informative trust mechanisms despite sybils. In Proceedings of the 2016 International Conference on Autonomous Agents & Multiagent Systems, pages 1124–1132. International Foundation for Autonomous Agents and Multiagent Systems, 2016.
• [33] Peter Lofgren, Siddhartha Banerjee, and Ashish Goel. Personalized pagerank estimation and search: A bidirectional approach. In Proceedings of the Ninth ACM International Conference on Web Search and Data Mining, WSDM ’16, pages 163–172, New York, NY, USA, 2016. ACM.
• [34] J Prystowsky and Levi Gill. Calculating web page authority using the pagerank algorithm, 2005.
• [35] Atish Das Sarma, Anisur Rahaman Molla, Gopal Pandurangan, and Eli Upfal. Fast distributed pagerank computation. Theoretical Computer Science, 561:113–121, 2015.
• [36]