With the advent of the era of social media, graphs have become an increasingly important tool for studying user behavior in such fields as social networks, academic citation, online retails, web analysis, etc. Various graph algorithms have been proposed to attack the problems people encounter during the study of graphs. For example, in order to find the web pages that are of uttermost importance and relevance to a set of keywords, people designed the PageRank algorithm to rank the tens of millions of web pages that are available on linepage1999pagerank . Nodes on a connected graph may form tightly bound communities, which can be detected using an algorithm that aims to maximize the modularity of each potential communityPhysRevE.70.066111 ; newman2006modularity ; PhysRevE.74.036104 . We can equally well study the division of a connected graph into weakly linked communities via the Laplacian matrix, which is derived from a graph’s adjacency matrixfortunato2010community . It is also of significant interest to find a metric that can measure the distance between any two nodes on a connected graph, and the node2vec methods is there to our resortgrover2016node2vec . All of these algorithms depend on a direct or indirect invocation of a graph’s adjacency matrix, an invocation that is understandable due to the fact a graph can be uniquely identified once we have access to its adjacency matrix.
Often times, we also need to explore the influence of a node on another node, or on a specific set of nodes in a graph. PageRank algorithm is a convenient method that can deliver how influential and important a single node is to the graph as a whole, whereas sometimes we also need to know how influential a node is to another specific node, a situation that could arise when we want to know how susceptible a community of nodes is to the presence of a fraudulent user in a social network. For example, we can create a connected graph whose nodes represent the users of an online retailer like Suning.com. If we already know that one user in the graph is a fraudulent user, then we want to know how likely other users that cooccur with the fraudulent user is also fraudulent. Intuitively, this likelihood depends on the distance between these two users. However, there is no universal definition of this distance, and the definition may vary for each specific case. Nowadays, there are already plenty of algorithms that can measure the distance or similarity between each pair of nodes in a graph, such as the node2vec algorithmgrover2016node2vec , and the algorithm for calculating the betweenness of two nodesbaesens2015fraud . These algorithm are essentially considering a random or non-random walk on a graph, and by analyzing the nodes and length of a path connecting the starting node and the target node, we can gain some insight into the distance or similarity between the starting node and the target. Several methods that rely on the concept of random walk on graph have been proposed. One such method starts from a node, say , and selects a node, say , as its target, and performs several random walks starting from and counts how many times this random walker encounters node within a specified number of steps. This encountering frequency for the random walk provides a measure of the distance between nodes and . The larger the frequency, the shorter the distance between and . This method of of measuring distances between two nodes, although valid in some sense, has several drawbacks, the most prominent of which is its strong dependence upon such capricious parameters as the maximum number of nodes each random walk can possibly traverse, and the number of random walks to be performed for the frequency to be statistically stable and meaningful. Another weak point of this method is that it is much too time-consuming to perform enough number of random walks to gain a statistically significant result for two nodes that are located afar in a huge graph. The application of this method to a tiny graph is no less troublesome due to the fact that a random walk starting from node is guaranteed to reach any other node in a connected graph as long as the random walking process lasts long enough, rendering all the nodes have identical distance to a target node. As a result, the differences between these nodes are utterly erased. Considering the time-expensiveness of performing a multitude of random walks on a large graph and the strong dependence of the final results on the hard-to-choose parameters, we prefer to find an alternative method that can deliver the exact solution to the random walk, thus avoiding this lengthy and tedious process of Monte Carlo simulation from the beginning. People already explored this direction, and mathematical results indicate that performing random walks on a graph is equivalent to solving a discrete Laplacian equationdoyle2000random . However, solution of Laplacian equations requires the pre-existence of boundary conditions, which are not always availablezhu2003semi . For example, in order to quantify the influence of a fraudulent user on the other users that cooccur in the social network, we also need the existence of at least one user that is explicitly labeled as “non-fraudulent” or “innocuous”, a label that is not always available.
In order to avoid these conundrums, here we propose a new algorithm that can measure the distance between any two nodes in a graph by giving an exact solution to a random walk problem on undirected graphs. The advantage of this algorithm is that the final result is uniquely obtained by solving a sparse linear system, thus releasing us of the unnecessarily thorny duty of selecting a set of appropriate parameters, and saving us tens of thousands of CPU hours from performing Monte Carlo simulations thanks to the linear algebra packages that are readily available for performing sparse matrix multiplications quickly and conveniently.
The organization of the paper is as follows. In section II, we will show that we can measure the distance between any two nodes and located in a graph by performing a random walk starting from , and see when this random walker hits the target node for the first time, a time that is called hitting time in a standard textbook on stochastic processesross1996stochastic
. The hitting time can be estimated using Monte Carlo simulation which is pretty time-consuming, as we already lamented, or can be solved exactly using mathematical techniques that we will elaborate in exquisite details in this paper. After describing the algorithm, we next apply our analytical and numerical methods to small and huge graphs respectively. In sectionIII.1, we will exemplify our method by giving an analytical solution to a random walk problem on a simple small graph. We will also cross validate our analytical results by performing Monte Carlo simulations on the same graph, showing that they both yield identical results within machine precision tolerance. For a large graph, analytical solutions only involving mathematical formulae are no longer feasible, and we will resort to highly efficient numerical methods. Section III.2 is devoted to a numerical solution to the equations derived in section II for huge graphs. We will also perform Monte Carlo simulations on the large graph as benchmarks to demonstrate the rapidity of our method in delivering solutions of higher precision.
Ii Exact solution to hitting times of random walks on graph
In this section, we will present an analytical method for finding hitting times of a random walk on an undirected and connected graph . The connectedness of the graph does not constitute a major restriction to our method due to the availability of efficient algorithms for finding connected components of an undirected graph. The adjacency matrix of this graph is , which is an matrix, with matrix elements if there is an edge between node and node , otherwise . The matrix dimension
is the number of nodes in the graph, and the number of non-zero matrix elements ofgives us the edge number. is symmetrical if graph
is undirected, or else it is generally non-symmetric. Since we are trying to find the probability of reaching a target node from any other node in the graph, whereas in a directed graph, a node may not be reachable from another node, and that is why we only consider undirected and connected graphs. Thus, the adjacency matrix is always symmetric in our case. Furthermore, our method also applies to weighted graphs, for whichif the edge connecting node to has weight , and if there is no edge connecting between nodes and .
A random walk on a graph is defined as follows:
1. Start from a certain node, and call it our current node, which is denoted as ;
2. Select a node at random from the neighbors of current node, where node is chosen with probability proportional to the weight of the edge connecting node to node ;
3. Designate node as our updated current node ;
4. Repeat steps
until we have reached a target node or util the number of steps reaches a pre-specified limit. For this random walk process, we define its hitting time as the number of steps needed for the random walker to reach a target node for the first time. According to this definition, the hitting time is a random variable that depends on the graph structure, the starting node, and the target node. Here in this section, we will derive a formula for calculating the probability distribution for hitting times. The derivation of this formula requires the knowledge of graph adjacency matrix, the starting node , and the target node . Denote the probability of hitting target after exactly steps starting from node as
where is a random variable that counts the number of steps used before the random walker hits node for the first time. Assume that node has neighboring nodes, of which at most one is the target node . We enumerate these nodes using indices . Then the probability can be recursively represented as
In the above equation, is the total weight associated with node , represents the probability for the random walker to make a transition from node to one of its neighbors , and is the probability of reaching target node from node after exactly steps. Since we are calculating the probability of reaching target for the first time from node after exactly steps, the probability of reaching target starting from the target itself is zero for any non-zero number of steps, i.e., . Thus, we demand that the random walker in the above recursive equation should not make a transition to the target node even if is one of the neighbors of node , which justifies our notation in the summation subscript. If we scan all possible starting vertices , we can obtain a simultaneous system of difference equations for the hitting probabilities , where is the set of all vertices in graph . Moreover, if a vertex has only one single neighbor, and this very neighbor is just our target, then we can directly write out the probability of reaching target after exactly number of steps starting node as , with being the Kronecker function. Since we already know the probability distribution of hitting times for such queer nodes which we call adherents to target , we can ignore those nodes when establishing the simultaneous system of difference equations for .
A node in graph is called an adherent to target node if and only if this node has the target node as its only neighbor.
With these notations, we can write out a system of simultaneous difference equations for the probabilities of hitting target node for the first time with exactly steps after starting from different nodes as
where we have imposed the restriction that the starting node should not be equal to , and should not be an adherent to target . The matrix is called probability transition matrix, the elements of which are for , and whenever . Most of the time, is a sparse matrix. Note that although in an undirected graph the adjacency matrix is always symmetric, the probability transition matrix is generally non-symmetric. Moreover, due to the exclusion of the target node in the definition of probability transition matrix, the sum of matrix elements for each row in is not necessarily equal to 1. In fact, for a connected undirected graph, there is at least one row of whose sum is less than 1. The rule is that if the target node is not a neighbor of node ; otherwise . Since we have excluded the target node and all its adherent nodes from the set of starting nodes, the matrix has a dimension that is smaller than that of matrix . For an undirected graph, the matrix is guaranteed to be square due to the fact that a random walker starting from a node that is not target cannot possibly reach an adherent node to the target. The fact that matrixis also smaller than 1. We will take advantage of this fact later in this section.
Eq.  can be re-written in compact matrix form as
is a column vector, the elements of which are the hitting probabilities. An iterative solution to Eq.  is
The initial probability vector can be conveniently obtained by the observation that if the target node is not a neighbor of node , and that if the target node is a neighbor of node . Now that we have known matrix and the initial probability vector , we can calculate all the hitting probabilities for any valid starting node.
Although we can calculate all the hitting probabilities, most of the time, we are more interested in the observable quantities associated with these probability distributions. We can calculate the moments of the probability distribution by invoking their definitions, which are
The expectation and variance of the first hitting time starting from any nodecan be easily calculated from the first and seconds moments of the hitting probability distribution. At first sight, it seems that we need to know all the hitting probabilities before we can calculate their moments. However, we can exploit the fact the spectral radius of matrix is less than 1 and directly calculate the moments vector from the recursive Eq. [4 as
For a discrete series like , the probability density function is
where is the Dirac function with the property that for any continuous function , . The characteristic function of is
If we further define , then the characteristic function can be more compactly rewritten as
We can read off the expectation value and variance of hitting probabilities from the first and second derivatives of as
If we consider as the th component of , and as the th component of which is a component-wise square of vector , then the above three relations can be simplified into the form
Here, we have defined a vector function as
Since the coefficients of are the hitting probabilities , we call it the generating function of hitting probabilities. Plug Eq.  into the above definition, and we get
The second line of the above equation stems from the fact that and that the spectral radius of matrix is smaller than 1.
We have already known how to calculate the probability transition matrix and the initial probability vector , we can in principle calculate exactly the generating function from Eq. . However, calculating the inverse of matrix is no easy task, especially when the graph is huge. Moreover, the fact that the sparsity of matrix
which we should take full advantage of can get lost after matrix inversion compels us to shun the idea of directly inverting matrixto calculate hitting probability moments. Therefore, we have devised a trick for finding probability moments without resorting to matrix inversion. For this purpose, we rewrite Eq.  as
Performing first order derivative of both sides with respect to and setting yields
By definition, , and gives us the first moment of , which is equal to
The second line of the above equation is due to the identity that , which can be easily verified by plugging into Eq. . We can avoid inverting sparse matrices, an operation that will destroy the sparsity of a matrix, by noting that Eq.  can be rewritten as (remember that the spectral radius of is smaller than 1)
It is noteworthy that the first moments of hitting probabilities starting from each valid vertex are independent of the initial probability vector , and depend only on the probability transition matrix . Taking the second order derivative of Eq.  yields
We have implemented a Python program for the numerical calculation of mean and variance of hitting times from the knowledge of transition matrix alone. The Python code is shown below.
Higher order derivatives of Eq.  yield higher order moments. Now we have already developed the algorithm for calculating the moments of hitting probabilities using both analytical and numerical methods, next we will illustrate the effectiveness of this algorithm using both small and huge graphs.
The significance of the first order moment lies in that it is a measure of the distance from a starting node to a target node. If we already know that the target node is a fraudulent user in the social network, then we can infer that the nodes with average distance smaller than a threshold value could be considered to be potential fraudulent users. From intuition, we make a claim that the smaller the distance between any two nodes, the more similar they are to each other.
iii.1 An analytical calculation of hitting time distribution on a simple graph
In this section, we will show how to calculate the hitting time distribution on a small graph using analytical methods. This graph contains five nodes, which are denoted as , as shown in Fig. .
The Adjacency matrix of this graph is
Our target node is 3, and we want to calculate the probability of hitting the target starting from each vertex for the first time after exactly steps. Since node 4 is an adherent to target 3, we can directly write out its hitting probability as
For nodes 0, 1, 2, we can define the probabilities of starting from each node and ending at node 3 after exactly steps. We encapsulate these probabilities into a column vector as
The probability transition matrix is
The probability vector satisfies this equation
with initial condition
The moment generating function foris
We can easily get the first order moments by differentiating the above equation with respect to at , the results of which are
The above result means that the average distance from nodes 0 and 1 to node 3 are equal, both being 9, and the average distance from node 2 to node 3 is 7. We interpret these results as demonstrating that nodes 0 and 1 have equal distance to node 3, whereas node 2 is nearer to node 3 than both 1 and 2. Node 4, being an adherent to target node 3, always has an average distance 1 to the target. These results are consistent with our intuition and signify to us that node 4 is most susceptible to the influence of node 3, node 2 is second most susceptible, and nodes 0 and 1 are least susceptible to its influence.
In order to make a further test of the analytical results, we also perform a Monte Carlo simulation for the random walk on this graph. In the Monte Carlo program, we use node 3 as our target node, and start each random walk from node 0, 1, and 2. Each random walk terminates at node 3 after some number of steps. By repeating this process thousands of times, we obtain the average number of steps required before the random walker finally reaches the target. For each node in the set , we perform random walks, each of which yields a step number, and then we calculate the mean value of these numbers. The Monte Carlo simulation results we get are pretty similar to the analytical results, which are shown together in Table . We can see that the Monte Carlo simulation results are consistent with our analytical results, with relative errors being approximately . We do not expect Monte Carlo simulation to give us high precision numerical results, and the final results of Monte Carlo simulations may vary slightly for different random number generators.
Using the algorithm outlined in the previous section, we can equally well compute the average step number starting from each node 0, 1, 2 using numerical methods. We can use either Eq.  or Eq.  for this purpose, because the probability transition matrix is small enough for the direct inversion of matrix to be feasible. However, Eq.  is no longer practical when the graph is large, and thus even for this small graph, we still prefer to use Eq. , where we need to calculate the sum of an infinite power series. Due to the quick convergence of this series, we artificially impose a cutoff condition that the summation should terminate if the norm of the summand vector is smaller than a pre-specified error limit , which we choose to be here. We run the Python program listed in the previous section on macOS Mojave, and obtain numerical results that are shown together with analytical results and Mont Carlo simulation results in Table . It is clear that the numerical results obtained using our algorithm have a much higher precision than that of the Monte Carlo simulation results.
|Exact results||Monte Carlo simulation||Numerical results|
iii.2 Numerical computation of hitting time distribution on large graphs
When dealing with large graphs, it is both tedious and impractical to get an analytical formula as Eq. . Instead, we will resort to Eq.  to find numerical values of hitting times. Another method to find the hitting times is to use Monte Carlo simulation, although we will see that for a large graph, the running time of Monte Carlo simulation is much longer than the numerical method, thus making the Monte Carlo simulation an inferior alternative compared to Eq. . In this section, we apply the numerical method and Monte Carlo simulation method to a connected graph as shown in Fig. .
We will calculate the hitting times from each vertex in the graph to the target node which is chosen to be node 1. In order to visualize the results, we sort the vertices in the graph according to their hitting times to the target node. In Fig. , we plot the results from Monte Carlo simulation method and numerical method. We can see from the figure that these two methods give almost the same results, although we know that results from Monte Carlo simulation have a precision that is much lower that obtained from numerical method. Another weak point of Monte Carlo simulation is that it is much more time-consuming than the numerical method. In order to obtain the results shown in Fig. , we need to run random walks from each vertex in the graph except the target node, and the whole process takes about 157 seconds, whereas in the numerical method, we only need to compute the power series in Eq.  up to 6180 terms, and it takes only 3.06 seconds to obtain results with machine precision. Actually, the larger the graph, the more time-saving the numerical method is compared to the Monte Carlo simulation method.
Another feature that is worth mentioning in Fig.  is that we can distinguish the two communities in the original graph by looking at the distribution of hitting times with respect to vertices. It is clear there is a transition region that connects two plateaus in the hitting time vs. sorted vertices curve. The two plateaus correspond to the two communities in the graph. We can interpret the emergence of these two plateaus by noting that a random walker that starts from within one of the two communities tend to get trapped in a community. Once the random walker gets trapped in a community, the hitting times for each vertex in the community would not change substantially from vertex to vertex, which gives rise to a plateau in the curve. However, as soon as the random walker finds a bridge leading from one community to another, it will make a rapid transition across the two communities, and the hitting times will also experience a significant change. Thus, the calculation of hitting times provides us a tool for community detection as a by-product. However, we should make a caveat that this method of community detection is usable only when the number of communities in the graph is small enough and the communities are clearly separated from each other. Or else, this method of community detection is not as good as the ones compiled in Ref. fortunato2010community .
In this paper, we have derived an analytical formula for calculating the hitting time from a starting vertex to a target vertex in a connected undirected graph. This method relies on the probability transition matrix that can be calculated conveniently from the graph’s adjacency matrix. We also propose a quick method for implementing this formula using Python code, without the need to invert a possibly huge sparse matrix. Since hitting time is a core concept in random walk which can directly be simulated using Monte Carlo method, we can obtain an approximate value of the hitting time using Monte Carlo simulation. We tested the validity of our formula by applying the analytical formula and Monte Carlo simulations to undirected connected graphs, and show that these two methods can give similar results within tolerance of error. The advantage of the analytical method over Monte Carlo simulation is that the former is much quicker and more accurate than the latter. Our calculation of the hitting times for vertices in a graph can also give us a glimpse of the community structures of the graph on which we perform random walks, although this method for detecting communities is not as good as other existing algorithms when the communities in graph are numerous and not clearly separated.
- (1) Lawrence Page, Sergey Brin, Rajeev Motwani, and Terry Winograd. The pagerank citation ranking: Bringing order to the web. Technical report, Stanford InfoLab, 1999.
- (2) Aaron Clauset, M. E. J. Newman, and Cristopher Moore. Finding community structure in very large networks. Phys. Rev. E, 70:066111, Dec 2004.
- (3) Mark EJ Newman. Modularity and community structure in networks. Proceedings of the national academy of sciences, 103(23):8577–8582, 2006.
M. E. J. Newman.
Finding community structure in networks using the eigenvectors of matrices.Phys. Rev. E, 74:036104, Sep 2006.
- (5) Santo Fortunato. Community detection in graphs. Physics reports, 486(3-5):75–174, 2010.
- (6) Aditya Grover and Jure Leskovec. node2vec: Scalable feature learning for networks. In Proceedings of the 22nd ACM SIGKDD international conference on Knowledge discovery and data mining, pages 855–864. ACM, 2016.
Bart Baesens, Veronique Van Vlasselaer, and Wouter Verbeke.
Fraud analytics using descriptive, predictive, and social network techniques: a guide to data science for fraud detection. John Wiley & Sons, 2015.
- (8) Peter G Doyle and J Laurie Snell. Random walks and electric networks. arXiv preprint math/0001057, 2000.
Xiaojin Zhu, Zoubin Ghahramani, and John D Lafferty.
Semi-supervised learning using gaussian fields and harmonic
Proceedings of the 20th International conference on Machine learning (ICML-03), pages 912–919, 2003.
- (10) Sheldon M Ross, John J Kelly, Roger J Sullivan, William James Perry, Donald Mercer, Ruth M Davis, Thomas Dell Washburn, Earl V Sager, Joseph B Boyce, and Vincent L Bristow. Stochastic processes, volume 2. Wiley New York, 1996.