Graphs naturally represent data arising in several real-world scenarios; important examples include social networks, medical records, protein networks and the web. In many tasks, using graph embeddings that meaningfully encode relevant information about the graph structure has enabled outstanding performance on downstream tasks, for example in node classification [neville2000iterative], link prediction [liben2007link]akoglu2015graph], and node clustering [nie2017unsupervised, aviles2019labelled]. This has motivated the development of several techniques which generate embeddings by digesting the graph in different ways. One set of techniques is based on matrix factorisation, e.g. [balasubramanian2002isomap, anderson1985eigenvalues, roweis2000nonlinear, ou2016asymmetric, pang2017flexible], in which the connections between nodes (graph properties) are represented in matrix form to obtain the embeddings. Other bodies of research have explored approaches including generative models [le2014probabilistic, alharbi2016learning, xiao2017ssp] and hybrid (i.e. a combination of) techniques [wei2017cross, guo2015semantically, mousavi2017hierarchical], to name a few.
Of particular interest for this paper are random-walk-based techniques, e.g. [perozzi2014deepwalk, grover2016node2vec, yang2016revisiting, cao2016dngr], who have become popular due to their good performance. The core idea is to create a context
for each vertex by generating sequences of vertices using random walks on the graph before generating vertex embeddings from these contexts. Motivated by the success of neural language models, starting with the seminal DeepWalk[perozzi2014deepwalk] and followed by others including node2vec [grover2016node2vec] and DNGR [cao2016dngr], a variety of algorithms view individual random walks as sentences consisting of individual words (individual vertices); a collection of random walks then corresponds to a text corpus. A corpus of random walks derived from a graph during a first stage can then be analysed in a second stage using language models such as SkipGram [mikolov2013skipgram], which creates word (vertex) embeddings that allow one to predict a word’s context from its embedding. This process is illustrated in creftype 1 using one of our experiments, where the graph input and the latent vertex representation are displayed. Contributions. Despite their wide-spread usage, not much effort has been spent on developing a firm theoretical understanding of these algorithms. In particular, the study of the convergence of random-walk based algorithms is relatively new and incomplete. While there are several works that slightly vary the graph sampling procedure in the first stage, e.g. [grover2016node2vec] and [perozzi2017don], in this paper we mainly consider algorithms that share the sampling strategy of DeepWalk [perozzi2014deepwalk] due to this strategy’s practical advantages (requiring, for example, fewer hyperparameters than node2vec [grover2016node2vec]) and because it continues to be the backbone for many new techniques and applications, e.g. [tu2016max, qiu2018unifying, chen2019deepwalk, chen2020prediction]. Our contributions are:
In creftype 4.1, we formalise the random-walk sampling strategy used by DeepWalk [perozzi2014deepwalk] in its first stage and derive the asymptotic limit of the corpora generated. We show that as the number of random walks goes to this convergence is almost sure both in the single limit and the double limit where is the length of each random walk. (creftypecap 4.1, creftypecap 4.1)
We apply the results of this analysis to derive a heuristic that gives optimal values for the hyperparameters and for a fixed computational cost. (creftypecap 4.2)
We validate our theoretical results through a set of experiments using medium-sized and large real-world graphs consisting of thousands to millions vertices / edges. (creftypecap 5)
2 Related Work
This paper is interested in random-walk-based techniques for vertex embeddings, which have demonstrated outstanding performance. We first briefly review existing such techniques. Random-Walk Based Vertex Embeddings. The seminal work of Perozzi et al. [perozzi2014deepwalk] introduced the DeepWalk technique, which learns embeddings in two stages. In the first stage, a given graph is sampled using random walks, followed by an application of the SkipGram model to then generate vertex embeddings that allow one to predict the vertices that appear within a window around the vertex in the random walks. The promising performance of the DeepWalk motivated the developments of subsequent techniques that follow the principles of the DeepWalk. Grover et al. [grover2016node2vec]
introduce an additional parameter to control the random walks by allowing the user to interpolate between breadth-first and depth-first graph searches. Another modification of the first stage called Walklets was introduced in[perozzi2017don], where authors skipped over steps in each random walk. The HARP model introduced in [chen2018harp] improves the initialisation of the SkipGram stage. Slightly improved performance of these models compared to DeepWalk however comes at the cost of increased computational costs. Another set of techniques modified DeepWalk by introducing different sampling techniques for the random walks (e.g. [yang2016revisiting, pimentel2017unsupervised]), and enforcing preservation of some graph properties (e.g. [yang2015multi, pan2016tri, li2016discriminative]). Other variants of the such family of techniques are the works of [tang2015line, cao2015grarep, ou2016asymmetric, cao2016dngr]. There exist also techniques that aim to create similar network embeddings without (explicitly) sampling any random walks, for example [chen2019fastrp]. While DeepWalk and its variants have been widely explored in the literature from a practical point of view, the analysis of the theoretical properties of such techniques is still relatively new. There are only few works that discussed the theoretical properties of this family of techniques. In the following, we discuss them and note the differences with the present work. Theoretical Analyses of DeepWalk & Comparison to Our Work. The first theoretical analysis of random-walk based vertex embedding techniques is due to Qiu et al. [qiu2018unifying], who demonstrated a relationship between DeepWalk and a particular matrix factorisation problem. In the notation of creftype 3, that work identifies the limiting behaviour as the length of the walk , but does not prove a quantitative rate, and keeps the number of trials fixed rather than identifying the convergence as or the convergence of the joint limit as done here. The behaviour of DeepWalk in the limit of many shorter (instead of fewer long) random walks as investigated in our work arguably has greater practical relevance as it allows for much easier parallelisation of the algorithm. Additionally, the proof in [qiu2018unifying] only applies to undirected, non-bipartite graphs whereas our creftypecap 4.1 and Corollaries 4.1 and 12 apply to directed and undirected graphs, both periodic and aperiodic. More recently, another work [qiu2020matrix] proved a concentration bound in terms of for arbitrary for the convergence of a corpus of random walks generated from an aperiodic graph. In contrast, we derive concentration bounds in terms of for arbitrary (creftypecap 4.2) and in terms of and for the joint limit (creftypecap 4.2) for both periodic and aperiodic graphs. Furthermore, our bounds do not depend on the size of the graph, a factor that while of constant order can be very significant in real-world graphs with vertices and edges potentially numbering in the millions (cf. creftype I). In this paper, we also consider the assumptions necessary for convergence of the co-occurrence matrix to imply convergence of the extracted vertex representations (creftypecap 37), a question that [qiu2020matrix] does not address. Most recently, [zhang2021consistency] considers random graphs, in either the sparse or dense regimes, in the asymptotic where the size and the number of sampled paths on a fixed time interval. While also investigating the theoretical properties of random-walk based graph embedding algorithms, the setting that they consider is different to the one considered in this paper or, in fact, in [qiu2020matrix]. Moreover and in contrast to the works above, we illustrate the practical relevance of our analysis in creftypecap 4.2 by combining our results to derive a rule for the optimal choice of the hyperparameters and . Qiu et al. [qiu2020matrix] also derive a heuristic for choosing , but they assume implicitly that , which as our experiments in creftype 5 show yields worse task-performance than our heuristic.
3 Formal Description of the Deepwalk Algorithm
The DeepWalk algorithm consists of two phases: in the first phase it generates a corpus of vertex pairs, and in the second phase it executes the SkipGram algorithm (with negative sampling) on this corpus to generate vertex embeddings.
3.1 Generation of Corpus
Informally, in the first phase the corpus is populated with all pairs of vertices that are within a given window size of each other in random walks generated on the graph. For each vertex, the vertices at most removed in the random walk are viewed as that vertex’s context vertices, and each pair is added to the corpus. We take a graph , equipped with a weight function . Consider a random walk on , with a given starting distribution and transition matrix with elements , where is the weight of edge . We write for the joint law of the first steps and, if is a one-hot distribution, write instead. The stationary distribution of this random walk is denoted as , which is given explicitly by . We define a corpus on graph to be a tuple consisting of the set of pairs and the multiplicity function . Its cardinality is the total multiplicity . Since is finite, we may identify
with the vector of its values, and write, and we denote its elements as for all , so for all . This matrix is also called the co-occurrence matrix, as it encodes the number of co-occurrences of two vertices (the number of times the appear within random walk steps of each other) in the random walks performed. Note that once we have specified a graph, different corpora only differ in their multiplicity functions. In the first phase of DeepWalk, an initially empty corpus is populated by repeatedly updating the multiplicity function. Formally, the algorithm with objective to populate the corpus is given in Algorithm 1, taking as parameters the number of walks , the walk length and window size . In the second loop in line 5, we follow [qiu2018unifying] in having take on different values to avoid edge effects (when ). This simplifies the derivation in creftype 4.1 while only removing constant order terms (depending only on the fixed length ) from the equations.
3.2 Generation of Vertex Representation
Having obtained a corpus in the first stage, in the second phase the vertex representations are calculated from the corpus. Traditionally using either the SkipGram algorithm [mikolov2013skipgram] or SkipGram with Negative Sampling [mikolov2013negativesampling] (though in principle other methods can be used as well), the vertex representations are obtained by maximising an objective function of the form
with respect to the matrix . Here, is the matrix of vertex embeddings whose columns are the individual , is the dimension of each embedding, and is some . In the following, we will use to refer to the partial application of an objective function where the corpus has been fixed. This partial application of yields a function that calculates the objective value for a certain vertex embedding; different corpora now correspond to different for the same objective function . The second stage obtains the vertex embeddings by fixing the corpus generated in the first stage from Algorithm 1 in the objective function to create a partial application which then is optimised with respect to the vertex embeddings by an optimisation procedure . To keep our discussion as general as possible, we view the optimisation procedure as a deterministic function which maps an objective function (with the corpus fixed) onto the space of vertex embeddings (the remaining argument of ); stochastic components can be incorporated by fixing the random seed of a pseudo-random number generator.
4 Theoretical Analysis
We can now give our main theoretical results. We first prove convergence of the vertex co-occurrences as and in the joint limit in creftype 4.1. We then quantify the rate of these convergences and show how they lead to a heuristic for choosing the hyperparameters in creftype 4.2. We finally show in creftype 4.3 that, under relatively weak assumptions, convergence of the vertex co-occurrences implies the convergence of the learned vertex representation.
4.1 Convergence of Vertex Co-Occurrences
In this section, we derive the limit of the vertex co-occurrences (the frequencies with which a given vertex pair appears in the corpus) as the number of random walks and then also their length become infinite. [Occurrence Vertex Co-Occurrences] Let be a corpus generated by creftypecap 1. Retaining the notation from that algorithm, the expected relative frequency of occurrence of the pair in the corpus is
A corpus generated from random walks is the union of independent sub-corpora , which are each generated from just a single random walk. Thus the expected value of is the expected value of (the relative occurrence frequency in sub-corpus ), averaged over all . Since the sub-corpora are independent, we have
In the remainder of the proof we can therefore focus on a corpus generated by just a single random walk. We again denote with the th vertex in this random walk generating . The corpus contains all the pairs such that for either or (cf. creftypecap 1).
is the probability with which we would draw the pairfrom the corpus . Since consists of the pairs from just a single (realisation of the) random walk ,
The number of pairs in the corpus with is the same as the number of pairs with . For a fixed start-vertex , fixed , and fixed , the probability of drawing the vertex pair from the random walk therefore is
The distributions of and are uniform (the number of pairs with some , in is the same for all ), so to evaluate the expectation values with respect to and we take the arithmetic mean. From creftypecap 1 we see that the distribution of is , so the first part of the theorem follows. For the second part, note that
is exactly the mean of the independent and identially distribution random variables, which take values in
. The law of large numbers therefore applies to show the almost sure convergence claimed. ∎
[Effect of Longer Random Walks] Using the ergodic theorem (cf. Theorem 1.10.2 in [norris1997markovchainsbook]), we can evaluate first the sum over , and then the expectation value with respect to , as .
Hence, as ,
This almost sure convergence is stronger than the convergence in probability established in [qiu2018unifying]. This is due to the fact that they do not let . [Effect of Starting Distribution] From the definition of the stationary distribution we see that if ,
In that case, (11) holds as an equality for arbitrary , and (12) holds for arbitrary fixed as The results from this section, together with creftype 37, guarantee that if the DeepWalk algorithm performs enough and long enough random walks, the learned vertex representations converge in the sense of creftype 37.
4.2 Convergence Rates
Let us first investigate the convergence rate as for arbitrary . [Convergence Rate as ] The convergence of creftype 4.1 satisfies the concentration bound
Let again be the multiplicity function of the corpus generated from the th random walk. Each random walk has the same length, so . Since , Hoeffding’s inequality gives
Dividing by and using then yields
Re-scaling then yields the theorem. ∎
To derive a concentration bound for the joint limit, we first need to prove two lemmas that quantify the convergence rate of as for undirected and directed graphs. For notational simplicity, define
as the value that converges to (creftypecap 4.1). [Convergence Rate for Undirected Graphs] Given an undirected graph, let be the out-degrees and . Let
be the ordered eigenvalues of the symmetric normalised graph Laplacian
Then we have the convergence estimate
where for non-bipartite graphs and for bipartite graphs and even .
For a general undirected graph, we get by substituting for and from creftype 4.1 for
by Jensen’s inequality. For non-bipartite graphs, we again apply Jensen’s inequality and use Theorem 5.1 in [lovasz1993randomwalkssurvey] to obtain
For even (so that contains an even number of terms) for bipartite graphs, we replace with , where the difference between the two stems from the fact that , so in the derivation of Theorem 5.1 in [lovasz1993randomwalkssurvey] successive terms involving cancel. The Lemma then follows.∎
[Convergence Rate for Directed Graphs] For a directed graph with period ,
Furthermore, if is aperiodic (), with constants and as in Theorem 4.9 in [levin2017markovmixing],
To prove this for both aperiodic and periodic graphs, we need the following intermediate result. Let be the transition matrix of a Markov chain with period and stationary distribution . Let be the disjoint subsets of the state space such that . If vertices for some , then there exist (explicitable) constants such that
A similar observation is well-known for Markov chains in general (Theorem 4.9, [levin2017markovmixing]); in the present context, it suffices to observe that
is an aperiodic stochastic matrix on the block, for which is an invariant probability measure, and the lemma follows from the cited result. ∎
With this proposition in hand, we are now in a position to prove the Lemma.
Substituting for and from creftypecap 4.1 for and using Jensen’s inequality as in the proof for creftypecap 4.2 again leads to (18). We prove the first part of the theorem by showing that for a general periodic graph . Let again be disjoint subsets of the state space as in creftypecap 4.2. Without loss of generality, take and with , and let so that . We decompose the path from to in steps, , into a part from to an and a part from to :
since if then , and if then , and since . We can use creftypecap 4.2 to obtain upper and lower bounds for
for some constants and . If , we use the upper bound on , and if we use the lower bound to obtain:
Using the fact that to evaluate the prefactor of , we continue
since the remaining geometric sum is bounded by a constant. To prove the second part of the theorem, note that for aperiodic graphs, and . For aperiodic graphs, and still have the same meaning as before. Therefore, in (29) the prefactor of vanishes for aperiodic graphs. The upper bound on therefore evaluates to
Substituting into (18), the theorem follows. ∎
Finally we state and prove our theorem for the convergence rate in the joint limit : [Convergence Rate as ] The convergence of creftype 4.1 satisfies the concentration bound
We have by the triangle inequality
Substituting and using creftypecap 4.2 then yields the result. ∎
We can use creftypecap 4.2 to obtain a rough heuristic for an ”optimal” choice of the parameters and given a fixed computational cost; the window size determines the size of the local structures which the vertex embeddings encode and is therefore usually fixed for a given application. The total computational cost of computing the co-occurrence matrix scales linearly with the number of random-walk steps , but since we define the excess computational cost . For notational simplicity we also denote the upper bound for the failure probability in creftypecap 4.2 as and absorb all terms in that are independent of into a constant for notational simplicity. Then we have: Assuming that so that and in , the value for that minimises the error given fixed is
This heuristic minimises the error given an excess computational cost and an upper bound for the failure probability , which both can be freely chosen. We see that under the heuristic, , which justifies the assumption . Even without this assumption, we still have an upper bound, as the neglected terms can only make smaller. In practice, one could use as a first approximation to the remaining constant , which depends on the graph structure through the spectrum or through . If a more precise estimate for is needed, let us remark that this proof (and the results recalled in the Appendix) produce a constructable , in that one can follow the steps of the proofs to explicitly find for a given example. In the case where we use creftypecap 4.2, let us remark that many techniques are known ([diaconis1991geometric], and references therein) for bounding spectral gaps, which would lead to a constructable upper bound for .
4.3 Convergence of Vertex Representations
We finally show that convergence of the corpora, suitably normalised, is indeed sufficient to show convergence of the vertex representations. We first remark that, if two corpora differ only by a constant multiple so that
for some , then the respective partial applications of the objective function also differ only up to a constant multiple: , and it follows that the derived vertex embeddings
also agree. To remove this redundancy, we define normalised objective functions
on some space of partial applications of objective functions, with respect to the essential supremum distance , where is the optimisation procedure as before. Informally, creftypecap 4.3 means that if the partial application of the objective function, , changes slightly, the optimisation procedure will find vertex embeddings that have a similar objective value. [Convergence of Vertex Representations] Let be a sequence of corpora generated by creftypecap 1 from the same graph such that the respective vertex pair occurrence frequencies converge to the vertex pair occurrence frequencies in some corpus generated from that graph. Let be the partial application of an objective function given corpus , and the normalised objective function, and similarly define and ; . Then
The map is continuous with respect to the uniform distance on functions, as the are bounded. By continuity of with respect to , implies . By continuity of with respect to , implies . From (35) we know that and and the theorem follows. ∎
creftypecap 37 means that if the vertex pair occurrence frequencies of pairs converge to those of some corpus , then the vertex representations converge to values that have the same (normalised) objective function value as the representations derived from corpus and should be similarly useful for downstream tasks.
5 Experimental Results
In this section, we detail the set of experiments that we conducted to illustrate our theoretical findings.
5.1 Datasets Description
We used five real-world datasets for our experiments. Each is a graph whose vertices are grouped in a number of classes. They vary in their number of nodes and edges (up to the order of millions) and in their number of classes. The dataset statistics are given in creftypecap I along with a graphical visualisation (see creftype 2).
The BlogCatalog [tang2009relational] dataset is derived from a blog-sharing website. The edges reflect the bloggers (vertices) following each other. The classes in the dataset are the bloggers’ interests.
The Cora [sen2008collective] dataset is a text classification dataset for Machine Learning papers. Each vertex corresponds to a paper with the edges representing citation links. The classes are different areas in Machine Learning. The dataset also supplies bag-of-word representations for each vertex, though we did not use these in our experiments.
The Wiki [yang2015network] dataset, similarly to Cora, is a collection of text documents (vertices) with edges representing hyperlinks between them.
The Facebook Large [rozemberczki2019multi] dataset is a page-page graph of Facebook sites. The vertices refer to Facebook pages while the edges denote mutual likes between sites.
The YouTube [mislove2007measurement] dataset is based on the online video-sharing platform, with edges reflecting subscription connections between users (vertices).
5.2 Results & Discussion
We support our theoretical analysis following a three-part evaluation scheme. (E1) Convergence of co-occurrence matrices. We begin by directly validating our results about the speed of convergence for the corpus generated by the first stage of DeepWalk (Theorems 4.2 and 4.2).To do this, we calculate the approximation error of the co-occurrence matrix as the Frobenius norm of the difference between and its asymptotic value , i.e. , for different values of the random walk length and the number of random walks . More precisely, for all experiments run in this part we set the window size and the embedding dimension . To observe the effect of increasing the number of walks, we set and fix (see blue lines in creftype 4). The effect of increasing the length of the walk, we set and fix (see red lines in creftype 4). We ran these experiments using the BlogCatalog and Cora datasets. The results are displayed in creftype 4, where one can observe a convergence rate consistent with our results for the limiting cases when and . (E2) Heuristic for optimal values for and . We show the practical application of our heuristic for optimal choices of the hyperparameters and for a fixed computational complexity and fixed failure probability (creftypecap 4.2). For our experiments, we plot the task performance against the computational cost when choosing and using three different strategies. In the first strategy, we set and to simulate the extreme of the minimum random walk length (red line in creftype 4). In the second strategy, we set and , the case of performing a single very long walk, which corresponds to the hyperparameter choice suggested in [qiu2020matrix] (green line in creftype 4). Finally, we select and using our heuristic, approximating the constant and fixing (blue line in creftype 4
). The task performance is the Micro-F1 score achieved by the one-vs-rest logistic regression suggested by[perozzi2014deepwalk] and also used in [qiu2018unifying, qiu2020matrix]. For all strategies we again set and . We ran these experiments using the BlogCatalog and the Facebook Large datasets for the values , so after necessary rounding the values used to test the heuristic were and . The results are plotted in creftype 4. Choosing and using the heuristic consistently outperforms the other two cases, illustrating the impact that a better choice of the hyperparameters and can have. The maximally-parallelisable choice of and performs a lot worse than the other two strategies, which is likely due to the uniform starting distribution being too different to the stationary distribution on these graphs, resulting in the vertex contexts generated from the short walks not being very informative. (E3) Convergence of task performance. In practice, vertex embeddings are seldom the final goal of a machine learning pipeline, and our notion of convergence in creftype 37 takes even two different sets of vertex embeddings to be equivalent if they result in the same (normalised) objective function value. We therefore use convergence of the task performance of a down-stream algorithm as a proxy for the convergence of the vertex embeddings; if the downstream task performance has converged then the vertex embeddings have very likely converged as well (in the sense of creftype 37). We again use the standard protocol for network embeddings evaluation ([perozzi2014deepwalk, qiu2018unifying, qiu2020matrix]) and compute the Micro-F1 score of a one-vs-rest logistic regression. We again set the window size . The results are plotted in creftype 5. For the experiments in creftype 5 (a) and (b) we use varying embedding dimensions of , while we keep the embedding dimension constant at for the other experiments in plots (c)-(f). When varying , we set , and when varying , we set . One observes that as the number of walks and the length of each walk are tending to large values the performance converges. This observation holds across all graph sizes and dimensions, for both medium-sized graphs like the Cora dataset and large graphs like the BlogCatalog dataset and embedding dimensions ranging from to .
We presented a convergence analysis for random-walk-based vertex embeddings. Unlike existing theoretical works, we investigated the convergence of vertex co-occurrences in the generated corpus as the number of random walks for arbitrary random walk lengths as well as the simultaneous limit , proving almost sure convergence and quantifying the convergence rates by deriving concentration bounds for the two limits. We also proved that convergence of vertex co-occurrences does indeed imply convergence of the vertex embeddings themselves, a fact that to our knowledge had not previously received formal theoretical treatment. Moreover, we provided a heuristic for choosing optimal values for the hyperparameters and . We supported our theory with a set of experiments using several real-world graphs, and demonstrate the practical implication of our findings. [Extended Mathematical Results] This appendix explicitly gives the mathematical results and some proofs cited, following the notation used in the main paper. [Ergodic Theorem; Theorem 1.10.2 in [norris1997markovchainsbook]] Let be an irreducible transition matrix and let be any distribution on the state space . If is a Markov chain with transition matrix , starting distribution , and invariant distribution , then
where is the expected return time to state and is the number of visits to state before time . Moreover, if is finite then for any bounded function we have