1. Introduction
Graphs or networks are the natural representation of a collection of entities and the relationships between them. They are fundamental structures that have many examples in the real world, e.g., social networks, transport networks, financial transactions, and communication networks. Recently, several machine learning techniques have been proposed that use graph information to predict attributes of vertices, relationships, and the entire graph (Hamilton et al., 2017b).
An effective approach for incorporating graph information into machine learning models is representation learning. Representation learning seeks to learn low dimensional vector representations for the vertices (vertex embeddings). The goal of the representation learning is to find a mapping of vertices to a vector representation such that distances between these vector representations meaningfully relate to similarities in the local structure of the vertices (Zhang et al., 2018).
Recently, there has been significant research on representation learning and machine learning techniques that explicitly use graph information (Perozzi et al., 2014; Tang et al., 2015; Grover and Leskovec, 2016; Dong et al., 2017; Hamilton et al., 2017a; Ying et al., 2018; Kipf and Welling, 2016). One common approach in many of the recent methods is unsupervised representation learning, which learns low dimensional vector representations for the vertices only based on the graph structure.
In order to capture the structure of the graphs efficiently, random walks have been shown to be scalable to large graphs (Perozzi et al., 2014). In addition, random walks have been shown to be able to trade off structural equivalence (vertices that have similar local structure have similar embeddings) and homophily (vertices that belong to the same communities have similar embeddings) (Grover and Leskovec, 2016). Random walks are combined with recent representation learning methods from language modelling to give high quality representations of vertices that can be used in downstream machine learning tasks such as vertex classification and edge prediction (Perozzi et al., 2014; Grover and Leskovec, 2016). In addition, random walk based methods have been extended to capture subgraph embeddings (Adhikari et al., [n. d.]), and vertex representations in heterogeneous graphs (Dong et al., 2017). Random walk based methods have been shown to have a fundamental link to matrix factorization (Qiu et al., 2018).
However, most of the previous research on unsupervised representation learning is based on static graphs whereas most realworld graphs are dynamic —namely, they change over time. For example, in a social network, new users join the network (added vertices) and existing users add friendships (added edges). However, the static representation learning algorithms cannot measure the extent of the change in the graph. This presents a challenge for static and transductive representation learning algorithms such as DeepWalk (Perozzi et al., 2014) and node2vec (Grover and Leskovec, 2016) applied to large dynamic graphs, in that it becomes impractical and inefficient to relearn vertex representations from scratch upon each change in the graph. Therefore, a method for learning representations of vertices that can explicitly utilise information about changes to the graph and update the vertex representations is desirable, in which the vertex representations can be incrementally learned as the graph changes and grows over time.
There has been recent research into how to use randomwalk based representation learning for dynamic graphs. Nguyen et al. (Nguyen et al., 2018b) used dynamic random walks on a temporal graph where at each step the next step is restricted to edges where the time is greater than at the previous step. The representations that were learned using these dynamic random walks improved predictive performance in several downstream machine learning tasks. Following on from this work, Winter et al. (De Winter et al., [n. d.]) investigated the use of timedirected walks on dynamic graphs for edge prediction and show that the past state of the graph can be used to predict the future edges. They found that node2vec (Grover and Leskovec, 2016) applied on static snapshots of the graphs perform better in some cases than using temporallydirected random walks.
In contrast to the previous works, in this work, we focus on how unsupervised learning methods based on random walks can be modified when a graph changes over time. While previous techniques have used randomwalks based methods on temporal graph snapshots they have not shown how to utilise what was learned in the previous snapshot to efficiently calculate the representations in the next snapshot. We break this problem into two parts: firstly, we look at how a pregenerated set of random walks generated on one graph snapshot can be updated when the graph changes. We show that simplistic methods to perform this update give random walks that do not statistically represent the updated graph. We then propose a general random walk update algorithm that produces an updated set of walks that is statistically indistinguishable from a set of random walks generated from scratch on the new graph. Secondly, we investigate how to update vertex representations incrementally given the current set of random walks, by treating the updates as a finetuning step in the DeepWalk and node2vec algorithms
(Perozzi et al., 2014; Grover and Leskovec, 2016).We demonstrate on multiple realworld datasets that our methods for updating the set of random walks and the resulting vertex representations give comparable predictive accuracy for downstream tasks to that obtained by relearning these representations at each time snapshot of the dynamic graph while being much less expensive computationally. We discuss the tradeoffs inherent in updating vertex representations, and how the computational cost of updating random walks and vertex representations depends on the number of edges that are added to the graph at each time step, and on the density of the graph.
Our contributions are as follows:

[topsep=0pt]

We propose an efficient algorithm that, given a graph structure change, produces an updated set of random walks statistically indistinguishable from walks generated from scratch on the updated graph. This update algorithm will be useful for any task requiring a set of random walks on a graph that is constantly changing.

We test this algorithm by updating the skipgram model of (Perozzi et al., 2014) to work on dynamic graphs, reducing the cost of calculating vertex representations (embeddings) by an order of magnitude compared to computing the embeddings from scratch.

We empirically evaluate our algorithms with several real world datasets for multiclass and multilabel classification tasks.
2. Unsupervised representation learning and random walks
Given an undirected and unweighted graph, , with the set of vertices and edges the goal of vertex representation learning is to determine a set of fixed length vectors, , for each vertex
such that similar vertices are close in the representation space. This paper focuses on methods that learn these representations using the conditional probabilities of vertex pairs derived from random walks on the graph
(Hamilton et al., 2017b).A typical workflow for vertex representation learning on graphs consists of the following steps: (i) update the graph, (ii) generate random walks, (iii) learn vertex representations (embeddings), (iv) train the downstream learning task (e.g., vertex classification). Steps (ii) and (iii), i.e., generating random walks and learning vertex representations, are the most resourceintensive steps in the workflow. Step (iv) involves taking the learned vertex representations and using them as features for predictive tasks such as vertex classification or edge prediction. The vertex representations are learned separately using an unsupervised objective function as this gives multipurpose vertex representations and also allows the use of a small number of labelled vertices in the downstream predication task (Perozzi et al., 2014).
In this section, we describe how the DeepWalk and node2vec algorithms (Perozzi et al., 2014; Grover and Leskovec, 2016) on a static graph use random walks to generate vertex pairs that are then used to find the vector representations for all vertices in the graph. In the following section, we will discuss the issues that arise when applying these algorithms to streaming graphs.
2.1. Random Walks
In general, a random walk can be modelled as
th order Markov chain, in which the state space is the set of graph vertices
and the future state depends on the last steps. For a thorder walk the transition probability only depends upon the previous vertices visited by the walk (Benson et al., 2017).A random walk of length starting from a vertex consists of a sequence , where represents the vertex index at the th position in the walk. In general, a th order random walk is generated by sampling a vertex given the previous vertices
from the transition probability distribution:
(1) 
which is nonzero only if there is an edge between vertices and . To generate a th order walk, we must also sample the initial vertices from another random sampling method in order to calculate the transition probability (Eq. 1). For first order walks this amounts to sampling the vertices to start the walk from, for higher order walks we also need another lowerorder walk process to generate the initial vertices.
2.2. Vertex Pairs from Random Walks
Given a set of random walks, we extract vertex pairs that appear close to each other in the walk. These vertex pairs represent the structure of the graph and will be used in the next subsection to learn the vertex representations.
Consider the case where we have a set of random walks , that start from a set of initial vertices in the graph, where is an independently created random walk starting from the vertex .
Given such a set of random walks, we sample vertex pairs from each random walk in a way analogous to words being sampled from sentences in the skipgram model of (Mikolov et al., 2013a). Namely, for each random walk and for each vertex we take the vertices before in the walk and create pairs, giving the following set:
(2) 
the same is done for the vertices after the vertex , giving the following set of pairs:
(3) 
Following the terminology of (Mikolov et al., 2013a) we will call the first item in the pairs the target and the second item the context. Parameter is called the context window size.
The generated targetcontext pairs for all random walks in set , together they form the corpus of vertex pairs. We define the set of all vertex pairs generated from the items before the target vertices as:
(4) 
and all the pairs generated from items after the target vertices as:
(5) 
finally, the complete vertex pair corpus is given by the union of both of these:
(6) 
, which are used to optimise the loss function as discussed in the next section.
2.3. The SkipGram Model
Given a corpus of targetcontext pairs , the vertex representations are found by learning a vector representation for each vertex that when combined by a specified function approximates the probability of cooccurrence of the target and context vertices in the vertex pair corpus. In particular, the skipgram model of (Mikolov et al., 2013a) models the conditional probability of a vertex pair, , by a loglinear function of the inner product between the vectors and representing the vertices, as follows:
(7) 
The vertex representation vectors for all vertices can be found by minimising the following crossentropy loss function:
(8) 
As the partition function in the denominator of Eq. 8 is a sum over all vertices, it is computationally intractable for all but the smallest graphs. Therefore, more efficient formulations are used to approximate this formulation, in particular Perozzi et al. (Perozzi et al., 2014) use a hierarchical softmax to approximate the partition function. In this work, as in more recent works (Grover and Leskovec, 2016; Hamilton et al., 2017a), we use negative sampling (Mikolov et al., 2013b)
to estimate Eq.
8, which is more efficient compared to hierarchical softmax.2.4. Random Walk Length
Grover et al. (Grover and Leskovec, 2016) found that walks of length gave the best crossvalidated performance on a downstream vertex classification task when all other hyperparameters are fixed. This meant that the number of walks was fixed and as the walk length is increased the total number of targetcontext pairs in the corpus also increases.
In contrast, we claim that shorter walks can give embeddings that have similar performance results on a downstream task when the number of walks is adjusted to keep the training corpus size the same. Table 1 shows similar performance of a downstream vertex classification task on the embeddings calculated from random walks of lengths and . The performance is measured as the MacroF1 scores for the multiclass classification on the Cora and CoCit datasets, where is the number of walks starting from each vertex and is the walk length. We keep the size of the corpus the same in the experiments by increasing the number of walks per vertex when we decrease the length of the walks. More information about the datasets and the experimental setups are given in Section 5.
However, we note that the training corpus will be affected by the length of the walks used to generate it. Specifically, for firstorder random walks, the length of the walks will firstly change the unigram distribution of vertices in the training corpus, and secondly change the bigram distribution of vertex pairs through edge effects.
Firstly, the unigram distribution of the vertices appearing in the corpus will change with random walk length. In particular, as the length of the random walks becomes long, the singleton distribution of vertices will tend to the stationary distribution (Bollobás, 1998)
. In comparison, short random walks will be dominated by the distribution of the initial vertices, typically a uniform distribution.
The unigram vertex distribution effectively alters the overall weighting of probabilities for each vertex in the crossentropy loss function Eq. 8. The effect is for longer walks to give a comparatively higher weight to highdegree vertices, as they will appear more frequently in longer walks.
Secondly, edge effects due to the sampling method of Eq. 4 and Eq. 5 will bias the vertex pair corpus towards vertices that are closer together. This is because when the target vertex is closer than to the start of the walk not all vertex pairs can be sampled by Eq. 4. Similarly, Eq. 5 is biased to vertex pairs that are closer together at the end of the walk. As the length of the walk increases, the number of vertex pairs sampled with the full context window will increase and the effect of the edges will proportionally decrease. Therefore, the corpus will have a larger bias towards vertices that are closer together for short walks compared to long walks. For higherorder walks, there are similar edge effects due to the choice of the initial nodes affecting the vertex pair distribution more for shorter walks.
In this section, we have described two ways that the length affects the generation of vertex pairs from random walks. We show that the effects of walk length are minimal when the size of the vertex pair corpus is controlled for. Furthermore, the changes to the vertex pair corpus caused by different walk lengths can potentially be controlled for in other ways, for example by changing the distribution of the initial vertices of the random walks.
Configuration  Cora  CoCit 

0.7825  0.3143  
0.7844  0.3059 
3. Dynamic Graphs
A dynamic graph can be represented as a series of undirected and unweighted graphs, , where vertices , edges , and is a discrete series of times. A dynamic graph can be considered as a set of updates taking the graph at time and producing a modified graph at time . These updates consist of deleting or adding one or more vertices and edges. We represent the set of vertices and edges that are deleted from the graph between times and as and , and the set of vertices and edges that are added to the graph between times and as and respectively. Therefore, the updated graph at time can be given in terms of the vertices and edges as and .
Our aim is given a corpus of random walks on a graph at time , , to update the random walks so that the updated set at time , , is statistically representative of the updated graph. Namely, the updated corpus of random walks at time should be statistically the same as random walks drawn only from the current graph snapshot .
Now, when updating random walks on graphs we consider the set of vertices which have changed directly as a result of the additions and deletions of vertices and edges. Specifically we denote the set of vertices contained in all added edges as and the set of vertices contained in all removed edges as . There are several different cases to consider:

Deleted vertices with edges in will be removed from the graph and any random walks containing these vertices will be invalid.

Deleted vertices without edges in will not be in any valid random walks and no change to the random walk corpus is needed.

Added vertices without edges will not be included in any random walks from other vertices.

All other added edges and deleted edges without removed vertices affect all random walks that include the vertices connected by these edges.
To analyse the effect of graph changes on the random walks we define the following terms:

Affected vertices: all vertices that are in the set of vertices contained in the set of added edges, and the set of removed edges but without the vertices in the deleted vertex set:
(9) 
Affected walks: All random walks from the corpus of random walks that contain at least one affected vertex.
Importantly, we note that all unaffected walks on the graph represent valid samples of walks in the current graph after the update, . On the other hand, the affected walks do not represent the statistics of the current graph. At the point that an affected walk encountered an affected vertex the next step of the walk would have different transition probabilities on the current graph snapshot, , than the previous graph snapshot, . Therefore, the first encountered affected vertex in an affected walk is of special importance. This generalises to all random walk orders, as it is only when the random walks pass through an affected vertex that different transition probabilities of next step of the walk are of consequence.
3.1. Updating the Random Walk Corpus
Our goal is to update the random walk corpus so that it is indistinguishable from random walks generated on the updated graph. Therefore, the baseline that we will compare to is to regenerate random walks for the latest snapshot of the graph every time the graph is changed. This baseline random walk algorithm, the m1 algorithm, is given in pseudocode in Algorithm 1.
The input data for the m1 algorithm is the snapshot of the graph , and the random walk parameters: number of walks and walk length . m1 takes all the vertices of the snapshot and initialises random walks per vertex (Line 3), with that vertex as the initial step of those random walks. The initialised walks are given to the function randomwalk (Line 4) that executes random walks of length through sampling vertices from Eq. 1.
As this baseline algorithm requires a large amount of computation at each update, regardless of how small the numbers of added or deleted vertices and edges are, clearly a more efficient algorithm is needed. Such an algorithm would replace the minimum number of random walks in the current corpus of random walks with new random walks such that the updated random walk corpus, , is statistically representative of the new graph .
Firstly in the next section, we show the problem with naively updating random walks by appealing to a simple example and we introduce the m4 algorithm.
3.2. Example Naive Random Walk Update
To illustrate how a naive algorithm produces biased random walks we consider taking firstorder random walks of length 3 starting from a vertex uniformly sampled from the graph of Fig. 1a. The set of walks generated will be uniformly distributed over the walks shown in Table 2(a). Next, we generate the corpus of pairs as described in Section 2.2. This set of adjacent pairs will be uniformly distributed over the pairs shown in Table 2(b).
The frequency of the vertex pairs can be directly related to the transition probabilities for the vertex pairs given by Eq. 1 with , namely in the expectation:
(10) 
where is the number of pairs in the corpus and is the number of all pairs containing the vertex in the first position. For the graph the expected pair probabilities Eq. 10 are given by: , and .
Now consider the graph shown in Fig. 1b which contains the same three vertices – A, B, and C – as but adds a new vertex D and an edge BD. The set of affected vertices are {B,D} and are shown as shaded in Fig. 1b. To update the walks to reflect the new vertex, we will naively generate walks of length 3 from the vertex D, of the same number as the expected number of walks from the other three vertices. Doing this we generate random walks that will be uniformly distributed over those shown in Table 2(c). However, the corpus of pairs for the set of walks consisting of the original walks of Table 2(a) and the walks of Table 2(c) do not represent the graph statistics. The vertex pairs generated from these walks will follow the distribution shown over the pairs shown in Table 2(d).
Theoretically the transition probability for vertices A, C, and D given that the random walk is at vertex B should all be as there are now three neighbours for that vertex. However, using Eq. 10 to calculate the expected empirical conditional probabilities for the vertex pairs we obtain highly biased probability estimates: , , and .
We see that simply adding random walks generated on the newly added vertices in the updated graph to the walks generated on the old graph gives biasedconditional probability distributions and counts of vertex pairs that do not reflect the statistics of the updated graph. As the naive incremental random walk, we introduce a slightly more sophisticated algorithm, namely m4 algorithm (see Algorithm 2), which initialises random walks for the affected vertices and creates random walks of length . In the end, it updates the random walks by replacing the old walks by their corresponding regenerated walks and adding the new random walks for the new vertices (Line 4 in Algorithm 2).
However, as explained in the example, the m4 algorithm does not give the same statistics of the random walks as rerunning the static algorithm m1 on the current state of the graph . In other words, the random walks given by the m4 algorithm have biased empirical transition probabilities and thus do not represent the statistic structure of graph .
In the next section, we present an algorithm that updates a set of random walks such that all statistics derived from it are identical to generating random walks on the new graph from all vertices.
4. Dynamic Representation Learning Algorithms
In this section, we introduce an algorithm that updates the random walks on a dynamic graph to maintain the statistical properties of the random walks as compared to sampling the random walks from scratch on the current static snapshot of the graph. We also discuss how the vertex representations learned by the skipgram model should be updated given the vertex representations of the previous state of the graph, and the updated random walk corpus.
4.1. Unbiased Random Walk Updates
In the previous section, we showed that the baseline m1 algorithm, namely regenerating all walks on every graph change regardless of the extent of the change, is inefficient and incurs unnecessary computation. We additionally showed that the m4 algorithm produces a random walk corpus that does not match the statistical properties of the updated graph.
To motivate an algorithm that allows an update to our random walk corpus , we return to the definitions of affected vertices and affected walks given in Section 3 and consider a firstorder random walk. In order to update a corpus of random walks we consider a random walk that has arrived at a vertex , if this vertex is not affected there is no change to the neighbours of this vertex; therefore, the choice of next vertex in the random walk is unchanged. However, if the vertex is affected then the neighbours of the vertex have changed and the random walk is biased from this point as it has not considered the correct transition probabilities at the affected vertices. A corollary is that only random walks that contain affected vertices, namely the affected walks, need to be updated in the random walk corpus.
We propose the m2 algorithm, which is represented in Algorithm 3. The m2 algorithm updates only the affected walks by resampling these walks starting from the first appearance of the affected vertex. Specifically, the m2 algorithm finds the affected walks (Line 4) and trims them to the first affected vertex (Line 5). After that, m2 resumes the trimmed random walks until they are of the given length .
Due to the computational expense of searching for the affected vertices in the affected walks, we also introduce the m3 algorithm that is the same as the m2 algorithm but instead of trimming the affected walks it regenerates the affected walks from their first step. However, as we show in the rest of this section, regenerating the affected walks from their first vertices create walks that are biased toward the walks that do not visit the affected vertices.
We have claimed that the m4 and m3 algorithms produce random walks that have biased statistics compared to fully updating the random walks. We now show this empirically by plotting the difference between the empirical transition probabilities of the pairs generated by the four random walk update algorithms calculated using Eq. 10 and the theoretical transition probabilities on the graph . These differences are shown in Fig. 2 which depicts the error in the transition probabilities and the normalised number of random walks (re)generated by each algorithm compared to the number of random walks in the complete corpus
. The errors are shown in terms of mean and maximum absolute errors for all the vertices when generating random walks on the Cora dataset. The error bars are the standard deviation over 5 runs of each experiment. At each run, the graph is initialized with
of the edges and edges are added to the graph at each time step to build the next snapshot. The details about the datasets and the experiment setup are explained in Section 5. For each snapshot of the graph, we run the four algorithms and compute the errors. We can see that the m1 and m2 algorithms have a very small error compared to the m3 and m4 algorithms which are seen to give biased transition probabilities. We note that m4 has the highest error because it only generates random walks for new and affected vertices and does not consider the affected walks. We also note that the m2 algorithm can update the random walk corpus by updating less than of the random walk corpus compared to m1. In the rest of the experiments, we ignore m3 as it has no clear advantage over m2 except that it does not require searching the random walks for the first affected vertex. However, we note that by storing the random seed that is used to generate each random walk we can regenerate the random walks up to the first affected vertex by simply reusing this seed when the walk is regenerated. Hence, by storing the random seeds for each walk we can implement the m2 algorithm with the same computational cost as the m3 algorithm.4.1.1. Complexity Analysis
Consider a snapshot of graph at time with and and a random walk corpus consisting of random walks starting at each vertex of the graph and having length . The computation cost of the m1 algorithm to generate random walks for all the vertices in is steps of the random walk algorithm.
In m2, the major functions are filter, randomwalk, and update (see Algorithm 3). For simplicity, we ignore the time complexity of the trim function and instead we assume all walks have the same length of for the randomwalk function. The computation complexity of filter using a naive linear search is , where is the number of affected vertices. However, filter can be implemented faster by using advanced search techniques such as reverseindexes on vertices. Note that the filter function is embarrassingly parallel and it can be done efficiently on clustercomputing frameworks such as Apache Spark (Zaharia et al., 2010). The time complexity of randomwalk is , where is the number of affected and new walks and . It is important to note that . Usually, and are in the order of tens while is in the order of millions to billions, which is several orders of magnitude larger than and . The smallest update on a graph is when a vertex is updated which incurs number of walks with length . However, the upper bound in m2, is a trivial upper bound that occurs when all the walks on the graph are affected, namely all vertices have an edge that has been either added or deleted. For m4, the computation cost depends only on the affected vertices, i.e., .
4.2. Updating Vertex Representations
Now we consider the problem of updating the vertex representations to give new representations that capture the current state of the graph. Specifically, we define the problem as updating the vertex representations given an updated corpus of random walks and the previous representations of the vertices in the graph for that were generated on the previous graph snapshot .
The baseline algorithm to learn vertex representations is to optimize the objective function Eq. 8 using the skipgram model of (Mikolov et al., 2013a)
. The algorithm proceeds by initialising the vertex representations randomly. Then, the algorithm performs stochastic gradient descent to optimise the function
Eq. 8 or its negative sampling version over the training data consisting of the targetcontext pairs in the corpus created by the updated random walks . For simplicity, we call this baseline algorithm u1.However, u1 is computationally expensive and inefficient for dynamic graphs, as it needs to train the skipgram model from a random initialisation using the entire corpus regardless of the extent of the change in the graph. In Section 4.1, we see that the number of random walks that algorithms m2, m3, and m4 create depends on the extent of the change on the graph.
Therefore, we propose to use this information in training the skipgram model as well. The u2 algorithm starts with the vertex representations from the previous graph snapshot and introduces randomly initialised vectors for the new vertices in the snapshot . Stochastic gradient descent is then performed to minimize the skipgram objective function using the newly generated random walks for graph .
Intuitively, as the optimisation process has not seen the regenerated affected walks, using the vertex pairs generated from the updated affected walks to minimise the skipgram objective will allow updating the vertex representations with the changes that have occurred in the graph. This is, in essence, a transfer learning process
(Bengio, 2012) for the graph where the previously learned representations are finetuned using the vertex pairs from the updated walks, with a suitable specified learning rate. In addition, this is highly related to fewshot learning which seeks to update a learning objective that has been previously trained on a large corpus of data with a small number of examples of a new class (Ravi and Larochelle, 2016). We reserve as future work the investigation about an approach similar to the fewshot learning approach of (Ravi and Larochelle, 2016) for updating the vertex representations.As we show in Section 5, the u2 algorithm can learn vertex representations competitive to u1 with much less training samples.
5. Experiments
We have implemented the random walk algorithms and the targetcontext pairs generator in Scala, and the skipgram model in python with TensorFlow framework
(Abadi et al., 2016)^{1}^{1}1https://github.com/shps/incrementalrepresentationlearning. We run the experiments a machine with 8 CPUs and 150GB memory.The datasets used in the experiments are as follows:

Cora (wik, 2018; McCallum et al., 2000): is based on scientific papers citations. Each paper belongs only to one category among
categories: Neural Networks, Reinforcement Learning, Probabilistic Methods, Genetic Algorithms, Rule Learning, Theory, and CaseBased Reasoning.

Wikipedia (wik, 2018): is the dataset of the Wikipedia web pages from categories and the links between them.

BlogCatalog (Zafarani and Liu, 2009): is a network of bloggers and social relationships among them on the BlogCatalog website. The bloggers are labeled with at least one label that represents interests of the bloggers. The labels are extracted from the metadata provided by the bloggers.
Table 3 presents the statistics of vertices, edges, labels and the density of the datasets, which are represented as undirected graphs.
Name  Labels  Density  

Cora  2,485  5,069  7  
Wikipedia  2,357  11,592  17  
BlogCatalog  10,312  333,983  39  
CoCit  42,452  194,410  15 
The graphs do not contain temporal information, so we follow the previous research (Du et al., 2018; Li et al., 2017) by creating an initial graph from a randomly selected subset of edges and at each step adding a specified number of randomly selected edges to the initial graph in order to create a new snapshot of the graph. We call the number of edges that are added to the graph at each step the update rate
. We run each experiment multiple times and report the mean of the results. The vertex representations are given to a onevsrest logistic regression classifier. The classifier is set to split the train and test data 10 times and we present the mean MacroF1 score. The standard deviations for the accuracy are less than
unless it is explicitly mentioned.Throughout the experiments, the walk lengths and the skipgram window size are set to and respectively. The number of walks and the size of vector representation for each vertex are set to and respectively as in previous research (Perozzi et al., 2014). The minibatch size for the skipgram model training is set to . The initial learning rate used is to for the Cora, Wikipedia, and CoCit datasets. For the BlogCatalog dataset, we observed a better performance when the learning rate is set to
. We set the number of epochs to
for the Cora and Wikipedia datasets, but and for the BlogCatalog and the CoCit datasets respectively.5.1. Vertex Classification
We evaluate the performance of the proposed dynamic graph algorithms for learning vertex representations, i.e., the m2 and m4 algorithms using the u2 method to update the skipgram model. We compare these dynamic representation learning algorithms against our implementation of DeepWalk (Perozzi et al., 2014)/node2vec (Grover and Leskovec, 2016), which regenerates all the random walks and retrains the skipgram model. Specifically, DeepWalk and node2vec are equivalent to the m1 algorithm combined with the u1 method.
Network snapshots in a dynamic graph often contain multiple disconnected components. Typically, for many realworld social network datasets such as the DBLP coauthorship network (dbl, 2018) the snapshots consist of a connected component with the majority of vertices and many smaller disconnected components. We have also observed the same behaviour when adding edges in random batches for all the datasets used in this paper. As the node representations learned on disconnected components do not have any relation to each other any downstream task must be performed separately on each disconnected component (Hamilton et al., 2017b). Therefore, while at every step we learn embeddings for all the components in the current snapshot, we train and evaluate the classifier only for the nodes of the largest connected component.
We evaluate the performance of the downstream classification task by using crossvalidation on labelled data and times reshuffling and splitting data into train/test sets. Evaluating the performance of the downstream classifier requires a considerable number of vertices for train/test splits. Therefore, we create the initial snapshots of the graphs by randomly selecting a proportion of edges. To that end, we initialised the Cora, Wikipedia, and CoCit graphs with , , and of the edges and created consecutive snapshots of the graphs by adding more edges.
The results for the different methods are shown in Figure 3 which shows the performance of the downstream classification task as measured by the crossvalidated MacroF1 score. The performance scores are shown for the initial graph snapshot and for two snapshots chosen from the middle and the end of the experiment. We see that both dynamic update algorithms give representations that have competitive performance as compared to the full m1 algorithm for each snapshot of the graph. In addition, we see that the m2 algorithm gives marginally better performance in the downstream task than the m4 algorithm. This is consistent with the fact that the m4 algorithm produces random walks that are biased in terms of the pairwise transition probabilities, as discussed in Section 3.1.
5.2. Update Rate
As we explain in Section 4.1, the computational complexity of the m4 randomwalk algorithm is proportional to the number of affected vertices in the graph update. This is compared to the m2 algorithm where the computational complexity is proportional to the number of affected walks. As the number of affected walks will be greater than the number of affected vertices the m4 algorithm is computationally more efficient than the other algorithms (as can be seen in Fig. 4).
The run time of the m2 algorithm depends on the number of affected walks which in turn depends on the density of the graphs as well as the number of affected vertices. In our experiments, the number of affected vertices is approximately proportional to the update rate. To illustrate this for the datasets used in this paper, Fig. 5 shows the percentage of affected walks as a proportion of all the walks generated as the update rate increases. The graphs are initialised with of the edges. As expected, increasing the update rate increases the number of affected walks and the denser a graph is the larger the number of affected walks for a given update rate. For example, the CoCit dataset has the lowest density of the datasets and hence we see that the proportion of affected walks is smaller for the same update rate as compared to the other datasets.
The effect of this is for a given update rate, m2 performs better for the graphs with low density. Therefore, dynamic large social graphs are naturally good targets to use our algorithm m2 as they usually have at least one order of magnitude lower density than the CoCit graph (net, 2018).
Fig. 6 depicts the run time of the m2 u2 method normalised to the run time of the m1 u1 method. The results are for the last snapshot of the CoCit graph where the entire dataset has been streamed to the algorithms. We can see that when update rate is edges per snapshot, the m2 u2 method is two orders of magnitude faster than the m1 and u1 method.
5.3. Random Walks and Bias in the m4 Algorithm
As discussed in Section 4.1, m4 generates random walks that are biased in terms of the pairwise transition probabilities. To evaluate the effect of the biased random walks produced by the m4 algorithm, we retrain the skipgram model starting with a random initialisation of the weights and use all the vertex pairs generated from the random walk algorithms in the last snapshot of each graph, where each graph has of the edges. We run each experiment times and the mean Macro F1score of the results for multiclass and multilabel classifications are presented in Fig. 7. The performance scores for the m2 algorithm are similar those for the m1 algorithm. The randomwalk corpus generated by m4 is biased and this seems to give a negative effect on the downstream task performance for Cora and CoCit data sets (Fig. 7(a) and Fig. 7(b)) as seen in Fig. 7 (a) and (b). However, we don’t observe such negative effect of the randomwalks generated by the m4 algorithm on the Wikipedia and BlogCatalog datasets (Fig. 7(c) and Fig. 7(d)). This could be due to fact that the number of random walks that are updated by the m4 algorithm is higher for graphs with higher density. In addition, the fact that the random walk corpus used to update the vertex representations is statistically biased does not necessarily affect the downstream task accuracy if this bias is small and localised. In the future, it would be of interest to investigate the differences in the representations that are found by the biased versus unbiased methods for different datasets.
6. Related Work
Feature engineering has been long studied for graph analysis tasks. It requires domain experts to extract features (Henderson et al., 2011)
for vertices by handcrafting features and using feature extraction techniques
(Tang and Liu, 2012). In contrast, the focus of this paper is on the generalpurpose representation learning approaches.Representations learning for graphs is an important problem due to its application for tasks such as link prediction (Backstrom and Leskovec, 2011) and vertex classification (Tsoumakas and Katakis, 2007). Many of the initially proposed techniques (Belkin and Niyogi, 2002; Tenenbaum et al., 2000; Roweis and Saul, 2000) are to learn graph representations based on a spectral analysis of the adjacency matrix.
More recently the first and second order proximity of vertices have been used to learn vertex representations by (Tang et al., 2015) by modelling the joint probabilities and the transition probabilities of connected vertices. Driven by recent advancements in word embedding (Mikolov et al., 2013a; Mikolov et al., 2013b) there have been a number of new methods to learn vertex representations in a similar way by using random walks in the graph as sentences in the same way as sentences in the word embedding methods (Perozzi et al., 2014; Grover and Leskovec, 2016). In addition, there are methods (Dong et al., 2017; Chen and Sun, 2017) that extend these methods to heterogeneous graphs. A recent method VERSE (Tsitsulin et al., 2018) proposes a more flexible approach for similarity notion than of local neighbourhood. VERSE explicitly learns any similarity measure among vertices, such as personalised PageRank, to learn vertex representations.
Another approach to learning vertex representations is GraphSAGE (Hamilton et al., 2017a) which learns a neuralnetwork that transforms the features of a vertex and a sampled subgraph around that vertex to a vertex representation. Specifically, GraphSAGE takes an inductive approach that can be used to generate vertex representations for vertices that are not in the training graph. Other methods offer a similar approach that requires graphs that have features for each vertex (Li et al., 2017, 2018).
We note that all of the aforementioned methods are only for static graphs. Du et al. extends LINE (Tang et al., 2015) to support dynamic graphs. Neighbourhood sampling methods based on random walks, that are the focus of this paper, have been shown to perform better than onestep and twosteps sampling in LINE (Perozzi et al., 2014; Grover and Leskovec, 2016). Chang et al. (Chang et al., 2017) propose a realtime recommender system on streaming data. In this paper, we focus on graph data and vertex representations that can be used for different downstream learning tasks.
Recently, a framework based on a modification of DeepWalk for temporal graphs has been proposed (Nguyen et al., 2018a) that uses temporal random walks where at each step the walker is only allowed to move along an edge that has a later time than the edge used to arrive at the current vertex. The framework of (Nguyen et al., 2018a) aims to use temporal information after the fact, whereas in this paper we focus on updating vertex representations given new edges and vertices not previously available.
7. Conclusion
Many of the realworld graphs are dynamic and change over time. However, the contemporary methods for unsupervised representation learning of vertices are mainly for static graphs. In this paper, we focused on efficient representation learning methods based on random walks for dynamic graphs. We discussed that naive incremental update of random walks results in random walks that statistically do not represent the graph structure when the graph changes over time. We proposed an intuitive way to capture changes in a dynamic graph based on the notions of affected vertices and affected walks. Following that, we proposed an incremental random walk algorithm, namely m2, and an incremental method for representation learning, namely u2, which their computation cost depends on the extent of the change in the graph. Incremental random walks generated by the m2 algorithm are statistically indistinguishable from regenerating random walks by the m1 algorithm. Through extensive experiments on realworld graphs, we showed that our incremental algorithms can achieve competitive results to the stateoftheart methods while being considerably more efficient.
References
 (1)
 ms2 (2016) 2016. Microsoft Academic Graph  KDD cup 2016. https://kddcup2016.azurewebsites.net/Data. (2016).
 dbl (2018) 2018. DBLP graphs. http://projects.csail.mit.edu/dnd/DBLP/. (2018).
 wik (2018) 2018. Deep Neural Networks Based Approaches for Graph Embeddings. (2018). https://github.com/PFEPassau/Evaluation_Framework_For_Graph_Embedding
 net (2018) 2018. Massive network data. http://networkrepository.com/massive.php. (2018).
 Abadi et al. (2016) Martín Abadi, Paul Barham, Jianmin Chen, Zhifeng Chen, Andy Davis, Jeffrey Dean, Matthieu Devin, Sanjay Ghemawat, Geoffrey Irving, Michael Isard, et al. 2016. Tensorflow: a system for largescale machine learning.. In OSDI, Vol. 16. 265–283.
 Adhikari et al. ([n. d.]) B Adhikari, Y Zhang, N Ramakrishnan PacificAsia Conference, and 2018. [n. d.]. Sub2Vec: Feature Learning for Subgraphs. Springer ([n. d.]).
 Backstrom and Leskovec (2011) Lars Backstrom and Jure Leskovec. 2011. Supervised random walks: predicting and recommending links in social networks. In Proceedings of the fourth ACM international conference on Web search and data mining. ACM, 635–644.
 Belkin and Niyogi (2002) Mikhail Belkin and Partha Niyogi. 2002. Laplacian eigenmaps and spectral techniques for embedding and clustering. In Advances in neural information processing systems. 585–591.
 Bengio (2012) Yoshua Bengio. 2012. Deep learning of representations for unsupervised and transfer learning. In Proceedings of ICML Workshop on Unsupervised and Transfer Learning. 17–36.
 Benson et al. (2017) Austin R Benson, David F Gleich, and LekHeng Lim. 2017. The spacey random walk: A stochastic process for higherorder data. SIAM Rev. 59, 2 (2017), 321–345.
 Bollobás (1998) B Bollobás. 1998. Modern Graph Theory. Vol. Graduate Texts in Mathematics.
 Chang et al. (2017) Shiyu Chang, Yang Zhang, Jiliang Tang, Dawei Yin, Yi Chang, Mark A HasegawaJohnson, and Thomas S Huang. 2017. Streaming recommender systems. In Proceedings of the 26th International Conference on World Wide Web. International World Wide Web Conferences Steering Committee, 381–389.
 Chen and Sun (2017) Ting Chen and Yizhou Sun. 2017. Taskguided and pathaugmented heterogeneous network embedding for author identification. In Proceedings of the Tenth ACM International Conference on Web Search and Data Mining. ACM, 295–304.
 De Winter et al. ([n. d.]) Sam De Winter, Tim Decuypere, Sandra Mitrovic, Bart Baesens, and Jochen De Weerdt. [n. d.]. Combining Temporal Aspects of Dynamic Networks with Node2Vec for a more Efficient Dynamic Link Prediction. In 2018 IEEE/ACM International Conference on Advances in Social Networks Analysis and Mining (ASONAM). IEEE, 1234–1241.
 Dong et al. (2017) Yuxiao Dong, Nitesh V Chawla, and Ananthram Swami. 2017. metapath2vec: Scalable representation learning for heterogeneous networks. In Proceedings of the 23rd ACM SIGKDD International Conference on Knowledge Discovery and Data Mining. ACM, 135–144.
 Du et al. (2018) Lun Du, Yun Wang, Guojie Song, Zhicong Lu, and Junshan Wang. 2018. Dynamic Network Embedding: An Extended Approach for Skipgram based Network Embedding.. In IJCAI. 2086–2092.
 Grover and Leskovec (2016) Aditya Grover and Jure Leskovec. 2016. node2vec: Scalable feature learning for networks. In Proceedings of the 22nd ACM SIGKDD international conference on Knowledge discovery and data mining. ACM, 855–864.
 Hamilton et al. (2017a) Will Hamilton, Zhitao Ying, and Jure Leskovec. 2017a. Inductive representation learning on large graphs. In Advances in Neural Information Processing Systems. 1025–1035.
 Hamilton et al. (2017b) William L Hamilton, Rex Ying, and Jure Leskovec. 2017b. Representation Learning on Graphs: Methods and Applications. arXiv.org (Sept. 2017). arXiv:1709.05584v3
 Henderson et al. (2011) Keith Henderson, Brian Gallagher, Lei Li, Leman Akoglu, Tina EliassiRad, Hanghang Tong, and Christos Faloutsos. 2011. It’s who you know: graph mining using recursive structural features. In Proceedings of the 17th ACM SIGKDD international conference on Knowledge discovery and data mining. ACM, 663–671.
 Kipf and Welling (2016) Thomas N Kipf and Max Welling. 2016. Semisupervised classification with graph convolutional networks. arXiv preprint arXiv:1609.02907 (2016).
 Li et al. (2018) Jundong Li, Kewei Cheng, Liang Wu, and Huan Liu. 2018. Streaming link prediction on dynamic attributed networks. In Proceedings of the Eleventh ACM International Conference on Web Search and Data Mining. ACM, 369–377.
 Li et al. (2017) Jundong Li, Harsh Dani, Xia Hu, Jiliang Tang, Yi Chang, and Huan Liu. 2017. Attributed network embedding for learning in a dynamic environment. In Proceedings of the 2017 ACM on Conference on Information and Knowledge Management. ACM, 387–396.
 McCallum et al. (2000) Andrew Kachites McCallum, Kamal Nigam, Jason Rennie, and Kristie Seymore. 2000. Automating the construction of internet portals with machine learning. Information Retrieval 3, 2 (2000), 127–163.
 Mikolov et al. (2013a) Tomas Mikolov, Kai Chen, Greg Corrado, and Jeffrey Dean. 2013a. Efficient estimation of word representations in vector space. arXiv preprint arXiv:1301.3781 (2013).
 Mikolov et al. (2013b) Tomas Mikolov, Ilya Sutskever, Kai Chen, Greg S Corrado, and Jeff Dean. 2013b. Distributed representations of words and phrases and their compositionality. In Advances in neural information processing systems. 3111–3119.
 Nguyen et al. (2018a) Giang Hoang Nguyen, John Boaz Lee, Ryan A Rossi, Nesreen K Ahmed, Eunyee Koh, and Sungchul Kim. 2018a. Continuoustime dynamic network embeddings. In 3rd International Workshop on Learning Representations for Big Networks (WWW BigNet).
 Nguyen et al. (2018b) Giang Hoang Nguyen, John Boaz Lee, Ryan A. Rossi, Nesreen K. Ahmed, Eunyee Koh, and Sungchul Kim. 2018b. Dynamic Network Embeddings: From Random Walks to Temporal Random Walks. In IEEE BigData.
 Perozzi et al. (2014) Bryan Perozzi, Rami AlRfou, and Steven Skiena. 2014. Deepwalk: Online learning of social representations. In Proceedings of the 20th ACM SIGKDD international conference on Knowledge discovery and data mining. ACM, 701–710.
 Qiu et al. (2018) Jiezhong Qiu, Yuxiao Dong, Hao Ma, Jian Li, Kuansan Wang, and Jie Tang. 2018. Network Embedding as Matrix Factorization: Unifying DeepWalk, LINE, PTE, and node2vec. In Proceedings of the Eleventh ACM International Conference on Web Search and Data Mining. ACM, 459–467.
 Ravi and Larochelle (2016) Sachin Ravi and Hugo Larochelle. 2016. Optimization as a model for fewshot learning. (2016).
 Roweis and Saul (2000) Sam T Roweis and Lawrence K Saul. 2000. Nonlinear dimensionality reduction by locally linear embedding. science 290, 5500 (2000), 2323–2326.

Tang and Liu (2012)
Jiliang Tang and Huan
Liu. 2012.
Unsupervised feature selection for linked social media data. In
Proceedings of the 18th ACM SIGKDD international conference on Knowledge discovery and data mining. ACM, 904–912.  Tang et al. (2015) Jian Tang, Meng Qu, Mingzhe Wang, Ming Zhang, Jun Yan, and Qiaozhu Mei. 2015. Line: Largescale information network embedding. In Proceedings of the 24th International Conference on World Wide Web. International World Wide Web Conferences Steering Committee, 1067–1077.
 Tenenbaum et al. (2000) Joshua B Tenenbaum, Vin De Silva, and John C Langford. 2000. A global geometric framework for nonlinear dimensionality reduction. science 290, 5500 (2000), 2319–2323.
 Tsitsulin et al. (2018) Anton Tsitsulin, Davide Mottin, Panagiotis Karras, and Emmanuel Müller. 2018. VERSE: Versatile Graph Embeddings from Similarity Measures. (2018).
 Tsoumakas and Katakis (2007) Grigorios Tsoumakas and Ioannis Katakis. 2007. Multilabel classification: An overview. International Journal of Data Warehousing and Mining (IJDWM) 3, 3 (2007), 1–13.
 Ying et al. (2018) Rex Ying, Ruining He, Kaifeng Chen, Pong Eksombatchai, William L Hamilton, and Jure Leskovec. 2018. Graph Convolutional Neural Networks for WebScale Recommender Systems. arXiv preprint arXiv:1806.01973 (2018).
 Zafarani and Liu (2009) Reza Zafarani and Huan Liu. 2009. Social computing data repository at ASU. (2009).
 Zaharia et al. (2010) Matei Zaharia, Mosharaf Chowdhury, Michael J Franklin, Scott Shenker, and Ion Stoica. 2010. Spark: Cluster computing with working sets. HotCloud 10, 1010 (2010), 95.
 Zhang et al. (2018) Daokun Zhang, Jie Yin, Xingquan Zhu, and Chengqi Zhang. 2018. Network representation learning: a survey. IEEE Transactions on Knowledge and Data Engineering. https://doi.org/10.1109/TBDATA.2018.2850013
Comments
There are no comments yet.