A network embedding method
Network embedding algorithms are able to learn latent feature representations of nodes, transforming networks into lower dimensional vector representations. Typical key applications, which have effectively been addressed using network embeddings, include link prediction, multilabel classification and community detection. In this paper, we propose BiasedWalk, a scalable, unsupervised feature learning algorithm that is based on biased random walks to sample context information about each node in the network. Our random-walk based sampling can behave as Breath-First-Search (BFS) and Depth-First-Search (DFS) samplings with the goal to capture homophily and role equivalence between the nodes in the network. We have performed a detailed experimental evaluation comparing the performance of the proposed algorithm against various baseline methods, on several datasets and learning tasks. The experiment results show that the proposed method outperforms the baseline ones in most of the tasks and datasets.READ FULL TEXT VIEW PDF
A network embedding method
Networks (or graphs) are used to model data arising from a wide range of applications – ranging from social network mining, to biology and neuroscience. Of particular interest here are applications that concern learning tasks over networks. For example, in social networks, to answer whether two given users belong in the same social community or not, examining their direct relationship does not suffice – the users can have many further characteristics in common, such as friendships and interests. Similarly, in friendship recommendations, in order to determine whether two unlinked users are similar, we need to obtain an informative representation of the users and their proximity – that potentially is not fully captured by handcrafted features extracted from the graph. Towards this direction, representation (or feature) learning algorithms are useful tools that can help us addressing the above tasks. Given a network, those algorithms embed it into a new space (usually a compact vector space), in such a way that both the original network structure and other “implicit” features of the network are captured.
In this paper, we are interested in unsupervised (non-task-specific) feature learning methods; once the vector representations are obtained, they can be used to deal with various data mining tasks on the network, including node classification, community detection and link prediction by existing machine learning tools. Nevertheless, it is quite challenging to design an “ideal” algorithm for network embeddings. The main reason is that, in order to learn task-independent embeddings, we need to preserve as many important properties of the network as possible in the embedding feature space. But such properties might be not apparent and even after they are discovered, preserving some of them is intractable.
This can explain the fact that most of the traditional methods only aim at retaining the first and the second-order proximity between nodes , , . In addition to that, network representation learning methods are expected to be efficient so that they can scale well on large networks, as well as are able to handle different types of rich network structures, including (un)weighted, (un)directed and labeled networks.
In this paper, we propose BiasedWalk, an unsupervised and Skip-gram-based  network representation learning algorithm which can preserve higher-order proximity information, as well as is able to capture both the homophily and role equivalence relationships between nodes. BiasedWalk relies on a novel node sampling procedure based on biased random walks, that can behave as actual depth-first-search (DFS) and breath-first-search (BFS) explorations – thus, forcing the sampling scheme to capture both role equivalence and homophily relations between nodes. Furthermore, BiasedWalk is scalable on large scale graphs, and is able to handle different types of network structures, including (un)weighted and (un)directed ones. Our extensive experimental evaluation indicates that BiasedWalk outperforms the state-of-the-art methods on various network datasets and learning tasks.
Code and datasets for the paper are available at: https://goo.gl/Easwk4.
Network embedding methods have been well studied since the 2000s. The early works on the topic consider the embedding task as a dimensionality reduction one, and are often based on factorizing some matrix associated to pairwise distances between all nodes (or data points) [22, 19, 2]. Those methods rely on the eigen-decomposition of such a matrix are often sensitive to noise (i.e. missing or incorrect edges). Some recently proposed matrix factorization-based methods focus on the formal problem of representation learning on networks, and are able to overcome limitations of the traditional manifold learning approaches. For example, while traditional ones only focus on preserving the first-order proximity, GraRep  aims to preserve a higher-order proximity by using many matrices, each of them is for
-step transition probabilities between nodes. The HOPE algorithm defines some similarity measures between nodes which are helpful for preserving higher-order proximities as well and formulates those measures as a product of sparse matrices to efficiently find the latent representations.
Recently, there is a category of network embedding methods that rely on representation learning techniques in natural language processing (NLP), , , , , . The two best-known methods here are DeepWalk  and node2vec . Which both are based on the Skip-gram model . The model has been proposed to learn vector representations for words in a corpus by maximizing the conditional probabilities of predicting contexts given the vector representations of those words. DeepWalk  is the first method to leverage the Skip-gram model for learning representations on networks, by extracting truncated random walks in the network and considering them as sentences. The sentences are then fed into the Skip-gram model to learn node representations. The node2vec algorithm  can essentially be considered as an extension of Deepwalk. In particular, the difference between those two methods concerns the node sampling approach: instead of sampling nodes by uniform random walks, node2vec uses biased (and second-order) random walks to better capture the network structure. Because the Skip-gram based methods use random walks to approximate high order proximities, their main advantages include both scalability and the ability to learn latent features that only appear when we observe a high order proximity (e.g., the community relation between nodes).
As we will present later on, the proposed method which is called BiasedWalk, is also based on the Skip-gram model. The core of BiasedWalk is a node sampling procedure which is able to generate biased (and first-order) random walks that behave as actual depth-first-search (DFS) and breath-first-search (BFS) explorations – thus, helping towards capturing important structural properties, such as community information and structural roles of nodes. It should be noted that node2vec  also aims to sample nodes based on DFS and BFS random walks, but it cannot control the distances between sampling nodes and the source node. Controlling the distances is crucial as in many specific tasks, such as the one of link prediction, nodes close to the source should have higher probability to form a link from it. Similarly, in the community detection task, given a fixed sampling budget we prefer to visit as many community-wide nodes as possible.
Another shortcoming of node2vec is that its second-order random walks require storing the interconnections between the neighbors of every node. More specifically, node2vec’s space complexity is , where is the average degree of the input network . This can be an obstacle towards embedding dense networks, where the node degrees can be very high. Moreover, we prefer random-walk based sampling rather than pursuing pure DFS and BFS sampling because of the Markovian property of random walks. Given a random walk of length , we can immediately obtain context sets for nodes in the random walk by sliding a window of size along the walk.
The LINE algorithm 
, though not belonging to any of the previous categories, is also one of the well-known network embedding methods. LINE derives two optimization functions for preserving the first and the second order proximity, and performs the optimizations by stochastic gradient descent with edge sampling. Nevertheless, in general, the performance tends to be inferior compared to Skip-gram based methods. The interesting reader may refer to the following review articles on representation learning on networks, 
. Lastly, we should mention here the recent progress on deep learning techniques that have also been adopted to learn network embeddings[24, 6, 3, 28, 7, 10].
A network can be represented by a graph , where is the set of nodes and is the set of links in the network. Depending on the nature of the relationships, can be (un)weighted and (un)directed. Table I includes notation used throughout the paper.
||An input network with set of nodes and set of links|
|Vector representation of|
|Set of neighborhood nodes of|
|Vector representation of as it is in the role of a context node|
|Probability of being the next node of the (uncompleted) random walk|
|Average degree of the input network|
|Parameter for controlling the distances from the source node to sampled nodes|
|Maximum length of random walks|
|Proximity score of node|
|The number of sampled random walks per node|
|Set of neighbors of node|
|Window size of training context of the Skip-gram|
|Hadamard operator on vector representations of and|
An embedding of is a mapping from the node set to a low-dimensional space, , where . Network embedding methods are expected to be able to preserve the structure of the original network, as well as to learn latent features of nodes. In this work, inspired by the Skip-gram model, we aim to find embeddings towards predicting neighborhoods (or “context”) of nodes:
where denotes the set of neighborhood nodes of . (Note that, is not necessarily a set of immediate neighbors of , but it can be any set of sampled nodes for . The insight is that nodes sharing the same context nodes are intended to be close enough in the graph or to be similar). For example, could be a set of nodes within -hop distance from , or nodes that co-appear with in a short random walk over the graph. Some related studies have shown that the consequent embedding result is strongly affected by the adopted strategy for sampling the context nodes , . For the purpose of making the problem tractable, we also assume that predicting nodes in a context set is independent of each other, thus Eq. (1) can be expressed as:
Lastly, we use the softmax function to define the probability of predicting a context node given vector representation of , as , where is the context vector representation of . Note that, there are two representation vectors for each : as is in the role of a target node, and with considered as a context node .
In this section, we introduce BiasedWalk, a method for learning representation for nodes in a network based on the Skip-gram model. To achieve the goal for both capturing the network structure effectively and leveraging the efficient Skip-gram model for learning vector representation, we first propose a novel sampling procedure based on biased random-walks that are able to behave as actual depth-first-search (DFS) and breath-first-search (BFS) sampling schemes. Figure 1 (a) shows an example of our biased random walks. On a limited budget of samples, random walks staying in the local neighborhood of the source node can be an alternative for BFS sampling , and such BFS random walks help in discovering the role of source nodes. For example, hub, leaf and bridge nodes can be determined by their neighborhood nodes only. On the other hand, random walks which move further away from the source node are equivalent to DFS sampling since such DFS walks could discover nodes in the community-wide area of the source, and this is helpful to understand the homophily effect between the nodes.
Given a random walk , the contexts of are nodes in a window of size centered at that node, i.e., . Then, we consider each of generated random walks as a sentence in a corpus to learn vector representations for words (e.g., nodes in the random walks) which maximize Eq. (2). It should be noted here that the efficiency of each Skip-gram based method depends on its own sampling strategy. Our method, instead of uniform random walks (as performed by Deepwalk), uses biased ones for better capturing the network structure. Furthermore, it does not adopt the second-order random walks like node2vec since they are not able to control the distances between the source node and sampled nodes.
To be able to simulate both DFS and BFS explorations, BiasedWalk uses additional information , called proximity score, for each candidate node
of the ongoing random walk, in order to estimate how far (not by the exact number of hops) a candidate is from the source node. More specifically, the nodes whose all neighbor nodes have never been visited by the random walk, should have a proximity score of zero. After the-th node in the walk is discovered, the proximity score of every node adjacent to that node will be increased by , where is a parameter 111The parameter is to control the distances from the source to sampled nodes. Note that, in case is directed, we increase the proximity scores for all of the in- and out-neighbors of that node.
. Then, the probability distribution of selecting the next node for the current walk is calculated based on the proximity scores of the neighbor nodes of the most recently visited node, and on which type of samplings (DFS or BFS) we desire. In the case of BFS, the probability of a node being the next node should be proportional to its proximity score, i.e.,. In the case of DFS, the probability should be inversely proportional to that score, i.e., , where is the most recently visited node and defines the set of neighbor nodes of . An illustration of our random-walk based sampling procedure is given in Figure 1 (b), and its main steps are presented in Algorithm 1.
The reason for using an exponential function of the current walk length for increasing the proximity scores, is that it helps to clearly distinguish candidates that belong to the local neighborhood of the source node and others far away from the source. Since , those on the local neighborhood should have much higher scores than the ones outside. In addition, our exponential function guarantees that candidates of the same level of distance from the source, should have comparable proximity scores. Thus, such proximity scores are a good estimate of distances from the source node to candidates, and therefore help selecting the next node for our desired (DFS or BFS) random walks.
Algorithm 2 depicts the complete procedure of the proposed BiasedWalk feature learning method. Given a desired sampling type (DFS or BFS), the procedure first generates a set of random walks from every node in the network based on Algorithm 1. In order to adopt the Skip-gram model for learning representations for the nodes, it considers each of these walks (a sequence of nodes) as a sentence (a sequence of words) in a corpus. Finally, the Skip-gram model is used to learn vector representations for words in the “network” corpus, which aims to maximize the probability of predicting context nodes as mentioned in Eq. (2).
Let be the average degree of the input network, and assume that every node in the network has a degree bounded by . For each node visited in a random walk of maximum length , Algorithm 1 needs to consider all neighbors of that node to update (or initialize) their proximity scores, and then calculate the transition probabilities. The algorithm uses a map to store proximity scores of such neighbor nodes, so the number of keys (node IDs) in the map is . Thus, the time for both accessing and updating a proximity score is in , as the map can be implemented by a balanced binary search tree. The algorithm needs to select at most nodes whose degree is to construct a walk. Therefore, the number of updating and accessing operations is , and thus the time complexity of Algorithm 1 is . With respect to memory requirement, BiasedWalk requires only space complexity since it adopts the first-order random walks.
For the experimental evaluation, we use vector representations of nodes learned by BiasedWalk and compare with four baseline methods, including DeepWalk , node2vec , LINE  and HOPE  in the tasks of multilabel node classification and link prediction. For each type of our random walks (DFS and BFS), the value of BiasedWalk varies in the range of . Then, we use 10-fold cross-validation on labeled data to choose the best parameters (including the walk type and value ) for each graph dataset.
Table II provides a summary of network datasets used in our experiments. More specifically, network datasets for the multilabel classification task include the following:
BlogCatalog : A network of social relationships between the bloggers listed on the BlogCatalog website. There are bloggers, friendship pairs between them, and the labels of each node is a subset of 39 different labels that represent blogger interests (e.g., political, educational).
Protein-Protein Interaction (PPI) : The network contains nodes, unweighted edges and different labels. Each of the labels corresponds to a biological function of the proteins.
IMDb : A network of movies and TV shows, extracted from the Internet Movie Database (IMDb). A pair of nodes in the network is connected by an edge if the corresponding movies (or TV shows) are directed by the same director. Each node can be labeled by a subset of different movie genres in the database, such as drama, comedy, documentary and action.
For the link prediction task, we use the following network datasets (we focus on the largest (strongly) connected components instead of the whole networks):
AstroPhy collaboration : The network represents co-author relationships between 17,903 scientists in AstroPhysics.
Election-Blogs : This is a directed network of front-page hyperlinks between blogs in the context of the 2004 US election. In the network, each node represents a blog and each edge represents a hyperlink between two blogs.
Protein-Protein Interaction (PPI): The same dataset used in the node-classification task.
Epinions: The network represents who-trust-whom relationships between users of the epinions.com product review website.
|PPI (for link prediction)||3,852||21,121||Undirected|
We use DeepWalk , node2vec , LINE  and HOPE  as baselines for BiasedWalk. The number of dimensions of output vector representations is set to 128 for all the methods. To get the best embeddings for LINE, the final representation for each node is created by concatenating the first-order and the second-order representations each of 64 dimensions . The Katz index with decay parameter is selected for HOPE’s high-order proximity measurement, since this setting gave the best performance in the original article . Similar to BiasedWalk, DeepWalk and node2vec belong to the same category of Skip-gram based methods. The parameters for both the node sampling and optimization steps of the three methods are set exactly the same: number of walks per node ; maximum walk length ; Skip-gram’s window size of training context
. Since node2vec requires the in-out and return hyperparametersfor its second-order random walks, we have performed a grid search over and 10-fold cross-validation on labeled data to select the best embedding – as suggested by the experiment in . For a fair comparison, the total number of training samples for LINE is set equally to the number of nodes sampled by the three Skip-gram based methods ().
Multilabel node classification is a challenging task, especially for networks with a large number of possible labels. To perform this task, we have used the learned vector representations of nodes and an one-vs-rest logistic regression classifier using theLibLinear library with regularization . The Micro-F1 and Macro-F1 scores in the 50%-50% train-test split of labeled data are reported in Table III. The best scores gained for each network across the methods are indicated in bold, and the best parameter settings including the walk type and the corresponding value for BiasedWalk have been shown in the last row. In our experiment (including the link prediction task), each Micro-F1 or Macro-F1 score reported was calculated as the average score on random instances of a given train-test split ratio.
The experimental results show that the Skip-gram based methods outperform the rest ones in the multilabel classification task. The main reason is that, the rest methods mainly aim to capture low order proximities for networks. But this is not enough for node classification as nodes which are not in the same neighborhood can be classified by the same labels. More precisely, BiasedWalk gives the best results for BlogCatalog and PPI networks; in BlogCatalog, it improves Macro-F1 score by 15% and Micro-F1 score by 6%. In IMDb network, node2vec and BiasedWalk are comparable, with the former being slightly better. Figure 2 depicts the performance of all the methods in different percentages of training data. Given a percentage of training data, in most cases, BiasedWalk has better performance than the rest baselines. An interesting observation from the experiment is that, preserving homophily between nodes is important for the node classification task. This explains the reason why the best performance results obtained by BiasedWalk come from its DFS random-walk based sampling scheme.
|(Best combination of: )||DFS, 1.0||DFS, 0.5||DFS, 0.25|
To perform link prediction in a network dataset, we first randomly remove half of its edges, so that after the removal the network is not disconnected. The node representations are then learned from the remaining part of the network. To create the negative labels for the prediction task, we randomly select pairs of nodes that are not connected in the original network. The number of such pairs is equal to the number of removed edges. The “negative” pairs and the pairs from edges that have been removed, are used together to form the labeled data for this task.
Because the prediction task is for edges, we need some way to combine a pair of vector representations of nodes to form a single vector for the corresponding edge. Some operators have been recommended in  for handling that. We have selected the Hadamard operator, as it gave the best overall results. Given vector representations and of two nodes and , the Hadamard operator defines the vector representation of link as , where . Then, we use a Linear Support Vector Classifier with the penalty to predict whether links exist or not.
The experimental results are reported in Table IV. Each score value shows the average over random instances of a 50%-50% train-test split of the labeled data. In this task, HOPE is still inferior to the other methods, except in the Election-Blogs network. Since HOPE is able to preserve the asymmetric transitivity property of networks, it can work well in directed networks, such as the Election-Blogs. LINE’s performance is comparable to that of the Skip-gram based methods and even the best one in the PPI network. The reason is that LINE is proposed to preserve the first and the second-order proximity and that is really helpful for the link prediction task. It is totally possible to infer the existence of an edge if we know the role of its nodes and meanwhile, roles of nodes can be discovered by examining just their local neighbors. BiasedWalk, again, gains the best results for this task in most of the networks. Finally, the best results of BiasedWalk in this task are almost based on its BFS sampling. This supports the fact that, discovering role equivalence between nodes is crucial for the link prediction task.
|(The best )||BFS, 0.125||BFS, 0.125||BFS, 1.0||BFS, 0.25|
We also evaluate how BiasedWalk’s performance is changing under different parameter settings. Figure 3 shows Macro-F1 scores gained by BiasedWalk in the multilabel classification task in BlogCatalog (the Micro-F1 scores follow a similar trend then it is not necessary to show them here). Except the parameter is being considered, all other parameters in the experiment are set to their default value. Obviously, since the number of walks per node or the walk length is increased, there are more nodes sampled by the random walks and then BiasedWalk should get a better score. The effectiveness of BiasedWalk also depends on the dimension number of output vector representations . Since the number of dimensions is too small, embeddings in the representation space may not be able to preserve the structure information of input networks, and as this parameter is set so high it could negatively affect consequent classification tasks. Finally, we can notice the dependence of Macro-F1 score and the tendency of parameter , this can support us in inferring the best value for the parameter on each network dataset.
We have also examined the efficiency of the proposed BiasedWalk algorithm. Figure 4 depicts the running time required for sampling (blue curve) and both sampling and optimization (orange curve) on the Erdős-Rényi graphs of various sizes ranging from to nodes with the average degree of 10. As we can observe, BiasedWalk is able to learn embeddings for graphs of millions of nodes in dozens of hours and scales linearly with respect to the size of graphs. Nearly total learning time belongs to the step of sampling nodes that means the Skip-gram is very efficient at solving Eq. (2).
In this work, we have proposed BiasedWalk, a Skip-gram based method for learning node representations on graphs. The core of BiasedWalk is a node sampling procedure using biased random-walks, that can behave as actual depth-first-search and breath-first-search explorations – thus, forcing BiasedWalk to efficiently capture role equivalence and homophily between nodes. We have compared BiasedWalk to several state-of-the-art baseline methods, demonstrating its good performance on the link prediction and multilabel node classification tasks. As future work, we plan to theoretically analyze the properties of the proposed biased random-walk scheme and to investigate how to adapt the scheme for networks with specific properties, such as signed networks and ego networks, in order to obtain better embedding results.
Deep neural networks for learning graph representations.In: AAAI. (2016) 1145–1152
In: Proceedings of the Twenty-Seventh International Joint Conference on Artificial Intelligence, IJCAI-18. (7 2018) 2086–2092
Revisiting semi-supervised learning with graph embeddings.In: Proceedings of the 33rd International Conference on International Conference on Machine Learning - Volume 48. ICML’16 (2016) 40–48