Accurately Modeling Biased Random Walks on Weighted Graphs Using Node2vec+

09/15/2021
by   Renming Liu, et al.
Michigan State University
9

Node embedding is a powerful approach for representing the structural role of each node in a graph. Node2vec is a widely used method for node embedding that works by exploring the local neighborhoods via biased random walks on the graph. However, node2vec does not consider edge weights when computing walk biases. This intrinsic limitation prevents node2vec from leveraging all the information in weighted graphs and, in turn, limits its application to many real-world networks that are weighted and dense. Here, we naturally extend node2vec to node2vec+ in a way that accounts for edge weights when calculating walk biases, but which reduces to node2vec in the cases of unweighted graphs or unbiased walks. We empirically show that node2vec+ is more robust to additive noise than node2vec in weighted graphs using two synthetic datasets. We also demonstrate that node2vec+ significantly outperforms node2vec on a commonly benchmarked multi-label dataset (Wikipedia). Furthermore, we test node2vec+ against GCN and GraphSAGE using various challenging gene classification tasks on two protein-protein interaction networks. Despite some clear advantages of GCN and GraphSAGE, they show comparable performance with node2vec+. Finally, node2vec+ can be used as a general approach for generating biased random walks, benefiting all existing methods built on top of node2vec. Node2vec+ is implemented as part of , which is available at https://github.com/krishnanlab/PecanPy .

READ FULL TEXT VIEW PDF

Authors

page 4

page 7

page 11

10/14/2021

Residual2Vec: Debiasing graph embedding with random graphs

Graph embedding maps a graph into a convenient vector-space representati...
05/19/2020

Learning Representations using Spectral-Biased Random Walks on Graphs

Several state-of-the-art neural graph embedding methods are based on sho...
10/22/2019

Collaborative Graph Walk for Semi-supervised Multi-Label Node Classification

In this work, we study semi-supervised multi-label node classification p...
09/07/2018

edge2vec: Learning Node Representation Using Edge Semantics

Representation learning for networks provides a new way to mine graphs. ...
03/05/2021

Nishimori meets Bethe: a spectral method for node classification in sparse weighted graphs

This article unveils a new relation between the Nishimori temperature pa...
03/28/2019

Distributed Algorithms for Fully Personalized PageRank on Large Graphs

Personalized PageRank (PPR) has enormous applications, such as link pred...
05/09/2016

On the Emergence of Shortest Paths by Reinforced Random Walks

The co-evolution between network structure and functional performance is...

Code Repositories

PecanPy

A fast, parallelized, memory efficient, and cache-optimized Python implementation of node2vec


view repo
This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

Introduction

Graphs and networks naturally appear in many real-world datasets including social networks and biological networks. The graph structure provides insightful information about the role of each node in the graph, such as protein function in a protein-protein interaction network Liu et al. (2020). However, it is typically impractical to access all neighborhood information of the graph due to the massive scale of many real-world networks with millions of nodes Hu et al. (2021)

. Node embeddings are one type of solution that aim to resolve this issue by representing all the nodes in a graph in a low dimensional vector space, where some mutual relations between nodes are preserved

Cui et al. (2018); Hamilton et al. (2017).

Node2vec is a second-order random walk based embedding method Grover and Leskovec (2016). It is widely used for unsupervised node embedding for various tasks, particularly in computational biology Nelson et al. (2019), such as gene function prediction Liu et al. (2020); Ata et al. (2018) and essential protein prediction Wang et al. (2021); Zeng et al. (2021)

, due to its superior performance than other matrix factorization and neural network based methods

Yue et al. (2019). Some recent works built on top of node2vec aim to adapt node2vec to more specific types of networks Wang et al. (2021); Valentini et al. (2021), generalize node2vec to higher dimension Hacker (2021), augment node2vec with additional downstream processing Chattopadhyay and Ganguly (2020); Hu et al. (2020), or even study node2vec theoretically Grohe (2020); Davison and Austern (2021); Qiu et al. (2018). Nevertheless, none of these follow-up works account for the fact that node2vec is less effective for weighted graphs, where the edge weights reflect the (potentially noisy) similarities between pairs of nodes. This failing is due to the inability of node2vec to differentiate between small and large edges connecting the previous vertex with a potential next vertex in the random walk, which subsequently causes less accurate modeling of the intended walk bias. In some extreme cases where the weighted graphs are fully connected, such as integrative biological networks Greene et al. (2015), node2vec completely loses its ability to devise a biased random walk.

Meanwhile, another line of recent works on graph neural networks (GNNs) have shown remarkable performance in prediction tasks that involve graph structure, including node classification Bronstein et al. (2021); Zhou et al. (2021); Zhang et al. (2021); Wu et al. (2021); Xia et al. (2021). Although GNNs and embedding methods like node2vec

are related in that they both aim at projecting nodes in the graph to a feature space, two main differences set them apart. First, GNNs typically require labeled data while embedding methods do not. This label dependency makes the embeddings generated by a GNN tied to the quality of the labels, which in some cases, like in biological networks, are noisy and scarce. Moreover, in a biological network, the node labels, such as gene functions and disease associations, are typically skewed, with many negatives but few positives

Ata et al. (2018). This data imbalance issue poses another challenge to properly train a GNN. Second, GNNs require node features as input to train, which are not always available. In the absence of given node features, one needs to generate them and often GNN algorithms in this case use trivial node features such as the constant feature or node degree. These two differences give node embedding methods a unique place in node classification, apart from the GNN methods.

Here, we propose an improved version of node2vec that is more effective for weighted graphs by taking into account the edge weight connecting the previous vertex and the potential next vertex. The proposed method, called node2vec+, is a natural extension of node2vec; when the input graph is unweighted, the resulting embeddings of node2vec+ and node2vec are equivalent in expectation. Moreover, when the bias parameters are set to neutral, node2vec+ recovers a first-order random walk, just as node2vec does. In order to demonstrate the utility of node2vec+, we empirically show that 1) node2vec+ is more robust to additive noise than node2vec using two synthetic datasets; and 2) node2vec+ can achieve equivalent or better performance in multi-label node classification tasks using gene labels in biological networks compared to node2vec and GNNs including GCN Kipf and Welling (2016) and GraphSAGE Hamilton et al. (2017).

Method

We start by briefly reviewing the node2vec method. Then we illustrate that node2vec is less effective for weighted graphs due to its inability to identify out edges. Finally, we present a natural extension of node2vec that resolves this issue.

Node2vec overview

In the setting of node embeddings, we are interested in finding a mapping that maps each node to a

-dimensional vector, so that the mutual proximity between pairs of nodes in the graph is preserved. In particular, a random walk based approach aims to maximize the probability of reconstructing the neighborhoods for any node in the graph, based on some sampling strategy

. Formally, given a graph (the analysis generalizes to directed and/or weighted graphs), we want to maximize the log probability of reconstructing the sampled neighborhood for each :

(1)

Under the conditional independence assumption, and the parameterization of the probabilities as the softmax normalized inner products Grover and Leskovec (2016); Mikolov et al. (2013), the objective function above simplifies to:

(2)

In practice, the partition function is approximated by negative sampling Mikolov et al. (2013) to save computational time. Given any sampling strategy , equation (2) can find the corresponding embedding , which is achieved in practice by feeding the random walks generated to the skipgram with negative sampling Mikolov et al. (2013).

Node2vec devises a second order random walk as the sampling strategy. Unlike a first order random walk Perozzi et al. (2014), where the transition probability depends only on the current vertex , a second order random walk depends also on the previous vertex , with transition probability . It does so by applying a bias factor to the edge that connects the current vertex and a potential next vertex. This bias factor is a function that depends on the relation between the previous vertex and the potential next vertex, and is parameterized by the return parameter , and the in-out parameter . For the ease of notation, in the following, we denote as the potential next vertex, as the current vertex, and as the previous vertex (see Figure 1). In this way, the random walk can be generated based on the following transition probabilities:

(3)

where the bias factor is defined as:

(4)

According to this bias factor, node2vec differentiates three types of edges: 1) the return edge, where the potential next vertex is the previous vertex (Figure 1a); 2) the out edge, where the potential next vertex is not connected to the previous vertex (Figure 1b); and 3) the in edge, where the potential next vertex is connected to the previous vertex (Figure 1c). Note that the first order (or unbiased) random walk can be seen as a special case of the second order random walk where both the return parameter and the in-out parameter are set to neutral ().

We now turn our attention to weighted networks, where the edge weights are not necessarily zeros or ones. Consider the case where and are connected, but with a small weight (Figure 1d), i.e. and . According to the definition of the bias factor, no matter how small is, would always be considered as an in edge. Since in this case and are barely connected, should in fact be considered as an out edge. In the extreme case of a fully connected weighted graph, where for all , node2vec completely loses its ability to identify out edges.

Thus, node2vec is less effective for weighted networks due to its inability to identify potential out edges where the terminal vertex is loosely connected to a previous vertex . Next, we propose an extension of node2vec that resolves this issue, by taking into account of the edge weight in the bias factor.

Figure 1: Illustration of different settings of return and in-out edges. The solid and dotted lines represent edges with large and small edge weights, respectively.

Node2vec+

The main idea of extending node2vec is to identify potential out edges coming from node , where is loosely connected with . Intuitively, we can determine the “looseness” of based on some threshold edge value. However, given that the distribution of edge weights of any given node in the graph is not known a priori, it is hard to come up with a reasonable threshold value for all networks. Instead, we define the “looseness” of by comparing it with the average edge weights for that node . Formally, a node is said to be loosely connected to if , where is the average edge weight of . For the ease of notation, we also consider being loosely connected to if , as a slight abuse of notation. Finally, an (directed or undirected) edge is loose if is loosely connected to , and otherwise it is tight.

Based on the definition of looseness of edges, and assuming , there are four types of edges (see Figure 1, (c-f)), depending on the looseness of and . Following node2vec, we categorize these edge types into in and out edges. Furthermore, to prevent amplification of noisy connections, we added one more edge type called the noisy edge, which is always suppressed.

Out edge (1b, 1d)

As a direct generalization to node2vec, we consider to be an out edge if is tight and is loose. The in-out parameter then modifies the out edge to differentiate “inward” and “outward” nodes, and subsequently leads to Breadth First Search or Depth First Search like searching strategies Grover and Leskovec (2016). Unlike node2vec, however, we further parameterize this bias factor based on

, and for simplicity, we choose to use linear interpolation. Specifically, for an

out edge , the bias factor is computed as . Thus the amount of modification to the out edge depends on the level of looseness of . When , or equivalently , the bias factor for is , same as that defined in node2vec.

Noisy edge (1e)

We consider to be a noisy edge if both and are loose. The noisy edges are not very informative and thus should be suppressed in all cases regardless of the setting of q to prevent amplification of noise. Thus, the bias factor for a noisy edge is set to be .

In edge (1c, 1f)

Finally, we consider to be an in edge if is tight, regardless of . The corresponding bias factor is set to neutral as in node2vec.

Combining the above, the bias factor for node2vec+ is defined as follows:

(5)

Note that the last case in equation (5) includes cases where . Based on the biased random walk searching strategy using this bias factor, the embedding can be generated accordingly using (2). One can verify, by checking equation (5), that this is indeed a natural extension of node2vec in the sense that

  • For an unweighted graph, the node2vec+ is equivalent to node2vec.

  • When and are set to 1, node2vec+ recovers a first order random walk, same as node2vec does.

Finally, by design, node2vec+ is able to identify potential out edges that would have been obliviated by node2vec. Node2vec+ is implemented as part of PecanPy Liu and Krishnan (2021).

Experiments

Figure 2: Barbell graph. (a) Illustration of the barbell graph, and three different types of nodes. (b)-(d) Embeddings of the barbell graph (or the noisy version of it), with three different settings of q = [1, 100, 0.01], and the color coding is the same as that in (a).

Synthetic datasets

We first demonstrate the ability of node2vec+

to identify potential out edges in weighted graphs using a barbell graph and the hierarchical cluster graphs.

Barbell graph

A barbell graph, denoted as , is constructed by connecting two complete graphs of size 20 with a common bridge node (Figure 2a). All edges in are weighted 1. There are three types of nodes in , 1) the bridge node; 2) the peripheral nodes that connect the two modules with the bridge node; 3) the interior nodes of the two modules. By changing the in-out parameter , node2vec could put the peripheral nodes closer to the bridge node or interior nodes in the embedding space.

When is large, node2vec suppresses the out edges, e.g., an edge connecting a peripheral node to the bridge node, coming from an interior node. Consequently, the biased random walks are restricted to the network modules. In this case, the transition from the peripheral nodes to the bridge node becomes less likely compared to a first-order random walk, thus pushing the embeddings between the bridge node and the peripheral nodes away from each other. Conversely, when is small, the transition between the peripheral nodes and the bridge node is encouraged. In this case, the embeddings of the bridge node and the peripheral nodes are pulled together. To see this, we run node2vec with fixed , and three different settings of . Indeed, for , node2vec tightly clusters interior nodes and pushed the bridge node away from the peripheral nodes, and for , the peripheral nodes are pushed away from the interior nodes (Figure 2b).

Next, the barbell graph is perturbed by adding loose edges with edge weights of 0.1, making the graph fully connected. This perturbed barbell graph is denoted . As expected, in this case, node2vec can not make any difference in the embedding space by changing (Figure 2c), since none of the edges are identified as an out edge. On the other hand, node2vec+ still picks out potential out edges and thus produces the desired outcome (Figure 2d). Note that both node2vec and node2vec+ have similar results for when . This confirms that node2vec+ and node2vec are equivalent when and are set to neutral, corresponding to embedding with unbiased random walks. Finally, when using non-neutral settings of , node2vec+ is able to suppress some noisy edges, resulting in less scattered embeddings of the interior nodes (Figure 2d).

Hierarchical CLUSTER graph

We use a modified version of the CLUSTER dataset Dwivedi et al. (2020) to further demonstrate the advantage of the node2vec+ due to identifying potential out edges. Specifically, the hierarchical cluster graph K3L2 contains levels (3 including the root level) of clusters, and each parent cluster is associated with children clusters (Figure 3a). There are 30 nodes in each cluster, resulting in a total of 390 nodes. To generate the hierarchical cluster graph, we first generate data points via a Gaussian process in a latent space so that the Euclidean distance between two points from two sibling clusters is about twice ( to be precise) the expected Euclidean distance from one of the two points to a point in the parent cluster, which is set to be 1. The noisiness of the clusters is controlled by the parameter , which is set to 0.1 by default. These data points are then turned into a fully connected weighted graph using a RBF kernel (see supplement).

Figure 3: Hierarchical CLUSTER graph classification task. (a) Illustrations of the K3L2 hierarchical clusters. Left: top-down view of the clusters. Right: adjacency matrix of K3L2; colored brackets indicate the corresponding cluster levels of the nodes. (b) Classification evaluation on K3L2. (c) Classification evaluation on K3L2c45

Similar to CLUSTER, the classification task is to classify each node in the hierarchical cluster graph into the corresponding cluster (

cluster classification). On the other hand, we can also classify the nodes based on the level of their corresponding clusters (level classification

). We perform stratified split to separate nodes into 10% training and 90% testing. We use the multinomial logistic regression model with l2 regulariation for prediction, and evaluate the performance by macro f1 score. This evaluation process, including the embedding generation, is repeated ten times. As shown in Figure

3b, the performance of node2vec is not affected by the parameter because the graph is fully connected. Meanwhile, node2vec+ achieves significantly better performance than node2vec for large q settings for both classification tasks. This is expected since node2vec+ identifies potential out edges and in the setting of large , out walks are discouraged, leading to random walks that are localized in tightly clustered network modules. And as illustrated in the previous section, this localization of the random walk causes the nodes in a network module to be oriented closer together in the embedding space, which is particularly helpful for distinguishing different clusters in the embedding space. Nevertheless, we note that the performance of node2vec+ for level classification on K3L2 is sub-optimal, despite being significantly better than node2vec. Similar results are observed on a couple different hierarchical cluster graphs K3L3, K5L1, and K5L2 (see supplement).

However, it is not surprising that node2vec+ does better than node2vec because K3L2 is fully connected. To make a more fair comparison, we maximally sparsify K3L2 using a global edge threshold value of 0.45, which preserves the connectdness of the graph (see supplement). The resulting sparsified graph is denoted as K3L2c45. In this case, node2vec shows significant improvement for large compared to , resulting in perfect prediction for the cluster classification task (Figure 3c). Meanwhile, node2vec+ achieves perfect cluster classification prediction too and, more importantly, also achieves near perfect level classification performance. This drastic improvement in level classification for node2vec+ as a result of sparsification suggests that even node2vec+ cannot perfectly resolve the problem with fully connected weighted graphs, and further improvement can be made.

Figure 4: Fine-grained performance evaluation on K3L2. (a) Varying cut threshold. (b) Varying percentage of testing. (c) Varying noise level.

We then compare the methods in more fine-grained settings by varying the percentage of testing data, the noise level, and the cut threshold. Since both node2vec and node2vec+ favor large in this dataset, we fix and throughout the analyses on K3L2. Overall, node2vec and node2vec+ both admit similar trends that 1) the prediction performance increases as the cut ratio increases, 2) the prediction performance decreases when the testing size increases, leaving a smaller training size, and when the noise increases (Figure 4). Notice that reducing the training size (larger testing size) and including more edges with small edge weights (smaller cut threshold) have less detrimental effects on node2vec+ than node2vec, particularly for the cluster classification task. Finally, node2vec+ consistently outperforms (or at least is comparable to) node2vec across all different settings and tasks, highlighting the benefits of identifying potential out edges.

Real world datasets

We further assess the quality of node embeddings generated by node2vec+ using node classification tasks on real world datasets. Two commonly used datasets, BlogCatalog and Wikipedia, are tested. In addition, two popular protein-protein interaction (PPI) networks STRING Szklarczyk et al. (2021) and GIANT-TN Greene et al. (2015) are also tested using a variety of challenging gene classification tasks. Following our previous approach for making a more fair comparison on GIANT-TN, we also include a sparse GIANT-TN-c01 network, by applying a global edge threshold of 0.01 to GIANT-TN, which preserves the connectivity of the network (see supplement). More detailed information for each network can be found in Table 1.

Network Weighted Edge density
BlogCatalog No 10,312 6.28E-03
Wikipedia Yes 4,777 8.10E-03
GIANT-TN Yes 25,825 1.00E+00
GIANT-TN-c01 Yes 25,689 1.18E-01
STRING Yes 17,352 2.42E-02
Table 1: Network information

Baseline Methods

In our evaluation on BlogCatalog and Wikipedia, the main comparisons are between node2vec and node2vec+, as before. The key parameters including embedding dimension, window-size, walk-length, and number of walks per node are set to , , , and , respectively, by default. We exclude the comparison against some popular node embedding methods like DeepWalk Perozzi et al. (2014) and LINE Tang et al. (2015), as they were shown to be inferior to node2vec Grover and Leskovec (2016). We also exclude some neural network based methods like GraRep Cao et al. (2015) and deepNF Gligorijevic et al. (2018), since they can not scale efficiently to large and dense networks.

For gene classification on STRING and GIANT-TN, we include the comparison against two popular GNNs, GCN Kipf and Welling (2016) and GraphSAGE Hamilton et al. (2017), which have shown exceptional performance in many node classification tasks. We use full-batch GraphSAGE with mean pooling aggregation following the Open Graph Benchmark Hu et al. (2021). To match the dimension for node2vec embeddings for a fair comparison, we use one hidden layer models for both GCN and GraphSAGE, with dimensions for GCN and for GraphSAGE. Since the PPIs here do not come with node features, we use constant feature for GCN, and degree feature for GraphSAGE.

As mentioned earlier, one of the challenges for training GNNs on datasets like the gene classification tasks here is the data imbalance issue, with many negative examples but only a few positive examples. Specifically, when using the GIANT-TN network, the ratios of negatives to positives are on average , , and for GOBP, KEGGBP, and DisGeNet, respectively. To mitigate this data imbalance issue, we up-weight the loss for positive examples using the corresponding negative to positive ratio. We observe that by doing so, the model converges significantly faster than if we did not include the scaling ratios (see supplement). Additionally, we tuned the learning rate via grid search from to . The optimal learning rates that result in decent convergence rate without diverging are and

for GCN and GraphSAGE, respectively (see supplement). Finally, we set the maximum epochs to

so that the models are decently converged (see supplement).

Figure 5: Node classification performance for BlogCatalog and Wikipedia

Experiment setup

We follow node2vec and randomly split the datasets into 50% training and 50% testing for BlogCatalog and Wikipdeia. The networks are then embedded using varying values of from up to . In this evaluation, we fix the parameter at neutral since the biased factor for was not modified in node2vec+ algorithm and should have the same effect as in node2vec. Then, a one vs rest logistic regression model with l2 regularization is trained on the embedding. The final test performance is reported as the macro f1 score. This evaluation process, including the embedding generation step, is repeated 10 times.

Meanwhile, for the PPIs, we test three types of gene classification tasks including two gene function predictions (KEGGBP, GOBP) and one disease gene prediction (DisGeNet), which are obtained as in Liu et al. (2020). We call each type of these gene classification tasks a gene set collection. Similar as before, we train a one vs rest logistic regression with l2 regularization using the embeddings from node2vec and node2vec+, while for GNNs, each gene set collection is used to train the models for a specific combination of network and gene set collection. Thus, we also note that the comparisons between GNNs and embeddings are slightly unfair since the one vs rest logistic regression model for the embeddings has access to only one task, while a GNN model has access to labels from all the tasks in a gene set collection, effectively increasing the number of training examples for the GNN model.

GOBO KEGGBP DisGeNet
GIANT-TN 94 (41) 65 (28) 245 (102)
STRING 85 (35) 58 (26) 234 (97)
Table 2: Number of gene sets for each combinations of network and gene set collection in the 5-fold cross validation (study-bias holdout) setting.

We use two types of validation schemes for evaluation, 5-fold cross-validation and study-bias holdout, following Liu et al. (2020). The cross-validation evaluations for GNNs are excluded due to extra computational time. In the study-bias holdout evaluation, we use 60% of the most well-studied genes, according to the number of PubMed publications associated with the genes in each dataset, for training, 20% of the least studied genes for testing, and the rest for validation. We filter out gene sets with less than ten positives in any splits for each PPI network. Thus, despite being a more realistic and rigorous evaluation scheme, the study-bias holdout is more restrictive than 5-fold cross-validation, leaving out a lot of gene sets that are scarce (Table 2). The validation set was used for tuning the , combinations via grid search for each gene set. This optimally tuned , combination is then used for the final testing, reported as , which is more appropriate for evaluating performance on imbalanced data Liu et al. (2020). Because of this tuning step, we consider the embedding approach being semi-supervised.

Experiment results

For the two commonly used datasets, node2vec+ is either comparable or better than node2vec (Figure 5). In particular, the identical performance on BlogCatalog confirms that node2vec+ reduces to node2vec for unweighted graphs. On the other hand, for Wikipedia, node2vec+ achieves the best result when is around , and significantly outperforms node2vec, whose performance peaks at around . This performance difference is attributed to the ability of node2vec+ to identify potential out edges, and consequently lead to more faithful out-biased walks.

Figure 6: Gene classification tasks using protein-protein interaction networks. Starred (*) pairs indicate that the performance between two methods are significantly different (Wilcoxon p-value ).

For gene classification in the 5-fold cross-validation setting (Figure 6, top row), node2vec+ significantly outperforms node2vec (Wilcoxon test p-value ) when using GIANT-TN for all three gene set collections. When using sparser networks like GIANT-TN-c01 and STRING, node2vec+ still achieves similar performance as node2vec in most cases. We note that although sparsifying GIANT-TN to GIANT-TN-c01 improves the overall performance compared to GIANT-TN, and at the same time, reduces the performance gap between node2vec and node2vec+, finding the optimal sparsification is not a trivial task. For example, one can take various sparsification approaches, including the global edge thresholding used here, a more complicated node-specific edge thresholding, even spectral sparsification Spielman and Teng (2010), and not to mention that each of these approaches come with parameters that need to be tuned. Our evaluation results here indicate that node2vec+ does no worse than node2vec, and in some cases, when the weighted edges are noisy, and an appropriate graph sparsification is not trivial to be applied, node2vec+ can perform significantly better than node2vec.

In the study-bias holdout setting (Figure 6, bottom row), node2vec+ and node2vec are comparable with no significant difference. This is likely because most gene sets that distinguish the two methods, e.g., the ones that are scarcer and are harder to predict, are removed from the gene set collection during the necessary filtering step for the study-bias holdout. Meanwhile, node2vec+ performs similarly to GCN, and in some cases, such as GOBP and KEGGBP using STRING, node2vec+ significantly outperforms GCN. GraphSAGE, on the other hand, performs much better than GCN. Yet, except for GIANT-TN-c01, node2vec+ still holds up well against GraphSAGE. Overall, despite having a substantial advantage of more training examples and being fully supervised, GCN and GraphSAGE did not entirely outperform the semi-supervised approach node2vec+.

Discussion and conclusion

In this paper, we proposed an improved version of the second-order random walk in node2vec for weighted graphs by considering the edge weights, which effectively identifies potential out edges. Consequently, the corresponding embeddings are improved whenever the in-out walks are appropriate for the task. The proposed modification node2vec+ is a natural extension of node2vec for weighted graphs. We empirically confirmed, by the evaluation on BlogCataog, that node2vec+ reduces to node2vec for unweighted graphs. We note that node2vec+ also serves as a general procedure for biased random walks and thus can be easily adapted to other methods that are built on top of node2vec such as KG2Vec Wang et al. (2021) and Het-Node2vec Valentini et al. (2021).

We illustrated the ability of node2vec+ to identify potential out edges on weighted graphs, as opposed to node2vec, using a synthetic barbell graph with uniformly added noise. Furthermore, using synthetic hierarchical cluster graphs and real-world datasets like Wikipedia, we showed that in many cases, when the graphs are weighted, node2vec+ outperforms node2vec. Finally, we evaluated node2vec+

against GCN and GraphSAGE using a variety of challenging gene classification tasks. We effectively managed the data imbalance issue for GNNs by applying scaling ratios of negatives to positives in the loss function. In our setup, although GNN methods have some clear advantages over the

node2vec+, i.e., having more training examples and being fully supervised, the performance of the node2vec+ is still comparable to the GNNs on the GIANT-TN and STRING networks.

Although node2vec+ does not outperform GraphSAGE, we emphasize that our main contribution was improving the biased random walk strategy. In fact, as a future direction, we would like to explore the unsupervised GraphSAGE, which is trained similarly to the skipgram with negative sampling using the first-order random walks on the graph (Equation (2)), by replacing the first-order random walks with the node2vec+ biased random walks. Using node2vec+ as the biased random walk strategy is a promising strategy for further improving unsupervised GraphSAGE because 1) node2vec outperforms DeepWalk because of more flexible searching strategies, and 2) node2vec+ identifies potential out edges, thus providing the intended biased searching strategies on weighted graphs.

References

  • S. K. Ata, L. Ou-Yang, Y. Fang, C. Kwoh, M. Wu, and X. Li (2018) Integrating node embeddings and biological annotations for genes to predict disease-gene associations. BMC Systems Biology 12 (9), pp. 138 (en). External Links: ISSN 1752-0509, Link, Document Cited by: Introduction, Introduction.
  • M. M. Bronstein, J. Bruna, T. Cohen, and P. Veličković (2021)

    Geometric Deep Learning: Grids, Groups, Graphs, Geodesics, and Gauges

    .
    arXiv:2104.13478 [cs, stat]. Note: arXiv: 2104.13478 External Links: Link Cited by: Introduction.
  • S. Cao, W. Lu, and Q. Xu (2015) GraRep: Learning Graph Representations with Global Structural Information. In Proceedings of the 24th ACM International on Conference on Information and Knowledge Management, CIKM ’15, New York, NY, USA, pp. 891–900. External Links: ISBN 978-1-4503-3794-6, Link, Document Cited by: Baseline Methods.
  • S. Chattopadhyay and D. Ganguly (2020) Community Structure aware Embedding of Nodes in a Network. arXiv:2006.15313 [physics]. Note: arXiv: 2006.15313 External Links: Link Cited by: Introduction.
  • P. Cui, X. Wang, J. Pei, and W. Zhu (2018) A Survey on Network Embedding. IEEE Transactions on Knowledge and Data Engineering, pp. 1–1. External Links: ISSN 1041-4347, Document Cited by: Introduction.
  • A. Davison and M. Austern (2021) Asymptotics of Network Embeddings Learned via Subsampling. arXiv:2107.02363 [cs, math, stat]. Note: arXiv: 2107.02363 External Links: Link Cited by: Introduction.
  • V. P. Dwivedi, C. K. Joshi, T. Laurent, Y. Bengio, and X. Bresson (2020) Benchmarking Graph Neural Networks. arXiv:2003.00982 [cs, stat]. Note: arXiv: 2003.00982 External Links: Link Cited by: Hierarchical CLUSTER graph.
  • V. Gligorijevic, M. Barot, and R. Bonneau (2018) deepNF: deep network fusion for protein function prediction. Bioinformatics (Oxford, England) 34 (22), pp. 3873–3881 (eng). External Links: ISSN 1367-4811, Document Cited by: Baseline Methods.
  • C. S. Greene, A. Krishnan, A. K. Wong, E. Ricciotti, R. A. Zelaya, D. S. Himmelstein, R. Zhang, B. M. Hartmann, E. Zaslavsky, S. C. Sealfon, D. I. Chasman, G. A. FitzGerald, K. Dolinski, T. Grosser, and O. G. Troyanskaya (2015) Understanding multicellular function and disease with human tissue-specific networks. Nature genetics 47 (6), pp. 569–576. External Links: ISSN 1061-4036, Link, Document Cited by: Introduction, Real world datasets.
  • M. Grohe (2020) Word2vec, node2vec, graph2vec, X2vec: Towards a Theory of Vector Embeddings of Structured Data. PODS. External Links: Document Cited by: Introduction.
  • A. Grover and J. Leskovec (2016) Node2Vec: Scalable Feature Learning for Networks. In Proceedings of the 22Nd ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, KDD ’16, New York, NY, USA, pp. 855–864. Note: event-place: San Francisco, California, USA External Links: ISBN 978-1-4503-4232-2, Link, Document Cited by: Introduction, Node2vec overview, Out edge (1b, 1d), Baseline Methods.
  • C. Hacker (2021) K-simplex2vec: a simplicial extension of node2vec. arXiv:2010.05636 [cs, math]. Note: arXiv: 2010.05636 External Links: Link Cited by: Introduction.
  • W. L. Hamilton, R. Ying, and J. Leskovec (2017) Inductive Representation Learning on Large Graphs. arXiv:1706.02216 [cs, stat]. Note: arXiv: 1706.02216 External Links: Link Cited by: Introduction, Introduction, Baseline Methods.
  • F. Hu, J. Liu, L. Li, and J. Liang (2020)

    Community detection in complex networks using Node2vec with spectral clustering

    .
    Physica A: Statistical Mechanics and its Applications 545, pp. 123633 (en). External Links: ISSN 0378-4371, Link, Document Cited by: Introduction.
  • W. Hu, M. Fey, M. Zitnik, Y. Dong, H. Ren, B. Liu, M. Catasta, and J. Leskovec (2021)

    Open Graph Benchmark: Datasets for Machine Learning on Graphs

    .
    arXiv:2005.00687 [cs, stat]. Note: arXiv: 2005.00687 External Links: Link Cited by: Introduction, Baseline Methods.
  • T. N. Kipf and M. Welling (2016) Semi-Supervised Classification with Graph Convolutional Networks. arXiv:1609.02907 [cs, stat]. Note: arXiv: 1609.02907 External Links: Link Cited by: Introduction, Baseline Methods.
  • R. Liu and A. Krishnan (2021) PecanPy: a fast, efficient and parallelized Python implementation of node2vec. Bioinformatics (btab202). External Links: ISSN 1367-4803, Link, Document Cited by: In edge (1c, 1f).
  • R. Liu, C. A. Mancuso, A. Yannakopoulos, K. A. Johnson, and A. Krishnan (2020) Supervised learning is an accurate method for network-based gene classification. Bioinformatics 36 (11), pp. 3457–3465. Note: _eprint: https://academic.oup.com/bioinformatics/article-pdf/36/11/3457/33329234/btaa150.pdf External Links: ISSN 1367-4803, Link, Document Cited by: Introduction, Introduction, Experiment setup, Experiment setup.
  • T. Mikolov, K. Chen, G. Corrado, and J. Dean (2013)

    Efficient Estimation of Word Representations in Vector Space

    .
    arXiv:1301.3781 [cs]. Note: arXiv: 1301.3781 External Links: Link Cited by: Node2vec overview.
  • T. Mikolov, I. Sutskever, K. Chen, G. Corrado, and J. Dean (2013) Distributed Representations of Words and Phrases and their Compositionality. arXiv:1310.4546 [cs, stat]. Note: arXiv: 1310.4546 External Links: Link Cited by: Node2vec overview.
  • W. Nelson, M. Zitnik, B. Wang, J. Leskovec, A. Goldenberg, and R. Sharan (2019) To Embed or Not: Network Embedding as a Paradigm in Computational Biology. Frontiers in Genetics 10, pp. 381. External Links: ISSN 1664-8021, Link, Document Cited by: Introduction.
  • B. Perozzi, R. Al-Rfou, and S. Skiena (2014) DeepWalk: Online Learning of Social Representations. Proceedings of the 20th ACM SIGKDD international conference on Knowledge discovery and data mining - KDD ’14, pp. 701–710. Note: arXiv: 1403.6652 External Links: Link, Document Cited by: Node2vec overview, Baseline Methods.
  • J. Qiu, Y. Dong, H. Ma, J. Li, K. Wang, and J. Tang (2018) Network Embedding as Matrix Factorization: Unifying DeepWalk, LINE, PTE, and node2vec. In Proceedings of the Eleventh ACM International Conference on Web Search and Data Mining, WSDM ’18, New York, NY, USA, pp. 459–467. External Links: ISBN 978-1-4503-5581-0, Link, Document Cited by: Introduction.
  • D. A. Spielman and S. Teng (2010) Spectral Sparsification of Graphs. arXiv:0808.4134 [cs]. Note: arXiv: 0808.4134 External Links: Link Cited by: Experiment results.
  • D. Szklarczyk, A. L. Gable, K. C. Nastou, D. Lyon, R. Kirsch, S. Pyysalo, N. T. Doncheva, M. Legeay, T. Fang, P. Bork, L. J. Jensen, and C. von Mering (2021) The STRING database in 2021: customizable protein–protein networks, and functional characterization of user-uploaded gene/measurement sets. Nucleic Acids Research 49 (D1), pp. D605–D612. External Links: ISSN 0305-1048, Link, Document Cited by: Real world datasets.
  • J. Tang, M. Qu, M. Wang, M. Zhang, J. Yan, and Q. Mei (2015) LINE: Large-scale Information Network Embedding. In Proceedings of the 24th International Conference on World Wide Web, WWW ’15, Republic and Canton of Geneva, CHE, pp. 1067–1077. External Links: ISBN 978-1-4503-3469-3, Link, Document Cited by: Baseline Methods.
  • G. Valentini, E. Casiraghi, L. Cappelletti, V. Ravanmehr, T. Fontana, J. Reese, and P. Robinson (2021) Het-node2vec: second order random walk sampling for heterogeneous multigraphs embedding. arXiv:2101.01425 [physics]. Note: arXiv: 2101.01425 External Links: Link Cited by: Introduction, Discussion and conclusion.
  • N. Wang, M. Zeng, Y. Li, F. Wu, and M. Li (2021)

    Essential Protein Prediction Based on node2vec and XGBoost

    .
    Journal of Computational Biology 28 (7), pp. 687–700. Note: Publisher: Mary Ann Liebert, Inc., publishers External Links: Link, Document Cited by: Introduction.
  • Y. Wang, L. Dong, X. Jiang, X. Ma, Y. Li, and H. Zhang (2021)

    KG2Vec: A node2vec-based vectorization model for knowledge graph

    .
    PLOS ONE 16 (3), pp. e0248552 (en). Note: Publisher: Public Library of Science External Links: ISSN 1932-6203, Link, Document Cited by: Introduction, Discussion and conclusion.
  • Z. Wu, S. Pan, F. Chen, G. Long, C. Zhang, and P. S. Yu (2021) A Comprehensive Survey on Graph Neural Networks. IEEE Transactions on Neural Networks and Learning Systems 32 (1), pp. 4–24. Note: arXiv: 1901.00596 External Links: ISSN 2162-237X, 2162-2388, Link, Document Cited by: Introduction.
  • F. Xia, K. Sun, S. Yu, A. Aziz, L. Wan, S. Pan, and H. Liu (2021) Graph Learning: A Survey.

    IEEE Transactions on Artificial Intelligence

    1 (01), pp. 1–1 (English).
    Note: Publisher: IEEE Computer Society External Links: ISSN 2691-4581, Link, Document Cited by: Introduction.
  • X. Yue, Z. Wang, J. Huang, S. Parthasarathy, S. Moosavinasab, Y. Huang, S. M. Lin, W. Zhang, P. Zhang, and H. Sun (2019) Graph Embedding on Biomedical Networks: Methods, Applications, and Evaluations. arXiv:1906.05017 [cs]. Note: arXiv: 1906.05017 External Links: Link Cited by: Introduction.
  • M. Zeng, M. Li, Z. Fei, F. Wu, Y. Li, Y. Pan, and J. Wang (2021) A Deep Learning Framework for Identifying Essential Proteins by Integrating Multiple Types of Biological Information. IEEE/ACM Transactions on Computational Biology and Bioinformatics 18 (1), pp. 296–305. Note: Conference Name: IEEE/ACM Transactions on Computational Biology and Bioinformatics External Links: ISSN 1557-9964, Document Cited by: Introduction.
  • X. Zhang, L. Liang, L. Liu, and M. Tang (2021) Graph Neural Networks and Their Current Applications in Bioinformatics. Frontiers in Genetics 12, pp. 1073. External Links: ISSN 1664-8021, Link, Document Cited by: Introduction.
  • J. Zhou, G. Cui, S. Hu, Z. Zhang, C. Yang, Z. Liu, L. Wang, C. Li, and M. Sun (2021) Graph Neural Networks: A Review of Methods and Applications. arXiv:1812.08434 [cs, stat]. Note: arXiv: 1812.08434 External Links: Link Cited by: Introduction.