NetSMF: Large-Scale Network Embedding as Sparse Matrix Factorization
We study the problem of large-scale network embedding, which aims to learn latent representations for network mining applications. Previous research shows that 1) popular network embedding benchmarks, such as DeepWalk, are in essence implicitly factorizing a matrix with a closed form, and 2)the explicit factorization of such matrix generates more powerful embeddings than existing methods. However, directly constructing and factorizing this matrix---which is dense---is prohibitively expensive in terms of both time and space, making it not scalable for large networks. In this work, we present the algorithm of large-scale network embedding as sparse matrix factorization (NetSMF). NetSMF leverages theories from spectral sparsification to efficiently sparsify the aforementioned dense matrix, enabling significantly improved efficiency in embedding learning. The sparsified matrix is spectrally close to the original dense one with a theoretically bounded approximation error, which helps maintain the representation power of the learned embeddings. We conduct experiments on networks of various scales and types. Results show that among both popular benchmarks and factorization based methods, NetSMF is the only method that achieves both high efficiency and effectiveness. We show that NetSMF requires only 24 hours to generate effective embeddings for a large-scale academic collaboration network with tens of millions of nodes, while it would cost DeepWalk months and is computationally infeasible for the dense matrix factorization solution. The source code of NetSMF is publicly available (https://github.com/xptree/NetSMF).READ FULL TEXT VIEW PDF
Node embedding learns a low-dimensional representation for each node in ...
We study the problem of learning similarity functions over very large co...
Bayesian Matrix Factorization (BMF) is a powerful technique for recommen...
Many successful methods have been proposed for learning low dimensional
Since the invention of word2vec, the skip-gram model has significantly
Matrix factorization (MF) discovers latent features from observations, w...
Graph embedding learns low-dimensional representations for nodes in a gr...
NetSMF: Large-Scale Network Embedding as Sparse Matrix Factorization
Recent years have witnessed the emergence of network embedding, which offers a revolutionary paradigm for modeling graphs and networks (Hamilton et al., 2017). The goal of network embedding is to automatically learn latent representations for objects in networks, such as vertices and edges. Significant lines of research have shown that the latent representations are capable of capturing the structural properties of networks, facilitating various downstream network applications, such as vertex classification and link prediction (Perozzi et al., 2014; Tang et al., 2015; Grover and Leskovec, 2016; Dong et al., 2017).
Over the course of its development, the DeepWalk (Perozzi et al., 2014), LINE (Tang et al., 2015), and node2vec (Grover and Leskovec, 2016) models have been commonly considered as powerful benchmark solutions for evaluating network embedding research. The advantage of LINE lies in its scalability for large-scale networks as it only models the first- and second-order proximities. That is to say, its embeddings lose the multi-hop dependencies in networks. DeepWalk and node2vec, on the other hand, leverage random walks on graphs and skip-gram (Mikolov et al., 2013a) with large context sizes to model nodes further away (i.e., global structures). Consequently, it is computationally more expensive for DeepWalk and node2vec to handle large-scale networks. For example, with the default parameter settings (Perozzi et al., 2014), DeepWalk requires months to embed an academic collaboration network of 67 million vertices and 895 million edges222With the default DeepWalk parameters (walk length: 40 and walk per node: 80), 214+ billion nodes (67M4080) with a vocabulary size of 67 million are fed into skip-gram. As a reference, Mikolov et al. reported that training on Google News of 6 billion words and a vocabulary size of only 1 million cost 2.5 days with 125 CPU cores (Mikolov et al., 2013a).. The node2vec model, which performs high-order random walks, takes more time than DeepWalk to learn embeddings.
More recently, a study shows that both the DeepWalk and LINE methods can be viewed as implicit factorization of a closed-form matrix (Qiu et al., 2018). Building upon this theoretical foundation, the NetMF method was instead proposed to explicitly factorize this matrix, achieving more effective embeddings than DeepWalk and LINE. Unfortunately, it turns out that the matrix to be factorized is an dense one with being the number of vertices in the network, making it prohibitively expensive to directly construct and factorize for large-scale networks.
In light of these limitations of existing methods (See the summary in Table 1), we propose to study representation learning for large-scale networks with the goal of achieving efficiency, capturing global structural contexts, and having theoretical guarantees. Our idea is to find a sparse matrix that is spectrally close to the dense NetMF matrix implicitly factorized by DeepWalk. The sparsified matrix requires a lower cost for both construction and factorization. Meanwhile, making it spectrally close to the original NetMF matrix can guarantee that the spectral information of the network is maintained, and the embeddings learned from the sparse matrix is as powerful as those learned from the dense NetMF matrix.
In this work, we present the solution to network embedding learning as sparse matrix factorization (NetSMF). NetSMF comprises three steps. First, it leverages the spectral graph sparsification technique (Cheng et al., 2015b, a)
to find a sparsifier for a network’s random-walk matrix-polynomial. Second, it uses this sparsifier to construct a matrix with significantly fewer non-zeros than, but spectrally close to, the original NetMF matrix. Finally, it performs randomized singular value decomposition to efficiently factorize the sparsified NetSMF matrix, yielding the embeddings for the network.
With this design, NetSMF offers both efficiency and effectiveness with guarantees, as the approximation error of the sparsified matrix is theoretically bounded. We conduct experiments in five networks, which are representative of different scales and types. Experimental results show that for million-scale or larger networks, NetSMF achieves orders of magnitude speedup over NetMF, while maintaining competitive performance for the vertex classification task. In other words, both NetSMF and NetMF outperform well-recognized network embedding benchmarks (i.e., DeepWalk, LINE, and node2vec), but NetSMF addresses the computation challenge faced by NetMF.
To summarize, we introduce the idea of network embedding as sparse matrix factorization and present the NetSMF algorithm, which makes the following contributions to network embedding:
Efficiency. NetSMF reaches significantly lower time and space complexity than NetMF. Remarkably, NetSMF is able to generate embeddings for a large-scale academic network of 67 million vertices and 895 million edges on a single server in 24 hours, while it would cost months for DeepWalk and node2vec, and is computationally infeasible for NetMF on the same hardware.
Effectiveness. NetSMF is capable of learning embeddings that maintain the same representation power as the dense matrix factorization solution, making it consistently outperform DeepWalk and node2vec by up to 34% and LINE by up to 100% for the multi-label vertex classification task in networks.
Theoretical Guarantee. NetSMF’s efficiency and effectiveness are theoretically backed up. The sparse NetSMF matrix is spectrally close to the exact NetMF matrix, and the approximation error can be bounded, maintaining the representation power of its sparsely learned embeddings.
|vertex set of with =n|
|edge set of with|
|adjacency matrix of|
|degree matrix of|
|number of negative samples|
|context window size|
|random-walk molynomial of (Eq. (4))|
|NetMF matrix sparisifier|
|number of non-zeros in|
|set for positive integer|
Commonly, the problem of network embedding is formalized as follows: Given an undirected and weighted network with as the vertex set of vertices, as the edge set of edges, and as the adjacency matrix, the goal is to learn a function that maps each vertex to a -dimensional (
) vector that captures its structural properties, e.g., community structures. The vector representation of each vertex can be fed into downstream applications such as link prediction and vertex classification.
One of the pioneering work on network embedding is the DeepWalk model (Perozzi et al., 2014), which has been consistently considered as a powerful benchmark over the past years (Hamilton et al., 2017). In brief, DeepWalk is coupled with two steps. First, it generates several vertex sequences by random walks over a network; Second, it applies the skip-gram model (Mikolov et al., 2013b) on the generated vertex sequences to learn the latent representations for each vertex. Commonly, skip-gram is parameterized with the context window size and the number of negative samples . Recently, a theoretical study (Qiu et al., 2018) reveals that DeepWalk essentially factorizes a matrix derived from the random walk process. More formally, it proves that when the length of random walks goes to infinity, DeepWalk implicitly and asymptotically factorizes the following matrix:
where denotes the volume of the graph, and
where is the degree matrix with as the generalized degree of the -th vertex. Note that represents the element-wise matrix logarithm (Horn and Johnson, 1991), which is different from the matrix logarithm. In other words, the matrix in Eq. (1) can be characterized as the result of applying element-wise matrix logarithm (i.e., ) to matrix .
The matrix in Eq. (1) offers an alternative view of the skip-gram based network embedding methods. Further, Qiu et al. provide an explicit matrix factorization approach named NetMF to learn the embeddings (Qiu et al., 2018). It shows that the accuracy for vertex classification based on the embeddings from NetMF outperforms that based on DeepWalk and LINE. Note that the matrix in Eq. (1) would be ill-defined if there exist a pair of vertices unreachable in hops, because . So following Levy and Goldberg (Levy and Goldberg, 2014), NetMF uses the logarithm truncated at point one, that is, . Thus, NetMF targets to factorize the matrix
In the rest of this work, we refer to the matrix in Eq. (3) as the NetMF matrix.
However, there exist a couple of challenges when leveraging the NetMF matrix in practice. First, almost every pair of vertices within distance correspond to a non-zero entry in the NetMF matrix. Recall that many social and information networks exhibit the small-world property where most vertices can be reached from each other in a small number of steps. For example, as of the year 2012, 92% of the reachable pairs in Facebook are at distance five or less (Backstrom et al., 2012). As a consequence, even if setting a moderate context window size (e.g., the default setting in DeepWalk), the NetMF matrix in Eq. (3) would be a dense matrix with number of non-zeros. The exact construction and factorization of such a matrix is impractical for large-scale networks. More concretely, computing the matrix power in Eq. (2) involves dense matrix multiplication which costs time; factorizing a dense matrix is also time consuming. To reduce the construction cost, NetMF approximates with its top eigen pairs. However, the approximated matrix is still dense, making this strategy unable to handle large networks.
In this work, we aim to address the efficiency and scalability limitation of NetMF, while maintaining its superiority in effectiveness. We list necessary notations and their descriptions in Table 2.
In this section, we develop network embedding as sparse matrix factorization (NetSMF). We present the NetSMF method to construct and factorize a sparse matrix that approximates the dense NetMF matrix. The main technique we leverage is random-walk matrix-polynomial (molynomial) sparsification.
We first introduce the definition of spectral similarity and the theorem of random-walk molynomial sparsification.
(Spectral Similarity of Networks) Suppose and are two weighted undirected networks. Let and be their Laplacian matrices, respectively. We define and are -spectrally similar if
To achieve a sparsifier with non-zeros, the sparsification algorithm consists of two steps: The first step obtains an initial sparsifier for with non-zeros. The second step then applies the standard spectral sparsification algorithm (Spielman and Srivastava, 2011) to further reduce the number of non-zeros to . In this work, we only adopt the first step because a sparsifier with non-zeros is sparse enough for our task. Thus we skip the second step that involves additional computations. From now on, when referring to the random-walk molynomial sparsification algorithm in this work, we mean its first step only.
Thm. 2 can help us construct a sparsifier for matrix . Then we define by replacing in Eq. (5) with its sparsifier . One can observe that matrix is still a sparse one with the same order of magnitude of non-zeros as . Consequently, instead of factorizing the dense NetMF matrix in Eq. (3), we can factorize its sparse alternative, i.e.,
In the rest of this work, the matrix in Eq. (6) is referred to as the NetMF matrix sparsifier.
In this section, we formally describe the NetSMF algorithm, which consists of three steps: random-walk molynomial sparsification, NetMF sparsifier construction, and truncated singular value decomposition.
Step 1: Random-Walk Molynomial Sparsification. To achieve the sparsifier , we adopt the algorithm in Cheng et al. (2015b). The algorithm starts from creating a network that has the same vertex set as and an empty edge set (Alg. 1, Line 1). Next, the algorithm constructs a sparsifier with non-zeros by repeating the PathSampling algorithm for times. In each iteration, it picks an edge and an integer uniformly (Alg. 1, Line 3-4). Then, the algorithm uniformly draws an integer and performs -step and -step random walks starting from the two endpoints of edge respectively (Alg. 2, Line 3-4). The above process samples a length- path . At the same time, the algorithm keeps track of , which is defined by
and then adds a new edge with weight to (Alg. 1, Line 6).333Details about how the edge weight is derived can be found in Thm. 7 in Appendix. Parallel edges in will be merged into one single edge, with their weights summed up together. Finally, the algorithm computes the Laplacian of , which is the sparsifier as we desired (Alg. 1, Line 8). This step gives us a sparsifier with non-zeros.
Step 2: Construct a NetMF Matrix Sparsifier. As we have discussed at the end of Section 3.1, after constructing a sparsifier , we can plug it into Eq. (5) to obtain a NetMF matrix sparsifier as shown in Eq. (6) (Alg. 1, Line 9-10). This step does not change the order of magnitude of non-zeros in the sparsifier.
Step 3: Truncated Singular Value Decomposition. The final step is to perform truncated singular value decomposition (SVD) on the constructed NetMF matrix sparsifier (Eq. (6)). However, even the sparsifier only has number of non-zeros, performing exact SVD is still time consuming. In this work, we leverage a modern randomized matrix approximation technique—Randomized SVD—developed by Halko et al. (2011). Due to space constraint, we cannot include many details. Briefly speaking, the algorithm projects the original matrix to a low-dimensional space through a Gaussian random matrix. One only needs to perform traditional SVD (e.g. Jacobi SVD) on a small matrix. We list the pseudocode algorithm in Alg. 3. Another advantage of SVD is that we can determine the dimensionality of embeddings by using, for example, Cattell’s Scree test (Cattell, 1966). In the test, we plot the singular values and select a rank such that there is a clear drop in the magnitudes or the singular values start to even out. More details will be discussed in Section 4.
Complexity Analysis. Now we analyze the time and space complexity of NetSMF, as summarized in Table 3. As for step 1, we call the PathSampling algorithm for times, during each of which it performs steps of random walks over the network. For unweighted networks, sampling a neighbor requires time, while for weighted networks, one can use roulette wheel selection to choose a neighbor in . It taks space to store , while the additional space comes from the storage of the input network. As for step 2, it takes time to perform the transformation in Eq. (5) and the element-wise truncated logarithm in Eq. (6). The additional space is spent in storing the degree matrix. As for step 3, time is required to compute the product of a row-major sparse matrix and a dense matrix (Alg. 3, Lines 3 and 5); time is spent in Gram-Schmidt orthogonalization (Alg. 3, Lines 4 and 8); time is spent in Jacobi SVD (Alg. 3, Line 10).
Connection to NetMF. The major difference between NetMF and NetSMF lies in the approximation strategy of the NetMF matrix in Eq. (3). As we mentioned in Section 2, NetMF approximates it with a dense matrix, which brings new space and computation challenges. In this work, NetSMF aims to find a sparse approximator to the NetMF matrix by leveraging theories and techniques from spectral graph sparsification.
Example. We provide a running example to help understand the NetSMF algorithm. Suppose we want to learn embeddings for a network with vertices, edges, context window size , and approximation factor . The NetSMF method calls the PathSampling algorithm for times and provides us with a NetMF matrix sparsifier with at most non-zeros (Notice that the reducer in Step 1 and in Step 2 will further sparsify the matrix, making an upper bound). The density of the sparsifier is at most . Then, when computing the sparse-dense matrix product in randomized SVD (Alg. 3, Lines 3 and 5), the sparseness of the factorized matrix can greatly accelerate the calculation. In comparison, NetMF must construct a dense matrix with non-zeros, which is an order of magnitude larger in terms of density. Also, the density of the sparsifier in NetSMF can be further reduced by using a larger , while NetMF does not have this flexibility.
In this section, we analyze the approximation error of the sparsification. We assume that we choose an approximation factor . We first see how the constructed approximates and then compare the NetMF matrix (Eq. (3)) against the NetMF matrix sparsifier (Eq. (6)). We use to denote the -th descending-order singular value of a matrix. We also assume the vertices’ degrees are sorted in ascending order, that is, .
theoremMerror The singular value of satisfies .
theoremlogerror Let be the matrix Frobenius norm. Then
See Appendix. ∎
Discussion on the Approximation Error. The above bound is achieved without making assumptions about the input network. If we introduce some assumptions, say a bounded lowest degree or a specific random graph model (e.g., Planted Partition Model or Extended Planted Partition Model), it is promising to explore tighter bounds by leveraging theorems in literature (Dasgupta et al., 2004; Chaudhuri et al., 2012).
Each step of NetSMF can be parallelized, enabling it to scale to very large networks. The parallelization design of NetSMF is introduced in Figure 1. Below we discuss the parallelization of each step in detail. At the first step, the paths in the PathSampling algorithm are sampled independently with each other. Thus we can launch multiple PathSampling workers simultaneously. Each worker handles a subset of the samples. Herein, we require that each worker is able to access the network data efficiently. There are many options to meet this requirement. The easiest one is to load a copy of the network data to each worker’s memory. When the network is extremely large (e.g., trillion scale) or workers have memory constraints, the graph engine should be designed to expose efficient graph query APIs to support graph operations such as random walks. At the end of this step, a reducer is designed to merge parallel edges and sum up their weights. If this step is implemented in a big data system such as Spark (Zaharia et al., 2010), the reduction step can be simply achieved by running a reduceByKey(_+_)444https://spark.apache.org/docs/latest/rdd-programming-guide.html function. After the reduction, the sparsifier is organized as a collection of triplets, a.k.a, COOrdinate format, with each indicating an entry of the sparsifier. The second step is the most straightforward step to scale up. When processing a triplet , we can simply query the degree of vertices and and perform the transformation defined in Eq. (5) as well as the truncated logarithm in Eq. (6), which can be well parallelized. For the last step, we organize the sparsifier into row-major format. This format allows efficient multiplication between a sparse and a dense matrix (Alg. 3, Line 3 and 5). Other dense matrix operators (e.g., Gaussian random matrix generation, Gram-Schmidt orthogonalization and Jacobi SVD) can be easily accelerated by using multi-threading or common linear algebra libraries. In this work, we adopt a single-machine shared-memory implementation. We use OpenMP (Dagum and Menon, 1998) to parallelize NetSMF in our implementation555Code is publicly available at https://github.com/xptree/NetSMF.
In this section, we evaluate the proposed NetSMF method on the multi-label vertex classification task, which has been commonly used to evaluate previous network embedding techniques (Perozzi et al., 2014; Tang et al., 2015; Grover and Leskovec, 2016; Qiu et al., 2018). We introduce our datasets and baselines in Section 4.1 and Section 4.2. We report experimental results and parameter analysis in Section 4.3 and Section 4.4, respectively.
We employ five datasets for the prediction task, four of which are in relatively small scale but have been widely used in network embedding literature, including BlogCatalog, PPI, Flickr, and YouTube. The remaining one is a large-scale academic co-authorship network, which is at least two orders of magnitude larger than the largest one (YouTube) used in most network embedding studies. The statistics of these datasets are listed in Table 4.
Protein-Protein Interactions (PPI) (Stark et al., 2010) is a subgraph of the PPI network for Homo Sapiens. The vertex labels are obtained from the hallmark gene sets and represent biological states.
Flickr (Tang and Liu, 2009a) is the user contact network in Flickr. The labels represent the interest groups of the users.
YouTube (Tang and Liu, 2009b) is a video-sharing website that allows users to upload, view, rate, share, add to their favorites, report, comment on videos. The users are labeled by the video genres they liked.
Open Academic Graph (OAG)666www.openacademic.ai/oag/ is an academic graph indexed by Microsoft Academic (Sinha et al., 2015) and AMiner.org (Tang et al., 2008). We construct an undirected co-authorship network from OAG, which contains 67,768,244 authors and 895,368,962 collaboration edges. The vertex labels are defined to be the top-level fields of study of each author, such as computer science, physics and psychology. In total, there are 19 distinct fields (labels) and authors may publish in more than one field, making the associated vertices have multiple labels.
We compare NetSMF with NetMF (Qiu et al., 2018), LINE (Tang et al., 2015), DeepWalk (Perozzi et al., 2014), and node2vec (Grover and Leskovec, 2016). For NetSMF, NetMF, DeepWalk, and node2vec that allow multi-hop structural dependencies, the context window size is set to be 10, which is also the default setting used in both DeepWalk and node2vec. Across all datasets, we set the embedding dimension to be 128. We follow the common practice for the other hyper-parameter settings, which are introduced below.
LINE. We use LINE with the second order proximity (i.e., LINE (2nd) (Tang et al., 2015)). We use the default setting of LINE’s hyper-parameters: the number of edge samples to be 10 billion and the negative sample size to be 5.
DeepWalk. We present DeepWalk’s results with the authors’ preferred parameters, that is, walk length to be 40, the number of walks from each vertex to be 80, and the number of negative samples in skip-gram to be 5.
node2vec. For the return parameter and in-out parameter in node2vec, we adopt the default setting that was used by its authors if available. Otherwise, we grid search . For a fair comparison, we use the same walk length and the number of walks per vertex as DeepWalk.
NetMF. In NetMF, the hyper-parameter indicates the number of eigen pairs used to approximate the NetMF matrix. We choose for the BlogCatalog, PPI and Flickr datasets.
NetSMF. In NetSMF, we set the number of samples for the PPI, Flickr, and YouTube datasets, for BlogCatalog, and for OAG in order to achieve desired performance. For both NetMF and NetSMF, we have .
Prediction Setting. We follow the same experiment and evaluation procedures that were performed in DeepWalk (Perozzi et al., 2014)
. First, we randomly sample a portion of labeled vertices for training and use the remaining for testing. For the BlogCatalog and PPI datasets, the training ratio varies from 10% to 90%. For Flickr, YouTube and OAG, the training ratio varies from 1% to 10%. We use the one-vs-rest logistic regression model implemented by LIBLINEAR(Fan et al., 2008) for the multi-label vertex classification task. In the test phase, the one-vs-rest model yields a ranking of labels rather than an exact label assignment. To avoid the thresholding effect, we take the assumption that was made in DeepWalk, LINE, and node2vec, that is, the number of labels for vertices in the test data is given (Perozzi et al., 2014; Tang et al., 2009; Grover and Leskovec, 2016). We repeat the prediction procedure ten times and evaluate the average performance in terms of both Micro-F1 and Macro-F1 scores (Tsoumakas et al., 2009). All the experiments are performed on a server with Intel Xeon E7-8890 CPU (64 cores), 1.7TB memory, and 2TB SSD hard drive.
|BlogCatalog||40 mins||12 mins||56 mins||2 mins||13 mins|
|PPI||41 mins||4 mins||4 mins||16 secs||10 secs|
|Flickr||42 mins||2.2 hours||21 hours||2 hours||48 mins|
|YouTube||46 mins||1 day||4 days||4.1 hours|
|OAG||2.6 hours||–||–||24 hours|
NetSMF vs. NetMF. We first focus on the comparison between NetSMF and NetMF, since the goal of NetSMF is to address the efficiency and scalability issues of NetMF while maintaining its superiority in effectiveness. From Table 5, we observe that for YouTube and OAG, both of which contain more than one million vertices, NetMF fails to complete because of the excessive space and memory consumption, while NetSMF is able to finish in four hours and one day, respectively. For the moderate-size network Flickr, both methods are able to complete within one week, though NetSMF is 2.5 faster (i.e., 48 mins vs. 2 hours). For small-scale networks, NetMF is faster than NetSMF in BlogCatalog and is comparable to NetSMF in PPI in terms of running time. This is because when the input networks contain only thousands of vertices, the advantage of sparse matrix construction and factorization over its dense alternative could be marginalized by other components of the workflow.
In terms of prediction performance, Figure 2 suggests NetSMF and NetMF yield consistently the best results among all compared methods, empirically demonstrating the power of the matrix factorization framework for network embedding. In BlogCatalog, NetSMF has slightly worse performance than NetMF (on average less than 3.1% worse regarding both Micro- and Macro-F1). In PPI, the two leading methods’ performance are relatively indistinguishable in terms of both metrics. In Flickr, NetSMF achieves significantly better Macro-F1 than NetMF (by 3.6% on average), and also higher Micro-F1 (by 5.3% on average). Recall that NetMF uses a dense approximation of the matrix to factorize. These results show that the sparse spectral approximation used by NetSMF does not necessarily yield worse performance than the dense approximation used by NetMF.
Overall, not only NetSMF improves the scalability, and the running time of NetMF by orders of magnitude for large-scale networks, it also has competitive, and sometimes better, performance. This demonstrates the effectiveness of our spectral sparsification based approximation algorithm.
NetSMF vs. DeepWalk, LINE node2vec. We also compare NetSMF against common graph embedding benchmarks—DeepWalk, LINE, and node2vec. For the OAG dataset, DeepWalk and node2vec fail to finish the computation within one week, while NetSMF requires only 24 hours. Based on the publicly reported running time of skip-gram (Mikolov et al., 2013a)
, we estimate that DeepWalk and node2vec may require months to generate embeddings for the OAG dataset. In BlogCatalog, DeepWalk and NetSMF require similar computing time, while in Flickr, YouTube, and PPI, NetSMF is 2.75, 5.9, and 24 faster than DeepWalk, respectively. In all the datasets, NetSMF achieves 4–24 speedup over node2vec.
Moreover, the performance of NetSMF is significantly better than DeepWalk in BlogCatalog, PPI, and Flickr, by 7–34% in terms of Micro-F1 and 5–25% in terms of Macro-F1. In YouTube, NetSMF achieves comparable results to DeepWalk. Compared with node2vec, NetSMF achieves comparable performance in BlogCatalog and YouTube, and significantly better performance in PPI and Flickr. In summary, NetSMF consistently outperforms DeepWalk and node2vec in terms of both efficiency and effectiveness.
LINE has the best efficiency among all the five methods and together with NetSMF, they are the only methods that can generate embeddings for OAG within one week (and both finish in one day). However, it also has the worst prediction performance and consistently loses to others by a large margin across all datasets. For example, NetSMF beats LINE by 21% and 39% in Flickr, and by 30% and 100% in OAG in terms of Micro-F1 and Macro-F1, respectively.
In summary, LINE achieves efficiency at the cost of ignoring multi-hop dependencies in networks, which are supported by all the other four methods—DeepWalk, node2vec, NetMF, and NetSMF, demonstrating the importance of multi-hop dependencies for learning network representations.
More importantly, among these four methods, DeepWalk achieves neither efficiency nor effectiveness superiority; node2vec achieves relatively good performance at the cost of efficiency; NetMF achieves effectiveness at the expense of significantly increased time and space costs; NetSMF is the only method that achieves both high efficiency and effectiveness, empowering it to learn effective embeddings for billion-scale networks (e.g., the OAG network with 0.9 billion edges) in one day on one modern server.
In this section, we discuss how the hyper-parameters influence the performance and efficiency of NetSMF. We report all the parameter analyses on the Flickr dataset with training ratio set to be 10%.
How to Set the Embedding Dimension . As mentioned in Section 3.1, SVD allows us to determine a “good” embedding dimension without supervised information. There are many methods available such as captured energy and Cattell’s Scree test (Cattell, 1966). Here we propose to use Cattell’s Scree test. Cattell’s Scree test plots the singular values and selects a rank such that there is a clear drop in the magnitudes or the singular values start to even out. In Flickr, if we sort the singular values in decreasing order, we can observe that the singular values approach 0 when the rank increases to around 100, as shown in Figure 3. In our experiments, by varying form to , we reach the best performance at , as shown in Figure 3, demonstrating the ability of our matrix factorization based NetSMF for automatically determining the embedding dimension.
The Number of Non-Zeros . In theory, = is required to guarantee the approximation error (See Section 3.1). Without loss of generality, we empirically set to be where is chosen from 1, 10, 100, 200, 500, 1000, 2000 and investigate how the number of non-zeros influence the quality of learned embeddings. As shown in Figure 3, when increasing the number of non-zeros, NetSMF tends to have better prediction performance because the original matrix is being approximated more accurately. On the other hand, although increasing has a positive effect on the prediction performance, its marginal benefit diminishes gradually. One can observe that setting (the second-to-the-right data point on each line in Figure 3) is a good choice that balances NetSMF’s efficiency and effectiveness.
The Number of Threads. In this work, we use a single-machine shared memory implementation with multi-threading acceleration. We report the running time of NetSMF when setting the number of threads to be 1, 10, 20, 30, 60, respectively. As shown in Figure 3, NetSMF takes 12 hours to embed the Flickr network with one thread and 48 minutes to run with 30 threads, achieving a 15 speedup ratio (with ideal being 30). This relatively good sub-linear speedup supports NetSMF to scale up to very large-scale networks.
In this section, we review the related work of network embedding, large-scale embedding algorithms, and spectral graph sparsification.
Network embedding has been extensively studied over the past years (Hamilton et al., 2017). The success of network embedding has driven a lot of downstream network applications, such as recommendation systems (Ying et al., 2018). Briefly, recent work about network embedding can be categorized into three genres: (1) Skip-gram based methods that are inspired by word2vec (Mikolov et al., 2013a), such as LINE (Tang et al., 2015), DeepWalk (Perozzi et al., 2014), node2vec (Grover and Leskovec, 2016), metapath2vec (Dong et al., 2017), and VERSE (Tsitsulin et al., 2018)
; (2) Deep learning based methods such as(Ying et al., 2018; Kipf and Welling, 2017); (3) Matrix factorization based methods such as GraRep (Cao et al., 2015) and NetMF (Qiu et al., 2018). Among them, NetMF bridges the first and the third categories by unifying a collection of skip-gram based network embedding methods into a matrix factorization framework. In this work, we leverage the merit of NetMF and address its limitation in efficiency. Among literature, PinSage is notably a network embedding framework for billion-scale networks (Ying et al., 2018). The difference between NetSMF and PinSage lies in the following aspect. The goal of NetSMF is to pre-train general network embeddings in an unsupervised manner, while PinSage is a supervised graph convolutional method with both the objective of recommender systems and existing node features incorporated. That being said, the embeddings learned by NetSMF can be consumed by PinSage for downstream network applications.
Studies have attempted to optimize embedding algorithms for large datasets from different perspectives. Some focus on improving skip-gram model, while others consider it as matrix factorization.
Distributed Skip-Gram Model. Inspired by word2vec (Mikolov et al., 2013b), most of the modern embedding learning algorithms are based on the skip-gram model. There is a sequence of work trying to accelerate the skip-gram model in a distributed system. For example, Ji et al. (2016) replicate the embedding matrix on multiple workers and synchronize them periodically; Ordentlich et al. (2016) distribute the columns (dimensions) of the embedding matrix to multiple executors and synchronize them with a parameter server (Li et al., 2014). Negative sampling is a key step in skip-gram, which requires to draw samples from a noisy distribution. Stergiou et al. (2017) focus on the optimization of negative sampling by replacing the roulette wheel selection with a hierarchical sampling algorithm based on the alias method. More recently, Wang et al. (2018)
propose a billion-scale network embedding framework by heuristically partitioning the input graph to small subgraphs, and processing them separately in parallel. However, the performance of their framework highly relies on the quality of graph partition. The drawback for partition-based embedding learning is that the embeddings learned in different subgraphs do not share the same latent space, making it impossible to compare nodes across subgraphs.
Efficient Matrix Factorization. Factorizing the NetMF matrix, either implicitly (e.g., LINE (Tang et al., 2015) and DeepWalk (Perozzi et al., 2014)) or explicitly (e.g., NetMF (Qiu et al., 2018)), encounters two issues. First, the denseness of this matrix makes computation expensive even for a moderate context window size (e.g.,
). Second, the non-linear transformation, i.e., element-wise matrix logarithm, is hard to approximate. LINE(Tang et al., 2015) solves this problem by setting . With such simplification, it achieves good scalability at the cost of prediction performance. NetSMF addresses these issues by efficiently sparsifying the dense NetMF matrix with a theoretically-bounded approximation error.
Spectral graph sparsification has been studied for decades in graph theory (Teng et al., 2016). The task of graph sparsification is to approximate a “dense” graph by a “sparse” one that can be effectively used in place of the dense one (Teng et al., 2016), which arises in many applications such as scientific computing (Higham and Lin, 2011)2015a; Calandriello et al., 2018) and data mining (Zhao, 2015). Our NetSMF model is the first work that incorporates spectral sparsification algorithms (Cheng et al., 2015b, a) into network embedding, which offers a powerful and efficient way to approximate and analyze the random-walk matrix-polynomial in the NetMF matrix.
In this work, we study network embedding with the goal of achieving both efficiency and effectiveness. To address the scalability challenges faced by the NetMF model, we propose to study large-scale network embedding as sparse matrix factorization. We present the NetSMF algorithm, which achieves a sparsification of the (dense) NetMF matrix. Both the construction and factorization of the sparsified matrix are fast enough to support very large-scale network embedding learning. For example, it empowers NetSMF to efficiently embed the Open Academic Graph in 24 hours, whose size is computationally intractable for the dense matrix factorization solution (NetMF). Theoretically, the sparsified matrix is spectrally close to the original NetMF matrix with an approximation bound. Empirically, our extensive experimental results show that the sparsely learned embeddings by NetSMF are as effective as those from the factorization of the NetMF matrix, leaving it outperform the common network embedding benchmarks—DeepWalk, LINE, and node2vec. In other words, among both matrix factorization based methods (NetMF and NetSMF) and common skip-gram based benchmarks (DeepWalk, LINE, and node2vec), NetSMF is the only model that achieves both efficiency and performance superiority.
Future Work. NetSMF brings an efficient, effective, and guaranteed solution to network embedding learning. There are multiple tangible research fronts we can pursue. First, our current single-machine implementation limits the number of samples we can take for large networks. We plan to develop a multi-machine solution in the future to further scale NetSMF. Second, building upon NetSMF, we would like to efficiently and accurately learn embeddings for large-scale directed (Cohen et al., 2016), dynamic (Kapralov et al., 2017), and/or heterogeneous networks. Third, as the advantage of matrix factorization methods demonstrated, we are also interested in exploring the other matrix definitions that may be effective in capturing different structural properties in networks. Last, it would be also interesting to bridge matrix factorization based network embedding methods with graph convolutional networks.
Acknowledgements. We would like to thank Dehua Cheng and Youwei Zhuo from USC for helpful discussions. Jian Li is supported in part by the National Basic Research Program of China Grant 2015CB358700, the National Natural Science Foundation of China Grant 61822203, 61772297, 61632016, 61761146003, and a grant from Microsoft Research Asia. Jie Tang is the corresponding author.
(Courant-Fisher Theorem) Let be a symmetric matrix with eigenvalues , then for ,
((Horn and Johnson, 1991)) Let be two symmetric matrices. Then for the decreasingly-ordered singular values of and ,
holds for any and .
Let and similarly . Then all the singular values of are smaller than , i.e., , .
which is a normalized graph Laplacian whose eigenvalues lie in the interval , i.e., for , (Von Luxburg, 2007). Since is a -spectral sparsifier of , we know that for ,
Let which is bijective, we have
The last inequality is because we assume . Then, by Courant-Fisher Theorem (Lemma 2), we can immediately get, ,
Then, by Lemma 1, . ∎
It is easy to observe that is 1-Lipchitz w.r.t. Frobenius norm. So we have
We finally explain the remaining question in Step 1 of NetSMF: After sampling a length- path , why does the algorithm add a new edge to the sparsifier with weight ? Our proof relies on two lemmas from (Cheng et al., 2015b).
(Theorem 2.2 in (Cheng et al., 2015b)) After sampling a length- path , the weight corresponding to the new edge added to the sparsifier should be .
After sampling a length- path using the PathSampling algorithm (Alg. 2). The weight of the new edge added to the sparsifier is .
Spectral analysis of random graphs with skewed degree distributions. InFOCS ’04. 602–610.
Graph Convolutional Neural Networks for Web-Scale Recommender Systems.KDD ’18.