NetSMF
NetSMF: LargeScale Network Embedding as Sparse Matrix Factorization
view repo
We study the problem of largescale network embedding, which aims to learn latent representations for network mining applications. Previous research shows that 1) popular network embedding benchmarks, such as DeepWalk, are in essence implicitly factorizing a matrix with a closed form, and 2)the explicit factorization of such matrix generates more powerful embeddings than existing methods. However, directly constructing and factorizing this matrixwhich is denseis prohibitively expensive in terms of both time and space, making it not scalable for large networks. In this work, we present the algorithm of largescale network embedding as sparse matrix factorization (NetSMF). NetSMF leverages theories from spectral sparsification to efficiently sparsify the aforementioned dense matrix, enabling significantly improved efficiency in embedding learning. The sparsified matrix is spectrally close to the original dense one with a theoretically bounded approximation error, which helps maintain the representation power of the learned embeddings. We conduct experiments on networks of various scales and types. Results show that among both popular benchmarks and factorization based methods, NetSMF is the only method that achieves both high efficiency and effectiveness. We show that NetSMF requires only 24 hours to generate effective embeddings for a largescale academic collaboration network with tens of millions of nodes, while it would cost DeepWalk months and is computationally infeasible for the dense matrix factorization solution. The source code of NetSMF is publicly available (https://github.com/xptree/NetSMF).
READ FULL TEXT VIEW PDF
Node embedding learns a lowdimensional representation for each node in ...
read it
We study the problem of learning similarity functions over very large co...
read it
Bayesian Matrix Factorization (BMF) is a powerful technique for recommen...
read it
Many successful methods have been proposed for learning low dimensional
...
read it
Since the invention of word2vec, the skipgram model has significantly
a...
read it
Matrix factorization (MF) discovers latent features from observations, w...
read it
Graph embedding learns lowdimensional representations for nodes in a gr...
read it
NetSMF: LargeScale Network Embedding as Sparse Matrix Factorization
Recent years have witnessed the emergence of network embedding, which offers a revolutionary paradigm for modeling graphs and networks (Hamilton et al., 2017). The goal of network embedding is to automatically learn latent representations for objects in networks, such as vertices and edges. Significant lines of research have shown that the latent representations are capable of capturing the structural properties of networks, facilitating various downstream network applications, such as vertex classification and link prediction (Perozzi et al., 2014; Tang et al., 2015; Grover and Leskovec, 2016; Dong et al., 2017).
Over the course of its development, the DeepWalk (Perozzi et al., 2014), LINE (Tang et al., 2015), and node2vec (Grover and Leskovec, 2016) models have been commonly considered as powerful benchmark solutions for evaluating network embedding research. The advantage of LINE lies in its scalability for largescale networks as it only models the first and secondorder proximities. That is to say, its embeddings lose the multihop dependencies in networks. DeepWalk and node2vec, on the other hand, leverage random walks on graphs and skipgram (Mikolov et al., 2013a) with large context sizes to model nodes further away (i.e., global structures). Consequently, it is computationally more expensive for DeepWalk and node2vec to handle largescale networks. For example, with the default parameter settings (Perozzi et al., 2014), DeepWalk requires months to embed an academic collaboration network of 67 million vertices and 895 million edges^{2}^{2}2With the default DeepWalk parameters (walk length: 40 and walk per node: 80), 214+ billion nodes (67M4080) with a vocabulary size of 67 million are fed into skipgram. As a reference, Mikolov et al. reported that training on Google News of 6 billion words and a vocabulary size of only 1 million cost 2.5 days with 125 CPU cores (Mikolov et al., 2013a).. The node2vec model, which performs highorder random walks, takes more time than DeepWalk to learn embeddings.
More recently, a study shows that both the DeepWalk and LINE methods can be viewed as implicit factorization of a closedform matrix (Qiu et al., 2018). Building upon this theoretical foundation, the NetMF method was instead proposed to explicitly factorize this matrix, achieving more effective embeddings than DeepWalk and LINE. Unfortunately, it turns out that the matrix to be factorized is an dense one with being the number of vertices in the network, making it prohibitively expensive to directly construct and factorize for largescale networks.
In light of these limitations of existing methods (See the summary in Table 1), we propose to study representation learning for largescale networks with the goal of achieving efficiency, capturing global structural contexts, and having theoretical guarantees. Our idea is to find a sparse matrix that is spectrally close to the dense NetMF matrix implicitly factorized by DeepWalk. The sparsified matrix requires a lower cost for both construction and factorization. Meanwhile, making it spectrally close to the original NetMF matrix can guarantee that the spectral information of the network is maintained, and the embeddings learned from the sparse matrix is as powerful as those learned from the dense NetMF matrix.
In this work, we present the solution to network embedding learning as sparse matrix factorization (NetSMF). NetSMF comprises three steps. First, it leverages the spectral graph sparsification technique (Cheng et al., 2015b, a)
to find a sparsifier for a network’s randomwalk matrixpolynomial. Second, it uses this sparsifier to construct a matrix with significantly fewer nonzeros than, but spectrally close to, the original NetMF matrix. Finally, it performs randomized singular value decomposition to efficiently factorize the sparsified NetSMF matrix, yielding the embeddings for the network.
LINE 
DeepWalk 
node2vec 
NetMF 
NetSMF 


Efficiency  
Global context  
Theoretical guarantee  
Highorder proximity 
With this design, NetSMF offers both efficiency and effectiveness with guarantees, as the approximation error of the sparsified matrix is theoretically bounded. We conduct experiments in five networks, which are representative of different scales and types. Experimental results show that for millionscale or larger networks, NetSMF achieves orders of magnitude speedup over NetMF, while maintaining competitive performance for the vertex classification task. In other words, both NetSMF and NetMF outperform wellrecognized network embedding benchmarks (i.e., DeepWalk, LINE, and node2vec), but NetSMF addresses the computation challenge faced by NetMF.
To summarize, we introduce the idea of network embedding as sparse matrix factorization and present the NetSMF algorithm, which makes the following contributions to network embedding:
Efficiency. NetSMF reaches significantly lower time and space complexity than NetMF. Remarkably, NetSMF is able to generate embeddings for a largescale academic network of 67 million vertices and 895 million edges on a single server in 24 hours, while it would cost months for DeepWalk and node2vec, and is computationally infeasible for NetMF on the same hardware.
Effectiveness. NetSMF is capable of learning embeddings that maintain the same representation power as the dense matrix factorization solution, making it consistently outperform DeepWalk and node2vec by up to 34% and LINE by up to 100% for the multilabel vertex classification task in networks.
Theoretical Guarantee. NetSMF’s efficiency and effectiveness are theoretically backed up. The sparse NetSMF matrix is spectrally close to the exact NetMF matrix, and the approximation error can be bounded, maintaining the representation power of its sparsely learned embeddings.
Notation  Description 

input network  
vertex set of with =n  
edge set of with  
adjacency matrix of  
degree matrix of  
volume of  
number of negative samples  
context window size  
embedding dimension  
randomwalk molynomial of (Eq. (4))  
’s sparsifier  
’s sparsifier  
NetMF matrix  
NetMF matrix sparisifier  
number of nonzeros in  
approximation factor  
set for positive integer 
Commonly, the problem of network embedding is formalized as follows: Given an undirected and weighted network with as the vertex set of vertices, as the edge set of edges, and as the adjacency matrix, the goal is to learn a function that maps each vertex to a dimensional (
) vector that captures its structural properties, e.g., community structures. The vector representation of each vertex can be fed into downstream applications such as link prediction and vertex classification.
One of the pioneering work on network embedding is the DeepWalk model (Perozzi et al., 2014), which has been consistently considered as a powerful benchmark over the past years (Hamilton et al., 2017). In brief, DeepWalk is coupled with two steps. First, it generates several vertex sequences by random walks over a network; Second, it applies the skipgram model (Mikolov et al., 2013b) on the generated vertex sequences to learn the latent representations for each vertex. Commonly, skipgram is parameterized with the context window size and the number of negative samples . Recently, a theoretical study (Qiu et al., 2018) reveals that DeepWalk essentially factorizes a matrix derived from the random walk process. More formally, it proves that when the length of random walks goes to infinity, DeepWalk implicitly and asymptotically factorizes the following matrix:
(1) 
where denotes the volume of the graph, and
(2) 
where is the degree matrix with as the generalized degree of the th vertex. Note that represents the elementwise matrix logarithm (Horn and Johnson, 1991), which is different from the matrix logarithm. In other words, the matrix in Eq. (1) can be characterized as the result of applying elementwise matrix logarithm (i.e., ) to matrix .
The matrix in Eq. (1) offers an alternative view of the skipgram based network embedding methods. Further, Qiu et al. provide an explicit matrix factorization approach named NetMF to learn the embeddings (Qiu et al., 2018). It shows that the accuracy for vertex classification based on the embeddings from NetMF outperforms that based on DeepWalk and LINE. Note that the matrix in Eq. (1) would be illdefined if there exist a pair of vertices unreachable in hops, because . So following Levy and Goldberg (Levy and Goldberg, 2014), NetMF uses the logarithm truncated at point one, that is, . Thus, NetMF targets to factorize the matrix
(3) 
In the rest of this work, we refer to the matrix in Eq. (3) as the NetMF matrix.
However, there exist a couple of challenges when leveraging the NetMF matrix in practice. First, almost every pair of vertices within distance correspond to a nonzero entry in the NetMF matrix. Recall that many social and information networks exhibit the smallworld property where most vertices can be reached from each other in a small number of steps. For example, as of the year 2012, 92% of the reachable pairs in Facebook are at distance five or less (Backstrom et al., 2012). As a consequence, even if setting a moderate context window size (e.g., the default setting in DeepWalk), the NetMF matrix in Eq. (3) would be a dense matrix with number of nonzeros. The exact construction and factorization of such a matrix is impractical for largescale networks. More concretely, computing the matrix power in Eq. (2) involves dense matrix multiplication which costs time; factorizing a dense matrix is also time consuming. To reduce the construction cost, NetMF approximates with its top eigen pairs. However, the approximated matrix is still dense, making this strategy unable to handle large networks.
In this work, we aim to address the efficiency and scalability limitation of NetMF, while maintaining its superiority in effectiveness. We list necessary notations and their descriptions in Table 2.
In this section, we develop network embedding as sparse matrix factorization (NetSMF). We present the NetSMF method to construct and factorize a sparse matrix that approximates the dense NetMF matrix. The main technique we leverage is randomwalk matrixpolynomial (molynomial) sparsification.
We first introduce the definition of spectral similarity and the theorem of randomwalk molynomial sparsification.
(Spectral Similarity of Networks) Suppose and are two weighted undirected networks. Let and be their Laplacian matrices, respectively. We define and are spectrally similar if
To achieve a sparsifier with nonzeros, the sparsification algorithm consists of two steps: The first step obtains an initial sparsifier for with nonzeros. The second step then applies the standard spectral sparsification algorithm (Spielman and Srivastava, 2011) to further reduce the number of nonzeros to . In this work, we only adopt the first step because a sparsifier with nonzeros is sparse enough for our task. Thus we skip the second step that involves additional computations. From now on, when referring to the randomwalk molynomial sparsification algorithm in this work, we mean its first step only.
One can immediately observe that, if we set , the matrix in Eq. (4) has a strong connection with the desired matrix in Eq. (2). Formally, we have the following equation
(5) 
Thm. 2 can help us construct a sparsifier for matrix . Then we define by replacing in Eq. (5) with its sparsifier . One can observe that matrix is still a sparse one with the same order of magnitude of nonzeros as . Consequently, instead of factorizing the dense NetMF matrix in Eq. (3), we can factorize its sparse alternative, i.e.,
(6) 
In the rest of this work, the matrix in Eq. (6) is referred to as the NetMF matrix sparsifier.
In this section, we formally describe the NetSMF algorithm, which consists of three steps: randomwalk molynomial sparsification, NetMF sparsifier construction, and truncated singular value decomposition.
Step 1: RandomWalk Molynomial Sparsification. To achieve the sparsifier , we adopt the algorithm in Cheng et al. (2015b). The algorithm starts from creating a network that has the same vertex set as and an empty edge set (Alg. 1, Line 1). Next, the algorithm constructs a sparsifier with nonzeros by repeating the PathSampling algorithm for times. In each iteration, it picks an edge and an integer uniformly (Alg. 1, Line 34). Then, the algorithm uniformly draws an integer and performs step and step random walks starting from the two endpoints of edge respectively (Alg. 2, Line 34). The above process samples a length path . At the same time, the algorithm keeps track of , which is defined by
(7) 
and then adds a new edge with weight to (Alg. 1, Line 6).^{3}^{3}3Details about how the edge weight is derived can be found in Thm. 7 in Appendix. Parallel edges in will be merged into one single edge, with their weights summed up together. Finally, the algorithm computes the Laplacian of , which is the sparsifier as we desired (Alg. 1, Line 8). This step gives us a sparsifier with nonzeros.
Step 2: Construct a NetMF Matrix Sparsifier. As we have discussed at the end of Section 3.1, after constructing a sparsifier , we can plug it into Eq. (5) to obtain a NetMF matrix sparsifier as shown in Eq. (6) (Alg. 1, Line 910). This step does not change the order of magnitude of nonzeros in the sparsifier.
Time  Space  

Step 1 


Step 2  
Step 3 
Step 3: Truncated Singular Value Decomposition. The final step is to perform truncated singular value decomposition (SVD) on the constructed NetMF matrix sparsifier (Eq. (6)). However, even the sparsifier only has number of nonzeros, performing exact SVD is still time consuming. In this work, we leverage a modern randomized matrix approximation technique—Randomized SVD—developed by Halko et al. (2011). Due to space constraint, we cannot include many details. Briefly speaking, the algorithm projects the original matrix to a lowdimensional space through a Gaussian random matrix. One only needs to perform traditional SVD (e.g. Jacobi SVD) on a small matrix. We list the pseudocode algorithm in Alg. 3. Another advantage of SVD is that we can determine the dimensionality of embeddings by using, for example, Cattell’s Scree test (Cattell, 1966). In the test, we plot the singular values and select a rank such that there is a clear drop in the magnitudes or the singular values start to even out. More details will be discussed in Section 4.
Complexity Analysis. Now we analyze the time and space complexity of NetSMF, as summarized in Table 3. As for step 1, we call the PathSampling algorithm for times, during each of which it performs steps of random walks over the network. For unweighted networks, sampling a neighbor requires time, while for weighted networks, one can use roulette wheel selection to choose a neighbor in . It taks space to store , while the additional space comes from the storage of the input network. As for step 2, it takes time to perform the transformation in Eq. (5) and the elementwise truncated logarithm in Eq. (6). The additional space is spent in storing the degree matrix. As for step 3, time is required to compute the product of a rowmajor sparse matrix and a dense matrix (Alg. 3, Lines 3 and 5); time is spent in GramSchmidt orthogonalization (Alg. 3, Lines 4 and 8); time is spent in Jacobi SVD (Alg. 3, Line 10).
Connection to NetMF. The major difference between NetMF and NetSMF lies in the approximation strategy of the NetMF matrix in Eq. (3). As we mentioned in Section 2, NetMF approximates it with a dense matrix, which brings new space and computation challenges. In this work, NetSMF aims to find a sparse approximator to the NetMF matrix by leveraging theories and techniques from spectral graph sparsification.
Example. We provide a running example to help understand the NetSMF algorithm. Suppose we want to learn embeddings for a network with vertices, edges, context window size , and approximation factor . The NetSMF method calls the PathSampling algorithm for times and provides us with a NetMF matrix sparsifier with at most nonzeros (Notice that the reducer in Step 1 and in Step 2 will further sparsify the matrix, making an upper bound). The density of the sparsifier is at most . Then, when computing the sparsedense matrix product in randomized SVD (Alg. 3, Lines 3 and 5), the sparseness of the factorized matrix can greatly accelerate the calculation. In comparison, NetMF must construct a dense matrix with nonzeros, which is an order of magnitude larger in terms of density. Also, the density of the sparsifier in NetSMF can be further reduced by using a larger , while NetMF does not have this flexibility.
In this section, we analyze the approximation error of the sparsification. We assume that we choose an approximation factor . We first see how the constructed approximates and then compare the NetMF matrix (Eq. (3)) against the NetMF matrix sparsifier (Eq. (6)). We use to denote the th descendingorder singular value of a matrix. We also assume the vertices’ degrees are sorted in ascending order, that is, .
theoremMerror The singular value of satisfies .
theoremlogerror Let be the matrix Frobenius norm. Then
See Appendix. ∎
Discussion on the Approximation Error. The above bound is achieved without making assumptions about the input network. If we introduce some assumptions, say a bounded lowest degree or a specific random graph model (e.g., Planted Partition Model or Extended Planted Partition Model), it is promising to explore tighter bounds by leveraging theorems in literature (Dasgupta et al., 2004; Chaudhuri et al., 2012).
Each step of NetSMF can be parallelized, enabling it to scale to very large networks. The parallelization design of NetSMF is introduced in Figure 1. Below we discuss the parallelization of each step in detail. At the first step, the paths in the PathSampling algorithm are sampled independently with each other. Thus we can launch multiple PathSampling workers simultaneously. Each worker handles a subset of the samples. Herein, we require that each worker is able to access the network data efficiently. There are many options to meet this requirement. The easiest one is to load a copy of the network data to each worker’s memory. When the network is extremely large (e.g., trillion scale) or workers have memory constraints, the graph engine should be designed to expose efficient graph query APIs to support graph operations such as random walks. At the end of this step, a reducer is designed to merge parallel edges and sum up their weights. If this step is implemented in a big data system such as Spark (Zaharia et al., 2010), the reduction step can be simply achieved by running a reduceByKey(_+_)^{4}^{4}4https://spark.apache.org/docs/latest/rddprogrammingguide.html function. After the reduction, the sparsifier is organized as a collection of triplets, a.k.a, COOrdinate format, with each indicating an entry of the sparsifier. The second step is the most straightforward step to scale up. When processing a triplet , we can simply query the degree of vertices and and perform the transformation defined in Eq. (5) as well as the truncated logarithm in Eq. (6), which can be well parallelized. For the last step, we organize the sparsifier into rowmajor format. This format allows efficient multiplication between a sparse and a dense matrix (Alg. 3, Line 3 and 5). Other dense matrix operators (e.g., Gaussian random matrix generation, GramSchmidt orthogonalization and Jacobi SVD) can be easily accelerated by using multithreading or common linear algebra libraries. In this work, we adopt a singlemachine sharedmemory implementation. We use OpenMP (Dagum and Menon, 1998) to parallelize NetSMF in our implementation^{5}^{5}5Code is publicly available at https://github.com/xptree/NetSMF.
In this section, we evaluate the proposed NetSMF method on the multilabel vertex classification task, which has been commonly used to evaluate previous network embedding techniques (Perozzi et al., 2014; Tang et al., 2015; Grover and Leskovec, 2016; Qiu et al., 2018). We introduce our datasets and baselines in Section 4.1 and Section 4.2. We report experimental results and parameter analysis in Section 4.3 and Section 4.4, respectively.
Dataset  BlogCatalog  PPI  Flickr  YouTube  OAG 

10,312  3,890  80,513  1,138,499  67,768,244  
333,983  76,584  5,899,882  2,990,443  895,368,962  
#labels  39  50  195  47  19 
We employ five datasets for the prediction task, four of which are in relatively small scale but have been widely used in network embedding literature, including BlogCatalog, PPI, Flickr, and YouTube. The remaining one is a largescale academic coauthorship network, which is at least two orders of magnitude larger than the largest one (YouTube) used in most network embedding studies. The statistics of these datasets are listed in Table 4.
BlogCatalog (Tang and Liu, 2009a; Agarwal et al., 2009) is a network of social relationships of online bloggers. The vertex labels represent the interests of the bloggers.
ProteinProtein Interactions (PPI) (Stark et al., 2010) is a subgraph of the PPI network for Homo Sapiens. The vertex labels are obtained from the hallmark gene sets and represent biological states.
Flickr (Tang and Liu, 2009a) is the user contact network in Flickr. The labels represent the interest groups of the users.
YouTube (Tang and Liu, 2009b) is a videosharing website that allows users to upload, view, rate, share, add to their favorites, report, comment on videos. The users are labeled by the video genres they liked.
Open Academic Graph (OAG)^{6}^{6}6www.openacademic.ai/oag/ is an academic graph indexed by Microsoft Academic (Sinha et al., 2015) and AMiner.org (Tang et al., 2008). We construct an undirected coauthorship network from OAG, which contains 67,768,244 authors and 895,368,962 collaboration edges. The vertex labels are defined to be the toplevel fields of study of each author, such as computer science, physics and psychology. In total, there are 19 distinct fields (labels) and authors may publish in more than one field, making the associated vertices have multiple labels.
We compare NetSMF with NetMF (Qiu et al., 2018), LINE (Tang et al., 2015), DeepWalk (Perozzi et al., 2014), and node2vec (Grover and Leskovec, 2016). For NetSMF, NetMF, DeepWalk, and node2vec that allow multihop structural dependencies, the context window size is set to be 10, which is also the default setting used in both DeepWalk and node2vec. Across all datasets, we set the embedding dimension to be 128. We follow the common practice for the other hyperparameter settings, which are introduced below.
LINE. We use LINE with the second order proximity (i.e., LINE (2nd) (Tang et al., 2015)). We use the default setting of LINE’s hyperparameters: the number of edge samples to be 10 billion and the negative sample size to be 5.
DeepWalk. We present DeepWalk’s results with the authors’ preferred parameters, that is, walk length to be 40, the number of walks from each vertex to be 80, and the number of negative samples in skipgram to be 5.
node2vec. For the return parameter and inout parameter in node2vec, we adopt the default setting that was used by its authors if available. Otherwise, we grid search . For a fair comparison, we use the same walk length and the number of walks per vertex as DeepWalk.
NetMF. In NetMF, the hyperparameter indicates the number of eigen pairs used to approximate the NetMF matrix. We choose for the BlogCatalog, PPI and Flickr datasets.
NetSMF. In NetSMF, we set the number of samples for the PPI, Flickr, and YouTube datasets, for BlogCatalog, and for OAG in order to achieve desired performance. For both NetMF and NetSMF, we have .
Prediction Setting. We follow the same experiment and evaluation procedures that were performed in DeepWalk (Perozzi et al., 2014)
. First, we randomly sample a portion of labeled vertices for training and use the remaining for testing. For the BlogCatalog and PPI datasets, the training ratio varies from 10% to 90%. For Flickr, YouTube and OAG, the training ratio varies from 1% to 10%. We use the onevsrest logistic regression model implemented by LIBLINEAR
(Fan et al., 2008) for the multilabel vertex classification task. In the test phase, the onevsrest model yields a ranking of labels rather than an exact label assignment. To avoid the thresholding effect, we take the assumption that was made in DeepWalk, LINE, and node2vec, that is, the number of labels for vertices in the test data is given (Perozzi et al., 2014; Tang et al., 2009; Grover and Leskovec, 2016). We repeat the prediction procedure ten times and evaluate the average performance in terms of both MicroF1 and MacroF1 scores (Tsoumakas et al., 2009). All the experiments are performed on a server with Intel Xeon E78890 CPU (64 cores), 1.7TB memory, and 2TB SSD hard drive.We summarize the prediction performance in Figure 2. To compare the efficiency of different algorithms, we also list the running time of each algorithm across all datasets, if available, in Table 5.
LINE 
DeepWalk 
node2vec 
NetMF 
NetSMF 


BlogCatalog  40 mins  12 mins  56 mins  2 mins  13 mins 
PPI  41 mins  4 mins  4 mins  16 secs  10 secs 
Flickr  42 mins  2.2 hours  21 hours  2 hours  48 mins 
YouTube  46 mins  1 day  4 days  4.1 hours  
OAG  2.6 hours  –  –  24 hours 
NetSMF vs. NetMF. We first focus on the comparison between NetSMF and NetMF, since the goal of NetSMF is to address the efficiency and scalability issues of NetMF while maintaining its superiority in effectiveness. From Table 5, we observe that for YouTube and OAG, both of which contain more than one million vertices, NetMF fails to complete because of the excessive space and memory consumption, while NetSMF is able to finish in four hours and one day, respectively. For the moderatesize network Flickr, both methods are able to complete within one week, though NetSMF is 2.5 faster (i.e., 48 mins vs. 2 hours). For smallscale networks, NetMF is faster than NetSMF in BlogCatalog and is comparable to NetSMF in PPI in terms of running time. This is because when the input networks contain only thousands of vertices, the advantage of sparse matrix construction and factorization over its dense alternative could be marginalized by other components of the workflow.
In terms of prediction performance, Figure 2 suggests NetSMF and NetMF yield consistently the best results among all compared methods, empirically demonstrating the power of the matrix factorization framework for network embedding. In BlogCatalog, NetSMF has slightly worse performance than NetMF (on average less than 3.1% worse regarding both Micro and MacroF1). In PPI, the two leading methods’ performance are relatively indistinguishable in terms of both metrics. In Flickr, NetSMF achieves significantly better MacroF1 than NetMF (by 3.6% on average), and also higher MicroF1 (by 5.3% on average). Recall that NetMF uses a dense approximation of the matrix to factorize. These results show that the sparse spectral approximation used by NetSMF does not necessarily yield worse performance than the dense approximation used by NetMF.
Overall, not only NetSMF improves the scalability, and the running time of NetMF by orders of magnitude for largescale networks, it also has competitive, and sometimes better, performance. This demonstrates the effectiveness of our spectral sparsification based approximation algorithm.
NetSMF vs. DeepWalk, LINE node2vec. We also compare NetSMF against common graph embedding benchmarks—DeepWalk, LINE, and node2vec. For the OAG dataset, DeepWalk and node2vec fail to finish the computation within one week, while NetSMF requires only 24 hours. Based on the publicly reported running time of skipgram (Mikolov et al., 2013a)
, we estimate that DeepWalk and node2vec may require months to generate embeddings for the OAG dataset. In BlogCatalog, DeepWalk and NetSMF require similar computing time, while in Flickr, YouTube, and PPI, NetSMF is 2.75
, 5.9, and 24 faster than DeepWalk, respectively. In all the datasets, NetSMF achieves 4–24 speedup over node2vec.Moreover, the performance of NetSMF is significantly better than DeepWalk in BlogCatalog, PPI, and Flickr, by 7–34% in terms of MicroF1 and 5–25% in terms of MacroF1. In YouTube, NetSMF achieves comparable results to DeepWalk. Compared with node2vec, NetSMF achieves comparable performance in BlogCatalog and YouTube, and significantly better performance in PPI and Flickr. In summary, NetSMF consistently outperforms DeepWalk and node2vec in terms of both efficiency and effectiveness.
LINE has the best efficiency among all the five methods and together with NetSMF, they are the only methods that can generate embeddings for OAG within one week (and both finish in one day). However, it also has the worst prediction performance and consistently loses to others by a large margin across all datasets. For example, NetSMF beats LINE by 21% and 39% in Flickr, and by 30% and 100% in OAG in terms of MicroF1 and MacroF1, respectively.
In summary, LINE achieves efficiency at the cost of ignoring multihop dependencies in networks, which are supported by all the other four methods—DeepWalk, node2vec, NetMF, and NetSMF, demonstrating the importance of multihop dependencies for learning network representations.
More importantly, among these four methods, DeepWalk achieves neither efficiency nor effectiveness superiority; node2vec achieves relatively good performance at the cost of efficiency; NetMF achieves effectiveness at the expense of significantly increased time and space costs; NetSMF is the only method that achieves both high efficiency and effectiveness, empowering it to learn effective embeddings for billionscale networks (e.g., the OAG network with 0.9 billion edges) in one day on one modern server.
In this section, we discuss how the hyperparameters influence the performance and efficiency of NetSMF. We report all the parameter analyses on the Flickr dataset with training ratio set to be 10%.
How to Set the Embedding Dimension . As mentioned in Section 3.1, SVD allows us to determine a “good” embedding dimension without supervised information. There are many methods available such as captured energy and Cattell’s Scree test (Cattell, 1966). Here we propose to use Cattell’s Scree test. Cattell’s Scree test plots the singular values and selects a rank such that there is a clear drop in the magnitudes or the singular values start to even out. In Flickr, if we sort the singular values in decreasing order, we can observe that the singular values approach 0 when the rank increases to around 100, as shown in Figure 3. In our experiments, by varying form to , we reach the best performance at , as shown in Figure 3, demonstrating the ability of our matrix factorization based NetSMF for automatically determining the embedding dimension.
The Number of NonZeros . In theory, = is required to guarantee the approximation error (See Section 3.1). Without loss of generality, we empirically set to be where is chosen from 1, 10, 100, 200, 500, 1000, 2000 and investigate how the number of nonzeros influence the quality of learned embeddings. As shown in Figure 3, when increasing the number of nonzeros, NetSMF tends to have better prediction performance because the original matrix is being approximated more accurately. On the other hand, although increasing has a positive effect on the prediction performance, its marginal benefit diminishes gradually. One can observe that setting (the secondtotheright data point on each line in Figure 3) is a good choice that balances NetSMF’s efficiency and effectiveness.
The Number of Threads. In this work, we use a singlemachine shared memory implementation with multithreading acceleration. We report the running time of NetSMF when setting the number of threads to be 1, 10, 20, 30, 60, respectively. As shown in Figure 3, NetSMF takes 12 hours to embed the Flickr network with one thread and 48 minutes to run with 30 threads, achieving a 15 speedup ratio (with ideal being 30). This relatively good sublinear speedup supports NetSMF to scale up to very largescale networks.
In this section, we review the related work of network embedding, largescale embedding algorithms, and spectral graph sparsification.
Network embedding has been extensively studied over the past years (Hamilton et al., 2017). The success of network embedding has driven a lot of downstream network applications, such as recommendation systems (Ying et al., 2018). Briefly, recent work about network embedding can be categorized into three genres: (1) Skipgram based methods that are inspired by word2vec (Mikolov et al., 2013a), such as LINE (Tang et al., 2015), DeepWalk (Perozzi et al., 2014), node2vec (Grover and Leskovec, 2016), metapath2vec (Dong et al., 2017), and VERSE (Tsitsulin et al., 2018)
; (2) Deep learning based methods such as
(Ying et al., 2018; Kipf and Welling, 2017); (3) Matrix factorization based methods such as GraRep (Cao et al., 2015) and NetMF (Qiu et al., 2018). Among them, NetMF bridges the first and the third categories by unifying a collection of skipgram based network embedding methods into a matrix factorization framework. In this work, we leverage the merit of NetMF and address its limitation in efficiency. Among literature, PinSage is notably a network embedding framework for billionscale networks (Ying et al., 2018). The difference between NetSMF and PinSage lies in the following aspect. The goal of NetSMF is to pretrain general network embeddings in an unsupervised manner, while PinSage is a supervised graph convolutional method with both the objective of recommender systems and existing node features incorporated. That being said, the embeddings learned by NetSMF can be consumed by PinSage for downstream network applications.Studies have attempted to optimize embedding algorithms for large datasets from different perspectives. Some focus on improving skipgram model, while others consider it as matrix factorization.
Distributed SkipGram Model. Inspired by word2vec (Mikolov et al., 2013b), most of the modern embedding learning algorithms are based on the skipgram model. There is a sequence of work trying to accelerate the skipgram model in a distributed system. For example, Ji et al. (2016) replicate the embedding matrix on multiple workers and synchronize them periodically; Ordentlich et al. (2016) distribute the columns (dimensions) of the embedding matrix to multiple executors and synchronize them with a parameter server (Li et al., 2014). Negative sampling is a key step in skipgram, which requires to draw samples from a noisy distribution. Stergiou et al. (2017) focus on the optimization of negative sampling by replacing the roulette wheel selection with a hierarchical sampling algorithm based on the alias method. More recently, Wang et al. (2018)
propose a billionscale network embedding framework by heuristically partitioning the input graph to small subgraphs, and processing them separately in parallel. However, the performance of their framework highly relies on the quality of graph partition. The drawback for partitionbased embedding learning is that the embeddings learned in different subgraphs do not share the same latent space, making it impossible to compare nodes across subgraphs.
Efficient Matrix Factorization. Factorizing the NetMF matrix, either implicitly (e.g., LINE (Tang et al., 2015) and DeepWalk (Perozzi et al., 2014)) or explicitly (e.g., NetMF (Qiu et al., 2018)), encounters two issues. First, the denseness of this matrix makes computation expensive even for a moderate context window size (e.g.,
). Second, the nonlinear transformation, i.e., elementwise matrix logarithm, is hard to approximate. LINE
(Tang et al., 2015) solves this problem by setting . With such simplification, it achieves good scalability at the cost of prediction performance. NetSMF addresses these issues by efficiently sparsifying the dense NetMF matrix with a theoreticallybounded approximation error.Spectral graph sparsification has been studied for decades in graph theory (Teng et al., 2016). The task of graph sparsification is to approximate a “dense” graph by a “sparse” one that can be effectively used in place of the dense one (Teng et al., 2016), which arises in many applications such as scientific computing (Higham and Lin, 2011)
(Cheng et al., 2015a; Calandriello et al., 2018) and data mining (Zhao, 2015). Our NetSMF model is the first work that incorporates spectral sparsification algorithms (Cheng et al., 2015b, a) into network embedding, which offers a powerful and efficient way to approximate and analyze the randomwalk matrixpolynomial in the NetMF matrix.In this work, we study network embedding with the goal of achieving both efficiency and effectiveness. To address the scalability challenges faced by the NetMF model, we propose to study largescale network embedding as sparse matrix factorization. We present the NetSMF algorithm, which achieves a sparsification of the (dense) NetMF matrix. Both the construction and factorization of the sparsified matrix are fast enough to support very largescale network embedding learning. For example, it empowers NetSMF to efficiently embed the Open Academic Graph in 24 hours, whose size is computationally intractable for the dense matrix factorization solution (NetMF). Theoretically, the sparsified matrix is spectrally close to the original NetMF matrix with an approximation bound. Empirically, our extensive experimental results show that the sparsely learned embeddings by NetSMF are as effective as those from the factorization of the NetMF matrix, leaving it outperform the common network embedding benchmarks—DeepWalk, LINE, and node2vec. In other words, among both matrix factorization based methods (NetMF and NetSMF) and common skipgram based benchmarks (DeepWalk, LINE, and node2vec), NetSMF is the only model that achieves both efficiency and performance superiority.
Future Work. NetSMF brings an efficient, effective, and guaranteed solution to network embedding learning. There are multiple tangible research fronts we can pursue. First, our current singlemachine implementation limits the number of samples we can take for large networks. We plan to develop a multimachine solution in the future to further scale NetSMF. Second, building upon NetSMF, we would like to efficiently and accurately learn embeddings for largescale directed (Cohen et al., 2016), dynamic (Kapralov et al., 2017), and/or heterogeneous networks. Third, as the advantage of matrix factorization methods demonstrated, we are also interested in exploring the other matrix definitions that may be effective in capturing different structural properties in networks. Last, it would be also interesting to bridge matrix factorization based network embedding methods with graph convolutional networks.
Acknowledgements. We would like to thank Dehua Cheng and Youwei Zhuo from USC for helpful discussions. Jian Li is supported in part by the National Basic Research Program of China Grant 2015CB358700, the National Natural Science Foundation of China Grant 61822203, 61772297, 61632016, 61761146003, and a grant from Microsoft Research Asia. Jie Tang is the corresponding author.
We first prove Thm. 3.3 and Thm. 3.3 in Section 3.3. The following lemmas will be useful in our proof.
((Trefethen and Bau III, 1997)
) Singular values of a real symmetric matrix are the absolute values of its eigenvalues.
(CourantFisher Theorem) Let be a symmetric matrix with eigenvalues , then for ,
((Horn and Johnson, 1991)) Let be two symmetric matrices. Then for the decreasinglyordered singular values of and ,
holds for any and .
Let and similarly . Then all the singular values of are smaller than , i.e., , .
Notice that
which is a normalized graph Laplacian whose eigenvalues lie in the interval , i.e., for , (Von Luxburg, 2007). Since is a spectral sparsifier of , we know that for ,
Let which is bijective, we have
The last inequality is because we assume . Then, by CourantFisher Theorem (Lemma 2), we can immediately get, ,
Then, by Lemma 1, . ∎
Given the above lemmas, we can see how the constructed approximates and how the constructed NetMF matrix sparsifier (Eq. (6)) approximates the NetMF matrix (Eq. (3)).
*
*
It is easy to observe that is 1Lipchitz w.r.t. Frobenius norm. So we have
∎
We finally explain the remaining question in Step 1 of NetSMF: After sampling a length path , why does the algorithm add a new edge to the sparsifier with weight ? Our proof relies on two lemmas from (Cheng et al., 2015b).
(Lemma 3.3 in (Cheng et al., 2015b)) Given the path length
, the probability for the
PathSampling algorithm to sample a path is , where is defined in Eq. (7) and(Theorem 2.2 in (Cheng et al., 2015b)) After sampling a length path , the weight corresponding to the new edge added to the sparsifier should be .
After sampling a length path using the PathSampling algorithm (Alg. 2). The weight of the new edge added to the sparsifier is .
Spectral analysis of random graphs with skewed degree distributions. In
FOCS ’04. 602–610.Graph Convolutional Neural Networks for WebScale Recommender Systems.
KDD ’18.
Comments
There are no comments yet.