DeepAI

# NetMF+: Network Embedding Based on Fast and Effective Single-Pass Randomized Matrix Factorization

In this work, we propose NetMF+, a fast, memory-efficient, scalable, and effective network embedding algorithm developed for a single machine with CPU only. NetMF+ is based on the theoretically grounded embedding method NetMF and leverages the theories from randomized matrix factorization to learn embedding efficiently. We firstly propose a fast randomized eigen-decomposition algorithm for the modified Laplacian matrix. Then, sparse-sign randomized single-pass singular value decomposition (SVD) is utilized to avoid constructing dense matrix and generate promising embedding. To enhance the performance of embedding, we apply spectral propagation in NetMF+. Finally, A high-performance parallel graph processing stack GBBS is used to achieve memory-efficiency. Experiment results show that NetMF+ can learn a powerful embedding from a network with more than 10^11 edges within 1.5 hours at lower memory cost than state-of-the-art methods. The result on ClueWeb with 0.9 billion vertices and 75 billion edges shows that NetMF+ saves more than half of the memory and runtime than the state-of-the-art and has better performance. The source code of NetMF+ will be publicly available after the anonymous peer review.

• 3 publications
• 18 publications
• 19 publications
• 3 publications
• 6 publications
• 113 publications
06/26/2019

### NetSMF: Large-Scale Network Embedding as Sparse Matrix Factorization

We study the problem of large-scale network embedding, which aims to lea...
09/04/2020

### Efficient Model-Based Collaborative Filtering with Fast Adaptive PCA

A model-based collaborative filtering (CF) approach utilizing fast adapt...
07/15/2019

### Out-of-core singular value decomposition

Singular value decomposition (SVD) is a standard matrix factorization te...
10/05/2021

### Revisiting SVD to generate powerful Node Embeddings for Recommendation Systems

Graph Representation Learning (GRL) is an upcoming and promising area in...
04/19/2018

### Programming Parallel Dense Matrix Factorizations with Look-Ahead and OpenMP

We investigate a parallelization strategy for dense matrix factorization...
11/16/2020

### RCHOL: Randomized Cholesky Factorization for Solving SDD Linear Systems

We introduce a randomized algorithm, namely RCHOL, to construct an appro...
10/12/2021

### Deviance Matrix Factorization

We investigate a general matrix factorization for deviance-based losses,...

## 1. Introduction

The past few years have witnessed the complexity of social networks and the challenges of processing. Therefore, how to learn important information from networks at a lower cost remains a problem. Recently, representation learning on graphs has provided a new paradigm for modeling networks.  (Hamilton et al., 2017). Learning latent representations for networks, network embedding, aims to map a network into a latent space. Recent studies have shown that the learned embedding can facilitate various network applications such as vertex classification and link prediction (Tang et al., 2015; Perozzi et al., 2014). These learned embeddings are widely used in various online services. The vital part of item recommendation systems at the E-Commerce platform of Alibaba requires re-embedding when the new users or items arrive online (Wang et al., 2018). The embedding must be quickly computed for billions of users and items. There is an urgent need to compute embedding with low latency and low memory overhead with today’s billion-scale social networks.

Recent researches have shown the superiority of skip-gram based model, like DeepWalk (Perozzi et al., 2014), LINE (Tang et al., 2015) and node2vec (Grover and Leskovec, 2016). The embedding learned by LINE loses the multi-hop dependencies because it only captures the first- and second-order proximities. DeepWalk and node2vec cannot handle large-scale networks because of expensive computation costs. Network embedding as matrix factorization (NetMF) (Qiu et al., 2018) gives a theoretical foundation that DeepWalk, LINE, and node2vec can be transformed as implicit factorization of a closed-form matrix. NetMF performs better on multi-label vertex classification task than DeepWalk and LINE, which indicates that methods based on matrix factorization can learn effective embedding. However, the matrix to be factorized in NetMF is an dense one, where is the number of vertices in the network. Thus it is prohibitively expensive to construct and factorize for billion-scale networks directly. Fortunately, network embedding as sparse matrix factorization (NetSMF) (Qiu et al., 2019) makes the aforementioned dense matrix sparse through the spectral sparsification theories. However, NetSMF is still relatively costly for time and memory due to a large number of random walks to construct a slightly sparse matrix. ProNE (Zhang et al., 2019) proposes to factorize a sparse matrix at first and present a spectral propagation scheme to enhance embedding quality. LightNE (Qiu et al., 2021) combines the strengths of NetSMF and ProNE to propose a cost-effective and scalable embedding system that beats several state-of-art systems such as GraphVite (Zhu et al., 2019)

, PyTorch-BigGraph

(Lerer et al., 2019), and ProNE (Zhang et al., 2019). However, the sparsifier construction process of LightNE still faces a huge memory overhead. It limits LightNE to learn effective embedding from networks with hundreds of billions of edges. To obtain high quality, scalable, memory-efficient, and cost-effective network embedding method that can learn promising embedding from networks with billions of nodes and hundreds of billions of edges, we design NetMF+.

Our Contribution. To tackle the challenge of billion-scale networks on a ubiquitous machine equipped with only shared-memory and multi-core CPUs. We build a network embedding system NetMF+ by leveraging the following techniques:

Firstly, we review the NetMF algorithm, whose first step is eigen-decomposition, followed by forming a dense matrix through eigenvalues and eigenvectors. Then, it generates embedding by performing singular value decomposition (SVD) on the dense matrix. To be precise, we propose a

fast randomized eigen-decomposition algorithm that is fast, scalable, and accurate. Experiments show that it is up to 93X faster than the standard eigsh routine in scipy (which is a wrapper to ARPACK (Lehoucq et al., 1998)) without sacrificing accuracy for YouTube dataset, which contains 1.1M vertices and 2.9M edges. We use the fast randomized eigen-decomposition algorithm for the first step of NetMF instead. Then we keep the eigenvectors and eigenvalues and avoid constructing the dense matrix in memory.

Secondly, we combine two randomized matrix factorization techniques: sparse-sign random projection and single-pass SVD algorithm. Instead of performing conventional SVD on dense matrix, we present a sparse-sign randomized single-pass SVD algorithm to complete generating an approximate low-rank result without constructing an dense matrix in memory. The spectral propagation strategy proposed in ProNE can be used to enhance the quality of embedding.

Thirdly, Considering the CSR format of a network with hundreds of billions of edges will be a huge overhead for memory, we design a sparse matrix matrix multiplication operation based on state-of-the-art shared-memory graph processing techniques. It’s used to compress graph and support parallel graph operation. While ensuring accuracy and efficiency, memory-efficient is realized. The main computation of NetMF+ is basic linear algebra. Thus we can use Intel Math Kernel Library (MKL) for these computations. These techniques enable NetMF+ to be the fastest and memory-efficient network embedding system compared to other state-of-the-art methods. In particular, we can embed one of the largest publicly available graphs, the Web Data Commons hyperlink 2012 graph, with over 200 billion edges by NetMF+. In summary, the main advantages of our NetMF+ are:

1. [label=(0),topsep=1pt,itemsep=0pt,parsep=0pt,leftmargin=20pt]

2. Accurate: NetMF+ achieves the highest accuracy in vertex classification task and link prediction task under the same time budget.

3. Efficiency: NetMF+ shows the fastest speed compared to state-of-art methods in most cases, according to experiments.

4. Memory-efficient: NetMF+ has the minimal memory cost compared to other network embedding systems designed for a single machine.

5. Scalable: NetMF+ can embed networks with 200 billion edges within 1.5 hours.

## 2. Related Work

In this section, we review the related work of network embedding and randomized matrix factorization.

Network Embedding. Network embedding has been comprehensively studied over the past decade. In general, recent work about network embedding can be divided into three categories. The first category is based on skip-gram methods, inspired by word2vec (Mikolov et al., 2013), including DeepWalk (Perozzi et al., 2014), Line (Tang et al., 2015), node2vec (Grover and Leskovec, 2016)

. These methods rely on stochastic gradient descent to optimize a logistic loss. The second category is based on deep learning methods, including GCN

(Kipf and Welling, 2017), GAT (Velickovic et al., 2018), GIN (Xu et al., 2019), GraphSAGE (Hamilton et al., 2017) and PinSAGE (Ying et al., 2018). They are beyond the scope of this paper because we focus on graphs with no additional information and pre-train general network embeddings in an unsupervised manner. In contrast, these methods pre-train in a supervised manner with vertex attributes. The third category method is based on matrix factorization, using SVD to generate the best low-rank approximation (Eckart and Young, 1936). Examples of methods that belong to this category include GraRep (Cao et al., 2015), HOPE (Ou et al., 2016), NetMF (Qiu et al., 2018), NetSMF (Qiu et al., 2019), ProNE (Zhang et al., 2019), LightNE (Qiu et al., 2021). Several embedding systems for large graphs have been developed. A CPU-GPU hybrid network embedding system, GraphVite (Zhu et al., 2019), is developed based on DeepWalk (Perozzi et al., 2014) and LINE (Tang et al., 2015). In Graphvite, CPU is used to perform graph operation, and GPU is used to compute linear algebra. Nevertheless, the GPU memory is a disadvantage when processing billion scale networks, limiting widespread use. Based on DeepWalk (Perozzi et al., 2014) and LINE (Tang et al., 2015), a distributed memory system PyTorch-BigGraph (Lerer et al., 2019) has been proposed. It achieves load balancing by graph partition and synchronization through a shared parameter server. In this work, we propose NetMF+, which leverages the merit of NetMF and addresses its limitation in speed and memory overhead.

Randomized Matrix Factorization.

Randomized matrix computation has gained significant increases in popularity as the data is becoming larger and larger. Randomized SVD can be an alternative to conventional SVD methods. Because randomized SVD involves the same or fewer floating-point operations and is more efficient for truly large high-dimensional data sets, by exploiting modern computing architectures

(Halko et al., 2011; Drineas and Mahoney, 2016). frPCA (Feng et al., 2018) is developed for large-scale sparse matrix with better performance compared with conventional PCA algorithms. Randomized SVD has shown its superiority in recent network embedding methods such as NetSMF, ProNE, and LightNE. Over the past few years, randomized matrix approximation for streaming data has been extensively studied (Tropp et al., 2017). Several single-pass SVD algorithms (Yu et al., 2017; Tropp et al., 2017; Boutsidis et al., 2016; Tropp et al., 2019) with different performances are introduced, which satisfy the streaming data scenario and only need to visit the target matrix once. The efficiency and memory limitation of NetMF will be solved by randomized matrix factorization.

## 3. Preliminaries

The Matlab conventions are used for specifying row/column indices of a matrix and some operations on sparse matrix. The list of notations used in this paper can be found in Table 1. The problem of the network embedding is as follows: Given an undirected network with as the vertex set of vertices, as the edge set of edges, and as the adjacency matrix, the goal is to learn an embedding matrix that row captures the structural property of -th vertex. The embedding matrix can be fed into downstream applications such as multi-label vertex classification and link prediction.

NetMF. One of the important network embedding methods based on matrix factorization is NetMF (Qiu et al., 2018). Formally, when it comes to an unweighted, undirected graph , is the adjacent matrix. The element in corresponding to each edge is , and the rest is . is the diagonal matrix, where the -th diagonal entry contains the degree of the -th vertex. The main contribution of NetMF is to prove that DeepWalk is approximately factorizing the matrix

 (1) M′=trunc_log∘(vol(G)bTM)=trunc_log∘(vol(G)bTT∑r=1(D−1A)rD−1),

where is the length of random walks, is the number of negative samples, and is the total number of edges in . is the element-wise truncated logarithm function () and is another optional function that has the similar effect as  (Qiu et al., 2019). NetMF also shows that LINE approximately factorizes a matrix in the same form but with . The bottleneck of factorizing the matrix in Eq. (1) is that tends to be a dense one as the increase of , and constructing the matrix is cost-prohibitive for the enormous amount of memory requirement. In addition, computing the matrix power in Eq. (1) involves dense matrix multiplication, which costs time. To reduce time cost in Eq. (1) for large (e.g., the default setting in DeepWalk), NetMF proposes to perform truncated eigen-decomposition on and approximates Eq. (1) to

 (2) ~M′≈trunc_log∘(vol(G)bT~M)=trunc_log∘(vol(G)bTD−12Uk(T∑r=1Λrk)U⊤kD−12).

However, the approximated matrix is still dense, making it impossible to construct and factorize. It is worth noting that the truncated logarithm is very important to embedding quality and cannot be omitted, otherwise there exists a shortcut to the factorization without constructing the dense matrix , similar to NPR (Yang et al., 2020) and RandNE (Zhang et al., 2018).

ProNE. ProNE (Zhang et al., 2019) proposes to perform truncated singular value decomposition (SVD) on a sparse matrix which each entry defined to be , where and . Therefore, it produces an initial embedding matrix . Consequently, ProNE proposes to improve the embedding performance by spectral propagation strategy, i.e., , where is coefficients of Chebyshev polynomials, is spectral propagation steps (the default setting is 10) and is normalized graph Laplacian matrix (

is the identity matrix). Because of the universality of spectral propagation scheme, we can further use it to enhance our embedding.

The Basic Randomized SVD. Previous work has witnessed the advantages of randomized matrix factorization methods for solving low-rank matrix approximation (Drineas and Mahoney, 2016; Halko et al., 2011; Musco and Musco, 2015) and network embedding problems (Qiu et al., 2021, 2019; Zhang et al., 2019). The basic randomized SVD algorithm can be described as Algorithm 1.

The basic randomized SVD can be used to compute approximate truncated SVD of an input matrix . In Alg. 1, is a Gaussian i.i.d matrix. The randomized low-rank approximation mainly relies on the random projection to identify the subspace capturing the dominant actions of target matrix . It can be realized by multiplying with a random matrix on its right side to obtain the subspace’s orthogonal basis matrix . With the sketch matrix , we can have approximation . The approximate truncated SVD of can be obtained by performing SVD on matrix . In Alg. 1, steps 3~6 is the power iteration scheme that can improve the accuracy of the approximation. The ”” in Alg. 1 is the orthonormalization operation that can be used to alleviate the round-off error in floating-point computation. QR factorization usually is a choice for orthonormalization operation.

## 4. The NetMF+ Based on Randomized Matrix Factorization

In this section, we develop a network embedding algorithm NetMF+ based on NetMF and fast random matrix factorization, which is approximate to Eq. (1). First, we propose a fast randomized eigen-decomposition algorithm to generate which is approximated to modified Laplacian matrix. Second, we propose to compute embedding by sparse-sign randomized single-pass SVD algorithm, which avoids constructing the dense matrix in memory. Finally, spectral propagation is used to improve the embedding quality, and Graph Based Benchmark Suite(GBBS) (Dhulipala et al., 2018) is adopted as the framework to reduce the peak memory cost.

### 4.1. Fast Randomized Eigen-Decomposition

We first introduce the fast randomized eigen-decomposition algorithm. Based on Alg. 1, the approximation is not suitable for solving eigen-decomposition. According to (Halko et al., 2011), the symmetric approximation formula should be and truncated eigen-decomposition result of can be derived by performing eigen-decomposition on . Combining the techniques of power iteration scheme, the symmetric approximation formula and acceleration strategy (Feng et al., 2018), fast randomized eigen-decomposition can be described as Algorithm 2.

In Alg. 2, the ”” is used as the orthonormalization operation. According to (Feng et al., 2018), compared with ””, ”” is much faster especially when . But there exists a shortcoming that numerical issues may arise if matrix input to ”” does not have full column rank. Fortunately, is a symmetric matrix which eigenvalues are in the range of according to (Qiu et al., 2018). Therefore, the orthonormalization operation in step 3 and step 5 is stable. Setting , the step 2~3 in Alg. 2 cost , where is the number of vertices in and is the number of edges. According to (Feng et al., 2018), step 4~6 cost , step 7 cost , step 8 cost and step 9 cost . Thus the total time cost is . Considering that , and is a small number, the time cost of Alg. 2 is , which is consistent with eigsh (Lehoucq et al., 1998). However, benefitting from the high parallelism of randomized matrix algorithm, Alg. 2 show up to 93X speedup ratio when compared to eigsh with almost same result in the experiment.

For a large scale matrix , the eigenvalues decay extremely slow and eigsh is hard to converge. There also exists the dilemma that the approximate error is unsatisfactory for randomized algorithm to deal with such a matrix (Halko et al., 2011). To alleviate the accuracy problem of matrix approximation, we propose to perform Alg. 2 on a modified Laplacian matrix instead, where . Therefore, we have . It means is computed as approximately. The approximation error has an upper bound described by the following theorem, whose proof can be found in Appendix.

###### Theorem 1 ().

Let be the matrix Frobenius norm. Then

 ∥∥∥D−12AD−12−D−12+αUkΛkU⊤kD−12+α∥∥∥F≤(1+ε) ⎷n∑j=k+1|λj|2 ,

with high probability. Here

is the -th largest absolute value of eigenvalue of .

Next, the matrix will be approximated to :

 (3) ^M=D−1+αUkΛk(T∑r=1Kr−1)U⊤kD−1+α,

where is a matrix, thus the computation of is cheap. For convenience, we denote and . Then Eq. (1) can be approximated by

 (4) ^M′=trunc_log∘(^M)=trunc_log∘(vol(G)bTFCF⊤).

### 4.2. Sparse-Sign Randomized Single-Pass SVD

In this subsection, we will introduce how to generate embedding without constructing the dense matrix when we have and . We first introduce the definition of the sparse-sign matrix. The sparse-sign matrix with similar performance to the Gaussian matrix is another type of randomized dimension reduction map (Tropp et al., 2019). Assuming that sparse-sign matrix , where and is oversampling parameter, we fix a column sparsity parameter in the range . We independently generate the columns of the matrix at random. For each column, we draw i.i.d random signs and place them in uniformly random coordinates. According to (Tropp et al., 2019), will usually be a good choice. Algorithm 3 describes how to generate the sparse-sign matrix.

After generating the sparse-sign matrix, we need to use it as the randomized matrix, which multiplies on its right side to obtain sketch matrix . Considering that a column of sparse-sign matrix only has nonzeros, it will generate at most coordinates in the range of . Therefore, we can perform row sampling according to these unique coordinates . Assuming that , which . One can immediately observe that we can have a sampling matrix . Therefore, the computation is accomplished by

 (5) Y=trunc_log∘(vol(G)bTFCF(p,:)⊤)Ψ.

The time cost of Eq. (5) is , and the memory cost is . When calculating for a network with billion vertices, it will introduce memory cost, which is still a disaster. To solve this, we adopt the batch matrix matrix multiplication, selecting the fixed-size rows of in turn to complete multiplication. Therefore, the memory cost will be reduced to . After orthogonalizing , we have an orthogonal basis matrix which contains the information of .

There is still a dilemma that we cannot generate , where is a dense matrix. To generate the embedding of , we leverage a randomized single-pass SVD algorithm (Tropp et al., 2019). The randomized single-pass SVD algorithm draws multiple sketch matrices that capture the row and column dominant actions of matrix and computes SVD based on these sketch matrices. In (Tropp et al., 2019), four random matrices are drawn for target matrix . Then three sketch matrices , and are generated respectively. By the way, is symmetric, which indicates that and can be the same and can be computed with one random matrix. Briefly, we need two random matrices to generate two sketch matrices and . With and , we can have . Then we get the core approximation matrix by solving least-squares problems. Finally, it will form a low rank approximation of the target matrix via , and the approximate truncated SVD of can be derived from performing SVD on .

With the sparse-sign matrix as the random matrix, the sparse-sign randomized single-pass SVD algorithm for can be described as Algorithm 4.

In Alg. 4, step 4 use ”” in Alg. 2 to orthogonal sketch matrix . Step 2 and step 5 in Alg. 4 is to generate two random matrices. Considering that and are small number, the time cost of step 3 is and step 6 cost . Step 4 cost , steps 7~8 cost and step 9 cost . Therefore, the time cost of Alg. 4 is .

### 4.3. The NetMF+ Algorithm

In this section, we formally describe the NetMF+ algorithm as Algorithm 5, which consists of three steps: fast randomized eigen-decomposition, sparse-sign randomized single-pass SVD and spectral propagation. In Alg. 5, step 1 compute fast randomized eigen-decomposition for modified Laplacian matrix and steps 2~3 form matrix for Eq. (4). Step 4 in Alg. 5 is to compute the SVD of by and through sparse-sign randomized single-pass SVD. Step 6 can be used to improve the embedding quality.

Complexity Analysis. Now we analyze the time and space complexity of NetMF+. As for step 1, the input matrix is still an matrix and have nonzeros. According to section 4.1, it requires time and space. As for steps 2~3, it requires time and space. As for step 4, the time cost is and space cost is . Steps 5~6 require time and space. In total, the NetMF+ cost time and the space cost is .

### 4.4. System Design

Figure 1 is the overview of the system design of NetMF+.

Each step of NetMF+ can be parallelized, and the storage overhead is mainly depends on the sparse adjacency matrix and the temporary matrix. Below we discuss the detail system design of each step.

Compression. First, the system is built on the Graph Based Benchmark Suite (GBBS) (Dhulipala et al., 2018), which is an extension of the Ligra (Shun and Blelloch, 2013) interface. The GBBS is easy to use and has already shown its practicality for many large scale fundamental graph problems. LightNE (Qiu et al., 2021) introduces GBBS to network embedding problems and shows its superiority to real-world networks with hundreds of billions of edges. The main benefit of GBBS to NetMF+ is the data compression. In the beginning, a sparse adjacency matrix is always stored as CSR format, which is normally regarded as an excellent compressed graph representation (Kepner and Gilbert, 2011). However, CSR format is still a huge memory overhead for networks with hundreds of billions of edges. For example, storing a network with 1 billion vertices and 100 billion edges will cost 1121GB of memory which is unsatisfactory. Therefore, we need to compress it further and reduce memory cost. A compressed CSR format for graph from Ligra+ (Shun et al., 2015), which supports fast parallel graph encoding and decoding, is a good solution.

Parallelization. Second, the main time cost of the NetMF+ is Sparse Matrix-Matrix multiplication (SPMM) and Matrix-Matrix product (GEMM). In step 1 of NetMF+ system, it performs fast randomized eigen-decomposition on . The random projection and the power iteration process in Alg. 2 are, in essence, a product of an sparse matrix and a dense Gaussian random matrix, which is supported in MKL’s Sparse BLAS Routines. However, MKL’s Sparse BLAS Routine requires the sparse matrix in CSR format as the input, which contradicts the original intention of using GBBS. Fortunately, GBBS supports traversing all neighbors of a vertex for the compressed CSR format, and we can propose an SPMM operation with the help of GBBS. It can be computed as follows: First of all, We need to traverse n vertices in parallel, and then traverse neighbor of the vertex to obtain the element corresponding to the sparse matrix. Finally, with the support of cblas_saxpy in MKL, we multiply the element with the -th row of the row-major matrix and add the result to the -th row of the target matrix. The SPMM operation based on GBBS is slightly slower than MKL’s SPMM operation, but ensuring memory-efficient. ”eigSVD”, ”eig” and other operations are well supported by Intel MKL BLAS routines and Intel MKL LAPACK routines. Step 3 of NetMF+ system also involves applying repeated SPMM between an sparse Laplacian matrix and a dense temporary matrix. MKL’s Sparse BLAS Routines support the operation. Step 2 of the system involves matrix row sampling and batch GEMM operation. In detail, we need to compute , where is the batch row index. For each column of , we can get nonzeros, and there are corresponding columns of to complete matrix multiplication based on element multiple and summation. This process is easy to implement through for loop under OpenMP (Dagum and Menon, 1998).

In conclusion, NetMF+ is implemented by C++. We use GBBS to reduce memory usage and implement a GBBS-based SPMM operation. For efficiency, we use the Intel MKL library for basic linear algebra operations, and we apply OpenMP for parallel programming especially in step 4 and step 7 of Alg. 4.

## 5. Experiments

In this section, we evaluate the proposed NetMF+ method on the multi-label vertex classification task and link prediction task, which have been commonly used to evaluate previous network embedding techniques (Zhang et al., 2019; Qiu et al., 2019; Perozzi et al., 2014; Qiu et al., 2021). We introduce our datasets, experimental setup, and results in Section 5.1, Section 5.2, and Section 5.3, respectively. The ablation study is in Section 5.4.

### 5.1. Datasets

The statistics of datasets are listed in Table 2. We employ five datasets for the multi-label vertex classification task. BlogCatalog (Tang and Liu, 2009a) and YouTube (Tang and Liu, 2009b) are small graphs with less than 10M edges, while the others are large graphs with more than 10M but less than 10B edges. For the link prediction task, We have four datasets in which vertex labels are not available. Livejournal (Leskovec et al., 2009) is the large graph, while the others are very large graphs with more than 10B edges. These datasets are of different scales but have been widely used in network embedding literature (Qiu et al., 2021). The two largest networks are hyperlinked networks extracted from the Common Crawl web corpus, containing billions of edges.

### 5.2. Experimental Setup

For NetMF+, all experiments are conducted on a server with two Intel® Xeon® E5-2699 v4 CPUs (88 virtual cores in total) and 1.5TB memory. In fact, LightNE (Qiu et al., 2021) is essentially an enhanced version of NetSMF (Qiu et al., 2019) and is the most important method to compare. Thus we make a full comparison between NetMF+ and LightNE. We follow the experiment and evaluation procedures that were performed in LightNE. For PyTorch-BigGraph (PBG) (Lerer et al., 2019), we compare NetMF+ with it on Livejournal which is an example dataset from the original paper (Lerer et al., 2019). We compare NetMF+ with GraphVite (Zhu et al., 2019), ProNE (Zhang et al., 2019)

and LightNE in vertex classification task. We used the evaluation scripts and hyperparameters provided by the corresponding paper’s GitHub Repos for comparison. For ProNE, we use the high-performance version released by LightNE GitHub repos. Considering that GraphVite and PBG are developed for different scenarios, we evaluate them and NetMF+ based on the cost of cloud server rental. We list the most suitable Azure Cloud server for each system in Table.

3.

Vertex Classification Setting.

After generating the embedding, we randomly sample a portion of labeled vertices for training and use the remaining for testing. The task is completed by the one-vs-rest logistic regression model, which is implemented by LIBLINEAR

(Fan et al., 2008). We repeat the prediction procedure five times and evaluate the average performance in terms of both Micro-F1 and Macro-F1 scores (Tsoumakas et al., 2009). Across all datasets, we set the embedding dimension to be 128 except on Friendster, where we set 96. We recommend that the oversampling parameter can usually be a small number of 10, 30, 50. We set parameter in most cases for NetMF+. We set , and according to (Tropp et al., 2019) which indicate that should more larger than for a better result. We observe that , and are more sensitive to the experimental results so that we use cross-validation to select them.

Link Prediction Setting. We follow LightNE to set up link prediction evaluation for very large graphs — we randomly exclude 0.00001% edges from the training graph for evaluation. When training NetMF+ on the three very large graphs, we skip the spectral propagation step (due to memory issue) and set except Hyperlink2012, where we use . We set and for NetMF+. The ranking metrics on the test set are obtained by ranking positive edges among randomly sampled corrupted edges. We evaluate the link prediction task with metrics to be HITS@1, HITS@10, HITS@50 and AUC.

### 5.3. Experiment Results

Comparison with PBG. First, We follow exactly the same experiment setting in PBG (Lerer et al., 2019). We set for NetMF+. Combining the price in Table 3, the results are reported in the Table 4.

It’s obvious that NetMF+ has better performance for all metrics with less time. To be precise, NetMF+ is faster and cheaper than PBG.

Vertex Classification Results. We summarize the multi-label vertex classification performance in Figure 2. To compare different algorithms’ runtime and memory cost, we also list the efficiency comparison in Table 5. In BlogCatalog, NetMF+ achieves the best performance compared to other methods. It achieves significantly better Micro-F1 and Macro-F1 than the second place method LightNE (by 3.5% on average), while only cost 2 seconds and 17 GB. In YouTube, NetMF+ show comparable Micro-F1 and Macro-F1 to LightNE while show better result than ProNE. In Friendster-small (Yang and Leskovec, 2015), NetMF+ shows slightly better Micro-F1 than the other methods while achieving comparable Macro-F1 results. In Friendster (Yang and Leskovec, 2015), NetMF+ is comparable to LightNE and ProNE. In OAG (Sinha et al., 2015), the result of LightNE is produced when choosing large edge samples , and NetMF+ achieves better performance than LightNE (Micro-F1 improved by 4.2% on average). Overall, NetMF+ has significantly better or comparable classification results compared to other methods with a lightweight memory and time cost. Except in OAG, the speed of NetMF+ is slower than ProNE, but both the Micro-F1 and Macro-F1 result of ProNE is inferior, which proves the effectiveness of the NetMF+ algorithm in learning challenging datasets. In general, the vertex classification results illustrate the effectiveness and lightweight of the NetMF+.

Next, we compare the rental costs of GraphVite and NetMF+ when applied to large graph data. Based on Table 3 and Table 5, we get the cost of GraphVite on Friendster-small, and Friendster is $23.2 and$168.1, respectively, while the cost of NetMF+ is $1.1 and$3.9 respectively. Therefore, NetMF+ shows up to faster and cheaper compared to GraphVite.

Link Prediction Results. In addition, we need to compare the performance of different algorithms on link prediction task for very large graphs. Due to limited memory, ProNE cannot accomplish learning an embedding. GraphVite and PBG cannot finish the job in one day. Therefore, we only report the results of LightNE and NetMF+ in Table 6. The results of LightNE are reported by choosing large edge samples for a best possible result in the 1.5 TB memory machine. Not only does NetMF+ achieve better performance regarding all metrics, but it also reduces time and memory costs significantly. The runtime of NetMF+ on ClueWeb (Boldi and Vigna, 2004) is 38.9 minutes in Table 5, and the peak memory is 612GB, which is less than half of LightNE. It is worth mentioning that NetMF+ will take 27 mins and 1498 GB to learn on ClueWeb if MKL’s Sparse BLAS Routine is used in step 1 of the system. The uncompressed ClueWeb graph (in CSR format) occupies 839 GB while the total peak memory cost of NetMF+ (using GBBS-based SPMM) on ClueWeb is 612GB, which proves the effects of compression of GBBS. The experimental results on Hyperlink2014 (Meusel et al., 2015) also demonstrate the effectiveness and efficiency of NetMF+. For the largest network, Hyperlink2012 (Meusel et al., 2015), LightNE can only have small edge samples, making embedding ineffective in the link prediction task compared to NetMF+. It describes the fact that NetMF+ has better performance when it uses the same or similar amount of memory as LightNE. In conclusion, the results on the three very large graphs show that NetMF+ is significantly better than LightNE with less time and memory.

### 5.4. Ablation Study

We first perform ablation studies on the effects of each step of the NetMF+ system and then focus on the parameter analysis.

Efficiency and Effectiveness of Each Step of NetMF+. First, we need to focus on the fast randomized eigen-decomposition (the step 1 of NetMF+ system). Halko et al. (Halko et al., 2011) has shown that it’s challenging to perform eigen-decomposition on . Although Halko et al. stated that this problem could be improved by using power iteration scheme, it is still impractical for large real-world networks. We choose YouTube as an example, and its related eigenvalues are shown in Figure 3. The eigenvalues of with and are far from the correct result, which the largest eigenvalue should be 1. The eigsh (Lehoucq et al., 1998) is not able to complete this job in one day. The results illustrate the requirement to use modified Laplacian matrix as input for fast randomized eigen-decomposition. When we choose , the eigenvalues of fast randomized eigen-decomposition () on is indistinguishable from the eigenvalues computed by eigsh when . The runtime of fast randomized eigen-decomposition is 20 seconds while eigsh costs 31 minutes, which proves the accuracy and efficiency of fast randomized eigen-decomposition.

Next, we focus on the effects of sparse-sign randomized single-pass SVD and spectral propagation. NetMF+ (w/o spectral) and NetMF+ (w/ spectral) represent embeddings from step 2 and step 3 of the NetMF+ system, respectively, and we need to compare them. The vertex classification results are shown in Figure 4.

NetMF+ (w/o spectral) results are satisfactory, which proves the effectiveness of the sparse-sign randomized single-pass SVD. Combining with step 3 of the system, NetMF+ (w/ spectral) show better results, especially the significant improvement on Macro-F1, which describes the effects of spectral propagation.

Parameter Analysis The peak memory cost is constant when we fix the oversampling parameters , column sparsity , eigen-decomposition rank , and embedding dimension . The parameters that strongly impact the quality of learned embedding are , and . determines the how slow decaying speed of the eigenvalues of the modified Laplacian and thus determines the accuracy of the fast randomized eigen-decomposition and the approximation error of . In general, and or will always be an appropriate choices. We need to pay attention to , which determines the accuracy of the fast randomized eigen-decomposition. Thus, we make a tradeoff between the quality of the learned embedding and the overall system runtime. We choose OAG dataset for test. When we fix and we enumerate the from , the efficiency-effectiveness trade-off result of OAG is shown as Figure 5. We also add LightNE results with different edge samples from . The peak memory of NetMF+ is 384GB while that of LightNE is 553 GB, 682 GB, 776 GB, 936 GB, 1118 GB and 1391 GB, respectively. Figure 5 shows that NetMF+ works better than LightNE when the runtime is fixed. The experiments prove that users can adjust NetMF+ flexibly according to time/memory budgets and performance requirements.

## 6. Conclusion

In this work, we propose NetMF+ employing the techniques of fast randomized matrix factorization, sparse-sign randomized single-pass SVD and GBBS, which produce a fast and memory-efficient network embedding system. Compared with state-of-the-art network embedding systems, NetMF+ achieves the best performance on nine benchmarks. With the help of fast randomized eigen-decomposition and a novel sparse-sign randomized single-pass SVD, NetMF+ consumes much less computational time and memory while generating a promising embedding. With GBBS and GBBS-based SPMM, the memory cost is further reduced. The main computations of NetMF+ are highly parallelizable, which is thus well supported by the MKL library and OpenMP. With these techniques, NetMF+ can learn high-quality embeddings from networks with more than 100 billion edges in about an hour with 1 TB or less memory cost.

In the future, we will extend the method to handle dynamic network embedding problems.

## References

• P. Boldi and S. Vigna (2004) The WebGraph framework I: Compression techniques. In WWW ’04, pp. 595–601. Cited by: §5.3.
• C. Boutsidis, D. P. Woodruff, and P. Zhong (2016)

Optimal principal component analysis in distributed and streaming models

.
In

Proceedings of the forty-eighth annual ACM symposium on Theory of Computing

,
pp. 236–249. Cited by: §2.
• S. Cao, W. Lu, and Q. Xu (2015) Grarep: learning graph representations with global structural information. In CIKM ’15, pp. 891–900. Cited by: §2.
• L. Dagum and R. Menon (1998) OpenMP: an industry-standard api for shared-memory programming. 5 (1), pp. 46–55. External Links: ISSN 1070-9924 Cited by: §4.4.
• L. Dhulipala, G. E. Blelloch, and J. Shun (2018) Theoretically efficient parallel graph algorithms can be fast and scalable. In ACM Symposium on Parallelism in Algorithms and Architectures (SPAA), pp. 393–404. Cited by: §4.4, §4.
• P. Drineas and M. W. Mahoney (2016) RandNLA: randomized numerical linear algebra. Commun. ACM 59 (6), pp. 80–90. External Links: ISSN 0001-0782 Cited by: §2, §3.
• C. Eckart and G. Young (1936) The approximation of one matrix by another of lower rank. Psychometrika 1 (3), pp. 211–218. Cited by: §2.
• R. Fan, K. Chang, C. Hsieh, X. Wang, and C. Lin (2008) LIBLINEAR: a library for large linear classification.

the Journal of machine Learning research

9, pp. 1871–1874.
Cited by: §5.2.
• X. Feng, Y. Xie, M. Song, W. Yu, and J. Tang (2018) Fast randomized pca for sparse data. In Proceedings of The 10th Asian Conference on Machine Learning, Proceedings of Machine Learning Research, Vol. 95, pp. 710–725. Cited by: §2, §4.1, §4.1.
• A. Grover and J. Leskovec (2016) Node2vec: scalable feature learning for networks. In KDD ’16, pp. 855–864. Cited by: §1, §2.
• N. Halko, P. Martinsson, and J. A. Tropp (2011) Finding structure with randomness: probabilistic algorithms for constructing approximate matrix decompositions. SIAM review 53 (2), pp. 217–288. Cited by: Lemma 2, §2, §3, §4.1, §4.1, §5.4.
• W. L. Hamilton, R. Ying, and J. Leskovec (2017) Representation learning on graphs: methods and applications.. IEEE Data(base) Engineering Bulletin 40, pp. 52–74. Cited by: §1, §2.
• R. A. Horn, R. A. Horn, and C. R. Johnson (1994) Topics in matrix analysis. Cambridge university press. Cited by: Lemma 3.
• J. Kepner and J. Gilbert (2011) Graph algorithms in the language of linear algebra. SIAM. Cited by: §4.4.
• T. N. Kipf and M. Welling (2017) Semi-supervised classification with graph convolutional networks. In ICLR ’17, Cited by: §2.
• R. B. Lehoucq, D. C. Sorensen, and C. Yang (1998) ARPACK users’ guide: solution of large-scale eigenvalue problems with implicitly restarted arnoldi methods. SIAM. Cited by: §1, §4.1, §5.4.
• A. Lerer, L. Wu, J. Shen, T. Lacroix, L. Wehrstedt, A. Bose, and A. Peysakhovich (2019) Pytorch-biggraph: a large-scale graph embedding system. arXiv preprint arXiv:1903.12287. Cited by: §1, §2, §5.2, §5.3.
• J. Leskovec, K. J. Lang, A. Dasgupta, and M. W. Mahoney (2009) Community structure in large networks: natural cluster sizes and the absence of large well-defined clusters. Internet Mathematics 6 (1), pp. 29–123. Cited by: §5.1.
• R. Meusel, S. Vigna, O. Lehmberg, and C. Bizer (2015) The graph structure in the web–analyzed on different aggregation levels. The Journal of Web Science 1. Cited by: §5.3.
• T. Mikolov, K. Chen, G. Corrado, and J. Dean (2013)

Efficient estimation of word representations in vector space

.
In 1st International Conference on Learning Representations, ICLR 2013, Scottsdale, Arizona, USA, May 2-4, 2013, Workshop Track Proceedings, Cited by: §2.
• C. Musco and C. Musco (2015) Randomized block krylov methods for stronger and faster approximate singular value decomposition. In Proceedings of the 28th International Conference on Neural Information Processing Systems - Volume 1, NIPS’15, pp. 1396–1404. Cited by: Lemma 2, §3.
• M. Ou, P. Cui, J. Pei, Z. Zhang, and W. Zhu (2016) Asymmetric transitivity preserving graph embedding. In Proceedings of the 22nd ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, KDD ’16, pp. 1105–1114. External Links: ISBN 9781450342322 Cited by: §2.
• B. Perozzi, R. Al-Rfou, and S. Skiena (2014) Deepwalk: online learning of social representations. In KDD ’14, pp. 701–710. Cited by: §1, §1, §2, §5.
• J. Qiu, L. Dhulipala, J. Tang, R. Peng, and C. Wang (2021) LightNE: a lightweight graph processing system for network embedding. In Proceedings of the 2021 International Conference on Management of Data, pp. 2281–2289. Cited by: §1, §2, §3, §4.4, §5.1, §5.2, §5.
• J. Qiu, Y. Dong, H. Ma, J. Li, C. Wang, K. Wang, and J. Tang (2019) Netsmf: large-scale network embedding as sparse matrix factorization. In The World Wide Web Conference, pp. 1509–1520. Cited by: §1, §2, §3, §3, §5.2, §5.
• J. Qiu, Y. Dong, H. Ma, J. Li, K. Wang, and J. Tang (2018) Network embedding as matrix factorization: unifying deepwalk, line, pte, and node2vec. In WSDM ’18, pp. 459–467. Cited by: §1, §2, §3, §4.1.
• J. Shun and G. E. Blelloch (2013) Ligra: a lightweight graph processing framework for shared memory. SIGPLAN Not. 48 (8), pp. 135–146. External Links: ISSN 0362-1340 Cited by: §4.4.
• J. Shun, L. Dhulipala, and G. E. Blelloch (2015) Smaller and faster: parallel processing of compressed graphs with ligra+. In 2015 Data Compression Conference, Vol. , pp. 403–412. Cited by: §4.4.
• A. Sinha, Z. Shen, Y. Song, H. Ma, D. Eide, B. (. Hsu, and K. Wang (2015) An overview of microsoft academic service (mas) and applications. In Proceedings of the 24th International Conference on World Wide Web, WWW ’15 Companion, pp. 243–246. External Links: ISBN 9781450334730 Cited by: §5.3.
• J. Tang, M. Qu, M. Wang, M. Zhang, J. Yan, and Q. Mei (2015) Line: large-scale information network embedding. In WWW ’15, pp. 1067–1077. Cited by: §1, §1, §2.
• L. Tang and H. Liu (2009a) Relational learning via latent social dimensions. In Proceedings of the 15th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, KDD ’09, pp. 817–826. External Links: ISBN 9781605584959 Cited by: §5.1.
• L. Tang and H. Liu (2009b) Scalable learning of collective behavior based on sparse social dimensions. In Proceedings of the 18th ACM Conference on Information and Knowledge Management, CIKM ’09, pp. 1107–1116. External Links: ISBN 9781605585123 Cited by: §5.1.
• L. N. Trefethen and D. Bau III (1997) Numerical linear algebra. Vol. 50, Siam. Cited by: Lemma 1.
• J. A. Tropp, A. Yurtsever, M. Udell, and V. Cevher (2017) Practical sketching algorithms for low-rank matrix approximation. SIAM Journal on Matrix Analysis and Applications 38 (4), pp. 1454–1485. Cited by: §2.
• J. A. Tropp, A. Yurtsever, M. Udell, and V. Cevher (2019) Streaming low-rank matrix approximation with an application to scientific simulation. SIAM Journal on Scientific Computing 41 (4), pp. A2430–A2463. Cited by: §2, §4.2, §4.2, §5.2.
• G. Tsoumakas, I. Katakis, and I. Vlahavas (2009) Mining multi-label data. In Data mining and knowledge discovery handbook, pp. 667–685. Cited by: §5.2.
• P. Velickovic, G. Cucurull, A. Casanova, A. Romero, P. Liò, and Y. Bengio (2018) Graph attention networks. In 6th International Conference on Learning Representations, ICLR 2018, Vancouver, BC, Canada, April 30 - May 3, 2018, Conference Track Proceedings, Cited by: §2.
• J. Wang, P. Huang, H. Zhao, Z. Zhang, B. Zhao, and D. L. Lee (2018) Billion-scale commodity embedding for e-commerce recommendation in alibaba. In KDD ’18, pp. 839–848. Cited by: §1.
• K. Xu, W. Hu, J. Leskovec, and S. Jegelka (2019)

How powerful are graph neural networks?

.
In 7th International Conference on Learning Representations, ICLR 2019, New Orleans, LA, USA, May 6-9, 2019, Cited by: §2.
• J. Yang and J. Leskovec (2015) Defining and evaluating network communities based on ground-truth. Knowledge and Information Systems 42 (1), pp. 181–213. Cited by: §5.3.
• R. Yang, J. Shi, X. Xiao, Y. Yang, and S. S. Bhowmick (2020) Homogeneous network embedding for massive graphs via reweighted personalized pagerank. Proc. VLDB Endow. 13 (5), pp. 670–683. External Links: ISSN 2150-8097 Cited by: §3.
• R. Ying, R. He, K. Chen, P. Eksombatchai, W. L. Hamilton, and J. Leskovec (2018)

Graph convolutional neural networks for web-scale recommender systems

.
In Proceedings of the 24th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, KDD ’18, pp. 974–983. External Links: ISBN 9781450355520 Cited by: §2.
• W. Yu, Y. Gu, and J. Li (2017) Single-pass pca of large high-dimensional data. In

Proceedings of the 26th International Joint Conference on Artificial Intelligence

,
IJCAI’17, pp. 3350–3356. External Links: ISBN 9780999241103 Cited by: §2.
• J. Zhang, Y. Dong, Y. Wang, J. Tang, and M. Ding (2019) ProNE: fast and scalable network representation learning. In Proceedings of the Twenty-Eighth International Joint Conference on Artificial Intelligence, IJCAI-19, pp. 4278–4284. Cited by: §1, §2, §3, §3, §5.2, §5.
• Z. Zhang, P. Cui, H. Li, X. Wang, and W. Zhu (2018) Billion-scale network embedding with iterative random projection. In 2018 IEEE International Conference on Data Mining (ICDM), Vol. , pp. 787–796. Cited by: §3.
• Z. Zhu, S. Xu, J. Tang, and M. Qu (2019) GraphVite: a high-performance cpu-gpu hybrid system for node embedding. In The World Wide Web Conference, pp. 2494–2504. Cited by: §1, §2, §5.2.

## Appendix A Appendix

The following lemmas will be useful in our proof.

###### Lemma0 ().

((Trefethen and Bau III, 1997)) Singular values of a real symmetric matrix are the absolute values of its eigenvalues.

###### Lemma0 ().

((Halko et al., 2011; Musco and Musco, 2015)) For input matrix , the Alg. 1 with the PI scheme has the following guarantee:

 ∥X−QQ⊤X∥F=∥X−UΣV⊤∥F≤(1+ε)∥X−Xk∥F,

with high probability. Here, is the best rank- approximation of .

###### Lemma0 ().

((Horn et al., 1994)) Let be two symmetric matrices. Then for the decreasingly ordered singular values of and , holds for any and .

###### Theorem 4 ().

Let be the matrix Frobenius norm. Then
with a high probability. Here is the -th largest absolute value of eigenvalue of .

###### Proof.

Apply Lemma 1 and Lemma 2, we will have the approximation for Alg. 2 as

 (1) ∥D−αAD−α−UkΛkU⊤k∥F≤(1+ε)∥D−αAD−α−(D−αAD−α)k∥F=(1+ε) ⎷n∑j=k+1|λj|2,

where is the -th largest absolute value of eigenvalue of .

Consider the approximation in Thm. 1, we have

 (2) ∥∥∥D−12AD−12−D−12+αUkΛkU⊤kD−12+α∥∥∥F=∥∥∥D−12+αD−αAD−αD−12+α−D−12+αUkΛkU⊤kD−12+α∥∥∥F= ⎷n∑i=1σi(D−12+αD−αAD−αD−12+α−D−12+αUkΛkU⊤kD−12+α)2.

Here, we have and we use to denote . Since every is positive, the decreasingly ordered singular values of the matrix can be constructed by sorting in the non-increasing order. In particular, we mark is largest singular values where is the smallest vertex degree. Apply Lemma 3 twice for Eq. (2), we will have

 (3) ∥D−12AD−12−D−12+αUkΛkU⊤kD−12+α∥F≤ ⎷n∑i=1σ1(D−12+α)2σi(D−αAD−α−UkΛkU⊤k)2σ1(D−12+α)2=σ1(D−12+α)2√σi(D−αAD−α−UkΛkU⊤k)2=d−1+2αmin√σi(D−αAD−α−UkΛkU⊤k)2≤√σi(D−αAD−α−UkΛkU⊤k)2.

Combining Eq. (1) with Eq. (3), we will have

 (4)