Code of the paper "Generalized Neural Graph Embedding with Matrix Factorization"
Recent advances in language modeling such as word2vec motivate a number of graph embedding approaches by treating random walk sequences as sentences to encode structural proximity in a graph. However, most of the existing principles of neural graph embedding do not incorporate auxiliary information such as node content flexibly. In this paper we take a matrix factorization perspective of graph embedding which generalizes to structural embedding as well as content embedding in a natural way. For structure embedding, we validate that the matrix we construct and factorize preserves the high-order proximities of the graph. Label information can be further integrated into the matrix via the process of random walk sampling to enhance the quality of embedding. In addition, we generalize the Skip-Gram Negative Sampling model to integrate the content of the graph in a matrix factorization framework. As a consequence, graph embedding can be learned in a unified framework integrating graph structure and node content as well as label information simultaneously. We demonstrate the efficacy of the proposed model with the tasks of semi-supervised node classification and link prediction on a variety of real-world benchmark network datasets.READ FULL TEXT VIEW PDF
Learning representations of nodes in a low dimensional space is a crucia...
Link prediction – the process of uncovering missing links in a complex
Since the invention of word2vec, the skip-gram model has significantly
A widely established set of unsupervised node embedding methods can be
Graph embedding learns low-dimensional representations for nodes in a gr...
In the current deep learning based recommendation system, the embedding
Graph embedding has become a key component of many data mining and analy...
Code of the paper "Generalized Neural Graph Embedding with Matrix Factorization"
The rapid growth of applications based on networks has posed major challenges of effective processing of network data, among which a critical task is network data representation. The primitive representation of a network is usually very sparse and suffers from overwhelming high dimensionality, which limits its generalization in statistical learning. To deal with this issue, network embedding aims to learn latent representations of nodes on a network while preserving the structure and the inherent properties of the network, which can be effectively exploited by classical vector-based machine learning models for tasks including node classification, link prediction, and community detection, etc.[6, 3, 10, 1].
Recently, inspired by the advances of neural representation learning in language modeling, which is based on the principle of learning the embedding vector of a word by predicting its context [12, 13], a number of network embedding approaches have been proposed with the paradigm of learning the embedding vector of a node by predicting its neighborhood [17, 20, 3]
. Specifically, latent representations of network nodes are learned by treating short random walk sequences as sentences to encode structural proximity in a network. Existing results demonstrate the effectiveness of the neural network embedding approaches in the tasks of node classification, behavior prediction, etc.
However, existing network embedding methods, including DeepWalk , LINE  and node2vec , are typically based on structural proximities only and do not incorporate other information such as node content flexibly. In this paper, we explore the question whether network structure and auxiliary properties of the network such as node content and label information can be integrated in a unified framework of network embedding. To achieve that, we take a matrix factorization perspective of network embedding with the benefits of natural integration of structural embedding and content embedding simultaneously, where label information can be incorporated flexibly.
Specifically, motivated by the recent work  that explains the word embedding model of Skip-Gram Negative Sampling (SGNS) as a matrix factorization of the words’ co-occurrence matrix, we build a co-occurrence matrix of structural proximities for a network based on a random walk sampling procedure. The process of SGNS can then be formulated as minimizing a matrix factorization loss, which can be naturally integrated with representation learning of node content. In addition, label information can be exploited in the process of building the co-occurrence matrix to enhance the quality of network embedding, which is achieved by decomposing the context of a node into the structure context generated with random walks, as well as the label context based on the given label information.
Our main contributions can be summarized as follows:
We propose a unified framework of Auxiliary information Preserved Network Embedding with matrix factorization, abbreviated as APNE, which can effectively learn the latent representations of nodes, and provide a flexible integration of network structure, node content, as well as label information without leveraging downstream classifiers.
We verify that the structure matrix we generate is an approximation of the high-order proximity of the network known as rooted PageRank.
We extensively evaluate our framework on four benchmark datasets and two tasks including semi-supervised classification and link prediction. Results show that the representations learned by our proposed method are general and powerful, producing significantly increased performance over the state of the art on both tasks.
Network embedding has been extensively studied in the literature . Recently, motivated by the advances of neural representation learning in language modeling, a number of embedding learning methods have been proposed based on the Skip-Gram model. A representative model is DeepWalk , which exploits random walk to generate sequences of instances as the training corpus, followed by utilizing the Skip-Gram model to obtain the embedding vectors of nodes. Node2vec  extend DeepWalk with sophisticated random walk schemes. Similarly in LINE  and GraRep , network embedding is learned by directly optimizing the objective function inspired from the Skip-Gram model. To further incorporate auxiliary information into network embedding, many efforts have been made. Among them, TADW  formulates DeepWalk in a matrix factorization framework, and jointly learns embeddings with the structure information and preprocessed features of text information. This work is further extended by HSCA , DMF  and MMDW  with various additional information. In SPINE , structural identities are incorporated to jointly preserve local proximity and global proximity of the network simultaneously in the learned embeddings. However, none of the above models jointly consider structure, content and label information in a unified model, which are the fundamental elements of a network . Recently, TriDNR  and LANE  tackle this problem both through implicit interactions between the three elements. TriDNR leverages multiple skip-gram algorithms between node-word and word-label pairs, while LANE first constructs three network affinity matrices from the three elements respectively, followed by executing SVD on affinity matrices with additional pairwise interactions. Although empirically effective, these methods do not provide a clear objective articulating how the three aspects are integrated in the embeddings learned, and is relatively inflexible to generalize to other scenarios.
In contrast to the above models, we propose a unified framework to learn network embeddings from structure, content and label simultaneously. The superiority of our framework is threefold: a) the structure matrix we generate contains high-order proximities of the network, and the label information is incorporated by explicitly manipulating the constructed matrix rather than through implicit multi-hop interactions [16, 5]; b) instead of leveraging label information through an explicitly learned classifier (e.g., SVM , linear classifier  and neural networks [23, 6]) whose performance is not only related to the quality of embeddings but also the specific classifiers being used, we exploit the label information without leveraging any downstream classifiers, which enables the flexibility of our model to generalize to different tasks; c) while most of the above models only consider text descriptions of nodes, we use raw features contained in datasets as the content information, which is more generalized to various types of networks in real world such as social networks.
Matrix Factorization (MF) has been proven effective in various machine learning tasks, such as dimensionality reduction, representation learning, recommendation systems, etc. Recently, connections have been built between MF and word embedding models. It is shown in  that the Skip-Gram with Negative Sampling (SGNS) model is an Implicit Matrix Factorization (IMF) that factorizes a word-context matrix, where the value of each entry is the pointwise mutual information (PMI) between a word and context pair, indicating the strength of association. It is further pointed out in  that the SGNS objective can be reformulated in a representation learning view with an Explicit Matrix Factorization (EMF) objective, where the matrix being factorized here is the co-occurrence matrix among words and contexts.
In this paper, we extend the matrix factorization perspective of word embedding into the task of network embedding. More importantly, we learn the network embedding by jointly factorizing the structure matrix and the content matrix of the network, which can be further improved by leveraging auxiliary label information. Different from most existing network embedding methods based on matrix factorization, which employ either trivial objective functions (F-norm used in TADW) or traditional factorization algorithms (SVD used in GraRep) for optimization, we design a novel objective function based on SGNS in our framework. Furthermore, the proposed method is general and not confined to specific downstream tasks, such as link prediction  and node classification , and we do not leverage any classifiers either.
In this section, we propose a novel approach for network embedding based on a unified matrix factorization framework, which consists of three procedures as illustrated in Figure 1. We follow the paradigm of treating random walk sequences as sentences to encode structural proximities in a network. However, unlike the EMF objective for word embedding where the matrix to factorize is clearly defined as the word-context co-occurrence matrix, for network embedding, there is a gap between the random walk procedure and the co-occurrence matrix. Therefore, we start with proposing a random walk sampling process to build a co-occurrence matrix, followed by theoretical justification of its property of preserving the high-order structural proximity in the network, based on which we present the framework of network embedding with matrix factorization.
Given an undirected network which includes a set of nodes connected by a set of edges , the corresponding adjacency matrix is , where indicates an edge with weight between the -th node and the -th node . And we denote the transition matrix of as , where . Next, a list of node sequences can be generated with random walks on the network.
Given , we can generate the co-occurrence matrix of with the -gram algorithm. The procedure is summarized in Algorithm 1. In short, for a given node in a node sequence, we increase the co-occurrence count of two nodes if and only if they are in a window of size .
Next we show that the co-occurrence matrix generated by Algorithm 1 preserves the high-order structural proximity in the network with the following theorem:
Define the high-order proximity of the network as
where denotes the order of the proximity as well as the window size in Algorithm 1. Then, under the condition that the random walk procedure is repeated enough times and the generated list of node sequences covers all paths in the network , we can derive that according to :
where is the window size in Algorithm 1, and the matrix denotes the expectation of row normalized co-occurrence matrix , i.e., .
Note that the -th entry of the left side of Equation (1) can be written as , which is the expected number of times that appears in the left or right -neighborhood of .
To investigate into the structural information of the network encoded in the co-occurrence matrix , we first consider a well-known high-order proximity of a network named rooted PageRank (RPR) , defined as , where
is the probability of randomly walking to a neighbor rather than jumping back. The-th entry of is the probability that a random walk from node will stop at in the steady state, which can be used as an indicator of the node-to-node proximity. can be further rewritten as:
We next show that for an undirected network, where is symmetric, the row normalized co-occurrence matrix is an approximation of the rooted PageRank matrix .
When is sufficiently large, for defined as , and , the -2 norm of the difference between and can be bounded by :
is the largest singular value of matrix
, which is also the eigenvalue offor the reason that is symmetric and non-negative. Note that
is the transition matrix, which is also known as the Markov matrix. And it can be easily proven that the largest eigenvalue of a Markov matrix is always, i.e., . We eliminate the absolute value sign by splitting the summation at , then we have:
Note that when is sufficiently large, according to the definition of , we have . Given , we can derive:
With Theorem 3.2, we can conclude that the normalized co-occurrence matrix we construct is an approximation of the rooted PageRank matrix with a bounded -2 norm.
Note that in TADW  and its follow-up works [25, 24] which also apply matrix factorization to learn network embeddings, the matrix constructed to represent the structure of a network is , which is a special case of when . As comparison, we construct a general matrix while preserving high-order proximities of the network with theoretical justification.
Apparently, the co-occurrence value between node and context indicates the similarity between them. A larger value of co-occurrence indicates closer proximity in the network, hence higher probability of belonging to the same class. This intuition coincides with the label information of nodes. Therefore, with the benefit of integer values in , label information can be explicitly incorporated in the procedure of sampling to enhance the proximity between nodes, which can additionally alleviate the problem of isolated nodes without co-occurrence in structure, i.e., we consider isolated nodes through label context instead of structure context.
Specifically, we randomly sample one node among labeled instances, followed by uniformly choosing another node with the same label and update the corresponding co-occurrence count in . As a consequence, the co-occurrence matrix captures both structure co-occurrence and label co-occurrence of instances. The complete procedure is summarized in Algorithm 2, where is a parameter controlling the ratio between the structure and label context.
In this way, while preserving high-order proximities of the network, we can incorporate supervision into the model flexibly without leveraging any downstream classifiers, which is another important advantage of our method. By contrast, most existing methods are either purely unsupervised  or leveraging label information through downstream classifiers [21, 24].
The method proposed above generates the co-occurrence matrix from a network and bridges the gap between word embedding and network embedding, allowing us to apply the matrix factorization paradigm to network embedding. With the flexibility of the matrix factorization principle, we propose a joint matrix factorization model that can learn network embeddings exploiting not only the topological structure and label information but also the content information of the network simultaneously.
Given the co-occurrence matrix and the content matrix , where and represent the number of nodes in the network and the dimensionality of node features respectively. Let be the dimensionality of embedding. The objective here is to learn the embedding of a network , denoted as the matrix , by minimizing the loss of factorizing the matrices and jointly as:
where is the reconstruction loss of matrix factorization which will be introduced later, and can be regarded as the feature embedding matrix, thus is the feature embedding dictionary of nodes.
By solving the joint matrix factorization problem in (5), the structure information in and the content information in are integrated to learn the network embeddings . This is inspired by Inductive Matrix Completion 
, a method originally proposed to complete a gene-disease matrix with gene and disease features. However, we take a completely different loss function here in light of the word embedding model of SGNS with a matrix factorization perspective.
We first rewrite (5) in a representation learning view as:
where is the representation loss functions evaluating the discrepancy between the column of and . is the feature embedding dictionary, and the embedding vector of the node, , can be learned by minimizing the loss of representing its structure context vector via the feature embedding .
We then proceed to the objective of factorizing the co-occurrence matrix and the content matrix jointly, denoted as . We follow the paradigm of explicit matrix factorization of the SGNS model and derive the following theorem according to :
For a node in the network, we denote as a pre-defined upper bound for the possible co-occurrence count between node and context . With the equivalence of Skip-Gram Negative Sampling (SGNS) and Explicit Matrix Factorization (EMF) , the representation loss can be defined as the negative log probability of observing the structure vector given and when is set to . To be more concrete,
where is the -th column of the content matrix , i.e., the feature vector of node , is the co-occurrence count between node and , , , and is the negative sampling ratio.
Based on Theorem 3.3, we can derive:
Finally, we can formulate the objective of the joint matrix factorization framework with parameters and as:
To minimize the loss function in (7) which integrates structure, label and content simultaneously, we utilize a novel optimization algorithm leveraging the alternating minimization scheme (ALM), which is a widely adopted method in the matrix factorization literature.
First we derive the gradients of (7) as:
is the sigmoid function.
The algorithm of Alternating Minimization (ALM) is summarized in Algorithm 3. The algorithm can be divided into solving two convex subproblems (starting from line 3 and line 6 respectively), which guarantees that the optimal solution of each subproblem can be reached with sublinear convergence rate with a properly chosen step-size . One can easily show that the objective (7) descents monotonically. As a consequence, Algorithm 3 will converge due to the lower bounded objective function (7).
The time complexity of one iteration in Algorithm 3 is , where is the number of non-zero elements in . For datasets with sparse node content, e.g., Cora, Citeseer, Facebook, etc., we implement in Equation (8) efficiently as a product of a sparse matrix with a dense matrix, which reduces the complexity from to .
The proposed framework is independent of specific downstream tasks, therefore in experiments, we test the model with different tasks including link prediction and node classification. Below we first introduce the datasets we use and the baseline methods that we compare to.
|Dataset||# Classes||# Nodes||# Edges||# Feature|
Datasets. We test our models on four benchmark datasets. The statistics of datasets are summarized in Table 1. For the node classification task, we employ datasets of Citation Networks , where nodes represent papers while edges represent citations. And each paper is described by a one-hot vector or a TFIDF word vector. For the link prediction task, we additionally include a social network dataset Facebook . This dataset consists of ego-networks from the online social network Facebook, where nodes and edges represent users and their relations respectively. Each user is described by users’ properties, which is represented by a one-hot vector.
Baselines. For both tasks, we compare our method with network embedding algorithms including DeepWalk , node2vec , TADW  and HSCA . For the node classification task, we further include DMF , LANE  and two neural network based methods, Planetoid  and GCN 
. To measure the performance of link prediction, we also evaluate our method against some popular heuristic scores defined in node2vec. Note that we do not consider TriDNR  as a baseline for the reason that they use text description as node content in citation networks, while in social networks such as Facebook, there is no natural text description for each user, which prevents TriDNR from generalizing to various types of networks. In addition, as MMDW  and DMF  are both semi-supervised variants of TADW with similar performance in our setting, we only compare our model with DMF for brevity.
Experimental Setup. For our model, the hyper-parameters are tuned on the Citeseer dataset and kept on the others. The dimensionality of embedding is set to for the proposed methods. In terms of the optimization parameters, the number of iterations is set to , the step-size in Algorithm 3 is set to . The parameters in Algorithm 2 are set in consistency with DeepWalk, i.e., walk length with window size . We use APNE to denote our unsupervised model of network embedding where the co-occurrence matrix is generated by Algorithm 1, and APNE+label denotes the semi-supervised model which uses Algorithm 2
to incorporate label context into the co-occurrence matrix. Unless otherwise specified, in all the experiments, we use one-vs-rest logistic regression as the classifier for the embedding based methods111Code available at https://github.com/lemmonation/APNE.
We first consider the semi-supervised node classification task on three citation network datasets. To facilitate the comparison between our model and the baselines, we use the same partition scheme of training set and test set as in . To be concrete, we randomly sample instances from each class as training data, and instances from all samples in the rest of the dataset as test data.
The experimental results are reported in Table 2. In the comparison of unsupervised models, the proposed APNE method learns embeddings from the network structure and node content jointly in a unified matrix factorization framework. As a consequence, APNE outperforms notably on all datasets with improvement from to . Compared with TADW and HSCA, which both incorporate network topology and text features of nodes in a matrix factorization model simultaneously, our method is superior in the following: a) the matrix we construct and factorize represents the network topology better as proven in Section 3.1; b) the loss function we derive from SGNS is tailored for representation learning.
Meanwhile, in the comparison of semi-supervised methods, the proposed APNE model outperforms embedding based baselines significantly, illustrating the promotion brought by explicitly manipulating the constructed matrix rather than implicitly executing multi-hop interactions. In addition, LANE suffers from extensive complexity both in time and space, which prevents it from being generalized to larger networks such as Pubmed. Although being slightly inferior to GCN on the Cora dataset, considering that APNE is a feature learning method independent of downstream tasks and classifiers, the competitive results against the state-of-the-art CNN based method GCN justify that the node representations learned by APNE preserve the network information well.
In general, the proposed matrix factorization framework outperforms embedding based baselines and performs competitive with the state-of-the-art CNN based model, demonstrating the quality of embeddings learned by our methods to represent the network from the aspects of content and structure. Between the two variants of our proposed framework, APNE and APNE+label, the latter performs consistently better on all datasets, indicating the benefits of incorporating label context.
We further visualize the embeddings learned by our unsupervised model APNE and two unsupervised embedding-based baselines on the Cora dataset with a widely-used dimension reduction method t-SNE , and results are shown in Figure 2. One can observe that different classes are better separated by our model, and nodes in the same class are clustered more tightly.
In order to test the sensitivity of our framework to hyper-parameters, we choose different values of the negative sampling parameter in Theorem 3.3 and the number of iterations of label context sampling in Algorithm 2 and evaluate the model on Citeseer on the node classification task.
For each pair of parameters, we repeat the experiments 10 times and compute the mean accuracy. Results are shown in Figure 3. The horizontal axis represents different values of . And represents results when the model is purely unsupervised, otherwise results are from semi-supervised models. The vertical axis is the classification accuracy on Citeseer. Clearly, increasing brings a boost of the performance of the model, as we infer in Section 3.2. This justifies the effectiveness of the approach we propose to incorporate the label context. In addition, the performance of the proposed models with different values of is relatively stable.
We further test our unsupervised model on the link prediction task. In link prediction, a snap-shot of the current network is given, and we are going to predict edges that will be added in the future. The experiment is set up as follows: we first remove
of existing edges from the network randomly as positive node pairs, while ensuring the residual network connected. To generate negative examples, we randomly sample an equal number of node pairs that are not connected. Node representations are then learned based on the residual network. While testing, given a node pair in the samples, we compute the cosine similarity between their representation vectors as the edge’s score. Finally, Area Under Curve (AUC) score and Mean Average Precision (MAP) are used to evaluate the consistency between the labels and the similarity scores of the samples.
Results are summarized in Table 3. As shown in the table, our method APNE
outperforms all the baselines consistently with different evaluation metrics. We take a lead of topology-only methods by a large margin, especially on sparser networks such as Citeseer, which indicates the importance of leveraging node features on networks with high sparsity. Again, we consistently outperform TADW and HSCA which also consider text features of nodes.
The stable performance of our proposed APNE model on different datasets justify that embeddings learned by jointly factorizing the co-occurrence matrix and node features can effectively represent the network. More importantly, the problem of sparsity can be alleviated by incorporating node features in a unified framework.
|Title||Same Class||Connected||Cosine Similarity|
|A cooperative coevolutionary approach to function|
Multi-parent reproduction in genetic algorithms
|A Class of Algorithms for Identification in|
|On the Computational Power of Neural Nets|
To further illustrate the effectiveness of APNE, we present some instances of link prediction on the Cora dataset. We randomly choose node pairs from all node samples and compute the cosine similarity for each pair. Results are summarized in Table 4. The superiority of APNE is obvious in the first instance, where TADW gives a negative correlation to a positive pair. For this pair, although the first paper is cited by the second one, their neighbors do not coincide. As a consequence it is easy to wrongly separate these two nodes into different categories if the structure information is not sufficiently exploited.
As for the second instance, both papers belong to the Neural Networks class but not connected in the network. Specifically, the first paper focuses on H-Infinity methods in control theory while the second paper is about recurrent neural networks, and there exist papers linking these two domains together in the dataset. As a consequence, although these two nodes can hardly co-occur in random walk sequences on the network, their features may overlap in the dataset. Therefore, the pair of nodes will have a higher feature similarity than the topology similarity. Thus by jointly considering the network topology and the node features, our method gives a higher correlation score to the two nodes that are disconnected but belong to the same category.
In this paper, we aim to learn a generalized network embedding preserving structure, content and label information simultaneously. We propose a unified matrix factorization based framework which provides a flexible integration of network structure, node content, as well as label information. We bridge the gap between word embedding and network embedding by designing a method to generate the co-occurrence matrix from the network, which is actually an approximation of high-order proximities of nodes in the network. The experimental results on four benchmark datasets show that the joint matrix factorization method we propose brings substantial improvement over existing methods. One of our future directions would be to apply our framework to social recommendations to combine the relationship between users with the corresponding feature representations.
This research was supported by the National Natural Science Foundation of China (No. 61673364, No. U1605251 and No. 61727809), and the Fundamental Research Funds for the Central Universities (WK2150110008).
Mikolov, T., Sutskever, I., Chen, K., Corrado, G.S., Dean, J.: Distributed representations of words and phrases and their compositionality. In: Advances in neural information processing systems. pp. 3111–3119 (2013)
Song, H.H., Cho, T.W., Dave, V., Zhang, Y., Qiu, L.: Scalable proximity estimation and link prediction in online social networks. In: Proceedings of the 9th ACM SIGCOMM conference on Internet measurement conference. ACM (2009)