1 Introduction
Graph node embedding is to learn a mapping that represents nodes as points in a lowdimensional vector space , where the geometric relationship reflects the structure of the original graph. Nodes that are “close” in the graph are embedded to have similar vector representations (surveyCAI). The learned node vectors benefit a number of graph analysis tasks, such as node classification (bhagat2011node), link prediction (liben2007link), community detection (fortunato2010community), and many others (surveyjure). In order to preserve the node geometric relations in an embedded space, the similarity/proximity/distance of a node to its neighborhood is generally taken as input to different graph embedding approaches. For example, matrixfactorization approaches work on predefined pairwise similarity measures (e.g., different order of adjacency matrix). Deepwalk (perozzi2014deepwalk), node2vec (grover2016node2vec) and other recent approaches (dong2017metapath2vec) consider flexible, stochastic measure of node similarity by the node cooccurrence on short random walks over the graph (surveyGoyal)
. Neighborhood autoencoder methods compress the information about a node’s local neighborhood that is described as a neighborhood vector containing the node’s pairwise similarity to all other nodes in the graph
(SDNE; DNGR). Neural network based approaches such as graph convolutional networks (GCN) and GraphSAGE apply convolution like functions on its surrounding neighborhood for aggregating neighborhood information
(kipf2016semi; GraphSAGE).Although effective, all existing approaches need to specify the neighborhood and the dependence form to the neighborhood, which significantly degrades their flexibility for general graph representation learning. In this work, we propose a novel graph node embedding method, namely Graph Embedding via Set Function (GESF), that can

learn node representation via a universal graph embedding function
, without predefining pairwise similarity, specifying random walk parameters, or choosing aggregation functions among elementwise mean, a maxpooling neural network, or LSTMs;

capture the arbitrary relationship between neighbors at different distance to the target node, and automatically decide the significance;

be generally applied to any graphs, from simple homogenous graphs to heterogeneous graphs with complicated types of nodes.
The core difficulty of graph node embedding is to characterize an arbitrary relationship to the neighborhood. From a local view of point, switching any two neighbors of a node in the same category would not affect the representation of this node. Based on this key observation, we propose to learn the embedding vector of a node via a partial permutation invariant set function applied on its neighbors’ embedding vector. We provide a neat form to represent such set function and prove that it can characterize an arbitrary partial permutation set function. Evaluation results on benchmark data sets show that the proposed GESF outperforms the stateoftheart approaches on producing node vectors for classification tasks.
2 Related Work
The main difference among various graph embedding methods lies in how they define the “closeness” between two nodes (surveyCAI). Firstorder proximity, secondorder proximity or even highorder proximity have been widely studied for capturing the structural relationship between nodes (tang2015line; yang2017fast). In this section, we discuss the relevant graph embedding approaches in terms of how node closeness to neighboring nodes is measured, for highlighting our contribution on utilizing neighboring nodes in a most general manner. Comprehensive reviews of graph embedding can be found in (surveyCAI; surveyjure; surveyGoyal; yang2017fast).
Matrix Analysis on Graph Embeddding
As early as 2011, a spectral clustering method
(tang2011leveraging)took the eigenvalue decomposition of a normalized Laplacian matrix of a graph as an effective approach to obtain the embeddings of nodes. Other similar approaches work on different node similarity matrix by applying various similarity functions to make a tradeoff between modeling the “firstorder similarity” and “higherorder similarity”
(GraRep; HOPE). Node content information can also be easily fused in the pairwise similarity measure, e.g., in TADW (yang2015network), as well as node label information, which resulting in semisupervised graph embedding methods, e.g., MMDW in (tu2016max).Random Walk on a Graph to Node Representation Learning
Both deepwalk (perozzi2014deepwalk) and node2vec (grover2016node2vec) are outstanding graph embedding methods to solve the node representation learning problem. They convert the graph structures into a sequential context format with random walk (lovasz1993random). Thanks to the invention of (mikolov2013distributed) for word representation learning of sentences, deepwalk inherited the learning framework for words representation learning in paragraphs to generate the representation of nodes in random walk context. And then node2vec evolved such the idea with additional hyperparameter tuning for the tradeoff between DFS and WFS to control the direction of random walk. Planetoid (yang2016revisiting)
proposed a semisupervised learning framework by guiding random walk with available node label information.
Neighborhood Encoders to Graph Embedding
There are also methods focusing on aggregating or encoding the neighbors’ information to generate node embeddings. DNGR (DNGR) and SDNE (wang2016structural) introduce the autoencoder to construct the similarity function between the neighborhood vectors and the embedding of the target node. DNGR defines neighborhood vectors based on random walks and SDNE introduces adjacency matrix and Laplacian eigenmaps to the definition of neighborhood vectors. Although the idea of autoencoder is a great improvement, these methods are painful when the scale of the graph is up to millions of nodes. Therefore, methods with neighborhood aggregation and convolutional encoders are involved to construct a local aggregation for node embedding, such as GCN (kipf2016semi; kipf2016variational; schlichtkrull2017modeling; van2017graph), column networks (pham2017column) and the GraphSAGE algorithm (GraphSAGE). The main idea of these methods is involving an iterative or recursive aggregation procedure e.g., convolutional kernels or pooling procedures to generate the embedding vectors for all nodes and such aggregation procedures are shared by all nodes in a graph.
The abovementioned methods work differently on using neighboring nodes for node representation learning. They require on predefining pairwise similarity measure between nodes, or specifying random walk parameters, or choosing aggregation functions. In practice, it takes nontrivial effort to tune these parameters or try different measures, especially when graphs are complicated with nodes in multiple types, i.e., heterogeneous graphs. This work hence targets on making neighboring nodes play their roles in a most general manner such that their contributions are learned but not userdefined. The resultant embedding method has the flexibility to work on any types of homogeneous and heterogeneous graph. The heterogeneity of graph nodes is handled by a heterogeneous random walk procedure in dong2017metapath2vec and by deep neural networks in chang2015heterogeneous. GESF has a natural advantage on avoiding any manual manipulation of random walking strategies or designs for the relationships between different types of nodes. To the invention of set functions in (zaheer2017deep), all existing valid mapping strategies from neighborhood to the target nodes can be represented by the set functions which are learnt by GESF automatically.
3 A Universal Graph Embedding Model based on Set Function
In this section, we first introduce a universal mapping function to generate the embedding vectors for nodes in a graph via involving the neighborhood with various steps to the target nodes for the graph embedding learning and then we propose a permutation invariant set function as the universal mapping function. Sequentially, matrix function is introduced to process the knowledge of different orders of neighborhood. At last, we propose the overall learning model to solve the proper embedding vectors of nodes respect to a specific learning problem.
3.1 A universal graph embedding model
We target on designing graph embedding models for the most general graph that may include different types of nodes. Formally, a graph , where the node set , i.e., is composed of disjoint types of nodes. One instance of such a graph is the academic publication network, which includes different types of nodes for papers, publication venues, author names, author affiliations, research domains etc.
Given a graph , our goal is to learn the embedding vector for each node in this graph. As we know, the position of a node in the embedded space is collaboratively determined by its neighboring nodes. Therefore, we propose a universal embedding model where the embedding vector of node can be represented by its neighbors’ embedding vectors via a set function
where is a matrix with column vectors corresponding to the embedding of node ’s neighbors in type . Note that the neighbors can be step1 (or immediate) neighbors, step2 neighbors, or even higher degree neighbors. However, all neighboring nodes that are steps reachable from a node play the same role when localizing this node in the embedded space. Therefore, function should be a partially permutation invariant function. That is, if we swap any columns in each , the function value remains the same. Unfortunately, the set function is not directly learnable due to the permutation property.
One straightforward idea to represent the partially permutation invariant function is to define it in the following form
(1) 
where denotes the set of dimensional permutation matrices, and denote the representation matrix consisting of the vectors in , respectively. is to permute the columns in . It is easy to verify that the function defined in (1) is partially permutation invariant, but it is almost not learnable because it involves “sum” items.
Our solution of learning function is based on the following important theorem, which gives a neat manner to represent any partially permutation invariant function. The proof is in the Appendix.
Theorem 3.1.
Let be a continuous realvalued function defined on a compact set with the following form
If function is partially permutation invariant, that is, any permutations of the values within the group for any does not change the function value, then there must exist functions and to approximate with arbitrary precision in the following form
(2) 
Based on this theorem, we only need to parameterize and to learn the node embedding function. We next formulate the embedding model when considering different order of neighborhood.
1step neighbors
From Theorem 3.1, any mapping function of a node can be characterized by appropriately defined functions , and :
where denotes the step neighbors of node in node type .
Multistep neighbors
High order proximity has been shown beneficial on generating high quality embedding vectors (yang2017fast). Extending the 1step neighbor model, we can have the more general model where the representation of each node could depend on immediate (1step) neighbors, 2step neighbors, 3step neighbors, and even infinitestep neighbors.
(3)  
where are the weights for neighbors at different steps. Let be the adjacent matrix indicating all edges by . If we define the polynomial matrix function on the adjacent matrix as , we can cast (3) into its matrix form
(4)  
where denotes the representation matrix for nodes in type , denotes the submatrix of indexed by column and rows in , and function (with on the top of function ) is defined as the function extension
Note that the embedding vectors for different type of nodes may be with different dimensions. Homogeneous graph is a special case of the heterogeneous graph with . The above proposed model is thus naturally usable on homogeneous graph.
To avoid optimizing infinite number of coefficients, we propose to use a 1dimensional NN function to equivalently represent the function to reduce the number of parameters based on the following observations
where
is the singular value decomposition. We parameterize
using 1dimensional NN, which allows us easily controlling the number of variables to optimize in by choosing the number of layers and the number of nodes in each layer.3.2 The overall model
For short, we denote the representation function for in (4) by
To fulfill the requirement of a specific learning task, we propose the following learning model involving a supervised component
(5) 
where denotes the set of labeled nodes, and
balances the representation error and prediction error. The first unsupervised learning component restricts the representation error between the target node and its neighbors with
norm since it is allowed to have noise in a practical graph. And the supervised component is flexible to be replaced with any designed learning task on the nodes in a graph. For example, to a regression problem, a least square loss can be chosen to replace and a cross entropy loss can be used to formulate a classification learning problem. To solve the problem in Eq. (5), we apply a stochastic gradient descent algorithm (SGD) to compute the effective solutions for the learning variables simultaneously.
4 Experiments
This section reports experimental results to validate the proposed method, comparing to stateoftheart algorithms on benchmark datasets including both homogenous and heterogenous graphs.
4.1 Comparison on homogeneous graphs
We consider the multiclass classification problem over the homogeneous graphs. Given a graph with partially labeled nodes, the goal is to learn the representation for each node for predicting the class for unlabelled nodes.
Datasets
We evaluate the performance of GESF and methods for comparison on five datasets.

Cora (mccallum2000automating) is a paper citation network. Each node is a paper. There are 2708 papers and 5429 citation links in total. Each paper is associated with one of 7 classes.

CiteSeer (giles1998citeseer)
is another paper citation network. CiteSeer contains 3,312 papers and 4,732 citations in total. All these papers have been classified into 6 classes.

Pubmed (sen2008collective) is a larger and more complex citation networks compared to previous two datasets. There are 19,717 vertexes and 88,651 citation links. Papers are classified into 3 classes.

Wikipedia (sen2008collective) contains 2,405 online webpages. The 17,981 links are undirected between pairs of them. All these pages are from 17 categories.

Emaileu (leskovec2007graph) is an Email communication network which illustrates the email communication relationships between researchers in a large European research institution. There are 1,005 researchers and 25,571 links between them. Department affiliations (42 in total) of the researchers are considered as labels to predict.
Baseline methods
The compared baseline algorithms are listed below:

Deepwalk (perozzi2014deepwalk) is an unsupervised graph embedding method which relies on the random walk and word2vec method. For each vertex, we take 80 random walks with length 40, and set windwo size 10. Since deepwalk is unsupervised
, we apply a logistic regression on the generated embeddings for node classification.

Node2vec (grover2016node2vec) is an improved graph embedding method based on deepwalk. We set the window size as 10, the walk length as 80 and the number of walks for each node is set to 100. Similarly, the node2vec is unsupervised as well. We apply the same evaluation procedure on the embeddings of node2vec as what we did for deepwalk.

MMDW (tu2016max) is a semisupervised learning framework of graph embedding which combines matrix decomposition and SVM classification. We tune the method multiple times and take 0.01 as the hyperparameter in the method which is recommended by authors.

Planetoid (yang2016revisiting) is a semisupervised learning framework. We mute the node attributes involved in planetoid since we only focus on the information abstraction from the graph structures.

GCN (Graph Convolutional Networks) (kipf2016semi)
applies the convolutional neural networks into the
semisupervised embedding learning of graph. We eliminate the node attributes for fairness as well.
Experiment setup and results
For fair comparison, the dimension of representation vectors is chosen to be the same for all algorithms (the dimension is ). The hyperparameters are finetuned for all of them. The details of GESF for multiclass case are as follows.

Supervised Component: Softmax function is chosen to formulate our supervised component in Eq. (5) which is defined as . For an arbitrary embedding
, we have the probability term as
for such node to be predicted as class , where is a classifier for class . Therefore, the whole supervised component in Eq. (5) is , where is an regularization for and is chosen to be . 
Unsupervised embedding mapping Component: We design a twolayer NN with hidden dimension 64 to form the mapping from embeddings of neighbors to the target node and we also form a twolayer 1to1 NN with a 3 dimensional hidden layer to construct the matrix function for the adjacency matrix. We preprocess the matrix with an eigenvalue decomposition where the procedure preserve the highest 1000 eigenvalues in default. The balance hyperparameter is set to be .
We take experiments on each data set and compare the performance among all methods mentioned above. Since it is the multiclass classification scenario, we use Accuracy as the evaluation criterion. The percentage of labeled samples is chosen from to and the remaining samples are used for evaluation, except for planetoid (yang2016revisiting). Fixed training and testing dataset are used due to the optimization strategy of planetoid, which is dependent upon the matching order of the vertexes in the graph and the order in the training and test set. Therefore, we only provide the results of planetoid
of one time. All experiments are repeated for three times and we report the mean and standard deviation of their performance in the Tables
1 . We highlight the best performance for each dataset with bold font style and the second best results with a “*”. We can observe that in most cases, our method outperforms other methods.training%  10.00%  20.00%  30.00%  40.00%  50.00%  60.00%  70.00%  80.00%  90.00%  

Cora  deepwalk 










node2vec 










MMDW 










planetoid 










GCN 










GESF 










CiteSeer  deepwalk 










node2vec 










MMDW 










planetoid 










GCN 










GESF 










Pubmed  deepwalk 










node2vec 










MMDW                    
planetoid 










GCN 










GESF 










Wikipedia  deepwalk 










node2vec 










MMDW 










planetoid 










GCN 










GESF 










Emaileu  deepwalk 










node2vec 










MMDW 










planetoid 










GCN 










GESF 









MMDW takes over 64GB memory on the experiment of Pubmed. The results of MMDW are not available in this comparison.
4.2 Comparison on heterogeneous graphs
We next conduct evaluation on heterogeneous graphs, where learned node embedding vectors are used for multilabel classification.
Datasets
The used datasets include

DBLP (ji2010graph) is an academic community network. Here we obtain a subset of the large network with two types of nodes, authors and key words from authors’ publications. The generated subgraph includes (authors) + (key words) vertexes. The link between a pair of author indicates the coauthor relationships, and the link between an author and a word means the word belongs to at least one publication of this author. There are 66,832 edges between pairs of authors and 338,210 edges between authors and words. Each node can have multiple labels out of four in total.

BlogCatalog (Wangetal10) is a social media network with 55,814 users and according to the interests of users, they are classified into multiple overlapped groups. We take the five largest groups to evaluate the performance of methods. Users and tags are two types of nodes. The 5,413 tags are generated by users with their blogs as keywords. Therefore, tags are shared with different users and also have connections since some tags are generated from the same blogs. The number of edges between users, between tags and between users and tags are about 1.4M, 619K and 343K respectively.
Methods for Comparison
To illustrate the validation of the performance of GESF on heterogeneous graphs, we conduct the experiments on two stages: (1) comparing GESF with Deepwalk (perozzi2014deepwalk) and node2vec (grover2016node2vec) on the graphs by treating all nodes as the same type (GESF with in a homogeneous setting); (2) comparing GESF with the stateofart heterogeneous graph embedding method, metapath2vec (dong2017metapath2vec), in a heterogeneous setting. The hyperparameters of the method are finetuned and metapath2vec++ is chosen as the option for the comparison.
Experiment Setup and Results
For fair comparison, the dimension of representation vectors is chosen to be the same for all algorithms (the dimension is 64). We finetune the hyperparameter for all of them. The details of GESF for multilabel case are as follows.

Supervised Component: Since it is a multilabel classification problem, each label can be treated as a binary classification problem. Therefore, we apply logistic regression for each label and for an arbitrary instance and the th label , the supervised component is formulated as , where is the classifier for the th label. Therefore, the supervised component in Eq. (5) is defined as and is the regularization component for , where is chosen to be .

Unsupervised Embedding Mapping Component: We design a twolayes NN with a 64dimensional hidden layer for each type of nodes with the types of nodes in its neighborhood to formulate the mapping from embedding of neighbors to the embedding of the target node. We also form a twolayer 1to1 NN wth a 3 dimensional hidden layer to construct the matrix function for the adjacency matrix for the whole graph. We preprocess the matrix with an eigenvalue decomposition by preserving the highest 1000 eigenvalues in default. We denote the nodes to be classified as type 1 and the other type as type 2. The balance hyperparameter is set to be [0.2, 200].
For the datasets DBLP and BlogCatalog, we carry out the experiments on each of them and compare the performance among all methods mentioned above. Since it is a multilabel classification task, we take f1score(macro, micro) as the evaluation score for the comparison. The percentage of labeled samples is chosen from 10% to 90%, while the remaining samples are used for evaluation. We repeat all experiments for three times and report the mean and standard deviation of their performance in the Tables 2. We can observe that in most cases, GESF in heterogeneous setting has the best performance, while GESF in homogeneous setting achieves the second best results, demonstrating the validity of our proposed universal graph embedding mechanism.
training%  10.00%  20.00%  30.00%  40.00%  50.00%  60.00%  70.00%  80.00%  90.00%  

DBLP (macro)  Deepwalk 










Node2vec 










Metapath2vec++ 
































DBLP (micro)  deepwalk 










node2vec 










metapath2vec++ 
































BlogCatalog (macro)  deepwalk 










node2vec 










metapath2vec++ 
































BlogCatalog (micro)  deepwalk 










node2vec 










metapath2vec++ 































5 Conclusion and Future Work
To summarize the whole work, GESF is proposed for a most general graph embedding solution with a theoretical guarantee for the effectiveness of the whole model and impressive experiment results compared to the stateofart algorithms. For the future work, our model can be extended to more general case e.g, involving the node content or attributes into the embedding learning. One possible solution is to introduce the attributes as a special type of neighbors in the graph and we can utilize multiple set functions to map the embeddings within a more complex heterogeneous graph structure.