1 Introduction
Graph node embedding (or graph node representation learning in some literature [1]) is to learn the numerical representation for each node in a graph by vectors in a Euclidean space, where the geometric relationship reflects the structure of the original graph. Nodes that are “close” in the graph are embedded to have similar vector representations [2]. The learned node vectors benefit a number of graph analysis tasks, such as node classification [3], link prediction [4], community detection [5], recommendation [6], and many others [7].
A graph can be uniquely determined by defining the neighborhood. Therefore, the key issue for graph embedding lies on how to model the dependence of each node to its neighbors.
Existing approaches mostly specify (either explicitly or implicitly) certain dependencies on neighbors. Deepwalk [8], node2vec [9], and their variants [10, 11] randomly generate a set of paths with a fixed length to learn the representation for each node, which implicitly defines the neighborhood and the dependence among nodes. [12] utilizes the adjacency matrix to represent the neighborhood for every node in a graph and apply matrix factorization for the node embedding learning, which implicitly defines the linear dependence among nodes. Neighborhood autoencoders [13, 14, 15] use a neighborhood vector to represent the neighborhood relations for a node. The neighborhood vector contains a node’s pairwise similarity to all the other nodes in a graph. Graph2Gauss [15]
embeds each node as a Gaussian distribution based on the graph knowledge. Deep neural networks for graph representations (DNGR)
[14] uses stacked denoising autoencoder to extract complex nonlinear features for each node. Structural Deep Network Embedding method (SDNE) [13] preserves the firstorder and secondorder proximity for each node in a graph via a semisupervised autoencoder learning model. Neural network based approaches such as graph convolutional networks (GCN) [16] and GraphSAGE [17] define fixeddepth neural network layers to capture the neighborhood information from onestep neighbors, twostep neighbors, up to step neighbors and they apply convolutionlike functions on these neighbors as the aggregation strategy. Graph attention networks (GATs) [18] and Attentionbased Graph Neural Network (AGNN) [19] employ attention mechanism when aggregating the neighbors.However, the way of predefining (no matter explicitly or implicitly) neighbors and dependence may cause subtle but important loss of structural information within the graph and dependence among neighbors. For example, the family of random walk based methods [8, 9, 10] ignore the influence of nodes out of the predefined length to the center node within the path. GCN[16]
restricts the form of the dependence on the neighbor nodes to a twolayer aggregation function and inherits considerable complexity from their deep learning lineage
[20]. These raise a fundamental question: can we design a model to give the maximal flexibility of defining dependencies on neighbors?In this work, we propose a Partial Permutation Invariant Node Embedding method (PINE) by developing a new notion of partial permutation invariant set function, that can

[leftmargin=*]

learn node representations via a universal graph embedding function
, without predefining pairwise similarity, specifying random walk parameters, or choosing aggregation functions such as elementwise mean, a maxpooling neural network, or longshort term memory units (LSTMs);

capture the arbitrary permutation invariant relationship of each node to its neighbors;

be applied to both homogeneous and heterogeneous graphs with complicated types of nodes.
Evaluation results on benchmark data sets show that the proposed PINE outperforms the stateoftheart approaches on producing node vectors for classification tasks.
Notations: Throughout this paper, we use following notations

denotes a graph with vertex set and edge set . The corresponding lower case characters and represents a single vertex and edge.

denotes an approximation to a function f.

denotes permutation matrix.

denotes a symmetric group.

denotes a permutation operator.

represents a nonlinear activation function.
2 Related Work
The main difference among various graph embedding methods lies in how they define the “closeness” between two nodes [2]. Firstorder proximity, secondorder proximity or even highorder proximity have been widely studied for capturing the structural relationship between nodes [21, 22, 23]. Comprehensive reviews of graph embedding can be found in [2, 7, 24, 22]. In this section, we discuss the relevant graph embedding approaches in terms of how node closeness is measured, to highlight our contributions on capturing neighborhood dependency in a most general manner. This section ends up with the review about set functions which is related to the technology we used in this paper.
Matrix Analysis on Graph Embedding:
As early as 2011, a spectral clustering method
[25]was proposed to take the eigenvalue decomposition of a normalized Laplacian matrix of a graph as an effective approach to obtain an embedding of nodes. Other similar approaches choose different similarity matrices (from the Laplacian matrix) to make a tradeoff between modeling the “firstorder similarity” and modelling “higherorder similarity”
[26, 27, 28]. Node content information can also be fused in the pairwise similarity measure, e.g., in textassociated DeepWalk (TADW) [29], as well as node label information, which results in semisupervised graph embedding methods, e.g., maxmargin DeepWalk (MMDW) [30]. Recently, an arbitraryorder proximitypreserving graph embedding method is introduced in [12] based on matrix eigendecomposition, which is applied to a predefined highorder proximity matrix. Furthermore, [31] proposes Lanczos network with Lanczos algorithm to construct low rank approximations of the graph Laplacian for graph convolution. For heterogeneous networks, [32] propose a labelinvolving matrix analysis to learn the classification result of each vertex within a semisupervised framework.Random Walk on a Graph to Node Representation: Both deepwalk [8] and node2vec [9] are graph embedding methods to solve the node embedding problem. They convert the graph structures into a sequential context format with random walk [33]. Thanks to the pioneering work of [34] for word representation learning of sentences, deepwalk inherits the learning framework for words representation learning in paragraphs to generate the representation of nodes in random walk context. Then node2vec evolves such the idea with additional hyperparameters tuning for the tradeoff between depthfirst search (DFS) and widthfirst search (WFS) to control the direction of random walk. Struc2vec [35] also utilizes the multilayer graph to construct the node representations. [36] proposes a selfpaced graph embedding by introducing a dynamic negative sampling method to select difficult negative context nodes in the training process. Planetoid [37]
is a semisupervised learning framework by guiding random walk with available node label information. The heterogeneity of graph nodes is often handled by a heterogeneous random walk procedure
[10], or selected relation pairs [38]. [39] considers the predictive text embedding problem on a largescale heterogeneous text network and the proposed method is also based on predefined heterogeneous random walks.Neighborhood Encoders to Graph Embedding: There are also methods focusing on aggregating or encoding the neighbors’ information to generate node embeddings. DNGR [14] and SDNE [13]
introduce autoencoders to construct the similarity function between the neighborhood vectors and the embedding of the target node. DNGR defines neighborhood vectors based on random walks and SDNE introduces adjacency matrix and Laplacian eigenmaps to the definition of neighborhood vectors. GraphWave
[40] learns the representation of each node’s neighborhood via leveraging heat wavelet diffusion patterns. Although the idea of autoencoder is a great improvement, these methods are painful computationally expensive when the scale of the graph is up to millions of nodes. As a result, neighborhood aggregation and convolutional encoders are employed to integrate local aggregation for node embedding, such as GCN [16, 41, 42, 43], FastGCN [44], column networks [45], the GraphSAGE algorithm [17], GAT [18]. A recent DRNE [46] method uses layer normalized LSTM to approximate the embedding of a target node by the aggregation of its neighbors’ embeddings. And [47] utilizes a set function as a universal approximator to distinguish different graphs with respect to graph classification tasks. The main idea of these methods is involving an iterative or recursive aggregation procedure, e.g., convolutional kernels or pooling procedures to generate the embedding vectors for all nodes, and such aggregation procedures are shared by all nodes in a graph.The abovementioned methods work differently on how they use neighboring nodes for node embedding. They require predefining pairwise similarity measure between nodes, specifying random walk parameters, or choosing aggregation functions. In practice, it usually takes a lot of effort to tune these parameters or try different measures, especially when graphs are complicated with nodes of multiple types, i.e., heterogeneous graphs. This work hence targets on making neighboring nodes play their roles in a most general manner such that their contributions are learned but not userdefined. The resultant embedding method has the flexibility to work on any types of homogeneous and heterogeneous graph.
Our proposed method PINE has a natural advantage on avoiding any manual manipulation of random walking strategies or designs for the relationships between different types of nodes.
Set functions: [48] introduces the notion of set functions as a universal approximator to measure the permutation invariant property of sets but only provides a less rigorous skeleton proof. A very recent work [49] further improves the theoretical analysis on invariant maps by neural networks. The notion of partial permutation invariant set functions proposed in this paper is a more generic version of the set function. We find a neater form than [49] even in the special case and also provide rigorous proofs for the representation theorem.
3 The Proposed Pine framework
In this section, we first formally define the problem, and then introduce a new definition — partial permutation invariant set function. This section ends up with the proposed PINE framework whose key is the representation theorem of the partial permutation invariant set function.
We target on designing graph embedding models for general graphs that may include different types of nodes (=1 corresponds to the homogeneous graphs and corresponds to heterogeneous graphs). Formally, a graph , where the node set , i.e., is composed of disjoint types of nodes. One instance of such a graph is the academic publication network, which includes different types of nodes for papers, publication venues, author names, author affiliations, research domains etc. Given such a graph , our goal is to learn the embedding vector for each node in this graph.
We use to denote the representation of node . The node can be represented by its neighbors’ embedding vectors via a function
(1) 
where is a matrix with column vectors corresponding to the embedding of node ’s neighbors in type . could also be the representation vectors associating with node ’s type neighbors. We use to denote the dimensions of the embedding vector. Note that they way we have defined the function implies that it is node dependent. For learning to be possible, the embedding functions for different nodes will share common parameters, as will become clear in Section 3.2.
3.1 Partial permutation invariant set functions
An undirected graph can be uniquely determined by defining the set of neighborhoods. Therefore, the key to defining the graph embedding lies in how to model the dependence of each node to its neighbors, that is, what function in (1) to choose. Most existing approaches only (either explicitly or implicitly) stress on some specific forms to characterize the dependence between each node and its neighbors while ignoring other potential dependence.
We propose a universal graph embedding model that does not predefine the dependence form between each node and its neighbors due to the key observation: all neighboring nodes reachable from a target node are not distinguishable from the view of the target node if they belong to the same type. To formally define the function satisfying this property, we introduce a new notation named partial permutation invariant set function.
Definition 3.1.
[Partial permutation invariant set function] Given where , a continuous real valued map is partially permutation invariant if
(2) 
for all permutation matrices and .
This definition essentially requires the function value of to be invariant to swapping any two columns of .
3.2 Pine: the representation of partial permutation invariant set function
Unfortunately, this function is not simply learnable because the permutation property is hard to guarantee directly. One straightforward idea to represent the partial permutation invariant set function is to define it in the following form
(3)  
(4) 
where denotes type neighbors of node and denotes the set of permutation matrices for any , is to permute the columns in , and is a properly designed function. It is easy to verify that the function defined in (4) is partial permutation invariant, but it is intractable because it involves “sum” items. Our solution of learning function is then based on the following important theorem, which gives a neat and general way to represent any partial permutation invariant set function.
Theorem 3.2.
[Representation theorem of partial permutation invariant set functions] Let be a continuous realvalued function defined on a compact set with the following form
where . If function is partial permutation invariant, that is, any permutations of the elements within the group for any does not change the function value, then there must exist functions and to approximate with arbitrary precision in the following form
(5) 
The rigorous proof is provided in Appendix A.2. This result suggests a neat but universal way to represent any partial permutation invariant set function. For instance, a popular permutation invariant set function widely used in deep learning can be approximated with an arbitrary precision by
with , and , as long as is large enough. This is because
Theorem 3.2 only establishes the existence of the approximation. To obtain concrete forms of and ’s, one can always use three layers neural networks to approximate it (to any precision) [50, 51], for example, . Our following theorem shows that we can even choose simpler and neater form than three layers neural network for and to approximate an arbitrary as a whole. More specifically, a twolayers neural network is enough. For simplicity, we consider the case that the image of or is one dimension. The case with a high dimensional image can be simply applied based on the one dimensional case.
Theorem 3.3.
The functions and in Theorem 3.2 can be chosen in the following form (assuming that the image of is one dimension):
where is the elementwise squashing activation function, and we omit the subscript all hyperparameters of .
Based on Theorem 3.3, we will use neural networks with appropriate structure as specified by the theorem to approximate the embedding function for a node in (1). In particular, for with one dimension image, it can be cast into the following form:
(6) 
with large enough hyper parameter ’s and ’s and proper real value coefficients and .
Generally, for an arbitrary node , we need to specify a function to aggregate the neighborhood information. However, it will be overfitting for a dataset if we define totally different for each . Hence, we define a function for a node via only varying the summation over the neighbor size with respect to different , but reusing other parameters such as and across the entire graph.
The objective function for the neural network parameter optimization will depend on application. As one example, a twonorm cost function of the embedded vectors and the embedding output as provided by the neural network can be minimized for consistency. This is a joint optimization problem: both the embedding vectors ’s and the embedding functions are jointly optimized. The numerical optimization algorithm and complexity are similar to those for standard deep neural networks. In a semisupervised setting, it is also possible to incorporate a supervised component into the objective function; see numerical examples in section 4, and specifically (7).
4 Empirical Experiments Study
In the section, we validate and report the performance of the proposed partial permutation invariant set function theorem on various aspects of graph node embedding learning tasks comparing to stateoftheart algorithms: (1) to evaluate the applicability of PINE on embedding problem of general graphs, we conduct experiments on both homogeneous and heterogeneous graphs; and (2) we also visualize the embedding vectors obtained by PINE to 2D space via tSNE with respect to the true and predicted labels; (3) for ablatoin study, we investigate the impact of hyperparameters for PINE and show the performace on two datasets, Cora and Wikipedia. More experimental results of ablation study are provided in Appendix B.
4.1 Evaluation on Homogeneous Graphs
First we consider the multiclass node classification problem over the homogeneous graphs. Given a graph with partially labeled nodes, the goal is to learn the representation for each node for predicting the class for unlabeled nodes. To fulfill the learning requirements, we have a basic assumption that the embedding of an arbitrary node in the graph can be calculated via PINE with the neighborhood as the input. For short, we denote the PINE embedding function to aggregate neighborhood information for a node by , where contains all embeddings of neighbors of . Since this section evaluates the homogeneous graphs, the in all s is set to 1. To fulfill the requirement of a specific learning task, we propose an overall learning model with PINE by involving an unsupervised component and a supervised component at the same time
(7) 
The first term of the objective in (7
) is the unsupervised learning component, which restricts the representation error between the target node and its neighbors with
norm since it is allowed to have noise in a practical graph. The second term is the supervised component, which is flexible to be replaced with any designed learning task on the nodes in a graph. For example, to a regression problem, a least square loss can be chosen to replace and a cross entropy loss can be used to formulate a classification problem.The details of PINE for multiclass case are as follows: (1) Supervised Component: Softmax function is chosen to formulate our supervised component in (7). For an arbitrary embedding
, we have the probability term as
for predicting with class , where andare classifier parameters for class
, and is the number of classes. Therefore, the supervised component in (7) is formulated as , where is the true label for training, is an regularization for , and is chosen to be ; (2) Unsupervised embedding mapping Component: The balance hyperparameter is set to be . We follow the formulation in (6) and , , and . We apply an ADAM algorithm to compute the effective solutions for the learning variables simultaneously.Cora  Citeseer  Pubmed  Wikipedia  Emaileu  DBLP  BlogCatalog  
#Node  2,708  3,312  19,717  2,405  1,005  +  55,814 + 5,413 
#Edge  5,429  4,732  88,651  17,981  25,571  338,210 + 66,832  1.4M + 619K + 343K 
#Classes  7  6  3  17  42  4 (multilabel)  5 (multilabel) 
Datasets: We evaluate the performance of PINE and other methods for comparison on five benchmark datasets: Cora [52], Citeseer [53], Pubmed [54], Wikipedia [54], and Emaileu [55]. The details of these five datasets are presented in Table 1.
Baseline methods: To evaluate the learning capability of PINE, we compare it with baseline algorithms listed below:

Deepwalk [8] is an unsupervised graph embedding method which relies on the random walk and word2vec method. For each vertex, we take 80 random walks with length 40, and set window size as 10. Since deepwalk is unsupervised
, we apply a logistic regression on the generated embeddings for node classification.

Node2vec [9] is an improved graph embedding method based on deepwalk. We set the window size as 10, the walk length as 80 and the number of walks for each node is set to 100. Similarly, the node2vec is unsupervised as well. We apply the same evaluation procedure on the embeddings of node2vec as what we did for deepwalk.

Struc2vec [35] chooses the window size as 10, the walking length as 80, the number of walks from each node as 10, and 5 iterations in total for SGD.

GraphWave [40]
chooses the heat coefficient as 1000, the number of characteristic functions as 50, the number of Chebyshev approximations as 100, and the number of steps as 20.

WYS (WatchYourStep) [56] chooses the learning rate as 0.2, the highest power of normalized adjacency matrix as 5, the regularization coefficient as 0.1, and uses the “Log Graph Likelihood” as objective function.

MMDW [30] is a semisupervised learning framework of graph embedding which combines matrix decomposition and SVM classification. We tune the method multiple times and take 0.01 as the hyperparameter in the method which is recommended by the authors.

Planetoid [37] is a semisupervised learning framework. We set the batch size as 200, learning rate as 0.01, the batch size for label context loss as 200, and mute the node attributes as input while using softmax for the model output.

GCN (Graph Convolutional Networks) [16]
chooses the convolutional neural networks into the
semisupervised embedding learning of graph. We eliminate the node attributes for fairness as well. 
GATs (Graph Attention Networks) [18] choose the learning rate as 0.005, the coefficient of the regularization as 0.0005, and the number of hidden units as 64. To make the comparison fair, we mute the node attributes in the training of GATs as well.
Experiment setup and results. For a fair comparison, the dimension of representation vectors is chosen to be the same for all algorithms (the dimension is ). The hyperparameters are finetuned for all of them. More experiment environment and comparison details are presented in the Appendix.
In this multiclass classification scenario, we use Accuracy as the evaluation criterion. The percentage of labeled nodes is chosen from to
and the remaining nodes are used for evaluation. All experiments are repeated for five times and we report the mean and standard deviation of the performance of each graph embedding method in Figure
1. We can observe that in most cases, PINE outperforms other methods and in few cases, PINE performs the second best behind of MMDW.In addition to the reported accuracy under the graph node classification task. We also present the visualization of graph node embedding in the 2D space with the tSNE method. The results of Cora, Citeseer, Emaileu, Pubmed, and Wikipedia are presented in Fig 2. We present all the figures with the ratio of unlabeled nodes as 50% for the node classification task. For each figure, we illustrate the tSNE results with respect to the true labels and the predicted ones on test set. As what we expected, we can observe that the embedding vectors can be easily clustered with tSNE for both true and predicted labels. There is the difference between the true label figure and predicted label one due to the difference of true and predicted labels.
Then, we provide several figures to show the hyperparameter sensitivity for PINE. We take a case study on the sensitivity to the embedding dimensions on Cora and Wikipedia dataset. The results are shown in Figure 3. For dimensions of 8, 16, 32, and 64, we run the experiments for 5 times with the ratio of unlabeled nodes as 50% for the node classification task and compute the mean and standard deviation of the accuracy results for GCN, MMDW, and PINE. As shown in the illustration, the performance rises along the increasing of the dimension of the representations for nodes for GCN, MMDW, and PINE, and PINE always achieves higher performance.
Exploration under graph neural network framework In addition to the experiments on the comparison of the graph node classification task with our overall model in (7), we also explore the potential of partial permutation invariant function in the existing Graph Neural Network (GNN). With the study of existing GNNs, we find that they rely on some specific aggregation function to measure the relationship between each target node and its neighborhood. It reminds us of substituting the neighborhood aggregation component with PINE in existing graph neural network (GNN) e.g., GraphSAGE [17] and GAT [18], and evaluate its capability on neighborhood aggregation. It is expected that PINE under the GNN framework performs better than the orignal GraphSAGE [17] and GAT [18]. (1) Dataset PPI is a set of proteinprotein interaction graphs, which poses an inductive learning problem. We take this chance to validate PINE by learning from 20 graphs with another 2 graphs for validation, and then classifying nodes in 2 other different graphs into 121 classes. We compare PINE with GraphSAGE and GAT on this task. The results in Table 2 show that PINE has higher classification accuracy on the nodes in the previously unseen graphs. That justifies the superior performance of PINE in inductive setting. (2) Reddit [17] is a large dataset including 232,965 nodes. We validate PINE on this dataset to see its generalizability on learning from old data to predict new data. We use the first 20 days’ data as the training set and the rest split up into the validation (30%) and test set (70%), which should be classified with multilabels by choosing as accurate as possible from 50 labels. Table 3 shows that PINE has the best performance on this challenging task. Overall, we can conclude that PINE can be a suitable aggregator in GNN framework without subtle aggregation structure design.
Methods  GraphSAGEpool [17]  PINE  GraphSage [18]  GAT [18]  PINE 
Results  0.600  0.637  0.768  0.973  0.985 
Methods  Deepwalk  Deepwalk+features  GraphSAGEpool [17]  PINE 

Results  0.324  0.691  0.949  0.951 
4.2 Comparison on heterogeneous graphs
We next conduct evaluation on heterogeneous graphs, where the learned node embedding vectors are used for multilabel classification. Since multiple types of nodes are presented in heterogeneous graphs, we substitute the unsupervised embedding mapping component with
The supervised component in a multilabel setting can be addressed by formulating a set of binary classification problem (one for each label). Therefore, (1) Supervised Component: we apply logistic regression for each instance and its th label via letting , where and are classifier parameters for the th label. Then, defining to be the true label for training, the supervised component in (7) is formulated as , where is the number of labels, is the regularization term for , and is chosen as ; (2) Unsupervised Embedding Mapping Component: The balance hyperparameter is set to be [0.2, 200]. And the hyperparameter in (6) is set to be .
Datasets: The applied datasets include: DBLP [57] is an academic community network. Here we obtain a subset of the large network with two types of nodes, authors and key words from authors’ publications. The generated subgraph includes (authors) + (key words) vertexes. The link between a pair of author indicates the coauthor relationships, and the link between an author and a word means the word belongs to at least one publication of this author. There are 66,832 edges between pairs of authors and 338,210 edges between authors and words. Each node can have multiple labels out of four. BlogCatalog [58] is a social media network with 55,814 users and according to the interests of users, they are classified into multiple overlapped groups. We take the five largest groups to evaluate the performance of methods. Users and tags are two types of nodes. The 5,413 tags are generated by users with their blogs as keywords. Therefore, tags are shared with different users and also have connections since some tags are generated from the same blogs. The number of edges between users, between tags and between users and tags are about 1.4M, 619K and 343K, respectively. Each user is associated with multiple labels out of five. The total number of labels is five and due to the multilabel classification setting, each user may have several possible labels.
Baseline Methods: To illustrate the valid performance of PINE on heterogeneous graphs, we conduct the experiments on two stages: (1) comparing PINE with Deepwalk [8] and node2vec [9] on the graphs by treating all nodes as the same type (PINE with in a homogeneous setting); (2) comparing PINE with the stateoftheart heterogeneous graph embedding method, metapath2vec [10], in a heterogeneous setting. The hyperparameters of the method are finetuned and metapath2vec++ is chosen as the option for the comparison.
Experiment Setup and Results: For the datasets DBLP and BlogCatalog, we conduct the experiments on each of them and compare the performance among all methods mentioned above. Since it is a multilabel classification task, we take F1score (macro, micro)
as the evaluation metrics for the comparison. The users in BlogCatalog or authors in DBLP work are classification targets. We vary the ratio of labeled nodes from 10% to 90%, repeat all experiments for five times and report the mean and standard deviation of their performance in the Figure
4.We can observe that in most cases, PINE in heterogeneous setting has the best performance. PINE in homogeneous setting is better than deepwalk and node2vec in the same homogeneous setting, and is even better than metapath2vec++ in heterogeneous setting (achieving the second best results). Overall, the superior performance of PINE in Figure 1 and 4 demonstrates the validity of our proposed universal graph embedding mechanism.
5 Conclusion and Future Work
To summarize the whole paper, we propose PINE, a general graph embedding solution with the novel notion of partial permutation invariant set function, that in principle can capture arbitrary dependence among neighbors and automatically decide the significance of neighbor nodes at different distance for both homogeneous and heterogeneous graphs. We provide a theoretical guarantee for the effectiveness of the whole model. Through extensive experimental evaluation, we show that PINE offers better performance on both homogeneous and heterogeneous graphs, compared to stochastic trajectories based, matrix analytics based and graph neural network based stateoftheart algorithms. For the future work, our model can be extended to more general cases, e.g., involving the rich content information out of graph neighborhood structures.
References
 [1] P. Goyal and E. Ferrara, “Graph embedding techniques, applications, and performance: A survey,” KnowledgeBased Systems, vol. 151, pp. 78–94, 2018.
 [2] H. Cai, V. W. Zheng, and K. C. Chang, “A comprehensive survey of graph embedding: Problems, techniques and applications,” CoRR, vol. abs/1709.07604, 2017.
 [3] S. Bhagat, G. Cormode, and S. Muthukrishnan, “Node classification in social networks,” in Social network data analytics. Springer, 2011, pp. 115–148.
 [4] D. LibenNowell and J. Kleinberg, “The linkprediction problem for social networks,” journal of the Association for Information Science and Technology, vol. 58, no. 7, pp. 1019–1031, 2007.
 [5] S. Fortunato, “Community detection in graphs,” Physics reports, vol. 486, no. 35, pp. 75–174, 2010.
 [6] X. Yu, X. Ren, Y. Sun, Q. Gu, B. Sturt, U. Khandelwal, B. Norick, and J. Han, “Personalized entity recommendation: A heterogeneous information network approach,” in Proceedings of the 7th ACM international conference on Web search and data mining. ACM, 2014, pp. 283–292.
 [7] W. Hamilton, R. Ying, and J. Leskovec, “Representation learning on graphs: Methods and applications,” CoRR, vol. abs/1709.05584, 2017.
 [8] B. Perozzi, R. AlRfou, and S. Skiena, “Deepwalk: Online learning of social representations,” in KDD ’14, ACM. New York, NY, USA: ACM, 2014, pp. 701–710.
 [9] A. Grover and J. Leskovec, “node2vec: Scalable feature learning for networks,” in KDD ’16, ACM. New York, NY, USA: ACM, 2016, pp. 855–864.
 [10] Y. Dong, N. V. Chawla, and A. Swami, “metapath2vec: Scalable representation learning for heterogeneous networks,” in KDD ’17, ACM. New York, NY, USA: ACM, 2017, pp. 135–144.
 [11] G. H. Nguyen, J. B. Lee, R. A. Rossi, N. K. Ahmed, E. Koh, and S. Kim, “Continuoustime dynamic network embeddings,” in Companion Proceedings of the The Web Conference 2018, ser. WWW ’18. Republic and Canton of Geneva, Switzerland: International World Wide Web Conferences Steering Committee, 2018, pp. 969–976.
 [12] Z. Zhang, P. Cui, X. Wang, J. Pei, X. Yao, and W. Zhu, “Arbitraryorder proximity preserved network embedding,” in KDD ’18. New York, NY, USA: ACM, 2018, pp. 2778–2786.
 [13] D. Wang, P. Cui, and W. Zhu, “Structural deep network embedding,” in KDD ’16. New York, NY, USA: ACM, 2016, pp. 1225–1234.

[14]
S. Cao, W. Lu, and Q. Xu, “Deep neural networks for learning graph
representations,” in
Proceedings of the Thirtieth AAAI Conference on Artificial Intelligence
, ser. AAAI’16, 2016, pp. 1145–1152.  [15] A. Bojchevski and S. Günnemann, “Deep gaussian embedding of graphs: Unsupervised inductive learning via ranking,” in ICLR, 2018.
 [16] T. N. Kipf and M. Welling, “Semisupervised classification with graph convolutional networks,” arXiv preprint arXiv:1609.02907, 2016.
 [17] W. Hamilton, R. Ying, and J. Leskovec, “Inductive representation learning on large graphs,” in NIPS ’17, 2017, pp. 1024–1034.
 [18] P. Velickovic, G. Cucurull, A. Casanova, A. Romero, P. Lio, and Y. Bengio, “Graph attention networks,” arXiv preprint arXiv:1710.10903, vol. 1, no. 2, 2017.
 [19] K. K. Thekumparampil, C. Wang, S. Oh, and L.J. Li, “Attentionbased graph neural network for semisupervised learning,” arXiv preprint arXiv:1803.03735, 2018.
 [20] F. Wu, A. H. S. Jr., T. Zhang, C. Fifty, T. Yu, and K. Q. Weinberger, “Simplifying graph convolutional networks,” in ICML 2019, 2019, pp. 6861–6871.
 [21] J. Tang, M. Qu, M. Wang, M. Zhang, J. Yan, and Q. Mei, “Line: Largescale information network embedding,” in Proceedings of the 24th International Conference on World Wide Web, 2015, pp. 1067–1077.
 [22] C. Yang, M. Sun, Z. Liu, and C. Tu, “Fast network embedding enhancement via high order proximity approximation,” in Proceedings of the TwentySixth International Joint Conference on Artificial Intelligence, IJCAI, 2017, pp. 19–25.
 [23] D. Zhu, P. Cui, D. Wang, and W. Zhu, “Deep variational network embedding in wasserstein space,” in KDD ’18. New York, NY, USA: ACM, 2018, pp. 2827–2836.
 [24] P. Goyal and E. Ferrara, “Graph embedding techniques, applications, and performance: A survey,” CoRR, vol. abs/1705.02801, 2017.
 [25] L. Tang and H. Liu, “Leveraging social media networks for classification,” Data Mining and Knowledge Discovery, vol. 23, no. 3, pp. 447–478, 2011.
 [26] S. Cao, W. Lu, and Q. Xu, “Grarep: Learning graph representations with global structural information,” in Proceedings of the 24th ACM International on Conference on Information and Knowledge Management, ser. CIKM ’15, 2015, pp. 891–900.
 [27] M. Ou, P. Cui, J. Pei, Z. Zhang, and W. Zhu, “Asymmetric transitivity preserving graph embedding,” in KDD ’16. New York, NY, USA: ACM, 2016, pp. 1105–1114.
 [28] R. A. Rossi, N. K. Ahmed, and E. Koh, “Higherorder network representation learning,” in Companion of the The Web Conference 2018 on The Web Conference 2018, 2018, pp. 3–4.
 [29] C. Yang, Z. Liu, D. Zhao, M. Sun, and E. Y. Chang, “Network representation learning with rich text information.” in IJCAI, 2015, pp. 2111–2117.
 [30] C. Tu, W. Zhang, Z. Liu, and M. Sun, “Maxmargin deepwalk: Discriminative learning of network representation.” in IJCAI, 2016, pp. 3889–3895.
 [31] R. Liao, Z. Zhao, R. Urtasun, and R. S. Zemel, “Lanczosnet: Multiscale deep graph convolutional networks,” arXiv preprint arXiv:1901.01484, 2019.
 [32] X. Huang, J. Li, and X. Hu, “Label informed attributed network embedding,” in Proceedings of the Tenth ACM International Conference on Web Search and Data Mining. ACM, 2017, pp. 731–739.
 [33] L. Lovász, “Random walks on graphs,” Combinatorics, Paul erdos is eighty, vol. 2, no. 146, p. 4, 1993.

[34]
T. Mikolov, I. Sutskever, K. Chen, G. S. Corrado, and J. Dean, “Distributed representations of words and phrases and their compositionality,” in
Advances in Neural Information Processing Systems, 2013, pp. 3111–3119.  [35] L. F. Ribeiro, P. H. Saverese, and D. R. Figueiredo, “struc2vec: Learning node representations from structural identity,” in KDD ’17, ACM. New York, NY, USA: ACM, 2017, pp. 385–394.
 [36] H. Gao and H. Huang, “Selfpaced network embedding,” in KDD ’18. New York, NY, USA: ACM, 2018, pp. 1406–1415.
 [37] Z. Yang, W. W. Cohen, and R. Salakhutdinov, “Revisiting semisupervised learning with graph embeddings,” in Proceedings of the 33rd International Conference on International Conference on Machine Learning  Volume 48, ser. ICML’16. JMLR.org, 2016, pp. 40–48.
 [38] S. Chang, W. Han, J. Tang, G.J. Qi, C. C. Aggarwal, and T. S. Huang, “Heterogeneous network embedding via deep architectures,” in KDD ’15, ACM. New York, NY, USA: ACM, 2015, pp. 119–128.
 [39] J. Tang, M. Qu, and Q. Mei, “Pte: Predictive text embedding through largescale heterogeneous text networks,” in KDD ’15, ACM. New York, NY, USA: ACM, 2015, pp. 1165–1174.
 [40] C. Donnat, M. Zitnik, D. Hallac, and J. Leskovec, “Learning structural node embeddings via diffusion wavelets,” in Proceedings of the 24th ACM SIGKDD International Conference on Knowledge Discovery & Data Mining. ACM, 2018, pp. 1320–1329.
 [41] T. N. Kipf and M. Welling, “Variational graph autoencoders,” arXiv preprint arXiv:1611.07308, 2016.
 [42] M. Schlichtkrull, T. N. Kipf, P. Bloem, R. van den Berg, I. Titov, and M. Welling, “Modeling relational data with graph convolutional networks,” in European Semantic Web Conference. Springer, 2018, pp. 593–607.
 [43] R. van den Berg, T. N. Kipf, and M. Welling, “Graph convolutional matrix completion,” stat, vol. 1050, p. 7, 2017.
 [44] J. Chen, T. Ma, and C. Xiao, “FastGCN: Fast learning with graph convolutional networks via importance sampling,” in ICLR, 2018.
 [45] T. Pham, T. Tran, D. Q. Phung, and S. Venkatesh, “Column networks for collective classification.” in AAAI, 2017, pp. 2485–2491.
 [46] K. Tu, P. Cui, X. Wang, P. S. Yu, and W. Zhu, “Deep recursive network embedding with regular equivalence,” in KDD ’18. New York, NY, USA: ACM, 2018, pp. 2357–2366.
 [47] K. Xu, W. Hu, J. Leskovec, and S. Jegelka, “How powerful are graph neural networks?” arXiv preprint arXiv:1810.00826, 2018.
 [48] M. Zaheer, S. Kottur, S. Ravanbakhsh, B. Poczos, R. R. Salakhutdinov, and A. J. Smola, “Deep sets,” in Advances in Neural Information Processing Systems, 2017, pp. 3394–3404.
 [49] D. Yarotsky, “Universal approximations of invariant maps by neural networks,” arXiv preprint arXiv:1804.10306, 2018.

[50]
G. Cybenko, “Approximations by superpositions of a sigmoidal function,”
Mathematics of Control, Signals and Systems, vol. 2, pp. 183–192, 1989.  [51] K. Hornik, “Approximation capabilities of multilayer feedforward networks,” Neural networks, vol. 4, no. 2, pp. 251–257, 1991.
 [52] A. K. McCallum, K. Nigam, J. Rennie, and K. Seymore, “Automating the construction of internet portals with machine learning,” Information Retrieval, vol. 3, no. 2, pp. 127–163, 2000.
 [53] C. L. Giles, K. D. Bollacker, and S. Lawrence, “Citeseer: An automatic citation indexing system,” in Proceedings of the third ACM conference on Digital libraries. ACM, 1998, pp. 89–98.
 [54] P. Sen, G. Namata, M. Bilgic, L. Getoor, B. Galligher, and T. EliassiRad, “Collective classification in network data,” AI magazine, vol. 29, no. 3, p. 93, 2008.
 [55] J. Leskovec, J. Kleinberg, and C. Faloutsos, “Graph evolution: Densification and shrinking diameters,” ACM Transactions on Knowledge Discovery from Data (TKDD), vol. 1, no. 1, p. 2, 2007.
 [56] S. AbuElHaija, B. Perozzi, R. AlRfou, and A. Alemi, “Watch your step: Learning node embeddings via graph attention,” in NeurIPS ’18, 2018.
 [57] M. Ji, Y. Sun, M. Danilevsky, J. Han, and J. Gao, “Graph regularized transductive classification on heterogeneous information networks,” in Joint European Conference on Machine Learning and Knowledge Discovery in Databases. Springer, 2010, pp. 570–586.
 [58] X. Wang, L. Tang, H. Gao, and H. Liu, “Discovering overlapping groups in social media,” in the 10th IEEE International Conference on Data Mining series (ICDM2010), Sydney, Australia, December 14  17 2010.
 [59] H. Weyl, The classical groups: their invariants and representations. Princeton university press, 1946.
 [60] M. H. Stone, “Applications of the theory of boolean rings to general topology,” Transactions of the American Mathematical Society, vol. 41, no. 3, pp. 375–481, 1937.
 [61] ——, “The generalized weierstrass approximation theorem,” Mathematics Magazine, vol. 21, no. 5, pp. 237–254, 1948.
 [62] H. Kraft and C. Procesi, Classical Invariant Theory, a Primer. Lecture notes, 2000.
 [63] G. Cybenko, “Approximation by superpositions of a sigmoidal function,” Mathematics of control, signals and systems, vol. 2, no. 4, pp. 303–314, 1989.
Appendix A Partial permutation invariant maps
a.1 Definitions: Permutation Invariant Maps and Polynomials
Definition A.1.
[Symmetric group ] Given an index set , The set of all onetoone mappings forms the symmetric group with function compositions as the group action.
For brevity, we will denote . And its permutation is denoted as for an arbitrary , where , is the index at position after permutation and is the th column of . The defined here is equivalent to the permutation matrix notation we used in Definition 3.1. As the proofs are based on symmetric group action, we also give the definition of permutation invariant map and partially permutation invariant map based on symmetric group action in the following.
Definition A.2.
[Permutation invariant map or invariant map] A continuous real valued map is permutation invariant if
(8) 
for all and all .
Definition A.3.
[Partially permutation invariant map] Given a series of symmetric group , and where , a continuous real valued map is partially permutation invariant if
(9) 
for all and all .
We call a polynomial that is a permutation invariant map permutation invariant polynomial. When the symmetric group involved is significant, we will use terms like invariant polynomial. Similarly, a polynomial that is a partially permutation invariant map will be called partially (permutation) invariant polynomial.
a.2 Proof of Theorem 3.2
A sketch of the proof ideas is as follows. StoneWeierstrass theorem states that polynomials are dense in the space of continuous functions. Based on this result, we first show in Lemma A.5 that partially permutation invariant polynomials are dense in partially invariant continuous functional space. Next, our main idea is to find a finite generating set for the partially invariant functions so that any partially invariant function can be represented as a polynomial of that generating set. There is a wellknown generating set for permutation invariant polynomials : the power sum polynomials. Polarization is introduced to extend the generating set to the case of .
a.2.1 Polarization
The goal of polarization is to represent polynomial invariant of a higher dimension in terms of invariant polynomials of a lower dimension. The lemma in the following provides an invariant representation of matrixargument invariant polynomial in terms of vectorargument invariant polynomial.
The group only permutes the columns of . So given two spaces and , and a linear mapping
(10) 
is commutable with with permutation; i.e., . If is a permutation invariant on then is a permutation invariant on . In the following lemma, we take .
Lemma A.4.
[Weyl’s Polarization] For any polynomial invariant on , there exist a series of vectors and a series of invariant polynomials on , such that can be represented by
a.2.2 Partially Permutation Invariant Polynomials
The following lemma shows that it is sufficient to use partially invariant polynomials to approximate partially invariant functions.
Lemma A.5.
[Denseness of Partially Invariant Polynomials] For any partially invariant function on where and for any , there exists a partially invariant polynomial such that for all .
Proof.
By StoneWeierstrass Theorem [60, 61] for the compact Hausdorff space, polynomial on compact Hausdorff space is dense in the space of continuous functions on that same compact Hausdorff space. So for any partially invariant function on and any , there exists a polynomial on such that for all .
Let , and . Construct
(11) 
which is a partially invariant polynomial. We have
The function in (11) thus fulfills the requirement of the lemma. ∎
The following lemma gives one form of explicit expansion for partially invariant polynomials.
Lemma A.6.
Any partially invariant polynomial on , where , can be expressed in the following form:
(12) 
where is an integer, and are invariant.
Proof.
Since is a polynomial, it is possible to write as
where is a suitable integer depending on the degree of , and are polynomials on .
Define the symmetrized versions of as follows:
(13) 
As is partially invariant, its value does not change if we perform (partial) symmetrization on . Therefore,
where the last step follows by exchanging the order of summations, and distributing the symmetrization sums to their respective functions. ∎
Lemma A.7.
[Hilbert’s finiteness Theorem, e.g.,[62]] There exists finitely many invariant polynomials such that any invariant polynomial can be expressed as
(14) 
with some polynomial of variables.
Lemma A.8.
[Power sums as generating set] One generating set of symmetric polynomials on is power sums up to degree :
(15) 
where is the th entry of .
a.2.3 Proof of Theorem 3.2
Proof.
By Lemma A.5, any partially invariant function can be approximated by a partially invariant polynomial, which in turn can be written in the form of (12), due to Lemma A.6. Using Lemma A.4, each term (fully) invariant polynomial can be expressed as follows,
(16) 
where is an invariant polynomial.
Based on Hilbert’s finiteness Theorem Lemma A.7, and the powersum basis result Lemma A.8, each function is expressible as a polynomial of the following variables:
(17) 
Let denote the matrix whose ’s column is , . Define the powersum vector function
(18) 
and the function
(19) 
It then follows that is a polynomial of
(20) 
Let , and . Recalling Lemma A.6, we establish that the function can be approximated arbitrarily well by a function of the form
(21) 
where is a polynomial. The vector version of the result follows from the scalar version. In the statement of the theorem, we have removed the explicit parameters ’s. ∎
a.3 Partially Permutation Invariant Neural Network: Proof of Theorem 3.3
The main idea of the proof is the following: as neural network is an universal approximator, we use neural networks with one hidden layer to approximate in (21) and the power functions. We then get an approximator of partially permutation invariant in the form of a structured neural network.
Proof.
We see that (21) can approximate any partially permutation invariant polynomial on . By the universal approximation theorem of neural networks [63], we can approximate the polynomial with a shallow (onehidden layer) neural network. Any partially permutation invariant function can be approximated as
(22) 
where , and we have combined the double indices and of in (17) into a single index for a fixed .
It is clear that the function in (22) is partially permutation invariant as all the are treated the same. We can approximate by . Combining the two neural network approximators, it follows that the functions and in Theorem 3.2 can be chosen in the following form:
where we have omitted the index on and used to denote for simplicity. ∎
Appendix B Additional Experiment Setups
b.1 Configurations of Hardware and Software.
PINE is implemented with PyTorch and TensorFlow learning framework in the version 1.1.0 and 1.8.0 (Python 3 version). All experiments are conducted on a Linux 18.04 machine. The machine has one Core i76700K, 64 GB RAM, 512GB+2T Hard disks and two GTX 1080 graphics cards.
b.2 TrainTest Splits.
For all homogeneous datasets, we conduct the same training and test splits for 5 times. In each round, we randomly sample a ratio of nodes from 10% to 90% to be the training set. We leave all the other nodes in the test set to evaluate the classification performance among all the methods. For the heterogeneous case, we only split the author nodes into training and test set since only author takes labels in those two datasets. We follow the same split strategy of homogeneous cases on the heterogeneous graphs as well.