The code of ''Learning the Implicit Semantic Representation on Graph-Structured Data'' in DASFAA2021.
Existing representation learning methods in graph convolutional networks are mainly designed by describing the neighborhood of each node as a perceptual whole, while the implicit semantic associations behind highly complex interactions of graphs are largely unexploited. In this paper, we propose a Semantic Graph Convolutional Networks (SGCN) that explores the implicit semantics by learning latent semantic-paths in graphs. In previous work, there are explorations of graph semantics via meta-paths. However, these methods mainly rely on explicit heterogeneous information that is hard to be obtained in a large amount of graph-structured data. SGCN first breaks through this restriction via leveraging the semantic-paths dynamically and automatically during the node aggregating process. To evaluate our idea, we conduct sufficient experiments on several standard datasets, and the empirical results show the superior performance of our model.READ FULL TEXT VIEW PDF
The code of ''Learning the Implicit Semantic Representation on Graph-Structured Data'' in DASFAA2021.
The representations of objects (nodes) in large graph-structured data, such as social or biological networks, have been proved extremely effective as feature inputs for graph analysis tasks. Recently, there have been many attempts in the literature to extend neural networks to deal with representation learning of graphs, such as Graph Convolutional Networks (GCN) , GraphSAGE  and Graph Attention Networks (GAT) .
In spite of enormous success, previous graph neural networks mainly proposed representation learning methods by describing the neighborhoods as a perceptual whole, and they have not gone deep into the exploration of semantic information in graphs. Taking the movie network as an example, the paths based on composite relations of “Movie-Actor-Movie” and “Movie-Director-Movie” may reveal two different semantic patterns, i.e., the two movies have the same actor (director). Here the semantic pattern is defined as a specific knowledge expressed by the corresponding path. Although several researchers [35, 30] attempt to capture these graph semantics of composite relations between two objects by meta-paths, existing work relies on the given heterogeneous information such as different types of objects and distinct object connections. However, in the real world, quite a lot of graph-structured data do not have the explicit characteristics. As shown in Figure 1, in a scholar cooperation network, there are usually no explicit node (relation) types and all nodes are connected through the same relation, i.e., “Co-author”. Fortunately, behind the same relation, there are various implicit factors which may express different connecting reasons, such as “Classmate” and “Colleague” for the same relation “Co-author”. These factors can further compose diverse semantic-paths (e.g. “Student-Advisor-Student” and “Advisor-Student-Advisor”), which reveal sophisticated semantic associations and help to generate more informative representations. Then, how to automatically exploit comprehensive semantic patterns based on the implicit factors behind a general graph is a non-trivial problem.
In general, there are several challenges to solve this problem. Firstly, it is an essential part to adaptively infer latent factors behind graphs. We notice that several researches begin to explore desired latent factors behind a graph by disentangled representations [20, 18]
. However, they mainly focus on inferring the latent factors by the disentangled representation learning while failing to discriminatively model the independent implicit factors behind the same connections. Secondly, after discovering the latent factors, how to select the most meaningful semantics and aggregate the diverse semantic information remain largely unexplored. Last but not the least, to further exploit the implicit semantic patterns and to be capable of conducting inductive learning are quite difficult.
To address above challenges, in this paper, we propose a novel Semantic Graph Convolutional Networks (SGCN), which sheds light on the exploration of implicit semantics in the node aggregating process. Specifically, we first propose a latent factor routing method with the DisenConv layer 
to adaptively infer the probability of each latent factor that may have caused the link from a given node to one of its neighborings. Then, for further exploring the diverse semantic information, we transfer the probability between every two connected nodes to the corresponding semantic adjacent matrix, which can present the semantic-paths in a graph. Afterwards, most semantic strengthen methods like the semantic level attention module can be easily integrated into our model and aggregate the diverse semantic information from these semantic-paths. Finally, to encourage the independence of the implicit semantic factors and conduct the inductive learning, we design an effective joint loss function to maintain the independent mapping channels of different factors. This loss function is able to focus on different semantic characteristics during the training process.
Specifically, the contributions of this paper can be summarized as follows:
We first break the heterogeneous restriction of semantic representations with an end-to-end framework. It automatically infers the independent factor behind the formation of each edge and explores the semantic associations of latent factors behind a graph.
We propose a novel Semantic Graph Convolutional Networks (SGCN), to learn node representations by aggregating the implicit semantics from the graph-structured data.
We conduct extensive experiments on various real-world graphs datasets to evaluate the performance of the proposed model. The results show the superiority of our proposed model by comparing it with many powerful models.
Graph neural networks (GNNs) [10, 26], especially graph convolutional networks , have been proven successful in modeling the structured graph data due to its theoretical elegance . They have made new breakthroughs in various tasks, such as node classification  and graph classification . In the early days, the graph spectral theory  was used to derive a graph convolutional layer. Then, the polynomial spectral filters  greatly reduced the computational cost than before. And, Kipf and Welling  proposed the usage of a linear filter to get further simplification. Along with spectral graph convolution, directly performing graph convolution in the spatial domain was also investigated by many researchers [8, 12]. Among them, graph attention networks  has aroused considerable research interest, since it adaptively specify weights to the neighbors of a node by attention mechanism [1, 37].
For semantic learning research, there have been studies explored a kind of semantic-path called meta-path in heterogeneous graph embedding to preserve structural information. ESim  learned node representations by searching the user-defined embedding space. Based on random walk, meta-path2vec  utilized skip-gram to perform a semantic-path. HERec  proposed a type constraint strategy to filter the node sequence and captured the complex semantics reflected in heterogeneous graph. Then, Fan et al.  suggested a meta-graph2vec model for malware detection, where both the structures and semantics are preserved. Sun et al.  proposed meta-graph-based network embedding models, which simultaneously considers the hidden relations of all meta information of a meta-graph. Meanwhile, there were other influential semantic learning approaches in some studies. For instance, many models [4, 17, 25] were utilized to various fields because of their latent semantic analysis ability.
In heterogeneous graphs, two objects can be connected via different semantic-paths, which are called meta-paths. It depends on the characteristic that this graph structure has different types of nodes and relations. One meta-path is defined as a path in the form of (abbreviated as ), it describes a composite relation , where denotes the composition operator on relations. Actually, in homogeneous graph, the relationships between nodes are also generated for different reasons (latent factors), so we can implicitly construct various types of relationships to extract various semantic-paths correspond to different semantic patterns, so as to improve the performance of GCN model from the perspective of semantic discovery.
In this section, we introduce the Semantic Graph Convolutional Networks (SGCN). We first present the notations, then describe the overall network progressively.
We focus primarily on undirected graphs, and it is straightforward to extend our approach to directed graphs. We define as a graph, comprised of the nodes set and edges set , and denotes the number of nodes. Each node
has a feature vector. We use to indicate that there is an edge between node and node . Most graph convolutional networks can be regarded as an aggregation function that outputs the representations of nodes when given features of each node and its neighbors:
where the output denotes the representations of nodes. It means that neighborhoods of a node contains rich information, which can be aggregated to describe the node more comprehensively. Different from previous studies [15, 12, 34], in our work, proposed would automatically learn the semantic-path from graph data to explore corresponding semantic pattern.
Here we aim to introduce the disentangled algorithm that calculates the latent factors between every two objects. We assume that each node is composed of independent components, hence there are latent factors to be disentangled. For the node
, the hidden representation ofis , where denotes corresponding aspect of node that is pertinent to the -th disentangled factor.
In the initial stage, we project its feature vector into different subspaces:
where and are the mapping parameters and bias of
-th subspace, the nonlinear activation functionis . To capture aspect of node comprehensively, we construct from both and , which can be utilized to identify the latent factors. Here we learn the probability of each factor by leveraging neighborhood routing mechanism [20, 18], it is a DisenConv layer:
where iteration , indicates the probability that factor indicates the reason why node reaches neighbor , and satisfies . The neighborhood routing mechanism will iteratively infer and construct . Note that, there are total DisenConv layers, is assigned the value of finally in each layer , more detail can refer to Algorithm 1.
For the data that various relation types between nodes and their corresponding neighbors are explicit and fixed, it is easily to construct multiple sub-semantic graphs as the input data for multiple GCN model. As shown in Figure 2(a) , a heterogeneous graph contains two different types of meta-paths (meta-path 1, meta-path 2). Then can be decomposed to multiple graphs consisting of single semantic graph and , where and its neighbors are connected by path-relation 1(2) for each node in .
However, we cannot simply transfer the pre-construct multiple graph method to all network architectures. In detail, for a graph with no different types of edges, we have to judge implicit connecting factors of these edges to find semantic-paths. And the probability of each latent factor is calculated in the iteratively running process as mentioned in last section. To solve this dilemma, we propose a novel algorithm to automatically represent semantic-paths during the model running.
After the latent factor routing process, we get the soft probability matrix of node latents , where means the possibility that node connects to because of the factor . In our model, the latent factor should identify the certain connecting cause of each connected node pair. Here we transfer the probability matrix to an semantic adjacent matrix , so the element in only has binary value (0 or 1). In detail, for every node pair and , if denotes the biggest value in . As shown in Figure 2(b), each node is represented by components. In this graph, every node may connect with others by one relationship from types, e.g., the relationship between node and is (denotes ). For node , we can find that it has two semantic-path-based neighbors and . And, the semantic-paths of and are two different types which composed by and respectively. We define the adjacent matrix for virtual semantic-path-based edges,
where , , and . For instance, in Figure 2(b), , , and , in this way two semantic-paths start from node can be expressed as and .
In the semantic information aggregation process, we aggregate the latent vectors connected by corresponding semantic-path as:
where we just use MeanPooling to avoid large values instead of operator, and are both returned from the last layer of DisenConv operation, in this time that factor probabilities would be stable since the representation of each node considers the influence from neighbors. According to Eq. (5), the aggregation of two latent representations (end points) of one certain semantic-path denotes the mining result of this semantic relation, e.g., and express two different kinds of semantic pattern representations in Figure 2(b), and respectively. And, for all types of semantic-paths start from node , the weight of each type depends on its frequency. Note that, although the semantic adjacent matrix neglects some low probability factors, our semantic paths are integrated with the node states of DisenGCN, which would not lose the crucial information captured by basic GCN model. The advantage of this aggregation method is that our model can distinguish different semantic relations without adding extra parameters, instead of designing various graph convolution networks for different semantic-paths. That is to say, the model does not increase the risk of over fitting after the graph semantic-paths learning. Here we only consider 2-order-paths in our model, however, it can be straightly extended to longer path mining.
In fact, one type of edge in a meta-path tries to denote one unique meaning, so the latent factors in our work should not overlap. So, the assumption of using latent factors to construct semantic-paths is that these different factors extracted by latent factor routing module can focus on different connecting causes. In other words, we should encourage the representations of different factors to be of sufficient independence. Before the probability calculating, on our features, the focused point views of subspaces in Eq. (1) should keep different. Our solution considers that the distance between independence factor representations should be sufficient long if they were projected to one subspace.
First, we project the input values in Eq. (1) into an unified space to get vectors and as follow:
where is the projection parameter matrix. Then, the independence loss based on distances between unequal factor representations could be calculated as follow:
denotes an identity matrix,is element-wise product, . Specifically, we learn a lesson from  that scaling the dot products by , to counteract the gradients disappear effect for large values. As long as is minimized in the training process, the distances between different factors tend to be larger, that is, the subspaces would capture sufficient different information to encourage independence among learned latent factors.
Next, we would analyze the validity of this optimization. Latent Factor Routing aims to utilize the disentangled algorithm to calculate the latent factors between every two objects. However, this approach is a variant of von Mises-Fisher (vMF)  mixture model, such an EM algorithm cannot optimize the independences of latent factors within the iterative process. And random initialization of the mapping parameters is also not able to promise that subspaces obtain different concerns. For this shortcoming, we give an assumption:
The features in different subspaces keep sufficient independent when the margins of their projections in the unified space are sufficiently distinct.
This assumption is inspired by the Latent Semantic Analysis algorithm (LSA)  that projects multi-dimensional features of a vector space model into a semantic space with less dimensions, which keeps the semantic features of the original space in a statistical sense. So, our optimization approach is listed below:
In the above equation, denotes the training parameter to be optimized. We ignore the and in Eq. (7), because they do not affect the optimization procedure. With the increase of Inter-distances of subspaces, the IntraVar of factors in each subspace would not larger than the original level (as the random initialization). The InterVar/IntraVar ratio becomes larger, in other word, we get more sufficient independence of mapping subspaces.
In this section, we describe the overall algorithm of SGCN for performing node-related tasks. For graph , the ground-truth label of node is , where is the number of classes. The details of our algorithm are shown in Algorithm 1. First, we calculate the independence loss after factor channels capture features. Then, layers of DisenConv operations would return the stable probability matrix . After that, the automatic graph semantic-path representation is learned based on . To apply to different tasks, we design the final layer by a fully-connected layer , where , . For instance, for the semi-supervised node classification task, we implement
as the loss function, where , is the set of labeled nodes, and would be joint training by sum up with the task loss function. For the multi-label classification task, since the label consists of more than one positive bits, we define the multi-label loss function for node as:
Moreover, for the node clustering task,
denotes the input feature of K-Means.
We should notice a problem in Section 3.3 that the time complexity of Eq. (4-5) by matrix calculation is . Such a complex time complexity will bring a lot of computing load, so we optimize this algorithm in the actual implementation. For real-world datasets, one node connects to neighbors that are far less than the total number of nodes in the graph. Therefore, when we create the semantic-paths based adjacent matrix, the matrix is defined to denote 1-order neighbor relationships, is the maximum number of neighbors that we define, and is the id of a neighbor if they are connected by , else . Then the semantic-path relations of type of are denoted by , and the pooling of this semantic pattern is the mean pooling of . According to the analysis above, the time complexity can be reduced to .
In this section, we empirically assess the efficacy of SGCN on several node-related tasks, includes semi-supervised node classification, node clustering and multi-label node classification. We then provide node visualization analysis and semantic-paths sampling experiments to verify the validity of our idea.
We conduct our experiments on 5 real-world datasets, Citeseer, Cora, Pubmed, POS and BlogCatalog [27, 11, 32], whose statistics are listed in Table 1. The first three citation networks are benchmark datasets for semi-supervised node classification and node clustering. For graph content, the nodes, edges, and labels in these three represent articles, citations, and research areas, respectively. Their node features correspond a bag-of-words representation of a document.
POS and BlogCatalog are suitable for multi-label node classification task. Their labels are part-of-speech tags and user interests, respectively. In detail, BlogCatalog is a social relationships network of bloggers who post blogs in the BlogCatalog website. These labels represent the blogger’s interests inferred from the text information provided by the blogger. POS (Part-of-Speech) is a co-occurrence network of words appearing in the first million bytes of the Wikipedia dump. The labels in POS denote the Part-of-Speech tags inferred via the Stanford POS-Tagger. Due to the two graphs do not provide node features, we use the rows of their adjacency matrices in place of node features for them.
To demonstrate the advantages of our model, we compare SGCN with some representative graph neural networks, including the graph convolution network (GCN)  and the graph attention network (GAT) . In detail, GCN  is a simplified spectral method of node aggregating, while GAT weights a node’s neighbors by the attention mechanism. GAT achieves state of the art in many tasks, but it contains far more parameters than GCN and our model. Besides, ChebNet  is a spectral graph convolutional network by means of a Chebyshev expansion of the graph Laplacian, MoNet  extends CNN architectures by learning local, stationary, and compositional task-specific features. And IPGDN  is the advanced version of DisenGCN. We also implement other non-graph convolution network method, including random walk based network embedding DeepWalk , link-based classification method ICA , inductive embedding based approach Planetoid , label propagation approach LP , semi-supervised embedding learning model SemiEmb  and so on.
In addition, we conduct the ablation experiments into nodes classification and clustering to verify the effectiveness of the main components of SGCN: SGCN-path is our complete model without independence loss, and SGCN-indep denotes SGCN without the semantic-path representations.
In the multi-label classification experiment, the original implementations of GCN and GAT do not support multi-label tasks. We therefore modify them to use the same multi-label loss function as ours for fair comparison in multi-label tasks. We additionally include three node embedding algorithms, including DeepWalk , LINE , and node2vec , because they are demonstrated to perform strongly on the multi-label classification. Besides, we remove IPGDN since it is not designed for multi-label task.
We train our models on one machine with 8 NVIDIA Tesla V100 GPUs. Some experimental results and the settings of common baselines that we follow [20, 18], and we optimize the parameters of models with Adam . Besides, we tune the hyper-parameters of both our model and baselines using hyperopt . In detail, for semi-supervised classification and node clustering, we set the number of iterations , the layers , the number of components (denotes the number of mapping channels. Therefore, for our model, the dimension of a component in the SGCN model is ), dropout rate , trade-off , the learning rate loguniform , the regularization term loguniform . Besides, it should be noted that, in the multi-label node classification, the output dimension is set to 128 to achieve better performance, while setting the dimension of the node embeddings to be 128 as well for other node embedding algorithms. And, when tuning the hyper-parameters, we set the number of components in the latent factor routing process. Here makes the best result in our experiments.
For semi-supervised node classification, there are only 20 labeled instances for each class. It means that the information of neighbors should be leveraged when predicting the labels of target nodes. Here we follow the experimental settings of previous works [38, 15, 34].
We report the classification accuracy (ACC) results in Table 2. The majority of nodes only connect with those neighbors of the same class. According to Table 2, it is obvious that SGCN achieves the best performance amongst all baselines. Here SGCN outperforms the most powerful baseline IPGDN with 1.55%, 0.47% and 1.1% relative accuracy improvements on three datasets, compared with the increasing degrees of previous models, our model express obvious improvements in the node classification task. And our proposed model achieves the best ACC of 85.4% on Cora dataset, it is a great improvement on this dataset. On the other hand, in the ablation experiment (the last three rows of Table 2), the completeModels Cora Citeseer Pubmed MLP 55.1 46.5 71.4 SemiEmb 59.0 59.6 71.1 LP 68.0 45.3 63.0 DeepWalk 67.2 43.2 65.3 ICA 75.1 69.1 73.9 Planetoid 75.7 64.7 77.2 ChebNet 81.2 69.8 74.4 GCN 81.5 70.3 79.0 MoNet 81.7 - 78.8 GAT 83.0 72.5 79.0 DisenGCN 83.7 73.4 80.5 IPGDN 84.1 74.0 81.2 SGCN-indep 84.2 73.7 82.0 SGCN-path 84.6 74.4 81.6 SGCN 85.4 74.2 82.1 Models Cora Citeseer Pubmed NMI ARI NMI ARI NMI ARI SemiEmb 48.7 41.5 31.2 21.5 27.8 35.2 DeepWalk 50.3 40.8 30.5 20.6 29.6 36.6 Planetoid 52.0 40.5 41.2 22.1 32.5 33.9 ChebNet 49.8 42.4 42.6 41.5 35.6 38.6 GCN 51.7 48.9 42.8 42.8 35.0 40.9 GAT 57.0 54.1 43.1 43.6 35.0 41.4 DIsenGCN 58.4 60.4 43.7 42.5 36.1 41.6 IPGDN 59.2 61.0 44.3 43.0 37.0 42.0 SGCN-indep 60.2 59.2 44.7 42.8 37.2 42.3 SGCN-path 60.5 60.7 45.1 44.0 37.3 42.8 SGCN 60.7 61.6 44.9 44.2 37.9 42.5
SGCN model is superior to either algorithm in at least two datasets. Moreover, we can find that SGCN-indep and SGCN-path are both perform better than previous algorithms to some degree. It reveals the effectiveness of our semantic-paths mining module and the independence learning for subspaces.
In the multi-label classification experiment, every node is assigned one or more labels from a finite set . We follow node2vec  and report the performance of each method while varying the number of nodes labeled for training from 10% to 90% , where is the total number of nodes. The rest of nodes are split equally to form a validation set and a test set. Then with the best hyper-parameters on the validation sets, we report the averaged performance of 30 runs on each multi-label test set. Here we summarize the results of multi-label node classification by Macro-F1 and Micro-F1 scores in Figure 3.
Firstly, there is an obvious point that proposed SGCN model achieves the best performances in both two datasets. Compared with DisenGCN model, SGCN combines with semantic semantic-paths can achieve the biggest improvement of 20.0% when we set 10% of labeled nodes in POS dataset. The reason may be that the relation type of POS dataset is Word Co-occurrence, there are lots of regular explicit or implicit semantics amongst these relationships between different words. In the other dataset, although SGCN does not show a full lead but achieves the highest accuracy on both indicators. We find that the GCN-based algorithms are usually superior to the traditional node embedding algorithms in overall effect. Although for the Micro-F1 score on Blogcatalog, GCN produces the poor results. In addition, the SGCN algorithm can make both Macro-F1 and Micro-F2 achieve good results at the same time, and there will be no bad phenomenon in one of them. Because this approach would not ignore the information provided by the classes with few samples but important semantic relationships.
To further evaluate the embeddings learned from the above algorithms, we also conduct the clustering task. Following , for our model and each baseline, we obtain its node embedding via feed forward when the model is trained. Then we input the node embedding to the K-Means algorithm to cluster nodes. The ground-truth is the same as that of node classification task, and the number of clusters is set to the number of classes. In detail, we employ two metrics of Normalized Mutual Information (NMI) and Average Rand Index (ARI) to validate the clustering results. Since the performance of K-Means is affected by initial centroids, we repeat the process for 20 times and report the average results in Table 3. As can be seen in Table 3, SGCN consistently outperforms all baselines, and GNN-based algorithms usually achieve better performance. Besides, with the semantic-path representation, SGCN and SGCN-path performs significantly better than DisenGCN and IPGDN, our proposed algorithm gets the best results on both NMI and ARI. It shows that SGCN captures a more meaningful node embedding via learning semantic patterns from graph.
We try to demonstrate the intuitive changes of node representations after incorporating semantic patterns. Therefore, we utilize t-SNE  to transform feature representations (node embedding) of SGCN and DisenGCN into a 2-dimensional space to make a more intuitive visualization. Here we visualize the node embedding of Cora (actually, the change of representation visualization is similar in other datasets), where different colors denote different research areas. According to Figure 5, there is a phenomenon that the visualization of SGCN is more distinguishable than DisenGCN. It demonstrates that the embedding learned by SGCN presents a high intra-class similarity and separates papers into different research areas with distinct boundaries. On the contrary, DisenGCN dose not perform well since the inter-margin of clusters are not distinguishable enough. In several clusters, many nodes belong to different areas are mixed with others.
Then, to explore the influence of different scales of semantic-paths on our model performance, we implement a semantic-paths sampling experiment on Cora. As mentioned in the section 3.6, for capturing different numbers of semantic paths, we change the hyper-parameter of cut size to restrict the sampling size on each node’s neighbors. As shown in Figure 5, the SGCN model with the path representation achieves higher performances than the first point (). From the perspective of global trend, with the increase of , the classification accuracy of SGCN model is also improved steady, although it get the highest score when . It means that GCN model combines with more sufficient scale semantic-paths can really learn better node representations.
In this paper, we proposed a novel framework named Semantic Graph Convolutional Networks which incorporates the semantic-paths automatically during the node aggregating process. Therefore, SGCN provided the semantic learning ability to general graph algorithms. We conducted extensive experiments on various real-world datasets to evaluate the superior performance of our proposed model. Moreover, our method has good expansibility, all kinds of path-based algorithms in the graph embedding field can be directly applied in SGCN to adapt to different tasks, we will take more explorations in future work.
This research was partially supported by grants from the National Key Research and Development Program of China (No. 2018YFC0832101), and the National Natural Science Foundation of China (No.s U20A20229 and 61922073). This research was also supported by Meituan-Dianping Group.
Geometric deep learning: going beyond euclidean data. IEEE Signal Processing Magazine 34 (4), pp. 18–42. Cited by: §2.
Proceedings of the AAAI Conference on Artificial Intelligence, pp. . Cited by: §1, §3.2, §4.1.2, §4.1.3, §4.4.
Revisiting semi-supervised learning with graph embeddings. arXiv preprint arXiv:1603.08861. Cited by: §4.1.2, §4.2.
Semi-supervised learning using gaussian fields and harmonic functions. In Proceedings of the 20th International conference on Machine learning (ICML-03), pp. 912–919. Cited by: §4.1.2.