Heterogeneous Deep Graph Infomax

by   Yuxiang Ren, et al.

Graph representation learning is to learn universal node representations that preserve both node attributes and structural information. The derived node representations can be used to serve various downstream tasks, such as node classification and node clustering. When a graph is heterogeneous, the problem becomes more challenging than the homogeneous graph node learning problem. Inspired by the emerging information theoretic-based learning algorithm, in this paper we propose an unsupervised graph neural network Heterogeneous Deep Graph Infomax (HDGI) for heterogeneous graph representation learning. We use the meta-path structure to analyze the connections involving semantics in heterogeneous graphs and utilize graph convolution module and semantic-level attention mechanism to capture local representations. By maximizing local-global mutual information, HDGI effectively learns high-level node representations that can be utilized in downstream graph-related tasks. Experiment results show that HDGI remarkably outperforms state-of-the-art unsupervised graph representation learning methods on both classification and clustering tasks. By feeding the learned representations into a parametric model, such as logistic regression, we even achieve comparable performance in node classification tasks when comparing with state-of-the-art supervised end-to-end GNN models.



page 6


Unsupervised Hierarchical Graph Representation Learning by Mutual Information Maximization

Graph representation learning based on graph neural networks (GNNs) can ...

Heterogeneous Graph Representation Learning with Relation Awareness

Representation learning on heterogeneous graphs aims to obtain meaningfu...

SURREAL: SUbgraph Robust REpresentAtion Learning

The success of graph embeddings or node representation learning in a var...

Omni-Granular Ego-Semantic Propagation for Self-Supervised Graph Representation Learning

Unsupervised/self-supervised graph representation learning is critical f...

Deep Graph Infomax

We present Deep Graph Infomax (DGI), a general approach for learning nod...

Unsupervised Attributed Multiplex Network Embedding

Nodes in a multiplex network are connected by multiple types of relation...

End-to-end Mapping in Heterogeneous Systems Using Graph Representation Learning

To enable heterogeneous computing systems with autonomous programming an...
This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

1 Introduction

Numerous real-world data applications, such as social networks [JZhang19]

, knowledge graphs 

[WMWG17] and protein-protein interaction networks [FBSB17] exhibit the favorable property of graph data structure. Meanwhile, handling graph data is also very challenging, because each node has its own unique attributes, and the extensive connections among nodes convey complex but important information as well. When learning from the information of individual nodes and the connection information among them simultaneously, the task becomes more challenging.

Traditional machine learning methods focus on the features of individual nodes, which obstacle their ability to process graph data. Graph neural networks (GNNs) for representation learning of graphs learn nodes’ new feature vectors through a recursive neighborhood aggregation scheme 

[XHLJ10], which completes the fusion of node attributes and structural information in essence. With the support of sufficient training samples, a rich body of successful supervised graph neural network models have been developed [KW17, VCCRLB18, YYRHL18]

. However, labeled data is not always available in graph representation learning tasks, and those algorithms are not applicable to the unsupervised learning settings. To alleviate the training sample insufficiency problem, unsupervised graph representation learning has aroused extensive research interest. The task is to learn low-dimensional representation for each graph node such that the representation preserves graph topology structure and node content. Meanwhile, the learned new node representations can be applied to conventional sample-based machine learning algorithms as well.

Most of the existing unsupervised graph representation learning models can be roughly grouped into factorization-based models and edge-based models. Factorization-based models capture the global graph information by factorizing the sample affinity matrix 

[zhang2016collective, yang2015network, zhang2016collective]. Those methods tend to ignore the node attributes and local neighborhood relationships, which usually contain important information. Edge-based models exploit the local and higher-order neighborhood information by edge connections or random-walk paths. Nodes tend to have similar representations if they are connected or co-occur in the same path [KW16, DN17, HYL17, GL16, PAS14]. Edge-based models are prone to preserve limited order node proximity and lack a mechanism to preserve the global graph structure. The recently proposed deep graph infomax (DGI) [velivckovic2018deep] model provides a novel direction that considers both global and local graph structure. DGI maximizes the mutual information between graph patch representations and the corresponding high-level summaries of graphs. It has shown competitive performance even compared with supervised graph neural networks in benchmark homogeneous graphs.

In this paper, we explore the mutual information-based learning framework in heterogeneous graph representation problems. The networked data in the real-world usually contain very complex structures (involving multiple types of nodes and edges), which can be formally modeled as the heterogeneous information networks (HIN). In this paper, we will misuse the terminologies “HIN” and “HG” (heterogeneous graph) in reference to such complex networked data without any differentiation. Compared with homogeneous graphs, heterogeneous graphs contain more detailed information and rich semantics with complex connections among multi-typed nodes. Taking the bibliographic network in Figure 1 as an example, it contains three types of nodes (Author, Paper and Subject) as well as two types of edges (Write and Belong-to). Besides, the individual nodes themselves also carry abundant attribute information (e.g., paper textual contents). Due to the diversity of node and edge types, the heterogeneous graph itself becomes more complex, and the diverse (direct or indirect) connections between nodes also convey more semantic information. In heterogeneous graph studies, meta-path [SHYYW11] has been widely used to represent the composite relations with different semantics. As illustrated in Figure 2, the relations between paper nodes can be expressed by PAP and PSP which represent papers written by the same author and papers belonging to the same subject, respectively. GNNs initially proposed for the homogeneous graphs may encounter great challenges to handle these relations with different semantics in heterogeneous graphs.

Figure 1: An example of heterogeneous bibliographic network.
Figure 2: Meta-Paths in the bibliographic network.

To address the above challenges, we propose a novel meta-path based unsupervised graph neural network model for heterogeneous graphs, namely Heterogeneous Deep Graph Infomax (HDGI). In summary, our contributions in this paper can be summarized as follows:

  • This paper presents the first model to apply mutual information maximization to representation learning in heterogeneous graphs.

  • Our proposed method, HDGI, is a novel unsupervised graph neural network with the attention mechanism. It handles graph heterogeneity by utilizing an attention mechanism on meta-paths and deals with the unsupervised settings by applying mutual information maximization.

  • Our experiments demonstrate that the representations learned by HDGI are effective for both node classification tasks and clustering tasks. Moreover, its performance can also beat state-of-the-art comparative graph neural network models, where they have the additional supervised label information.

The rest of this paper is organized as follows. We discuss the related work in Section II. In Section III, we present the problem formulation along with important terminologies used in our method. We propose HDGI in Section IV. We present the experiments results and analyses in Section V. Finally, we conclude the paper in Section VI.

2 Related Work

Graph representation learning. Graph representation learning has become a non-trivial topic [CWPZ18] because of the ubiquity of graphs in the real world. As a data type containing rich structural information, many models[GL16, TQWZYM15] acting on graphs learn the representations of nodes based on the structure of the graph. DeepWalk [PAS14] uses the set of random walks over the graph in SkipGram to learn node embeddings. Several methods [MPJZW16, WCWPZY17] attempt to retrieve structural information through the matrix factorization. However, all the above methods are proposed for homogeneous graphs.

Heterogeneous graph learning. In order to handle the heterogeneity of graphs, metapath2vec [DCS17] samples random walks under the guidance of meta-paths and learns node embeddings through the skip-gram in heterogeneous graphs. HIN2Vec [FCL17] learns the embedding vectors of nodes and meta-paths simultaneously while conducts prediction tasks. Wang et al.  [WJSWCYY19] consider the attention mechanism in heterogeneous graph learning, where information from multiple meta-path defined connections can be learned effectively. From the perspective of attributed graphs, SHNE [ZSC19] captures both structural closeness and unstructured semantic relations through joint optimization of heterogeneous SkipGram and deep semantic encoding.

Graph neural network

. With the success of deep learning in the recent period, graph neural networks (GNNs) 


have made a lot of progress in graph representation learning. The core idea of GNN is to aggregate the feature information of the neighbors through neural networks to learn the new features that combine the independent information of the node and corresponding structural information in the graph. Most successful GNNs are based on supervised learning including GCN 

[KW17], GAT [VCCRLB18], GraphRNN [YYRHL18], SplineCNN [fey2018splinecnn], AdaGCN [sun2019adagcn] and AS-GCN [huang2018adaptive]. The unsupervised learning GNNs can be mainly divided into two categories, i.e., random walk-based [PAS14, GL16, KW16, DN17, HYL17] and mutual information-based [velivckovic2018deep].

3 Problem Formulation

In this section, we define the concepts of heterogeneous graph and meta-path based adjacency matrix and formulate the problem of heterogeneous graph representation learning.

Definition 3.1 (Heterogeneous Graph (HG))

A heterogeneous graph is defined as with a node type mapping function and an edge type mapping function . Each node belongs to one particular node type in the node type set , and each edge belongs to a particular edge type in the edge type set . The sets of node types and edge types in heterogeneous graphs have the property that .

Problem Definition. (Heterogeneous Graph Representation Learning): Given a heterogeneous graph and the set of node feature vectors , the representation learning task in is to learn a low dimensional node representation which can contain both structural information from and node attributes from . The learned representation H can be applied to the downstream graph-related tasks such as node classification and node clustering, etc. Note that we only focus on learning the representations of one specific type of node in this paper. We can represent such a set of nodes as the target-type nodes .

In a heterogeneous graph, two neighbor nodes can be connected by different types of edge. Meta-paths, which represent node classes and edge types between two neighboring nodes in a HG, have been proposed to model such rich information [SHYYW11]. Meta-path is a well-known concept used in graph studies, and we will not reintroduce its definition in this paper. Formally, we can represent the set of meta paths used in this paper as , where denotes the -th meta path type. For example, in Figure 2, Paper-Author-Paper (PAP) and Paper-Subject-Paper (PSP) are two types of meta-paths between papers.

Definition 3.2 (Meta-path based Adjacency Matrix)

For meta-path definition , if there exist a meta-path between node and , we call that and are “connected neighbors” based on . Such neighborhood information can be represented by a meta-path based adjacent matrix , where if , are connected by meta-path and otherwise.

4 Hdgi Methodology

A high-level illustration of the proposed Heterogeneous Deep Graph Infomax (HDGI) model is shown in Figure 3. We summarize the notations used for model description in Table LABEL:tab:notation. In the following, we elaborate the four major components of HDGI: (1) meta-path based local representation encoder, (2) global representation encoder, (3) negative samples generator and (4) mutual information based discriminator.

Figure 3: The high-level structure of Heterogeneous Deep Graph Infomax (HDGI). Local representation encoder is a hierarchical structure: learning node representations in terms of every meta-path based adjacency matrix respectively and then aggregating them through semantic-level attention. Global representation encoder will output a graph-level summary vector . Negative samples generator is responsible for generating negative nodes. The discriminator maximizes mutual information between positive nodes and the graph-level summary vector .
Symbol Interpretation
Meta-path based adjacency matrix
The set of node feature vectors
X The initial node feature matrix
The set of nodes with the target type
The number of nodes in
The given heterogeneous graph
Mutual information based discriminator
Negative samples generator
Global representation encoder
The graph-level summary vector
Node-level representations
Semantic-level attention vector
Attention weight of meta-path
Final negative nodes representations
H Final positive nodes representations
Table 1: Symbols and Definitions

4.1 Hdgi Architecture Overview

The input of HDGI should be a heterogeneous graph along with the set of node feature vectors and the meta-path set . Based on the original graph and the meta-path set, the set of meta-path based adjacency matrices can be calculated. The meta-path based local representation encoder is a hierarchical structure: learning individual node representations in terms of every meta-path based adjacency matrix respectively and then aggregating them through semantic-level attention. With the support of the output node representation H from the meta-path based local representation encoder, the global representation encoder will output a graph-level summary vector . Negative samples generator is responsible for generating negative nodes for the graph , and these negative nodes along with the positive nodes from will be used to train the discriminator with the object to maximize mutual information between positive nodes and the graph-level summary vector .

4.2 Meta-path based local representation encoder

The meta-path based node encoder has a two-level structure. We first derive a node representation from each meta-path based adjacency matrix , respectively. After that, the node representations based on all of are aggregated by an attention mechanism.

4.2.1 Node-level learning

Each of can be viewed as a homogeneous graph. At this step our target is to derive a node representation containing the information of initial node feature and . The initial node feature matrix X can be constructed by stacking the feature vectors in . In HDGI, we try to use GCN [KW17] and GAT [VCCRLB18] as components in the local representation encoder respectively.

Graph Convolutional Network (GCN) [KW17] introduces a spectral graph convolution operator for the graph representation learning. GCN proposes a first-order approximation, where the node representations learned by GCN will be:


where and is the diagonal matrix. Matrix is the filter parameter matrix, which is not shared between different .

Graph Attention Neural Network (GAT) [VCCRLB18]

effectively updates the nodes representations by aggregating the information from their neighbors including the self-neighbor. The learned hidden representation of node can be represented


Where W

is a meta-path specific weight matrix of the shared linear transformation and

is the set of -based neighbors of node is the attention weight between two connected nodes based on . is the number of heads in the multi-head attention mechanism.

In the experiment section, we will show the performance along with the analysis of using these two GNNs as the node-level encoder.

For each meta-path , a node-level encoder:


will be learned in order to output the high-level representation . After the node-level learning, we can obtain the set of node representations based on meta-path connections with different semantics.

4.2.2 Semantic-level learning

The representations learned based on the structural information of each meta-path contains only the semantic-specific information in heterogeneous graphs, and in order to aggregate the more general representations of the nodes, we need to combine these representations . The key issue to accomplish this combination is exploring how much each meta-path should contribute to the final representations. In other words, we need to learn what weights should be assigned to different meta-paths. Here we add a semantic attention layer to learn the weights that each meta-path should be assigned:


Then fuse the representations of multiple semantics according to the learned weights . Our semantic attention layer is inspired by HAN [WJSWCYY19]

, but the learned weights of meta-paths should make the final representations meet the fact that the node belongs to the original graph without any bias from known labels. HAN utilizes classification cross-entropy as the loss function, the learning direction is guided by known labels in the training set. However, the attention weights learned in

HDGI are guided by the binary cross-entropy loss which indicates whether the node belongs to the original graph. Therefore, the weights learned in HDGI serve for the existence of a node, and because no classification label involves, the weights get no bias from the known labels.

In order to make representations based on different meta-paths comparable, we first need to transform each node’s representation with a linear transformation, parameterized by a shared weight matrix

and a shared bias vector

. The importance of the representations based on different meta-paths will be measured by a shared attention vector . The importance of the meta-path can be calculated as:


According to the importance of meta-paths, we will normalize them using the softmax function:


Once obtained, the weights of different meta-paths are used as coefficients to conduct a linear combination of representations corresponding to them as follows:


The representations H serve as the final output local features. It should be mentioned that all parameters in the meta-path based local representation encoder are shared for positive nodes and negative nodes generated by the negative samples generator we will introduce later. The global representation encoder will also leverage the representations H to output the graph-level summary which will be described in the following part.

4.3 Global Representation Encoder

The learning object of HDGI is to maximize the mutual information between local representations and the global representation. The local representations of nodes are included in H, and we need the summary vector to represent the global information of the entire heterogeneous graph. Based on H, we examined three candidate encoder functions:

Averaging encoder function. Our first candidate encoder function is the averaging operator, where we simply take the mean of the node representations to output the graph-level summary vector :


Pooling encoder function

. In this pooling encoder function, each node’s vector will be independently fed through a fully-connected layer. An elementwise max-pooling operator has applied to summary the information from the nodes set:


where denotes the element-wise max operator and

is a nonlinear activation function.

Set2vec encoder function. The final encoder function we examine is Set2vec [VBK16] which is based on an LSTM architecture. Because the original set2vec in [VBK16] works on ordered node sequences, but here we need a summary of the graph concluding comprehensive information from each node instead of merely graph structure. Therefore, we apply the LSTMs to a random permutation of the node’s neighbor on an unordered set.

Among these functions, the simple averaging function achieves the best performance in our experiments. We report the results based on different functions in Figute 5.

4.4 Hdgi Learning

4.4.1 Negative samples generator

The negative samples generator is responsible for generating negative samples (nodes do not exist in the original graph), which will be used to train the mutual information based discriminator. DIM [HFMGBTB19] produces negative patch representations by simply using another image from the training set as a fake input. However, the heterogeneous graph representation learning tasks we face normally are the single-graph setting. Here, we borrow the idea [velivckovic2018deep] and extend it to heterogeneous graphs.

As our target is to maximize the mutual information between positive nodes and the graph-level summary vector, the generated negative samples will affect the structural information captured by the model. In this way, we need high-quality negative samples that can keep the structural information precisely. In heterogeneous graph , we have rich and complex structural information from the set of meta-path based adjacency matrices. In our negative samples generator:


we will keep all meta-path based adjacency matrices unchanged which can make the overall structure of stable. Then we shuffle the rows of the initial node feature matrix X, which changes the index of nodes in order to corrupt the node-level connections among them. According to the spectral theory, the structure of the whole graph does not change, but the initial feature corresponding to each node has changed. We provide a simple example to illustrate the procedure of generating negative samples in Figure 4.

Figure 4: The example of generating negative samples

4.4.2 Mutual information based discriminator

According to the proof in Mutual Information Neural Estimation 

[BBROBCD18], the mutual information can be estimated by gradient descent over neural networks. Here, we estimate the mutual information by training a discriminator to distinguish between and . The sample is denoted as positive because node belongs to the original graph, and is denoted as negative as the node is the generated fake one. The discriminator

is a binary classifier:


Based on the relationship [HFMGBTB19] between Jensen-Shannon divergence and the mutual information, we can maximize the mutual information with the binary cross-entropy loss of the discriminator:


The above loss can be optimized through the gradient descent, and the representations of nodes can be learned when the optimization is completed.

5 Evaluation

In this section, we evaluate the proposed HDGI framework in three real-world heterogeneous graphs. We first introduce the datasets and experimental settings. Then we report the model performance as compared to other state-of-the-art competitive methods. The evaluation results show the superiority of our developed model.

Dataset Node-type # Nodes Edge-type # Edges Meta-path
ACM Paper (P) 3025 Paper-Author 9744 PAP
Author (A) 5835 Paper-Subject 3025 PSP
Subject (S) 56
IMDB Movie (M) 4275 Movie-Actor 12838 MAM
Actor (A) 5431 Movie-Director 4280 MDM
Director (D) 2082 Movie-keyword 20529 MKM
Keyword (K) 7313
DBLP Author (A) 4057 Author-Paper 19645 APA
Paper (P) 14328 Paper-Conference 14328 APCPA
Conference (C) 20 Paper-Term 88420 APTPA
Term (T) 8789
Table 2: Summary of heterogeneous graphs in experiments

5.1 Datasets

We evaluate the performance of HDGI in three heterogeneous graphs, and the detailed descriptions of them are shown in Table 2.

  • DBLP: The DBLP dataset we use comes from  [GLFSH09]. We choose Author as the target node, and authors can be divided into 4 areas: database, data mining, information retrieval, and machine learning. We will use the area an author belongs to as the label. The initial features of the target nodes are the bag-of-words embeddings based on profiles. The meta-paths we defined in DBLP are Author-Paper-Author (APA), Author-Paper-Conference-Paper-Author (APCPA) and Author-Paper-Term-Paper-Author (APTPA).

  • ACM: ACM dataset is proposed by [WJSWCYY19]. The target nodes we choose are Papers that can be categorized into 3 classes including database, wireless communication, data Mining. We extract 2 meta-paths from this graph: Paper-Author-Paper (PAP) and Paper-Subject-Paper (PSP). The feature of the ACM dataset is the TF-IDF-based embedding of paper keywords and the dimension is 1870.

  • IMDB: It is a knowledge graph about movies. Movies belonging to three categories (Action, Comedy, and Drama) will be used as target nodes, and the meta-paths we choose are Movie-Actor-Movie (MAM), Movie-Director-Movie (MDM) and Movie-Keyword-Movie (MKM). The feature of the IMDB dataset is composed of {color, title, language, keywords, country, rating, year} with a TF-IDF encoding. The dimension of the IMDB movie node is 6334.

Available data X A X, A X, A, Y
Dataset Train Metric Raw Feature Metapath2vec DeepWalk DeepWalk+F DGI HDGI-A HDGI-C GCN GAT HAN
ACM 20% Micro-F1 0.8590 0.6125 0.5503 0.8785 0.9104 0.9178 0.9227 0.9250 0.9178 0.9267
Macro-F1 0.8585 0.6158 0.5582 0.8789 0.9104 0.9170 0.9232 0.9248 0.9172 0.9268
80% Micro-F1 0.8820 0.6378 0.5788 0.8965 0.9175 0.9333 0.9379 0.9317 0.9250 0.9400
Macro-F1 0.8802 0.6390 0.5825 0.8960 0.9155 0.9330 0.9379 0.9317 0.9248 0.9403
DBLP 20% Micro-F1 0.7552 0.6985 0.2805 0.7163 0.8975 0.9062 0.9175 0.8192 0.8244 0.8992
Macro-F1 0.7473 0.6874 0.2302 0.7063 0.8921 0.8988 0.9094 0.8128 0.8148 0.8923
80% Micro-F1 0.8325 0.8211 0.3079 0.7860 0.9150 0.9192 0.9226 0.8383 0.8540 0.9100
Macro-F1 0.8152 0.8014 0.2401 0.7799 0.9052 0.9106 0.9153 0.8308 0.8476 0.9055
IMDB 20% Micro-F1 0.5112 0.3985 0.3913 0.5262 0.5728 0.5482 0.5893 0.5931 0.5985 0.6077
Macro-F1 0.5107 0.4012 0.3888 0.5293 0.5690 0.5522 0.5914 0.5869 0.5944 0.6027
80% Micro-F1 0.5900 0.4203 0.3953 0.6017 0.6003 0.5861 0.6592 0.6467 0.6540 0.6600
Macro-F1 0.5884 0.4119 0.4001 0.6049 0.5950 0.5834 0.6646 0.6457 0.6550 0.6586
Table 3: The results of node classification tasks

5.2 Experimental Setup

There are many ways to measure the quality of learned representations, and the most commonly used tasks are node classification [PAS14, GL16, HYL17b] and node clustering [DCS17, WJSWCYY19] in graph-related research works. We evaluate HDGI from both two kinds of tasks.

5.2.1 Comparison methods

We compare our method HDGI with the following state-of-the-art methods including both supervised and unsupervised methods:
Unsupervised methods

  • Raw Feature: It represents the bag-of-words embedding, and we will directly test them in tasks.

  • Metapath2vec [DCS17]: A meta-path based heterogeneous graph embedding method, but it can only handle specific one meta-path.

  • DeepWalk [PAS14]: A random walk based graph embedding method, but it is designed to deal with homogeneous graph.

  • DeepWalk+Raw Feature(DeepWalk+F): We concatenate the embeddings learned from DeepWalk and the bag-of-words embeddings as the final representations.

  • DGI [velivckovic2018deep]: A mutual information based unsupervised learning method which is proposed for homogeneous graph.

  • HDGI-C: The proposed method which uses graph convolutional network to capture local representations.

  • HDGI-A: The proposed method which uses attention mechanism (GAT [VCCRLB18]) to learn local representations.

Supervised methods

  • GCN [KW17]: GCN is a semi-supervised methods for the node classification in homogeneous graphs.

  • GAT [VCCRLB18]: GAT applies the attention mechanism on homogeneous graphs which requires supervised setting.

  • HAN [WJSWCYY19]: HAN employs node-level attention and semantic-level attention to capture the information from all meta-paths.

For methods designed for homogeneous graphs including DeepWalk, DGI, GCN, GAT, we test the graph ignoring the heterogeneity and graphs constructed from every meta-path based adjacency matrix respectively, then report the best result. Metapath2vec can only handle one kind of meta-path, thus we test all meta-paths for it and report the best results.

5.2.2 Reproducibility

For the proposed HDGI including HDGI-C and HDGI-A, we optimize the model with Adam [KB15]. The dimension of node-level representations in HDGI-C is set as 512 and the dimension of is set as 8. For HDGI-A, we set the dimension of node-level representations as 64 and the attention head is set as 4. The dimension of

is set as 8 as well. We employ Pytorch to implement our model and conduct experiments in the server with 4 GTX-1028ti GPUs. Code is available at


DeepWalk 25.47 18.24 7.40 5.30 1.23 1.22
Raw Feature 32.62 30.99 11.21 6.98 1.06 1.17
DeepWalk+F 32.54 31.20 11.98 6.99 1.23 1.22
Metapath2vec 27.59 24.57 34.30 37.54 1.15 1.51
DGI 41.09 34.27 59.23 61.85 0.56 2.6
HDGI-A 57.05 50.86 52.12 49.86 0.8 1.29
HDGI-C 54.35 49.48 60.76 62.67 1.87 3.7
Table 4: Evaluation results on the node clustering task

5.3 Results

(a) Macro-F1
(b) Micro-F1
Figure 5: The comparison between different global representation encoder functions

5.3.1 Node classification task

In the node classification task, we will train a logistic regression classifier for unsupervised learning methods, while the supervised methods can output the classification result as end-to-end models. We conduct the experiments with two different training-ratios (20% and 80%). To keep the results stable, we repeat the classification process for 10 times and report the average Macro-F1 and Micro-F1 of all methods in Table 3.

In Table 3, we can observe that HDGI-C outperforms all other unsupervised learning methods. When compared with the supervised learning methods but designed for homogeneous graphs like GCN and GAT, HDGI can perform much better as well which proves that the type information and semantic information are very important and need to be handled carefully instead of directly ignoring them in heterogeneous graphs. HDGI is also competitive with the result reported from the supervised model HAN which is designed for heterogeneous graphs. The reason should be that HDGI can capture more global structural information when the mutual information plays a strong role in reconstructing the representation, while supervised loss based GNNs overemphasize the direct neighborhoods [velivckovic2018deep]. This, on the other hand, also suggests that the features learned through supervised learning in graph structures may have limitations, either from the structure or from a task-based preference. These limitations can affect learning representations from a more general perspective badly.

5.3.2 Node clustering task

In the node clustering task, we use the KMeans to conduct the clustering based on the learned representations. The number of clusters is set as the number of the node classes. We will not use any label in this unsupervised learning task and make the comparison among all unsupervised learning methods. To keep the results stable, we also repeat the clustering process for 10 times and report the average NMI and ARI of all methods in Table 3. DeepWalk can not perform well because they are not able to handle the heterogeneity of graphs. Metapath2vec can not handle diversity semantic information simultaneously which makes the representations not effective enough. The verification based on node clustering tasks also demonstrates that HDGI can learn effective representation considering the structural information, the semantic information and the node independent information simultaneously.

5.3.3 HDGI-A vs HDGI-C

From the comparison between HDGI-C and HDGI-A in node classification tasks, the difference in results between them reflects some interesting things. HDGI-C has better performance than HDGI-A in all experiments, which means that the graph convolution works better than the attention mechanism in capturing local representation. We insist that the reason is that the graph attention mechanism is strictly limited to the direct neighbors of nodes, the graph convolution considering hierarchical dependencies can see farther than the graph attention. This analysis can also be verified by the results of the clustering task.

5.3.4 Comparison between different global representation encoder functions

We present the results of HDGI-C with different global representation encoder functions working on the node classification task in Figure 5. We can find the simple average function performs the best compared with other functions. However, we can also find that this advantage is very subtle. In fact, it can be said that each function can perform well on our experimental datasets. But for some of the larger and more complex heterogeneous graphs, our view is consistent with DGI [velivckovic2018deep] that a specified and sophisticated function may perform better. The selection and design of the global encoder function for heterogeneous graphs with different scales and structures is an open question worthy of further discussion in the future.

6 Conclusion

In this paper, we propose an unsupervised graph neural network, HDGI, which learns node representations in heterogeneous graphs. HDGI combines several state-of-the-art techniques. It employs convolution-style GNNs along with a semantic-level attention mechanism to capture individual node local representations. Through maximizing the local-global mutual information by gradient descent over neural networks, HDGI learns high-level node representations containing graph-level structural information. It additionally exploits the structure of meta-path to analyze the connection semantics in heterogeneous graphs. Node attributes are fused into representations through the local-global mutual information maximization simultaneously. We demonstrate the effectiveness of learned representations across a variety of node classification and clustering tasks in three heterogeneous graphs. HDGI is particularly competitive in node classification tasks with state-of-art supervised methods, where they have the additional supervised label information. We are optimistic that mutual information maximization will be a promising future direction for unsupervised representation learning.