Interpretable and Efficient Heterogeneous Graph Convolutional Network

05/27/2020 ∙ by Yaming Yang, et al. ∙ Deakin University 10

Graph Convolutional Network (GCN) has achieved extraordinary success in learning effective high-level representations of nodes in graphs. However, the study regarding Heterogeneous Information Network (HIN) is still limited, because the existing HIN-oriented GCN methods suffer from two deficiencies: (1) they cannot flexibly exploit all possible meta-paths, and some even require the user to specify useful ones; (2) they often need to first transform an HIN into meta-path based graphs by computing commuting matrices, which has a high time complexity, resulting in poor scalability. To address the above issues, we propose interpretable and efficient Heterogeneous Graph Convolutional Network (ie-HGCN) to learn representations of nodes in HINs. It automatically extracts useful meta-paths for each node from all possible meta-paths (within a length limit determined by the model depth), which brings good model interpretability. It directly takes the entire HIN as input and avoids intermediate HIN transformation. The carefully designed hierarchical aggregation architecture avoids computationally inefficient neighborhood attention. Thus, it is much more efficient than previous methods. We formally prove ie-HGCN evaluates the usefulness of all possible meta-paths within a length limit (model depth), show it intrinsically performs spectral graph convolution on HINs, and analyze the time complexity to verify its quasi-linear scalability. Extensive experimental results on three real-world networks demonstrate the superiority of ie-HGCN over state-of-the-art methods.

READ FULL TEXT VIEW PDF
POST COMMENT

Comments

There are no comments yet.

Authors

page 2

page 5

page 6

page 7

page 8

page 9

page 11

page 12

This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

1 Introduction

In the real world, a graph usually contains multiple types of nodes and edges, which is called a heterogeneous graph, or Heterogeneous Information Network (HIN) [1]. Figure 1 (left) shows a toy HIN of the DBLP bibliographic network. It contains papers (), authors (), conferences () and terms (). The edges from authors to papers are of “Writing” type, while the edges from papers to conferences are of “Published” type. By convention, in an HIN, the nodes are called objects, the edges are called links, and the types of links are called relations. Meta-path [1] is an important concept in HINs. It is defined as a composite relation between two object types. A meta-path usually conveys specific semantics and different meta-paths have different importance for a specific task. For example, in DBLP, the meta-path (abbreviated as ) means the co-author relationship between authors, while means authors publish papers in conferences. When predicting an author’s affiliation, is more helpful than , since authors usually collaborate with colleagues in the same institution.

Properly learning representations of objects in an HIN can boost a variety of tasks such as object classification and link prediction [2]. Existing HIN embedding methods learn object representations in a non-parametric way and by preserving some specific structural properties. Among them, some methods [3, 4, 5] only preserve first-order proximity conveyed by relations. Although the other methods [6, 7, 8, 9, 10, 11, 12] preserve high-order structural proximities conveyed by meta-paths, they either require users to specify meta-paths [6, 7, 8, 9, 10] or cannot learn the importance of meta-paths for a task [11, 12]. To summarize, (1) with unsupervised structure-preserving training, the learned embeddings may not lead to optimal performance for a specific task; (2) none of these methods can automatically explore useful meta-paths from all possible meta-paths for specific tasks.

Fig. 1: A toy HIN of DBLP and its network schema.

Recently, Graph Convolutional Network (GCN) has been successfully applied to many graph analytical tasks such as node classification. Different from graph embedding, GCN encodes structural properties by convolution and uses task-specific objectives for training. Several recent works try to extend GCN to HINs. However, they still fail to fully and efficiently exploit the structural properties of HINs. Table I summarizes the key deficiencies of existing HIN GCN methods: (1) Some of them [13, 14, 15, 16] require the user to specify several useful meta-paths for a specific task, which is difficult for users without professional knowledge. (2) Meta-paths convey diverse structural proximities and rich semantics in an HIN. However, many of them [13, 14, 15, 17, 18, 19] cannot exploit all possible meta-paths, risking potential loss of important structural information. They only exploit a subset of all possible meta-paths, such as user-specified symmetric meta-paths [13, 14, 15, 16], fixed-length meta-paths [18], or meta-paths that start from and end with the same object type [17]. HetGNN [19] samples neighbors for a target node by random walk and aggregates them by Bi-LSTM. Most structural information is lost. (3) Some methods [17, 18, 20, 21] do not distinguish the importance of meta-paths, failing to consider that not all meta-paths are useful for a specific task. (4) Many of them [13, 14, 15, 16, 17, 22] need to compute commuting matrices [1] by iterative multiplication of adjacency matrices, which has at least square time complexity to the number of related objects. The resulting commuting matrices are very dense, and the longer the meta-paths, the denser the commuting matrices, which also increases the time complexity of the final graph convolution on these commuting matrices. Thus, they have limited scalability and cannot scale well to large-scale HINs.

Very recently, several HIN GCN methods [22, 23, 24] are proposed, which also consider all possible meta-paths. Among them, GTN [22]

first computes meta-path based graphs for all possible meta-paths, and then performs graph convolution. However, it has two deficiencies: (1) It only keeps a learnable importance weight for each relation. The weight is shared among all the related objects, which is not flexible enough to capture the “personality” of different objects. For example, suppose we have a task to classify the research ares of papers in DBLP (the complete list of areas is in Section

5.1). Paper is published in an interdisciplinary conference such as WWW, and is connected to term “Web Search”. Paper is published in AAAI, and is connected to term “Graph Algorithm”. Obviously, the connected term of is more helpful for classifying as “information retrieval”, while the conference where is published is more helpful to classify

as “artificial intelligence”. GTN cannot handle this complexity among different objects. (2) It also needs to compute the commuting matrices (incorporating relation weights) for all possible meta-paths. Even by applying sparse-sparse matrix multiplication, it has at least square time complexity. Therefore it cannot well scale to large HINs (Section

5.7). HetSANN [23] and HGT [24] directly aggregate the representations of heterogeneous neighbor objects by the multi-head attention mechanism [25]

, and add a residual connection after each layer. However, (1) the interpretability of the model is hindered by the multi-head concatenation and residual connections, since they break the normalization property of probabilities and consequently it is difficult to assess the contribution of different parts; (2) in real-life power-law networks, objects could have very high degrees, which leads to calculation inefficiency of softmax

[26] in attention and further affects scalability.

Property
[20, 21]
[23, 24]
[13, 14]
[15, 16]
[18, 19] [17] [22] ie-HGCN
NU
AMP
UMP
LS
TABLE I: Summary of Related Methods. (1) NU - not require user prior knowledge? (2) AMP - exploit all possible meta-paths? (3) UMP - automatically discover and efficiently show useful meta-paths? (4) LS - linear or quasi-linear scalability?

To fully and efficiently exploit structural properties of HINs, we propose interpretable and efficient Heterogeneous Graph Convolutional Network (ie-HGCN), which directly takes an HIN as input and performs multiple layers of heterogeneous convolution on the HIN to learn task-specific object representations. Each layer of ie-HGCN has three key steps to obtain higher-level object representations: (1) Projection

. We define relation-specific projection matrices to project heterogeneous neighbor objects’ hidden representations (input object features in the first layer) into a common semantic space corresponding to the target object type. We additionally define self-projection matrices (one for each object type) to project the representations of the target objects into the new common semantic space as well. (2)

Object-level Aggregation. Given the adjacency matrix between -type target objects and their -type neighbor objects, we use its row-normalized matrix to perform within-type aggregation among the neighbor objects of each target object. We show the first two steps intrinsically define a heterogeneous spectral graph convolution operation on the bipartite graph described by , with the projection matrices in the first step as convolution filters. (3) Type-level Aggregation. We develop an type-level attention mechanism to learn the importance of different types of neighbors for a target object and perform type-level aggregation on the object-level aggregation results accordingly.

Compared to existing HIN GCN methods, the proposed ie-HGCN has two salient features as follows:

(1) Interpretability: By stacking multiple layers, the proposed type-level attention and convolutional aggregation facilitate adaptively learning the importance score of each meta-path for each object, which enhances the interpretability of the model. We formally prove that ie-HGCN can evaluate all possible meta-paths within a length limit (i.e., model depth) in Section 4.5.1.

(2) Efficiency: ie-HGCN evaluates various meta-paths as the multi-layer iterative calculation proceeds. Hence, it avoids the computation of meta-path based graphs which is quite time-consuming. Moreover, in each layer ie-HGCN first uses normalized adjacency matrices (it is a reasonable choice and we will discuss it in Section 4.2) to aggregate a target object’s neighbors of different types as super “type” objects, and then uses type-level attention to aggregate them. This hierarchical aggregation architecture makes our model efficient because: (1) it avoids large-scale softmax calculation in the neighborhood of a target object directly; (2) an HIN often has a small number of node types, which leads to very efficient attention calculation. In Section 4.5.3, we analyze the time complexity to verify its quasi-linear scalability.

We conduct extensive experiments to show the superior performance of ie-HGCN against state-of-the-art methods on three benchmark datasets.

2 Related Work

HIN Embedding Methods: In recent years, a series of methods are proposed to learn representations of objects in HINs. EOE [5] PTE [4] and HEER [3] split an HIN into several bipartite graphs, and then use the LINE model [27] to learn object representations by preserving the first-order or the second-order proximities. Based on user-specified meta-paths, Esim [8] first samples path instances. Then it learns object representations, such that objects which co-occur in many path instances have similar representations. HIN2Vec [12] learns representations of objects and meta-paths by predicting whether two objects have a specific relation. HINE [11] learns object representations by minimizing the distance between two distributions which respectively model the meta-path based proximity on the graph and the first-order proximity [27] in the embedding space. Metapath2vec [6] and SHNE [7] first sample path instances guided by a set of user-specified meta-paths. Then they learn object representations by their proposed heterogeneous skip-gram. HERec [9] and MCRec [10] first perform meta-path based random walks, and then learn object representations accordingly for recommendation tasks. However, these methods cannot learn task-specific embeddings. Although structural properties are exploited, none of them can automatically learn meta-path importance for all meta-paths within a length limit, not to mention task-specific importance.

GCNs for Homogeneous Graphs:

Inspired by the great success of convolutional neural networks in computer vision, researchers try to generalize convolution on graphs

[28]. Bruna et al. [29]

first develop a graph convolution operation based on graph Laplacian in the spectral domain, inspired by the Fourier transformation in signal processing. Then, ChebNet

[30] is proposed to improve its efficiency by using K-order Chebyshev polynomials. Kipf et al. [31] further introduce a first-order approximation of the K-order Chebyshev polynomials, to further build efficient deep models. GAT [25] is proposed to learn different importance of nodes in a node’s neighborhood, based on a masked self-attention mechanism. Hamilton et al. propose a general inductive framework GraphSAGE [32]. It generates node embeddings by sampling neighbor nodes and aggregating their features by their proposed aggregator functions. However, all these methods are developed for homogeneous graphs. They cannot be directly applied to HINs because of heterogeneity.

GCNs for Heterogeneous Graphs: Based on user-specified symmetric meta-paths, HAN [13], HAHE [14], DeepHGNN [15], and GraphInception [17] transform an HIN into several homogeneous graphs by computing commuting matrices. Then, they apply GCN to these resulting homogeneous graphs. For each user-specified meta-path, MAGNN [16] first performs intra-metapath aggregation by encoding all the object features along a path instance of the meta-path, and then performs inter-metapath aggregation by attention mechanism. HetGNN [19] samples a fixed number of neighbors in the vicinity of an object via random walk with restart, and aggregates these neighbors by Bi-LSTM. R-GCN [20] and Decagon [21] use different weight matrices for different relations, and sum the convolution results of different types of neighbors. ActiveHNE [18] takes the entire HIN as input. It concatenates convolution results in each convolution layer. GTN [22] first computes all possible meta-path based graphs by iterative matrix multiplication of two softly selected adjacency matrices. Then it performs graph convolution on these resulting graphs. HetSANN [23] and HGT [24] extend GAT [25] to HINs. They directly use attention mechanism to aggregate different types of neighbors. However, these methods either cannot discover useful meta-paths from all possible meta-paths [13, 14, 15, 17, 16, 18, 19, 20, 21, 23, 24], or have limited scalability [13, 14, 15, 17, 16, 22].

3 Problem Formulation

(a) An instance of ie-HGCN on DBLP
(b) The calculation flow in a block
Fig. 2: The overall architecture of ie-HGCN on DBLP. (a): An instance of ie-HGCN with 5 layers. The solid lines stand for the relation-specific projection, and the dashed lines stand for the dummy self-relation projection. In classification task, the softmax function can be applied to target object representations in the last layer to obtain prediction scores; (b): The block in a layer. For , projects the representations from semantic space into a new common “Paper” semantic space. projects the representations of paper objects from the original “Paper” semantic space into the new common “Paper” semantic space. We use the same shape to denote the projected object representations are located in the new common “Paper” semantic space. is used for object-level aggregation. Type-level attention is used for type-level aggregation.

We first introduce some important concepts about HINs [1], and then formally define the problem we study in this paper.

Definition 1.

Heterogeneous Information Network (HIN). A heterogeneous information network is defined as , where is the set of objects, is the set of links. : and : are respectively object type mapping function and link type mapping function. denotes the set of object types, and denotes the set of relations, where . Let denote the set of objects for type , and denotes the set of neighbor object types of that have relations from them to . is a neighbor object type of . We abuse notation a bit to use also as the index of object type in . The relation from to is denoted as or .

Definition 2.

Network Schema. Given an HIN , : , : , the network schema is a directed graph defined over , with edges as relations from , denoted as . It is a meta template for .

Definition 3.

Meta-path. A meta-path is essentially a path defined on network schema . It is denoted in the form of (abbreviated as ), which describes a composite relation between object types and , where denotes the composition operator on relations. The subscript is the length of , i.e. the number of relations in . We say is symmetric if its corresponding composite relation is symmetric. A path instance of is a concrete path in an HIN that instantiates .

Figure 1 shows a toy HIN of DBLP (left) and its network schema (right). It contains four object types: “Paper” (), “Author” (), “Conference” () and “Term” (), and six relations: “Publishing” and “Published” between and , “Writing” and “Written” between and , “Containing” and “Contained” between and . For object type , its set of neighbor object types is . The meta-path is symmetric, while is asymmetric, and they both have length of 2. As shown, author has published paper in conference , and thus we say is a path instance of .

Definition 4.

HIN Representation Learning. Given an HIN , the problem is to learn representation matrices for a specific task such as object classification. For each object type , the representation matrix is denoted as , where is the representation dimensionality. For an object

, its corresponding representation vector is the

-th row of , which is a dimensional vector.

Notations Descriptions
Hidden representations of of previous layer
New representations of of current layer
Dummy self-relation projection matrix
Relation-specific projection matrix
/ Projected representations of /
i.e.
Convolved representations from to
/ Attention query/key parameters
Attention parameters
Mapped queries for
/ Mapped keys for /
/ Unnormalized attention coefficients for /
/ Normalized attention coefficients for /
TABLE II: Main Notations.

4 Model

In this section, we present the ie-HGCN method. Figure 2(a) shows the overall architecture of ie-HGCN for the network schema of DBLP. Each layer consists of blocks. In each block, three key calculation steps are performed. Figure 2(b) shows the calculation flow of the block in a layer. In the following, we elaborate on the three key calculation steps of the block in a layer. The process is similar in other blocks. The main notations used in this paper are summarized in Table II. We use bold uppercase/lowercase letters to denote matrices/vectors. For clarity, we omit layer indices of all the layer-specific notations.

4.1 Projection

For different types of objects, their features are located in different semantic spaces. Therefore, in each block, we first project the representations of neighbor objects of different types into a new common semantic space. The input of the block is a set of hidden representation (and input feature in the first layer) matrices , obtained from the previous layer. and are the representation matrices for and respectively. For each neighbor object type , we define a relation-specific projection matrix for relation . It projects from the semantic space into a new common semantic space . Besides, to project from the feature space into the new common space as well, we additionally define a projection matrix . Here is simply a projection matrix, but not a relation-specific projection matrix. For convenience, we call as dummy self-relation. When the real self-relation exists, i.e. , is a relation-specific projection matrix. Note that each relation has a relation-specific projection matrix, and different relations have different ones. The projection is formulated as follows:

(1)

where and are projected hidden representations located in the new common space .

For example, as illustrated in Figure 2(b), projects from the “Conference” space into a new common “Paper” space . is originally located in the “Paper” space . projects it from the original space into the new common space .

4.2 Object-level Aggregation

After projecting all the hidden representations of neighbor objects into a common semantic space, we then perform object-level aggregation. In the following, let us take the example of aggregating hidden representations from to . However, we cannot directly apply GCN [31] to the aggregation, since the neighbors of an object are of different types in HINs, i.e., the heterogeneity of HINs. An adjacency matrix between two different types of objects may not even be a square matrix. In this paper, given the adjacency matrix between and , we first compute its row-normalized matrix , where is the degree matrix. Then, we define the heterogeneous graph convolution as follows:

(2)

Each row of can serve as the normalized coefficients to compute a linear combination of the corresponding projected representations of . For symbolic consistency, we let . Thus, we can obtain a set of convolved representations , and each representation in the set contributes to from one aspect. Take the block in Figure 2(b) as an example. We use , , to respectively aggregate the projected representations of paper objects’ neighbor conference objects, author objects and term objects. Thus, we obtain .

Although Eq. (2) is similar to the aggregation ideas in previous methods [18, 17, 20, 21], our design still has some novel aspects: (1) different from previous methods, we calculate the self-representation , which, together with the attentive type-level aggregation introduced in the next subsection, enables ie-HGCN to evaluate the usefulness of all meta-paths within a length limit (model depth). We will prove this in Section 4.5.1; (2) Since is usually not a square matrix and consequently cannot be eigendecomposed to obtain Fourier basis, no previous work provides theoretical analysis to formally show Eq. (2) is a proper convolution. In Section 4.5.2, we show that Eq. (2) is intrinsically a spectral graph convolution on the bipartite graph. Moreover, we can also implement an attention mechanism for object-level aggregation similar to that in [25]. However, in this work, we simply use to perform object-level aggregation, considering that the object-level attention is computationally inefficient and the (weighted) adjacency matrices of real-world complex networks are often sufficient to reflect the relative importance among objects. Take IMDB as an example. The rating between a user and a movie naturally reflects the preference of the user towards the movie. Empirical results also support this idea.

4.3 Type-level Aggregation

To learn more comprehensive representations for , we need to fuse representations from different types of neighbor objects. In a specific task, for a target object, the information from different types of neighbor objects may have different importance. Take paper objects in DBLP as an example. In the task of predicting a paper’s quality, the representation of the conference where the paper is published could be more important. To this end, we propose type-level attention to automatically learn the importance weights for different types of neighbor objects. Then we aggregate the corresponding convolved representations by computing the weighted sum of them. The proposed attention mechanism also facilitates the model to evaluate all possible meta-paths within a length limit (model depth) for a particular task. We will prove it in Section 4.5.1.

The attention mechanism is to map a set of queries and a set of key-value pairs to an output. In practice, we pack together queries, keys and values into three matrices , and respectively. Then it can be formulated as: , where is the attention function such as dot-product [33], neural network [25]. Here, the obtained convolved representations are values. We define a weight matrix to map them into keys, and define a weight matrix to map into the query, where is the hidden layer dimensionality of the type-level attention.

(3)

Since we want to assess the importance of and , with respect to when calculating next-layer representations, it is intuitive to map into keys, and map into the query. It is different from previous methods [13, 14, 22], where the query is a parameter vector. Note that mapping

as the query is also the key to achieve personalized importance estimation for each

object. The attention function is implemented as follows:

(4)

where denotes the row-wise concatenation operation, is the parameter vector, and ELU [34]

is the activation function. The

-th element of and respectively reflect the unnormalized importance of object itself and its neighbors when calculating its higher level representation. Then, the normalized attention coefficients is computed by applying the softmax function:

(5)

where softmax is applied to the operand row-wise. The normalized attention coefficients are used to compute the higher level representations of by a weighted combination of their corresponding values as follows:

(6)

where is the nonlinearity, and the subscript () means the -th element (row) of a vector (matrix), and corresponds to the -th object in . The new representations in are in turn used as the input of the blocks in the next layer. The final representations of objects are output by the blocks in the last layer.

4.4 Loss

Once the final representations of objects are obtained from the last layer, they can be used for a variety of tasks such as classification, clustering, etc. The loss functions can be defined depending on specific tasks. For semi-supervised multi-class object classification, it can be defined as the sum (or weighted sum) of the cross-entropy over all the labeled objects for each object type:

(7)

where is the set of indices of labeled objects in , is the set of class indices for , and and are respectively ground-truth label indicator and the predicted score of object on class . We can minimize the loss by back propagation. The overall training procedure of ie-HGCN is shown in Algorithm 1. Wherein, we index layers by square brackets.

Input :  The HIN , : , : ,
The object feature matrices ,
The number of layers .
Output :  The final representations .
1 Initialize parameters, and let ;
2 for  do
3       for  do
4             ;
5             for  do
6                   ;
7                  
8             end for
9            Compute normalized attention coefficients by , and according to Eq. (3-5);
10             Compute according to Eq. (6) ;
11            
12       end for
13      
14 end for
15Compute loss and update parameters by gradient descent;
return .
Algorithm 1 The pseudocode of ie-HGCN.

4.5 Analysis

4.5.1 Automatically learning useful meta-paths.

The most important highlight of ie-HGCN is that it evaluates all possible meta-paths with length less than the number of layers in the model. We formalize this property as a theorem as follows:

Theorem 1.

For an object type , let denotes all possible meta-paths of length greater than or equal to 0, less than , and end with object type . In the -th layer, the output hidden representation evaluates all the meta-paths in .

Proof.

We prove the theorem by mathematical induction.

The base case: When , is the input features of . Obviously, the meta-path evaluated can be expressed as , which has a length of and ends with , i.e., .

The step case: Assume that the theorem holds when , i.e. evaluates . When , is an attention-weighted combination of and . According to Eq. (2), is a linear projection of which evaluates by assumption; , where evaluates by assumption. The heterogeneous graph convolution concatenates the relation at the end of . Since we aggregate for all , this results in . By uniting with from , we can conclude evaluates .

Therefore, the theorem holds. ∎

The proposed ie-HGCN can capture objects’ personalized preference for different meta-paths because each object has its own attention coefficients. GTN cannot capture such personalized meta-path importance, since its importance weights for relations are shared by all the related objects. We can obtain the importance score of a meta-path to a specific target object by summing the scores of all its path instances ending with that object. The score of a path instance is intuitively calculated by multiplying the attention coefficients and the link weights (from the corresponding normalized adjacency matrices for real relations, or 1 for dummy self-relations) between objects along the path. Since path instances often share attention coefficients, we could efficiently aggregate subpath scores iteratively during the forward propagation of ie-HGCN, recording in each block the aggregation scores for different meta-paths up to that block.

To efficiently show the importance of meta-paths in general, we can first calculate the mean attention distribution in each block. Then calculate the importance scores of meta-paths based on these mean distributions. See computation details in Section 5.5 and Appendix A.1.

4.5.2 Connection to spectral graph convolution.

We can also derive the heterogeneous graph convolution presented in Eq. (2) by connecting to the spectral domain of bipartite graphs (when the self-relation exists, the following derivation still holds by setting ). For and , given their representation matrices and and the adjacency matrices and between them, we cannot directly eigendecompose and as they may not be square matrices. Thus, we define the augmented adjacency matrix and the augmented representation matrix as follows:

where ’s denote square zero matrices, and

is properly padded by zeros since generally

. Our convolution is related to random walk Laplacian which is defined as: , where . We also have , where and are respectively

’s eigenvectors and eigenvalues.

and define graph Fourier transform and inverse transform respectively. Then the bipartite graph convolution is defined as the multiplication of a parameterized filter ( in the Fourier domain) and a signal (a column of ) in the Fourier domain:

where can be regarded as a function of and is efficiently approximated by the truncated Chebyshev polynomials [30]. , , and . We note that in general, has the same eigenvalues as symmetric normalized Laplacian [35], which lie in [36]. For the purpose of numerical stability, we replace with without affecting Fourier basis, so as to rescale the eigenvalues to [-1, 1] [30]. can be expressed as follows:

where is the row-normalized adjacency matrix between and . Now, the convolution operation can be expressed as: , which is K-localized, since it can be easily verified that denotes the transition probability of objects to their k-order neighborhood. Following GCN [31], we further let and stack multiple layers to recover a rich class of convolutional filter functions. Then we have . Generalizing the filter to multiple ones, and the signal to multiple channels, the two terms can be expressed as follows:

The above two equations recover the calculation of and in Eq. (2). They differ from Eq. (2) only by using the same parameters and for the two types and . In ie-HGCN, we use separate parameters, , and , for and respectively, to improve model flexibility. Another difference is that we aggregate and ’s through the type-level attention rather than simply adding them.

4.5.3 High computational efficiency.

Most previous methods [13, 14, 15, 17, 22] need to compute commuting matrices by iterative multiplication of adjacency matrices, which has at least square time complexity. Our ie-HGCN performs heterogeneous graph convolution on the HIN directly in each layer, which is more efficient. For ie-HGCN, by applying sparse-dense matrix multiplication, the time complexity of the heterogeneous graph convolution is , where is the number of links between and . The time complexity of the type-level attention is . Taking all types of objects and all types of links into consideration, the overall time complexity is , which is linear to the total number of objects and links in an HIN.

5 Experiments

In this section, we conduct extensive experiments to show the performance of ie-HGCN. The source code is available at GitHub 111We will release the source code once the manuscript is accepted.. We use three widely used and publicly available real-world networks (IMDB, ACM, DBLP) to construct three HINs. We compare ie-HGCN against three GCN methods for homogeneous graphs: GraphSAGE, GCN and GAT; one HIN embedding method: metapath2vec; five GCN methods for HINs: HAN, HAHE, ActiveHNE, HetSANN and GTN; one ie-HGCN variant: ie-HGCN.

5.1 Datasets

The statistics of the used HINs are summarized in Table III

. The notation * means the features are real. Otherwise, they are generated randomly. Note that, most existing methods only require features of the target objects, while ActiveHNE, GTN and our method need input features of all types of objects. However, in the widely used HIN datasets, some types of objects have no available real features. For these objects, some existing methods input their one-hot ids as features, which results in a large number of parameters in the first layer and consequently, high space complexity and time complexity. Considering the general idea is to generate non-informative features for those objects without real features, in this paper, we generate a 128-dimensional random vector for each of these objects from the Xavier uniform distribution

[37]. In this way, little information can be got from their features. For all the methods (except HAHE, which cannot make use of object input features), we input exactly the same object features as shown in Table III.

• IMDB. We extract a subset from the IMDB dataset in HetRec 2011 222https://grouplens.org/datasets/hetrec-2011/, and construct an HIN which contains 4 object types: Movie (), Actor (), User () and Director (), and 6 relations: , and . We select 14 (task-irrelevant features such as id and url are ignored) numerical and categorical features from the original features for movie objects. Movie () objects are labeled by 4 classes: comedy, documentary, drama, and horror.

• ACM. The dataset is provided by the authors of HAN [13]. It is downloaded from ACM digital library 333https://dl.acm.org/ in 2010, including data from 14 representative computer science conferences. We construct an HIN with 3 object types: Paper (), Author () and Subject (), and 4 relations: and . Paper () objects are labeled by 3 research areas: data mining, database and computer network, and their features are the TF-IDF representations of their titles.

• DBLP. The dataset is provided by the authors of HAN [13], which is extracted from 4 research areas of DBLP bibliography 444https://dblp.org/. The 4 research areas are: data mining (DM), database (DB), artificial intelligence (AI) and information retrieval (IR) 555DM: ICDM, KDD, PAKDD, PKDD, SDM; DB: SIGMOD, VLDB, PODS, EDBT, ICDE; AI: AAAI, CVPR, ECML, ICML, IJCAI; IR: ECIR, SIGIR, WWW, WSDM, CIKM.. Based on the dataset, we construct an HIN with 4 object types: Paper (), Author (), Conference () and Term (), and 6 relations: , and . Author () objects are labeled with the 4 research areas according to the conferences to which they submitted papers [13]. Although papers in DBLP have titles as their features, the titles provide very similar information as the terms connected to papers. Hence, we do not incorporate them as real features for papers, so that we can answer an important research question: whether ie-HGCN can well exploit useful structural features conveyed by meta-paths to accomplish the task without informative object features.

Dataset Objects Number Features Classes
DBLP A 4057 128 4
P 14328 128 -
C 20 128 -
T 8898 128 -
ACM P 4025 128 3
A 7167 128 -
S 60 128 -
IMDB M 3328 14 4
A 42553 128 -
U 2103 128 -
D 2016 128 -
TABLE III: Dataset Statistics.

5.2 Baselines

We evaluate ie-HGCN against ten baselines as follows.

• GraphSAGE (GSAGE) [32]: It is a homogeneous method that learns a function to aggregate features from a node’s neighborhood. We use the convolutional mean-based aggregator, which corresponds to a rough, linear approximation of localized spectral convolution.

• GCN [31]: It is the state-of-the-art graph convolutional for homogeneous graphs.

• GAT [25]: It is designed for homogeneous graphs. For each node, it aggregates neighbor representations via the importance scores learned by node-level attention.

• metapath2vec (MP2V) [6]: It is state-of-the-art HIN embedding method. It first performs random walks guided by user-specified meta-paths and then uses the heterogeneous skip-gram to learn object representations. It cannot learn the importance of these input meta-paths.

• HAN [13]: It transforms an HIN into several homogeneous graphs via given symmetric meta-paths and uses GAT to perform object-level aggregation. Then, by attention mechanism, it fuses object representations learned from different meta-path based graphs.

• HAHE [14]: HAHE is similar to HAN, except that it initializes the features of the target objects as the meta-path based structural features. Thus, it cannot exploit object features.

• ActiveHNE (DHNE) [18]

: It is an active learning method for HIN. For a fair comparison, we use its discriminative heterogeneous network embedding (DHNE) component. It only considers fixed-length meta-paths, and cannot learn the importance of meta-paths.

• HetSANN (HetSA) [23]: It is a heterogeneous method which directly uses attention mechanism to aggregate heterogeneous neighbors. We use the variant HetSANN.M.R.V which achieves the best performance as reported. The attention is implemented by sparse operations.

• GTN [22]: It is a heterogeneous method which considers all possible by computing all possible meta-path based graphs, and then performs graph convolution on the resulting graphs.

• ie-HGCN: It is a variant of ie-HGCN. We replace the type-level attention with the element-wise mean function. We use this method to show the effectiveness of the type-level attention.

Dataset Metrics Training GSAGE GCN GAT MP2V HAHE HAN DHNE HetSA GTN ie-HGCN ie-HGCN
DBLP Micro F1 20% 0.8882 0.9155 0.9097 0.9015 0.9357 0.9224 0.8445 0.9336 0.9341 0.9368 0.9426
40% 0.8881 0.9110 0.9120 0.9081 0.9418 0.9240 0.8461 0.9372 0.9384 0.9355 0.9422
60% 0.8868 0.9048 0.9080 0.8982 0.9365 0.9280 0.8677 0.9385 0.9401 0.9448 0.9554
80% 0.8887 0.9172 0.9173 0.9089 0.9438 0.9308 0.8736 0.9403 0.9446 0.9520 0.9648
Macro F1 20% 0.8787 0.9060 0.9196 0.9043 0.9311 0.9311 0.8399 0.9228 0.9282 0.9321 0.9385
40% 0.8798 0.9017 0.9216 0.8973 0.9378 0.9330 0.8480 0.9251 0.9334 0.9305 0.9383
60% 0.8805 0.8973 0.9184 0.9048 0.9345 0.9370 0.8624 0.9317 0.9353 0.9400 0.9525
80% 0.8829 0.9099 0.9255 0.9097 0.9424 0.9399 0.8682 0.9348 0.9377 0.9472 0.9629
ACM Micro F1 20% 0.8147 0.7880 0.7418 0.6674 0.7717 0.7358 0.7621 0.7857 0.7785 0.7873 0.8193
40% 0.8086 0.7864 0.7201 0.6901 0.7819 0.7744 0.7841 0.7862 0.7884 0.8023 0.8210
60% 0.8031 0.7778 0.7618 0.7168 0.7809 0.7647 0.7897 0.7925 0.7927 0.8326 0.8373
80% 0.8112 0.7975 0.7720 0.7327 0.8086 0.7613 0.7902 0.8113 0.7964 0.8396 0.8422
Macro F1 20% 0.6340 0.6019 0.5818 0.5092 0.5387 0.6469 0.6494 0.5789 0.5134 0.6895 0.6979
40% 0.6235 0.5912 0.6114 0.5191 0.5482 0.6439 0.6567 0.5880 0.5365 0.6917 0.6931
60% 0.6038 0.5880 0.5379 0.5187 0.5465 0.6531 0.6599 0.5920 0.5536 0.6943 0.7025
80% 0.5960 0.6025 0.5936 0.5481 0.5884 0.6626 0.6761 0.6108 0.5628 0.6936 0.6942
IMDB Micro F1 20% 0.5820 0.5958 0.5530 0.4987 0.5489 0.5647 0.6379 0.6299 - 0.6212 0.6494
40% 0.5711 0.5849 0.5542 0.5014 0.5500 0.5603 0.6439 0.6344 - 0.6661 0.6670
60% 0.5989 0.5981 0.5514 0.5083 0.5501 0.5700 0.6426 0.6364 - 0.6644 0.6822
80% 0.5827 0.5873 0.5406 0.5090 0.5464 0.5718 0.6551 0.6421 - 0.6904 0.6971
Macro F1 20% 0.4093 0.2926 0.3007 0.2095 0.2147 0.4587 0.5032 0.4741 - 0.5419 0.5660
40% 0.4268 0.2915 0.3126 0.2136 0.2352 0.4444 0.5954 0.5459 - 0.5761 0.5981
60% 0.3964 0.3095 0.2982 0.2089 0.2172 0.4482 0.5581 0.5432 - 0.5547 0.6084
80% 0.3922 0.3036 0.2893 0.2211 0.2575 0.4477 0.5338 0.5332 - 0.5245 0.5835
TABLE IV: Object Classification Results.

5.3 Hyper-parameter Settings

On each dataset, we randomly select % objects as training set, and the rest % are divided equally as validation set and test set, where . For all the methods, we use exactly the same training/validation/test sets for fairness. We only investigate proper hyper-parameter setting on the validation set of DBLP and use the same setting for ACM and IMDB. This can reflect whether the hyper-parameter setting is sensitive w.r.t. datasets. The hyper-parameter settings of all the methods are detailed as follows.

• Ours: To make model tuning easy, we set the same hidden representation dimensionality for all the object types in a layer. Specifically, we set the number of layers to 5. The first layer is the input layer, and its dimensionalities for different objects are determined by object features. For the other 4 hidden layers, the dimensionalities are all set to [64, 32, 16, 8]. The nonlinearity is set to ELU function [34]. The hidden layer dimensionality of the type-level attention is set to 64. For optimization, we use Adam optimizer, and the parameters are initialized by Xavier uniform distribution [37]. We set learning rate to 0.01. We apply dropout to the output of each layer except the output layer, with dropout rate 0.5. The regularization weight is set to 5e-4. For a fair comparison, our ie-HGCN and ie-HGCN use the same hyper-parameter setting.

• Baselines: Since GraphSAGE, GCN, GAT, metapath2vec, HAN and HAHE need user-specified meta-paths, we use the meta-paths used in the papers [13, 14]. Concretely, on DBLP, we use , and . On ACM, we use and . On IDMB, we use , and

. For GraphSAGE, GCN, GAT and metapath2vec, we test them on homogeneous graphs constructed by the above meta-paths and report their best results. For all the baselines, we use the validation set of DBLP to tune better hyper-parameters such as epochs on our datasets based on their default settings. Their key hyper-parameters are set as follows. For GraphSAGE, the neighborhood sample size is set to 5. For GAT, HAN and HetSANN, the number of attention head is set to 8. For HAHE, the batch size is set to 512, and the sample size of neighbors is set to 100. For ActiveHNE, its number of layers is set to 3, so that it can exploit length-2 meta-paths. For GTN, the number of and channels set to 2. Its number of layers is set to 3. For metapath2vec, the window size and the negative sample size are set to 5, and the walk length is set to 100.

5.4 Object Classification

We conduct object classification to compare the performance of all the methods. Each method is randomly run 10 times, and the average Micro F1 and Macro F1 are reported in Table IV. Note that due to the high space complexity of GTN, it cannot make use of GPU on our datasets due to out of 12GB GPU memory. Therefore in this experiment, we run GTN on CPUs with 128GB main memory, as suggested by the authors [22]. Even then, it runs out of 128GB main memory on IMDB. We can see, ie-HGCN achieves the best overall performance, and ie-HGCN outperforms the other baselines in most cases, which indicates the effectiveness of our proposed heterogeneous graph convolution for object-level aggregation. On the other hand, ie-HGCN performs better than ie-HGCN, which shows the effectiveness that our proposed type-level attention can discover and exploit the most useful meta-paths for this task.

On DBLP, heterogeneous methods HAN and HAHE significantly outperform homogeneous methods GraphSAGE, GCN and GAT, while on ACM and IMDB, the former does not have much superiority than the latter. This may be because on DBLP, the heterogeneous structural features conveyed by meta-paths are more helpful for this task (see Section 5.5), while on ACM and IMDB, the real features of the target objects are more helpful. ie-HGCN always achieves the best results on all the datasets, which indicates that it can not only exploit useful structural features but also take advantage of useful object features. On DBLP, we can see ActiveHNE performs much worse than HAN, HAHE, HetSANN and GTN, because it only exploits fixed-length meta-paths and cannot learn their importance. Even homogeneous methods GraphSAGE, GCN and GAT perform better than ActiveHNE, as they can exploit useful meta-paths that previous researchers have empirically chosen. On DBLP, HetSANN and GTN perform better than ActiveHNE, which may be because they can exploit all possible meta-paths. GTN performs better than HetSANN, which may be because that GTN can correctly discover and exploit useful meta-paths for this task while HetSANN cannot. However, GTN performs worse than ie-HGCN, the reason of which should be that it is not flexible enough to capture the complexity of different objects. Metapath2vec performs worst in most cases, which indicates the superiority of graph convolutional methods over traditional network embedding methods.

0.4228
0.1098
0.0935
0.0736
(a) Top-4 Meta-paths
0.0748
0.0242
0.0332
0.0373
0.1151
0.1382
(b) Merged Paths for
TABLE V: Useful Meta-paths Discovered by ie-HGCN on DBLP.

5.5 Attention Study

One salient feature of ie-HGCN is the ability to evaluate all possible meta-paths, discover and exploit the most useful ones for a specific task. We provide experimental evidences in the subsection. In Table VI(a), we show the top-4 most useful meta-paths returned by ie-HGCN on DBLP for classifying author objects. See more computation details in Appendix A.1. Note that we need to merge equivalent meta-paths. Here, for object type , we use to denote the dummy self-relation, and use to denote the real self-relation. In Table VI(b), we show the meta-paths that are merged for . ie-HGCN finds that is the most useful meta-path for the task of author object classification. This is reasonable. The semantic meaning of is “the conferences where authors have published papers”. This correctly reflects the fact that in the DBLP dataset, the class of an author (i.e. his/her research area) is labeled according to the conferences to which he/she submitted papers [13]. Besides, and are also useful for the task. indicates that in addition to the conferences where an author himself/herself has published papers, the conferences where his/her coauthors have published papers are also useful. suggests we should further consider conferences where the published papers share a lot of common terms with those written by the author. The last meta-path is also intuitive. Notice that a paper can only be published in one conference. The meta-path essentially does not introduce information of other conferences to a conference. Hence, we can interpret as .

Regarding the best meta-path for each object, we also find the results are intuitive. For example, ie-HGCN correctly classifies Yoshua Bengio as “AI” (see Section 5.1 for details of labels), and assigns the highest score to for him. This is intuitive since all the 7 papers connected to him in our dataset are from “AI” conferences such as ICML. On the other hand, ie-HGCN correctly classifies the scholar Chen Chen as “DM”, and assigns the highest score to . This is also reasonable. In our dataset, he has published 3 papers in “DB” conferences and 2 papers in “DM” conferences, but all the 5 papers are co-authored with Jiawei Han, who has published many papers in “DM” conferences such as KDD. These observations indicate that ie-HGCN is able to assess the importance of meta-paths according to the information of different objects.

Regarding baselines, HAN and HAHE assign the largest attention coefficient to meta-path [13, 14]. GraphSAGE, GCN, GAT and metapath2vec also achieve the best classification results when their input meta-path is . means that we should resort to authors who have published papers in the same conferences as the target author. However, this is less effective since it indirectly exploits the conference information. These methods cannot directly exploit the most useful meta-path , because they can only perform homogeneous graph convolution which requires constructing homogeneous graphs by symmetric meta-paths.

ie-HGCN discovers useful meta-paths of and on ACM, and , and on IMDB. Most of them are widely used in previous works. It also indicates effectiveness real features of target objects, since the target objects are typically connected to their input features.

(a) IMDB
(b) ACM
(c) DBLP
Fig. 3: Classification performance of ie-HGCN w.r.t. the hidden layer dimensionality of the type-level attention.
(a) Scalability
(b) Depth
Fig. 4: (a): Running time (Sec.) of all the heterogeneous GCN methods w.r.t. the total number of links and objects; (b): Classification performance of ie-HGCN w.r.t. its number of layers.

5.6 Hyper-parameter Study

In this subsection, we investigate the sensitivity of ie-HGCN’s performance to the hidden layer dimensionality of the type-level attention. With other hyper-parameters fixed, we gradually increase from 8 to 512 and report Micro F1 and Macro F1 in Figure 3. We can see, on DBLP, the performance is not very sensitive to

. On IMDB and ACM, Micro F1 is not very sensitive, while Macro F1 is more sensitive. Considering that Macro F1 is sensitive to skewed classes in classification, this can be explained by the fact that the classes in IMDB and ACM are skewed, while those in DBLP are balanced. The general pattern is: in the beginning, the performance grows as the dimensionality gradually increases; then the performance begins to decline, which should be because of overfitting with more parameters in the attention module. The overall inflection point is at the dimensionality of 64. Thus, we set

for ie-HGCN.

5.7 Scalability

We test and compare the scalability of the heterogeneous GCN methods on eight constructed HINs with different scales. See construction details in Appendix A.2. All the methods are randomly run 10 times on GPU. The average running time (seconds) w.r.t. the total number of links+objects in the HINs is reported in Figure 4(a). We can see, ie-HGCN achieves the best scalability. The time cost of ie-HGCN increases linearly with the increment of the HIN scale, which helps to verify the time complexity analyzed in Section 4.5.3. HAN, HAHE and HetSANN perform worse than ActiveHNE and ie-HGCN because the former need to perform object-level attention, which is computationally inefficient in practice. The time cost of HAN and HAHE increases more sharply than that of HetSANN because the former need to construct meta-path based graphs which are very dense, and the latter implements the object-level attention by sparse operations. HAHE performs worse than HAN, since it uses the high dimensional meta-path based structural features as input features. GTN shows the worst scalability due to its square time complexity. HAHE, HAN and GTN cannot run on large-scale HINs due to out of 12 GB GPU memory.

5.8 Depth Study

For ie-HGCN, the more layers, the more complex and rich semantics can be captured. We implement 8 instances of ie-HGCN with layers increasing from 2 to 9. See details in Appendix A.3. We test their classification performance on DBLP. Each of these model instances is randomly run 10 times, and the average Micro F1 and Macro F1 scores are reported in Figure 4(b). We can see, when the model has 2 layers, the performance is very poor. This is not surprising, since the ie-HGCN model with 2 layers can only consider meta-paths with length less than 2. As discussed in Section 5.5, in order to accurately classify author objects on DBLP, it is critical to capture and fuse the information from related conference objects. However, there is no 1-hop meta-path between authors and conferences in the network schema of DBLP (Figure 1). When the depth becomes 3, the performance is promoted dramatically, since it is possible for ie-HGCN to evaluate meta-path . Then, the performance grows slightly as the depth increases until it achieves its best when the depth is 6. After that, the performance starts to decrease possibly due to overfitting.

6 Conclusion

In this paper, we propose ie-HGCN to learn representations of objects in an HIN. To address the heterogeneity, we first project the representations of different types of neighbor objects into a common semantic space. Then we define the heterogeneous graph convolution operation to perform the object-level aggregation. Finally, we use the proposed type-level attention to aggregate the representations of different types of neighbor objects. ie-HGCN automatically evaluates all possible meta-paths in an HIN, discovers and exploits the most useful meta-paths for a specific task, which brings good interpretability of the model. The theoretical analysis and the scalability experiment show that it is efficient. Extensive experiments show ie-HGCN outperforms several state-of-the-art methods.

Appendix A

Layers Coefficients
1-2 [, , , ] [0.06, 0.06, 0.82, 0.06]
[, ] [0.50, 0.50]
[, ] [0.63, 0.37]
[, ] [0.50, 0.50]
2-3 [, , , ] [0.64, 0.04, 0.27, 0.05]
[, ] [0.20, 0.80]
[, ] [0.37, 0.63]
[, ] [0.06, 0.94]
3-4 [, , , ] [0.25, 0.25, 0.25, 0.25]
[, ] [0.49, 0.51]
[, ] [0.42, 0.58]
[, ] [0.19, 0.81]
4-5 [, ] [0.43, 0.57]
TABLE VI: Attention Coefficients.
0.82 * 0.94 * 0.25 * 0.57 = 0.1098
0.82 * 0.80 * 0.25 * 0.57 = 0.0935
0.82 * 0.63 * 0.25 * 0.57 = 0.0736
0.82 * 0.64 * 0.25 * 0.57 = 0.0748
0.63 * 0.27 * 0.25 * 0.57 = 0.0242
0.63 * 0.37 * 0.25 * 0.57 = 0.0332
0.63 * 0.27 * 0.51 * 0.43 = 0.0373
0.82 * 0.64 * 0.51 * 0.43 = 0.1151
0.82 * 0.80 * 0.49 * 0.43 = 0.1382
TABLE VII: Meta-path Scores.

a.1 Attention Study Details

We first compute the mean attention distribution in each block, and then compute the importance score of a meta-path based on these mean distributions. Table VI shows the final mean attention coefficients returned by ie-HGCN on DBLP. Table VII shows the computation of scores of most useful meta-paths.

a.2 Scalability Details

Based on the original DBLP dataset provided by HAN [13], we use different number of authors to induce 8 HINs with different scales. We denote the scale by a tuple: (author numbers, total objects, total links), i.e. (, , ). Thus, from small to large, the scales of resulting HINs are: (800, 6183, 21308), (1500, 9799, 38384), (2500, 13935, 59578), (4000, 18785, 84356), (5500, 22969, 106171), (7000, 26327, 125894), (10000, 31775, 147777), (14475, 37791, 170794).

a.3 Depth Study Settings

We implement 8 instances of ie-HGCN, with layers increasing from 2 to 9. Specifically, fixing other hyper-parameters, we set the layer (except for input layer) dimensionalities of these model instances as: [64], [64, 32], [64, 32, 16], [64, 32, 16, 8], [64, 64, 32, 16, 8], [64, 64, 64, 32, 16, 8], [64, 64, 64, 64, 32, 16, 8], [64, 64, 64, 64, 64, 32, 16, 8].

a.4 Hardware and Software

In the experiment, we use a server with 16 Intel Xeon E5-2620 CPUs, 1 Nvidia GeForce GTX 1080Ti with 12GB GPU memory, and 128GB main memory. Unless otherwise specified, all experiments are performed on GPU to accelerate computation.

References