A heterogeneous information network (HIN) is a network whose nodes and links may belong to different types. HINs are ubiquitous in our daily life and many real world networks can be modeled as HINs, e.g., bibliographic network, social network and knowledge base etc. Mining HIN has become a hot research topic which attracts a lot of attentions from researchers due to its capability of capturing meta structures with various rich semantic meanings wide applications in real-world scenarios including recommender systems, clustering, and outlier detections(Shi et al., 2018b; Sun et al., 2013; Gupta et al., 2013).
HINs are becoming more and more popular in real world, however, directly mining such complex relationships is neither efficient nor effective. Network embedding, given its capability of preserving network structure and node proximity in networks, has drawn much attention from researchers. The new spaces obtained by network embedding models can be then fed to many existing machine learning algorithms to help improve performances in various tasks such as node classification, clustering, and link prediction etc. Therefore, the widely studied network embedding techniques recently have also been extended to help analyse HINs(Shi et al., 2018b; Dong et al., 2017). HIN embedding targets the problem of exploiting various types of relationships among nodes and the network structures which are carried by meta-path, a sequence of node types and/or edge types. Figure 1 presents a scenario in which authors can be connected by paper with term and venue information. Here the proximities between authors can be measured through mining if they coauthor a paper, publish a paper in the same venue, or publish paper with citations between them, each of which may form a meta-path that refers a different semantic space. HIN embedding learns the meta-path based node embedding so that proximities among nodes in each space can be preserved. The comprehensive node embedding should be based on all these meta-path based embeddings.
However, different meta-paths normally contribute differently in producing HIN embeddings for different tasks and it will burden users a lot to manually provide explicit guidances in determining different importances of meta-paths for each node. Therefore, one challenge is that how can we automatically learn these importance weights of meta-paths for each individual node. Existing methods either do not distinguish the different importances of meta-paths or learn the importance weights for different meta-paths taking the global HIN structure into consideration. Thus they fail to learn a personalized preference towards different meta-paths for each individual node in different tasks. Furthermore, when real-world large scale HINs are more and more popular, the requirement from many real-world applications that embeddings should be quickly generated/updated for new nodes (Hamilton et al., 2017) becomes more and more difficult to meet. Therefore, how can we efficiently generate/update the embeddings for new nodes becomes another challenging problem. Existing HIN embedding methods only support transductive learning of the embeddings through traversing the whole HIN, which fails to generate embeddings for new nodes efficiently.
To solve the above challenges, in this paper, we propose a Hierarchical Attentive Heterogeneous information network Embedding (HAHE ) model which can efficiently learn HIN embeddings with personalized preferences over different meta-paths for each individual node. The proposed HAHE model are also adequate for efficient generations of embeddings for new nodes through aggregation of neighbor structure information rather than traversal of the whole HIN.
Moreover, our proposed method assumes that each meta-path refers to a semantic space and nodes connected by different meta-paths share different structure features. In each semantic space, we learn the embedding of each node by a weighted aggregation of neighbor features so that both first order and second order proximities can be preserved. After learning different embeddings with respect to different meta-paths for each node, we aggregate these embeddings together for each node by considering each node’s personalized preferences towards different meta-paths. We employ attention mechanism to i) determine different importances of different neighbors with respect to each meta-path for each node and ii) learn personalized preferences towards different meta-paths for each individual node. In particular, the main advantages of using attention on HINs can be summarized as follows: i) Attention allows the HAHE model to be robust towards noisy parts of the HINs, thus improve the signal-to-noise (SNR) ratio; ii) Attention allows the HAHE model to assign a relevance score to each node in the HINs to highlight nodes with the most task-relevant information and therefore improve SNR; iii) Attention also provides a way for us to make HAHE more interpretable.
The contributions of this paper can be summarized as follows:
We propose a Hierarchical Attentive Heterogeneous information network Embedding (HAHE ) model to learn HIN node embeddings which can not only optimize for specific task, but also supportmeta-path inductive learning for unseen nodes.
We elaborately design hierarchical attention mechanism to distinguish the importance of neighborhood nodes and meta-paths for learning comprehensive embeddings.
We conduct experiments on real-world datasets to show the superiority of our model against several existing methods and give comprehensive analysis on the learned embeddings and attention coefficients in order to gain more insights of the datasets.
The reminder of this paper is organized as follows. Section 2 introduces the related work. Section 3 describes notations used in this paper and presents some preliminary knowledge. Then, we propose the HAHE model in Section 4. Experiments and detailed analysis are reported in Section 5. Finally, we conclude the paper in Section 6.
2. Related Work
In this section, we will review the related studies in four aspects, namely heterogeneous information network, network embedding, heterogeneous information network embedding and attention mechanisms. We compare our method with the existing methods in Table 1
|Node2Vec (Grover and Leskovec, 2016)|
|GraphSAGE (Hamilton et al., 2017)|
|GAT/AGNN (Velickovic et al., 2017; Thekumparampil et al., 2018)|
|GAM (Lee et al., 2018)|
|AttentionWalks (Abu-El-Haija et al., 2017)|
|Metapath2Vec (Dong et al., 2017)|
|HIN2vec (Fu et al., 2017)|
|ASPEM/HEER (Shi et al., 2018a; Zhang et al., 2016)|
|HINE (Chang et al., 2015)|
|HNE/PTE (Chang et al., 2015; Tang et al., 2015a)|
Heterogeneous Information Network As a newly emerging direction, heterogeneous information network (HIN) has been extensively studied as a powerful and effective paradigm to model complex objects and relations. In HIN, nodes can be reached by paths with different semantic meanings and these paths(also called meta-paths (Sun et al., 2011)) have been explored for fulfilling tasks, including classification (Ji et al., 2010), clustering (Sun et al., 2013, 2012), recommendation (Yu et al., 2014; Chen and Sun, 2017), and outlier detection (Gupta et al., 2013). However, with the development of data collecting, the scale of HIN are growing and the adjacent representation is always high-dimensional and sparse. A low-dimensional and dense representation is needed to serve as the basis for different downstream applications.
Network embedding aims at learning low-dimensional vector representation to facilitate a better understanding of semantic relationships among nodes. Many existing works focus on learning representation for homogeneous networks. Among them, a branch of methods(Grover and Leskovec, 2016; Perozzi et al., 2014; Tang et al., 2015b) employ a truncated random walk to generate node sequences, which is treated as sentences in language models and fed to the skip-gram model to learn embeddings. Beyond skip-gram model, graph structure is also incorporated into deep auto-encoder to preserve the highly non-linear first order and second order proximities (Tang et al., 2015b; Wang et al., 2016; Huang et al., 2017; Cao et al., 2015). Recently, inspired by GCN (Kipf and Welling, 2016)
which use convolution operators that offer promise as an embedding methodology, a wide variety of graph neural network(GNN) models have been proposed(Hamilton et al., 2017; Velickovic et al., 2017; Ying et al., 2018; You et al., 2018, 2018). Also, in many real-world applications, nodes are also associated with attributes, network embedding has been explored on such attributed networks (Huang et al., 2017; Li et al., 2017a; Pan et al., 2016; Zhou et al., 2018). However, these methods assume that there only exists single type of node/edge while real-world datasets usually contain multiple node/edge types.
Heterogeneous Information Network Embedding Recently, network embedding has been extended to HINs and a bunch of methods have been proposed. Among them, heterogeneous skip-gram model based methods are proposed (Dong et al., 2017; Zhang et al., 2018; Huang and Mamoulis, 2017) where meta-path based random walks are used to generate graph contexts. However, random walk based methods is time consuming in large scale HINs and only support transductive learning for new incoming nodes.
Beyonds the meta-path based random walk and skip-gram model, neural networks are also explored on HINs. In HIN2vec (Fu et al., 2017), a single-hidden-layer feedforward neural network is applied to enable users to capture rich semantics of relationships and the details of the network structure to learn representations of nodes in HINs. In HNE (Chang et al., 2015) and PTE (Tang et al., 2015a), node embeddings are learned by capturing 1-hop neighborhood relationships between nodes via deep architectures. In Aspem (Shi et al., 2018a) and HEER (Zhang et al., 2016)
, HIN is decomposed into multiple aspects and embeddings are derived from the aspects. Some methods learn embedding on specific HINs such as knowledge graph (KG)(Zhang et al., 2016; Fang et al., 2018), signed HINs (Wang et al., 2018) or employ HIN embedding for specific tasks such as similarity search (Shang et al., 2016) and recommendation (Shi et al., 2018b; Hu et al., 2018; Chen and Sun, 2017)
However, most of the above methods either use the weights given by users or just equally leverage the meta-paths which ignores the different semantics of relationships between nodes. Also, embedding learning in inductive manner has not been well solved by these methods yet.
Attention In Graphs
Attention mechanism was first introduced in the deep learning community to help models attend to important parts of the data(Bahdanau et al., 2014; Mnih et al., 2014)
. It has been proved efficient in many computer vision(Wu et al., 2016)
, natural language processing domains(Yang et al., 2016a) and other data mining tasks (Yang et al., 2016b).
More recently, there has been a growing interest in attention models for graphs and various techniques have been proposed. Similar to DeepWalk(Perozzi et al., 2014) and Node2Vec (Grover and Leskovec, 2016), AttentionWalks (Abu-El-Haija et al., 2017) is proposed to use attention to steer the walk towards a broader neighborhood or to restrict within a narrower neighborhood. Methods like GAT (Velickovic et al., 2017; Ryu et al., 2018) and AGNN (Thekumparampil et al., 2018) extend graph convolutional networks(GCN) (Kipf and Welling, 2016) by incorporating an explicit attention mechanism. In GAM (Lee et al., 2018), two types of attention are used learn node embeddings: one for the guidance of random walk to determine the relevance of neighboring nodes and another to determine the relevance of various subgraph embeddings. In this paper, we extend attention mechanism to HIN and a hierarchical attention architecture is proposed to distinguish the importance of neighborhood nodes and meta-paths for learning comprehensive embeddings.
In this section, we formally define the problem of Heterogeneous Information Network Embedding and introduce some background definitions.
|G||Heterogeneous Information Network|
|Target Node type|
|Content Node type|
|Adjacent matrix based on meta-path|
|meta-path based path count matrix|
|meta-path based structural feature matrix|
|Neighbor Attention coefficient based on meta-path|
|Embedding of node based on meta-path|
|Meta Attention coefficient of meta-path|
|Comprehensive node embedding|
|Feature transformation matrix|
|Concatenation transformation matrix|
|Preference transformation matrix|
|Preference transformation bias|
Definition 1 ().
Heterogeneous Information Network A heterogeneous information network(HIN) (Sun and Han, 2012) is defined as a network with multiple types of nodes and/or multiple types of links. As a mathematical abstraction, we define a HIN as , where denotes the set of nodes and denotes the set of links. A HIN is also associated with a node type mapping function , which maps the node to a predefined node type, and a link type mapping function , which maps the link to a predefined link type.
The input links of HIN are often between different types of nodes and represented with commuting matrix (Sun et al., 2011) where denotes the link between node and . Nodes without direct links can be connected by other types of nodes and such connections are also called meta-path.
Definition 2 ().
Network schema and meta-path Network schema is a template for a heterogeneous information network which is a directed graph defined over object types, denoted as . A meta-path (Sun et al., 2011) is defined on the network schema and is denoted in the form of . A path , which goes through nodes , is an instance of the meta-path , if and .
The definition of meta-path is often given by users with prior knowledge. Some previous works (Sun et al., 2013; Yu et al., 2012) have been focused on automatically find the meta-path. However, it is not the main problem we aim to solve in this paper and we assume the meta-path is defined by user. Given the meta-path set, our work focuses on distinguishing these meta-paths. In real-world HIN, not all the node types are studied and we separate them into target node type and content node type.
Definition 3 ().
Target/Content type nodes
The target type nodes are defined as nodes aims to be embedded in network embedding. They are often associated with labels in semi-supervised learning tasks.
The content type nodes
are defined as nodes aims to be embedded in network embedding. They are often associated with labels in semi-supervised learning tasks. The content type nodesare defined as the rest type of nodes which serves as connection between target type nodes.
It is worth noting that providing labels for all node types is time-consuming and labor-intensive in real-world applications. The target type nodes can be connected with different types of content nodes with semantic meanings. In this paper, we only learn the embedding for one particular target type of nodes in HIN. Learning embeddings for all node types can be achieved by setting each node type as target type.
Definition 4 ().
Heterogeneous Information Network Embedding Given a heterogeneous information network(HIN), denoted as a graph , corresponding node type mapping function and edge type mapping function . Heterogeneous Information Network Embedding aims at learning a function that projects node into a vector in a d-dimensional space , where .
In this section, we describe the details of our proposed Hierarchical Attentive Heterogeneous Information Network Embedding(HAHE ) model.
The basic idea is that target type nodes can be projected into different semantic space by each meta-path. We can learn meta-path based embeddings to preserve the proximities in each space by aggregating neighborhood information. The comprehensive embedding should be related to all the meta-path based embedding which naturally form a hierarchical structure. We elaborately design the hierarchical attention mechanism to learn preference towards meta-path based neighborhood information and meta-path based embeddings. The architecture of HAHE is showed in Fig 2. It consists of several parts: semantic space materialization, a neighborhood attention layer and a meta-path attention layer.
4.1. Semantic space materialization
Given a HIN and a set of meta-paths , the target type nodes can be connected by meta-paths with different semantic meanings. To learn node embeddings from each semantic space, we materialize the semantic space in two aspects: the meta-path based neighborhood and structural features. It is worth noting that we only study the problem of non-attributed HIN embedding in this paper where nodes are without additional attributes. However, it is easy for our proposed HAHE model to be extended to attributed heterogeneous information network(AHIN) (Li et al., 2017b) and we will discuss this later.
From meta-path based neighborhood aspect, given meta-path , the relation strength between target type nodes can be extracted by matrix multiplication of partial commuting matrix . We use a weighted matrix to denote the meta-path based relation strength where and is the number of meta-path instances between target type node and . The weighted matrix can also be viewed as the adjacent matrix of a weighted homogeneous information network corresponding to the semantic space where nodes are target type nodes and edges are related to the relation strength. Based on the the relation strength matrix , meta-path based neighborhood can be defined as:
where is the meta-path and is the meta-path based relation strength matrix. Such adjacent relationship are employed in HAHE model for neighborhood aggregation.
From meta-path based structural feature aspect, given a HIN and meta-path , we use the connection distribution as the structural feature of nodes. The proximity between nodes can then be measured by the similarity of structural features vectors. To measure the connection between nodes in HIN, we have the following candidate measures:
Path Count: the number of path instances between target type node and any other target type node following meta-path .
Positive Pointwise Mutual Information(PPMI) (Levy and Goldberg, 2014) : related to the statistics of graph context generated by random walks following meta-path and reflect the connection strength.
However, Pairwise Random Walk and PPMI are based on random walks on graphs which can not be directly applied for unseen nodes in graph. In order to support inductive learning and efficiently extract structure features, we use Path Count to learn the meta-path based structural feature. It has been proved (Sun et al., 2011) that nodes with high visibility may affect the similarity measure between nodes. To balance the visibility of nodes, we use the normalized connection strength matrix as the features of nodes of meta-path :
where is the relation strength between node and based on meta-path . Note that for different meta-path , the structural features can be totally different which captures the semantic meanings in each space.
Meta-path based first and second order proximity The basic motivation of embedding methods is to preserve the proximities between nodes in the embedded space. Based on the semantic space materialization, we get the meta-path based neighborhood and structural feature . We can preserve the first order proximity by expecting nodes with neighborhood relationship close in the embedding space and preserve the second order proximity by expecting nodes with similar neighborhood distribution in the embedding space.
4.2. Neighbor attention layer
After the semantic space materialization, we extract the meta-path based neighborhood and structural features in each semantic space. Neighbor attention layer aims at learning meta-path based node embeddings in the semantic space so that the proximities between nodes can be preserved in the learned embedding space.
To achieve this goal, some existing network embedding methods (Dong et al., 2017; Zhang et al., 2018) perform random walks on HIN to generate graph contexts and fed into skip-gram model. However, these methods will be time consuming especially when the network is large and dense. Some other methods fed structure based information into deep neural networks (Wang et al., 2016; You et al., 2018). Recently, some methods (Kipf and Welling, 2016; Velickovic et al., 2017; Hamilton et al., 2017) are proposed to aggregate the feature information from neighborhood nodes and alternatively update the node embeddings. Such methods has been proved to be effective in many data mining tasks and support inductive learning manners. Inspired by these work, we propose a neighborhood attention layer to attending over neighborhood nodes following a self-attention strategy.
To overcome the sparsity and noisy of the structural features, we first use a linear transformation, parameterized by weight matrix
to transform the structural features into lower dimensional features. Note that the transformation can be replaced by deeper Multilayer Perceptron(MLP)(Ruck et al., 1990) or other dimension reduction methods. Based on the transformed features, we can learn node embeddings by aggregating neighborhood features. However, neighborhood nodes are of different importance and a self-attention strategy is applied to distinguish them. The basic assumption behind this is that nodes with similar structural features should be assigned with higher attention coefficients.
The attention coefficients between nodes are based on the similarities of structural feature vectors in the transformed feature space:
where denotes the similarity between node and in transformed feature space, is the transformation matrix for meta-path , is the structural feature vector of node based on meta-path . To make attention coefficients easily comparable across different nodes, we normalize them across all choices of using the softmax function:
where is the attention coefficients for node in learning meta-path based embedding based on meta-path , is the adjacent relationship between node and based on meta-path .
Once obtained, the normalized attention coefficients are used to compute a weighted linear combination of neighborhood transformed features:
is the activation function and we use Tanh function here,is aggregated neighborhood features. Some existing work (Velickovic et al., 2017) includes the transformed feature of node in the aggregated neighborhood. However, when there exist large amount of neighborhood nodes, node’s personalized feature may be diluted in the neighborhood features. To keep node’s personalized structural information as well as aggregate neighborhood information, we apply a concatenation with linear transformation in the following formulation to learn the meta-path based node embedding:
where is the weight of linear transformation from the concatenation to the embedding space, is the meta-path based neighborhood embedding, is the transformed node feature, is the concatenation of vectors.
The proposed neighbor attention strategy has the following properties:
The neighborhood aggregating is parallelable for each node which makes it efficient in real-world applications.
By aggregating neighborhood features, the model can be applied to inductive learning problems, including tasks where the model has to generalize to completely unseen graphs.
Comparison with existing methods Compared with existing neighborhood aggregation based network embedding methods, our neighbor attention layer have the following advantages:
Compared with GraphSAGE (Hamilton et al., 2017), the neighbor attention layer distinguishes the importance of neighborhood nodes based on the similarity of structural features.
Compared with GAT (Velickovic et al., 2017), the neighbor attention layer not only weighted aggregates the neighborhood features but also keeps the information of node’s own features by concatenation.
4.3. Meta-path attention layer
Given a HIN and meta-path set , the neighborhood attention layer learns meta-path based embedding for each meta-path. As we have discussed in the former section, each meta-path corresponds to a specific semantic space. The comprehensive node embedding should be based on all these meta-path based embeddings. However, different meta-path are expected to reflect distinct semantic facets of an HIN. A key challenge in aggregating different meta-paths is distiguish the meta-path based embeddings in the comprehensive node embedding.
Take the bibliographic network as an example: authors can be connected by coauthoring a paper(APA), publish paper in same venue(APVPA) and publish paper with same keywords(APTPA). Existing methods aims at preserving all the proximities based on meta-path. However, some authors may publish paper in a wide range of venues which relies on meta-path APVPA, some authors may publish with many co-authors in a limited venues which relies on meta-path APA. Such phenomenon requires embedding methods can model the personalized preference on meta-paths for each author. Also, understanding the importance of each meta-path can help us better understand structure of HIN.
In our model, we leverage a meta-path attention layer to distinguish the embedding learned from each meta-path and learn the comprehensive node embedding. The input of this layer is a set of node embeddings of each meta-path:
The output of meta attention layer is the comprehensive node embeddings.
In order to model the personalized preference on meta-paths, a dataset level preference on the meta-paths is determined to help learn the attention coefficients for each node. We use a context vector to denote the preference, where is the dimension of the hidden preference. Foe meta-path based embedding similar to the context vector, it will be assigned with higher attention coefficients. The context vector is shared by all the target type nodes in HIN as well as unseen nodes to support inductive learning. is randomly initialized and jointly learned during the training process.
To measure the similarity between the context vector and transformed meta-path based embedding, we first use a linear transformation with non-linear activation to transform the meta-path based embedding into -dimension preference space:
where is the weight of linear transformation, is the bias parameter of the transformation. The meta attention coefficients is then based on the similarity between context vector and transformed meta-path based embedding:
where is the L2 normalization of vectors, is the personalized attention coefficients on meta-path for node .
With the learned attention coefficients, we can weighted combine these meta-path based embeddings to obtain the comprehensive node embedding of node with the following function:
where is the comprehensive embedding of node , is the attention coefficients of meta-path for node .
The proposed meta-path attention strategy has the following properties:
By employing the data set level preference vector, the meta-path attention can be learned for unseen nodes in the inductive manner. As a result, the HAHE model can be applied on dynamic graphs.
4.4. Model inference
In order to learn useful, predictive representations of nodes in HIN, we learn the parameters of HAHE in a task-specific environment. For multi-class classification, the objective is to minimize the Cross-Entropy loss between the ground-truth and the predictions:
where is ground truth of node on label , is predicted result of node on label . For multi-label classification where nodes may have more than one label, we choose multi-label margin loss which is formulated by:
where is the number of nodes, is the ground truth label of node and is the predicted probability on labels of node . Given partial labels of nodes, we can optimize the HAHE
model with mini-batch stochastic gradient descent and back propagation algorithm. The overallHAHE model is described in Algorithm 1.
In this section, we conduct extensive experiments on real-world HINs which demonstrates the superior of our proposed HAHE . We design several experiments to answer the following questions:
Can HAHE model learn good embedding for nodes in HIN for different data mining tasks?
Can HAHE support inductive node embedding in HIN for unseen nodes?
Can HAHE model capture the importance of meta-path and learn meaningful attention coefficients?
How the dimensionality of meta-path preference vector affect the performance of HAHE ?
We conduct multi-class/multi-label node classification and network visualization experiments on several real-world datasets. The details of datasets are described as follows with dataset statistics summarized in Table 3:
DBLP is a bibliographic network of computer science which is frequently used in the study of heterogeneous networks. It contains four types of objects including paper, author, venue and topic. We use a subset of DBLP containing 4249 papers(P), 1909 authors(A), 18 venues(V) and 4474 topics(T) from 4 areas: database, data mining, machine learning and information retrieval. We conduct multi-class node classification for authors and consider the meta-path set: APA, APPA, APVPA, APTPA.
IMDB is a movie rating website contains a social network of users and the movie rating of each user. We extract a heterogeneous information network with 942 users(U), 1,318 movies(M), 889 directors(D) and 41,485 actors(A). The type of the movie is used as the label of the movie. We conduct multi-label node classification for movies and consider the meta-path set: MDM, MAM, MUM.
Yelp-Restaurant is a social media dataset, released in Yelp Dataset Challenge 111https://www.yelp.com/dataset/challenge. We extracted information related to the restaurant business objects of three sub-categories (Li et al., 2017c) : Fast Food, Sushi Bars and American(New) Food. We construct a HIN of 2,614 business objects (B); 33,360 review objects (R); 1,286 user objects (U) and 82 food relevant keyword objects (K). We conduct multi-class node classification for business and consider the meta-path set BRKRB,BRURB.
|# Node||# Edge||# Label||meta-path|
Evaluation Metric For multi-class classification, we follow the evaluation strategy of many existing works and use Micro-F1 and Macro-F1 score as the evaluation metric, the Micro-F1 calculates metrics globally by counting the total true positives, false negatives and false positives while Macro-F1 calculates metrics for each label, and find their unweighted mean.
For multi-label classification, it has been a popular research topic on the evaluation metric and we follow the setting in (Zhang and Zhou, 2014) which includes Micro-F1, Macro-F1, average precision and ROC-AUC score.
We evaluate HAHE against six state-of-the-art representation learning models and two HAHE variants.
Node2vec (Grover and Leskovec, 2016) uses truncated random walks to generate node sequences and employ skip-gram model for node representation learning. It is widely used as the baseline of network embedding methods.
LINE (Tang et al., 2015b) learn node embedding by preserving both first order and second order proximities. We compare with LINE to show the effectiveness of the neighbor attention layer in learning meta-path based embedding.
GraphSAGE (Hamilton et al., 2017) learns node embedding by aggregating local neighborhood features. We compare with this method to demonstrate the superior of learning neighbor attention and combine different meta-paths. Here, we use the mean aggregator version of GraphSAGE to show the importance of learning weights for neighbors.
GAT (Velickovic et al., 2017) is an attention based network embedding method. The attention coefficients are learned by a single-layer feedforward neural network. We compared with this method to demonstrate the superior of the proposed meta attention layer.
Metapath2Vec (Dong et al., 2017) is one of the state-of-the-art network embedding algorithms for HINs. The meta-path guided random walks are performed on HIN for context generation.
HIN2vec (Fu et al., 2017) is another state-of-the-art heterogeneous information network embedding method. It learns the embedding through a deep neural network by considering the meta-path. However, it does not consider the weight of different meta-paths. We compare with this method to demonstrate the superior of learning meta-path attention.
HAHE -homo is a variant of HAHE in which we only use one meta-path embedding as the final embedding of nodes. We use this method to analysis the learned attention coefficients. Also, the parameters of HAHE -homo is pre-trained as the initialization of neighbor attention layer.
HAHE -max/HAHE -avg are variants of HAHE
in which we use max pooling/mean pooling to learn the comprehensive embedding from meta-path based embedding. We compare with this method to demonstrate the superior of using meta-path attention layer to distinguish meta-paths.
We implement the proposed HAHE
model with Pytorch. The model parameters are randomly initialized with a xavier initializer and Adam optimizer is employed for optimization. We use the parameters learned byHAHE -homo as the initialization of neighbor attention layer. We set the learning rate to 0.0005 and the batch size to 512. The vector dimension of all the methods is 128. For the compared methods, we use the code provided by authors. For random walk based method, the number of walks per node is set to 80, the walk length is 100, the size of negative sampling is 5. All the experiments are conducted on a Linux server with one NVIDIA Titan Xp GPU and 24 core Intel Xeon E5-2690 CPU. The code of our model will be released upon acceptance.
5.3. Transductive Node Classification
5.3.1. Experimental Setup
Node classification has been widely used in literature to evaluate network embeddings. Since only a few existing methods support inductive learning in HIN, we first compare the node classification performance in the transductive manner. For methods designed for homogeneous information network including Node2Vec, LINE, GraphSAGE and GAT, we apply semantic space materialization described in section 4.1 and extract the homogeneous information network based on each meta-path. The structural features are used as node attributes for GraphSAGE and GAT. We select the meta-path with best classification performance and report the classification results. We use Micro-F1 and Macro-F1 as the evaluation metric for multi-class classification in DBLP and YELP. For multi-label classification in IMDB dataset, there are several metrics for evaluation and we select Micro-F1, Macro-F1, Average precision score and ROC-AUC score.
5.3.2. Experiment results and analysis
Table 4 illustrated the multi-class node classification results in DBLP and YELP dataset, Table 5 illustrated the multi-label node classification results in IMDB dataset. Based on the results, we have the following observations:
An overall observation is that HAHE achieves the best performance among the compared algorithms and HAHE variants in terms of all the evaluation metrics. With more labeled data for classification, most of the methods get better performances. This indicates the effectiveness of our proposed HAHE model for learning node embeddings in HIN.
Among the methods deigned for homogeneous information network, GraphSAGE and GAT has better performance compared with Node2Vec and LINE. This demonstrates the superior of employing the structural features in the embedding learning. Also, GAT gains slight improvements over GraphSAGE which points out that learning weights for neighborhood can help learn better representations.
Compare HAHE with its two variants: HAHE -avg and HAHE -max, we observe that learning attention for aggregating meta-path based embedding can achieve better performance. As a fact, mean-pooling and max-pooling are two popular pooling function for aggregating features while the importance of each aspect is ignored. The performance improvement gained by HAHE further indicate the advantage of distinguishing meta-path based embedding.
5.4. Inductive Node Classification
5.4.1. Experimental Setup
To contextualize the empirical results on inductive benchmarks, we compare our method with baseline methods in the inductive manner. The baseline methods that support inductive learning are GraphSAGE and GAT, which is deigned for homogeneous information network. For inductive node classification, given HIN , we first sample a subset of nodes and extract the relations to conduct a partial HIN . The rest nodes are simulated as unseen nodes for inductive learning. Only the edges that between nodes from the subset are kept in partial HIN . Then we learn node embeddings in partial HIN and optimize the parameters for neighborhood aggregation and meta attention. For nodes not in subset , we directly apply the neighbor attention layer to learn meta-path based embedding from the partial HIN and apply meta attention layer to learn comprehensive embedding. The learned embeddings are fed to classification tasks for evaluation.
5.4.2. Experiment results and analysis
We conduct the inductive experiments on DBLP and YELP dataset, the detailed results are shown in Figure 3. To summarize, we have the following observations:
An overall observation is that HAHE can learn better node embeddings in the inductive manner compared with baseline methods. This indicates that by aggregating meta-path based neighborhood information and meta-path based embedding, HAHE can learn good node embeddings in the inductive manner.
With the increasing percent of unseen nodes, the performance of methods are getting worse. This is explainable since with more unseen nodes, less information is kept in the partial HIN : the edges between unseen nodes are missed and the structural feature is incomplete. Embeddings learned from such sub-graph may be hard to support inductive learning.
Comparing HAHE with GAT, we find that the performance of HAHE drops slower with increasing percentage of unseen nodes. This is explainable since HAHE aggregates meta-path embedding from the whole meta-path set while GAT only works in homogeneous information network and we report the results with best meta-path. For unseen nodes, considering more meta-path can aggregate more information from partial HIN .
5.5. Analysis of learned attention
As we have discussed in the motivation of our model, meta-paths in HIN are of different importance in learning node embeddings. In HAHE model, we use meta-path attention layer to distinguish the meta-paths. To evaluate whether the learned attention coefficients could reflect the importance of meta-paths, we compared the learned attention coefficients with the quality of meta-paths. The quality of each meta-path can be represented by meta-path based embedding which is the output of the neighbor attention layer in HAHE . We directly fed the meta-path based embeddings into node classification task and Figure 4 illustrates the comparison of meta-path quality and learned attention coefficients. Based on the results, we have the following observations:
The basic observation is that there is a positive correlation between the quality of meta-path and learned attention coefficients. meta-path with better quality is assigned with larger attention coefficients. This proves that the learned attention coefficients can properly reflect the quality of meta-paths. As a result, users can have deeper insight of semantic meanings of meta-paths in HIN.
Another interesting observation is that the qualities of meta-paths are significantly different. For example, in the DBLP data set, meta-path APVPA gets the best performance while APA gets the worst performance. This is explainable since the APA refers to the relationship of co-authorship and most of the author can only coauthor with limited number of other authors, as a result, only a few neighborhoods could provide information and the structural feature of author is sparse based on such meta-path.
5.6. Case Study: Network Visualization
Network visualization is one of the popular applications of network embedding which supports tasks such as data exploration and understanding. Following the experimental setting of existing works (Pan et al., 2016), we first learn low dimensional representation for each node and then map them into the 2-D space with t-SNE (Van der Maaten and Hinton, 2012). Figure 5 illustrates the network visualization results on DBLP dataset considering meta-path APA, APPA, APTPA and APVPA. Each dot denotes a node in the HIN and each color denotes label of a class. A good embedding method is expected to make nodes with same label close to each other while far for nodes with different labels. As observed in Fig 5, the state-of-the-art baseline methods Metapath2Vec and HIN2Vec do not separate the nodes as good as HAHE . The visualization results of HAHE are quite clear since most of nodes with same label (color) are close to each other and nodes with different labels(colors) are far from each other. This further verifies the effectiveness of the proposed HAHE method.
5.7. Parameter Analysis
In this subsection, we investigate the parameter sensitivity of HAHE . More specifically, we evaluate how different numbers of the preference space dimensions can affect the results of node classification. Following the previous experiment settings, we only change the numbers of the embedding dimensions to show how the dimensionality affects the performance of HAHE .
Figure 6 illustrates the result of Micro-F1 score w.r.t dimensionality. We can see that the dimensionality slightly affect the classification performance. In each dataset, when the number of dimensions continuously increases, the Micro-F1 score have minor changes(within 1%) which shows that HAHE is not very sensitive to the dimensionality of context vector.
In this paper, we have proposed the HAHE model for heterogeneous information network embedding which supports inductive learning. In HAHE model, a hierarchical attention architecture is proposed to distinguish the importance of neighborhood nodes and meta-paths for learning comprehensive embeddings. Experimental results of transductive and inductive node classification, network visualization on real-world HIN datasets demonstrate the superior performance of HAHE compared to several state-of-the-art heterogeneous information network embedding methods.
This paper suggests several potential future directions of research. First, the recommender system can be viewed as a HIN with rich attribute information, the problem of recommending items or friends to users can be based on the embedding of items and users. Another possible direction is to take the attributes of nodes into consideration since in real-world datasets, nodes are often associated with rich information. Finally, learning embedding for all types of nodes is also an interesting problem to solve.
- Abu-El-Haija et al. (2017) Sami Abu-El-Haija, Bryan Perozzi, Rami Al-Rfou, and Alex Alemi. 2017. Watch your step: Learning graph embeddings through attention. arXiv preprint arXiv:1710.09599 (2017).
- Bahdanau et al. (2014) Dzmitry Bahdanau, Kyunghyun Cho, and Yoshua Bengio. 2014. Neural machine translation by jointly learning to align and translate. arXiv preprint arXiv:1409.0473 (2014).
- Cao et al. (2015) Shaosheng Cao, Wei Lu, and Qiongkai Xu. 2015. Grarep: Learning graph representations with global structural information. In Proceedings of the 24th ACM International on Conference on Information and Knowledge Management. ACM, 891–900.
- Chang et al. (2015) Shiyu Chang, Wei Han, Jiliang Tang, Guo-Jun Qi, Charu C Aggarwal, and Thomas S Huang. 2015. Heterogeneous network embedding via deep architectures. In Proceedings of the 21th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining. ACM, 119–128.
- Chen and Sun (2017) Ting Chen and Yizhou Sun. 2017. Task-guided and path-augmented heterogeneous network embedding for author identification. In Proceedings of the Tenth ACM International Conference on Web Search and Data Mining. ACM, 295–304.
- Dong et al. (2017) Yuxiao Dong, Nitesh V Chawla, and Ananthram Swami. 2017. metapath2vec: Scalable representation learning for heterogeneous networks. In Proceedings of the 23rd ACM SIGKDD International Conference on Knowledge Discovery and Data Mining. ACM, 135–144.
- Fang et al. (2018) Yang Fang, Xiang Zhao, Zhen Tan, and Weidong Xiao. 2018. TransPath: Representation Learning for Heterogeneous Information Networks via Translation Mechanism. IEEE Access 6 (2018), 20712–20721.
- Fu et al. (2017) Tao-yang Fu, Wang-Chien Lee, and Zhen Lei. 2017. Hin2vec: Explore meta-paths in heterogeneous information networks for representation learning. In Proceedings of the 2017 ACM on Conference on Information and Knowledge Management. ACM, 1797–1806.
- Grover and Leskovec (2016) Aditya Grover and Jure Leskovec. 2016. node2vec: Scalable feature learning for networks. In Proceedings of the 22nd ACM SIGKDD international conference on Knowledge discovery and data mining. ACM, 855–864.
- Gupta et al. (2013) Manish Gupta, Jing Gao, and Jiawei Han. 2013. Community distribution outlier detection in heterogeneous information networks. In Joint European Conference on Machine Learning and Knowledge Discovery in Databases. Springer, 557–573.
- Hamilton et al. (2017) Will Hamilton, Zhitao Ying, and Jure Leskovec. 2017. Inductive representation learning on large graphs. In Advances in Neural Information Processing Systems. 1024–1034.
- Hu et al. (2018) Binbin Hu, Chuan Shi, Wayne Xin Zhao, and Philip S. Yu. 2018. Leveraging Meta-path Based Context for Top- N Recommendation with A Neural Co-Attention Model. In Proceedings of the 24th ACM SIGKDD International Conference on Knowledge Discovery & Data Mining (KDD ’18). ACM, New York, NY, USA, 1531–1540. https://doi.org/10.1145/3219819.3219965
- Huang et al. (2017) Xiao Huang, Jundong Li, and Xia Hu. 2017. Accelerated attributed network embedding. In Proceedings of the 2017 SIAM International Conference on Data Mining. SIAM, 633–641.
- Huang and Mamoulis (2017) Zhipeng Huang and Nikos Mamoulis. 2017. Heterogeneous information network embedding for meta path based proximity. arXiv preprint arXiv:1701.05291 (2017).
- Ji et al. (2010) Ming Ji, Yizhou Sun, Marina Danilevsky, Jiawei Han, and Jing Gao. 2010. Graph regularized transductive classification on heterogeneous information networks. In Joint European Conference on Machine Learning and Knowledge Discovery in Databases. Springer, 570–586.
- Kipf and Welling (2016) Thomas N Kipf and Max Welling. 2016. Semi-supervised classification with graph convolutional networks. arXiv preprint arXiv:1609.02907 (2016).
- Lee et al. (2018) John Boaz Lee, Ryan Rossi, and Xiangnan Kong. 2018. Graph Classification using Structural Attention. In Proceedings of the 24th ACM SIGKDD International Conference on Knowledge Discovery & Data Mining. ACM, 1666–1674.
- Levy and Goldberg (2014) Omer Levy and Yoav Goldberg. 2014. Neural word embedding as implicit matrix factorization. In Advances in neural information processing systems. 2177–2185.
- Li et al. (2017a) Jundong Li, Harsh Dani, Xia Hu, Jiliang Tang, Yi Chang, and Huan Liu. 2017a. Attributed network embedding for learning in a dynamic environment. In Proceedings of the 2017 ACM on Conference on Information and Knowledge Management. ACM, 387–396.
- Li et al. (2017b) Xiang Li, Yao Wu, Martin Ester, Ben Kao, Xin Wang, and Yudian Zheng. 2017b. Semi-supervised clustering in attributed heterogeneous information networks. In Proceedings of the 26th International Conference on World Wide Web. International World Wide Web Conferences Steering Committee, 1621–1629.
- Li et al. (2017c) Xiang Li, Yao Wu, Martin Ester, Ben Kao, Xin Wang, and Yudian Zheng. 2017c. Semi-supervised Clustering in Attributed Heterogeneous Information Networks. In Proceedings of the 26th International Conference on World Wide Web (WWW ’17). International World Wide Web Conferences Steering Committee, Republic and Canton of Geneva, Switzerland, 1621–1629. https://doi.org/10.1145/3038912.3052576
- Meila and Shi (2001) Marina Meila and Jianbo Shi. 2001. A random walks view of spectral segmentation. (2001).
- Mnih et al. (2014) Volodymyr Mnih, Nicolas Heess, Alex Graves, et al. 2014. Recurrent models of visual attention. In Advances in neural information processing systems. 2204–2212.
- Pan et al. (2016) Shirui Pan, Jia Wu, Xingquan Zhu, Chengqi Zhang, and Yang Wang. 2016. Tri-party deep network representation. Network 11, 9 (2016), 12.
- Perozzi et al. (2014) Bryan Perozzi, Rami Al-Rfou, and Steven Skiena. 2014. Deepwalk: Online learning of social representations. In Proceedings of the 20th ACM SIGKDD international conference on Knowledge discovery and data mining. ACM, 701–710.
- Ruck et al. (1990) Dennis W Ruck, Steven K Rogers, Matthew Kabrisky, Mark E Oxley, and Bruce W Suter. 1990. The multilayer perceptron as an approximation to a Bayes optimal discriminant function. IEEE Transactions on Neural Networks 1, 4 (1990), 296–298.
- Ryu et al. (2018) Seongok Ryu, Jaechang Lim, and Woo Youn Kim. 2018. Deeply learning molecular structure-property relationships using graph attention neural network. arXiv preprint arXiv:1805.10988 (2018).
- Shang et al. (2016) Jingbo Shang, Meng Qu, Jialu Liu, Lance M Kaplan, Jiawei Han, and Jian Peng. 2016. Meta-path guided embedding for similarity search in large-scale heterogeneous information networks. arXiv preprint arXiv:1610.09769 (2016).
- Shi et al. (2018b) Chuan Shi, Binbin Hu, Xin Zhao, and Philip Yu. 2018b. Heterogeneous Information Network Embedding for Recommendation. IEEE Transactions on Knowledge and Data Engineering (2018).
- Shi et al. (2018a) Yu Shi, Huan Gui, Qi Zhu, Lance Kaplan, and Jiawei Han. 2018a. AspEm: Embedding Learning by Aspects in Heterogeneous Information Networks. In Proceedings of the 2018 SIAM International Conference on Data Mining. SIAM, 144–152.
- Sun et al. (2012) Yizhou Sun, Charu C Aggarwal, and Jiawei Han. 2012. Relation strength-aware clustering of heterogeneous information networks with incomplete attributes. Proceedings of the VLDB Endowment 5, 5 (2012), 394–405.
- Sun and Han (2012) Yizhou Sun and Jiawei Han. 2012. Mining heterogeneous information networks: principles and methodologies. Synthesis Lectures on Data Mining and Knowledge Discovery 3, 2 (2012), 1–159.
- Sun et al. (2011) Yizhou Sun, Jiawei Han, Xifeng Yan, Philip S Yu, and Tianyi Wu. 2011. Pathsim: Meta path-based top-k similarity search in heterogeneous information networks. Proceedings of the VLDB Endowment 4, 11 (2011), 992–1003.
- Sun et al. (2013) Yizhou Sun, Brandon Norick, Jiawei Han, Xifeng Yan, Philip S Yu, and Xiao Yu. 2013. Pathselclus: Integrating meta-path selection with user-guided object clustering in heterogeneous information networks. ACM Transactions on Knowledge Discovery from Data (TKDD) 7, 3 (2013), 11.
- Tang et al. (2015a) Jian Tang, Meng Qu, and Qiaozhu Mei. 2015a. Pte: Predictive text embedding through large-scale heterogeneous text networks. In Proceedings of the 21th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining. ACM, 1165–1174.
- Tang et al. (2015b) Jian Tang, Meng Qu, Mingzhe Wang, Ming Zhang, Jun Yan, and Qiaozhu Mei. 2015b. Line: Large-scale information network embedding. In Proceedings of the 24th International Conference on World Wide Web. International World Wide Web Conferences Steering Committee, 1067–1077.
- Thekumparampil et al. (2018) Kiran K Thekumparampil, Chong Wang, Sewoong Oh, and Li-Jia Li. 2018. Attention-based Graph Neural Network for Semi-supervised Learning. arXiv preprint arXiv:1803.03735 (2018).
- Van der Maaten and Hinton (2012) Laurens Van der Maaten and Geoffrey Hinton. 2012. Visualizing non-metric similarities in multiple maps. Machine learning 87, 1 (2012), 33–55.
- Velickovic et al. (2017) Petar Velickovic, Guillem Cucurull, Arantxa Casanova, Adriana Romero, Pietro Lio, and Yoshua Bengio. 2017. Graph attention networks. arXiv preprint arXiv:1710.10903 (2017).
- Wang et al. (2016) Daixin Wang, Peng Cui, and Wenwu Zhu. 2016. Structural deep network embedding. In Proceedings of the 22nd ACM SIGKDD international conference on Knowledge discovery and data mining. ACM, 1225–1234.
- Wang et al. (2018) Hongwei Wang, Fuzheng Zhang, Min Hou, Xing Xie, Minyi Guo, and Qi Liu. 2018. SHINE: signed heterogeneous information network embedding for sentiment link prediction. In Proceedings of the Eleventh ACM International Conference on Web Search and Data Mining. ACM, 592–600.
- Wu et al. (2016) Jia Wu, Zhibin Hong, Shirui Pan, Xingquan Zhu, Zhihua Cai, and Chengqi Zhang. 2016. Multi-graph-view subgraph mining for graph classification. Knowledge and Information Systems 48, 1 (2016), 29–54.
et al. (2016a)
Zichao Yang, Xiaodong He,
Jianfeng Gao, Li Deng, and
Alex Smola. 2016a.
Stacked attention networks for image question
Proceedings of the IEEE conference on computer vision and pattern recognition. 21–29.
- Yang et al. (2016b) Zichao Yang, Diyi Yang, Chris Dyer, Xiaodong He, Alex Smola, and Eduard Hovy. 2016b. Hierarchical attention networks for document classification. In Proceedings of the 2016 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies. 1480–1489.
- Ying et al. (2018) Rex Ying, Jiaxuan You, Christopher Morris, Xiang Ren, William L Hamilton, and Jure Leskovec. 2018. Hierarchical Graph Representation Learning withDifferentiable Pooling. arXiv preprint arXiv:1806.08804 (2018).
- You et al. (2018) Jiaxuan You, Rex Ying, Xiang Ren, William L Hamilton, and Jure Leskovec. 2018. GraphRNN: A Deep Generative Model for Graphs. arXiv preprint arXiv:1802.08773 (2018).
- Yu et al. (2014) Xiao Yu, Xiang Ren, Yizhou Sun, Quanquan Gu, Bradley Sturt, Urvashi Khandelwal, Brandon Norick, and Jiawei Han. 2014. Personalized entity recommendation: A heterogeneous information network approach. In Proceedings of the 7th ACM international conference on Web search and data mining. ACM, 283–292.
- Yu et al. (2012) Xiao Yu, Yizhou Sun, Brandon Norick, Tiancheng Mao, and Jiawei Han. 2012. User guided entity similarity search using meta-path selection in heterogeneous information networks. In Proceedings of the 21st ACM international conference on Information and knowledge management. Acm, 2025–2029.
- Zhang et al. (2018) Daokun Zhang, Jie Yin, Xingquan Zhu, and Chengqi Zhang. 2018. MetaGraph2Vec: Complex Semantic Path Augmented Heterogeneous Network Embedding. In Pacific-Asia Conference on Knowledge Discovery and Data Mining. Springer, 196–208.
- Zhang et al. (2016) Jingyuan Zhang, Chun-Ta Lu, Mianwei Zhou, Sihong Xie, Yi Chang, and S Yu Philip. 2016. Heer: Heterogeneous graph embedding for emerging relation detection from news. In Big Data (Big Data), 2016 IEEE International Conference on. IEEE, 803–812.
- Zhang and Zhou (2014) Min-Ling Zhang and Zhi-Hua Zhou. 2014. A review on multi-label learning algorithms. IEEE transactions on knowledge and data engineering 26, 8 (2014), 1819–1837.
- Zhou et al. (2018) Sheng Zhou, Hongxia Yang, Xin Wang, Jiajun Bu, Martin Ester, Pinggang Yu, Jianwei Zhang, and Can Wang. 2018. PRRE: Personalized Relation Ranking Embedding for Attributed Networks. In Proceedings of the 27th ACM International Conference on Information and Knowledge Management. ACM, 823–832.