Hyper-SAGNN: a self-attention based graph neural network for hypergraphs

11/06/2019 ∙ by Ruochi Zhang, et al. ∙ 0

Graph representation learning for hypergraphs can be used to extract patterns among higher-order interactions that are critically important in many real world problems. Current approaches designed for hypergraphs, however, are unable to handle different types of hypergraphs and are typically not generic for various learning tasks. Indeed, models that can predict variable-sized heterogeneous hyperedges have not been available. Here we develop a new self-attention based graph neural network called Hyper-SAGNN applicable to homogeneous and heterogeneous hypergraphs with variable hyperedge sizes. We perform extensive evaluations on multiple datasets, including four benchmark network datasets and two single-cell Hi-C datasets in genomics. We demonstrate that Hyper-SAGNN significantly outperforms the state-of-the-art methods on traditional tasks while also achieving great performance on a new task called outsider identification. Hyper-SAGNN will be useful for graph representation learning to uncover complex higher-order interactions in different applications.

READ FULL TEXT VIEW PDF
POST COMMENT

Comments

There are no comments yet.

Authors

page 1

page 2

page 3

page 4

This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

1 Introduction

Graph structure is a widely used representation for data with complex interactions. Learning on graphs has also been an active research area in machine learning on how to predict or discover patterns based on the graph structure 

(hamilton2017representation). Although existing methods can achieve strong performance in tasks such as link prediction and node classification, they are mostly designed for analyzing pair-wise interactions and thus are unable to effectively capture higher-order interactions in graphs. In many real-world applications, however, relationships among multiple instances are key to capturing critical properties, e.g., co-authorship involving more than two authors or relationships among multiple heterogeneous objects such as “(human, location, activity)”. Hypergraphs can be used to represent higher-order interactions (zhou2007learning). To analyze higher-order interaction data, it is straightforward to expand each hyperedge into pair-wise edges with the assumption that the hyperedge is decomposable. Several previous methods were developed based on this notion (sun2008hypergraph; feng2018learning). However, earlier work DHNE (Deep Hyper-Network Embedding) (tu2018structural) suggested the existence of heterogeneous indecomposable hyperedges where relationships within an incomplete subset of a hyperedge do not exist. Although DHNE provides a potential solution by modeling the hyperedge directly without decomposing it, due to the neural network structure used in DHNE, the method is limited to the fixed type and fixed-size heterogeneous hyperedges and is unable to consider relationships among multiple types of instances with variable size. For example, Fig. 1 shows a heterogeneous co-authorship hypergraph with two types of nodes (corresponding author and coauthor). Due to the variable number of both authors and corresponding authors in a publication, the hyperedges (co-authorship) have different sizes or types. Unfortunately, methods for representation learning of heterogeneous hypergraph with variable-sized hyperedges, especially those that can predict variable-sized hyperedges, have not been developed.

Figure 1: An example of the co-authorship hypergraph. Here authors are represented as nodes (in dark blue and light blue) and coauthorships are represented as hyperedges.

In this work, we developed a new self-attention based graph neural network, called Hyper-SAGNN that can work with both homogeneous and heterogeneous hypergraphs with variable hyperedge size. Using the same datasets in the DHNE paper (tu2018structural), we demonstrated the advantage of Hyper-SAGNN over DHNE in multiple tasks. We further tested the effectiveness of the method in predicting edges and hyperedges and showed that the model can achieve better performance from the multi-tasking setting. We also formulated a novel task called outsider identification and showed that Hyper-SAGNN performs strongly. Importantly, as an application of Hyper-SAGNN to single-cell genomics, we were able to learn the embeddings for the most recently produced single-cell Hi-C (scHi-C) datasets to uncover the clustering of cells based on their 3D genome structure (ramani2017massively; nagano2017cell). We showed that Hyper-SAGNN achieved improved results in identifying distinct cell populations as compared to existing scHi-C clustering methods. Taken together, Hyper-SAGNN can significantly outperform the state-of-the-art methods and can be applied to a wide range of hypergraphs for different applications.

2 Related Work

Deep learning based models have been developed recently to generalize from graphs to hypergraphs (gui2016large; tu2018structural). The HyperEdge Based Embedding (HEBE) method (gui2016large) aims to learn the embeddings for each object in a specific heterogeneous event by representing it as a hyperedge. However, as demonstrated in tu2018structural, HEBE does not perform well on sparse hypergraphs. Notably, previous methods typically decompose the hyperedge into pair-wise relationships where the decomposition methods can be divided into two categories: explicit and implicit. For instance, given a hyperedge , the explicit approach would decompose it directly into three edges, , while the implicit approach would add a hidden node representing the hyperedge before decomposition, i.e.,

. The deep hypergraph embedding (DHNE) model, however, directly models the tuple-wise relationship using MLP (Multilayer Perceptron). The method is able to achieve better performance on multiple tasks as compared to other methods designed for graphs or hypergraphs such as Deepwalk 

(perozzi2014deepwalk), node2vec (grover2016node2vec), and HEBE. Unfortunately, the structure of MLP takes fixed-size input, making DHNE only capable of handling -uniform hypergraphs, i.e., hyperedges containing nodes. To use DHNE for non--uniform hypergraphs or hypergraphs with different types of hyperedges, a function for each type of hyperedges needs to be trained individually, which leads to significant computational cost and loss of the capability to generalize to unseen types of hyperedges. Another recent method, hyper2vec (huang2019hyper2vec), can also generate embeddings for nodes within the hypergraph. However, hyper2vec cannot solve the link prediction problem directly as it only generates the embeddings of nodes in an unsupervised manner without a learned function to map from embeddings of nodes to hyperedges. Also, for -uniform hypergraphs, hyper2vec is equivalent to node2vec, which cannot capture the high-order network structures for indecomposable hyperedges (as shown in tu2018structural). Our Hyper-SAGNN in this work addresses all these challenges with a self-attention based graph neural network that can learn embeddings of the nodes and predict hyperedges for non--uniform heterogeneous hypergraphs.

3 Method

3.1 Definitions and Notations

Definition 1.

(Hypergraph) A hypergraph is defined as , where represents the set of nodes in the graph, and represents the set of hyperedges. For any hyperedge , it can contain more than two nodes (i.e., ). If all hyperedges within a hypergraph have the same size of , it is called a -uniform hypergraph. Note that even if a hypergraph is -uniform, it can still have different types of hyperedges because the node type can vary for nodes within the hyperedges.

Definition 2.

(The hyperedge prediction problem) We formally define the hyperedge prediction problem. For a given tuple , our goal is to learn a function that satisfies:

(1)

where

is the threshold to binarize the continuous value of

into a label, which indicates whether the tuple is an hyperedge or not. Specifically, when we are given the pre-trained embedding vectors or the features of nodes

, we can rewrite this function as:

(2)

where the vectors can be considered as the fine-tuned embedding or embedding vectors for the nodes. For convenience, we refer to as the features and as the learned embeddings.

3.2 Structure of Hyper-SAGNN

Our goal is to learn the functions and that take tuples of node features

as input and produce the probability of these nodes forming a hyperedge. Without the assumption that the hypergraph is

-uniform and the type of each hyperedge is identical, we require that can take variable-sized, non-ordered input. Although simple functions such as average pooling satisfy this tuple-wise condition, previous work showed that the linear function is not sufficient to model this relationship (tu2018structural). DHNE used an MLP to model the non-linear function, but it requires that an individual function needs to be trained for different types of hyperedges. Here we propose a new method to tackle the general hyperedge prediction problem.

Figure 2: Structure of the neural network used in Hyper-SAGNN. The input , representing the features for nodes 1 to , passes through two branches of the network resulting in static embeddings and dynamic embeddings , respectively. The layer for generating dynamic embeddings is the multi-head attention layer. An example for its mechanism on node 1 here is shown in the figure as well. Then the pseudo-euclidean distance of each pair of static and dynamic embeddings is calculated by one-layered position-wise feed-forward network to produce probability scores . These scores are further averaged to represent whether this group of nodes form a hyperedge or not.

Graph neural network based methods such as GraphSAGE (hamilton2017inductive) typically define a unique computational graph for each node, allowing it to perform efficient information aggregation for nodes with different degrees. Graph Attention Network (GAT) introduced by velivckovic2017graph utilizes a self-attention mechanism in the information aggregation process. Motivated by these properties, we propose our method Hyper-SAGNN based on self-attention mechanism within each tuple to learn the function .

We first briefly introduce the self-attention mechanism. We use the same terms as the self-attention mechanism described in vaswani2017attention; velivckovic2017graph. Given a group of nodes and weight matrices

that represent linear transformation of features before applying the dot-product attention to be trained, we first compute the attention coefficients that reflect the pair-wise importance of nodes:

(3)

We then normalize by all possible within the tuple through the softmax function, i.e.,

(4)

Finally, a weighted sum of the transformed features with an activation function is calculated:

(5)

In GAT, each node is applied to the self-attention mechanism usually with all its first-order neighbors. In Hyper-SAGNN, we aggregate the information for a node only with its neighbors for a given tuple. The structure of Hyper-SAGNN is illustrated in Fig. 2.

The input to our model can be represented as tuples, i.e., . Each tuple first passes through a position-wise feed-forward network to produce , where . We refer to each as the static embedding for node since it remains the same for node no matter what the given tuple is. The tuple also passes through a multi-head graph attention layer to produce a new set of node embedding vectors , which we refer to as the dynamic embeddings because they are dependent on all the node features within this tuple.

Note that unlike the standard attention mechanism described above, when calculating , we require that in Eqn. (5). In other words, we exclude the term in the calculation of dynamic embeddings. Based on our results we found that including would lead to either similar or worse performance in terms of the hyperedge prediction and node classification task (see Appendix A.1 for details). We will elaborate on the motivation of this choice later in this section.

With the static and dynamic embedding vectors for each node, we calculate the Hadamard power (element-wise power) of the difference of the corresponding static/dynamic pair. It is then further passed through a one-layered neural network with sigmoid as the activation function to produce a probability score . Finally, all the output is averaged to get the final , i.e.,

(6)
(7)

By design, can be regarded as the squared weighted pseudo-euclidean distance between the static embedding and the dynamic one . It is called pseudo-euclidean distance because we do not require the weight to be non-zero or to sum up to 1. One rationale for allowing negative weights when calculating the distance could be the Minkowski space where the distance is defined as . Therefore, for these high dimensional embedding vectors, we do not specifically treat them as euclidean vectors.

Figure 3: Illustration of the method for generating node features for node in the hypergraph. In the walk based approach, a biased random walk on hypergraphs is used to produce walking paths (the yellow circles in the walking paths represents node ). These walks are further used to train a skip-gram model for features. In the encoder based approach, the -th row of the adjacency matrix (as shown in the figure where the orange/white blocks represent whether or not there is adjacency between node and other nodes in the graph) is used as the input to an auto-encoder. The output of the encoder part is used as the features for node .

Our network essentially aims to build the correlation of the average “distance” of the static/dynamic embedding pairs with the probability of the node group forming a hyperedge. Since the dynamic embedding is the weighted sum of features (with potential non-linear transformation) from neighbors within the tuple, this “distance” reflects how well the static embedding of each node can be approximated by the features of their neighbors within that tuple. This design strategy shares some similarities with the CBOW model in natural language processing 

(mikolov2013skipgram), where the model aims to predict the target word given its context. In principle, we could still include the term to obtain the embedding . Alternatively, we can directly pass through a fully connected layer to produce while the rest remains the same. However, we argue that our proposed model would be able to produce that can be directly used for tasks such as node classification while the alternative approach is unable to achieve that (see Appendix A.1 for detailed analysis).

3.3 Approaches for Generating Features

In an inductive learning setting with known attributes for the nodes, can just be the attributes of the node. However, in a transductive learning setting without knowing attributes of the nodes, we have to generate based on the graph structure solely. Here we use two existing strategies to generate features .

We first define the functions used in the subsequent sections as follows: a hyperedge with weight is incident with a vertex if and only if . We denote the indicator function that represents the incident relationship between and by , which equals when is incident with . The degree of vertex, , and the size of hyperedge, , are defined as:

(8)
(9)

3.3.1 Encoder based approach

As shown on the right side of Fig. 3, the first method to generate features is referred to as the encoder based approach, which is similar to the structure used in DHNE (tu2018structural). We first obtain the incident matrix of the hypergraph with entries if and 0 otherwise. We also calculate the diagonal degree matrix containing the vertex degree . We thus have the adjacency matrix , of which the entries denote the concurrent times between each pair of nodes . The -th row of , denoted by , shows the neighborhood structures of the node , which then passes through a one-layer neural network to produce :

(10)

In DHNE, a symmetric structure was introduced where there are corresponding decoders to transform the back to . tu2018structural

remarked that including this reconstruction error term would help DHNE to learn the graph structure better. We also include the reconstruction error term into the loss function, but with tied-weights between encoder and decoder to reduce the number of parameters that need to be trained.

3.3.2 Random walk based approach

Besides the encoder based approach, we also utilize a random walk based framework to generate the feature vectors (shown on the left side of Fig. 3) We extend the biased 2nd-order random walks proposed in node2vec (grover2016node2vec) to generalize to hypergraphs. For a walk from to then to , the strategies are described as follows.

The 1st-order random walk strategy given the current vertex is to randomly select a hyperedge incident with based on the weight of and then to choose the next vertex from uniformly (zhou2007learning). Therefore, the 1st-order transition probability is defined as:

(11)

We then generalize the 2nd-order bias from ordinary graph to hypergraph for a walk from to to as:

(12)

where the parameters and are to control the tendencies that encourage outward exploration and obtain a local view.

Next we add the above terms to set the biased 2nd-order transition probability as:

(13)

where is a normalizing factor.

With the well-defined 2nd-order transition probability , we simulate a random walk of fixed length through a 2nd-order Markov process marked by , where is the -th node in the walk. A Skip-gram model (mikolov2013word2vec; mikolov2013skipgram) is then used to extract the node features from sampled walks such that the nodes that appear in similar contexts would have similar embeddings.

4 Results

4.1 Evaluation Datasets

We sought to compare Hyper-SAGNN with the state-of-the-art method DHNE as it has already been demonstrated with superior performance over previous algorithms such as DeepWalk, LINE, and HEBE. We also did not compare our Hyper-SAGNN with hyper2vec (huang2019hyper2vec) for the following reasons: (1) hyper2vec cannot be directly used for the hyperedge prediction task; and (2) for a -uniform hypergraphs like the four datasets used in DHNE or the IMDb dataset used in the hyper2vec paper (huang2019hyper2vec), it is equivalent to the standard node2vec.

We first used the same four datasets in the original DHNE paper to have a direct comparison:

  • GPS (datasetGPS): GPS network. The hyperedges are based on (user, location, activity) relations.

  • MovieLens (datasetMovieLens): Social network. The hyperedges are based on (user, movie, tag) relations, describing peoples’ tagging activities.

  • drug: Medicine network from FAERS111http://www.fda.gov/Drugs/. The hyperedges are based on (user, drug, reaction) relations.

  • wordnet (datasetwordnet): Semantic network from WordNet 3.0. The hyperedges are based on (head entity, relation, tail entity), expressing the relationships between words.

Details of the datasets, including node types, the number of nodes, and the number of edges, are shown in Table 1.

Datasets Node Type #(V) #(E)
GPS user location activity 146 70 5 1,436
MovieLens user movie tag 2,113 5,908 9,079 47,957
drug user drug reaction 12 1,076 6,398 171,756
wordnet head relation tail 40,504 18 40,551 145,966
Table 1: Network datasets used for evaluation. The columns under “#(V)” correspond to the columns under “Node Type” for each dataset.

4.2 Parameter Setting

In this section, we describe details of the parameters used for both Hyper-SAGNN and other methods in the evaluation. We downloaded the source code of DHNE from its GitHub repository. The structure of the neural network of DHNE was set to be the same as what the authors described in tu2018structural. We tuned parameters such as the term and the learning rate following the same procedure. We also tried adding dropout between representation vectors and the fully connected layer for better performance of DHNE. All these parameters were tuned until it was able to replicate or even improve the performance reported in the original paper. To make a fair comparison, for all the results below, we made sure that the training and validation data setups were the same across different methods.

For node2vec, we decomposed the hypergraph into pairwise edges and ran node2vec on the decomposed graph. For the hyperedge prediction task, we first used the learned embedding to predict pairwise edges. We then used the mean or min of the pairwise similarity as the probability for the tuple to form a hyperedge. We set the window size to 10, walk length to 40, the number of walks per vertex to 10, which are the same parameters used in DHNE for node2vec. However, we found that for the baseline method node2vec, when we tuned the hyper-parameter and also used larger walk length, window size and walks per vertex (120, 20, 80 instead of 40, 10, 10), it would achieve comparable performance for node classification task as DHNE. This observation is consistent with our designed biased hypergraph random walk. But this would result in a longer time for sampling the walks and training the skip-gram model. We therefore kept the parameters consistent with what was used in DHNE paper.

For our Hyper-SAGNN, we set the representation size to 64, which is the same as DHNE. When using the encoder based approach to calculate , we set the encoder structure to be the same as the encoder part in DHNE. When using the random walk based approach, we decomposed the hypergraph into a graph as described above. We set the window size to 10, walk length to 40, the number of walks per vertex to 10, to allow time-efficient generation of feature vector . The results in Section 4.3 showed that even when the pre-trained embeddings are not so ideal, Hyper-SAGNN can still well capture the structure of the graph.

4.3 Performance Comparison with Existing Methods

We evaluated the effectiveness of our embedding vectors and the learned function with the network reconstruction task. We compared our Hyper-SAGNN using the encoder based approach and also the model using the random walk based pre-trained embeddings against DHNE and the baseline node2vec. We first trained the model and then used the learned embeddings to predict the hyperedge of the original network. We sampled the negative samples to be 5 times the amount of the positive samples following the same setup of DHNE. We evaluated the performance based on both the AUROC score and the AUPR score.

GPS MovieLens drug wordnet
AUC AUPR AUC AUPR AUC AUPR AUC AUPR
node2vec-mean 0.572 0.188 0.557 0.197 0.668 0.246 0.613 0.215
node2vec-min 0.570 0.187 0.535 0.186 0.682 0.257 0.576 0.201
DHNE 0.959 0.836 0.974 0.878 0.952 0.873 0.989 0.953
Hyper-SAGNN-E 0.971 0.877 0.991 0.952 0.977 0.916 0.989 0.950
Hyper-SAGNN-W 0.976 0.857 0.998 0.986 0.988 0.945 0.994 0.956
Table 2: AUC and AUPR values for network reconstruction. Model trained with the random walk based approach and the encoder based approach is marked as Hyper-SAGNN-W and Hyper-SAGNN-E, respectively.

As shown in Table 2, Hyper-SAGNN can capture the network structure better than DHNE over all datasets either using the encoder based approach or the random walk based approach.

Figure 4: Performance of classification on MovieLens and wordnet datasets. Hyper-SAGNN trained with the random walk based approach and encoder based approach are marked as Hyper-SAGNN-W, Hyper-SAGNN-E, respectively. The models trained with a mix of edges and hyperedges are denoted with “(mix)”.

We further assessed the performance of Hyper-SAGNN on the hyperedge prediction task. We randomly split the hyperedge set into training and testing set by a ratio of 4:1. The way to generate negative samples is the same as the network reconstruction task. As shown in Table 3, our model again achieves significant improvement over DHNE for predicting the unseen hyperedges. The most significant improvement is from the wordnet dataset, which is about a 24.6% increase on the AUPR score. For network reconstruction and hyperedge prediction tasks, the difference between random walk based Hyper-SAGNN and encoder based Hyper-SAGNN is minor.

In addition to the tasks related to the prediction of hyperedges, we also evaluated whether the learned node embeddings are effective for node classification tasks. A multi-label classification experiment and a multi-class classification experiment were carried out for the MovieLens dataset and the wordnet dataset, respectively. We used Logistic Regression as the classifier. The proportion of the training data was chosen to be from 10% to 90% for the MovieLens dataset, and 1% to 10% for the wordnet dataset. We used averaged Mirco-F1 and Macro-F1 to evaluate the performance. The results are in Fig. 

4. We observed that Hyper-SAGNN consistently achieves both higher Micro-F1 and Macro-F1 scores over DHNE for different fractions of the training data. Also, Hyper-SAGNN based on the random walk generally achieves the best performance (Hyper-SAGNN-W in Fig. 4).

4.4 Performance on Non-k-uniform Hypergraph

Next, we evaluated Hyper-SAGNN using the non--uniform heterogeneous hypergraph. For the above four datasets, we decomposed each hyperedge into 3 pairwise edges and added them to the existing graph. We trained our model to predict both the hyperedges and the edges (i.e., non-hyperedges). We then evaluated the performance for link prediction tasks for both the hyperedges and the edges. We also performed the node classification task following the same setting as above. The results for link prediction are in Table 3. Fig. 4 shows the results for the node classification task.

GPS MovieLens drug wordnet
AUC AUPR AUC AUPR AUC AUPR AUC AUPR
node2vec - mean 0.563 0.191 0.562 0.197 0.670 0.246 0.608 0.213
node2vec - min 0.570 0.185 0.539 0.186 0.684 0.258 0.575 0.200
DHNE 0.910 0.668 0.877 0.668 0.925 0.859 0.816 0.459
Hyper-SAGNN-E 0.952 0.798 0.926 0.793 0.961 0.895 0.890 0.705
Hyper-SAGNN-W 0.922 0.722 0.930 0.810 0.955 0.892 0.880 0.706
Hyper-SAGNN-E (mix) 0.950 0.795 0.928 0.799 0.956 0.887 0.881 0.694
Hyper-SAGNN-W (mix) 0.920 0.720 0.929 0.811 0.950 0.889 0.884 0.684
GPS (2) MovieLens (2) drug (2) wordnet (2)
AUC AUPR AUC AUPR AUC AUPR AUC AUPR
Hyper-SAGNN-E (mix) 0.921 0.899 0.971 0.967 0.981 0.973 0.891 0.897
Hyper-SAGNN-W (mix) 0.931 0.910 0.999 0.999 0.999 0.999 0.923 0.916
Table 3: Performance evaluation based on AUROC and AUPR for hyperedge/edge prediction. Methods with annotation (mix) represent Hyper-SAGNN trained with a mixture of edges and hyper-edges. Datasets marked with “(2)” represent the performance on pair-wise edge prediction (i.e., non-hyperedges).

We observed that Hyper-SAGNN can preserve the graph structure on different levels. Compared to training the model with hyperedges only, including the edges into the training would not cause obvious changes in performance for hyperedge predictions (about a 1% fluctuation for AUC/AUPR).

We then further assessed the model in a new evaluation setting where there are adequate edges but only a few hyperedges presented. We asked whether the model can still achieve good performance on the hyperedge prediction based on this dataset. This scenario is possible in the real-world applications especially when the dataset is combined from different sources. For example, in the drug dataset, it is possible that, in addition to the (user, drug, reaction) hyperedges, there are also extra edges that come from other sources, e.g., (drug, reaction) edges from the drug database, (user, drug) and (user, reaction) edges from the medical record. Here for each dataset that we tested, we used about 50% of the edges and only 5% of the hyperedges in the network to train the model. The results are in Fig. 5.

When using only the edges to train the model, our method still achieves higher AUROC and AUPR score for hyperedge prediction as compared to node2vec (Table 3). We found that when the model is trained with both the downsampled hyperedge dataset and the edge dataset, it would be able to reach higher performance or suffer less from overfitting than being trained with each of the datasets individually. This demonstrates that our model can capture the consensus information on the graph structure across different sizes of hyperedges.

Figure 5:

AUROC and AUPR scores of Hyper-SAGNN for hyperedge prediction on the downsampled dataset over training epochs.

4.5 Outsider Identification

In addition to the standard link prediction and node classification, we further formulated a new task called “outsider identification”. Previous methods such as DHNE can answer the question of whether a specific tuple of nodes form a hyperedge. However, in many settings, we might also want to know the reason why this group of nodes will not form a hyperedge. We first define the outsider of a group of nodes as follows. Node is the outsider of the node group if it satisfies:

(14)
(15)

We speculated that Hyper-SAGNN can answer this question by analyzing the probability score to (defined in Eqn. 7). We assume that the node with the smallest would be the outsider. We then set the evaluation as follows. We first trained the model as usual, but at the final stage, we replaced the average pooling layer with min pooling layer and fine-tuned the model for several epochs. We then fed the generated triplets with known outsider node into the trained model and calculated the top- accuracy of the outsider node matching the node with the smallest probability. Because this task is based on the prediction results of the hyperedges, we only tested on the dataset that achieves the best hyperedge prediction, i.e., the drug dataset. We found that we have 81.9% accuracy for the smallest probability and 95.3% accuracy for the top-2 smallest probability. These results showed that by switching the pooling layer we would have better outsider identification accuracy (from 78.5% to 81.9%) with the cost of slightly decreased hyperedge prediction performance (AUC from 0.955 to 0.935). This demonstrates that our model is able to accurately predict the outsider within the group even without further labeled information. Moreover, the performance of outsider identification can be further improved if we include the cross-entropy between and the label of whether is an outsider for all applicable triplets in the loss term. Together, these results demonstrate the advantage of Hyper-SAGNN in terms of the interpretability of hyperedge prediction.

4.6 Application to Single-cell Hi-C Datasets

We next applied Hyper-SAGNN to the recently produced single-cell Hi-C (scHi-C) datasets (ramani2017massively; nagano2017cell). Genome-wide mapping of chromatin interactions by Hi-C (lieberman2009comprehensive; rao20143d) has enabled comprehensive characterization of the 3D genome organization that reveals patterns of chromatin interactions between genomic loci. However, unlike bulk Hi-C data where signals are aggregated from cell populations, scHi-C provides unique information about chromatin interactions at single-cell resolution, thus allowing us to ascertain cell-to-cell variation of the 3D genome organization. Specifically, we propose that scHi-C makes it possible to model the cell-to-cell variation of chromatin interaction as a hyperedge, i.e., (cell, genomic locus, genomic locus). For the analysis of scHi-C, the most common strategy would be revealing the cell-to-cell variation by embedding the cells based on the contact matrix and then applying the clustering algorithms such as K

-means clustering or hierarchical clustering on the embedded vectors. We performed the following evaluation to assess the effectiveness of Hyper-SAGNN on learning the embeddings of cells by representing the scHi-C data as hypergraphs.

We tested Hyper-SAGNN on two datasets. The first one consists of scHi-C from four human cell lines: HAP1, GM12878, K562, and HeLa (ramani2017massively). The second one includes the scHi-C that represents the cell cycle of the mouse embryonic stem cells (nagano2017cell). We refer to the first dataset as “Ramani et al. data”, and the second as “Nagano et al. data” for abbreviation.

Figure 6: (A) and (B): Visualization of the learned embedding based on Hyper-SAGNN for the Ramani et al. data. (C) and (D): Visualization of the learned embedding based on Hyper-SAGNN for the Nagano et al. data. Embedding vectors are projected to two dimensional space using either UMAP or PCA. (E): Quantitative evaluation of the Hyper-SAGNN on two scHi-C datasets

We trained Hyper-SAGNN with the corresponding datasets. Due to the large average degrees of cell nodes, the random walk approach would take an extensive amount of time to sample the walks. Thus, we only applied the encoder version of our method. We visualize the learned embeddings by reducing them to 2 dimensions with PCA and UMAP (mcinnes2018umap) (Fig. 6A-D).

We quantified the effectiveness of the embeddings by applying K-means clustering on the Ramani et al. data and evaluating with Adjusted Rand Index (ARI). In addition, we also assessed the effectiveness of the embeddings with a supervised scenario. We used Logistic Regression as the classifier with 10% of the cell as training samples and evaluated the multi-class classification task with Micro-F1 and Macro-F1. We did not run K-means clustering on the Nagano et al. data as it represents a state of continuous cell cycle which is not suitable for a clustering task. We instead used the metric ACROC (Average Circular ROC) developed in the HiCRep/MDS paper (liu2018unsupervised) to evaluate the performance of the three methods on the Nagano et al. data. We compared the performance with two recently developed computational methods based on dimensionality reduction of the contact matrix, HiC-Rep/MDS (liu2018unsupervised) and scHiCluster (zhou2019robust). Because Hyper-SAGNN is not a deterministic method for generating embeddings for scHi-C, we repeated the training process 5 times and averaged the score. All these results are in Fig. 6E.

For the Ramani et al. data (Fig. 6A-B), the visualization of the embedding vectors learned by Hyper-SAGNN exhibits clear patterns that cells with the same cell type are clustered together. Moreover, cell line HAP1, GM12878, and K562 are all blood-related cell lines, which are likely to be more similar to each other in terms of 3D genome organization as compared to HeLa. Indeed, we observed that they are also closer to each other in the embedding space. Quantitative results in Fig. 6E are consistent with the visualization as our method achieves the highest ARI, Micro-F1, Macro-F1 score among all three methods. For the Nagano et al. data, as shown in Fig. 6C-D, we found that the embeddings exhibit a circular pattern that corresponds to the cell cycle. Also, both HiC-Rep/MDS and Hyper-SAGNN achieve high ACROC score. All these results demonstrated the effectiveness of representing the scHi-C datasets as hypergraphs using Hyper-SAGNN, which has great potential to provide insights into the cell-to-cell variation of higher-order genome organization.

5 Conclusion

In this work, we have developed a new graph neural network model called Hyper-SAGNN for the representation learning of general hypergraphs. The framework has the flexible ability to deal with homogeneous and heterogeneous, and uniform and non-uniform hypergraphs. We demonstrated that Hyper-SAGNN is able to improve or match state-of-the-art performance for hypergraph representation learning while addressing the shortcomings of prior methods such as the inability to predict hyperedges for non--uniform heterogeneous hypergraphs. Hyper-SAGNN is computationally efficient as the size of input to the graph attention layer is bounded by the maximum hyperedge size as opposed to the number of first-order neighbors.

One potential improvement of Hyper-SAGNN as future work would be to allow information aggregation over all the first-order neighbors before calculating the static/dynamic embeddings for a node with additional computational cost. With this design, the static embedding for a node would still satisfy our constraint that it is fixed for a known hypergraph with varying input tuples. This would allow us to incorporate previously developed methods on graphs, such as GraphSAGE (hamilton2017inductive) and GCN (kipf2016semi), as well as methods designed for hypergraphs like HyperGCN (yadati2018hypergcn)

into this framework for better link prediction performance. Such improvement may also extend the application of Hyper-SAGNN to semi-supervised learning.

Acknowledgment

J.M. acknowledges support from the National Institutes of Health Common Fund 4D Nucleome Program grant U54DK107965, National Institutes of Health grant R01HG007352, and National Science Foundation grant 1717205. Y.Z. (Yao Class, IIIS, Tsinghua University) contributed to this work as a visiting undergraduate student at Carnegie Mellon University during summer 2019.

References

Appendix A Appendix

a.1 Comparison of Hyper-SAGNN with Its Variants

As mentioned above, unlike the standard GAT model, we exclude the term in the self-attention mechanism. To test whether this constraint would improve or reduce the model’s ability to learn, we implemented a variant of our model (referred to as variant type I) by including this term. Also, as mentioned in the Method section, another potential variant of our model would be directly using the to calculate the probability score . We refer to this variant as variant type II. For variant type II, on node classification task, since it does not have a static embedding, we used . The rest of the parameters and structure of the neural network remain the same.

We then compared the performance of Hyper-SAGNN and two variants in terms of AUC and AUPR values for network reconstruction task and hyperedge link prediction task on the following four datasets: MovieLens, wordnet, drug, and GPS. We also compared the performance in terms of the Micro F1 score and Macro F1 score on the node classification task on the MovieLens and the wordnet dataset. For the MovieLens dataset, we used 90% nodes as training data while for wordnet, we used 1% of the nodes as training data. All the evaluation setup is the same as described in the main text. To avoid the effect of randomness from the neural network training, we repeated the training process for each experiment five times and made the line plot of the score versus the epoch number. To illustrate the differences more clearly, we started the plot at epoch 3 for the random walk based approach and epoch 12 for the encoder based approach. The performance of the model using the random walk based approach is shown in Fig. A1 to Fig. A4. The performance of the model using the encoder based approach is shown in Fig. A5 to Fig. A8.

For models with the random walk based approach, Hyper-SAGNN is the best in terms of all metrics for the GPS, MovieLens, and wordnet dataset. On the drug dataset, Hyper-SAGNN achieves higher AUROC and AUPR score on the network reconstruction task than two variants, but slightly lower AUROC score for the link prediction task (less than 0.5%).

For models with the encoder based approach, the advantage is not that obvious. All 3 methods achieve similar performance in terms of all metrics for the GPS and the drug dataset. For the MovieLens and wordnet dataset, Hyper-SAGNN performs similar to variant type I, higher than variant type II on the network reconstruction and link prediction task. However, our model achieves slightly higher accuracy on the node classification task than variant type I.

Therefore, these evaluations demonstrated that the choice of the structure of Hyper-SAGNN can achieve higher or at least comparable performance than the two potential variants over multiple tasks on multiple datasets.

Figure A1: Performance comparison of Hyper-SAGNN – Walk and Variant Type I, II (GPS)
Figure A2: Performance comparison of Hyper-SAGNN – Walk and Variant Type I, II (MovieLens)
Figure A3: Performance comparison of Hyper-SAGNN – Walk and Variant Type I, II (drug)
Figure A4: Performance comparison of Hyper-SAGNN – Walk and Variant Type I, II (wordnet)
Figure A5: Performance comparison of Hyper-SAGNN – Encoder and Variant Type I, II (GPS)
Figure A6: Performance comparison of Hyper-SAGNN – Encoder and Variant Type I, II (MovieLens)
Figure A7: Performance comparison of Hyper-SAGNN – Encoder and Variant Type I, II (drug)
Figure A8: Performance comparison of Hyper-SAGNN – Encoder and Variant Type I, II (wordnet)