1 Introduction
Graph structure is a widely used representation for data with complex interactions. Learning on graphs has also been an active research area in machine learning on how to predict or discover patterns based on the graph structure
(hamilton2017representation). Although existing methods can achieve strong performance in tasks such as link prediction and node classification, they are mostly designed for analyzing pairwise interactions and thus are unable to effectively capture higherorder interactions in graphs. In many realworld applications, however, relationships among multiple instances are key to capturing critical properties, e.g., coauthorship involving more than two authors or relationships among multiple heterogeneous objects such as “(human, location, activity)”. Hypergraphs can be used to represent higherorder interactions (zhou2007learning). To analyze higherorder interaction data, it is straightforward to expand each hyperedge into pairwise edges with the assumption that the hyperedge is decomposable. Several previous methods were developed based on this notion (sun2008hypergraph; feng2018learning). However, earlier work DHNE (Deep HyperNetwork Embedding) (tu2018structural) suggested the existence of heterogeneous indecomposable hyperedges where relationships within an incomplete subset of a hyperedge do not exist. Although DHNE provides a potential solution by modeling the hyperedge directly without decomposing it, due to the neural network structure used in DHNE, the method is limited to the fixed type and fixedsize heterogeneous hyperedges and is unable to consider relationships among multiple types of instances with variable size. For example, Fig. 1 shows a heterogeneous coauthorship hypergraph with two types of nodes (corresponding author and coauthor). Due to the variable number of both authors and corresponding authors in a publication, the hyperedges (coauthorship) have different sizes or types. Unfortunately, methods for representation learning of heterogeneous hypergraph with variablesized hyperedges, especially those that can predict variablesized hyperedges, have not been developed.In this work, we developed a new selfattention based graph neural network, called HyperSAGNN that can work with both homogeneous and heterogeneous hypergraphs with variable hyperedge size. Using the same datasets in the DHNE paper (tu2018structural), we demonstrated the advantage of HyperSAGNN over DHNE in multiple tasks. We further tested the effectiveness of the method in predicting edges and hyperedges and showed that the model can achieve better performance from the multitasking setting. We also formulated a novel task called outsider identification and showed that HyperSAGNN performs strongly. Importantly, as an application of HyperSAGNN to singlecell genomics, we were able to learn the embeddings for the most recently produced singlecell HiC (scHiC) datasets to uncover the clustering of cells based on their 3D genome structure (ramani2017massively; nagano2017cell). We showed that HyperSAGNN achieved improved results in identifying distinct cell populations as compared to existing scHiC clustering methods. Taken together, HyperSAGNN can significantly outperform the stateoftheart methods and can be applied to a wide range of hypergraphs for different applications.
2 Related Work
Deep learning based models have been developed recently to generalize from graphs to hypergraphs (gui2016large; tu2018structural). The HyperEdge Based Embedding (HEBE) method (gui2016large) aims to learn the embeddings for each object in a specific heterogeneous event by representing it as a hyperedge. However, as demonstrated in tu2018structural, HEBE does not perform well on sparse hypergraphs. Notably, previous methods typically decompose the hyperedge into pairwise relationships where the decomposition methods can be divided into two categories: explicit and implicit. For instance, given a hyperedge , the explicit approach would decompose it directly into three edges, , while the implicit approach would add a hidden node representing the hyperedge before decomposition, i.e.,
. The deep hypergraph embedding (DHNE) model, however, directly models the tuplewise relationship using MLP (Multilayer Perceptron). The method is able to achieve better performance on multiple tasks as compared to other methods designed for graphs or hypergraphs such as Deepwalk
(perozzi2014deepwalk), node2vec (grover2016node2vec), and HEBE. Unfortunately, the structure of MLP takes fixedsize input, making DHNE only capable of handling uniform hypergraphs, i.e., hyperedges containing nodes. To use DHNE for nonuniform hypergraphs or hypergraphs with different types of hyperedges, a function for each type of hyperedges needs to be trained individually, which leads to significant computational cost and loss of the capability to generalize to unseen types of hyperedges. Another recent method, hyper2vec (huang2019hyper2vec), can also generate embeddings for nodes within the hypergraph. However, hyper2vec cannot solve the link prediction problem directly as it only generates the embeddings of nodes in an unsupervised manner without a learned function to map from embeddings of nodes to hyperedges. Also, for uniform hypergraphs, hyper2vec is equivalent to node2vec, which cannot capture the highorder network structures for indecomposable hyperedges (as shown in tu2018structural). Our HyperSAGNN in this work addresses all these challenges with a selfattention based graph neural network that can learn embeddings of the nodes and predict hyperedges for nonuniform heterogeneous hypergraphs.3 Method
3.1 Definitions and Notations
Definition 1.
(Hypergraph) A hypergraph is defined as , where represents the set of nodes in the graph, and represents the set of hyperedges. For any hyperedge , it can contain more than two nodes (i.e., ). If all hyperedges within a hypergraph have the same size of , it is called a uniform hypergraph. Note that even if a hypergraph is uniform, it can still have different types of hyperedges because the node type can vary for nodes within the hyperedges.
Definition 2.
(The hyperedge prediction problem) We formally define the hyperedge prediction problem. For a given tuple , our goal is to learn a function that satisfies:
(1) 
where
is the threshold to binarize the continuous value of
into a label, which indicates whether the tuple is an hyperedge or not. Specifically, when we are given the pretrained embedding vectors or the features of nodes
, we can rewrite this function as:(2) 
where the vectors can be considered as the finetuned embedding or embedding vectors for the nodes. For convenience, we refer to as the features and as the learned embeddings.
3.2 Structure of HyperSAGNN
Our goal is to learn the functions and that take tuples of node features
as input and produce the probability of these nodes forming a hyperedge. Without the assumption that the hypergraph is
uniform and the type of each hyperedge is identical, we require that can take variablesized, nonordered input. Although simple functions such as average pooling satisfy this tuplewise condition, previous work showed that the linear function is not sufficient to model this relationship (tu2018structural). DHNE used an MLP to model the nonlinear function, but it requires that an individual function needs to be trained for different types of hyperedges. Here we propose a new method to tackle the general hyperedge prediction problem.Graph neural network based methods such as GraphSAGE (hamilton2017inductive) typically define a unique computational graph for each node, allowing it to perform efficient information aggregation for nodes with different degrees. Graph Attention Network (GAT) introduced by velivckovic2017graph utilizes a selfattention mechanism in the information aggregation process. Motivated by these properties, we propose our method HyperSAGNN based on selfattention mechanism within each tuple to learn the function .
We first briefly introduce the selfattention mechanism. We use the same terms as the selfattention mechanism described in vaswani2017attention; velivckovic2017graph. Given a group of nodes and weight matrices
that represent linear transformation of features before applying the dotproduct attention to be trained, we first compute the attention coefficients that reflect the pairwise importance of nodes:
(3) 
We then normalize by all possible within the tuple through the softmax function, i.e.,
(4) 
Finally, a weighted sum of the transformed features with an activation function is calculated:
(5) 
In GAT, each node is applied to the selfattention mechanism usually with all its firstorder neighbors. In HyperSAGNN, we aggregate the information for a node only with its neighbors for a given tuple. The structure of HyperSAGNN is illustrated in Fig. 2.
The input to our model can be represented as tuples, i.e., . Each tuple first passes through a positionwise feedforward network to produce , where . We refer to each as the static embedding for node since it remains the same for node no matter what the given tuple is. The tuple also passes through a multihead graph attention layer to produce a new set of node embedding vectors , which we refer to as the dynamic embeddings because they are dependent on all the node features within this tuple.
Note that unlike the standard attention mechanism described above, when calculating , we require that in Eqn. (5). In other words, we exclude the term in the calculation of dynamic embeddings. Based on our results we found that including would lead to either similar or worse performance in terms of the hyperedge prediction and node classification task (see Appendix A.1 for details). We will elaborate on the motivation of this choice later in this section.
With the static and dynamic embedding vectors for each node, we calculate the Hadamard power (elementwise power) of the difference of the corresponding static/dynamic pair. It is then further passed through a onelayered neural network with sigmoid as the activation function to produce a probability score . Finally, all the output is averaged to get the final , i.e.,
(6)  
(7) 
By design, can be regarded as the squared weighted pseudoeuclidean distance between the static embedding and the dynamic one . It is called pseudoeuclidean distance because we do not require the weight to be nonzero or to sum up to 1. One rationale for allowing negative weights when calculating the distance could be the Minkowski space where the distance is defined as . Therefore, for these high dimensional embedding vectors, we do not specifically treat them as euclidean vectors.
Our network essentially aims to build the correlation of the average “distance” of the static/dynamic embedding pairs with the probability of the node group forming a hyperedge. Since the dynamic embedding is the weighted sum of features (with potential nonlinear transformation) from neighbors within the tuple, this “distance” reflects how well the static embedding of each node can be approximated by the features of their neighbors within that tuple. This design strategy shares some similarities with the CBOW model in natural language processing
(mikolov2013skipgram), where the model aims to predict the target word given its context. In principle, we could still include the term to obtain the embedding . Alternatively, we can directly pass through a fully connected layer to produce while the rest remains the same. However, we argue that our proposed model would be able to produce that can be directly used for tasks such as node classification while the alternative approach is unable to achieve that (see Appendix A.1 for detailed analysis).3.3 Approaches for Generating Features
In an inductive learning setting with known attributes for the nodes, can just be the attributes of the node. However, in a transductive learning setting without knowing attributes of the nodes, we have to generate based on the graph structure solely. Here we use two existing strategies to generate features .
We first define the functions used in the subsequent sections as follows: a hyperedge with weight is incident with a vertex if and only if . We denote the indicator function that represents the incident relationship between and by , which equals when is incident with . The degree of vertex, , and the size of hyperedge, , are defined as:
(8)  
(9) 
3.3.1 Encoder based approach
As shown on the right side of Fig. 3, the first method to generate features is referred to as the encoder based approach, which is similar to the structure used in DHNE (tu2018structural). We first obtain the incident matrix of the hypergraph with entries if and 0 otherwise. We also calculate the diagonal degree matrix containing the vertex degree . We thus have the adjacency matrix , of which the entries denote the concurrent times between each pair of nodes . The th row of , denoted by , shows the neighborhood structures of the node , which then passes through a onelayer neural network to produce :
(10) 
In DHNE, a symmetric structure was introduced where there are corresponding decoders to transform the back to . tu2018structural
remarked that including this reconstruction error term would help DHNE to learn the graph structure better. We also include the reconstruction error term into the loss function, but with tiedweights between encoder and decoder to reduce the number of parameters that need to be trained.
3.3.2 Random walk based approach
Besides the encoder based approach, we also utilize a random walk based framework to generate the feature vectors (shown on the left side of Fig. 3) We extend the biased 2ndorder random walks proposed in node2vec (grover2016node2vec) to generalize to hypergraphs. For a walk from to then to , the strategies are described as follows.
The 1storder random walk strategy given the current vertex is to randomly select a hyperedge incident with based on the weight of and then to choose the next vertex from uniformly (zhou2007learning). Therefore, the 1storder transition probability is defined as:
(11) 
We then generalize the 2ndorder bias from ordinary graph to hypergraph for a walk from to to as:
(12) 
where the parameters and are to control the tendencies that encourage outward exploration and obtain a local view.
Next we add the above terms to set the biased 2ndorder transition probability as:
(13) 
where is a normalizing factor.
With the welldefined 2ndorder transition probability , we simulate a random walk of fixed length through a 2ndorder Markov process marked by , where is the th node in the walk. A Skipgram model (mikolov2013word2vec; mikolov2013skipgram) is then used to extract the node features from sampled walks such that the nodes that appear in similar contexts would have similar embeddings.
4 Results
4.1 Evaluation Datasets
We sought to compare HyperSAGNN with the stateoftheart method DHNE as it has already been demonstrated with superior performance over previous algorithms such as DeepWalk, LINE, and HEBE. We also did not compare our HyperSAGNN with hyper2vec (huang2019hyper2vec) for the following reasons: (1) hyper2vec cannot be directly used for the hyperedge prediction task; and (2) for a uniform hypergraphs like the four datasets used in DHNE or the IMDb dataset used in the hyper2vec paper (huang2019hyper2vec), it is equivalent to the standard node2vec.
We first used the same four datasets in the original DHNE paper to have a direct comparison:

GPS (datasetGPS): GPS network. The hyperedges are based on (user, location, activity) relations.

MovieLens (datasetMovieLens): Social network. The hyperedges are based on (user, movie, tag) relations, describing peoples’ tagging activities.

drug: Medicine network from FAERS^{1}^{1}1http://www.fda.gov/Drugs/. The hyperedges are based on (user, drug, reaction) relations.

wordnet (datasetwordnet): Semantic network from WordNet 3.0. The hyperedges are based on (head entity, relation, tail entity), expressing the relationships between words.
Details of the datasets, including node types, the number of nodes, and the number of edges, are shown in Table 1.
Datasets  Node Type  #(V)  #(E)  

GPS  user  location  activity  146  70  5  1,436 
MovieLens  user  movie  tag  2,113  5,908  9,079  47,957 
drug  user  drug  reaction  12  1,076  6,398  171,756 
wordnet  head  relation  tail  40,504  18  40,551  145,966 
4.2 Parameter Setting
In this section, we describe details of the parameters used for both HyperSAGNN and other methods in the evaluation. We downloaded the source code of DHNE from its GitHub repository. The structure of the neural network of DHNE was set to be the same as what the authors described in tu2018structural. We tuned parameters such as the term and the learning rate following the same procedure. We also tried adding dropout between representation vectors and the fully connected layer for better performance of DHNE. All these parameters were tuned until it was able to replicate or even improve the performance reported in the original paper. To make a fair comparison, for all the results below, we made sure that the training and validation data setups were the same across different methods.
For node2vec, we decomposed the hypergraph into pairwise edges and ran node2vec on the decomposed graph. For the hyperedge prediction task, we first used the learned embedding to predict pairwise edges. We then used the mean or min of the pairwise similarity as the probability for the tuple to form a hyperedge. We set the window size to 10, walk length to 40, the number of walks per vertex to 10, which are the same parameters used in DHNE for node2vec. However, we found that for the baseline method node2vec, when we tuned the hyperparameter and also used larger walk length, window size and walks per vertex (120, 20, 80 instead of 40, 10, 10), it would achieve comparable performance for node classification task as DHNE. This observation is consistent with our designed biased hypergraph random walk. But this would result in a longer time for sampling the walks and training the skipgram model. We therefore kept the parameters consistent with what was used in DHNE paper.
For our HyperSAGNN, we set the representation size to 64, which is the same as DHNE. When using the encoder based approach to calculate , we set the encoder structure to be the same as the encoder part in DHNE. When using the random walk based approach, we decomposed the hypergraph into a graph as described above. We set the window size to 10, walk length to 40, the number of walks per vertex to 10, to allow timeefficient generation of feature vector . The results in Section 4.3 showed that even when the pretrained embeddings are not so ideal, HyperSAGNN can still well capture the structure of the graph.
4.3 Performance Comparison with Existing Methods
We evaluated the effectiveness of our embedding vectors and the learned function with the network reconstruction task. We compared our HyperSAGNN using the encoder based approach and also the model using the random walk based pretrained embeddings against DHNE and the baseline node2vec. We first trained the model and then used the learned embeddings to predict the hyperedge of the original network. We sampled the negative samples to be 5 times the amount of the positive samples following the same setup of DHNE. We evaluated the performance based on both the AUROC score and the AUPR score.
GPS  MovieLens  drug  wordnet  

AUC  AUPR  AUC  AUPR  AUC  AUPR  AUC  AUPR  
node2vecmean  0.572  0.188  0.557  0.197  0.668  0.246  0.613  0.215 
node2vecmin  0.570  0.187  0.535  0.186  0.682  0.257  0.576  0.201 
DHNE  0.959  0.836  0.974  0.878  0.952  0.873  0.989  0.953 
HyperSAGNNE  0.971  0.877  0.991  0.952  0.977  0.916  0.989  0.950 
HyperSAGNNW  0.976  0.857  0.998  0.986  0.988  0.945  0.994  0.956 
As shown in Table 2, HyperSAGNN can capture the network structure better than DHNE over all datasets either using the encoder based approach or the random walk based approach.
We further assessed the performance of HyperSAGNN on the hyperedge prediction task. We randomly split the hyperedge set into training and testing set by a ratio of 4:1. The way to generate negative samples is the same as the network reconstruction task. As shown in Table 3, our model again achieves significant improvement over DHNE for predicting the unseen hyperedges. The most significant improvement is from the wordnet dataset, which is about a 24.6% increase on the AUPR score. For network reconstruction and hyperedge prediction tasks, the difference between random walk based HyperSAGNN and encoder based HyperSAGNN is minor.
In addition to the tasks related to the prediction of hyperedges, we also evaluated whether the learned node embeddings are effective for node classification tasks. A multilabel classification experiment and a multiclass classification experiment were carried out for the MovieLens dataset and the wordnet dataset, respectively. We used Logistic Regression as the classifier. The proportion of the training data was chosen to be from 10% to 90% for the MovieLens dataset, and 1% to 10% for the wordnet dataset. We used averaged MircoF1 and MacroF1 to evaluate the performance. The results are in Fig.
4. We observed that HyperSAGNN consistently achieves both higher MicroF1 and MacroF1 scores over DHNE for different fractions of the training data. Also, HyperSAGNN based on the random walk generally achieves the best performance (HyperSAGNNW in Fig. 4).4.4 Performance on Nonkuniform Hypergraph
Next, we evaluated HyperSAGNN using the nonuniform heterogeneous hypergraph. For the above four datasets, we decomposed each hyperedge into 3 pairwise edges and added them to the existing graph. We trained our model to predict both the hyperedges and the edges (i.e., nonhyperedges). We then evaluated the performance for link prediction tasks for both the hyperedges and the edges. We also performed the node classification task following the same setting as above. The results for link prediction are in Table 3. Fig. 4 shows the results for the node classification task.
GPS  MovieLens  drug  wordnet  
AUC  AUPR  AUC  AUPR  AUC  AUPR  AUC  AUPR  
node2vec  mean  0.563  0.191  0.562  0.197  0.670  0.246  0.608  0.213 
node2vec  min  0.570  0.185  0.539  0.186  0.684  0.258  0.575  0.200 
DHNE  0.910  0.668  0.877  0.668  0.925  0.859  0.816  0.459 
HyperSAGNNE  0.952  0.798  0.926  0.793  0.961  0.895  0.890  0.705 
HyperSAGNNW  0.922  0.722  0.930  0.810  0.955  0.892  0.880  0.706 
HyperSAGNNE (mix)  0.950  0.795  0.928  0.799  0.956  0.887  0.881  0.694 
HyperSAGNNW (mix)  0.920  0.720  0.929  0.811  0.950  0.889  0.884  0.684 
GPS (2)  MovieLens (2)  drug (2)  wordnet (2)  
AUC  AUPR  AUC  AUPR  AUC  AUPR  AUC  AUPR  
HyperSAGNNE (mix)  0.921  0.899  0.971  0.967  0.981  0.973  0.891  0.897 
HyperSAGNNW (mix)  0.931  0.910  0.999  0.999  0.999  0.999  0.923  0.916 
We observed that HyperSAGNN can preserve the graph structure on different levels. Compared to training the model with hyperedges only, including the edges into the training would not cause obvious changes in performance for hyperedge predictions (about a 1% fluctuation for AUC/AUPR).
We then further assessed the model in a new evaluation setting where there are adequate edges but only a few hyperedges presented. We asked whether the model can still achieve good performance on the hyperedge prediction based on this dataset. This scenario is possible in the realworld applications especially when the dataset is combined from different sources. For example, in the drug dataset, it is possible that, in addition to the (user, drug, reaction) hyperedges, there are also extra edges that come from other sources, e.g., (drug, reaction) edges from the drug database, (user, drug) and (user, reaction) edges from the medical record. Here for each dataset that we tested, we used about 50% of the edges and only 5% of the hyperedges in the network to train the model. The results are in Fig. 5.
When using only the edges to train the model, our method still achieves higher AUROC and AUPR score for hyperedge prediction as compared to node2vec (Table 3). We found that when the model is trained with both the downsampled hyperedge dataset and the edge dataset, it would be able to reach higher performance or suffer less from overfitting than being trained with each of the datasets individually. This demonstrates that our model can capture the consensus information on the graph structure across different sizes of hyperedges.
4.5 Outsider Identification
In addition to the standard link prediction and node classification, we further formulated a new task called “outsider identification”. Previous methods such as DHNE can answer the question of whether a specific tuple of nodes form a hyperedge. However, in many settings, we might also want to know the reason why this group of nodes will not form a hyperedge. We first define the outsider of a group of nodes as follows. Node is the outsider of the node group if it satisfies:
(14)  
(15) 
We speculated that HyperSAGNN can answer this question by analyzing the probability score to (defined in Eqn. 7). We assume that the node with the smallest would be the outsider. We then set the evaluation as follows. We first trained the model as usual, but at the final stage, we replaced the average pooling layer with min pooling layer and finetuned the model for several epochs. We then fed the generated triplets with known outsider node into the trained model and calculated the top accuracy of the outsider node matching the node with the smallest probability. Because this task is based on the prediction results of the hyperedges, we only tested on the dataset that achieves the best hyperedge prediction, i.e., the drug dataset. We found that we have 81.9% accuracy for the smallest probability and 95.3% accuracy for the top2 smallest probability. These results showed that by switching the pooling layer we would have better outsider identification accuracy (from 78.5% to 81.9%) with the cost of slightly decreased hyperedge prediction performance (AUC from 0.955 to 0.935). This demonstrates that our model is able to accurately predict the outsider within the group even without further labeled information. Moreover, the performance of outsider identification can be further improved if we include the crossentropy between and the label of whether is an outsider for all applicable triplets in the loss term. Together, these results demonstrate the advantage of HyperSAGNN in terms of the interpretability of hyperedge prediction.
4.6 Application to Singlecell HiC Datasets
We next applied HyperSAGNN to the recently produced singlecell HiC (scHiC) datasets (ramani2017massively; nagano2017cell). Genomewide mapping of chromatin interactions by HiC (lieberman2009comprehensive; rao20143d) has enabled comprehensive characterization of the 3D genome organization that reveals patterns of chromatin interactions between genomic loci. However, unlike bulk HiC data where signals are aggregated from cell populations, scHiC provides unique information about chromatin interactions at singlecell resolution, thus allowing us to ascertain celltocell variation of the 3D genome organization. Specifically, we propose that scHiC makes it possible to model the celltocell variation of chromatin interaction as a hyperedge, i.e., (cell, genomic locus, genomic locus). For the analysis of scHiC, the most common strategy would be revealing the celltocell variation by embedding the cells based on the contact matrix and then applying the clustering algorithms such as K
means clustering or hierarchical clustering on the embedded vectors. We performed the following evaluation to assess the effectiveness of HyperSAGNN on learning the embeddings of cells by representing the scHiC data as hypergraphs.
We tested HyperSAGNN on two datasets. The first one consists of scHiC from four human cell lines: HAP1, GM12878, K562, and HeLa (ramani2017massively). The second one includes the scHiC that represents the cell cycle of the mouse embryonic stem cells (nagano2017cell). We refer to the first dataset as “Ramani et al. data”, and the second as “Nagano et al. data” for abbreviation.
We trained HyperSAGNN with the corresponding datasets. Due to the large average degrees of cell nodes, the random walk approach would take an extensive amount of time to sample the walks. Thus, we only applied the encoder version of our method. We visualize the learned embeddings by reducing them to 2 dimensions with PCA and UMAP (mcinnes2018umap) (Fig. 6AD).
We quantified the effectiveness of the embeddings by applying Kmeans clustering on the Ramani et al. data and evaluating with Adjusted Rand Index (ARI). In addition, we also assessed the effectiveness of the embeddings with a supervised scenario. We used Logistic Regression as the classifier with 10% of the cell as training samples and evaluated the multiclass classification task with MicroF1 and MacroF1. We did not run Kmeans clustering on the Nagano et al. data as it represents a state of continuous cell cycle which is not suitable for a clustering task. We instead used the metric ACROC (Average Circular ROC) developed in the HiCRep/MDS paper (liu2018unsupervised) to evaluate the performance of the three methods on the Nagano et al. data. We compared the performance with two recently developed computational methods based on dimensionality reduction of the contact matrix, HiCRep/MDS (liu2018unsupervised) and scHiCluster (zhou2019robust). Because HyperSAGNN is not a deterministic method for generating embeddings for scHiC, we repeated the training process 5 times and averaged the score. All these results are in Fig. 6E.
For the Ramani et al. data (Fig. 6AB), the visualization of the embedding vectors learned by HyperSAGNN exhibits clear patterns that cells with the same cell type are clustered together. Moreover, cell line HAP1, GM12878, and K562 are all bloodrelated cell lines, which are likely to be more similar to each other in terms of 3D genome organization as compared to HeLa. Indeed, we observed that they are also closer to each other in the embedding space. Quantitative results in Fig. 6E are consistent with the visualization as our method achieves the highest ARI, MicroF1, MacroF1 score among all three methods. For the Nagano et al. data, as shown in Fig. 6CD, we found that the embeddings exhibit a circular pattern that corresponds to the cell cycle. Also, both HiCRep/MDS and HyperSAGNN achieve high ACROC score. All these results demonstrated the effectiveness of representing the scHiC datasets as hypergraphs using HyperSAGNN, which has great potential to provide insights into the celltocell variation of higherorder genome organization.
5 Conclusion
In this work, we have developed a new graph neural network model called HyperSAGNN for the representation learning of general hypergraphs. The framework has the flexible ability to deal with homogeneous and heterogeneous, and uniform and nonuniform hypergraphs. We demonstrated that HyperSAGNN is able to improve or match stateoftheart performance for hypergraph representation learning while addressing the shortcomings of prior methods such as the inability to predict hyperedges for nonuniform heterogeneous hypergraphs. HyperSAGNN is computationally efficient as the size of input to the graph attention layer is bounded by the maximum hyperedge size as opposed to the number of firstorder neighbors.
One potential improvement of HyperSAGNN as future work would be to allow information aggregation over all the firstorder neighbors before calculating the static/dynamic embeddings for a node with additional computational cost. With this design, the static embedding for a node would still satisfy our constraint that it is fixed for a known hypergraph with varying input tuples. This would allow us to incorporate previously developed methods on graphs, such as GraphSAGE (hamilton2017inductive) and GCN (kipf2016semi), as well as methods designed for hypergraphs like HyperGCN (yadati2018hypergcn)
into this framework for better link prediction performance. Such improvement may also extend the application of HyperSAGNN to semisupervised learning.
Acknowledgment
J.M. acknowledges support from the National Institutes of Health Common Fund 4D Nucleome Program grant U54DK107965, National Institutes of Health grant R01HG007352, and National Science Foundation grant 1717205. Y.Z. (Yao Class, IIIS, Tsinghua University) contributed to this work as a visiting undergraduate student at Carnegie Mellon University during summer 2019.
References
Appendix A Appendix
a.1 Comparison of HyperSAGNN with Its Variants
As mentioned above, unlike the standard GAT model, we exclude the term in the selfattention mechanism. To test whether this constraint would improve or reduce the model’s ability to learn, we implemented a variant of our model (referred to as variant type I) by including this term. Also, as mentioned in the Method section, another potential variant of our model would be directly using the to calculate the probability score . We refer to this variant as variant type II. For variant type II, on node classification task, since it does not have a static embedding, we used . The rest of the parameters and structure of the neural network remain the same.
We then compared the performance of HyperSAGNN and two variants in terms of AUC and AUPR values for network reconstruction task and hyperedge link prediction task on the following four datasets: MovieLens, wordnet, drug, and GPS. We also compared the performance in terms of the Micro F1 score and Macro F1 score on the node classification task on the MovieLens and the wordnet dataset. For the MovieLens dataset, we used 90% nodes as training data while for wordnet, we used 1% of the nodes as training data. All the evaluation setup is the same as described in the main text. To avoid the effect of randomness from the neural network training, we repeated the training process for each experiment five times and made the line plot of the score versus the epoch number. To illustrate the differences more clearly, we started the plot at epoch 3 for the random walk based approach and epoch 12 for the encoder based approach. The performance of the model using the random walk based approach is shown in Fig. A1 to Fig. A4. The performance of the model using the encoder based approach is shown in Fig. A5 to Fig. A8.
For models with the random walk based approach, HyperSAGNN is the best in terms of all metrics for the GPS, MovieLens, and wordnet dataset. On the drug dataset, HyperSAGNN achieves higher AUROC and AUPR score on the network reconstruction task than two variants, but slightly lower AUROC score for the link prediction task (less than 0.5%).
For models with the encoder based approach, the advantage is not that obvious. All 3 methods achieve similar performance in terms of all metrics for the GPS and the drug dataset. For the MovieLens and wordnet dataset, HyperSAGNN performs similar to variant type I, higher than variant type II on the network reconstruction and link prediction task. However, our model achieves slightly higher accuracy on the node classification task than variant type I.
Therefore, these evaluations demonstrated that the choice of the structure of HyperSAGNN can achieve higher or at least comparable performance than the two potential variants over multiple tasks on multiple datasets.
Comments
There are no comments yet.