HyperGCN: Hypergraph Convolutional Networks for Semi-Supervised Classification

09/07/2018 ∙ by Naganand Yadati, et al. ∙ 0

Graph-based semi-supervised learning (SSL) is an important learning problem where the goal is to assign labels to initially unlabeled nodes in a graph. Graph Convolutional Networks (GCNs) have recently been shown to be effective for graph-based SSL problems. GCNs inherently assume existence of pairwise relationships in the graph-structured data. However, in many real-world problems, relationships go beyond pairwise connections and hence are more complex. Hypergraphs provide a natural modeling tool to capture such complex relationships. In this work, we explore the use of GCNs for hypergraph-based SSL. In particular, we propose HyperGCN, an SSL method which uses a layer-wise propagation rule for convolutional neural networks operating directly on hypergraphs. To the best of our knowledge, this is the first principled adaptation of GCNs to hypergraphs. HyperGCN is able to encode both the hypergraph structure and hypernode features in an effective manner. Through detailed experimentation, we demonstrate HyperGCN's effectiveness at hypergraph-based SSL.

READ FULL TEXT VIEW PDF
POST COMMENT

Comments

There are no comments yet.

Authors

page 1

page 2

page 3

page 4

This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

1 Introduction

In many real-world network datasets such as co-authorship, co-citation, email communication, etc., relationships are complex and go beyond pairwise associations. Hypergraphs provide a flexible and natural modeling tool to model such complex relationships. For example, in a co-authorship network an author (hyperedge) can be a co-author of more than two documents (vertices).

The obvious existence of such complex relationships in many real-world networks naturaly motivates the problem of learning with hypergraphs Zhou et al. (2007); Hein et al. (2013); Zhang et al. (2017); Feng et al. (2019). A popular learning paradigm is graph-based / hypergraph-based semi-supervised learning (SSL) where the goal is to assign labels to initially unlabelled vertices in a graph / hypergraph Chapelle et al. (2010); Zhu et al. (2009); Subramanya and Talukdar (2014). While many techniques have used explicit Laplacian regularisation in the objective Zhou et al. (2003); Zhu et al. (2003); Chapelle et al. (2003); Weston et al. (2008), the state-of-the-art neural methods encode the graph / hypergraph structure implicitly via a neural network Kipf and Welling (2017); Atwood and Towsley (2016); Feng et al. (2019) ( contains the initial features on the vertices for example, text attributes for documents).

While explicit Laplacian regularisation assumes similarity among vertices in each edge / hyperedge, implicit regularisation of graph convolutional networks (GCNs) Kipf and Welling (2017) avoids this restriction and enables application to a broader range of problems in combinatorial optimisation Gong et al. (2019); Lemos et al. (2019); Prates et al. (2019); Li et al. (2018c)

, computer vision

Chen et al. (2019); Norcliffe-Brown et al. (2018); Wang et al. (2018)

, natural language processing

Vashishth et al. (2019a); Yao et al. (2019); Marcheggiani and Titov (2017), etc. In this work, we propose, HyperGCN, a novel training scheme for a GCN on hypergraph and show its effectiveness not only in SSL where hyperedges encode similarity but also in combinatorial optimisation where hyperedges do not encode similarity. Combinatorial optimisation on hypergraphs has recently been highlighted as crucial for real-world network analysis Amburg et al. (2019); Nguyen et al. (2019).

Methodologically, HyperGCN approximates each hyperedge of the hypergraph by a set of pairwise edges connecting the vertices of the hyperedge and treats the learning problem as a graph learning problem on the approximation. While the state-of-the-art hypergraph neural networks (HGNN) Feng et al. (2019) approximates each hyperedge by a clique and hence requires (quadratic number of) edges for each hyperedge of size , our method, i.e. HyperGCN, requires a linear number of edges (i.e. ) for each hyperedge. The advantage of this linear approximation is evident in Table 1 where a faster variant of our method has lower training time on synthetic data (with higher density as well) for densest -subhypergraph and SSL on real-world hypergraphs (DBLP and Pubmed). In summary, we make the following contributions:

Model Metric Training time Density DBLP Pubmed
HGNN s s s
FastHyperGCN s s s
Table 1:

average training time of an epoch (lower is better)

  • We propose HyperGCN, a novel method of training a graph convolutional network (GCN) on hypergraphs using existing tools from spectral theory of hypergraphs (Section 4).

  • We apply HyperGCN to the problems of SSL on attributed hypergraphs and combinatorial optimisation. Through detailed experimentation, we demonstrate its effectiveness compared to the state-of-the art HGNN Feng et al. (2019) and other baselines (Sections 5, and 7).

  • We thoroughly discuss when we prefer HyperGCN to HGNN (Sections 6, and 8)

While the motivation of HyperGCN is based on similarity of vertices in a hyperedge, we show that it can be used effectively for combinatorial optimisation where hyperedges do not encode similarity.

2 Related work

In this section, we discuss related work and then the background in the next section.

Deep learning on graphs: Geometric deep learning Bronstein et al. (2017) is an umbrella phrase for emerging techniques attempting to generalise (structured) deep neural network models to non-Euclidean domains such as graphs and manifolds. Graph convolutional network (GCN) Kipf and Welling (2017) defines the convolution using a simple linear function of the graph Laplacian and is shown to be effective on semi-supervised classification on attributed graphs. The reader is referred to a comprehensive literature review Bronstein et al. (2017) and extensive surveys Hamilton et al. (2017); Battaglia et al. (2018); Zhang et al. (2018); Sun et al. (2018); Wu et al. (2019)

on this topic of deep learning on graphs.

Learning on hypergraphs: The clique expansion of a hypergraph was introduced in a seminal work Zhou et al. (2007) and has become popular Agarwal et al. (2006); Satchidanand et al. (2015); Feng et al. (2018). Hypergraph neural networks Feng et al. (2019)

use the clique expansion to extend GCNs for hypergraphs. Another line of work uses mathematically appealing tensor methods

Shashua et al. (2006); Bulò and Pelillo (2009); Kolda and Bader (2009), but they are limited to uniform hypergraphs. Recent developments, however, work for arbitrary hypergraphs and fully exploit the hypergraph structure Hein et al. (2013); Zhang et al. (2017); Chan and Liang (2018); Li and Milenkovic (2018b); Chien et al. (2019).

Graph-based SSL: Researchers have shown that using unlabelled data in training can improve learning accuracy significantly. This topic is so popular that it has influential books Chapelle et al. (2010); Zhu et al. (2009); Subramanya and Talukdar (2014).

Graph neural networks for combinatorial optimisation: Graph-based deep models have recently been shown to be effective as learning-based approaches for NP-hard problems such as maximal independent set, minimum vertex cover, etc. Li et al. (2018c), the decision version of the traveling salesman problem Prates et al. (2019), graph colouring Lemos et al. (2019), and clique optimisation Gong et al. (2019).

3 Background: Graph convolutional network

Let , with , be a simple undirected graph with adjacency , and data matrix . which has

-dimensional real-valued vector representations for each node

.

The basic formulation of graph convolution Kipf and Welling (2017) stems from the convolution theorem Mallat (1999) and it can be shown that the convolution of a real-valued graph signal and a filter signal is approximately where and are learned weights, and is the scaled graph Laplacian,

is the largest eigenvalue of the symmetrically-normalised graph Laplacian

where is the diagonal degree matrix with elements . The filter depends on the structure of the graph (the graph Laplacian ). The detailed derivation from the convolution theorem uses existing tools from graph signal processing Shuman et al. (2013); Hammond et al. (2011); Bronstein et al. (2017) and is provided in the supplementary material. The key point here is that the convolution of two graph signals is a linear function of the graph Laplacian .

Symbol Description Symbol Description
an undirected simple graph an undirected hypergraph
set of nodes set of hypernodes
set of edges set of hyperedges
number of nodes number of hypernodes
graph Laplacian hypergraph Laplacian
graph adjacency matrix hypergraph incidence matrix
Table 2: Summary of symbols used in the paper.

The graph convolution for different graph signals contained in the data matrix with learned weights with hidden units is . The proof involves a renormalisation trick Kipf and Welling (2017) and is in the supplementary.

Gcn Kipf and Welling (2017)

The forward model for a simple two-layer GCN takes the following simple form:

(1)

where is an input-to-hidden weight matrix for a hidden layer with hidden units and

is a hidden-to-output weight matrix. The softmax activation function is defined as

and applied row-wise.

GCN training for SSL: For multi-class, classification with classes, we minimise cross-entropy,

(2)

over the set of labelled examples . Weights and are trained using gradient descent.

A summary of the notations used throughout our work is shown in Table 2.

4 HyperGCN: Hypergraph Convolutional Network

We consider semi-supervised hypernode classification on an undirected hypergraph with , and a small set of labelled hypernodes. Each hypernode is also associated with a feature vector of dimension given by . The task is to predict the labels of all the unlabelled hypernodes, that is, all the hypernodes in the set .

Overview: The crucial working principle here is that the hypernodes in the same hyperedge are similar and hence are likely to share the same label Zhang et al. (2017). Suppose we use to denote some representation of the hypernodes in , then, for any , the function will be “small” only if vectors corresponding to the hypernodes in are “close” to each other. Therefore, as a regulariser is likely to achieve the objective of the hypernodes in the same hyperedge having similar representations. However, instead of using it as an explicit regulariser, we can achieve the same goal by using GCN over an appropriately defined Laplacian of the hypergraph. In other words, we use the notion of hypergraph Laplacian as an implicit regulariser which achieves this objective.

A hypergraph Laplacian with the same underlying motivation as stated above was proposed in prior works Chan et al. (2018); Louis (2015). We present this Laplacian first. Then we run GCN over the simple graph associated with this hypergraph Laplacian. We call the resulting method -HyperGCN (as each hyperedge is approximated by exactly one pairwise edge). One epoch of -HyperGCN is shown in figure 1

Figure 1: Graph convolution on a hypernode using HyperGCN.

4.1 Hypergraph Laplacian

As explained before, the key element for a GCN is the graph Laplacian of the given graph . Thus, in order to develop a GCN-based SSL method for hypergraphs, we first need to define a Laplacian for hypergraphs. One such way Chan et al. (2018) (see also Louis (2015)) is a non-linear function (the Laplacian matrix for graphs can be viewed as a linear function ).

Definition 1 (Hypergraph Laplacian Chan et al. (2018); Louis (2015)111The problem of breaking ties in choosing (resp. ) is a non-trivial problem as shown in Chan et al. (2018). Breaking ties randomly was proposed in Louis (2015), but Chan et al. (2018) showed that this might not work for all applications (see Chan et al. (2018) for more details). Chan et al. (2018) gave a way to break ties, and gave a proof of correctness for their tie-breaking rule for the problems they studied. We chose to break ties randomly because of its simplicity and its efficiency. )

Given a real-valued signal defined on the hypernodes, is computed as follows.

  1. For each hyperedge , let , breaking ties randomlyfootnotemark: .

  2. A weighted graph on the vertex set is constructed by adding edges with weights to , where is the weight of the hyperedge . Next, to each vertex , self-loops are added such that the degree of the vertex in is equal to . Let denote the weighted adjacency matrix of the graph .

  3. The symmetrically normalised hypergraph Laplacian is

4.2 -HyperGCN

By following the Laplacian construction steps outlined in Section 4.1, we end up with the simple graph with normalized adjacency matrix . We now perform GCN over this simple graph . The graph convolution operation in Equation (1), when applied to a hypernode in , in the neural message-passing framework Gilmer et al. (2017) is . Here, is epoch number, is the new hidden layer representation of node , is a non-linear activation function, is a matrix of learned weights, is the set of neighbours of , is the weight on the edge after normalisation, and is the previous hidden layer representation of the neighbour

. We note that along with the embeddings of the hypernodes, the adjacency matrix is also re-estimated in each epoch.

Figure 1 shows a hypernode with five hyperedges incident on it. We consider exactly one representative simple edge for each hyperedge given by where for epoch . Because of this consideration, the hypernode may not be a part of all representative simple edges (only three shown in figure). We then use traditional Graph Convolution Operation on considering only the simple edges incident on it. Note that we apply the operation on each hypernode in each epoch of training until convergence.

Connection to total variation on hypergraphs: Our 1-HyperGCN model can be seen as performing implicit regularisation based on the total variation on hypergraphs Hein et al. (2013). In that prior work, explicit regularisation and only the hypergraph structure is used for hypernode classification in the SSL setting. HyperGCN, on the other hand, can use both the hypergraph structure and also exploit any available features on the hypernodes, e.g., text attributes for documents.

Figure 2: Hypergraph Laplacian Chan et al. (2018) vs. the generalised hypergraph Laplacian with mediators Chan and Liang (2018). Our approach requires at most a linear number of edges ( and respectively) while HGNN Feng et al. (2019) requires a quadratic number of edges for each hyperedge.

4.3 HyperGCN: Enhancing -HyperGCN with mediators

One peculiar aspect of the hypergraph Laplacian discussed is that each hyperedge is represented by a single pairwise simple edge (with this simple edge potentially changing from epoch to epoch). This hypergraph Laplacian ignores the hypernodes in in the given epoch. Recently, it has been shown that a generalised hypergraph Laplacian in which the hypernodes in act as “mediators" Chan and Liang (2018) satisfies all the properties satisfied by the above Laplacian given by Chan et al. (2018). The two Laplacians are pictorially compared in Figure 2. Note that if the hyperedge is of size , we connect and with an edge. We also run a GCN on the simple graph associated with the hypergraph Laplacian with mediators Chan and Liang (2018) (right in Figure 2). It has been suggested that the weights on the edges for each hyperedge in the hypergraph Laplacian (with mediators) sum to Chan and Liang (2018). We chose each weight to be as there are edges for a hyperedge .

4.4 FastHyperGCN

We use just the initial features (without the weights) to construct the hypergraph Laplacian matrix (with mediators) and we call this method FastHyperGCN. Because the matrix is computed only once before training (and not in each epoch), the training time of FastHyperGCN is much less than that of other methods. We have provided the algorithms for the three methods in the supplementary.

5 Experiments for semi-supervised learning

We conducted experiments not only on real-world datasets but also on categorical data (results in supplementary) which are a standard practice in hypergraph-based learning Zhou et al. (2007); Hein et al. (2013); Zhang et al. (2017); Li and Milenkovic (2018b, a); Li et al. (2018a).

5.1 Baselines

We compared HyperGCN, -HyperGCN and FastHyperGCN against the following baselines:

  • Hypergraph neural networks (HGNN) Feng et al. (2019) uses the clique expansion Zhou et al. (2007); Agarwal et al. (2006) to approximate the hypergraph. Each hyperedge of size is approximated by an -clique.

  • Multi-layer perceptron (MLP)

    treats each instance (hypernode) as an independent and identically distributed (i.i.d) instance. In other words, in equation 1. We note that this baseline does not use the hypergraph structure to make predictions.

  • Multi-layer perceptron + explicit hypergraph Laplacian regularisation (MLP + HLR): regularises the MLP by training it with the loss given by and uses the hypergraph Laplacian with mediators for explicit Laplacian regularisation . We used of the test set used for all the above models for this baseline to get an optimal .

  • Confidence Interval-based method (CI) Zhang et al. (2017) uses a subgradient-based method Zhang et al. (2017). We note that this method has consistently been shown to be superior to the primal dual hybrid gradient (PDHG) of Hein et al. (2013) and also Zhou et al. (2007). Hence, we did not use these other previous methods as baselines, and directly compared HyperGCN against CI.

The task for each dataset is to predict the topic to which a document belongs (multi-class classification). Statistics are summarised in Table 3. For more details about datasets, please refer to the supplementary. We trained all methods for

epochs and used the same hyperparameters of a prior work

Kipf and Welling (2017)

. We report the mean test error and standard deviation over

different train-test splits. We sampled sets of same sizes of labelled hypernodes from each class to have a balanced train split.

DBLP Pubmed Cora Cora Citeseer
(co-authorship) (co-citation) (co-authorship) (co-citation) (co-citation)
# hypernodes,
# hyperedges,
avg.hyperedge size
# features,
# classes,
label rate,
Table 3:

Real-world hypergraph datasets used in our work. Distribution of hyperedge sizes is not symmetric either side of the mean and has a strong positive skewness.

Data Method DBLP Pubmed Cora Cora Citeseer
co-authorship co-citation co-authorship co-citation co-citation
CI
MLP
MLP + HLR
HGNN
1-HyperGCN
FastHyperGCN
HyperGCN
Table 4: Results of SSL experiments. We report mean test error standard deviation (lower is better) over train-test splits. Please refer to section 5 for details.

6 Analysis of results

The results on real-world datasets are shown in Table 4. We now attempt to explain them.

Proposition 1:

Given a hypergraph with and signals on the vertices , let, for each hyperedge , and . Define

so that and are the normalised clique exapnsion, i.e., graph of HGNN and mediator expansion, i.e., graph of HyperGCN/FastHyperGCN respectively. A sufficient condition for is .

Method sDBLP
HGNN
FastHyperGCN
HyperGCN
Table 5: Results (lower is better) on sythetic data and a subset of DBLP showing that our methods are more effective for noisy hyperedges. is no. of hypernodes of one class divided by that of the other in noisy hyperedges. Best result is in bold and second best is underlined. Please see Section 6.

Proof:

Observe that we consider hypergraphs in which the size of each hyperedge is at least . It follows from definitions that and . Clealy, a sufficient condition is when each hyperedge is approximated by the same subgraph in both the expansions. In other words the condition is for each . Solving the resulting quadratic eqution gives us . Hence, or for each .

Comparable performance on Cora and Citeseer co-citation We note that HGNN is the most competitive baseline. Also for FastHyperGCN and for HyperGCN. The proposition states that the graphs of HGNN, FastHyperGCN, and HyperGCN are the same irrespective of the signal values whenever the maximum size of a hyperedge is .

This explains why the three methods have comparable accuracies for Cora co-citaion and Citeseer co-citiation hypergraphs. The mean hyperedge sizes are close to (with comparitively lower deviations) as shown in Table 3. Hence the graphs of the three methods are more or less the same.

Superior performance on Pubmed, DBLP, and Cora co-authorship

We see that HyperGCN performs statistically significantly (p-value of Welch t-test is less than 0.0001) compared to HGNN on the other three datasets. We believe this is due to large noisy hyperedges in real-world hypergraphs. An author can write papers from different topics in a co-authorship network or a paper typically cites papers of different topics in co-citation networks.

Average sizes in Table 3 show the presence of large hyperedges (note the large standard deviations). Clique expansion has edges on all pairs and hence potentially a larger number of hypernode pairs of different labels than the mediator graph of Figure 2, thus accumulating more noise.

Preference of HyperGCN and FastHyperGCN over HGNN To further illustrate superiority over HGNN on noisy hyperedges, we conducted experiments on synthetic hypergraphs each consisting of hypernodes, randomly sampled hyperedges, and classes with hypernodes in each class. For each synthetic hypergraph, hyperedges (each of size ) were “pure", i.e., all hypernodes were from the same class while the other hyperedges (each of size ) contained hypernodes from both classes. The ratio, , of hypernodes of one class to the other was varied from (less noisy) to (most noisy) in steps of .

Table 5 shows the results on synthetic data. We initialise the hypernode features to random Gaussian of dimensions. We report mean error and deviation over different synthetically generated hypergraphs. As we can see in the table for hyperedges with (mostly pure), HGNN is the superior model. However, as (noise) increases our methods begin to outperform HGNN.

Subset of DBLP: We also trained all three models on a subset of DBLP (we call it sDBLP) by removing all hyperedges of size and . The resulting hypergraph has around hyperedges with an average size of . We report mean error over different train-test splits in Table 5.

Conclusion: From the above analysis, we conclude that our proposed methods (HyperGCN and FastHyperGCN) should be preferred to HGNN for hypergraphs with large noisy hyperedges. This is also the case on experiments in combinatorial optimisation (Table 6) which we discuss next.

7 HyperGCN for combinatorial optimisation

Inspired by the recent sucesses of deep graph models as learning-based approaches for NP-hard problems Li et al. (2018c); Prates et al. (2019); Lemos et al. (2019); Gong et al. (2019), we have used HyperGCN as a learning-based approach for the densest -subhypergraph problem Chlamtác et al. (2018). NP-hard problems on hypergraphs have recently been highlighted as crucial for real-world network analysis Amburg et al. (2019); Nguyen et al. (2019). Our problem is, given a hypergraph , to find a subset of hypernodes so as to maximise the number of hyperedges contained in , i.e., we wish to maximise the density given by .

A greedy heuristic for the problem is to select the

hypernodes of the maximum degree. We call this “MaxDegree". Another greedy heuristic is to iteratively remove all hyperedges from the current (residual) hypergraph consisting of a hypernode of the minimum degree. We repeat the procedure times and consider the density of the remaining hypernodes. We call this “RemoveMinDegree".

Dataset Synthetic DBLP Pubmed Cora Cora Citeseer
Approach test set co-authorship co-citation co-authorship co-citation co-citation
MaxDegree
RemoveMinDegree
 MLP
MLP + HLR
HGNN
-HyperGCN
FastHyperGCN
HyperGCN
# hyperedges,
Table 6: Results on the densest -subhypergraph problem. We report density (higher is better) of the set of vertices obtained by each of the proposed approaches for . See section 7 for details.
    
(a) RemoveMinDegree
(b) HyperGCN
Figure 3: Green / pink hypernodes denote those the algorithm labels as positive / negative respectively.

Experiments: Table 6 shows the results. We trained all the learning-based models with a synthetically generated dataset. More details on the approach and the synthetic data are in the supplementary. As seen in Table 6, our proposed HyperGCN outperforms all the other approaches except for the pubmed dataset which contains a small number of vertices with large degrees and a large number of vertices with small degrees. The RemoveMinDegree baseline is able to recover all the hyperedges here.

Qualitative analysis: Figure 3 shows the visualisations given by RemoveMinDegree and HyperGCN on the Cora co-authorship hypergraph. We used Gephi’s Force Atlas to space out the vertices. In general, a cluster of nearby vertices has multiple hyperedges connecting them. Clusters of only green vertices indicate the method has likely included all vertices within the hyperedges induced by the cluster. The figure of HyperGCN has more dense green clusters than that of RemoveMinDegree.

8 Comparison of training time

We compared the average training time of an epoch of FastHyperGCN and HGNN in Table 1. Both were run on a GeForce GTX 1080 Ti GPU machine. We observe that FastHyperGCN is faster than HGNN because it uses a linear number of edges for each hyperedge while HGNN uses quadratic. FastHyperGCN is also superior in terms of performance on hypergraphs with large noisy hyperedges.

9 Conclusion

We have proposed HyperGCN, a new method of training GCN on hypergraph using tools from spectral theory of hypergraphs. We have shown HyperGCN’s effectiveness in SSL and combinatorial optimisation. Approaches that assign importance to nodes Veličković et al. (2018); Monti et al. (2018); Vashishth et al. (2019b) have improved results on SSL. HyperGCN may be augmented with such approaches for even more improved performance. Supplementary: Hypergraph convolutional network

10 Algorithms of our proposed methods

The forward propagation of a -layer graph convolutional network (GCN) Kipf and Welling (2017) is

and is the diagonal degree matrix with elements . We provide algorithms for our three proposed methods:

  • HyperGCN - Algorithm 1

  • FastHyperGCN - Algorithm 2

  • 1-HyperGCN - Algorithm 3

Input: An attributed hypergraph , with attributes , a set of labelled vertices
      Output All hypernodes in labelled

1:for each epoch of training do
2:     for layer of the network do
3:         set For all hypernodes
4:         let be the parameters For the current epoch
5:         for  do
6:              hidden representation matrix of layer
7:              
8:              
9:              
10:              for  do
11:                  
12:                  
13:              end for
14:         end for
15:     end for
16:     
17:     update parameters to minimise cross entropy loss on the set of labelled hypernodes
18:end for
19:label the hypernodes in using
Algorithm 1 Algorithm for HyperGCN

Input: An attributed hypergraph , with attributes , a set of labelled vertices
      Output All hypernodes in labelled

set for all hypernodes
for  do
     
     
     for  do
         
         
     end for
end for
for each epoch of training do
     let be the parameters for the current epoch
     
     update parameters to minimise cross entropy loss on the set of labelled hypernodes
end for
label the hypernodes in using
Algorithm 2 Algorithm for FastHyperGCN

Input: An attributed hypergraph , with attributes , a set of labelled vertices
      Output All hypernodes in labelled

for each epoch of training do
     for layer of the network do
         set for all hypernodes
         let be the parameters for the current epoch
         for  do
               hidden representation matrix of layer
              
              
         end for
     end for
     
     update parameters to minimise cross entropy loss on the set of labelled hypernodes
end for
label the hypernodes in using
Algorithm 3 Algorithm for -HyperGCN

10.1 Time complexity

Given an attributed hypergraph , let be the number of initial features, be the number of hidden units, and be the number of labels. Further, let be the total number of epochs of training. Define

  • HyperGCN takes time

  • 1-HyperGCN takes time

  • FastHyperGCN takes time

  • HGNN takes time

11 HyperGCN for combinatorial optimisation

Inspired by the recent sucesses of deep graph models as learning-based approaches for NP-hard problems Li et al. (2018c); Prates et al. (2019); Lemos et al. (2019); Gong et al. (2019), we have used HyperGCN as a learning-based approach for the densest -subhypergraph problem Chlamtác et al. (2018), an NP-hard hypergraph problem. The problem is given a hypergraph , find a subset of hypernodes so as to maximise the number of hyperedges contained in (induced by) i.e. we intend to maximise the density given by

One natural greedy heuristic approach for the problem is to select the hypernodes of the maximum degree. We call this approach “MaxDegree". Another greedy heuristic approach is to iteratively remove all the hyperedges from the current (residual) hypergraph containing a hypernode of the minimum degree. We repeat the procedure times and consider the density of the remaining hypernodes. We call this approach “RemoveMinDegree".

11.1 Our approach

A natural approach to the problem is to train HyperGCN to perform the labelling. In other words, HyperGCN would take an input hypergraph as input and output a binary labelling of the hypernodes

. A natural output representation is a probability map in

that indicates how likely each hypernode is to belong to .

Let be a training set, where is an input hypergraph and is one of the optimal solutions for the NP-hard hypergraph problem. The HyperGCN model learns its parameters and is trained to predict given . During training we minimise the binary cross-entropy loss for each training sample Additionally we generate different probability maps to minimise the hindsight loss i.e. where is the cross-entropy loss corresponding to the -th probability map. Generating multiple probability maps has the advantage of generating diverse solutions Li et al. (2018c).

11.2 Experiments: Training data

To generate a sample in the training set , we fix a vertex set of vertices chosen uniformly randomly. We generate each hyperedge such that with high probability . Note that with probability . We give the algorithm to generate a sample .

Input: A hypergraph and a dense set of vertices
      Output A hypergraph and a dense set of vertices

subset of of size chosen uniformly randomly
for  do
      chosen uniformly randomly
     sample from with probability
     sample from with probability
end for
Algorithm 4 Algorithm for generating a training sample

11.3 Experiments: Results

We generated training samples with the number of hypernodes uniformly randomly chosen from . We fix as this is mostly the case for real-world hypergraphs. Further we chose such that is uniformly randomly chosen from as this is also mostly the case for real-world hypergraphs. We compared all our proposed approaches viz. -HyperGCN, HyperGCN, and FastHyperGCN against the baselines MLP, MLP+HLR and the state-of-the art HGNN. We also compared against the greedy heuristics MaxDegree and RemoveMinDegree. We train all the deep models using the same hyperparameters of Li et al. (2018c) and report the results for and in Table 7. We test all the models on a synthetically generated test set of hypergraphs with vertices for each. We also test the models on the five real-world hypergraphs used for SSL experiments. As we can see in the table our proposed HyperGCN outperforms all the other approaches except for the pubmed dataset which contains a small number of vertices with large degrees and a large number of vertices with small degrees. The RemoveMinDegree baseline is able to recover all the hyperedges in the pubmed dataset. Moreover FastHyperGCN is competitive with HyperGCN as the number of hypergraphs in the training data is large.

11.4 Qualitative analysis

Figure 4 shows the visualisations given by RemoveMinDegree and HyperGCN on the Cora co-authorship hypergraph. We used Gephi’s Force Atlas to space out the vertices. In general, a cluster of nearby vertices has multiple hyperedges connecting them. Clusters of only green vertices indicate the method has likely included all vertices within the hyperedges induced by the cluster. The figure of HyperGCN has more dense green clusters than that of RemoveMinDegree. Figure 5 shows the results of HGNN vs. HyperGCN.

Dataset Synthetic DBLP Pubmed Cora Cora Citeseer
Approach test set co-authorship co-citation co-authorship co-citation co-citation
MaxDegree
RemoveMinDegree
 MLP
MLP + HLR
HGNN
-HyperGCN
FastHyperGCN
HyperGCN
# hyperedges,
Table 7: Results on the densest -subhypergraph problem. We report density (higher is better) of the set of vertices obtained by each of the proposed approaches for . See Section 11 for details.
    
(a) RemoveMinDegree
(b) HyperGCN
Figure 4: Green / pink hypernodes denote those the algorithm labels as positive / negative respectively.
    
(a) HGNN
(b) HyperGCN
Figure 5: Green / pink hypernodes denote those the algorithm labels as positive / negative respectively.

12 Sources of the real-world datasets

Co-authorship data: All documents co-authored by an author are in one hyperedge. We used the author data222https://people.cs.umass.edu/ mccallum/data.htmlto get the co-authorship hypergraph for cora. We manually constructed the DBLP dataset from Arnetminer333https://aminer.org/lab-datasets/citation/DBLP-citation-Jan8.tar.bz.

Co-citation data: All documents cited by a document are connected by a hyperedge. We used cora, citeseer, pubmed from 444https://linqs.soe.ucsc.edu/data for co-citation relationships. We removed hyperedges which had exactly one hypernode as our focus in this work is on hyperedges with two or more hypernodes. Each hypernode (document) is represented by bag-of-words features (feature matrix ).

12.1 Construction of the DBLP dataset

We downloaded the entire dblp data from https://aminer.org/lab-datasets/citation/DBLP-citation-Jan8.tar.bz. The steps for constructing the dblp dataset used in the paper are as follows:

  • We defined a set of conference categories (classes for the SSL task) as “algorithms", “database", “programming", “datamining", “intelligence", and “vision"

  • For a total of venues in the entire dblp dataset we took papers from only a subset of venues from https://en.wikipedia.org/wiki/List_of_computer_science_conferences corresponding to the above conferences

  • From the venues of the above conference categories, we got authors publishing at least two documents for a total of

  • We took the abstracts of all these documents, constructed a dictionary of the most frequent words (words with frequency more than ) and this gave us a dictionary size of

13 Experiments on datasets with categorical attributes

property/dataset mushroom covertype45 covertype67
number of hypernodes,
number of hyperedges,
number of edges in clique expansion
number of classes,
Table 8: Summary of the three UCI datasets used in the experiments in Section 13
Figure 6: Test errors (lower is better) comparing HyperGCN_with_mediators with the non-neural baseline Zhang et al. (2017) on the UCI datasets. HyperGCN_with_mediators offers superior performance. Comparing against GCN on Clique Expansion is unfair. Please see below for details.

We closely followed the experimental setup of the baseline model Zhang et al. (2017)

. We experimented on three different datasets viz., mushroom, covertype45, and covertype67 from the UCI machine learning repository

Dheeru and Karra Taniskidou (2017). Properties of the datasets are summarised in Table 8. The task for each of the three datasets is to predict one of two labels (binary classification) for each unlabelled instance (hypernode). The datasets contain instances with categorical attributes. To construct the hypergraph, we treat each attribute value as a hyperedge, i.e., all instances (hypernodes) with the same attribute value are contained in a hyperedge. Because of this particular definition of a hyperedge clique expansion is destined to produce an almost fully connected graph and hence GCN on clique expansion will be unfair to compare against. Having shown that HyperGCN is superior to -HyperGCN in the relational experiments, we compare only the former and the non-neural baseline Zhang et al. (2017). We have calledHyperGCN as HyperGCN_with_mediators. We used the incidence matrix (that encodes the hypergraph structure) as the data matrix . We trained HyperGCN_with_mediators for the full epochs and we used the same hyperparameters as in Kipf and Welling (2017).

As in Zhang et al. (2017), we performed trials for each and report the mean accuracy (averaged over the trials). The results are shown in Figure 6

. We find that HyperGCN_with_mediators model generally does better than the baselines. We believe that this is because of the powerful feature extraction capability of HyperGCN_with_mediators.

13.1 GCN on clique expansion

We reiterate that clique expansion, i.e., HGNN Feng et al. (2019) for all the three datasets produce almost fuly connected graphs and hence clique expansion does not have any useful information. So, GCN on clique expansion is unfair to compare against (HGNN does not learn any useful weights for classification because of the fully connected nature of the graph).

13.2 Relevance of SSL

The main reason for performing these experiments, as pointed out in the publicly accessible NIPS reviews555https://papers.nips.cc/paper/4914-the-total-variation-on-hypergraphs-learning-on-hypergraphs-revisited of the total variation on hypergraphs Hein et al. (2013), is to show that the proposed method (the primal-dual hybrid gradient method in their case and the HyperGCN_with_mediators method in our case) has improved results on SSL, even if SSL is not very relevant in the first place.

We do not claim that SSL with HyperGCN_with_mediators is the best way to go about handling these categorical data but we do claim that, given this built hypergraph albeit from non-relational data, it has superior results compared to the previous best non-neural hypergraph-based SSL method Zhang et al. (2017) in the literature and that is why we have followed their experimental setup.

14 Derivations

We show how the graph convolutional network (GCN) Kipf and Welling (2017) has its roots from the convolution theorem Mallat (1999).

Available data Method
CI
MLP
MLP + HLR
HGNN
1-HyperGCN
FastHyperGCN
HyperGCN
Table 9: Results on Pubmed co-citation hypergraph. Mean test error standard deviation (lower is better) over trials for different values of . We randomly sampled the same number of labelled hypernodes from each class and hence we chose each to be divisible by with to .

14.1 Graph signal processing

We now briefly review essential concepts of graph signal processing that are important in the construction of ChebNet and graph convolutional networks. We need convolutions on graphs defined in the spectral domain. Similar to regular -D or -D signals, real-valued graph signals can be efficiently analysed via harmonic analysis and processed in the spectral domain Shuman et al. (2013). To define spectral convolution, we note that the convolution theorem Mallat (1999) generalises from classical discrete signal processing to take into account arbitrary graphs Sandryhaila and Moura (2013).

Informally, the convolution theorem says the convolution of two signals in one domain (say time domain) equals point-wise multiplication of the signals in the other domain (frequency domain). More formally, given a graph signal, , , and a filter signal, , , both of which are defined in the vertex domain (time domain), the convolution of the two signals, , satisfies

(3)

where , , are the graph signals in the spectral domain (frequency domain) corresponding, respectively, to , and .

An essential operator for computing graph signals in the spectral domain is the symmetrically normalised graph Laplacian operator of , defined as

(4)

where is the diagonal degree matrix with elements . As the above graph Laplacian operator, , is a real symmetric and positive semidefinite matrix, it admits spectral eigen decomposition of the form , where,

forms an orthonormal basis of eigenvectors and

is the diagonal matrix of the corresponding eigenvalues with .

The eigenvectors form a Fourier basis and the eigenvalues carry a notion of frequencies as in classical Fourier analysis. The graph Fourier transform of a graph signal

, is thus defined as and the inverse graph Fourier transform turns out to be , which is the same as,

(5)

The convolution theorem generalised to graph signals  3 can thus be rewritten as . It follows that , which is the same as

(6)
Available data Method
CI
MLP
MLP + HLR
HGNN
1-HyperGCN
FastHyperGCN
HyperGCN
Table 10: Results on DBLP co-authorship hypergraph. Mean test error standard deviation (lower is better) over trials for different values of . We randomly sampled the same number of labelled hypernodes from each class and hence we chose each to be divisible by with to .

14.2 ChebNet convolution

We could use a non-parametric filter but there are two limitations: (i) they are not localised in space (ii) their learning complexity is . The two limitations above contrast with with traditional CNNs where the filters are localised in space and the learning complexity is independent of the input size. It is proposed by Defferrard et al. (2016) to use a polynomial filter to overcome the limitations. A polynomial filter is defined as:

(7)

Using  7 in  6, we get . From the definition of an eigenvalue, we have and hence for a positive integer and . Therefore,

(8)

Hence,

(9)

The graph convolution provided by Eq.  9 uses the monomial basis to learn filter weights. Monomial bases are not optimal for training and not stable under perturbations because they do not form an orthogonal basis. It is proposed by Defferrard et al. (2016) to use the orthogonal Chebyshev polynomials Hammond et al. (2011) (and hence the name ChebNet) to recursively compute the powers of the graph Laplacian.

A Chebyshev polynomial of order can be computed recursively by the stable recurrence relation