# A nonlinear diffusion method for semi-supervised learning on hypergraphs

Hypergraphs are a common model for multiway relationships in data, and hypergraph semi-supervised learning is the problem of assigning labels to all nodes in a hypergraph, given labels on just a few nodes. Diffusions and label spreading are classical techniques for semi-supervised learning in the graph setting, and there are some standard ways to extend them to hypergraphs. However, these methods are linear models, and do not offer an obvious way of incorporating node features for making predictions. Here, we develop a nonlinear diffusion process on hypergraphs that spreads both features and labels following the hypergraph structure, which can be interpreted as a hypergraph equilibrium network. Even though the process is nonlinear, we show global convergence to a unique limiting point for a broad class of nonlinearities, which is the global optimum of a interpretable, regularized semi-supervised learning loss function. The limiting point serves as a node embedding from which we make predictions with a linear model. Our approach is much more accurate than several hypergraph neural networks, and also takes less time to train.

## Authors

• 19 publications
• 2 publications
• 39 publications
• ### Nonlinear Higher-Order Label Spreading

Label spreading is a general technique for semi-supervised learning with...
06/08/2020 ∙ by Francesco Tudisco, et al. ∙ 0

• ### A Consistent Diffusion-Based Algorithm for Semi-Supervised Classification on Graphs

Semi-supervised classification on graphs aims at assigning labels to all...
08/27/2020 ∙ by Nathan de Lara, et al. ∙ 0

• ### Directed hypergraph neural network

To deal with irregular data structure, graph convolution neural networks...
08/09/2020 ∙ by Loc Hoang Tran, et al. ∙ 0

• ### Noise-robust classification with hypergraph neural network

This paper presents a novel version of the hypergraph neural network met...
02/03/2021 ∙ by Nguyen Trinh Vu Dang, et al. ∙ 0

• ### HyperGCN: Hypergraph Convolutional Networks for Semi-Supervised Classification

Graph-based semi-supervised learning (SSL) is an important learning prob...
09/07/2018 ∙ by Naganand Yadati, et al. ∙ 0

• ### Semi-supervised Learning on Graph with an Alternating Diffusion Process

Graph-based semi-supervised learning usually involves two separate stage...
02/16/2019 ∙ by Qilin Li, et al. ∙ 0

• ### Deep Learning with Sets and Point Clouds

We introduce a simple permutation equivariant layer for deep learning wi...
11/14/2016 ∙ by Siamak Ravanbakhsh, et al. ∙ 0

##### This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

## 1 Introduction

In graph-based semi-supervised learning (SSL), one has labels at a small number of nodes, and the goal is to predict labels at the remaining nodes. Diffusions, label spreading, and label propagation are classical techniques for this problem, where known labels are diffused, spread, or propagated over the edges in a graph [41, 43]. These methods were originally developed for graphs where the set of nodes corresponds to a point cloud, and edges are similarity measures such as -nearest neighbors; however, these methods can also be used with relational data such as social networks or co-purchasing [13, 20, 10, 17]. In the latter case, diffusions work because they capture the idea of homophily [27] or assortativty [29], where labels are smooth over the graph.

While graphs are a widely-used model for relational data, many complex systems and datasets are actually described by higher-order relationships that go beyond pairwise interactions [5, 4, 34]

. For instance, co-authorship often involves several more than two others, people in social network gather in small groups and not just pairs, and emails can have several recipients. A hypergraph is a standard representation for such data, where a hyperedge can connect any number of nodes. Directly modeling these higher-order interactions has led to improvements in a number of machine learning problems

[42, 6, 22, 23, 39, 32, 2]. Along this line, there are a number of diffusions or label spreading techniques for semi-supervised learning on hypergraphs [42, 14, 40, 21, 24, 37, 35], which are also built on principles of similarity or assortativity. However, these methods are designed for cases where only labels are available, and do not take advantage of rich features or metadata associated with hypergraphs that are potentially useful for making accurate predictions. For instance, coauthorship or email data could have rich textual information.

Hypergraph neural networks (HNNs) are one popular approach for combining both features and network structure for SSL [39, 12, 11]. The hidden layers of HNNs combine the features of neighboring nodes with neural networks and learn the model parameters fitting the available labeled nodes. While combining features according to the hypergraph structure is a key idea, HNNs do not take advantage of the fact that connected nodes likely share similar labels; moreover, they can be expensive to train. In contrast, diffusion-like methods work precisely because of homophily and are typically fast. In the simple case of graphs, combining these two ideas has led to several recent advances [19, 15, 16].

Here, we combine the ideas of HNNs and diffusions for SSL on hypergraphs with a method that simultaneously diffuses both labels and features according to the hypergraph structure. In addition to incorporating features, our new diffusion can incorporate a broad class of nonlinearities to increase the modeling capability, which is critical to the architectures of both graph and hypergraph neural networks. Our nonlinear diffusion can be interpreted as a forward model of a simple deep equilibrium network [3]

with infinitely many layers. The limiting point of the process provides an embedding at each node, which can then be combined with a simpler model such as multinomial logistic regression to make predictions at each node.

Remarkably, even though our model is nonlinear, we can still prove a number of theoretical properties about the diffusion process. In particular, we show that the limiting point of the process is unique and provide a simple, globally convergent iterative algorithm for computing it. Furthermore, we show that this limiting point is the global optimum of an interpretable optimization formulation of SSL, similar to the linear case of graphs, where the objective function is a combination of a squared loss term and a Laplacian-like regularization term. From this perspective, the limiting point is both close to the known labels features at each node but is also smooth with respect to the hypergraph, as measured by nonlinear aggregation functions of values at nodes on hyperedges.

Empirically, we find that using the limiting point of our nonlinear hypergraph diffusion as features for a linear model outperforms state-of-the-art HNNs and other diffusions on several real-world datasets. Including the final-layer embedding of HNNs as additional features in this linear model does not improve accuracy.

## 2 Problem set-up

We consider the multi-class semi-supervised classification problem on a hypergraph, in which we are given nodes with features and hyperedges connecting them. A small number of node labels are available and the goal is to assign the labels to the remaining set of nodes. Here we introduce some notation.

Let be a hypergraph where is the set of nodes and the set of hyperedges. Each hyperedge has an associated positive weight . In our setting every node can belong to an arbitrary number of hyperedges. Let denote the (hyper)degree of node , i.e., the weighted number of hyperedges node participates in,

 δi=∑e:i∈ew(e),

and let be the diagonal matrix of the node degrees, i.e. . Throughout we assume no isolated nodes, i.e.  for all . This is a standard assumption, as one can always add self loops or remove isolated vertices.

We will represent -dimensional features on nodes in by a matrix , where row

is the feature vectors of

. Suppose each node belongs to one of classes, denoted , and we know the label of a (small) training subset of the nodes . We denote by the input-labels matrix of the nodes, with rows entrywise defined by

 Yij=(yi)j={1node i belongs to class j0otherwise..

Since we know the labels only for the nodes in , all the rows for are fully zero, while the rows with have exactly one nonzero entry.

## 3 Background and related work on hypergraph semi-supervised learning

Here, we review basic ideas in hypergraph neural networks (HNNs) and hypergraph label spreading (HLS), which will contextualize the methods we develop in the next section.

### 3.1 Neural network approaches

Graph (convolutional) neural networks are a broadly adopted method for semi-supervised learning on graphs. Several generalizations to hypergraphs have been proposed, and we summarize the most fundamental ideas here.

When for all , the hypergraph is a standard graph . The basic formulation of a graph convolutional network (GCN) [18] is based on a first-order approximation of the convolution operator on graph signals [26]. This approximation boils down to a mapping given by , where is the (possibly rescaled) normalized Laplacian matrix of the graph , is the adjacency matrix, and is the normalized adjacency matrix. The forward model for a two-layer GCN is the

 Z=softmax(F)=softmax(¯¯¯¯Aσ(¯¯¯¯AXΘ(1))Θ(2))

where is the matrix of the graph signals (the node features), are the input-to-hidden and hidden-to-output weight matrices of the network and

is a nonlinear activation function (typically,

). Here, the graph convolutional filter combines features across nodes that are well connected in the input graph. For multi-class semi-supervised learning problems, the weights are then trained minimizing the cross-entropy loss

 −∑i∈Tc∑j=1YijlnZij

over the training set of known labels .

Several hypergraph variations of this neural network model have been proposed for the more general case . A common strategy is to consider a hypergraph Laplacian and define an analogous convolutional filter. One simple case to define as the Laplacian of the clique expansion graph of  [1, 42], where the hypergraph is mapped to a graph on the same set of nodes by adding a clique among the nodes of each hyperedge. This is the approach used in HGNN [12], and other variants uses mediators instead of cliques in the hypergraph to graph reduction [8]. HyperGCN [39] is based on the nonlinear hypergraph Laplacian [25, 9] as . This model uses a GCN on a graph that depends on the features, where if and only if . The convolutional filter is then defined in terms of the normalized Laplacian of , resulting into the two-layer HyperGCN network

 F(1)=σ(AXΘ(1)),Z=softmax(F)=softmax(AF(1)Θ(2)).

### 3.2 Laplacian regularization, and label spreading

Semi-supervised learning based on Laplacian-like regularization strategies were developed by [41] for graphs and then by [42]

for hypergraphs. The main idea of these approaches is to obtain a classifier

by minimizing the regularized square loss function

 minFℓΩ(F)=∥F−Y∥22+λΩH(F) (1)

where is a regularization term that takes into account for the hypergraph structure. (Note that only labels — and not features — are used here.) In particular, if denotes the -th row of , the clique expansion approach of [42] defines , with

 ΩL2H(F)=∑e∈E∑i,j∈ew(e)|e|∥∥fi√δi−fj√δj∥∥22,

while the total variation on hypergraph regularizer proposed by [14] is , where

 ΩL1H(F)=∑e∈Ew(e)maxi,j∈e∥fi−fj∥1.

The graph construction in HyperGCN can be seen as a type of regularization based on this total variation approach.

These two choices of regularizing terms can be solved by means of different strategies. As is quadratic, one can solve (1) via gradient descent with learning rate to obtain the simple iterative method:

 F(k+1)=α¯¯¯¯AHF(k)+(1−α)Y (2)

where is the normalized adjacency matrix of the clique-expanded graph of . The sequence (2) converges to the global solution of (1), for any starting point and the limit is entrywise nonnegative. This method is usually referred to as Hypergraph Label Spreading (HLS) as the iteration in (2) takes the initial labels and “spreads” or “diffuses” them throughout the vertices of the hypergraph , following the edge structure of its clique-expanded graph. It is worth noting, in passing, that each step of (2) can also be interpreted as one layer of the forward model of a linear neural network (i.e., with no activation functions), and a bias term given by . We will further discuss this analogy later on in Section 4.

The one-norm-based regularizer is related to the -Laplacian energy [7, 36] and has advantages for hyperedge cut interpretations of (2). The is convex but not differentiable, and computing the solution of (1) requires more sophisticated and computationally demanding optimization schemes [14, 40]. Unlike HLS in (2), this case cannot be easily interpreted as a label diffusion or as a linear forward network.

## 4 Nonlinear hypergraph diffusion

The guiding principle of both the hypergraph neural networks and the regularization approaches discussed above is that the nodes that share connections are likely to share also the same label. This is conducted implicitly with the convolutional networks via the representation and explicitly by label spreading methods via the regularization term in (1). The neural network approaches typically require expensive training to find structure in the features, whereas HLS is a fast linear model that enforces smoothness of labels over the hypergraph.

In this section, we propose HyperND, a new nonlinear hypergraph diffusion method that propagates both input node label and feature embeddings through the hypergraph in a manner similar to (2). The method is a simple “forward model” akin to (2), but allows for nonlinear activations, which increases modeling power and yields a type of a hypergraph deep equilibrium network architecture.

Recall that each nodes has a label-encoding vector vectors ( is the all-zero vector for initially unlabeled points ) and a feature vector . Thus, each node in the hypergraph has an initial -dimensional embedding, which forms an input matrix , with rows .

Our nonlinear diffusion process will result in a new embedding , which we then use to train a logistic multi-class classifier

 Z∗=softmax(F∗Θ)

based on the known labels and their new embedding by minimizing the cross-entropy loss

 −∑i∈T∑jYijlnZ∗ij. (3)

Unlike HNN, the optimization over and the computation of are decoupled.

### 4.1 The model

Our proposed hypergraph-based diffusion map is a nonlinear generalization of the clique-expansion hypergraph Laplacian. Specifically, let denote the incidence matrix of , whose rows correspond to nodes and columns to hyperedges:

 Ki,e={1i∈e0otherwise.

To manage possible weights on hyperedges, we use a diagonal matrix defined by

 W=Diag(w(e1),…,w(em)).

With this notation, the degree of node is equal to , where is a vector with one in every entry.

For a standard graph, i.e., a hypergraph where all edges have exactly two nodes, where is the adjacency matrix of the graph and is the diagonal matrix of the weighted node degrees. Similarly, for a general hypergraph , we have the identity , where is the adjacency matrix of the clique-expansion graph associated with . Then

 D−1/2KWK⊤D−1/2=¯¯¯¯AH+I (4)

is the clique-expansion hypergraph normalized adjacency [42] that can be used as a hypergraph convolutional filter [12]. Here, we propose a diffusion map which is similar to (4) but defines a nonlinear hypergraph convolutional filter:

 Φ(F)=D−1/2KWσ(K⊤ϱ(D−1/2F)), (5)

where and are diagonal maps (that is, , for some real function , and is similar). Note that when and are the identity maps, reduces down to the clique expansion , and that any neural network activation function is a diagonal mapping.

Our proposed hypergraph semi-supervised classifier uses the normalized fixed point of the nonlinear diffusion process

 F(k+1)=αΦ(F(k))+(1−α)U. (6)

Similarly to (2), each step of (6) can be interpreted as one layer of the forward model of a simple hypergraph neural network, which only uses the convolutional filter and has no weights. Thus, the limit point

 F∗=αΦ(F∗)+(1−α)U (7)

corresponds to a simplified hypergraph convolutional network with infinitely many layers.

Networks with infinitely many layers are sometimes called deep equilibrium networks [3] and one of the most challenging questions for this type of networks is whether the limit point exits and is unique [38]. Our main theoretical result shows that, under mild assumptions on , a unique fixed point always exists, provided we look for it on a suitable projective slice of the form , where is a homogeneous scaling function, such as a norm (we will specify particular later).

In what follows, we use the notation (resp. ) to indicate that has nonnegative (resp. positive) entries.

###### Theorem 4.1.

Let be a homogeneous of degree , positive and order preserving mapping, i.e.,

1. for all and all ,

2. if , and

3. if .

Let be an entrywise positive input embedding, let and let be a real-valued, positive and one-homogeneous function, i.e., for all and , for all and all . The sequence

 {˜F(k)=αΦ(F(k))+(1−α)UF(k+1)=˜F(k)/φ(˜F(k)) (8)

converges to the unique nonnegative fixed point such that

 F∗=αΦ(F∗)+(1−α)U,φ(F∗)=1

for any starting point with nonnegative entries. Moreover, is entrywise positive.

###### Proof.

Consider the iteration in (8)

 {˜F(k)=αΦ(F(k))+(1−α)UF(k+1)=˜F(k)/φ(˜F(k))

As is -homogeneous with and is 1-homogeneous, we have that the -th component of is bounded and positive, i.e. there exists a constant such that

 maxF:φ(F)=1Φ(F)i=maxFΦ(F)iφ(F)p≤Mi.

Thus, if we have that entrywise, for all such that . As a consequence, since is entrywise positive, there exists a such that for all such that . The thesis thus follows from Theorem 3.1 in [35]. ∎

Following (5), the th row of is

 Φ(F)i,:=1√δi∑e:i∈ew(e)σ(∑j∈eϱ(fj√δi)),

which highlights how and combe features and labels along each hyperedge. This operation creates a -dimensional embedding edge embedding for an input , which we denote by :

 μe(F)=σ(K⊤ϱ(D−1/2F))e,:.

Thus, each step of (6) or, equivalently, each of the infinitely many layers of the deep equilibrium model (7), mixes the combined labels and feature node embedding along the hyperedges as illustrated in Figure 1.

In addition to guarantees on existence and uniqueness, if we choose the slice appropriately, then our equilibrium model is also minimizing a regularized loss function of the form (1), with regularization term

 Ω∗H(F)=∑i∈V∑e:i∈ew(e)∥∥(D−1/2F)i,:−12μe(D−1/2F)∥∥2.

This is characterized by the following result.

###### Theorem 4.2.

Under the same assumptions of Theorem 4.1, suppose and are defined as

 Φ(F)=D−1/2KWσ(K⊤ϱ(D−1/2F)) (9) φ(F)=12√∑i∈V∑e:i∈ew(e)∥μe(D−1/2F)∥22 (10)

where and are diagonal mappings. If is differentiable and one-homogeneous, and if is initially scaled so that , then the limit of (8) is the global optimum of

 minF∈Rn×(c+d)ℓΩ(F)subject toF≥0,φ(F)=1,

where

 ℓΩ(F)=∥∥F−Uφ(U)∥∥2+λΩ∗H(F), (11)

and .

###### Proof.

Note that, as for all , the function

 φ(F)=12√∑i∈V∑e:i∈ew(e)∥μe(D−1/2F)∥22

is positive and one-homogeneous. Thus, by Theorem 4.1 the iteration (8) converges to the unique fixed point in for all . We show below that this is also the only point where the gradient of vanishes. Let us denote by the matrix of the hyperedge embedding . We have

 Ω∗H(D1/2F) =∑i∑e:i∈ew(e)∑j(Fij−12S(F)ej)2 =∑i∑e:i∈ew(e)∑j(F2ij−FijS(F)ej)+14∑i∑e:i∈e∑jw(e)S(F)2ej =∑i∑jF2ijδi−FijB(F)ij+φ(D1/2F)2 =⟨F,DF−B(F)⟩+φ(D1/2F)2

where . Therefore we get

 Ω∗H(F)−φ(F)2 =⟨F,F−D−1/2B(D−1/2F)⟩ =⟨F,F−Φ(F)⟩

As is 1-homogeneous and differentiable, by the Euler theorem for homogeneous functions we have that

 ddF{Ω∗H(F)−φ(F)2} =ddF⟨F,F−Φ(F)⟩ =2(F−Φ(F)).

Thus,

 ddF{ℓΩ(F)−λφ(F)2}=2(F−U/φ(U)+λ(F−Φ(F))=2((1+λ)F−λΦ(F)−U/φ(U))

which shows that the gradient of vanishes on a point if and only if is a fixed point

 F∗=λ1+λΦ(F)+11+λUφ(U)

which coincides with (7) for and . Finally, as the two losses and have the same minimizers on the slice , we conclude. ∎

For example, if we choose

 ϱ(F)=Fp,σ(F)=2(D−1EF)1/p, (12)

where the powers are taken entrywise and denotes the diagonal matrix with diagonal entries , then the assumptions of both Theorem 4.1 and 4.2 are satisfied and, for every , we have

 12μe(D−1/2F)=(1|e|∑i∈e(fi√δi)p)1/p=:meanp{fi√δi,i∈e}.

In other words, is the -power mean of the normalized feature vectors of all the nodes in the hyperedge .

Using a power mean for the nonlinear functions and yields a natural hypergraph consistency interpretation to the diffusion process in (6). Specifically, the regularization term becomes

 ∑i∈V∑e:i∈ew(e)∥∥fi√δi−% meanp{fj√δj,j∈e}∥∥2.

Thus, the embedding that minimizes in (11) is such that each node embedding must be similar to the -power mean of the node embedding of the other vertices in the same hyperedge.

### 4.2 Algorithm details

A seemingly difficult requirement for our main theoretical results is that we require a entrywise positive input embedding . However, this turns out to not be that stringent in practice.

If , i.e. we have nonnegative node features, we can easily obtain a positive embedding by performing an initial label smoothing-type step [28, 33] where we choose a small and let

 Uε=(1−ε)[YX]+ε11⊤. (13)

Note that nonnegative input features

are not uncommon. For instance, bag-of-words, one-hot encodings, and binary features in general are all nonnegative. In fact, for all of the real-world datasets we consider in our experiments, the features are nonnegative.

Similarly, if some of the input features has negative values (e.g., features coming from a word embedding), one can perform other preprocessing manipulations (such a simple shift of the feature embedding) to get the required .

Once the new node embedding is computed, we use it to infer the labels of the non-labeled datapoints via cross-entropy minimization. The pseudocode of the classification procedure is shown in Algorithm 1.

Similar to standard LS, the parameter in Algorithm 1 yields a convex combination of the diffusion mapping and the “bias” , allowing to tune the contribution given by the homophily along the hyperedges and the one provided by the input features and labels. Moreover, in view of Theorem 4.2, the parameter quantifies the strength of the regularization parameter , which allows us to tune the contribution of the regularization term over the data-fitting term .

We also point out that since HyperND is forward model, it can be implemented efficiently. The cost of each iteration of (2) is dominated by the cost of the two matrix-vector products with the matrices and , both of which only require a single pass over the input data and can be parallelized with standard techniques. Therefore, HyperND scales linearly with the number and size of the hyperedges, i.e., its computational cost is linear in the size of the data.

## 5 Experiments

We now evaluate our method on several real-world hypergraph datasets. The datasets we used are co-citation and co-authorship hypergraphs: Cora co-authorship, Cora co-citation, Citeseer, Pubmed [31] and DBLP [30]. Table 1 has summary statistics of these datasets. All nodes in the datasets are documents, features are given by the content of the abstract and hyperedge connections are based on either co-citation or co-authorship. The task for each dataset is to predict the topic to which a document belongs (multi-class classification).

We compare our method to five baselines, based on the discussion in Section 3.2.

• [topsep=0pt,itemsep=-1pt,leftmargin=*]

• TV

This is a confidence-interval subgradient-based method from

[40] for the total variation regularization approach of [14]. This method consistency outperforms other label spreading techniques such as the original PDHG strategy of [14] and the HLS method [42].

• MLP

This is a standard supervised approach, where we train a multilinear perceptron with the features and labels for the nodes

, ignoring the hypergraph.

• MLP+ This is the MLP baseline with a regularization term added to the objective, where is the Laplacian energy associated to the hypergraph Laplacian based on mediators [8].

• HGNN This is the hypergraph neural network model of [12], which uses the clique-expansion Laplacian [42, 1] for the hypergraph convolutional filter.

• HyperGCN This is the hypergraph convolutional network model proposed in [39]. See also Section 3. In that paper the authors propose three different variations of this architecture (1-HyperGCN, FastHyperGCN, HyperGCN). In our results we report the best performance across these three models.

Table 2 shows the size of the training dataset for each network and compares the accuracy (mean standard deviation) of HyperND, with and as in (12), against the different baselines. For each dataset, we use five trials with different samples of the training nodes

. All of the algorithms that we use have hyperparameters. For the baselines we use either default hyperparameters or the reported tuned hyperparameters

[18, 39]

. For all of the neural network-based models, we use two layers and 200 training epochs, following

[39] and [12]. For our method, we run 5-fold cross validation with label-balanced 50/50 splits to choose from . We use the value of that gives the best mean accuracy over the five folds. As all the datasets we use here have nonnegative features, we preprocess via label smoothing as in (13) with . Our experiments have shown that different choices of do not alter the classification performance of the algorithm.

Due to its simple regularization interpretation, we choose and to be the -mean considered in (12), for various . When varying , we change the nonlinear activation functions that define the final embedding in (7). Moreover, in order to highlight the role of in the performance of the algorithm, we show in Figure 2 the mean accuracy over 10 runs of HyperND, for all of the considered values of in (with no cross-validation).

Our proposed nonlinear diffusion method performs the best overall, with different choices of yielding better performance on different datasets. In nearly all of the cases, the performance gaps are quite substantial. For example, on Cora co-citation dataset we achieve nearly 83% accuracy, while other baselines do not exceed .

Moreover, HyperND scales linearly with the number of nonzero elements in , i.e., the number of hyperedges and their sizes. Thus, it is typically cheap to compute (similar to standard hypergraph LS) and is overall faster to train than a two-layer HyperGCN. Training times are reported in Table 3, where we compare mean execution time over ten runs for our HyperND vs HyperGCN. For HyperND, we show mean execution time over the five choices of shown in Table 2.

As mentioned earlier, the diffusion map (2) can be seen as one layer of a forward neural network model and the limit point is the node embedding resulting from an equilibrium model, i.e., a forward message passing network with infinitely many layers. This model yields a new feature-based representation , similar to the last-layer embedding of any neural network approach. A natural question is whether or not is actually a better embedding. To this end, we consider four node embeddings and train a classifier via cross-entropy minimization of , optimizing . Specifically, we consider the following:

1. . We run a nonlinear “purely label” spreading iteration, by setting in (6). The limit point is the fixed point of (7). By Thm. 4.2, this embedding is a Laplacian regularization method analogous to HLS.

2. , where

is the embedding generated by HyperGCN before the softmax layer.

3. , the limit point (7) of our HyperND. This is the embedding used for the results in Table 2 and Figure 2.

4. . This combines the representations of our HyperND method and HyperGCN.

Figure 3 shows the accuracy for these embeddings with various values of for the -mean in HyperND. The best performance are obtained by the two embeddings that contain our learned features : (E3) and (E4). In particular, while (E4) includes the final-layer embedding of HyperGCN, it does not improve accuracy over (E3).

## 6 Conclusion

Graph neural networks and hypergraph label spreading are two distinct techniques with different advantages for semi-supervised learning with higher-order relational data. We have developed a method (HyperND) that takes the best from both approaches: feature-based learning, modeling flexibility, label-based regularization, and computational speed. More specifically, HyperND is a nonlinear diffusion that can be interpreted as a deep equilibrium network. Importantly, we can prove that the diffusion converges to a unique fixed point, and we have an algorithm that can compute this fixed point. Furthermore, the fixed point can be interpreted as the global minimizer of an interpretable regularized loss function. Overall, HyperND outperforms neural network and label spreading methods, and we also find evidence that our method learns embeddings that contain information that is complementary to what is contained in the representations learned by neural network methods.

## Acknowledgments

This research was supported in part by ARO Award W911NF19-1-0057, ARO MURI, NSF Award DMS-1830274, and JP Morgan Chase & Co.

## References

• [1] Sameer Agarwal, Kristin Branson, and Serge Belongie. Higher order learning with graphs. In Proceedings of the 23rd International Conference on Machine Learning, pages 17–24, 2006.
• [2] Francesca Arrigo, Desmond J Higham, and Francesco Tudisco.

A framework for second-order eigenvector centralities and clustering coefficients.

Proceedings of the Royal Society A, 476(2236):20190724, 2020.
• [3] Shaojie Bai, J Zico Kolter, and Vladlen Koltun. Deep equilibrium models. Advances in Neural Information Processing Systems, 32:690–701, 2019.
• [4] Federico Battiston, Giulia Cencetti, Iacopo Iacopini, Vito Latora, Maxime Lucas, Alice Patania, Jean-Gabriel Young, and Giovanni Petri. Networks beyond pairwise interactions: Structure and dynamics. Physics Reports, 874:1–92, 2020.
• [5] Austin R Benson, Rediet Abebe, Michael T Schaub, Ali Jadbabaie, and Jon Kleinberg. Simplicial closure and higher-order link prediction. Proceedings of the National Academy of Sciences, 115(48):E11221–E11230, 2018.
• [6] Austin R Benson, David F Gleich, and Jure Leskovec. Higher-order organization of complex networks. Science, 353(6295):163–166, 2016.
• [7] Thomas Bühler and Matthias Hein. Spectral clustering based on the graph p-Laplacian. In Proceedings of the 26th Annual International Conference on Machine Learning, pages 81–88, 2009.
• [8] T-H Hubert Chan and Zhibin Liang. Generalizing the hypergraph laplacian via a diffusion process with mediators. Theoretical Computer Science, 806:416–428, 2020.
• [9] T-H Hubert Chan, Anand Louis, Zhihao Gavin Tang, and Chenzi Zhang. Spectral properties of hypergraph laplacian and approximation algorithms. Journal of the ACM, 65(3):1–48, 2018.
• [10] Alex Chin, Yatong Chen, Kristen M. Altenburger, and Johan Ugander. Decoupled smoothing on graphs. In The World Wide Web Conference, pages 263–272, 2019.
• [11] Yihe Dong, Will Sawin, and Yoshua Bengio. Hnhn: Hypergraph networks with hyperedge neurons. arXiv preprint arXiv:2006.12278, 2020.
• [12] Yifan Feng, Haoxuan You, Zizhao Zhang, Rongrong Ji, and Yue Gao. Hypergraph neural networks. In

Proceedings of the AAAI Conference on Artificial Intelligence

, volume 33, pages 3558–3565, 2019.
• [13] David F Gleich and Michael W Mahoney. Using local spectral methods to robustify graph-based learning algorithms. In Proceedings of the 21th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, pages 359–368, 2015.
• [14] Matthias Hein, Simon Setzer, Leonardo Jost, and Syama Sundar Rangapuram. The total variation on hypergraphs - learning on hypergraphs revisited. In Advances in Neural Information Processing Systems, pages 2427–2435, 2013.
• [15] Qian Huang, Horace He, Abhay Singh, Ser-Nam Lim, and Austin R Benson. Combining label propagation and simple models out-performs graph neural networks. arXiv preprint arXiv:2010.13993, 2020.
• [16] Junteng Jia and Austin R Benson. A unifying generative model for graph learning algorithms: Label propagation, graph convolutions, and combinations. arXiv preprint arXiv:2101.07730, 2021.
• [17] Da-Cheng Juan, Chun-Ta Lu, Zhen Li, Futang Peng, Aleksei Timofeev, Yi-Ting Chen, Yaxi Gao, Tom Duerig, Andrew Tomkins, and Sujith Ravi. Ultra fine-grained image semantic embedding. In Proceedings of the 13th International Conference on Web Search and Data Mining, pages 277–285, 2020.
• [18] Thomas N Kipf and Max Welling. Semi-supervised classification with graph convolutional networks. In ICLR, 2017.
• [19] Johannes Klicpera, Aleksandar Bojchevski, and Stephan Günnemann. Predict then propagate: Graph neural networks meet personalized PageRank. In International Conference on Learning Representations, 2018.
• [20] Rasmus Kyng, Anup Rao, Sushant Sachdeva, and Daniel A Spielman. Algorithms for lipschitz learning on graphs. In Conference on Learning Theory, pages 1190–1223, 2015.
• [21] Pan Li, Niao He, and Olgica Milenkovic. Quadratic decomposable submodular function minimization: Theory and practice. Journal of Machine Learning Research, 21(106):1–49, 2020.
• [22] Pan Li and Olgica Milenkovic. Inhomogeneous hypergraph clustering with applications. Advances in Neural Information Processing Systems, 2017:2309–2319, 2017.
• [23] Pan Li and Olgica Milenkovic. Submodular hypergraphs: p-laplacians, cheeger inequalities and spectral clustering. In International Conference on Machine Learning, pages 3014–3023. PMLR, 2018.
• [24] Meng Liu, Nate Veldt, Haoyu Song, Pan Li, and David F Gleich. Strongly local hypergraph diffusions for clustering and semi-supervised learning. arXiv preprint arXiv:2011.07752, 2020.
• [25] Anand Louis.

Hypergraph markov operators, eigenvalues and approximation algorithms.

In

Proceedings of the forty-seventh annual ACM symposium on Theory of computing

, pages 713–722, 2015.
• [26] Stéphane Mallat. A wavelet tour of signal processing. Elsevier, 1999.
• [27] Miller McPherson, Lynn Smith-Lovin, and James M Cook. Birds of a feather: Homophily in social networks. Annual review of sociology, 27(1):415–444, 2001.
• [28] Rafael Müller, Simon Kornblith, and Geoffrey E Hinton. When does label smoothing help? In Advances in Neural Information Processing Systems, pages 4696–4705, 2019.
• [29] Mark EJ Newman. Assortative mixing in networks. Physical review letters, 89(20):208701, 2002.
• [30] Ryan Rossi and Nesreen Ahmed. The network data repository with interactive graph analytics and visualization. In Proceedings of the AAAI Conference on Artificial Intelligence, volume 29, 2015.
• [31] Prithviraj Sen, Galileo Namata, Mustafa Bilgic, Lise Getoor, Brian Galligher, and Tina Eliassi-Rad. Collective classification in network data. AI magazine, 29(3):93–93, 2008.
• [32] Balasubramaniam Srinivasan, Da Zheng, and George Karypis. Learning over families of sets–hypergraph representation learning for higher order tasks. arXiv preprint arXiv:2101.07773, 2021.
• [33] Christian Szegedy, Vincent Vanhoucke, Sergey Ioffe, Jon Shlens, and Zbigniew Wojna.

Rethinking the inception architecture for computer vision.

In

Proceedings of the IEEE conference on computer vision and pattern recognition

, pages 2818–2826, 2016.
• [34] Leo Torres, Ann S. Blevins, Danielle S. Bassett, and Tina Eliassi-Rad. The why, how, and when of representations for complex systems. Technical report, arXiv:2006.02870v1, 2020.
• [35] Francesco Tudisco, Austin R Benson, and Konstantin Prokopchik. Nonlinear higher-order label spreading. In Proceedings of the Web Conference, 2021.
• [36] Francesco Tudisco and Matthias Hein. A nodal domain theorem and a higher-order Cheeger inequality for the graph p-Laplacian. Journal of Spectral Theory, 8(3):883–909, 2018.
• [37] Nate Veldt, Austin R Benson, and Jon Kleinberg. Minimizing localized ratio cut objectives in hypergraphs. In Proceedings of the 26th ACM SIGKDD International Conference on Knowledge Discovery & Data Mining, pages 1708–1718, 2020.
• [38] Ezra Winston and J Zico Kolter. Monotone operator equilibrium networks. arXiv preprint arXiv:2006.08591, 2020.
• [39] Naganand Yadati, Madhav Nimishakavi, Prateek Yadav, Vikram Nitin, Anand Louis, and Partha Talukdar. Hypergcn: A new method for training graph convolutional networks on hypergraphs. Advances in Neural Information Processing Systems, 32:1511–1522, 2019.
• [40] Chenzi Zhang, Shuguang Hu, Zhihao Gavin Tang, and TH Hubert Chan. Re-revisiting learning on hypergraphs: confidence interval and subgradient method. In International Conference on Machine Learning, pages 4026–4034. PMLR, 2017.
• [41] Dengyong Zhou, Olivier Bousquet, Thomas N Lal, Jason Weston, and Bernhard Schölkopf. Learning with local and global consistency. In Advances in neural information processing systems, pages 321–328, 2004.
• [42] Dengyong Zhou, Jiayuan Huang, and Bernhard Schölkopf. Learning with hypergraphs: Clustering, classification, and embedding. In Advances in neural information processing systems, pages 1601–1608, 2007.
• [43] Xiaojin Zhu, Zoubin Ghahramani, and John D Lafferty. Semi-supervised learning using gaussian fields and harmonic functions. In Proceedings of the 20th International conference on Machine learning (ICML-03), pages 912–919, 2003.