Recurrently Predicting Hypergraphs

06/26/2021 ∙ by David W. Zhang, et al. ∙ University of Amsterdam TNO 3

This work considers predicting the relational structure of a hypergraph for a given set of vertices, as common for applications in particle physics, biological systems and other complex combinatorial problems. A problem arises from the number of possible multi-way relationships, or hyperedges, scaling in 𝒪(2^n) for a set of n elements. Simply storing an indicator tensor for all relationships is already intractable for moderately sized n, prompting previous approaches to restrict the number of vertices a hyperedge connects. Instead, we propose a recurrent hypergraph neural network that predicts the incidence matrix by iteratively refining an initial guess of the solution. We leverage the property that most hypergraphs of interest are sparsely connected and reduce the memory requirement to 𝒪(nk), where k is the maximum number of positive edges, i.e., edges that actually exist. In order to counteract the linearly growing memory cost from training a lengthening sequence of refinement steps, we further propose an algorithm that applies backpropagation through time on randomly sampled subsequences. We empirically show that our method can match an increase in the intrinsic complexity without a performance decrease and demonstrate superior performance compared to state-of-the-art models.



There are no comments yet.


page 1

page 2

page 3

page 4

This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

1 Introduction

This work considers the task of inferring the relational structure for a given set of entities. In terms of a hypergraph the set of entities can be understood as the vertices, while the relations are expressed by the (hyper)edges. A hypergraph generalizes the concept of ordinary graphs to include edges that connect more than two vertices. This setting encompasses a wide spectrum of problems, like vertex reconstruction in particle physics (Shlomi et al., 2020a; Serviansky et al., 2020), inferring higher-order interactions in biological and social systems (Brugere et al., 2018; Battiston et al., 2020)

or combinatorial optimization problems, such as finding the convex hull or Delaunay triangulation

(Vinyals et al., 2015; Serviansky et al., 2020). The number of possible edges grows in with the number of vertices , further implying a super-exponential growth for the number of possible hypergraphs. Simply just storing a single hypergraph in memory becomes challenging already for moderately sized , as observed for Set2Graph by Serviansky et al. (2020), not to mention searching for the correct hypergraph. In this work we assume the hypergraphs of interest only contain a small fraction of all possible edges, as is the case for many relevant tasks. We show it is unnecessary to explicitly represent all non-existent edges and demonstrate the feasibility of solely supervising and predicting the positive edges. Combinatorial optimization problems introduce the additional challenge of complexity to the learning task. Optimal solutions commonly require loops or recurrence, in order to repeat an operation multiple times, dependent on the input size (Grötschel et al., 2012). A solution for finding the convex hull of a -dimensional point set is optimal in time (Chazelle, 1993)

. Neural networks running on Turing machine equivalents are unable to circumvent the computational complexity intrinsic to a problem

(Arora and Barak, 2009), albeit some operations may be parallelizable. Many known optimal solutions to computationally challenging problems are believed to be inherently sequential, i.e., they have no highly parallel solution (Greenlaw et al., 1995). We imitate the sequential approach towards solving complex problems, by applying a recurrent hypergraph network that iteratively refines an initial guess of the solution. In order to address tasks that require large numbers of refinement steps, we propose a scalable alternative to backpropagation through time (BPTT) (Robinson and Fallside, 1987; Werbos, 1988; Mozer, 1989) that works in a constant memory budget and is more efficient than alternatives. An extensive ablation highlights the importance of recurrence for addressing intrinsic complexity. On common benchmarks we demonstrate our approach outperforms the previous state-of-the-art, while substantially improving asymptotic memory scaling. We illustrate our approach in Figure 1 and detail our proposal next.


Figure 1: Recurrently predicting hypergraphs. The input set and an initial guess on the edges are combined into a hypergraph. A recurrent network holistically refines the hypergraph into the prediction. During training, we skip some iterations in the backward pass to improve efficiency and scale to a large number of iterations at constant memory cost.

2 Method

Definition. An undirected hypergraph is a -tuple , where is the set of vertices, is the set of (hyper)edges and is the incidence matrix. We represent vertices and edges as -dimensional and

-dimensional vectors. Each entry

in the incidence matrix indicates if an edge is incident to vertex . Notably, in our definition, an edge is not defined as a subset of vertices. Instead, the relational structure of the hypergraph is solely expressed by . We assume the number of edges is small, relative to the number of potential candidates, which is the case in many applications as we show in the experiments. During training, we leverage this property by setting equal to the maximum number of edges for any training example and effectively reduce the memory requirement of to a tractable size. Furthermore, we relax the constraint on to continuous values

, in order to facilitate the usage of standard neural networks for modelling the incidence structure. The continuous incidence matrix instead indicates the probability that

and are incident. Initialization. is initialized from a learned affine transformation of the set of input entities. Each element in

is sampled i.i.d. from an isotropic normal distribution with learned parameters. For

, we do not assume the existence of any connections at the beginning and correspondingly initialize it to . It is important that the elements in

are independently sampled from a distribution with sufficient variance. For any two edges in

with identical feature vectors, any permutation equivariant neural network will map both elements to the same output vector. In practice, this property can already become a problem when edges are sufficiently similar. We observe it is unnecessary to explicitly lower bound the variance for our chosen architecture. We denote the initial hypergraph as . Refinement step. The refinement function performs a holistic update at each iteration . We instantiate as three different neural networks , each responsible for updating the vertices, edges and the incidence matrix, respectively. Each entry in the incidence matrix is updated independently at each iteration , based on the -th vertex, the -th edge and the incidence from the previous step :


where we instantiate

as a multilayer perceptron in our experiments. The vertices and edges are further refined based on the updated relational structure:


where denotes the concatenation and are permutation equivariant neural networks, which we instantiate as DeepSets (Zaheer et al., 2017) in the experiments. The aggregation functions sum over all edges and vertices, weighted by their respective incidence probability at time . We add skip-connections to and for and , to facilitate training large number of iterations (He et al., 2016). Additionally, we apply layer normalization (Ba et al., 2016) to both the input and output (following the skip connection) of and

. Previous works observed both the skip-connections and layer normalization to help speed up the training convergence in the context of a recurrent neural network

(Locatello et al., 2020). While the concatenation facilitates cross-communication between vertices and edges, the set neural network directly supplies context from the other vertices and edges, avoiding a detour through the incidence structure. Variable number of edges. Different hypergraph instances may admit differing number of edges. We set

equal to the cardinality of the largest set of edges amongst all training examples and pad the target incidence matrix with additional rows of

, resulting in an incidence matrix of shape for all examples. An edge that is incident to no vertex, i.e., , can be interpreted as non-existent. Usually the values of the predicted incidence matrix will not be exactly or , requiring an additional decision procedure to determine if an edge exists and which vertices it connects to. A simple approach is to decide the incidence structure locally for each entry in the incidence matrix. This approach ignores fixed structures that are inherent in many problems, e.g., each edge connects exactly (or no) vertices in a convex hull of -dimensional point sets. To improve the naive local version, while keeping the decision process simple, we add an existence indicator that is computed as a dot product between the edge

and a learned weight vector, followed by a sigmoid function. Using the existence indicator, we update the incidence matrix:


which replaces the incidence information used for aggregation in and . With an explicit existence indicator we can simply choose the vertices with the highest incidence probability for each existing edge in the D convex hull example. Iterative refinement. The prediction at test-time is attained after applying the refinement step, described in creftypepluralcap 14, for number of times. Our approach can be considered an instance of a recurrent neural network (RNN), as each refinement step uses the same set of parameters at each iteration . Different from conventional RNNs, we structure the latent state of our network as a hypergraph and inject the same input at every time step. Intuitively, the recurrent architecture provides the neural network with multiple processing steps to consider the global context , in the prediction of the local structures . In contrast to auto-regressive approaches, we consider all simultaneously. Thereby, we do not impose any artificial order on the edges and facilitate parallelization in the problem-solving procedure. BPTT with gradient skips. The common learning method for RNNs is backpropagation through time (BPTT) (Pearlmutter, 1995; Robinson and Fallside, 1987; Werbos, 1988; Mozer, 1989)

. Conceptually BPTT treats the forward process as a feedforward neural network with weight-tied parameters and applies backpropagation to update the network parameters with respect to the gradients of a loss function. Applying BPTT on the entire sequence of

refinement steps is prohibitive for large , due to memory costs growing in . Instead, we approximate the complete gradient by applying BPTT on selected subsequences of length , with skips in between different subsequences. By limiting the length of the backward pass in this way, our memory usage remains constant over time. We select subsequences such that there are zero or more iterations in between any pair of subsequences. In the case when there are exactly iterations in between any consecutive subsequences, we recover truncated BPTT (Williams and Peng, 1990), which in its more general form also permits overlaps in the iterations. We liken our recurrent network to an iterative optimization algorithm, that is learned by optimizing at all time steps . Supervising all iterations is redundant, when the hypergraphs are similar at different steps, motivating us to skip some iterations in the backward passes to reduce the training time.

initialize() sample() for  in  :
        with no_grad():
               for  in range() :
        for  in range() :
               refine() + loss()
       backward() gradient_step_and_reset()
Algorithm 1 BPTT with skips

Algorithm 1

describes the training procedure in pseudo-code, leaning on the PyTorch (BSD) syntax

(Paszke et al., 2019). For each mini-batch, we sample a list of integers , that determine the number of skipped steps, i.e., refinement calls that are not part of any backward pass. The total number of skipped steps is , where denotes the number of gradient updates per mini-batch. We consider two alternative distributions for : 1. Evenly distribute the skipped steps, or 2. Randomly partition the number of skipped steps into parts (with ’s). This training scheme can be interpreted as adapting the refiner to work on different initial hypergraphs, with a random one at the beginning and arbitrarily well refined ones later on. Training objective. During training, we minimize the element-wise binary cross entropy of the target and predicted incidence matrix. While the column (vertices) order is given by the input, the rows (edges) may be in arbitrary order due to independent and identically distributed random initialization. We match the edges between the target and the prediction with Hungarian matching (Kuhn, 1955). We apply the loss on each intermediate incidence matrix to encourage iterative improvements. Computational complexity. The space complexity of is in , offering an efficient representation for hypergraphs when the maximal number of edges is relatively low. Problems that involve edges connecting many vertices benefit from this choice of representation, as the space requirement does not scale with the maximal connectivity of an edge. During training, BPTT further scales the memory requirement by a factor of , possibly limiting the number of refinement steps. Our proposed training algorithm does not directly scale with , offering a tractable solution for problems requiring many iterations. The loss computation is dominated by the Hungarian matching, requiring time in . This computational cost is further scaled by the factors , and the mini-batch size. In practice, the impact from and the mini-batch size are dampened by applying the matching algorithm in parallel. In contrast to that, each gradient update step happens sequentially, motivating smaller choices for , to reduce the training time. By randomly skipping refinement steps during training, we require less gradient update steps for fixed and , resulting in faster training.

3 Experiments

Next, we evaluate our approach on multiple set-to-hypergraph tasks, in order to assess our approach and examine the main design choices. For additional details, we refer to the appendix. Code is available at:

3.1 Does improving the asymptotic memory requirement degrade predictive performance?

The improved asymptotic memory cost of our method, hinges on the feasibility to train with only (or mainly) positive edge examples, as opposed to explicitly supervising negative edges. To assess if the computational efficiency improvement comes at the cost of decreased test-time performance, we compare our method to prior state-of-the-art on three common benchmarks (Serviansky et al., 2020). Particle partitioning. Particle colliders are an important tool for studying the fundamental particles of nature and their interactions. During a collision, several particles are emanated and measured by nearby detectors, while some particles decay beforehand. Identifying which measured particles share a common progenitor is an important subtask in the context of vertex reconstruction (Shlomi et al., 2020b). For more information, we refer to Shlomi et al. (2020a). We use the simulated dataset of M examples (CC-BY 4.0) (Serviansky et al., 2020; Shlomi et al., 2020b), with the same train/validation/test split. Each example is sampled from one of three different generation processes that differ in the typical number of partitions per set: bottom, charm and light. Each set of particles, also called jets, consists of to elements represented by features. The target partitionings indicate the common progenitors and restrict the valid incidence matrices to those with a single incident edge per vertex. We compare our method to Set2Graph (Serviansky et al., 2020), Set Transformer (Lee et al., 2019) and Slot Attention (Locatello et al., 2020). Set2Graph incurs a prohibitively large memory cost when predicting edges with high connectivity and instead resorts to only predicting edges with at most

connecting vertices, followed by an additional heuristic to infer the partitions. Both Set Transformer and Slot Attention were not designed for set-to-hypergraph tasks, but we consider them as general set-to-set models. Building on the idea of solely predicting the set of positive edges, we construct two additional baselines that map from the set of vertices to the set of edges, followed by an incidence indicator network similar to Equation 

1. We train our method with , , and a maximum of edges. Since each particle can only be part of a single partition, we choose the one with the highest incidence probability at test time. All models additionally minimize a soft F1 score (Serviansky et al., 2020). In Table 1 we report the performances on each type of jets as the F1 score and Adjusted Rand Index (ARI). Our method outperforms all alternatives on bottom and charm jets, while being competitive on light jets. Coupled with the performance of the set-to-set based baselines, the results highlight that it is indeed feasible to train without explicitly supervising negative edges. Furthermore, the relative improvement of our approach compared to the set-to-set baselines hints at the benefits of a holistic refining approach.

0.9! bottom jets charm jets light jets Model F1 ARI F1 ARI F1 ARI Set2Graph Set Transformer Slot Attention Ours

Table 1: Particle partitioning results. On three jet types performance measured in F1 score and adjusted rand index (ARI). Our method outperforms the baselines on bottom and charm jets, while being competitive on light jets.

Convex hull. The convex hull of a finite set of -dimensional points can be efficiently represented as the set of simplices that enclose all points. In the D case, each simplex consists of points that together form a triangle. For the general -dimensional case, the valid incidence matrices are limited to those with incident vertices per edge. Finding the convex hull is an important and well understood task in computational geometry, with optimal exact solutions (Chazelle, 1993; Preparata and Shamos, 2012)

. Nonetheless, predicting the convex hull for a given set of points poses a challenging problem for current machine learning methods, especially when the number of points increases

(Vinyals et al., 2015; Serviansky et al., 2020). Each point set is drawn from one of two different distributions: Gaussian and spherical. In the Gaussian case, each point is drawn i.i.d. from a standard normal distribution. In the spherical case we additionally normalize each point to lie on the unit sphere. In subsequent comparisons to prior work, we consider the same point distributions, set sizes and training/validation/test data sizes as in Serviansky et al. (2020). On convex hull finding in D, we compare our method to the same baselines as in the previous experiment. Set2Graph (Serviansky et al., 2020) learns to map the set of D points to the rd order adjacency tensor. Since storing this tensor in memory is not feasible, they instead concentrate on a local version of the problem, which only considers the -nearest neighbors for each point (Serviansky et al., 2020). We train our method with , , and set equal to the highest number of triangles in the training data. At test time, a prediction admits an edge if its existence indicator . Each edge is incident to the three vertices with the highest incidence probability. In Table 2 we report the F1 score for , and for both Gaussian and spherically distributed point sets. The results demonstrate that our method consistently outperforms all the baselines by a considerable margin. Interestingly, our method does not suffer from the same drastic performance decline as in Set2Graph, when increasing the set size from to . The results for

with Gaussian distributed points shows a stark drop in performance for Set2Graph. Both the set-to-set baselines and our approach do not exhibit this drop, indicating a much weaker sensitivity to the input set size and thus better generalization.

Spherical Gaussian
Set Transformer
Slot Attention
Table 2: Convex hull results measured as F1 score. Our method outperforms all baselines considerably for all settings.

Delaunay triangulation. A Delaunay triangulation of a finite set of D points is a set of triangles, for which the circumcircles of all triangles are empty, i.e., have no point lying inside. When there exists more than one such set, Delaunay triangulation aims to maximize the minimum angle of all triangles. The problem of Delaunay triangulation is, similar to convex hull finding, a well studied problem in computational geometry and has exact solutions in (Rebay, 1993). We consider the same learning task and setup as Serviansky et al. (2020), who frame Delaunay triangulation as mapping from a set of D points to the set of Delaunay edges, represented by the adjacency matrix. The point sets of size are sampled uniformly from the unit square, with or . The goal in this task is to predict the adjacency matrix of an ordinary graph — a graph consisting of edges that connect two vertices — where the number of edges are greater than the number of vertices. One could recover the adjacency matrix based on the matrix product of , by clipping all values above back to and setting the diagonal to . This detour through the incidence matrix is clearly unfavorable, as in this case the incidence matrix actually takes up more compute than the adjacency matrix. Instead of applying our method directly, we consider a simple adaptation of our approach to the graph setting. We replace the initial set of edges with the (smaller) set of vertices and apply the same vertex refinements (Equation 2) on both sets. This change results in for the prediction and effectively reduces the incidence matrix to an adjacency matrix, since it is computed based on all pairwise combinations of and . We further replace the concatenation in (Equation 1) with a sum, to ensure that the predicted adjacency matrix is symmetric and represents an undirected graph. Two of the main design choices of our approach remain in this adaptation: Iterative refining of the complete graph with a recurrent neural network and BPTT with gradient skips. We train our model with , and . At test-time, an edge between two vertices exists if the adjacency value is greater than . In Table 3, we report the accuracy, precision, recall and F1 score for Set2Graph (Serviansky et al., 2020) and our adapted method. Our method outperforms Set2Graph on all metrics. More importantly, our method does not see a large discrepancy in performance between set sizes. We attribute this to the recurrence of our iterative refinement scheme, which we study in more depth subsequently.

Model Acc Pre Rec F1 Acc Pre Rec F1
Table 3: Delaunay triangulation results for different . Our method outperforms Set2Graph on all metrics.

3.2 What is the role of recurrence in addressing the intrinsic complexity of a problem?

The intrinsic complexity of finding a convex hull for a -dimensional set of points is in (Chazelle, 1993)

. This scaling behavior offers an interesting opportunity to study the effects of increasing (time) complexity on model performance. The intrinsic time complexity of the convex hull problem signifies that any solution scales super-linearly with the input set size. Since our learned model is not an exact solution, the implications of the complexity become less clear. In order to assess the relevancy of the problem’s complexity for our approach, we examine the relation between the number of refining steps and increases in the intrinsic resource requirement. The following experiments are all performed with standard BPTT, in order to not introduce additional hyperparameters that may affect the conclusions.


Figure 2: Performance for increasing complexity. Increasing the iterations counteracts the performance decline from larger set sizes.

First we examine the performance of our approach with iterations, trained on increasing set sizes . In Figure 2 we observe a drop in performance with increasing set sizes when training with the same number of iterations. The observed (negative) correlation between the set size and the performance confirms a relationship between the computational complexity and the difficulty of the learning task. Next, we examine the performance for varying number of iterations and set sizes. We refer to the setting, where the number of iterations is and set size , as the base case. All other set sizes and number of iterations are chosen such that the performance matches the base case as closely as possible. In Figure 2, we observe that the required number of iterations increases with the input set size. The results highlight that an increase in the number of iterations suffices in counteracting the performance drop. The number of refinement steps scales sub-linearly with the set size, different from the intrinsic time complexity. This highlights the property that our approach does not operate incrementally, but instead parallelizes the edge finding process. Notably, increasing the number of iterations does not introduce any additional parameters, as opposed to increasing the depth or width. The two experiments suggest that the intrinsic complexity of a problem manifests similarly for solutions containing errors as they do for exact solutions. An analogy can be made with an exact algorithm that incrementally constructs the convex hull, but is allocated insufficient computing time, in which case the result will have (more) errors. By replacing BPTT with truncated BPTT or BPTT with skips, our approach offers a scalable solution for settings with a fixed memory budget

3.3 How does recurrence compare to stacking layers?

This section examines the role of recurrence in our approach, by comparing to a baseline that stacks multiple refinement layers, each with a separate set of parameters. We compare the models trained for to refinement steps. Both versions are trained by backpropagating through the entire sequence of refinement steps. For the stacked version, we only compute the loss at the final refinement step, as we observed that supervising all intermediate steps leads to worse performance. In Figure 2(a), we show that both cases benefit from an increasing number of refinement steps. Adding more iterations beyond only slightly improves the performance of the stacked model, while the recurrent version still benefits, leading to an absolute difference of in F1 score for 9 iterations. The results indicate that recurrence constitutes a useful inductive bias on an intrinsically complex problem, especially when supervised over multiple steps. Next we train a recurrent and stacked model, with iterations, until both achieve a similar validation performance, by stopping training early on the recurrent model. The results in Figure 2(b) show that the recurrent variant performs better when tested at larger set sizes than trained, indicating an improved generalization ability.




Figure 3: Recurrent vs. stacked. (a) Performance for different numbers of iterations. (b) Extrapolation performance on for models trained with set size . We stop training the recurrent model early, to match the validation performance of the stacked on . The recurrent model derives greater benefits from adding iterations and generalizes better.

3.4 Can we actually learn tasks that encompass higher-order edges?

The particle partitioning experiment exemplifies a case where a single edge can connect up to vertices. Set2Graph (Serviansky et al., 2020) demonstrates that in this specific case it is possible to approximate the hypergraph with a graph. They leverage the fact that any vertex is incident to exactly one edge and apply a post-processing step that constructs the edges from noisy cliques. Instead, we consider a task for which no straightforward graph based approximation exists. Specifically, we consider convex hull finding in -dimensional space for standard normal distributed points. We train with , and . The test performance reaches an F1 score of , clearly demonstrating that the model managed to learn. This result demonstrates the improved scaling behavior can be leveraged for tasks that are computationally out of reach for adjacency-based approaches.

3.5 Can we skip gradients without degrading the performance?


Figure 4: Training time of BPTT with skips. Relative training time and performance for different . All runs require the same memory, except standard BPTT , which require more.

In order to assess if skipping gradients comes at the cost of decreased test-time performance, we compare our approach to truncated BPTT (Williams and Peng, 1990). We constrict both truncated BPTT and BPTT with skips to a fixed memory budget, by limiting any backward pass to the most recent iterations, for . We consider two variants of our training algorithm: 1. Skipping iterations at fixed time steps and 2. Skipping randomly sampled time steps. In both the fixed and random skips versions, we skip half of the total iterations. We train all models on convex hull finding in -dimensions for spherically distributed points. In addition, we include baselines trained with standard BPTT that contingently inform us about performance degradation incurred by truncating or skipping gradients. Standard BPTT increases the memory footprint linearly with the number of iterations , inevitably exceeding the available memory at some threshold. We deliberately choose a small set size in order to accommodate training with standard BPTT for number of iterations. The results in Figure 4 demonstrate that skipping half of the iterations during backpropagation, significantly decreases the training time without incurring any test-time performance degradation. This suggests that supervising all iterations, like in truncated BPTT, is highly redundant and skipping iterations in the weight update phase constitutes an effective approach to reducing said redundancy. In general, we observe that sampling random subsequences to apply BPTT on is more stable than picking fixed time steps. Both truncated BPTT and BPTT with skips greatly outperform standard BPTT, when the memory budget is constricted to . The results on standard BPTT appear to indicate that performance worsens when increasing the number of iterations from to . We observe that applying BPTT on many iterations leads to increasing gradient norms in the course of training, complicating the training process. The memory limited versions did not exhibit a similar behavior, evident from the improved performance, when increasing the iterations from to .

4 Related Work

Adjacency prediction. Set2Graph (Serviansky et al., 2020) is a family of maximally expressive permutation equivariant neural networks that map from an input set to (hyper)graphs. The authors extend the general idea, of applying a scalar-valued adjacency indicator function on all pairs of vertices (Kipf and Welling, 2016), to the -edge case (Serviansky et al., 2020), i.e., edges that connect vertices. In Set2Graph, for each the adjacency structure is modeled by an -tensor, requiring memory in . This becomes intractable already for small and moderate set sizes. By pruning the negative edges, our approach scales in , making it applicable even when . Set prediction. Recent works on set prediction map a learned initial set (Zhang et al., 2019; Lee et al., 2019) or a randomly initialized set (Locatello et al., 2020; Carion et al., 2020; Zhang et al., 2021) to the target space. Out of these, the closest one to our hypergraph refining approach is Slot Attention (Locatello et al., 2020), which recurrently applies the Sinkhorn operator (Adams and Zemel, 2011) in order to associate each element in the input set with a single slot (hyperedge). None of the prior works on set prediction consider the set-to-hypergraph task, but some can be naturally extended by mapping the input set to the set of positive edges, an approach similar to ours. Non-temporal recurrent networks. Although recurrence in the context of neural networks is commonly used to handle time-varying inputs (Elman, 1990) and outputs (Graves et al., 2006), many have applied recurrent networks in non-temporal settings (Hopfield, 1982; Locatello et al., 2020). Deep implicit models (Ghaoui et al., 2019; Bai et al., 2019, 2020; Gu et al., 2020) recurrently apply a nonlinear transformation on the latent state until reaching a fixed point (Pineda, 1987; Simard et al., 1988). These models’ memory consumption does not scale with the recurrence, by analytically deriving the gradient at the fixed point via the implicit function theorem (Krantz and Parks, 2012). This comes at the cost of restricting the model to those that reliably converge to a fixed point (Bai et al., 2019). While some works enforce the existence of a unique attracting fixed point (Winston and Kolter, 2020; Gu et al., 2020), this property is undesirable for our case. Any permutation equivariant neural network with a unique attracting fixed point will collapse its hidden state to a multiset of identical elements, making this approach inapplicable to set-to-hypergraph, without forgoing permutation equivariance. Our approach offers an alternative that retains the constant memory scaling, but does not necessitate any restrictions on the RNN. Hypergraph neural network. Feng et al. (2019) propose an extension of graph convolutions (Defferrard et al., 2016; Kipf and Welling, 2017) to hypergraphs and apply it to classification tasks. Bai et al. (2021) apply the attention mechanism to induce incidence matrices, over which they deploy a hypergraph convolution operator. Our recurrent hypergraph neural network holistically refines the whole hypergraph at each iteration, updating vertex/edge features and the incidence structure. Furthermore, the hypergraph convolution (Bai et al., 2021) does not benefit from increased depth, as commonly observed for deep graph neural networks (Oono and Suzuki, 2019; Li et al., 2018; Nt and Maehara, 2019), where increased depth leads to over-smoothing. In experiments, we have demonstrated our recurrent network to greatly benefit from additional depth.

5 Limitations and Future Work

The scope of our work is limited to undirected hypergraphs, though extending this approach to directed hypergraphs constitutes an interesting future direction. Directed hypergraphs can represent targeted relations in applications like scene graph generation (Chang et al., 2021). We identify the Hungarian matching as the main computational bottleneck during training. Replacing the Hungarian matched loss with a faster alternative, like a learned energy function (Zhang et al., 2021), would greatly speed up training for tasks with large . In application areas such as social networks, interactions are typically modeled by graphs with pair-wise edges (Zhou et al., 2020). Extending these to higher-order edges may facilitate new applications and opportunities, possibly invoking complex ethical and societal questions.

6 Conclusion

We have introduced a new hypergraph RNN, accompanied by a recurrent training algorithm for set-to-hypergraph prediction. By representing and supervising the set of positive edges, we substantially improve the asymptotic scaling and enable learning tasks with higher-order edges. On common benchmarks, we have demonstrated that our method outperforms previous works, while offering a more favorable asymptotic scaling behavior. In further evaluations, we have highlighted the importance of recurrence for addressing the intrinsic complexity of problems. This work is part of the research programme Perspectief EDL with project number P16-25 project 3, which is financed by the Dutch Research Council (NWO) domain Applied and Engineering Sciences (TTW).


  • R. P. Adams and R. S. Zemel (2011) Ranking via sinkhorn propagation. arXiv preprint arXiv:1106.1925. Cited by: §4.
  • S. Arora and B. Barak (2009) Computational complexity: a modern approach. Cambridge University Press. Cited by: §1.
  • J. L. Ba, J. R. Kiros, and G. E. Hinton (2016) Layer normalization. arXiv preprint arXiv:1607.06450. Cited by: §2.
  • S. Bai, J. Z. Kolter, and V. Koltun (2019) Deep equilibrium models. Advances in Neural Information Processing Systems, pp. 690–701. Cited by: §4.
  • S. Bai, V. Koltun, and J. Z. Kolter (2020) Multiscale deep equilibrium models. In Advances in Neural Information Processing Systems, Cited by: §4.
  • S. Bai, F. Zhang, and P. H. Torr (2021) Hypergraph convolution and hypergraph attention. Pattern Recognition 110, pp. 107637. Cited by: §4.
  • F. Battiston, G. Cencetti, I. Iacopini, V. Latora, M. Lucas, A. Patania, J. Young, and G. Petri (2020) Networks beyond pairwise interactions: structure and dynamics. Physics Reports. Cited by: §1.
  • I. Brugere, B. Gallagher, and T. Y. Berger-Wolf (2018) Network structure inference, a survey: motivations, methods, and applications. ACM Computing Surveys (CSUR) 51 (2), pp. 1–39. Cited by: §1.
  • N. Carion, F. Massa, G. Synnaeve, N. Usunier, A. Kirillov, and S. Zagoruyko (2020) End-to-end object detection with transformers. In

    European Conference on Computer Vision

    pp. 213–229. Cited by: §4.
  • X. Chang, P. Ren, P. Xu, Z. Li, X. Chen, and A. Hauptmann (2021) Scene graphs: a survey of generations and applications. arXiv preprint arXiv:2104.01111. Cited by: §5.
  • B. Chazelle (1993) An optimal convex hull algorithm in any fixed dimension. Discrete & Computational Geometry 10 (4), pp. 377–409. Cited by: §1, §3.1, §3.2.
  • M. Defferrard, X. Bresson, and P. Vandergheynst (2016) Convolutional neural networks on graphs with fast localized spectral filtering. In Advances in Neural Information Processing Systems, Cited by: §4.
  • J. L. Elman (1990) Finding structure in time. Cognitive Science 14 (2), pp. 179–211. Cited by: §4.
  • Y. Feng, H. You, Z. Zhang, R. Ji, and Y. Gao (2019) Hypergraph neural networks. In

    Proceedings of the AAAI Conference on Artificial Intelligence

    pp. 3558–3565. Cited by: §4.
  • L. E. Ghaoui, F. Gu, B. Travacca, A. Askari, and A. Y. Tsai (2019)

    Implicit deep learning

    arXiv preprint arXiv:1908.06315. Cited by: §4.
  • A. Graves, S. Fernández, F. Gomez, and J. Schmidhuber (2006) Connectionist temporal classification: labelling unsegmented sequence data with recurrent neural networks. In International Conference on Machine Learning, pp. 369–376. Cited by: §4.
  • R. Greenlaw, H. J. Hoover, W. L. Ruzzo, et al. (1995) Limits to parallel computation: p-completeness theory. Oxford University Press on Demand. Cited by: §1.
  • M. Grötschel, L. Lovász, and A. Schrijver (2012) Geometric algorithms and combinatorial optimization. Vol. 2, Springer Science & Business Media. Cited by: §1.
  • F. Gu, H. Chang, W. Zhu, S. Sojoudi, and L. E. Ghaoui (2020) Implicit graph neural networks. In Advances in Neural Information Processing Systems, pp. 11984–11995. Cited by: §4.
  • K. He, X. Zhang, S. Ren, and J. Sun (2016) Deep residual learning for image recognition. In IEEE Conference on Computer Vision and Pattern Recognition, pp. 770–778. Cited by: §2.
  • J. J. Hopfield (1982) Neural networks and physical systems with emergent collective computational abilities. Proceedings of the National Academy of Sciences 79 (8), pp. 2554–2558. Cited by: §4.
  • D. P. Kingma and J. Ba (2014) Adam: a method for stochastic optimization. arXiv preprint arXiv:1412.6980. Cited by: Appendix A.
  • T. N. Kipf and M. Welling (2016) Variational graph auto-encoders. arXiv preprint arXiv:1611.07308. Cited by: §4.
  • T. N. Kipf and M. Welling (2017) Semi-supervised classification with graph convolutional networks. In International Conference on Learning Representations, Cited by: §4.
  • S. G. Krantz and H. R. Parks (2012) The implicit function theorem: history, theory, and applications. Springer Science & Business Media. Cited by: §4.
  • H. W. Kuhn (1955) The Hungarian method for the assignment problem. Naval Research Logistics Quarterly 2 (1-2), pp. 83–97. Cited by: §2.
  • J. Lee, Y. Lee, J. Kim, A. Kosiorek, S. Choi, and Y. W. Teh (2019) Set transformer: a framework for attention-based permutation-invariant neural networks. In International Conference on Machine Learning, pp. 3744–3753. Cited by: §3.1, §4.
  • Q. Li, Z. Han, and X. Wu (2018)

    Deeper insights into graph convolutional networks for semi-supervised learning

    In Proceedings of the AAAI Conference on Artificial Intelligence, Cited by: §4.
  • F. Locatello, D. Weissenborn, T. Unterthiner, A. Mahendran, G. Heigold, J. Uszkoreit, A. Dosovitskiy, and T. Kipf (2020) Object-centric learning with slot attention. In Advances in Neural Information Processing Systems, pp. 11525–11538. Cited by: Appendix A, §2, §3.1, §4.
  • M. C. Mozer (1989) A focused back-propagation algorithm for temporal pattern recognition. Complex systems 3 (4), pp. 349–381. Cited by: §1, §2.
  • H. Nt and T. Maehara (2019) Revisiting graph neural networks: all we have is low-pass filters. arXiv preprint arXiv:1905.09550. Cited by: §4.
  • K. Oono and T. Suzuki (2019) Graph neural networks exponentially lose expressive power for node classification. arXiv preprint arXiv:1905.10947. Cited by: §4.
  • A. Paszke, S. Gross, F. Massa, A. Lerer, J. Bradbury, G. Chanan, T. Killeen, Z. Lin, N. Gimelshein, L. Antiga, A. Desmaison, A. Kopf, E. Yang, Z. DeVito, M. Raison, A. Tejani, S. Chilamkurthy, B. Steiner, L. Fang, J. Bai, and S. Chintala (2019) PyTorch: an imperative style, high-performance deep learning library. In Advances in Neural Information Processing Systems, pp. 8024–8035. Cited by: §2.
  • B. A. Pearlmutter (1995) Gradient calculations for dynamic recurrent neural networks: a survey. IEEE Transactions on Neural networks 6 (5), pp. 1212–1228. Cited by: §2.
  • F. Pineda (1987) Generalization of back propagation to recurrent and higher order neural networks. In Advances in Neural Information Processing Systems, pp. 602–611. Cited by: §4.
  • F. P. Preparata and M. I. Shamos (2012) Computational geometry: an introduction. Springer Science & Business Media. Cited by: §3.1.
  • S. Rebay (1993) Efficient unstructured mesh generation by means of delaunay triangulation and bowyer-watson algorithm. Journal of Computational Physics 106 (1), pp. 125–138. Cited by: §3.1.
  • A. Robinson and F. Fallside (1987) The utility driven dynamic error propagation network. University of Cambridge Department of Engineering Cambridge, MA. Cited by: §1, §2.
  • H. Serviansky, N. Segol, J. Shlomi, K. Cranmer, E. Gross, H. Maron, and Y. Lipman (2020) Set2graph: learning graphs from sets. In Advances in Neural Information Processing Systems, Cited by: Appendix A, Appendix A, Appendix A, §1, §3.1, §3.1, §3.1, §3.4, §4.
  • J. Shlomi, P. Battaglia, and J. Vlimant (2020a) Graph neural networks in particle physics. Machine Learning: Science and Technology 2 (2), pp. 021001. Cited by: §1, §3.1.
  • J. Shlomi, S. Ganguly, E. Gross, K. Cranmer, Y. Lipman, H. Serviansky, H. Maron, and N. Segol (2020b) Secondary vertex finding in jets with neural networks. arXiv preprint arXiv:2008.02831. Cited by: §3.1.
  • P. Y. Simard, M. B. Ottaway, and D. H. Ballard (1988) Fixed point analysis for recurrent networks.. In Advances in Neural Information Processing Systems, pp. 149–159. Cited by: §4.
  • O. Vinyals, M. Fortunato, and N. Jaitly (2015) Pointer networks. In Advances in Neural Information Processing Systems, Cited by: §1, §3.1.
  • P. J. Werbos (1988) Generalization of backpropagation with application to a recurrent gas market model. Neural Networks 1 (4), pp. 339–356. Cited by: §1, §2.
  • R. J. Williams and J. Peng (1990) An efficient gradient-based algorithm for on-line training of recurrent network trajectories. Neural computation 2 (4), pp. 490–501. Cited by: Appendix A, §2, §3.5.
  • E. Winston and J. Z. Kolter (2020) Monotone operator equilibrium networks. In Advances in Neural Information Processing Systems, Vol. 33, pp. 10718–10728. Cited by: §4.
  • M. Zaheer, S. Kottur, S. Ravanbakhsh, B. Poczos, R. R. Salakhutdinov, and A. J. Smola (2017) Deep sets. In Advances in Neural Information Processing Systems, Cited by: Appendix A, §2.
  • D. W. Zhang, G. J. Burghouts, and C. G. M. Snoek (2021)

    Set prediction without imposing structure as conditional density estimation

    In International Conference on Learning Representations, Cited by: §4, §5.
  • Y. Zhang, J. Hare, and A. Prügel-Bennett (2019) Deep set prediction networks. In Advances in Neural Information Processing Systems, Cited by: §4.
  • J. Zhou, G. Cui, S. Hu, Z. Zhang, C. Yang, Z. Liu, L. Wang, C. Li, and M. Sun (2020) Graph neural networks: a review of methods and applications. AI Open 1, pp. 57–81. Cited by: §5.

Appendix A Experiments details

In this section we provide further details on the experimental setup and additional results.

Particle partitioning

Both and are instantiated as a -layer equivariant DeepSets [Zaheer et al., 2017]:


The is instantiated as a -layer fully-connected neural network:


We simplify the hyperparameter search by choosing the same number for the vertices , edges and all output dimensions of the affine maps in (5) and (6). In all runs dedicated to searching , we set the number of total iterations and backpropagate through all iterations. We start with and double it, until an increase yields no substantial performance gains on the validation set, resulting in . We train all models using the Adam optimizer [Kingma and Ba, 2014] with a learning rate of for epochs and retain the parameters corresponding to the lowest validation loss. We apply the same to both the Slot Attention and Set Transformer baselines. Similar to the original version [Locatello et al., 2020], we train Slot Attention with iterations. Attempts with more than iterations resulted in frequent divergences in the training losses. We attribute this behavior to the recurrent sinkhorn operation, that acts as a contraction map, forcing all slots to the same vector in the limit. Our model has trainable parameters, similar to for the Slot Attention baseline, but less than for Set Transformer and for Set2Graph [Serviansky et al., 2020]. The total training time is less than 12 hours on a single GTX 1080 Ti and 10 CPU cores. For completeness, we also report the results for the rand index (RI) in Table 4.

bottom jets charm jets light jets
Model RI RI RI
Set Transformer
Slot Attention
Table 4: Additional particle partitioning results. On three jet types performance measured as rand index (RI). Our method outperforms the baselines on bottom and charm jets, while being competitive on light jets.

Convex hull finding

We apply the same hyperparameters, architectures and optimizer as in the particle partitioning experiment, except for and as we mention in Section 3. This notably differs to Set2Graph, which reports an increased parameter count of [Serviansky et al., 2020]. We train our method until we observe no improvements on the F1 validation performance for epochs, with a maximum of epochs. The set-to-set baselines are trained for epochs, and we retain the parameters resulting in the highest f1 score on the validation set. The total training time is between 14 and 50 hours on a single GTX 1080 Ti and 10 CPU cores.

Delaunay triangulation

For Delaunay triangulation we adapt the hypergraph refiner to the graph prediction case. We replace with and apply weight sharing in the incidence MLP, to reflect the symmetry of undirected graphs. We apply the same neural networks for both and as in the previous experiments, but increase the latent dimensions to , resulting in trainable parameters. This notably differs to Set2Graph, which increases the parameter count to [Serviansky et al., 2020], an order of magnitude larger. The total training, for the results in Table 2, is less than 9 hours on a single GTX 1080 Ti and 10 CPU cores.

BPTT with skips

In general, truncated BPTT allows for overlaps between subsequent BPTT applications, as we illustrate in Figure 5. We compare BPTT with skips to truncated BPTT [Williams and Peng, 1990] with every iterations, which is the setting that is most similar to ours with regard to training time. The BPTT results serve as a reference point to answer the question: "What if we apply BPTT more frequently, resulting in a better approximation to the true gradients?", without necessitating a grid search over all possible hyperparameter combinations for truncated BPTT.

[width=]figures/BPTT.pdf [width=]figures/TBPTT.pdf [width=]figures/BPTTwS.pdf
Figure 5: Adding gradient skips to BPTT (a) Standard BPTT (b) Truncated BPTT, applying BPTT on iterations every nd iteration (c) BPTT with skips at iterations , which effectively reduces the training time, while retaining the same number of refinement steps.