 # Set2Graph: Learning Graphs From Sets

Many problems in machine learning (ML) can be cast as learning functions from sets to graphs, or more generally to hypergraphs; in short, Set2Graph functions. Examples include clustering, learning vertex and edge features on graphs, and learning triplet data in a collection. Current neural network models that approximate Set2Graph functions come from two main ML sub-fields: equivariant learning, and similarity learning. Equivariant models would be in general computationally challenging or even infeasible, while similarity learning models can be shown to have limited expressive power. In this paper we suggest a neural network model family for learning Set2Graph functions that is both practical and of maximal expressive power (universal), that is, can approximate arbitrary continuous Set2Graph functions over compact sets. Testing our models on different machine learning tasks, including an application to particle physics, we find them favorable to existing baselines.

## Authors

##### This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

## 1 Introduction

We consider the problem of learning functions mapping sets of vectors in

to graphs, or more generally hypergraphs; we name this problem Set2Graph, or set-to-graph. Set-to-graph functions appear in machine-learning applications such as clustering, predicting features on edges and nodes in graphs, and learning -edge information in sets.

Mathematically, we represent each set-to-graph function as a collection of set-to--edge functions, where each set-to--edge function learns features on -edges. That is, given an input set we consider functions attaching feature vectors to -edges: each -tuple is assigned with an output vector . Now, functions mapping sets to hypergraphs with hyper-edges of size up-to are modeled by . For example, functions mapping sets to standard graphs are represented by , see Figure 1. Figure 1: Set-to-graph functions are represented as collections of set-to-k-edge functions.

Set-to-graph functions are well-defined if they satisfy a property called equivariance (defined later), and therefore the set-to-graph problem is an instance of the bigger class of equivariant learning. One option to learn equivariant set-to-graph model is using out-of-the-box full equivariant model as in (Maron et al., 2019b). By full we mean that each linear layer is chosen from the space of all linear equivariant layers. Learning would require equivariant layers mapping

-st order tensors (representing sets) to

-order tensors (representing -edge hypergraphs). Such models will be infeasible computationally: (i) They will possess a large number of parameters (combinatorial in ); (ii) they will require storing in memory -th order tensors.

Aside from the practical problem of using equivariant models for learning set-to-graph functions, there is a theoretical question of expressive power, or universality. That is, the ability of the models to approximate any continuous equivariant function. In equivariant learning literature set-to-set models (Zaheer et al., 2017; Qi et al., 2017) are recently proven equivariant universal (Keriven and Peyré, 2019; Segol and Lipman, 2020; Sannai et al., 2019). In contrast, the situation for graph-to-graph equivariant models is more intricate: some models, such as message passing (a.k.a. graph convolutional networks), are known to be non-universal (Xu et al., 2019; Morris et al., 2018; Maron et al., 2019a; Chen et al., 2019); high-order equivariant models are known to be universal (Maron et al., 2019c) but as discussed above, not practical. Universality of equivariant set-to-graph models is not known, as far as we are aware.

Another machine-learning approach for learning set-to-graph functions is similarity learning (Bromley et al., 1994; Chopra et al., 2005; Simo-Serra et al., 2015; Zagoruyko and Komodakis, 2015; Bell and Bala, 2015; Ahmed et al., 2015; Vo and Hays, 2016) This is a simpler approach where a siamese network is used to embed each set element independently and pairwise information is extracted from pairs of embeddings . Although this approach does not suffer from the complexity issues of equivariant network it has limited expressive power.

In this paper we introduce a model for the set-to-graph problem that is both practical (i.e., small number of parameters and no-need to build high-order tensors in memory) and provably universal. We achieve that with models defined as composition of three networks: , where is a set-to-set model, is a non-learnable broadcasting set-to-graph layer, and

is a simple graph-to-graph network using only a single Multi-Layer Perceptron (MLP) acting on each

-edge feature vector independently.

We have tested our model on four different applications: (i) Set-to--edges: partitioning (clustering) of simulated particles generated in the Large Hadron Collider (LHC); (ii) Set-to--edges: predicting Delaunay edges in planar point clouds; (iii) Set-to-graph: improving graph neural networks by augmenting them with a set-2-graph models; and (iv) Set-to--edges: finding triplets of point on the convex hull of a volumetric point cloud. We show that in all applications we achieve superior performances to baseline.

## 2 Previous work

##### Equivariant learning.

In many learning setups the task we would like to learn is invariant or equivariant to certain transformations of the input. The Canonical examples are image classification tasks (LeCun et al., 1998; Krizhevsky et al., 2012) which are often assumed to be invariant to translations of the image, and set classification tasks (Zaheer et al., 2017; Qi et al., 2017) which are typically invariant to the specific order of the elements in the set. Restricting models to be invariant or equivariant to these transformation was shown to be an excellent approach for reducing the number of parameters of models while improving generalization. This paradigm for designing deep models was used for many different tasks, data modalities and transformations, e.g., set learning (Zaheer et al., 2017; Qi et al., 2017), graph learning (Kipf and Welling, 2016; Gilmer et al., 2017; Veličković et al., 2017; Xu et al., 2019; Kondor et al., 2018; Maron et al., 2019b, a), learning images with rotational and reflectional symmetries (Cohen and Welling, 2016b, a; Dieleman et al., 2016; Worrall et al., 2017), learning functions on spheres (Cohen et al., 2018; Esteves et al., 2017) and learning general 3D data (Weiler et al., 2018; Worrall and Brostow, 2018; Weiler et al., 2018). Except from designing invariant and equivariant networks there has been a keen interest in the analysis of such models (Ravanbakhsh et al., 2017; Kondor et al., 2018), especially the analysis of their approximation power (Zaheer et al., 2017; Qi et al., 2017; Maron et al., 2019c; Keriven and Peyré, 2019; Segol and Lipman, 2020). Most related to this work is, the work of (Maron et al., 2019b) which characterized equivariant layers between hypergraph data which includes the set-to-graph setup, (Maron et al., 2019c) that proved a universal approximation property for invariant networks for general permutation groups, and the work of (Keriven and Peyré, 2019) which provides a proof that equivariant neural networks are universal. They construct an equivariant network with a single hidden layer that may contain tensors of unbounded degree. In contrast we construct equivariant networks that only involve tensors of order and and are universal.

##### Learning to Cluster.

Deep clustering is a large field (Aljalbout et al., 2018) and we restrict our attention to the methods that are most related to ours. The work that tackles the most similar problem to this paper is the work of (Jiang and Verma, 2019) that suggest a method for meta clustering that is based on LSTMs and therefore depends on the order of the set elements. In contrast, our method is blind (equivariant) to the chosen order of the input sets. In another related work, (Hsu et al., 2017)

suggest to perform transfer learning between tasks and domains by learning how to cluster, where the main idea is to learn a similarity function between set elements according to their labels. This similarity function is learned using a loss that promotes cluster assignments of points with the same label. The main difference from our work is the fact that we assume that all the data can be seen at the same time and aim at universal models approximating set-to-graph functions.

We discuss the differences of our method and equivariant, similarity learning (including learning to cluster) methods in more details at the end of Section 4.

## 3 Learning hypergraphs from sets

We would like to learn functions mapping sets of vectors in to hypergraphs with nodes (think of the nodes as corresponding to the set elements), and arbitrary -edge feature vectors in , where a -edge is defined as a -tuple of set elements. A function mapping sets of vectors to -edges is called set-to--edge function and denoted . Consequently, a set-to-hypergraph function would be modeled as a sequence , for target hypergraphs with hyperedges of maximal size . For example, learns pairwise relations in a set; and is a function from sets to graphs; see Figure 1.

##### Our goal

is to design equivariant neural network models for that are as-efficient-as-possible in terms of number of parameters and memory usage, but on the same time with maximal expressive power, i.e., universal.

##### Representing sets and k-edges.

A matrix represents a set of vectors and therefore should be considered up to re-ordering of its rows. We denote by the symmetric group, that is the group of bijections (permutations) , where . We denote by the matrix resulting in reordering the rows of by the permutation , i.e., . In this notation, and represent the same set, for all permutations .

-edges are represented as a tensor , where denotes the feature vector attached to the -edge defined by the -tuple , where is a multi-index with non-repeating indices. Similarly to the set case, -edges are considered up-to renumbering of the nodes by some permutation . That is, if we define the action by , where , then and represent the same -edge data, for all .

##### Equivariance.

For to represent a well-defined map between sets and -edge data it should be equivariant to permutations, namely satisfy

 \mathsfitFk(σ⋅X)=σ⋅\mathsfitFk(X), (1)

for all sets and permutations . Equivariance guarantees, in particular, that the two equivalent sets and are mapped to equivalent -edge data tensors and .

##### Set-to-k-edge models.

In this paper we explore the following neural network model family for approximating :

 \mathsfitFk(X;θ)=ψ∘β∘ϕ(X), (2)

where , and will be defined soon. For to be equivariant (as required in equation 1) it is sufficient to require its constituents, namely , are equivariant. That is, all satisfy equation 1.

##### Set-to-graphs models.

Given the model of set-to--edge functions, a model for a set-to-graph function can now be constructed from a pair of set-to--edge networks . Similarly, set-to-hypergraph function would require , where is the maximal hyperedge size. Figure 1 shows an illustration of set-to--edge and set-to-graph functions

##### ϕ component.

is a set-to-set equivariant model, that is is mapping sets of vectors in to sets of vectors in . To achieve the universality goal we will need to be universal as set-to-set model; that is, can approximate arbitrary continuous set-to-set functions. Several options exists (Keriven and Peyré, 2019; Sannai et al., 2019)

although probably the simplest option is either DeepSets

(Zaheer et al., 2017) or one of its variations; all were proven to be universal recently in (Segol and Lipman, 2020).

In practice, as will be clear later from the proof of the universality of the model, when building set-to-graph or set-to-hypergraph model, the (set-to-set) part of the -edge networks can be shared between different set-to--edge models, , without compromising universality.

##### β component.

is a non-learnable linear broadcasting layer mapping sets to -edges. In theory, as shown in (Maron et al., 2019b) the space of equivariant linear mappings is of dimension which can be very high since has exponential growth. Interestingly, in the set-to--edge case one can achieve universality with only linear operators. We define the broadcasting operator to be

 β(X)i,:=[xi1,xi2,…,xik], (3)

where and brackets denote concatenation in the feature dimension, that is, for , their concatenation is . Therefore, the feature output dimension of is .

As an example, consider the graph case, where . In this case . This function is illustrated in Figure 2 broadcasting data in to tensor .

To see that broadcasting layer is equivariant, it is enough to consider a single feature . Permuting the rows of by a permutation we get . Figure 2: The model architecture for the Set-to-graph and set-to-2-edge functions.
##### ψ component.

is a mapping of -tensors to -tensors. Here the theory of equivariant operators indicates that the space of linear equivariant maps is of dimension that suggests a huge number of model parameters even for a single linear layer. Surprisingly, universality can be achieved with much less, in fact a single linear operator (i.e., scaled identity) in each layer which in the multi-feature multi-layer case boils to applying a Multi-Layer Perceptron to each feature in the input tensor . That is, we use

 ψ(\mathsfitX)i,:=m(\mathsfitXi,:). (4)

Figure 2 illustrates set-to--edges and set-to-graph models incorporating the three components discussed above.

## 4 Universality of set-to-graph models.

In this section we prove that the model introduced above, is universal, in the sense it can approximate arbitrary continuous equivariant set-to--edge functions over compact domains .

###### Theorem 1.

The model is set-to--edge universal.

A corollary of Theorem 1 establishes a general set-to-hypergraph universal models:

###### Theorem 2.

The model is set-to-hypergraph universal.

Our main tool for proving Theorem 1 is a characterization of the equivariant set-to--edge polynomials . This characterization can be seen as a generalization of the characterization of set-to-set equivariant polynomial recently appeared in (Segol and Lipman, 2020).

We consider an arbitrary set-to--edge continuous mapping over a compact set . Since is equivariant we can assume is symmetric, i.e., for all . The proof consists of three parts: (i) Characterization of the equivariant set-to--edge polynomials . (ii) Showing that every equivariant continuous set-to--edge function can be approximated by some . (iii) Every can be approximated by our model .

Before providing the full proof which contains some technical derivations let us provide a simpler universality proof (under some mild conditions) for the set-to--edge model,

, based on the Singular Value Decomposition (SVD).

### 4.1 A simple proof for universality of \mathsfitF2

It is enough to consider the case; the general case is implied by applying the argument for each output feature dimension independently. Let be an arbitrary continuous equivariant set-to--edge function . We want to approximate with our model . First, note that without losing generality we can assume

has a simple spectrum (i.e., eigenvalues are all different) for all

. Indeed, if this is not the case we can always choose sufficiently large and consider . This diagonal addition does not change the -edge values assigned by , and it guarantee simple specturm using standard hermitian matrix eigenvalue perturbation theory (see e.g., (Stewart, 1990), Section IV:4).

Now let be the SVD of , where and . Since has a simple spectrum, are all continuous in ; is unique, and are unique up to a sign flip of the singular vectors (i.e., columns of ) (O’Neil, 2005). Let us first assume that the singular vectors can be chosen uniquely also up to a sign, later we show how we achieve this with some additional mild assumption.

Now, uniqueness of the SVD together with the equivariance of imply that are continuous set-to-set equivariant and is continuous set invariant function:

 (σ⋅U(X))Σ(X)(σ⋅V(X))T =σ⋅G(X)=G(σ⋅X) =U(σ⋅X)Σ(σ⋅X)V(σ⋅X)T.

Lastly, since is set-to-set universal there is a choice of its parameters so that it approximates arbitrarily well the equivariant set-to-set function . The component can be chosen by noting that , where are the singular values, and is a cubic polynomial. To conclude pick to approximate sufficiently well so that approximates to the desired accuracy.

To achieve uniqueness of the singular vectors up-to a sign we can add, e.g., the following assumption: for all singular vectors and . Using this assumption we can always pick , in the SVD so that , , for all .

We now move to the general proof.

### 4.2 Equivariant set-to-k-edge polynomials

We need some more notation. Given a vector , and a multi-index , we set ; ; and define accordingly . Given two tensors , we use the notation to denote the tensor-product, defined by , where are suitable multi-indices. Lastly, we denote by a vector of multi-indices , and .

###### Theorem 3.

An equivariant set-to--edge polynomial can be written as

 \mathsfitPk(X)=∑αXα⊗qα(X) (5)

where , , and are invariant polynomials.

As an example, consider the graph case, where . Equivariant set-to--edge polynomials take the form:

 \mathsfitPk(X)=∑α1,α2Xα1⊗Xα2⊗qα1,α2(X), (6)

and coordinate-wise

 \mathsfitPkijl(X)=∑α1,α2xα1ixα2jqα1,α2,l(X). (7)

The general proof idea is to consider an arbitrary equivariant set-to--edge polynomial and use its equivariance property to show that it has the form as in equation 5. This is done by looking at a particular output entry , where say . Then the proof considers two subsets of permutations: First, the subgroup of all permutations that fixes the numbers , i.e., , but permute everything else freely; this subgroup is denoted . Second, permutations of the form , where . Each of these permutation subsets reveals a different part in the structure of the equivariant polynomial and its relation to invariant polynomials.

As before, it is enough to prove Theorem 3 for . Let and consider any permutation . Then from equivariance of we have

 \mathsfitPki0(X)=\mathsfitPkσ−1(i0)(X)=\mathsfitPki0(σ⋅X),

and . That is is invariant to permuting its last elements ; we say that is invariant. We next prove that invariance can be written using a combination of invariant polynomials and tensor products of :

###### Lemma 1.

Let be invariant polynomial. That is invariant to permuting the last terms. Then

 p(X)=∑αxα11⋯xαkkqα(X), (8)

where are invariant polynomials.

We prove this lemma in the supplementary material. So now we know that has the form equation 8. On the other hand let be an arbitrary multi-index and consider the permutation . Again by permutation equivariance of we have

 \mathsfitPki1i2⋯ik(X) =\mathsfitPkσ−1(i0)(X)=\mathsfitPki0(σ⋅X) =∑αxα1i1⋯xαkikqα(X),

which is a coordinate-wise form of equation 5 with .

##### Approximating \mathsfitGk with a polynomial \mathsfitPk.

We denote for an arbitrary tensor its infinity norm by .

###### Lemma 2.

Let be a continuous equivariant function over a symmetric domain . For an arbitrary , there exists an equivariant polynomial so that

 maxX∈K∥∥\mathsfitGk(X)−\mathsfitPk(X)∥∥∞<ϵ.

This is a standard lemma, similar to (Yarotsky, 2018; Maron et al., 2019c; Segol and Lipman, 2020); for completeness we provide a proof in the supplementary.

##### Approximating \mathsfitPk with a network \mathsfitFk.

The final component of the proof of Theorem 1 is showing that an equivariant polynomial can be approximated over using a network of the form in equation 2. The key is to use the characterization of Theorem 3 and write in a similar form to our model in equation 2:

 \mathsfitPki,:(X)=p(β(H(X))i,:), (9)

where defined by , where , and are all the multi-indices participating in the sum in equation 5. Note that

 β(H(X))i,:=[xi1,q(X),xi2,q(X),…,xik,q(X)].

Therefore, is chosen as the polynomial

 p:[x1,y,x2,y,…,xk,y]↦∑αxα11⋯xαkkyα,

where , and .

In view of equation 9 all we have left is to choose and (i.e., ) to approximate (resp.) to a desired accuracy. We detail the rest of the proof in the supplementary.

##### Universality of the set-to-hypergraph model.

Theorem 2 follows from Theorem 1 by considering a set-to-hypergraph continuous function as a collection of set-to--edge functions and approximating each one using our model . Note that universality still holds if all share the part of the network (assuming sufficient width ).

Note that a set-to--edge model (in equation 2) is not universal when approximating set-to-hypergraph functions:

###### Proposition 1.

The set-to--edge model, , cannot approximate general set-to-graph functions.

The proof is in the supplementary; it shows that even the constant function that outputs for 1-edges (nodes), and for 2-edges cannot be approximated by a set-to--edge model .

##### Relation to similarity learning

Previous models suggested for learning pairwise relations in sets were mostly of the form (Hsu et al., 2017; Bromley et al., 1994; Simo-Serra et al., 2015; Zagoruyko and Komodakis, 2015; Bell and Bala, 2015; Ahmed et al., 2015; Vo and Hays, 2016). This model is similar to the model suggested in this paper for the case but is not universal for two main reasons: (i) The same MLP is used both for -edge (nodes or self-loops) predictions and -edge prediction; Proposition 1 implies that the model is not universal in this case; (ii) The model used in the role of is a element-wise MLP which is not set-to-set universal (Segol and Lipman, 2020).

##### Relation to hypergraph equivariant networks.

Our model’s constituents are all built from certain equivariant linear layers and entry-wise non-linearity. Therefore, our model is an instance of the general equivariant hypergraph networks framework (Maron et al., 2019b). The benefit in our suggested model compared to the general equivariant model is that it is much more efficient in terms computational complexity and memory footprint. In particular, it uses only one basis function for the (scaled identity, without counting features) in contrast to in the full equivariant model, and can be used without constructing -order tensors explicitly in memory. Even though the model is lean in terms of number of parameters (i.e., uses less basis functions) it is proven it to be universal (i.e., with maximal expressive power). As far as we are aware, this fact was not known before, even for the full equivariant models when approximating set-to-hypergraph functions and restricting the tensor order in the network.

## 5 Applications

We have tested our model on a collection of learning tasks that fall into the categories: (1) Set-to-2-edge tasks; (2) Set-to-graph tasks; and (3) Set-to-3-edge tasks.

##### Variants of our model.

We used , , and (resp.) for these learning tasks. is implemented using DeepSets (Zaheer et al., 2017) with layers with output dimension ; is implemented with an MLP, , with layers with input dimension defined by and . is implemented according to equation 3: for it uses output features and for , output features. We name this model S2G. For the case we have also tested a more general (but not more expressive) broadcasting defined using the full equivariant basis from (Maron et al., 2019b) that contains basis operations: (1-2) as in ; (3) broadcast the nodes values to the diagonal; map the sum of all nodes to the (4) diagonal; and (5) to all of the entries. This broadcasting layer gives ; we name this model S2G+.

More architecture, implementation and hyper-parameter details can be found in the supplementary.

##### Baselines.

We compare our results to the following baselines: (1) a model as in equation 2 but with a non-universal set-to-set function as , namely, an MLP on each element (vector) in the set; we use the same loss as is used in our model; we name this model MLP. For the particle physics application we also used: (2) The same architecture as (1) but with a triplet loss (Weinberger et al., 2006) on the learned representations based on distance; we name this baseline (TRI). (3) A non-learnable geometric-based baseline described later.

### 5.1 Set-to-2-edges

The first type of problems we tackle involve learning set-to--edge functions. Here, each training example is a pair where is a set and is an adjacency matrix (the diagonal of is ignored).

#### 5.1.1 Partitioning for particle physics

In particle physics experiments, such as the Large Hadron Collider (LHC), beams of incoming particles are collided at high energies. The results of the collision are outgoing particles, whose properties (such as the trajectory) are measured by detectors surrounding the collision point.

A critical low-level task for analyzing this data is to associate the particle trajectories to their progenitor, which can be formalized as partitioning sets of particle trajectories into subsets according to their unobserved point of origin in space. This task is referred to as vertex reconstruction in particle physics and is illustrated in Figure 2(a).

The measured particle trajectories correspond to elements in the input set and nodes in the output graph, and the parameters that characterize them serve as the node features. An edge between two nodes indicates that the two particles come from a common progenitor or vertex. We enforce that the adjacency matrix of the graph encodes a valid partitioning of tracks to vertices.

Vertex reconstruction propagates to a number of down-stream data analysis tasks, such as particle identification (a classification problem). Therefore, improvements to the vertex reconstruction has significant impact on the sensitivity of collider experiments. We consider multiple quantities to quantify the performance of the partitioning: the F1 score, the Rand Index (RI), and the Adjusted Rand Index (). We will consider three different types (or flavors) of particle sets (called jets) corresponding to three different fundamental data generating processes labeled bottom-jets, charm-jets, and light-jets (B/C/L). The important distinction between the flavors is the typical number of partitions in each set. Figure 2(b) shows the distribution of the number of partitions (vertices) in each flavor: bottom jets typically have multiple partitions; charm jets also have multiple partitions, but fewer than bottom jets; and light jets typically have only one partition.

##### Dataset.

Algorithms for particle physics are typically designed with high-fidelity simulators, which can provide labeled training data. These algorithms are then applied to and calibrated with real data collected by the LHC experiments. Our simulated samples are created with a standard simulation package called pythia (Sjöstrand et al., 2015) and the detector is simulated with delphes  (de Favereau et al., 2014). We use this software to generate synthetic datasets for the three flavors of jets. The generated sets are small, ranging from 2 to 14 elements each.

##### Results

We compare the results of our model (S2G and S2G+) trained to minimize the F1 score to a typical baseline algorithm used in particle physics - the Adaptive Vertex Reconstruction (AVR) algorithm (Waltenberger, 2011). We ran each model (except AVR) 11 times, and evaluated the model F1 score, RI and ARI over the test set for each run. The results are shown in Table 1. For bottom and charm jets, which have secondary vertices, all of our models reach comparable results and improve over the AVR baseline by about 10% in all performance metrics. In light-jets, without secondary decays, our models reach similar F1 scores.

#### 5.1.2 Learning Delaunay triangulations

In a second set-to--edge task we test our model’s ability to learn Delaunay triangulations, namely given a set of planar points we want to predict the Delaunay edges between pairs of points, see e.g., (De Berg et al., 1997) Chapter 9. We generated planar point sets as training data and planar point sets as test data; the point sets, , were uniformly sampled in the unit square, and a ground truth matrix in was computed per point set using a Delaunay triangulation algorithm. The number of points in a set, , is either or varies and is randomly chosen from

. Training was stopped after 100 epochs. See more implementation details in the supplementary material. In Table

2 we report accuracy of prediction as well as precision recall and F1 score. Evidently, both S2G and S2G+ achieve comparable results while outperforming the baseline MLP. See also Figure 4 for visualizations of several triangulations predicted with the trained model versus ground truth. Figure 4: Results of Delaunay triangulation learning. Top: n=50; Bottom: n∈{20,…,80}.

### 5.2 Set-to-Graphs

One of the main learning tasks in graph analysis is graph classification. Our goal here is to try and improve existing graph learning models, and in particular Graph Neural Networks (GNNs). Since existing GNNs are not graph-to-graph universal we suggest the following procedure to potentially improve their performance: First, compute new node and edge features from the initial set of node features using and (resp.), and then concatenate these to the original input and feed this augmented input to the GNN.

For the set-to-graph model, we use , as described above, and 2 different MLPs as : one for the nodes (for ), and one for the edges (for ). For the GNN we have used two GNN variants: a message-passing neural network (Gilmer et al., 2017) (MMPN), and Provably Powerful Graph Networks (PPGN) (Maron et al., 2019a); implementation details can be found in the supplementary materials. We compared performance after training the same GNN in two ways: with the original input data, and with the S2G-augmented input data. We used graph datasets from (Wu et al., 2018) using the Open Graph Benchmark (29). All tasks considered are binary or multi-binary graph classification.

Results are presented in table 3. We report mean standard deviation of the AUC-ROC of the models on the test sets. Note that in several datasets (bold in table) there is a significant performance boost when S2G augments MPNN; while for other datasets and for the PPGN model (that is more expressive than MPNN yet heavier computationally and memory-wise) there is no noticeable improvement.

### 5.3 Set to 3-edges

In the last experiment, we demonstrate learning of set-to--edge function. The learning task we target is finding supporting triangles in the convex hull of a set of points in . In this scenario, the input is a point set , and the function we wanted to learn is where the output is a probability for each triplet of nodes (triangle) to belong to the triangular mesh that describes the convex hull of .

Note that storing -rd order tensors in memory is not feasible, hence we concentrate on a local version of the problem: Given a point set , identify the triangles within the -Nearest-Neighbors of each point that belong to the convex hull of the entire point set . We used . Therefore, for broadcasting () from point data to 3-edge data, instead of holding a -rd order tensor in memory we broadcast only the subset of -NN neighborhoods. This allows working with high-order information with relatively low memory footprint. Furthermore, since we want to consider -edges (triangles) with no order we used invariant universal set model (DeepSets again) as .

We tested our models on two types of data: Gaussian and spherical. For both types we draw point sets in

i.i.d. from standard normal distribution,

, where for the spherical data we normalize each point to unit length. We generated point set samples as a training set, for validation and another for test set. Point sets are in , where , , and . As a baseline, we used MLP. The F1 scores and AUC-ROC of the predicted convex hull triangles are shown in Table 4, where our models out-preform the baseline. See Figure 5 for several examples of triangles predicted using our trained model compared to the ground truth.

### Acknowledgments

HS, NS and YL were supported in part by the European Research Council (ERC Consolidator Grant, ”LiftMatch” 771136), the Israel Science Foundation (Grant No. 1830/17) and by a research grant from the Carolito Stiftung (WAIC). JS and EG were supported by the NSF-BSF Grant 2017600 and the ISF Grant 2871/19. KC was supported by the National Science Foundation under the awards ACI-1450310, OAC-1836650, and OAC-1841471 and by the Moore-Sloan data science environment at NYU.

## References

• E. Ahmed, M. Jones, and T. K. Marks (2015)

An improved deep learning architecture for person re-identification

.
In

Proceedings of the IEEE conference on computer vision and pattern recognition

,
pp. 3908–3916. Cited by: §1, §4.2.
• E. Aljalbout, V. Golkov, Y. Siddiqui, M. Strobel, and D. Cremers (2018) Clustering with deep learning: taxonomy and new methods. arXiv preprint arXiv:1801.07648. Cited by: §2.
• S. Bell and K. Bala (2015)

Learning visual similarity for product design with convolutional neural networks

.
ACM Transactions on Graphics (TOG) 34 (4), pp. 98. Cited by: §1, §4.2.
• J. Bromley, I. Guyon, Y. LeCun, E. Säckinger, and R. Shah (1994) Signature verification using a” siamese” time delay neural network. In Advances in neural information processing systems, pp. 737–744. Cited by: §1, §4.2.
• Z. Chen, S. Villar, L. Chen, and J. Bruna (2019) On the equivalence between graph isomorphism testing and function approximation with gnns. arXiv preprint arXiv:1905.12560. Cited by: §1.
• S. Chopra, R. Hadsell, Y. LeCun, et al. (2005) Learning a similarity metric discriminatively, with application to face verification. In CVPR (1), pp. 539–546. Cited by: §1.
• T. S. Cohen, M. Geiger, J. Köhler, and M. Welling (2018) Spherical cnns. arXiv preprint arXiv:1801.10130. Cited by: §2.
• T. S. Cohen and M. Welling (2016a) Steerable CNNs. (1990), pp. 1–14. External Links: 1612.08498, Link Cited by: §2.
• T. Cohen and M. Welling (2016b) Group equivariant convolutional networks. In International conference on machine learning, pp. 2990–2999. Cited by: §2.
• M. De Berg, M. Van Kreveld, M. Overmars, and O. Schwarzkopf (1997) Computational geometry. In Computational geometry, pp. 1–17. Cited by: §5.1.2.
• J. de Favereau, C. Delaere, P. Demin, A. Giammanco, V. Lemaître, A. Mertens, and M. Selvaggi (2014) DELPHES 3: a modular framework for fast simulation of a generic collider experiment. Journal of High Energy Physics 2014 (2). External Links: ISSN 1029-8479, Link, Document Cited by: §5.1.1.
• S. Dieleman, J. De Fauw, and K. Kavukcuoglu (2016) Exploiting cyclic symmetry in convolutional neural networks. arXiv preprint arXiv:1602.02660. Cited by: §2.
• C. Esteves, C. Allen-Blanchette, A. Makadia, and K. Daniilidis (2017) 3D object classification and retrieval with spherical cnns. arXiv preprint arXiv:1711.06721. Cited by: §2.
• J. Gilmer, S. S. Schoenholz, P. F. Riley, O. Vinyals, and G. E. Dahl (2017) Neural message passing for quantum chemistry. In International Conference on Machine Learning, pp. 1263–1272. Cited by: §2, §5.2.
• Y. Hsu, Z. Lv, and Z. Kira (2017) Learning to cluster in order to transfer across domains and tasks. arXiv preprint arXiv:1711.10125. Cited by: §2, §4.2.
• M. Ilse, J. M. Tomczak, and M. Welling (2018) Attention-based deep multiple instance learning. arXiv preprint arXiv:1802.04712. Cited by: §6.
• Y. Jiang and N. Verma (2019) Meta-learning to cluster. arXiv preprint arXiv:1910.14134. Cited by: §2.
• N. Keriven and G. Peyré (2019) Universal invariant and equivariant graph neural networks. CoRR abs/1905.04943. External Links: Link, 1905.04943 Cited by: §1, §2, §3.
• D. P. Kingma and J. Ba (2014) Adam: a method for stochastic optimization. arXiv preprint arXiv:1412.6980. Cited by: §6.
• T. N. Kipf and M. Welling (2016) Semi-supervised classification with graph convolutional networks. arXiv preprint arXiv:1609.02907. Cited by: §2.
• R. Kondor, H. T. Son, H. Pan, B. Anderson, and S. Trivedi (2018) Covariant compositional networks for learning graphs. arXiv preprint arXiv:1801.02144. Cited by: §2.
• A. Krizhevsky, I. Sutskever, and G. E. Hinton (2012) Imagenet classification with deep convolutional neural networks. In Advances in neural information processing systems, pp. 1097–1105. Cited by: §2.
• Y. LeCun, L. Bottou, Y. Bengio, P. Haffner, et al. (1998) Gradient-based learning applied to document recognition. Proceedings of the IEEE 86 (11), pp. 2278–2324. Cited by: §2.
• H. Maron, H. Ben-Hamu, H. Serviansky, and Y. Lipman (2019a) Provably powerful graph networks. arXiv preprint arXiv:1905.11136. Cited by: §1, §2, §5.2.
• H. Maron, H. Ben-Hamu, N. Shamir, and Y. Lipman (2019b) Invariant and equivariant graph networks. In International Conference on Learning Representations, External Links: Link Cited by: §1, §2, §3, §4.2, §5.
• H. Maron, E. Fetaya, N. Segol, and Y. Lipman (2019c) On the universality of invariant networks. arXiv preprint arXiv:1901.09342. Cited by: §1, §2, §4.2.
• C. Morris, M. Ritzert, M. Fey, W. L. Hamilton, J. E. Lenssen, G. Rattan, and M. Grohe (2018) Weisfeiler and leman go neural: higher-order graph neural networks. arXiv preprint arXiv:1810.02244. Cited by: §1.
• K. A. O’Neil (2005) Critical points of the singular value decomposition. SIAM journal on matrix analysis and applications 27 (2), pp. 459–473. Cited by: §4.1.
•  (2019) Open graph benchmark. Cited by: §5.2.
• C. R. Qi, H. Su, K. Mo, and L. J. Guibas (2017) Pointnet: deep learning on point sets for 3d classification and segmentation. Proc. Computer Vision and Pattern Recognition (CVPR), IEEE 1 (2), pp. 4. Cited by: §1, §2.
• S. Ravanbakhsh, J. Schneider, and B. Poczos (2017) Equivariance through parameter-sharing. arXiv preprint arXiv:1702.08389. Cited by: §2.
• D. Rydh (2007) A minimal set of generators for the ring of multisymmetric functions. In Annales de l’institut Fourier, Vol. 57, pp. 1741–1769. Cited by: §7.
• A. Sannai, Y. Takai, and M. Cordonnier (2019) Universal approximations of permutation invariant/equivariant functions by deep neural networks. arXiv preprint arXiv:1903.01939. Cited by: §1, §3.
• N. Segol and Y. Lipman (2020) On universal equivariant set networks. In International Conference on Learning Representations, External Links: Link Cited by: §1, §2, §3, §4.2, §4.2, §4, §7.
• E. Simo-Serra, E. Trulls, L. Ferraz, I. Kokkinos, P. Fua, and F. Moreno-Noguer (2015) Discriminative learning of deep convolutional feature point descriptors. In Proceedings of the IEEE International Conference on Computer Vision, pp. 118–126. Cited by: §1, §4.2.
• T. Sjöstrand, S. Ask, J. R. Christiansen, R. Corke, N. Desai, P. Ilten, S. Mrenna, S. Prestel, C. O. Rasmussen, and P. Z. Skands (2015) An introduction to pythia 8.2. Computer Physics Communications 191, pp. 159–177. External Links: ISSN 0010-4655, Link, Document Cited by: §5.1.1.
• G. W. Stewart (1990) Matrix perturbation theory. Cited by: §4.1.
• A. Vaswani, N. Shazeer, N. Parmar, J. Uszkoreit, L. Jones, A. N. Gomez, L. Kaiser, and I. Polosukhin (2017) Attention is all you need. External Links: 1706.03762 Cited by: §6.
• P. Veličković, G. Cucurull, A. Casanova, A. Romero, P. Liò, and Y. Bengio (2017) Graph Attention Networks. pp. 1–12. External Links: 1710.10903, Link Cited by: §2.
• N. N. Vo and J. Hays (2016) Localizing and orienting street views using overhead imagery. In European conference on computer vision, pp. 494–509. Cited by: §1, §4.2.
• W. Waltenberger (2011) RAVE: A detector-independent toolkit to reconstruct vertices. IEEE Trans. Nucl. Sci. 58, pp. 434–444. External Links: Document Cited by: §5.1.1.
• M. Weiler, M. Geiger, M. Welling, W. Boomsma, and T. Cohen (2018) 3D Steerable CNNs: Learning Rotationally Equivariant Features in Volumetric Data. External Links: 1807.02547, Link Cited by: §2.
• K. Q. Weinberger, J. Blitzer, and L. K. Saul (2006) Distance metric learning for large margin nearest neighbor classification. In Advances in neural information processing systems, pp. 1473–1480. Cited by: §5, §6.
• D. Worrall and G. Brostow (2018) Cubenet: equivariance to 3d rotation and translation. In Proceedings of the European Conference on Computer Vision (ECCV), pp. 567–584. Cited by: §2.
• D. E. Worrall, S. J. Garbin, D. Turmukhambetov, and G. J. Brostow (2017) Harmonic networks: deep translation and rotation equivariance. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 5028–5037. Cited by: §2.
• Z. Wu, B. Ramsundar, E. N. Feinberg, J. Gomes, C. Geniesse, A. S. Pappu, K. Leswing, and V. Pande (2018) MoleculeNet: a benchmark for molecular machine learning. Chemical science 9 (2), pp. 513–530. Cited by: §5.2.
• K. Xu, W. Hu, J. Leskovec, and S. Jegelka (2019) How powerful are graph neural networks?. In International Conference on Learning Representations, External Links: Link Cited by: §1, §2.
• D. Yarotsky (2018) Universal approximations of invariant maps by neural networks. arXiv preprint arXiv:1804.10306. Cited by: §4.2.
• S. Zagoruyko and N. Komodakis (2015) Learning to compare image patches via convolutional neural networks. In Proceedings of the IEEE conference on computer vision and pattern recognition, pp. 4353–4361. Cited by: §1, §4.2.
• M. Zaheer, S. Kottur, S. Ravanbakhsh, B. Poczos, R. R. Salakhutdinov, and A. J. Smola (2017) Deep sets. In Advances in neural information processing systems, pp. 3391–3401. Cited by: §1, §2, §3, §5, §6.

## 6 Architectures and hyper-paramteres

All of our models follow the formula , where is a set-to-set model, is a non-learnable broadcasting set-to-graph layer, and is a simple graph-to-graph network using only a single Multi-Layer Perceptron (MLP) acting on each -edge feature vector independently. We note that all the hyper-parameters were chosen using the validation scores only.

##### Notation.

”DeepSets / MLP of widths ” means that we use a DeepSets/MLP network with 3 layers, and each layer’s output feature size is its corresponding argument in the array (e.g.the first and second layers have output feature size of , while the third layer output feature size is

). Between the layers we use ReLU as a non linearity.

##### Partitioning for particle physics applications.

In our models is implemented using DeepSets (Zaheer et al., 2017) with 5 layers of width . is broadcasting the node features in one of the following ways: for model S2G it creates 2 features for each input feature, and S2G+ creates 5 features for each input feature (2 features out of the 5 vanish outside the diagonal). is implemented with 2 edge-wise MLP of widths , ending as the edge probability. As a loss, we used a combination of soft F1 score loss and an edge-wise binary cross-entropy loss.

Instead of using a max or sum pooling in DeepSets layers, we used a self-attention mechanism based on (Ilse et al., 2018) and (Vaswani et al., 2017):

 Attention(X)=softmax(tanhf1(X)⋅f2(X)T√dsmall)⋅X (10)

Where are implemented by two single MLPs of width .

We used a grid search for the following hyper-parameters: learning rate in the range of , DeepSets layers width of , number of layers of , (MLP) of widths , and with or without attention mechanism in DeepSets. We chose to use 250 epochs with early stopping based on validation score, batch size of 2048, adam optimizer (Kingma and Ba, 2014). Our models train in less than 2 hours on a single Tesla V100 GPU.

The deep learning baselines are implemented as follows: MLP is implemented similarly to S2G, with the exception that instead of using DeepSets as , we use MLP of widths . TRI uses a siamese MLP of widths

to extract node features, and the edge logits are the l2 distances between the nodes. For loss, we use triplets loss

(Weinberger et al., 2006) - we draw random triplets anchor, neg, pos where anchor and pos are of the same cluster, and neg is of a different cluster, and the loss is defined as 111A natural disadvantage of the triplets loss is that it cannot learn from sets with a single cluster, or sets with size 2.

 Li=min(dl2(anchi,posi)−dl2(anchi,negi)+2,0)

The dataset is made of training, validation and test set with 543544, 181181 and 181182 instances accordingly. Each of the sets contains all three flavors: bottom, charm and light jets roughly in the same amount, while the flavor of each instance is not part of the input.

Each model is being evaluated 11 times for stability in the same manner: (1) training over the dataset, stopping when the F1 score over the validation set is minimal. (2) Predicting the clusters of the test set. (3) Separate the 3 flavors and calculating the metrics for each flavor. Eventually, we have 11 scores for each combination of metrics, flavor and model, and we report the meanstd. Note that the AVR is evaluated only once since it is not a learning algorithm.

##### Learning Delaunay triangulation.

In our models is implemented using with 7 layers of width . is broadcasting as before for models S2G and S2G+, thus ending with 160 or 400 features per edge. is implemented with 2 edge-wise MLP of widths , ending as the edge probability. We use edge-wise binary cross-entropy loss. The implementation of MLP baseline is identical except for which is an MLP with the same amount of layers and widths.

We searched learning rate from .

##### Set to graph.

Our models include set2graph followed by a GNN, trained together (as a single network). In our models is implemented using DeepSets with 5 layers of width . is of S2G, thus ending with 20 features. uses 2 different MLPs for the edges and nodes (i.e., diagonal and off-diagonal)222The use of 2 different MLPs for the diagonal and off-diagonal is necessary to separate between the learning of , the set-to-set function, and , the set-to-2-edges function., each MLP is of widths