1 Introduction
We consider the problem of learning functions mapping sets of vectors in
to graphs, or more generally hypergraphs; we name this problem Set2Graph, or settograph. Settograph functions appear in machinelearning applications such as clustering, predicting features on edges and nodes in graphs, and learning edge information in sets.Mathematically, we represent each settograph function as a collection of settoedge functions, where each settoedge function learns features on edges. That is, given an input set we consider functions attaching feature vectors to edges: each tuple is assigned with an output vector . Now, functions mapping sets to hypergraphs with hyperedges of size upto are modeled by . For example, functions mapping sets to standard graphs are represented by , see Figure 1.
Settograph functions are welldefined if they satisfy a property called equivariance (defined later), and therefore the settograph problem is an instance of the bigger class of equivariant learning. One option to learn equivariant settograph model is using outofthebox full equivariant model as in (Maron et al., 2019b). By full we mean that each linear layer is chosen from the space of all linear equivariant layers. Learning would require equivariant layers mapping
st order tensors (representing sets) to
order tensors (representing edge hypergraphs). Such models will be infeasible computationally: (i) They will possess a large number of parameters (combinatorial in ); (ii) they will require storing in memory th order tensors.Aside from the practical problem of using equivariant models for learning settograph functions, there is a theoretical question of expressive power, or universality. That is, the ability of the models to approximate any continuous equivariant function. In equivariant learning literature settoset models (Zaheer et al., 2017; Qi et al., 2017) are recently proven equivariant universal (Keriven and Peyré, 2019; Segol and Lipman, 2020; Sannai et al., 2019). In contrast, the situation for graphtograph equivariant models is more intricate: some models, such as message passing (a.k.a. graph convolutional networks), are known to be nonuniversal (Xu et al., 2019; Morris et al., 2018; Maron et al., 2019a; Chen et al., 2019); highorder equivariant models are known to be universal (Maron et al., 2019c) but as discussed above, not practical. Universality of equivariant settograph models is not known, as far as we are aware.
Another machinelearning approach for learning settograph functions is similarity learning (Bromley et al., 1994; Chopra et al., 2005; SimoSerra et al., 2015; Zagoruyko and Komodakis, 2015; Bell and Bala, 2015; Ahmed et al., 2015; Vo and Hays, 2016) This is a simpler approach where a siamese network is used to embed each set element independently and pairwise information is extracted from pairs of embeddings . Although this approach does not suffer from the complexity issues of equivariant network it has limited expressive power.
In this paper we introduce a model for the settograph problem that is both practical (i.e., small number of parameters and noneed to build highorder tensors in memory) and provably universal. We achieve that with models defined as composition of three networks: , where is a settoset model, is a nonlearnable broadcasting settograph layer, and
is a simple graphtograph network using only a single MultiLayer Perceptron (MLP) acting on each
edge feature vector independently.We have tested our model on four different applications: (i) Settoedges: partitioning (clustering) of simulated particles generated in the Large Hadron Collider (LHC); (ii) Settoedges: predicting Delaunay edges in planar point clouds; (iii) Settograph: improving graph neural networks by augmenting them with a set2graph models; and (iv) Settoedges: finding triplets of point on the convex hull of a volumetric point cloud. We show that in all applications we achieve superior performances to baseline.
2 Previous work
Equivariant learning.
In many learning setups the task we would like to learn is invariant or equivariant to certain transformations of the input. The Canonical examples are image classification tasks (LeCun et al., 1998; Krizhevsky et al., 2012) which are often assumed to be invariant to translations of the image, and set classification tasks (Zaheer et al., 2017; Qi et al., 2017) which are typically invariant to the specific order of the elements in the set. Restricting models to be invariant or equivariant to these transformation was shown to be an excellent approach for reducing the number of parameters of models while improving generalization. This paradigm for designing deep models was used for many different tasks, data modalities and transformations, e.g., set learning (Zaheer et al., 2017; Qi et al., 2017), graph learning (Kipf and Welling, 2016; Gilmer et al., 2017; Veličković et al., 2017; Xu et al., 2019; Kondor et al., 2018; Maron et al., 2019b, a), learning images with rotational and reflectional symmetries (Cohen and Welling, 2016b, a; Dieleman et al., 2016; Worrall et al., 2017), learning functions on spheres (Cohen et al., 2018; Esteves et al., 2017) and learning general 3D data (Weiler et al., 2018; Worrall and Brostow, 2018; Weiler et al., 2018). Except from designing invariant and equivariant networks there has been a keen interest in the analysis of such models (Ravanbakhsh et al., 2017; Kondor et al., 2018), especially the analysis of their approximation power (Zaheer et al., 2017; Qi et al., 2017; Maron et al., 2019c; Keriven and Peyré, 2019; Segol and Lipman, 2020). Most related to this work is, the work of (Maron et al., 2019b) which characterized equivariant layers between hypergraph data which includes the settograph setup, (Maron et al., 2019c) that proved a universal approximation property for invariant networks for general permutation groups, and the work of (Keriven and Peyré, 2019) which provides a proof that equivariant neural networks are universal. They construct an equivariant network with a single hidden layer that may contain tensors of unbounded degree. In contrast we construct equivariant networks that only involve tensors of order and and are universal.
Learning to Cluster.
Deep clustering is a large field (Aljalbout et al., 2018) and we restrict our attention to the methods that are most related to ours. The work that tackles the most similar problem to this paper is the work of (Jiang and Verma, 2019) that suggest a method for meta clustering that is based on LSTMs and therefore depends on the order of the set elements. In contrast, our method is blind (equivariant) to the chosen order of the input sets. In another related work, (Hsu et al., 2017)
suggest to perform transfer learning between tasks and domains by learning how to cluster, where the main idea is to learn a similarity function between set elements according to their labels. This similarity function is learned using a loss that promotes cluster assignments of points with the same label. The main difference from our work is the fact that we assume that all the data can be seen at the same time and aim at universal models approximating settograph functions.
We discuss the differences of our method and equivariant, similarity learning (including learning to cluster) methods in more details at the end of Section 4.
3 Learning hypergraphs from sets
We would like to learn functions mapping sets of vectors in to hypergraphs with nodes (think of the nodes as corresponding to the set elements), and arbitrary edge feature vectors in , where a edge is defined as a tuple of set elements. A function mapping sets of vectors to edges is called settoedge function and denoted . Consequently, a settohypergraph function would be modeled as a sequence , for target hypergraphs with hyperedges of maximal size . For example, learns pairwise relations in a set; and is a function from sets to graphs; see Figure 1.
Our goal
is to design equivariant neural network models for that are asefficientaspossible in terms of number of parameters and memory usage, but on the same time with maximal expressive power, i.e., universal.
Representing sets and edges.
A matrix represents a set of vectors and therefore should be considered up to reordering of its rows. We denote by the symmetric group, that is the group of bijections (permutations) , where . We denote by the matrix resulting in reordering the rows of by the permutation , i.e., . In this notation, and represent the same set, for all permutations .
edges are represented as a tensor , where denotes the feature vector attached to the edge defined by the tuple , where is a multiindex with nonrepeating indices. Similarly to the set case, edges are considered upto renumbering of the nodes by some permutation . That is, if we define the action by , where , then and represent the same edge data, for all .
Equivariance.
For to represent a welldefined map between sets and edge data it should be equivariant to permutations, namely satisfy
(1) 
for all sets and permutations . Equivariance guarantees, in particular, that the two equivalent sets and are mapped to equivalent edge data tensors and .
Settoedge models.
Settographs models.
Given the model of settoedge functions, a model for a settograph function can now be constructed from a pair of settoedge networks . Similarly, settohypergraph function would require , where is the maximal hyperedge size. Figure 1 shows an illustration of settoedge and settograph functions
component.
is a settoset equivariant model, that is is mapping sets of vectors in to sets of vectors in . To achieve the universality goal we will need to be universal as settoset model; that is, can approximate arbitrary continuous settoset functions. Several options exists (Keriven and Peyré, 2019; Sannai et al., 2019)
although probably the simplest option is either DeepSets
(Zaheer et al., 2017) or one of its variations; all were proven to be universal recently in (Segol and Lipman, 2020).In practice, as will be clear later from the proof of the universality of the model, when building settograph or settohypergraph model, the (settoset) part of the edge networks can be shared between different settoedge models, , without compromising universality.
component.
is a nonlearnable linear broadcasting layer mapping sets to edges. In theory, as shown in (Maron et al., 2019b) the space of equivariant linear mappings is of dimension which can be very high since has exponential growth. Interestingly, in the settoedge case one can achieve universality with only linear operators. We define the broadcasting operator to be
(3) 
where and brackets denote concatenation in the feature dimension, that is, for , their concatenation is . Therefore, the feature output dimension of is .
As an example, consider the graph case, where . In this case . This function is illustrated in Figure 2 broadcasting data in to tensor .
To see that broadcasting layer is equivariant, it is enough to consider a single feature . Permuting the rows of by a permutation we get .
component.
is a mapping of tensors to tensors. Here the theory of equivariant operators indicates that the space of linear equivariant maps is of dimension that suggests a huge number of model parameters even for a single linear layer. Surprisingly, universality can be achieved with much less, in fact a single linear operator (i.e., scaled identity) in each layer which in the multifeature multilayer case boils to applying a MultiLayer Perceptron to each feature in the input tensor . That is, we use
(4) 
Figure 2 illustrates settoedges and settograph models incorporating the three components discussed above.
4 Universality of settograph models.
In this section we prove that the model introduced above, is universal, in the sense it can approximate arbitrary continuous equivariant settoedge functions over compact domains .
Theorem 1.
The model is settoedge universal.
A corollary of Theorem 1 establishes a general settohypergraph universal models:
Theorem 2.
The model is settohypergraph universal.
Our main tool for proving Theorem 1 is a characterization of the equivariant settoedge polynomials . This characterization can be seen as a generalization of the characterization of settoset equivariant polynomial recently appeared in (Segol and Lipman, 2020).
We consider an arbitrary settoedge continuous mapping over a compact set . Since is equivariant we can assume is symmetric, i.e., for all . The proof consists of three parts: (i) Characterization of the equivariant settoedge polynomials . (ii) Showing that every equivariant continuous settoedge function can be approximated by some . (iii) Every can be approximated by our model .
Before providing the full proof which contains some technical derivations let us provide a simpler universality proof (under some mild conditions) for the settoedge model,
, based on the Singular Value Decomposition (SVD).
4.1 A simple proof for universality of
It is enough to consider the case; the general case is implied by applying the argument for each output feature dimension independently. Let be an arbitrary continuous equivariant settoedge function . We want to approximate with our model . First, note that without losing generality we can assume
has a simple spectrum (i.e., eigenvalues are all different) for all
. Indeed, if this is not the case we can always choose sufficiently large and consider . This diagonal addition does not change the edge values assigned by , and it guarantee simple specturm using standard hermitian matrix eigenvalue perturbation theory (see e.g., (Stewart, 1990), Section IV:4).Now let be the SVD of , where and . Since has a simple spectrum, are all continuous in ; is unique, and are unique up to a sign flip of the singular vectors (i.e., columns of ) (O’Neil, 2005). Let us first assume that the singular vectors can be chosen uniquely also up to a sign, later we show how we achieve this with some additional mild assumption.
Now, uniqueness of the SVD together with the equivariance of imply that are continuous settoset equivariant and is continuous set invariant function:
Lastly, since is settoset universal there is a choice of its parameters so that it approximates arbitrarily well the equivariant settoset function . The component can be chosen by noting that , where are the singular values, and is a cubic polynomial. To conclude pick to approximate sufficiently well so that approximates to the desired accuracy.
To achieve uniqueness of the singular vectors upto a sign we can add, e.g., the following assumption: for all singular vectors and . Using this assumption we can always pick , in the SVD so that , , for all .
We now move to the general proof.
4.2 Equivariant settoedge polynomials
We start with a characterization of the settoedge equivariant polynomials .
We need some more notation. Given a vector , and a multiindex , we set ; ; and define accordingly . Given two tensors , we use the notation to denote the tensorproduct, defined by , where are suitable multiindices. Lastly, we denote by a vector of multiindices , and .
Theorem 3.
An equivariant settoedge polynomial can be written as
(5) 
where , , and are invariant polynomials.
As an example, consider the graph case, where . Equivariant settoedge polynomials take the form:
(6) 
and coordinatewise
(7) 
The general proof idea is to consider an arbitrary equivariant settoedge polynomial and use its equivariance property to show that it has the form as in equation 5. This is done by looking at a particular output entry , where say . Then the proof considers two subsets of permutations: First, the subgroup of all permutations that fixes the numbers , i.e., , but permute everything else freely; this subgroup is denoted . Second, permutations of the form , where . Each of these permutation subsets reveals a different part in the structure of the equivariant polynomial and its relation to invariant polynomials.
As before, it is enough to prove Theorem 3 for . Let and consider any permutation . Then from equivariance of we have
and . That is is invariant to permuting its last elements ; we say that is invariant. We next prove that invariance can be written using a combination of invariant polynomials and tensor products of :
Lemma 1.
Let be invariant polynomial. That is invariant to permuting the last terms. Then
(8) 
where are invariant polynomials.
We prove this lemma in the supplementary material. So now we know that has the form equation 8. On the other hand let be an arbitrary multiindex and consider the permutation . Again by permutation equivariance of we have
which is a coordinatewise form of equation 5 with .
Approximating with a polynomial .
We denote for an arbitrary tensor its infinity norm by .
Lemma 2.
Let be a continuous equivariant function over a symmetric domain . For an arbitrary , there exists an equivariant polynomial so that
Approximating with a network .
The final component of the proof of Theorem 1 is showing that an equivariant polynomial can be approximated over using a network of the form in equation 2. The key is to use the characterization of Theorem 3 and write in a similar form to our model in equation 2:
(9) 
where defined by , where , and are all the multiindices participating in the sum in equation 5. Note that
Therefore, is chosen as the polynomial
where , and .
In view of equation 9 all we have left is to choose and (i.e., ) to approximate (resp.) to a desired accuracy. We detail the rest of the proof in the supplementary.
Universality of the settohypergraph model.
Theorem 2 follows from Theorem 1 by considering a settohypergraph continuous function as a collection of settoedge functions and approximating each one using our model . Note that universality still holds if all share the part of the network (assuming sufficient width ).
Note that a settoedge model (in equation 2) is not universal when approximating settohypergraph functions:
Proposition 1.
The settoedge model, , cannot approximate general settograph functions.
The proof is in the supplementary; it shows that even the constant function that outputs for 1edges (nodes), and for 2edges cannot be approximated by a settoedge model .
Relation to similarity learning
Previous models suggested for learning pairwise relations in sets were mostly of the form (Hsu et al., 2017; Bromley et al., 1994; SimoSerra et al., 2015; Zagoruyko and Komodakis, 2015; Bell and Bala, 2015; Ahmed et al., 2015; Vo and Hays, 2016). This model is similar to the model suggested in this paper for the case but is not universal for two main reasons: (i) The same MLP is used both for edge (nodes or selfloops) predictions and edge prediction; Proposition 1 implies that the model is not universal in this case; (ii) The model used in the role of is a elementwise MLP which is not settoset universal (Segol and Lipman, 2020).
Relation to hypergraph equivariant networks.
Our model’s constituents are all built from certain equivariant linear layers and entrywise nonlinearity. Therefore, our model is an instance of the general equivariant hypergraph networks framework (Maron et al., 2019b). The benefit in our suggested model compared to the general equivariant model is that it is much more efficient in terms computational complexity and memory footprint. In particular, it uses only one basis function for the (scaled identity, without counting features) in contrast to in the full equivariant model, and can be used without constructing order tensors explicitly in memory. Even though the model is lean in terms of number of parameters (i.e., uses less basis functions) it is proven it to be universal (i.e., with maximal expressive power). As far as we are aware, this fact was not known before, even for the full equivariant models when approximating settohypergraph functions and restricting the tensor order in the network.
5 Applications
We have tested our model on a collection of learning tasks that fall into the categories: (1) Setto2edge tasks; (2) Settograph tasks; and (3) Setto3edge tasks.
Variants of our model.
We used , , and (resp.) for these learning tasks. is implemented using DeepSets (Zaheer et al., 2017) with layers with output dimension ; is implemented with an MLP, , with layers with input dimension defined by and . is implemented according to equation 3: for it uses output features and for , output features. We name this model S2G. For the case we have also tested a more general (but not more expressive) broadcasting defined using the full equivariant basis from (Maron et al., 2019b) that contains basis operations: (12) as in ; (3) broadcast the nodes values to the diagonal; map the sum of all nodes to the (4) diagonal; and (5) to all of the entries. This broadcasting layer gives ; we name this model S2G+.
More architecture, implementation and hyperparameter details can be found in the supplementary.
Baselines.
We compare our results to the following baselines: (1) a model as in equation 2 but with a nonuniversal settoset function as , namely, an MLP on each element (vector) in the set; we use the same loss as is used in our model; we name this model MLP. For the particle physics application we also used: (2) The same architecture as (1) but with a triplet loss (Weinberger et al., 2006) on the learned representations based on distance; we name this baseline (TRI). (3) A nonlearnable geometricbased baseline described later.
5.1 Settoedges
The first type of problems we tackle involve learning settoedge functions. Here, each training example is a pair where is a set and is an adjacency matrix (the diagonal of is ignored).
5.1.1 Partitioning for particle physics
In particle physics experiments, such as the Large Hadron Collider (LHC), beams of incoming particles are collided at high energies. The results of the collision are outgoing particles, whose properties (such as the trajectory) are measured by detectors surrounding the collision point.
A critical lowlevel task for analyzing this data is to associate the particle trajectories to their progenitor, which can be formalized as partitioning sets of particle trajectories into subsets according to their unobserved point of origin in space. This task is referred to as vertex reconstruction in particle physics and is illustrated in Figure 2(a).
The measured particle trajectories correspond to elements in the input set and nodes in the output graph, and the parameters that characterize them serve as the node features. An edge between two nodes indicates that the two particles come from a common progenitor or vertex. We enforce that the adjacency matrix of the graph encodes a valid partitioning of tracks to vertices.
Vertex reconstruction propagates to a number of downstream data analysis tasks, such as particle identification (a classification problem). Therefore, improvements to the vertex reconstruction has significant impact on the sensitivity of collider experiments. We consider multiple quantities to quantify the performance of the partitioning: the F1 score, the Rand Index (RI), and the Adjusted Rand Index (). We will consider three different types (or flavors) of particle sets (called jets) corresponding to three different fundamental data generating processes labeled bottomjets, charmjets, and lightjets (B/C/L). The important distinction between the flavors is the typical number of partitions in each set. Figure 2(b) shows the distribution of the number of partitions (vertices) in each flavor: bottom jets typically have multiple partitions; charm jets also have multiple partitions, but fewer than bottom jets; and light jets typically have only one partition.
Dataset.
Algorithms for particle physics are typically designed with highfidelity simulators, which can provide labeled training data. These algorithms are then applied to and calibrated with real data collected by the LHC experiments. Our simulated samples are created with a standard simulation package called pythia (Sjöstrand et al., 2015) and the detector is simulated with delphes (de Favereau et al., 2014). We use this software to generate synthetic datasets for the three flavors of jets. The generated sets are small, ranging from 2 to 14 elements each.
Results
We compare the results of our model (S2G and S2G+) trained to minimize the F1 score to a typical baseline algorithm used in particle physics  the Adaptive Vertex Reconstruction (AVR) algorithm (Waltenberger, 2011). We ran each model (except AVR) 11 times, and evaluated the model F1 score, RI and ARI over the test set for each run. The results are shown in Table 1. For bottom and charm jets, which have secondary vertices, all of our models reach comparable results and improve over the AVR baseline by about 10% in all performance metrics. In lightjets, without secondary decays, our models reach similar F1 scores.
Model  F1  RI  ARI  

AVR  0.565  0.612  0.318  
MLP  0.6060.001  0.6720.004  0.4090.004  
B  TRI  0.5950.002  0.6650.004  0.3870.004 
S2G  0.6460.003  0.7360.004  0.4910.006  
S2G+  0.6550.004  0.7470.006  0.5080.007  
AVR  0.695  0.650  0.326  
MLP  0.7290.001  0.6940.002  0.4050.004  
C  TRI  0.7180.001  0.7060.003  0.4140.004 
S2G  0.7470.001  0.7270.003  0.4570.004  
S2G+  0.7510.002  0.7330.003  0.4670.005  
AVR  0.970  0.965  0.922  
MLP  0.9730.001  0.9700.001  0.9260.003  
L  TRI  0.9040.002  0.8880.002  0.7580.006 
S2G  0.9720.001  0.9700.001  0.9310.003  
S2G+  0.9710.002  0.9690.002  0.9290.003  

5.1.2 Learning Delaunay triangulations
In a second settoedge task we test our model’s ability to learn Delaunay triangulations, namely given a set of planar points we want to predict the Delaunay edges between pairs of points, see e.g., (De Berg et al., 1997) Chapter 9. We generated planar point sets as training data and planar point sets as test data; the point sets, , were uniformly sampled in the unit square, and a ground truth matrix in was computed per point set using a Delaunay triangulation algorithm. The number of points in a set, , is either or varies and is randomly chosen from
. Training was stopped after 100 epochs. See more implementation details in the supplementary material. In Table
2 we report accuracy of prediction as well as precision recall and F1 score. Evidently, both S2G and S2G+ achieve comparable results while outperforming the baseline MLP. See also Figure 4 for visualizations of several triangulations predicted with the trained model versus ground truth.Accuracy  Precision  Recall  F1  
S2G  0.984  0.927  0.926  0.926 
S2G+  0.983  0.927  0.925  0.926 
MLP  0.939  0.769  0.647  0.702 
S2G  0.947  0.736  0.934  0.799 
S2G+  0.947  0.735  0.934  0.798 
MLP  0.917  0.658  0.772  0.686 
truth  S2G  S2G+  baseline 

5.2 SettoGraphs
One of the main learning tasks in graph analysis is graph classification. Our goal here is to try and improve existing graph learning models, and in particular Graph Neural Networks (GNNs). Since existing GNNs are not graphtograph universal we suggest the following procedure to potentially improve their performance: First, compute new node and edge features from the initial set of node features using and (resp.), and then concatenate these to the original input and feed this augmented input to the GNN.
For the settograph model, we use , as described above, and 2 different MLPs as : one for the nodes (for ), and one for the edges (for ). For the GNN we have used two GNN variants: a messagepassing neural network (Gilmer et al., 2017) (MMPN), and Provably Powerful Graph Networks (PPGN) (Maron et al., 2019a); implementation details can be found in the supplementary materials. We compared performance after training the same GNN in two ways: with the original input data, and with the S2Gaugmented input data. We used graph datasets from (Wu et al., 2018) using the Open Graph Benchmark (29). All tasks considered are binary or multibinary graph classification.
Results are presented in table 3. We report mean standard deviation of the AUCROC of the models on the test sets. Note that in several datasets (bold in table) there is a significant performance boost when S2G augments MPNN; while for other datasets and for the PPGN model (that is more expressive than MPNN yet heavier computationally and memorywise) there is no noticeable improvement.
dataset  PPGN  PPGN+S2G  MPNN  MPNN+S2G 

BBBP  67.191.72  66.501.88  64.602.23  65.451.89 
BACE  76.100.60  77.311.62  77.482.60  76.403.25 
TOXCAST  62.040.47  61.980.68  56.953.06  63.280.78 
HIV  75.092.71  74.303.18  69.381.94  72.632.48 
TOX21  72.880.93  72.900.93  67.330.83  73.220.79 
SIDER  56.051.87  57.471.72  56.081.23  61.741.27 
CLINTOX  75.758.36  77.935.78  58.987.45  76.183.42 
5.3 Set to 3edges
In the last experiment, we demonstrate learning of settoedge function. The learning task we target is finding supporting triangles in the convex hull of a set of points in . In this scenario, the input is a point set , and the function we wanted to learn is where the output is a probability for each triplet of nodes (triangle) to belong to the triangular mesh that describes the convex hull of .
Note that storing rd order tensors in memory is not feasible, hence we concentrate on a local version of the problem: Given a point set , identify the triangles within the NearestNeighbors of each point that belong to the convex hull of the entire point set . We used . Therefore, for broadcasting () from point data to 3edge data, instead of holding a rd order tensor in memory we broadcast only the subset of NN neighborhoods. This allows working with highorder information with relatively low memory footprint. Furthermore, since we want to consider edges (triangles) with no order we used invariant universal set model (DeepSets again) as .
# points  F1  AUCROC  
Spherical  
S2G  30  0.780  0.988 
MLP  30  0.425  0.885 
S2G  50  0.686  0.975 
MLP  50  0.424  0.890 
S2G  20100  0.535  0.953 
MLP  20100  0.354  0.885 
Gaussian  
S2G  30  0.707  0.996 
MLP  30  0.275  0.946 
S2G  50  0.661  0.997 
MLP  50  0.254  0.974 
S2G  20100  0.552  0.994 
MLP  20100  0.187  0.969 
We tested our models on two types of data: Gaussian and spherical. For both types we draw point sets in
i.i.d. from standard normal distribution,
, where for the spherical data we normalize each point to unit length. We generated point set samples as a training set, for validation and another for test set. Point sets are in , where , , and . As a baseline, we used MLP. The F1 scores and AUCROC of the predicted convex hull triangles are shown in Table 4, where our models outpreform the baseline. See Figure 5 for several examples of triangles predicted using our trained model compared to the ground truth.Acknowledgments
HS, NS and YL were supported in part by the European Research Council (ERC Consolidator Grant, ”LiftMatch” 771136), the Israel Science Foundation (Grant No. 1830/17) and by a research grant from the Carolito Stiftung (WAIC). JS and EG were supported by the NSFBSF Grant 2017600 and the ISF Grant 2871/19. KC was supported by the National Science Foundation under the awards ACI1450310, OAC1836650, and OAC1841471 and by the MooreSloan data science environment at NYU.
References

An improved deep learning architecture for person reidentification
. InProceedings of the IEEE conference on computer vision and pattern recognition
, pp. 3908–3916. Cited by: §1, §4.2.  Clustering with deep learning: taxonomy and new methods. arXiv preprint arXiv:1801.07648. Cited by: §2.

Learning visual similarity for product design with convolutional neural networks
. ACM Transactions on Graphics (TOG) 34 (4), pp. 98. Cited by: §1, §4.2.  Signature verification using a” siamese” time delay neural network. In Advances in neural information processing systems, pp. 737–744. Cited by: §1, §4.2.
 On the equivalence between graph isomorphism testing and function approximation with gnns. arXiv preprint arXiv:1905.12560. Cited by: §1.
 Learning a similarity metric discriminatively, with application to face verification. In CVPR (1), pp. 539–546. Cited by: §1.
 Spherical cnns. arXiv preprint arXiv:1801.10130. Cited by: §2.
 Steerable CNNs. (1990), pp. 1–14. External Links: 1612.08498, Link Cited by: §2.
 Group equivariant convolutional networks. In International conference on machine learning, pp. 2990–2999. Cited by: §2.
 Computational geometry. In Computational geometry, pp. 1–17. Cited by: §5.1.2.
 DELPHES 3: a modular framework for fast simulation of a generic collider experiment. Journal of High Energy Physics 2014 (2). External Links: ISSN 10298479, Link, Document Cited by: §5.1.1.
 Exploiting cyclic symmetry in convolutional neural networks. arXiv preprint arXiv:1602.02660. Cited by: §2.
 3D object classification and retrieval with spherical cnns. arXiv preprint arXiv:1711.06721. Cited by: §2.
 Neural message passing for quantum chemistry. In International Conference on Machine Learning, pp. 1263–1272. Cited by: §2, §5.2.
 Learning to cluster in order to transfer across domains and tasks. arXiv preprint arXiv:1711.10125. Cited by: §2, §4.2.
 Attentionbased deep multiple instance learning. arXiv preprint arXiv:1802.04712. Cited by: §6.
 Metalearning to cluster. arXiv preprint arXiv:1910.14134. Cited by: §2.
 Universal invariant and equivariant graph neural networks. CoRR abs/1905.04943. External Links: Link, 1905.04943 Cited by: §1, §2, §3.
 Adam: a method for stochastic optimization. arXiv preprint arXiv:1412.6980. Cited by: §6.
 Semisupervised classification with graph convolutional networks. arXiv preprint arXiv:1609.02907. Cited by: §2.
 Covariant compositional networks for learning graphs. arXiv preprint arXiv:1801.02144. Cited by: §2.
 Imagenet classification with deep convolutional neural networks. In Advances in neural information processing systems, pp. 1097–1105. Cited by: §2.
 Gradientbased learning applied to document recognition. Proceedings of the IEEE 86 (11), pp. 2278–2324. Cited by: §2.
 Provably powerful graph networks. arXiv preprint arXiv:1905.11136. Cited by: §1, §2, §5.2.
 Invariant and equivariant graph networks. In International Conference on Learning Representations, External Links: Link Cited by: §1, §2, §3, §4.2, §5.
 On the universality of invariant networks. arXiv preprint arXiv:1901.09342. Cited by: §1, §2, §4.2.
 Weisfeiler and leman go neural: higherorder graph neural networks. arXiv preprint arXiv:1810.02244. Cited by: §1.
 Critical points of the singular value decomposition. SIAM journal on matrix analysis and applications 27 (2), pp. 459–473. Cited by: §4.1.
 [29] (2019) Open graph benchmark. Note: https://ogb.stanford.edu/ Cited by: §5.2.
 Pointnet: deep learning on point sets for 3d classification and segmentation. Proc. Computer Vision and Pattern Recognition (CVPR), IEEE 1 (2), pp. 4. Cited by: §1, §2.
 Equivariance through parametersharing. arXiv preprint arXiv:1702.08389. Cited by: §2.
 A minimal set of generators for the ring of multisymmetric functions. In Annales de l’institut Fourier, Vol. 57, pp. 1741–1769. Cited by: §7.
 Universal approximations of permutation invariant/equivariant functions by deep neural networks. arXiv preprint arXiv:1903.01939. Cited by: §1, §3.
 On universal equivariant set networks. In International Conference on Learning Representations, External Links: Link Cited by: §1, §2, §3, §4.2, §4.2, §4, §7.
 Discriminative learning of deep convolutional feature point descriptors. In Proceedings of the IEEE International Conference on Computer Vision, pp. 118–126. Cited by: §1, §4.2.
 An introduction to pythia 8.2. Computer Physics Communications 191, pp. 159–177. External Links: ISSN 00104655, Link, Document Cited by: §5.1.1.
 Matrix perturbation theory. Cited by: §4.1.
 Attention is all you need. External Links: 1706.03762 Cited by: §6.
 Graph Attention Networks. pp. 1–12. External Links: 1710.10903, Link Cited by: §2.
 Localizing and orienting street views using overhead imagery. In European conference on computer vision, pp. 494–509. Cited by: §1, §4.2.
 RAVE: A detectorindependent toolkit to reconstruct vertices. IEEE Trans. Nucl. Sci. 58, pp. 434–444. External Links: Document Cited by: §5.1.1.
 3D Steerable CNNs: Learning Rotationally Equivariant Features in Volumetric Data. External Links: 1807.02547, Link Cited by: §2.
 Distance metric learning for large margin nearest neighbor classification. In Advances in neural information processing systems, pp. 1473–1480. Cited by: §5, §6.
 Cubenet: equivariance to 3d rotation and translation. In Proceedings of the European Conference on Computer Vision (ECCV), pp. 567–584. Cited by: §2.
 Harmonic networks: deep translation and rotation equivariance. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 5028–5037. Cited by: §2.
 MoleculeNet: a benchmark for molecular machine learning. Chemical science 9 (2), pp. 513–530. Cited by: §5.2.
 How powerful are graph neural networks?. In International Conference on Learning Representations, External Links: Link Cited by: §1, §2.
 Universal approximations of invariant maps by neural networks. arXiv preprint arXiv:1804.10306. Cited by: §4.2.
 Learning to compare image patches via convolutional neural networks. In Proceedings of the IEEE conference on computer vision and pattern recognition, pp. 4353–4361. Cited by: §1, §4.2.
 Deep sets. In Advances in neural information processing systems, pp. 3391–3401. Cited by: §1, §2, §3, §5, §6.
6 Architectures and hyperparamteres
All of our models follow the formula , where is a settoset model, is a nonlearnable broadcasting settograph layer, and is a simple graphtograph network using only a single MultiLayer Perceptron (MLP) acting on each edge feature vector independently. We note that all the hyperparameters were chosen using the validation scores only.
Notation.
”DeepSets / MLP of widths ” means that we use a DeepSets/MLP network with 3 layers, and each layer’s output feature size is its corresponding argument in the array (e.g.the first and second layers have output feature size of , while the third layer output feature size is
). Between the layers we use ReLU as a non linearity.
Partitioning for particle physics applications.
In our models is implemented using DeepSets (Zaheer et al., 2017) with 5 layers of width . is broadcasting the node features in one of the following ways: for model S2G it creates 2 features for each input feature, and S2G+ creates 5 features for each input feature (2 features out of the 5 vanish outside the diagonal). is implemented with 2 edgewise MLP of widths , ending as the edge probability. As a loss, we used a combination of soft F1 score loss and an edgewise binary crossentropy loss.
Instead of using a max or sum pooling in DeepSets layers, we used a selfattention mechanism based on (Ilse et al., 2018) and (Vaswani et al., 2017):
(10) 
Where are implemented by two single MLPs of width .
We used a grid search for the following hyperparameters: learning rate in the range of , DeepSets layers width of , number of layers of , (MLP) of widths , and with or without attention mechanism in DeepSets. We chose to use 250 epochs with early stopping based on validation score, batch size of 2048, adam optimizer (Kingma and Ba, 2014). Our models train in less than 2 hours on a single Tesla V100 GPU.
The deep learning baselines are implemented as follows: MLP is implemented similarly to S2G, with the exception that instead of using DeepSets as , we use MLP of widths . TRI uses a siamese MLP of widths
to extract node features, and the edge logits are the l2 distances between the nodes. For loss, we use triplets loss
(Weinberger et al., 2006)  we draw random triplets anchor, neg, pos where anchor and pos are of the same cluster, and neg is of a different cluster, and the loss is defined as ^{1}^{1}1A natural disadvantage of the triplets loss is that it cannot learn from sets with a single cluster, or sets with size 2.The dataset is made of training, validation and test set with 543544, 181181 and 181182 instances accordingly. Each of the sets contains all three flavors: bottom, charm and light jets roughly in the same amount, while the flavor of each instance is not part of the input.
Each model is being evaluated 11 times for stability in the same manner: (1) training over the dataset, stopping when the F1 score over the validation set is minimal. (2) Predicting the clusters of the test set. (3) Separate the 3 flavors and calculating the metrics for each flavor. Eventually, we have 11 scores for each combination of metrics, flavor and model, and we report the meanstd. Note that the AVR is evaluated only once since it is not a learning algorithm.
Learning Delaunay triangulation.
In our models is implemented using with 7 layers of width . is broadcasting as before for models S2G and S2G+, thus ending with 160 or 400 features per edge. is implemented with 2 edgewise MLP of widths , ending as the edge probability. We use edgewise binary crossentropy loss. The implementation of MLP baseline is identical except for which is an MLP with the same amount of layers and widths.
We searched learning rate from .
Set to graph.
Our models include set2graph followed by a GNN, trained together (as a single network). In our models is implemented using DeepSets with 5 layers of width . is of S2G, thus ending with 20 features. uses 2 different MLPs for the edges and nodes (i.e., diagonal and offdiagonal)^{2}^{2}2The use of 2 different MLPs for the diagonal and offdiagonal is necessary to separate between the learning of , the settoset function, and , the setto2edges function., each MLP is of widths , ending as the probability. The GNN that followed are either PPGN with 3 layers, or MPNN with 3 layers with shared weights. As a loss, we use a binary crossentropy loss on the targets.
We used a grid search as following: For all models, learning rate from