On Deep Set Learning and the Choice of Aggregations

03/18/2019 ∙ by Maximilian Soelch, et al. ∙ 4

Recently, it has been shown that many functions on sets can be represented by sum decompositions. These decompositons easily lend themselves to neural approximations, extending the applicability of neural nets to set-valued inputs---Deep Set learning. This work investigates a core component of Deep Set architecture: aggregation functions. We suggest and examine alternatives to commonly used aggregation functions, including learnable recurrent aggregation functions. Empirically, we show that the Deep Set networks are highly sensitive to the choice of aggregation functions: beyond improved performance, we find that learnable aggregations lower hyper-parameter sensitivity and generalize better to out-of-distribution input size.



There are no comments yet.


page 10

This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

1 Introduction

Machine learning algorithms make implicit assumptions on the data set encoding. For instance, feed-forward neural networks assume that data is encoded in a unique vector representation, by one-hot encoding categorical variables. Yet, many interesting learning tasks revolve around data sets consisting of sets: depth vision with 3D point clouds, probability distributions represented by finite samples, or operations on unstructured sets of tags

[16, 26, 21].

Naively, a population111 Disambiguating terms like set and sample, we discuss data sets of populations of particles. is embedded by ordering and concatenating particle vectors into a matrix. While standard neural networks can learn to imitate order-invariant behavior, e.g. by random input permutation at each gradient step, such architectures are no true set functions. Further, they cannot easily handle varying population sizes. This motivated research into order-invariant neural architectures [24, 6, 4, 20]. From this, the Deep Set framework emerged, proving that many interesting invariant functions allow for a sum decomposition [29, 18, 25]. It allows for straightforward application of neural networks that are order-invariant by design, and can handle varying population sizes.

In this work, we study aggregations—the component of a Deep Set architecture that induces order invariance by mapping a variable-sized population to a fixed-sized description. After discussing desirable properties and extending the theory around aggregation functions, we suggest multiple alternatives, including learnable recurrent aggregation functions. Studying them in several experimental settings, we find that the choice of aggregation impacts not only the performance, but also hyper-parameter sensitivity and robustness to varying population sizes. In the light of these findings, we argue for new evaluation techniques for neural set functions.

2 Order-Invariant Deep Architectures

We discuss populations of particles from a particle space , and . We are further interested in representations achieved by concatenating the particles of . A permutation of the particle axis with a permutation is denoted by , but . Data sets consist of finite populations of potentially varying size.

2.1 Invariance, Equivariance, and Decomposition of Invariant Functions

We study invariant functions according to

Definition 1 (Invariance)

A function on the power set is order-invariant if for any permutation and input

If it is clear from the context, we will call such functions invariant. When the input is embedded as a matrix, definition 1 can be formulated as A related, important notion is that of equivariant functions:

Definition 2 (Equivariance)

A function is equivariant if input permutation results in equivalent output permutation, for any and

In [29], a defining structural property of order-invariant functions was proven:

Theorem 2.1 (Deep Sets, [29])

A function on populations from countable particle space is invariant if and only if there exists a decomposition,

with appropriate functions and .

We call such functions sum-decomposable; this follows [25], where severe pathologies for uncountable input spaces are pointed out:

  1. There exist invariant functions that have no sum decomposition.

  2. There exist sum decompositions that are everywhere-discontinuous.

  3. Even relevant functions such as cannot be continuously decomposed when the image space of the embedding is smaller than the population size .

As a consequence they refine theorem 2.1 to

Theorem 2.2 (Uncountable Particle Spaces, [25])

A continuous function on finite populations , , is invariant if and only if it is sum-decomposable via .

That is, for arbitrary , the image space of has to have at least dimension , which is both necessary and sufficient. More restrictive in scope than theorem 2.1, it is more applicable in practice where most function approximators—neural networks, Gaussian processes—are continuous.

2.2 Deep Sets



(a) Deep Set Framework.


(b) Recurrent Aggregation.
Figure 1: Left: Deep Set architecture, eqs. 5, 3, 2 and 4, with a single equivariant layer, eq. 1. Aggregation functions are depicted by . Right: Recurrent aggregation function, definitions 3, 3, 3, 3 and 3. Queries to memory are produced in a forward pass, responses aggregated in a backward pass.

A generic invariant neural architecture emerges from theorems 2.2 and 2.1 by using neural networks for and , respectively. In practice, to allow for higher-level particle interaction during the embedding , equivariant neural layers are introduced [29],


where denotes a per-particle feed-forward layer, and denotes an aggregation. Aggregations—our object of study—induce invariance by mapping a population to a fixed-size description, typically sum, mean, or . The full architecture is


with implemented by a per-particle embedding followed by an equivariant combination function consisting of equivariant layers. Summation is replaced by a generic aggregation operation. In [29, 18], the operation is suggested as an alternative summation. Lastly, can be implemented by arbitrary functions, since the aggregation in eq. 4 is already invariant. This framework is depicted in fig. 0(a).

2.3 Order Matters

Recurrent neural networks can handle set-valued input by feeding one particle at a time. However, it has been shown that the result is sensitive to order, and an invariant read-process-write architecture has been suggested as a remedy [24]: _t &= LSTM(,)
^_i,t &= attention(_i, _t)&(=_i^⊤_t)
_t &= softmax(^_t)
_t &= ∑_i,t_i
&= _T An embedded memory is queried The invariant result is iteratively used to refine subsequent queries with an LSTM [8]. It is not obvious how to cast the recurrent structure into the setting of theorems 2.2, 2.1, 2, 5, 3 and 4. To the best of our knowledge, this model has only been discussed in its sequence-to-sequence context. We will revisit and refine this architecture in section 3.3.

2.4 Further Related Work

Several papers introduce and discuss a Deep Set framework for dealing with set-valued inputs [18, 29]. A driving force behind research into order-invariant neural networks are point clouds [19, 17, 18], where such architectures are used to perform classification and semantic segmentation of objects and scenes represented as point clouds in . It is further shown that a decomposition allows for arbitrarily close approximation [18].

Generative models of sets have been investigated: in an extension of variational auto-encoders [11, 22], the inference of latent population statistics resembles a Deep Sets architecture [4]. Generative models of point clouds are proposed by [1] and [28].

Permutation-invariant neural networks have been used for predicting dynamics of interacting objects [6]. The authors propose to embed the individual object positions in pairs using a feed-forward neural network. Similar pairwise approaches have been investigated by [3, 2], and applied to relational reasoning in [23].

Weighted averages based on attention have been proposed and applied to multi-instance learning [10]. Several works have focused on higher-order particle interaction, suggesting computationally efficient approximations of Janossy pooling [15], or propose set attention blocks as an alternative to equivariant layers [14].

3 The Choice of Aggregation

The invariance of the Deep Set architecture emerges from invariance of the aggregation functioneq. 4. Theorem 2.1 theoretically justifies summing the embeddings

. In practice, mean or max-pooling operations are used. Equally simple and invariant, they are numerically favorable for varying population sizes, controlling input magnitude to downstream layers. This section discusses alternatives and their properties.

3.1 Alterantive Aggregations

We start by justifying alternative choices with an extension of theorems 2.2 and 2.1:

Corollary 1 (Sum Isomorphism)

Theorems 2.2 and 2.1 can be extended to aggregations of the form , summations in an isomorphic space.


From , sum decompositions can be constructed from -decompositions and vice versa.

This class includes, , mean (with and ) and () (with ).

(a) on
(b) on
(c) on
(d) on
(e) on
(f) on
Figure 2: Contour plots for (left), sum (right), and () on two inputs. For large ranges, acts like , shifting towards with decreasing input range. Matching square boxes indicate zoom between plots. Plots (a), (c), and (f) on range share contour levels.

In that light, there is an interesting case to be made for : depending on the input magnitudes, can behave akin to (cf. figs. 1(c), 1(b) and 1(a)) or like a linear function akin to summation (cf. figs. 1(e), 1(d) and 1(f)). Operating in log space, further exhibits diminishing returns: identical scalar particles yield . The larger , the smaller the output change from additional particles. Beyond making a numerically useful aggregation, diminishing returns are a desirable property from a statistical perspective, where we would like to have asymptotically consistent results.

Divide and Conquer

Commutative and associative binary operations like addition and multiplication yield invariant aggregations. Widening this perspective, we see that divide-and-conquer style operations yield invariant aggregations: order invariance is equivalent to conquering being invariant to division. Examples beyond the previously mentioned operations are logical operators such as any or all, but also sorting (generalizing and , and any percentile, median). While impractical for typical first-order optimization, we note that aggregations can be of very sophisticated nature.

3.2 Learnable Aggregation Functions

In [29], cf. eqs. 5, 4, 3 and 2, the aggregation is the only non-learnable component. We will now investigate ways to render the aggregations learnable. In section 2.3, we have seen that due to the structure of theorem 2.1, recurrent architectures as suggested by [24] had been overlooked as it is not straightforward to cast them into the Deep Sets framework. Inspired by the read-process-write architecture, we suggest recurrent aggregations:

Definition 3 (Recurrent and Query Aggregation)

A recurrent aggregation is a function that can be written recursively as: _t &= query(,)
^_i,t &= attention(_i, _t)
_t &= normalize(^_t)
_t &= reduce({_i,t_i})
&= g(), where is an embedding of the input population and is a constant. We further call the special case (a single query ) a query aggregation.

As long as is invariant and is equivariant, recurrent and query aggregations are invariant. This architectural block is depicted in fig. 0(b).

Building upon sections 2.3, 2.3, 2.3, 2.3 and 2.3, recurrent aggregations introduce two modifications: firstly, we replace a weighted sum by a general weighted aggregation—giving us a rich combinatorial toolbox on the basis of simple invariant functions such as those mentioned in section 3.1. Secondly, we add post-processing of the step-wise results . In practice, we use another recurrent network layer that processes in reversed order. Without this modification, later queries tend to be more important, as their result is not as easily forgotten by the forward recurrence. The backward processing reverses this effect, so that the first queries tend to be more important, and the overall architecture is more robust to common fallacies of recurrent architectures, in particular unstable gradients.

Observing definition 3, we note that our learnable aggregation functions wrap around the previously discussed simpler non-learnable aggregations. A major benefit is that the inputs are weighted—sum becomes weighted average, for instance. This also allows the model to effectively exploit non-linearities as discussed with (cf. fig. 2).

3.3 A Note on Universal Approximation

The key promise of universal approximation is that a family of approximators (neural nets, or neural sum decompositions) is dense within a wider family of interesting functions [12, 7, 9]. The universality granted by theorems 2.2 and 2.1, through constructive proofs, hinges on sum aggregation. Corollary 1 grants flexibility, but does not apply to arbitrary aggregations, like or the suggested learnable aggregations. (Note that allows for arbitrary approximation [18].) It remains open to what extent the sum can be replaced. As such, the suggested architectures might not grant universal approximators. As we will see in section 4, however, they provide useful inductive biases in practical settings, much like feed-forward neural nets are usually replaced with architectures targeted towards the task. It is worth noting that the embedding dimension constraint of theorem 2.2 is rarely met, trading theoretical guarantees for test-time performance.

4 Experiments

We consider three simple aggregations: mean (or weighted sum), , and . These are used in equivariant layers and final aggregations, and may be be wrapped into a recurrent aggregation. This combinatorially large space of configurations is tested in four experiments described in the following sections.

4.1 Mininmal Enclosing Circle

Figure 3: Minimal enclosing example population
Figure 4: Minimal enclosing circle results.
best MSE radius MSE center MSE median best MSE
✗ / ✗ 0.71 0.06 0.66 1.57
✗ / ✓ 1.02 0.14 0.88 1.30
✓ / ✗ 0.54 0.08 0.47 0.87
✓ / ✓ 0.42 0.09 0.33 0.58

In this supervised experiment, we are trying to predict the minimal enclosing circle of a population of size

from a Gaussian mixture model (GMM). A sample population with target circle is depicted in

fig. 4. The sample mean does not approximate the center of the minimal enclosing circle well, and the correct solution is defined by at least three particles. The models are trained by minimizing the mean squared error (MSE) towards the center and radius of the true circle (computable in linear time [27]).

Results are given in fig. 4. Each row shows the best result out of 180 runs (20 runs for each of the 9 combinations of aggregations). We can see that both recurrent equivariant layers and recurrent aggregations improve the performance, with equivariant layers granting the larger performance boost. The challenge lies mostly in a better approximation of the center.

The top row indicates that an entirely non-recurrent model performs better than its counterpart with recurrent aggregation (second row). To test for a performance outlier, we compute a bootstrap estimate of the expected peak performance when only performing 20 experiments: we subsample all available experiments (with replacement) into several sets of 20 experiments, recording the best performance in each batch. The last column in

fig. 4 reports the median of these best batch performances. The result shows increased robustness to hyper-parameters, despite having more hyper-parameters.

4.2 GMM Mixture Weights

(a) Left: Example population. Middle and Right

: Estimator development for increasing populations size for a non-learnable and a learnable model, with 50% and 90% empirical confidence intervals.


Robustness analysis. Metric is the score ratio of the true mixture weight under a neural model compared to expectation maximization (negative sign indicates EM is outperformed; the more negative, the better). Each violin shows the peak performance distribution for batches of 5 experiments. Top row: equivariant layer aggregations. Bottom row: final aggregations.

Figure 5: Results for the Gaussian mixture model mixture weights experiment.

In this experiment, our goal is to estimate the mixture weights of a Gaussian mixture model directly from particles. The GMM populations of size in our data set are sampled as follows: each mixture consists of two components; the mixture weights are sampled from

; the means span a diameter of the unit circle, their position is drawn uniformly at random; component variances are a fixed to the same diagonal value such that the clusters are not linearly separable. An example population is shown in

fig. 4(a). The model outputs concentrations and

of a Beta distribution. We train to maximize the log-likelihood of the smaller ground truth weight under this Beta distribution. At training time, for every gradient step the batch population size

is chosen randomly, with . In fig. 4(a), we show how an estimator based on the learned model behaves with growing population size.

We were again interested in the robustness of the models. We compare to expectation maximization (EM)—the classic estimation technique for mixture weights—as a baseline by gathering 100 estimates each from EM and the model for each population size by subsampling (with replacement) the original population. Then we compare the likelihood of the true weight under a kernel density estimate (KDE) of these estimates. The final metric is the log ratio of the scores under the two KDEs. Then, as in the previous section, we compute the peak performance for batches of 5 experiments in order to see which configurations of models consistently perform well.

The results of this analysis are shown in fig. 4(b). The top row indicates that learnable equivariant layers lead to a significant performance boost across all reduction operations. Note that the y-axis is in log scale, indicating multiples of improvements over the EM baseline. We note that benefits most drastically from learnable inputs. Notably, the middle column, which depicts -type aggregations, indicates that this type of aggregation significantly falls behind the alternatives. Notice that we had to scale the y-axes to even show the violins, and that a significant amount of peak performances perform worse than EM (indicated by sign flip of the metric).

4.3 Point Clouds

Equivariant layer type & aggregation type





1000 87.3 85.8 85.7 83.8 83.5 82.0 81.7 81.2 78.0 77.5
100 66.5 75.3 73.0 69.5 68.4 71.9 45.3 22.0 64.0 60.3
50 47.0 62.8 58.4 52.4 51.3 61.0 35.5 14.6 51.9 46.8
Table 1: Test set accuracy on ModelNet40 classification.

The previous experiment extensively tested the effect of aggregations in controlled scenarios. To test the effect of aggregations on a more realistic data set, we tackle classification of point clouds derived from the ModelNet40 benchmark data set [30]. The data set consists of CAD models describing the surfaces of objects from 40 classes. We sample point cloud populations uniformly from the surface. The training is performed on 1000 particles. For this experiment, we fixed all hyper-parameters—including optimizer parameters and learning rate schedules—as described in [29], and only exchanged the aggregation functions in the equivariant layers and the final aggregation.

The results for the 10 best configurations are summarized in table 1. The original model (/ column) performs best in the training scenario (, first row)—as expected on hyper-parameters that were optimized for the model. Otherwise, learnable final aggregations outperform all non-learnable aggregations. We further observe that -type aggregations in equivariant layers seem crucial for good final performance. This contrasts the findings from section 4.2. We believe this to be a result of either (i) the hyper-parameters being optimized for -type equivariant layers, or (ii) the classification task (as opposed to a regression task), favoring -normalized embeddings that amplify discriminative features.

The second and third row highlight an insufficiently investigated problem with invariant neural architectures: the top-performing model overfits to the training population size. Despite sharing all hyper-parameters except the aggregations, the test scenarios with fewer particles show that learnable aggregation functions generalize favorably. Compare the first two columns: both drops for the original model are comparable to the total drop for the learnable model.

4.4 Spatial Attention

(a) Spatial attention example. Each pane shows multiple test time bounding box samples for 5, 20, 200, 1000 particles.
(b) Test-time evidence lower bound values against various population sizes. Dashed vertical line: training population size. Dashed horizontal line: best baseline model.
Figure 6: Results of the spatial attention experiment.

In the previous experiments, we investigated models trained in isolation on supervised tasks. Here, we will test the performance as a building block of a larger model, trained end-to-end and unsupervised. The data consists of canvases containing multiple MNIST digits, cf. fig. 5(a). In [5]

, an unsupervised algorithm for scene understanding of such canvases was introduced. We plug an invariant model as the localization module, which repeatedly attends to the input image, at each step returning the bounding box of an object. To turn a canvas into a population, we interpret the gray-scale image as a two-dimensional density and create populations by sampling 200 particles proportional to the pixel intensities. Remarkably, the set-based approach requires an order of magnitude fewer weights, and consequently has a significantly lower memory footprint compared to the original model, which repeatedly processes the entire image.

The task is challenging in several ways: the loss is a lower bound to the likelihood of the input canvas, devoid of localization information. The intended localization behavior needs to emerge from interaction with downstream components of the overall model. As with enclosing circles, the bounding box center is correlated with the sample mean of isolated particles from one digit. However, depending on the digit, this can be inaccurate.

As fig. 5(b) indicates, the order-invariant architecture on 200 particles (as in training, vertical line) can serve as a drop-in replacement, performing on a par or slightly improved compared to the original model baseline, indicated by the vertical line. This is remarkable, with the original model being notoriously hard to train [13].

We investigate the performance of the model when the population size varies. We observe that the effect on performance varies with different aggregation functions. Learnable aggregation functions exhibit strictly monotonic performance improvements. This is reflected by tightening bounding boxes for increasing population sizes, fig. 5(a). Similar behavior cannot be found reliably for non-learnable aggregations. Note that we can now trade off performance and inference speed at test time by varying the population size.

Lastly, we note that in both this and the point cloud experiment, section 4.3, learnable -aggregations performed well. We attribute this to the properties of diminishing returns and sum-

-interpolation amplified by weighted inputs, cf. 

section 4.

5 Discussion and Conclusion

We investigated aggregation functions for order-invariant neural architectures. We discussed alternatives to previously used aggregations. Introducing recurrent aggregations, we showed that each component of the Deep Set framework can be learnable. Establishing the notion of sum isomorphism, we created ground for future aggregation models.

Our empirical studies showed that aggregation functions are indeed an orthogonal research axis within the Deep Set framework worth studying. The right choice of aggregation function may depend on the type of task (regression vs. classification). It affects not only training performance, but also model sensitivity to hyper-parameters and test time performance on out-of-distribution population sizes. We showed that the learnable aggregation functions introduced in this work are more robust in their performance and more consistent in their estimates with growing population sizes. Lastly, we showed how to exploit these features in larger architectures by using neural set architectures as drop-in replacements. In the light of our experimental results, we strongly encourage emphasizing desirable properties of invariant functions, and in particular actively challenge models in non-training scenarios in future research.