1 Introduction
Machine learning algorithms make implicit assumptions on the data set encoding. For instance, feedforward neural networks assume that data is encoded in a unique vector representation, by onehot encoding categorical variables. Yet, many interesting learning tasks revolve around data sets consisting of sets: depth vision with 3D point clouds, probability distributions represented by finite samples, or operations on unstructured sets of tags
[16, 26, 21].Naively, a population^{1}^{1}1 Disambiguating terms like set and sample, we discuss data sets of populations of particles. is embedded by ordering and concatenating particle vectors into a matrix. While standard neural networks can learn to imitate orderinvariant behavior, e.g. by random input permutation at each gradient step, such architectures are no true set functions. Further, they cannot easily handle varying population sizes. This motivated research into orderinvariant neural architectures [24, 6, 4, 20]. From this, the Deep Set framework emerged, proving that many interesting invariant functions allow for a sum decomposition [29, 18, 25]. It allows for straightforward application of neural networks that are orderinvariant by design, and can handle varying population sizes.
In this work, we study aggregations—the component of a Deep Set architecture that induces order invariance by mapping a variablesized population to a fixedsized description. After discussing desirable properties and extending the theory around aggregation functions, we suggest multiple alternatives, including learnable recurrent aggregation functions. Studying them in several experimental settings, we find that the choice of aggregation impacts not only the performance, but also hyperparameter sensitivity and robustness to varying population sizes. In the light of these findings, we argue for new evaluation techniques for neural set functions.
2 OrderInvariant Deep Architectures
We discuss populations of particles from a particle space , and . We are further interested in representations achieved by concatenating the particles of . A permutation of the particle axis with a permutation is denoted by , but . Data sets consist of finite populations of potentially varying size.
2.1 Invariance, Equivariance, and Decomposition of Invariant Functions
We study invariant functions according to
Definition 1 (Invariance)
A function on the power set is orderinvariant if for any permutation and input
If it is clear from the context, we will call such functions invariant. When the input is embedded as a matrix, definition 1 can be formulated as A related, important notion is that of equivariant functions:
Definition 2 (Equivariance)
A function is equivariant if input permutation results in equivalent output permutation, for any and
In [29], a defining structural property of orderinvariant functions was proven:
Theorem 2.1 (Deep Sets, [29])
A function on populations from countable particle space is invariant if and only if there exists a decomposition,
with appropriate functions and .
We call such functions sumdecomposable; this follows [25], where severe pathologies for uncountable input spaces are pointed out:

There exist invariant functions that have no sum decomposition.

There exist sum decompositions that are everywherediscontinuous.

Even relevant functions such as cannot be continuously decomposed when the image space of the embedding is smaller than the population size .
As a consequence they refine theorem 2.1 to
Theorem 2.2 (Uncountable Particle Spaces, [25])
A continuous function on finite populations , , is invariant if and only if it is sumdecomposable via .
That is, for arbitrary , the image space of has to have at least dimension , which is both necessary and sufficient. More restrictive in scope than theorem 2.1, it is more applicable in practice where most function approximators—neural networks, Gaussian processes—are continuous.
2.2 Deep Sets
A generic invariant neural architecture emerges from theorems 2.2 and 2.1 by using neural networks for and , respectively. In practice, to allow for higherlevel particle interaction during the embedding , equivariant neural layers are introduced [29],
(1) 
where denotes a perparticle feedforward layer, and denotes an aggregation. Aggregations—our object of study—induce invariance by mapping a population to a fixedsize description, typically sum, mean, or . The full architecture is
(2)  
(3)  
(4)  
(5) 
with implemented by a perparticle embedding followed by an equivariant combination function consisting of equivariant layers. Summation is replaced by a generic aggregation operation. In [29, 18], the operation is suggested as an alternative summation. Lastly, can be implemented by arbitrary functions, since the aggregation in eq. 4 is already invariant. This framework is depicted in fig. 0(a).
2.3 Order Matters
Recurrent neural networks can handle setvalued input by feeding one particle at a time.
However, it has been shown that the result is sensitive to order, and an invariant readprocesswrite architecture has been suggested as a remedy [24]:
_t &= LSTM(,)
^_i,t &= attention(_i, _t)&(=_i^⊤_t)
_t &= softmax(^_t)
_t &= ∑_i,t_i
&= _T
An embedded memory is queried
The invariant result is iteratively used to refine subsequent queries with an LSTM [8].
It is not obvious how to cast the recurrent structure into the setting of theorems 2.2, 2.1, 2, 5, 3 and 4.
To the best of our knowledge, this model has only been discussed in its sequencetosequence context.
We will revisit and refine this architecture in section 3.3.
2.4 Further Related Work
Several papers introduce and discuss a Deep Set framework for dealing with setvalued inputs [18, 29]. A driving force behind research into orderinvariant neural networks are point clouds [19, 17, 18], where such architectures are used to perform classification and semantic segmentation of objects and scenes represented as point clouds in . It is further shown that a decomposition allows for arbitrarily close approximation [18].
Generative models of sets have been investigated: in an extension of variational autoencoders [11, 22], the inference of latent population statistics resembles a Deep Sets architecture [4]. Generative models of point clouds are proposed by [1] and [28].
Permutationinvariant neural networks have been used for predicting dynamics of interacting objects [6]. The authors propose to embed the individual object positions in pairs using a feedforward neural network. Similar pairwise approaches have been investigated by [3, 2], and applied to relational reasoning in [23].
Weighted averages based on attention have been proposed and applied to multiinstance learning [10]. Several works have focused on higherorder particle interaction, suggesting computationally efficient approximations of Janossy pooling [15], or propose set attention blocks as an alternative to equivariant layers [14].
3 The Choice of Aggregation
The invariance of the Deep Set architecture emerges from invariance of the aggregation function—eq. 4. Theorem 2.1 theoretically justifies summing the embeddings
. In practice, mean or maxpooling operations are used. Equally simple and invariant, they are numerically favorable for varying population sizes, controlling input magnitude to downstream layers. This section discusses alternatives and their properties.
3.1 Alterantive Aggregations
We start by justifying alternative choices with an extension of theorems 2.2 and 2.1:
Corollary 1 (Sum Isomorphism)
Theorems 2.2 and 2.1 can be extended to aggregations of the form , summations in an isomorphic space.
Proof
From , sum decompositions can be constructed from decompositions and vice versa.
This class includes, , mean (with and ) and () (with ).
In that light, there is an interesting case to be made for : depending on the input magnitudes, can behave akin to (cf. figs. 1(c), 1(b) and 1(a)) or like a linear function akin to summation (cf. figs. 1(e), 1(d) and 1(f)). Operating in log space, further exhibits diminishing returns: identical scalar particles yield . The larger , the smaller the output change from additional particles. Beyond making a numerically useful aggregation, diminishing returns are a desirable property from a statistical perspective, where we would like to have asymptotically consistent results.
Divide and Conquer
Commutative and associative binary operations like addition and multiplication yield invariant aggregations. Widening this perspective, we see that divideandconquer style operations yield invariant aggregations: order invariance is equivalent to conquering being invariant to division. Examples beyond the previously mentioned operations are logical operators such as any or all, but also sorting (generalizing and , and any percentile, median). While impractical for typical firstorder optimization, we note that aggregations can be of very sophisticated nature.
3.2 Learnable Aggregation Functions
In [29], cf. eqs. 5, 4, 3 and 2, the aggregation is the only nonlearnable component. We will now investigate ways to render the aggregations learnable. In section 2.3, we have seen that due to the structure of theorem 2.1, recurrent architectures as suggested by [24] had been overlooked as it is not straightforward to cast them into the Deep Sets framework. Inspired by the readprocesswrite architecture, we suggest recurrent aggregations:
Definition 3 (Recurrent and Query Aggregation)
A recurrent aggregation is a function that can be written recursively as:
_t &= query(,)
^_i,t &= attention(_i, _t)
_t &= normalize(^_t)
_t &= reduce({_i,t_i})
&= g(),
where is an embedding of the input population and is a constant.
We further call the special case (a single query ) a query aggregation.
As long as is invariant and is equivariant, recurrent and query aggregations are invariant. This architectural block is depicted in fig. 0(b).
Building upon sections 2.3, 2.3, 2.3, 2.3 and 2.3, recurrent aggregations introduce two modifications: firstly, we replace a weighted sum by a general weighted aggregation—giving us a rich combinatorial toolbox on the basis of simple invariant functions such as those mentioned in section 3.1. Secondly, we add postprocessing of the stepwise results . In practice, we use another recurrent network layer that processes in reversed order. Without this modification, later queries tend to be more important, as their result is not as easily forgotten by the forward recurrence. The backward processing reverses this effect, so that the first queries tend to be more important, and the overall architecture is more robust to common fallacies of recurrent architectures, in particular unstable gradients.
Observing definition 3, we note that our learnable aggregation functions wrap around the previously discussed simpler nonlearnable aggregations. A major benefit is that the inputs are weighted—sum becomes weighted average, for instance. This also allows the model to effectively exploit nonlinearities as discussed with (cf. fig. 2).
3.3 A Note on Universal Approximation
The key promise of universal approximation is that a family of approximators (neural nets, or neural sum decompositions) is dense within a wider family of interesting functions [12, 7, 9]. The universality granted by theorems 2.2 and 2.1, through constructive proofs, hinges on sum aggregation. Corollary 1 grants flexibility, but does not apply to arbitrary aggregations, like or the suggested learnable aggregations. (Note that allows for arbitrary approximation [18].) It remains open to what extent the sum can be replaced. As such, the suggested architectures might not grant universal approximators. As we will see in section 4, however, they provide useful inductive biases in practical settings, much like feedforward neural nets are usually replaced with architectures targeted towards the task. It is worth noting that the embedding dimension constraint of theorem 2.2 is rarely met, trading theoretical guarantees for testtime performance.
4 Experiments
We consider three simple aggregations: mean (or weighted sum), , and . These are used in equivariant layers and final aggregations, and may be be wrapped into a recurrent aggregation. This combinatorially large space of configurations is tested in four experiments described in the following sections.
4.1 Mininmal Enclosing Circle
recurrent
equiv./aggr. 
best MSE  radius MSE  center MSE  median best MSE 

✗ / ✗  0.71  0.06  0.66  1.57 
✗ / ✓  1.02  0.14  0.88  1.30 
✓ / ✗  0.54  0.08  0.47  0.87 
✓ / ✓  0.42  0.09  0.33  0.58 
In this supervised experiment, we are trying to predict the minimal enclosing circle of a population of size
from a Gaussian mixture model (GMM). A sample population with target circle is depicted in
fig. 4. The sample mean does not approximate the center of the minimal enclosing circle well, and the correct solution is defined by at least three particles. The models are trained by minimizing the mean squared error (MSE) towards the center and radius of the true circle (computable in linear time [27]).Results are given in fig. 4. Each row shows the best result out of 180 runs (20 runs for each of the 9 combinations of aggregations). We can see that both recurrent equivariant layers and recurrent aggregations improve the performance, with equivariant layers granting the larger performance boost. The challenge lies mostly in a better approximation of the center.
The top row indicates that an entirely nonrecurrent model performs better than its counterpart with recurrent aggregation (second row). To test for a performance outlier, we compute a bootstrap estimate of the expected peak performance when only performing 20 experiments: we subsample all available experiments (with replacement) into several sets of 20 experiments, recording the best performance in each batch. The last column in
fig. 4 reports the median of these best batch performances. The result shows increased robustness to hyperparameters, despite having more hyperparameters.4.2 GMM Mixture Weights
In this experiment, our goal is to estimate the mixture weights of a Gaussian mixture model directly from particles. The GMM populations of size in our data set are sampled as follows: each mixture consists of two components; the mixture weights are sampled from
; the means span a diameter of the unit circle, their position is drawn uniformly at random; component variances are a fixed to the same diagonal value such that the clusters are not linearly separable. An example population is shown in
fig. 4(a). The model outputs concentrations andof a Beta distribution. We train to maximize the loglikelihood of the smaller ground truth weight under this Beta distribution. At training time, for every gradient step the batch population size
is chosen randomly, with . In fig. 4(a), we show how an estimator based on the learned model behaves with growing population size.We were again interested in the robustness of the models. We compare to expectation maximization (EM)—the classic estimation technique for mixture weights—as a baseline by gathering 100 estimates each from EM and the model for each population size by subsampling (with replacement) the original population. Then we compare the likelihood of the true weight under a kernel density estimate (KDE) of these estimates. The final metric is the log ratio of the scores under the two KDEs. Then, as in the previous section, we compute the peak performance for batches of 5 experiments in order to see which configurations of models consistently perform well.
The results of this analysis are shown in fig. 4(b). The top row indicates that learnable equivariant layers lead to a significant performance boost across all reduction operations. Note that the yaxis is in log scale, indicating multiples of improvements over the EM baseline. We note that benefits most drastically from learnable inputs. Notably, the middle column, which depicts type aggregations, indicates that this type of aggregation significantly falls behind the alternatives. Notice that we had to scale the yaxes to even show the violins, and that a significant amount of peak performances perform worse than EM (indicated by sign flip of the metric).
4.3 Point Clouds
Equivariant layer type & aggregation type  

r 
r 
q 
q 
r
r 
r 
r
r 
r
r 
qsum
qsum 

1000  87.3  85.8  85.7  83.8  83.5  82.0  81.7  81.2  78.0  77.5 
100  66.5  75.3  73.0  69.5  68.4  71.9  45.3  22.0  64.0  60.3 
50  47.0  62.8  58.4  52.4  51.3  61.0  35.5  14.6  51.9  46.8 
The previous experiment extensively tested the effect of aggregations in controlled scenarios. To test the effect of aggregations on a more realistic data set, we tackle classification of point clouds derived from the ModelNet40 benchmark data set [30]. The data set consists of CAD models describing the surfaces of objects from 40 classes. We sample point cloud populations uniformly from the surface. The training is performed on 1000 particles. For this experiment, we fixed all hyperparameters—including optimizer parameters and learning rate schedules—as described in [29], and only exchanged the aggregation functions in the equivariant layers and the final aggregation.
The results for the 10 best configurations are summarized in table 1. The original model (/ column) performs best in the training scenario (, first row)—as expected on hyperparameters that were optimized for the model. Otherwise, learnable final aggregations outperform all nonlearnable aggregations. We further observe that type aggregations in equivariant layers seem crucial for good final performance. This contrasts the findings from section 4.2. We believe this to be a result of either (i) the hyperparameters being optimized for type equivariant layers, or (ii) the classification task (as opposed to a regression task), favoring normalized embeddings that amplify discriminative features.
The second and third row highlight an insufficiently investigated problem with invariant neural architectures: the topperforming model overfits to the training population size. Despite sharing all hyperparameters except the aggregations, the test scenarios with fewer particles show that learnable aggregation functions generalize favorably. Compare the first two columns: both drops for the original model are comparable to the total drop for the learnable model.
4.4 Spatial Attention

In the previous experiments, we investigated models trained in isolation on supervised tasks. Here, we will test the performance as a building block of a larger model, trained endtoend and unsupervised. The data consists of canvases containing multiple MNIST digits, cf. fig. 5(a). In [5]
, an unsupervised algorithm for scene understanding of such canvases was introduced. We plug an invariant model as the localization module, which repeatedly attends to the input image, at each step returning the bounding box of an object. To turn a canvas into a population, we interpret the grayscale image as a twodimensional density and create populations by sampling 200 particles proportional to the pixel intensities. Remarkably, the setbased approach requires an order of magnitude fewer weights, and consequently has a significantly lower memory footprint compared to the original model, which repeatedly processes the entire image.
The task is challenging in several ways: the loss is a lower bound to the likelihood of the input canvas, devoid of localization information. The intended localization behavior needs to emerge from interaction with downstream components of the overall model. As with enclosing circles, the bounding box center is correlated with the sample mean of isolated particles from one digit. However, depending on the digit, this can be inaccurate.
As fig. 5(b) indicates, the orderinvariant architecture on 200 particles (as in training, vertical line) can serve as a dropin replacement, performing on a par or slightly improved compared to the original model baseline, indicated by the vertical line. This is remarkable, with the original model being notoriously hard to train [13].
We investigate the performance of the model when the population size varies. We observe that the effect on performance varies with different aggregation functions. Learnable aggregation functions exhibit strictly monotonic performance improvements. This is reflected by tightening bounding boxes for increasing population sizes, fig. 5(a). Similar behavior cannot be found reliably for nonlearnable aggregations. Note that we can now trade off performance and inference speed at test time by varying the population size.
Lastly, we note that in both this and the point cloud experiment, section 4.3, learnable aggregations performed well. We attribute this to the properties of diminishing returns and sum
interpolation amplified by weighted inputs, cf.
section 4.5 Discussion and Conclusion
We investigated aggregation functions for orderinvariant neural architectures. We discussed alternatives to previously used aggregations. Introducing recurrent aggregations, we showed that each component of the Deep Set framework can be learnable. Establishing the notion of sum isomorphism, we created ground for future aggregation models.
Our empirical studies showed that aggregation functions are indeed an orthogonal research axis within the Deep Set framework worth studying. The right choice of aggregation function may depend on the type of task (regression vs. classification). It affects not only training performance, but also model sensitivity to hyperparameters and test time performance on outofdistribution population sizes. We showed that the learnable aggregation functions introduced in this work are more robust in their performance and more consistent in their estimates with growing population sizes. Lastly, we showed how to exploit these features in larger architectures by using neural set architectures as dropin replacements. In the light of our experimental results, we strongly encourage emphasizing desirable properties of invariant functions, and in particular actively challenge models in nontraining scenarios in future research.
References
 [1] Achlioptas, P., Diamanti, O., Mitliagkas, I., Guibas, L.: Learning Representations and Generative Models for 3D Point Clouds (Feb 2018), https://openreview.net/forum?id=BJInEZsTb
 [2] Chang, M.B., Ullman, T., Torralba, A., Tenenbaum, J.B.: A Compositional ObjectBased Approach to Learning Physical Dynamics. arXiv:1612.00341 [cs] (Dec 2016), http://arxiv.org/abs/1612.00341
 [3] Chen, X., Cheng, X., Mallat, S.: Unsupervised Deep Haar Scattering on Graphs. In: Ghahramani, Z., Welling, M., Cortes, C., Lawrence, N.D., Weinberger, K.Q. (eds.) Advances in Neural Information Processing Systems 27, pp. 1709–1717. Curran Associates, Inc. (2014), http://papers.nips.cc/paper/5545unsuperviseddeephaarscatteringongraphs.pdf
 [4] Edwards, H., Storkey, A.: Towards a Neural Statistician. arXiv:1606.02185 [cs, stat] (Jun 2016), http://arxiv.org/abs/1606.02185
 [5] Eslami, S.M.A., Heess, N., Weber, T., Tassa, Y., Szepesvari, D., Kavukcuoglu, K., Hinton, G.E.: Attend, Infer, Repeat: Fast Scene Understanding with Generative Models. In: Proceedings of the 30th International Conference on Neural Information Processing Systems. pp. 3233–3241. NIPS’16, Curran Associates Inc., USA (2016), http://dl.acm.org/citation.cfm?id=3157382.3157459
 [6] Guttenberg, N., Virgo, N., Witkowski, O., Aoki, H., Kanai, R.: Permutationequivariant neural networks applied to dynamics prediction. arXiv:1612.04530 [cs, stat] (Dec 2016), http://arxiv.org/abs/1612.04530

[7]
HechtNielsen: Theory of the backpropagation neural network. In: International Joint Conference on Neural Networks. pp. 593–605 vol.1. IEEE, Washington, DC, USA (1989).
https://doi.org/10.1109/IJCNN.1989.118638, http://ieeexplore.ieee.org/document/118638/ 
[8]
Hochreiter, S., Schmidhuber, J.: Long ShortTerm Memory. Neural Computation
9(8), 1735–1780 (Nov 1997). https://doi.org/10.1162/neco.1997.9.8.1735, https://www.mitpressjournals.org/doi/10.1162/neco.1997.9.8.1735  [9] Hornik, K., Stinchcombe, M., White, H.: Multilayer feedforward networks are universal approximators. Neural Networks 2(5), 359–366 (Jan 1989). https://doi.org/10.1016/08936080(89)900208, http://www.sciencedirect.com/science/article/pii/0893608089900208
 [10] Ilse, M., Tomczak, J.M., Welling, M.: Attentionbased Deep Multiple Instance Learning (Feb 2018), https://arxiv.org/abs/1802.04712
 [11] Kingma, D.P., Welling, M.: AutoEncoding Variational Bayes. arXiv:1312.6114 [cs, stat] (Dec 2013), http://arxiv.org/abs/1312.6114
 [12] Kolmogorov, A.N.: On the representation of continuous functions of many variables by superposition of continuous functions of one variable and addition. Doklady Akademii Nauk SSSR 114, 953–956 (1957), https://zbmath.org/?q=an%3A0090.27103, mSC2010: 26B40 = Representation and superposition of functions of several real variables
 [13] Kosiorek, A., Kim, H., Teh, Y.W., Posner, I.: Sequential Attend, Infer, Repeat: Generative Modelling of Moving Objects. In: Bengio, S., Wallach, H., Larochelle, H., Grauman, K., CesaBianchi, N., Garnett, R. (eds.) Advances in Neural Information Processing Systems 31, pp. 8606–8616. Curran Associates, Inc. (2018), http://papers.nips.cc/paper/8079sequentialattendinferrepeatgenerativemodellingofmovingobjects.pdf
 [14] Lee, J., Lee, Y., Kim, J., Kosiorek, A.R., Choi, S., Teh, Y.W.: Set Transformer (Oct 2018), https://arxiv.org/abs/1810.00825
 [15] Murphy, R.L., Srinivasan, B., Rao, V., Ribeiro, B.: Janossy Pooling: Learning Deep PermutationInvariant Functions for VariableSize Inputs. arXiv:1811.01900 [cs, stat] (Nov 2018), http://arxiv.org/abs/1811.01900

[16]
Poczos, B., Singh, A., Rinaldo, A., Wasserman, L.: DistributionFree Distribution Regression. In: Artificial Intelligence and Statistics. pp. 507–515 (Apr 2013),
http://proceedings.mlr.press/v31/poczos13a.html  [17] Qi, C.R., Liu, W., Wu, C., Su, H., Guibas, L.J.: Frustum PointNets for 3D Object Detection from RGBD Data. arXiv:1711.08488 [cs] (Nov 2017), http://arxiv.org/abs/1711.08488

[18]
Qi, C.R., Su, H., Kaichun, M., Guibas, L.J.: PointNet: Deep Learning on Point Sets for 3D Classification and Segmentation. In: 2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR). pp. 77–85 (Jul 2017).
https://doi.org/10.1109/CVPR.2017.16  [19] Qi, C.R., Yi, L., Su, H., Guibas, L.J.: PointNet++: Deep Hierarchical Feature Learning on Point Sets in a Metric Space. In: Guyon, I., Luxburg, U.V., Bengio, S., Wallach, H., Fergus, R., Vishwanathan, S., Garnett, R. (eds.) Advances in Neural Information Processing Systems 30, pp. 5099–5108. Curran Associates, Inc. (2017), http://papers.nips.cc/paper/7095pointnetdeephierarchicalfeaturelearningonpointsetsinametricspace.pdf
 [20] Ravanbakhsh, S., Schneider, J., Poczos, B.: Deep Learning with Sets and Point Clouds. arXiv:1611.04500 [cs, stat] (Nov 2016), http://arxiv.org/abs/1611.04500
 [21] Reed, S., Akata, Z., Yan, X., Logeswaran, L., Schiele, B., Lee, H.: Generative Adversarial Text to Image Synthesis. In: International Conference on Machine Learning. pp. 1060–1069 (Jun 2016), http://proceedings.mlr.press/v48/reed16.html
 [22] Rezende, D.J., Mohamed, S., Wierstra, D.: Stochastic Backpropagation and Approximate Inference in Deep Generative Models (Jan 2014), https://arxiv.org/abs/1401.4082
 [23] Santoro, A., Raposo, D., Barrett, D.G.T., Malinowski, M., Pascanu, R., Battaglia, P., Lillicrap, T.: A simple neural network module for relational reasoning. arXiv:1706.01427 [cs] (Jun 2017), http://arxiv.org/abs/1706.01427
 [24] Vinyals, O., Bengio, S., Kudlur, M.: Order Matters: Sequence to sequence for sets. arXiv:1511.06391 [cs, stat] (Nov 2015), http://arxiv.org/abs/1511.06391
 [25] Wagstaff, E., Fuchs, F.B., Engelcke, M., Posner, I., Osborne, M.: On the Limitations of Representing Functions on Sets. arXiv:1901.09006 [cs, stat] (Jan 2019), http://arxiv.org/abs/1901.09006
 [26] Wang, Y., Sun, Y., Liu, Z., Sarma, S.E., Bronstein, M.M., Solomon, J.M.: Dynamic Graph CNN for Learning on Point Clouds. arXiv:1801.07829 [cs] (Jan 2018), http://arxiv.org/abs/1801.07829
 [27] Welzl, E.: Smallest enclosing disks (balls and ellipsoids). In: Maurer, H. (ed.) New Results and New Trends in Computer Science. pp. 359–370. Lecture Notes in Computer Science, Springer Berlin Heidelberg (1991)
 [28] Yi, L., Zhao, W., Wang, H., Sung, M., Guibas, L.: GSPN: Generative Shape Proposal Network for 3D Instance Segmentation in Point Cloud. arXiv:1812.03320 [cs] (Dec 2018), http://arxiv.org/abs/1812.03320
 [29] Zaheer, M., Kottur, S., Ravanbakhsh, S., Poczos, B., Salakhutdinov, R.R., Smola, A.J.: Deep Sets. In: Guyon, I., Luxburg, U.V., Bengio, S., Wallach, H., Fergus, R., Vishwanathan, S., Garnett, R. (eds.) Advances in Neural Information Processing Systems 30, pp. 3391–3401. Curran Associates, Inc. (2017), http://papers.nips.cc/paper/6931deepsets.pdf
 [30] Zhirong Wu, Song, S., Khosla, A., Fisher Yu, Linguang Zhang, Xiaoou Tang, Xiao, J.: 3D ShapeNets: A deep representation for volumetric shapes. In: 2015 IEEE Conference on Computer Vision and Pattern Recognition (CVPR). pp. 1912–1920. IEEE, Boston, MA, USA (Jun 2015). https://doi.org/10.1109/CVPR.2015.7298801, http://ieeexplore.ieee.org/document/7298801/