Distribution-Based Invariant Deep Networks for Learning Meta-Features

by   Gwendoline de Bie, et al.
Cole Normale Suprieure

Recent advances in deep learning from probability distributions enable to achieve classification or regression from distribution samples, invariant under permutation of the samples. This paper extends the distribution-based deep neural architectures to achieve classification or regression from distribution samples, invariant under permutation of the descriptive features, too. The motivation for this extension is the Auto-ML problem, aimed to identify a priori the ML configuration best suited to a dataset. Formally, a distribution-based invariant deep learning architecture is presented, and leveraged to extract the meta-features characterizing a dataset. The contribution of the paper is twofold. On the theoretical side, the proposed architecture inherits the NN properties of universal approximation, and the robustness of the approach w.r.t. moderate perturbations is established. On the empirical side, a proof of concept of the approach is proposed, to identify the SVM hyper-parameters best suited to a large benchmark of diversified small size datasets.



There are no comments yet.


page 1

page 2

page 3

page 4


Universal approximations of permutation invariant/equivariant functions by deep neural networks

In this paper,we develop a theory of the relationship between permutatio...

Improved Generalization Bound of Permutation Invariant Deep Neural Networks

We theoretically prove that a permutation invariant property of deep neu...

Improved Brain Age Estimation with Slice-based Set Networks

Deep Learning for neuroimaging data is a promising but challenging direc...

Deep Networks with Adaptive Nyström Approximation

Recent work has focused on combining kernel methods and deep learning to...

Single Class Universum-SVM

This paper extends the idea of Universum learning [1, 2] to single-class...

PICASO: Permutation-Invariant Cascaded Attentional Set Operator

Set-input deep networks have recently drawn much interest in computer vi...

Classifying Unordered Feature Sets with Convolutional Deep Averaging Networks

Unordered feature sets are a nonstandard data structure that traditional...
This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

1 Introduction

Deep networks architectures, initially devised for structured data such as images [24] and speech [17], have been extended to respect some invariance or equivariance [41] of more complex data sets. This includes for instance point clouds [34], graphs [16] and probability distributions [6], which are invariant with respect to permutations of the input points. In such cases, invariant architectures improve practical performance while inheriting the universal approximation properties of neural nets [5, 25].

1.1 Distribution-based Architectures and AutoML

This paper focuses on distribution-based neural architectures, i.e. deep networks tailored to manipulate distributions of points. For the sake of simplicity, we describe our architectures over discrete distributions, represented as uniform distributions on a set of points of arbitrary size. The extension to arbitrary (possibly continuous) distributions is detailed in supplementary material, Appendix A.

In this paper, distribution-based neural architectures are extended to cope with an additional invariance: the space of features and labels (i.e. the space supporting the distributions) is also assumed to be invariant under permutation of its coordinates. This extra invariance is important to tackle Auto-ML problems [38, 30, 11, 19, 1, 18, 22, 36, 10]. Auto-ML aims to identify a priori

the ML configuration (learning algorithm and hyper-parameters thereof) best suited to the dataset under consideration in the sense of a given performance indicator. Would a dataset be associated with accurate descriptive features, referred to as meta-features, the Auto-ML problem could be handled via solving yet another supervised learning problem: given archives recording the performance of various ML configurations on various datasets


, with each dataset described as a vector of meta-features, the best-performing algorithm (among these configurations) on a new dataset

z could be predicted from its meta-features. The design of accurate meta-features however has eluded research since the 80s (with the except of [20], more in Section 1.2), to such an extent that the prominent AutoML approaches currently rely on learning a performance model specific to each dataset [11, 36].

1.2 Related Works and Contributions

Learning from finite discrete distributions.

Learning from sets of samples subject to invariance or equivariance properties opens up a wide range of applications: in the sequence-to-sequence framework, relaxing the order in which the input is organized might be beneficial [46]. The ability to follow populations at a macroscopic level, using distributions on their evolution along time without requiring to follow individual trajectories, and regardless of the population size, is appreciated when modelling dynamic cell processes [15]

. The use of sets of pixels, as opposed to e.g., voxellized approaches in computer vision

[6], offers a better scalability in terms of data dimensionality and computational resources.

Most generally, the fact that the considered hypothesis space / neural architecture complies with domain-dependent invariances ensures a better robustness of the eventually learned model, better capturing the data geometry. Such neural architectures have been pioneered by [34, 51] for learning from point clouds subject to permutation invariance or equivariance. These have been extended to permutation equivariance across sets [14]. Characterizations of invariance or equivariance under group actions have been proposed in the finite [13, 3, 37] or infinite case [48, 23]. A general characterization of linear layers on the top of a representation that are invariant or equivariant with respect to the whole permutation group has been proposed by [26, 21]. Universality results are known to hold in the case of sets [51], point clouds [34], equivariant point clouds [40], discrete measures [6], invariant [27] and equivariant [21]

graph neural networks. The approach most related to our work is that of

[28], presenting a neural architecture invariant w.r.t. the ordering of samples and their features. The originality of our approach is that we do not fix in advance the number of samples, and consider probability distributions instead of point clouds. This allows us to leverage the natural topology of optimal transport to assess theoretically the universality and smoothness of our architectures, which is adapted to tackle the AutoML problem.


The absence of learning algorithms efficient on all datasets [47] makes AutoML

i.e. the automatic identification of the machine learning pipelines yielding the best performance on the task at hand

a main bottleneck toward the so-called democratizing of the machine learning technology [19]. The AutoML field has been sparking interest for more than four decades [38]

, spread from hyperparameter optimization

[2] to the optimization of the whole pipeline [11]. Formally, AutoML defines a mixed integer and discrete optimization problem (finding the ML pipeline algorithms and their hyper-parameters), involving a black-box expensive objective function. The organization of international challenges spurred the development of various efficient AutoML systems, instrinsically relying on Bayesian optimization [11, 42], Monte-Carlo tree search [7] on top of a surrogate model, or their combination [36].

As said, the ability to characterize tasks (datasets, in the remainder of the paper) via vectors of meta-features

would solve AutoML through learning the performance model. Meta-features, expected to describe the joint distribution underlying the dataset, should also be inexpensive to compute. Particular meta-features called

landmarks [33]

are given by the performance of fast ML algorithms; indeed, knowing that a decision tree reaches a given level of accuracy on a dataset gives some information on this dataset; see also

[30]. Another direction is explored by [20], defining the Dataset2Vec representation. Specifically, meta-features are extracted through solving the classification problem of whether two patches of data (subset of examples, described according to a subset of features) are extracted from the same dataset. Meta-learning [12, 50]

and hyper-parameter transfer learning

[31], more remotely related to the presented approach, respectively aim to find a generic model with quick adaptability to new tasks, achieved through few-shot learning, and to transfer the performance model learned for a task, to another task.


The contribution of the paper is twofold. On the algorithmic side, a distribution-based invariant deep architecture (Dida) able to learn such meta-features is presented in Section 2. The challenge is that a meta-feature associated to a set of samples must be invariant both under permutation of the samples, and under permutation of their coordinates. Moreover, the architecture must be flexible enough to accept discrete distributions with diverse support and feature sizes. The theoretical properties of these architectures (smoothness and universality) are detailed in Section 3. A proof of concept of the merits of the approach is presented in Section 4, where the AutoML problem is restricted to the identification of the best SVM configuration on a large-size benchmark of diversified datasets.

2 Distribution-Based Invariant Networks for Meta-Feature Learning

This section describes our distribution-based invariant layers, mapping a point distribution to another one while respecting invariances. It details how they can be trained to perform invariant regression and achieve meta-feature learning.

2.1 Invariant Functions of Discrete Distributions

Let z denote a dataset including labelled samples, with an instance and the associated multi-label. With and respectively being the dimensions of the instance and label spaces, let . By construction, z is invariant under permutation on the sample ordering; it is viewed as an -size discrete distribution in , as opposed to a point cloud. While the paper focuses on the case of discrete distributions, the approach and theoretical results also hold in the general case of continuous distribution (Appendix A).

We denote the space of such -size point distributions, with the space of distributions of arbitrary size.

As the performance of an ML algorithm is most generally invariant w.r.t. permutations operating on the feature or label spaces, the neural architectures leveraged to learn the meta-features must enjoy the same property. Formally, let denote the group of permutations independently operating on the feature and label spaces. For , the image of a labelled sample is defined as , with and . For simplicity and by abuse of notations, the operator mapping a distribution to is still denoted .

We denote the space of distributions supported on some set , and we assume that the domain is invariant under permutations in .

The goal of the paper is to define trainable deep architectures, implementing functions defined on such that these are invariant under , i.e. for any . Such functions will be trained to define meta-features.

2.2 Distribution-Based Invariant Layers

Taking inspiration from [6], the basic building-blocks of the proposed neural architecture are extended to satisfy the feature- and label-invariance requirements.

(Distribution-based invariant layers) Let an interaction functional be -invariant, i.e.

A distribution-based invariant layer is defined as


It is easy to see that is invariant. The construction of such a distribution-based invariant is extended to arbitrary (possibly continuous) probability distributions by essentially replacing sums by integrals (Appendix A).

(Nature of the invariance) Note that the invariance requirement on actually is less demanding than requiring for any two distinct permutations and in .

Two particular cases are when only depends on its first or second input:

  • if , then

    computes a global “moment” descriptor of the input, as


  • if , then transports the input distribution via , as . This operation is referred to as a push-forward.

(Spaces of arbitrary dimension) Both in practice and in theory, it is important to define layers (in particular the first one of the architecture) that can be applied to distributions on of arbitrary dimensions and . This can be achieved by constraining to be of the form, with and :

where and are independent of .

(Generalization to arbitrary groups) The definition of invariant functions (and the corresponding architectures) can be generalized to arbitrary group operating on (in particular sub-groups of the permutation group). A simple way to design an invariant function is to consider where is -invariant. In the linear case, [28], Theorem 5 shows that these types of functions are the only ones, but this is not anymore true for non-linear functions.

(Localized computation) In practice, the complexity of computing can be reduced by considering only in a neighborhood of . The layer then extracts local information around each of the points.

2.3 Learning Dataset Meta-features from Distributions

The proposed invariant regression neural architectures defined on point distributions (Dida) are defined as


where are the trainable parameters of the architecture (detailed below). Here , and only depends on its second argument (such that should be understood as being a vector, as opposed to a distribution). Note that only is required to be -invariant and dimension-agnostic for the architecture to be as well. In practice, this map defined as in Remark 2.2 is thus learned using inputs of varying dimension as a -invariant layer with , where maps to , maps to , with are affine functions, is a non-linearity and denotes concatenation.

As the following layers () need not be invariant, they are parameterized as using a pair of (matrix,vector). The parameters of the Dida architecture are thus

. They are learned in a supervised fashion, with a loss function depending on the task at hand (see Section 

4). By construction, these architectures are invariant w.r.t. the orderings of both the points composing the input distributions and their coordinates. The input distributions can be composed of any number of points in any dimension, which is a distinctive feature with respect to [28].

3 Theoretical Analysis

To get some insight on these architectures, we now detail their robustness to perturbations and their approximation abilities with respect to the convergence in law, which is the natural topology for distributions. Although we expose these contributions for discrete distributions, these results hold for arbitrary (possibly continuous) distributions (supplementary material, Appendix A).

3.1 Optimal Transport Comparison of Datasets

Point clouds vs. distributions.

It is important to note that learning from datasets, referred to as meta-learning for simplicity in the sequel, requires such datasets be seen as probability distributions, as opposed to point clouds. For instance, having twice the same point in a dataset really corresponds to doubling its mass, i.e. it should have twice more importance than the other points. We thus argue that the natural topology to analyze meta-learning methods is the one of the convergence in law, which can be quantified using Wasserstein optimal transport distances. This is in sharp contrast with point clouds architectures (see for instance  [34]

), making use of max-pooling and relying on the Haussdorff distance to analyze the architecture properties. While this analysis is standard for low-dimensional (2D and 3D) applications in graphics and vision, this is not suitable for our purpose, because max-pooling is not a continuous operation for the topology of convergence in law.

Wasserstein distance.

In order to quantify the regularity of the involved functionals, we resort to the -Wasserstein distance between two discrete probability distributions (referring the reader to [39, 32] for a comprehensive presentation of Wasserstein distance):

where is the space of -Lipschitz functions . In this paper, as probability distribution and its permuted image under are considered to be indistinguishable, one introduces the permutation-invariant -Wasserstein distance: for :

such that if and only if z and are equal (in the sense of probability distributions) up to feature permutations (i.e. belong to the same equivalence class, Appendix A).

Lipschitz property.

In this context, a map is continuous for the convergence in law111Note that takes any probability distribution on as input, hence in particular, size samples belonging to for any are accepted, as well as continuous distributions (Appendix A). (aka the weak of distributions, denoted ) if for any sequence , then . The Wasserstein distance metrizes the convergence in law, in the sense that is equivalent to . Such a map is furthermore said to be -Lipschitz for the permutation invariant -Wasserstein distance if


Lipschitz properties enable us to analyze robustness to input perturbations, since it ensures that if the input distributions are close enough (in the permutation invariant -Wasserstein sense), the corresponding outputs are close too.

3.2 Regularity of Distribution-Based Invariant Layers

The following propositions show the robustness of invariant layers with respect to different variations of their input, assuming the following regularity condition on the interaction functional:


The proofs of this section are detailed in Appendix B. We first show that invariant layers are Lipschitz regular. This ensures that deep architectures of the form (2) map close inputs onto close outputs.

Invariant layers of type  (1) are -Lipschitz in the sense of (3).

Secondly, we consider perturbations with respect to diffeomorphisms. This stability is important for instance to cope with situation where an auto-encoder has been trained, so that a dataset and its encoded-decoded representation are expected to yield similar meta-features. The following proposition shows that and are indeed close if is close to the identity, which is expected when using auto-encoders. It also shows that similarly, if both inputs and outputs are modified by regular deformations and , then the output are also close.

For and two Lipschitz maps, one has for all ,

3.3 Universality of Invariant Layers

We now show that our architecture can approximate any continuous invariant map. More precisely, the following proposition shows that the combination of an invariant layer (1) and a fully-connected layer are enough to reach universal approximation capability. This statement holds for arbitrary distributions (not necessarily discrete) and for functions defined on spaces of arbitrary dimension in the sense of Remark 2.2 (assuming some a priori bound on the dimensions).

Let a -invariant map on a compact , continuous for the convergence in law. Then , there exists two continuous maps such that

where is -invariant and independent of .


We give a sketch of the proof, more detail is provided in Appendix C). We consider where: (i) is the collection of elementary symmetric polynomials in the features and elementary symmetric polynomials in the labels, which are invariant to ; (ii) is defined through a discretization of on a grid; (iii) applies function on a discretized version of z – which requires to be bijective: this is achieved by , through a projection on the quotient space and a restriction to its image compact . The sum in definition of computes an expectation which collects integrals over each cell of the grid to approximate measure by a discrete counterpart . Hence applies to . Continuity is obtained as follows: (i) proximity of and is guaranteed (see Lemma C from [6]) and gets tighter as the discretization step tends to 0 ; (ii) the map is regular enough (-Hölder, see theorem 1.3.1 from [35]) such that according to Lemma C, can be upper-bounded; (iii) since is compact, by Banach-Alaoglu theorem, also is. Since is continuous, it is thus uniformly weak-* continuous: choosing a discretization step small enough ensures the result. ∎

(Approximation by an invariant NN) A consequence of theorem 3.3 is that any continuous invariant regression function taking (compactly supported) distributions can be approximated to arbitrary precision by an invariant neural network. This result is detailed in Appendix C and uses the following ingredients: (i) an invariant layer with that can be approximated by an invariant network; (ii) the universal approximation theorem [5, 25]; (iii) uniform continuity to obtain uniform bounds.

(Extension to different spaces) Theorem 3.3 also extends to distributions supported on different spaces, by considering a joint embedding space of large enough dimension. This way, any invariant prediction function can (uniformly) be approximated by an invariant network, up to setting added coordinates to zero (Appendix C).

4 Learning meta-features: proofs of concept

To showcase the validity of the proposed architecture, two proofs of concept are proposed, extracting meta-features by training Dida222Dida code is available at: https://github.com/herilalaina/dida. to achieve two tasks, respectively distribution identification and performance model learning.

4.1 Experimental setting

Three benchmarks have been considered (details in supplementary material, Appendix D). Benchmarks TOY and UCI are taken from [20], respectively involving toy datasets with instances in , and 121 datasets from the UCI repository [8]. Benchmark OpenML-3D is derived from 593 datasets extracted from the OpenML repository [44], where each dataset gives rise to compressed datasets using auto-encoders (instance being replaced with its 3d-image in latent space). Twenty such compressed datasets are generated for each initial OpenML dataset. Each benchmark is divided into 70%-30% training-test sets (all compressed datasets generated from a same dataset being either in training or in test sets).

The Dida neural architecture includes 2 invariant layers followed by three fully connected layers of sizes 256, 128, 64. The first layer processes a dataset z (finite distribution in dimension ), yielding a distribution in dimension 10, while the second layer yields a deterministic vector in dimension 1024. The latter is processed by the FC architecture; denotes the learned meta-features, with Dida parameters (section 2.3).

All experiments are run on 4 NVIDIA-Tesla-V100-SXM2 GPUs with 32GB memory, using Adam optimizer with base learning rate and batch size 32.

4.2 Task 1: Distribution Identification

The patch identification task is introduced by [20]. Let dataset , referred to as patch of dataset , be extracted by uniformly selecting a subset of samples with indices in . To each pair of patches (z,z’) (with same number of instances) is associated the binary meta-label , set to 1 iff z and z’ are extracted from the same initial dataset. In this case, the Dida parameters are trained to build the (dimension-agnostic) model minimizing the (weighted version of) binary cross-entropy loss:


with and meta-features defined as the 64-dimensional output of the last FC layer.

The Dida performance is assessed comparatively to Dataset2Vec333Dataset2Vec code is available at https://github.com/hadijomaa/dataset2vec.. Table 1 shows that Dida significantly outperforms Dataset2Vec

 on all benchmarks (columns 1-3), all the more so as the number of features in the datasets is large (in UCI). Uncertainty estimates are obtained with 3 folds splitting of the test set.

Method TOY UCI OpenML
(1) (2) (3) (4)
Dida 97.2 % 0.1 89.2 % 2.1 98.54% 0.9 91.57% 2.11
Table 1: Distribution identification: Comparative performances of Dida and Dataset2Vec on patch identification (columns 1-3) and distribution identification (column 4; see text).

An original generalization of patch identification is defined using OpenML-3D, where the label of a pair of patches is thereafter set to 1 iff z and are extracted from some u and , with u and derived by auto-encoder from the same initial OpenML dataset. The task difficulty is increased compared to patch identification as patches z and are now extracted from similar distributions444If the composition of the encoder and decoder module were the identity, then the u distribution is mapped onto the distribution by composing the decoder of the AE used to generate u with the encoder of the AE used to generate ., as opposed to the same distribution. Dida also significantly outperforms Dataset2Vec (Table 1, column (4)).

All experiments are conducted using 10 patches of 100 samples for each dataset. Dida computational time is ca 2 hours on TOY and UCI, and 6 hours on OpenML 3D. Dataset2Vec hyperparameters are set to their default values except size and number of patches, set to same values as in Dida.

4.3 Task 2: Performance model learning

The set of ML configurations includes 100 SVM configurations (e.g. type and hyper-parameters of the kernel). For each configuration and dataset z, the performance is the predictive accuracy of the SVM learned from z and assessed using a 90%-10% split among training and test sets, with and respectively the best and the median values of for ranging in . Top-k(z) is the set of configurations with highest accuracy on z. The goal of performance modelling is to support the a priori identification of a sufficiently good, or quasi-optimal, configuration for each z.

Dida is trained to approximate the metric induced on OpenML 3D benchmark by the ML configurations . Let the dissimilarity of two datasets z and be defined as:


Based on this dissimilarity, three clusters are defined on each benchmark, and the associated 3-class learning problem is considered, with meta-label the index of the cluster z belongs to. On the top of the last invariant layer (delivering meta-features ) are built the three fully-connected layers followed by a softmax with output for . The Dida parameters are thus learned by classically minimizing the (weighted version of) cross-entropy loss . On the top of meta-features , a metric learning module is trained using ListMLE [49], yielding such that the Euclidean metric based on the be compliant with :


The merits of the meta-features are comparatively established as follows. For each z in the benchmark, let denote the -th nearest neighbor of z according to the metric defined by meta-features MF, be they extracted by Dida, handcrafted as used in [29] or in [11], or based on landmarks [33]. For each z in the benchmark, let denote the -th nearest neighbor of z according to the metric defined by meta-features MF. Likewise, let denote the performance on z of the best configuration for , and . The regret of the AutoML process based on MF is defined as .

Figure 1 displays the regret curve associated to Dida meta-features, comparatively to that of handcrafted meta-features [29, 11], landmarks [33], or random meta-features; the regret of the best

on average on the training set is displayed for comparison. Handcrafted and landmark meta-features are normalized then pre-processed using SVD, retaining the top 10 singular values. These regret curves establish the relevance of the proposed

Dida approach; a discussion on its limitations is presented in supplementary material, Appendix D.

Figure 1: Comparative assessment of the AutoML process based on Dida, handcrafted, Auto-Sklearn, landmark and random meta-features on OpenML 3D benchmark.

5 Conclusion

In this paper, we develop Dida, an architecture performing invariant regression on point distributions, invariant w.r.t. feature permutations and accommodating various data sizes, backed by theoretical capabilities of universal approximation and robustness, with natural extensions to continuous distributions.
Tackling the long-known Auto-ML problem, we demonstrate the feasibility and relevance of automatically extracting meta-feature vectors using Dida, outperforming the Dataset2Vec approach [20] and the meta-features manually defined in the last two decades [11, 29].
The ability to pertinently situate a dataset in the landscape defined by ML algorithms paves the way to quite a few applications beyond Auto-ML, ranging from domain adaptation to meta-learning.

6 Acknowledgements

The work of G. De Bie is supported by the Region Ile-de-France. H. Rakotoarison acknoledges funding from the ADEME #1782C0034 project NEXT. The work of G. Peyré is supported by the European Research Council (ERC project NORIA).


  • [1] R. Bardenet, M. Brendel, B. Kégl, and M. Sebag (2013) Collaborative hyperparameter tuning. pp. II–199–II–207. Cited by: §1.1.
  • [2] J. S. Bergstra, R. Bardenet, Y. Bengio, and B. Kégl (2011) Algorithms for hyper-parameter optimization. pp. 2546–2554. Cited by: §1.2.
  • [3] T. Cohen and M. Welling (2016-20–22 Jun) Group equivariant convolutional networks. 48, pp. 2990–2999. Cited by: §1.2.
  • [4] D. A. Cox, J. Little, and D. O’Shea (2007) Ideals, varieties, and algorithms: an introduction to computational algebraic geometry and commutative algebra, 3/e (undergraduate texts in mathematics). Springer-Verlag, Berlin, Heidelberg. External Links: ISBN 0387356509 Cited by: item , Appendix C.
  • [5] G. Cybenko (1989)

    Approximation by superpositions of a sigmoidal function

    Mathematics of control, signals and systems 2 (4), pp. 303–314. Cited by: Appendix C, §1, §3.3.
  • [6] G. De Bie, G. Peyré, and M. Cuturi (2019) Stochastic deep networks. pp. 1556–1565. Cited by: Appendix B, Appendix B, item , Appendix C, §1.2, §1.2, §1, §2.2, §3.3.
  • [7] I. Drori, Y. Krishnamurthy, R. Rampin, R. Lourenco, J. One, K. Cho, C. Silva, and J. Freire (2018) AlphaD3M: machine learning pipeline synthesis. Cited by: §1.2.
  • [8] D. Dua and C. Graff (2017) UCI machine learning repository. University of California, Irvine, School of Information and Computer Sciences. External Links: Link Cited by: §4.1.
  • [9] D. Dua and C. Graff (2017) UCI machine learning repository. External Links: Link Cited by: §D.1.
  • [10] T. Elsken, J. H. Metzen, and F. Hutter (2019) Neural architecture search: A survey. J. Mach. Learn. Res. 20, pp. 55:1–55:21. External Links: Link Cited by: §1.1.
  • [11] M. Feurer, A. Klein, K. Eggensperger, J. Springenberg, M. Blum, and F. Hutter (2015) Efficient and robust automated machine learning. pp. 2962–2970. External Links: Link Cited by: §1.1, §1.2, §4.3, §4.3, §5.
  • [12] C. Finn, K. Xu, and S. Levine (2018) Probabilistic model-agnostic meta-learning. In Advances in Neural Information Processing Systems 31, S. Bengio, H. Wallach, H. Larochelle, K. Grauman, N. Cesa-Bianchi, and R. Garnett (Eds.), pp. 9516–9527. Cited by: §1.2.
  • [13] R. Gens and P. M. Domingos (2014) Deep symmetry networks. In Advances in Neural Information Processing Systems 27, Z. Ghahramani, M. Welling, C. Cortes, N. D. Lawrence, and K. Q. Weinberger (Eds.), pp. 2537–2545. Cited by: §1.2.
  • [14] J. Hartford, D. R. Graham, K. Leyton-Brown, and S. Ravanbakhsh (2018) Deep models of interactions across sets. External Links: 1803.02879 Cited by: §1.2.
  • [15] T. Hashimoto, D. Gifford, and T. Jaakkola (2016-20–22 Jun) Learning population-level diffusions with generative rnns. 48, pp. 2417–2426. Cited by: §1.2.
  • [16] M. Henaff, J. Bruna, and Y. LeCun (2015) Deep convolutional networks on graph-structured data. CoRR abs/1506.05163. External Links: 1506.05163 Cited by: §1.
  • [17] G. Hinton, L. Deng, D. Yu, G. E. Dahl, A. Mohamed, N. Jaitly, A. Senior, V. Vanhoucke, P. Nguyen, T. N. Sainath, et al. (2012) Deep neural networks for acoustic modeling in speech recognition: the shared views of four research groups. IEEE Signal processing magazine 29 (6), pp. 82–97. Cited by: §1.
  • [18] F. Hutter, H. H. Hoos, and K. Leyton-Brown (2011) Sequential model-based optimization for general algorithm configuration. pp. 507–523. External Links: ISBN 9783642255656, Link, Document Cited by: §1.1.
  • [19] F. Hutter, L. Kotthoff, and J. Vanschoren (Eds.) (2018) Automated machine learning: methods, systems, challenges. Springer. Note: In press, available at http://automl.org/book. Cited by: §1.1, §1.2.
  • [20] H. S. Jomaa, J. Grabocka, and L. Schmidt-Thieme (2019) Dataset2Vec: learning dataset meta-features. External Links: 1905.11063 Cited by: §D.1, §D.2, §1.1, §1.2, §4.1, §4.2, §5.
  • [21] N. Keriven and G. Peyré (2019) Universal invariant and equivariant graph neural networks. pp. 7090–7099. Cited by: §1.2.
  • [22] A. Klein, S. Falkner, S. Bartels, P. Hennig, and F. Hutter (2017-20–22 Apr) Fast Bayesian Optimization of Machine Learning Hyperparameters on Large Datasets. 54, pp. 528–536. External Links: Link Cited by: §1.1.
  • [23] R. Kondor and S. Trivedi (2018) On the generalization of equivariance and convolution in neural networks to the action of compact groups. External Links: 1802.03690 Cited by: §1.2.
  • [24] A. Krizhevsky, I. Sutskever, and G. E. Hinton (2012) Imagenet classification with deep convolutional neural networks. pp. 1097–1105. Cited by: §1.
  • [25] M. Leshno, V. Y. Lin, A. Pinkus, and S. Schocken (1993)

    Multilayer feedforward networks with a nonpolynomial activation function can approximate any function

    Neural networks 6 (6), pp. 861–867. Cited by: Appendix C, §1, §3.3.
  • [26] H. Maron, H. Ben-Hamu, N. Shamir, and Y. Lipman (2019) Invariant and equivariant graph networks. Cited by: §1.2.
  • [27] H. Maron, E. Fetaya, N. Segol, and Y. Lipman (2019) On the universality of invariant networks. pp. 4363–4371. Cited by: §1.2.
  • [28] H. Maron, O. Litany, G. Chechik, and E. Fetaya (2020) On learning sets of symmetric elements. External Links: 2002.08599 Cited by: item , §1.2, §2.2, §2.3.
  • [29] M. A. Muñoz, L. Villanova, D. Baatar, and K. Smith-Miles (2018) Instance spaces for machine learning classification. Machine Learning 107 (1), pp. 109–147. Cited by: §4.3, §4.3, §5.
  • [30] M. A. Muñoz, L. Villanova, D. Baatar, and K. Smith-Miles (2018-01) Instance spaces for machine learning classification. Machine Learning 107 (1), pp. 109–147. External Links: ISSN 0885-6125, Document Cited by: §1.1, §1.2.
  • [31] V. Perrone, R. Jenatton, M. W. Seeger, and C. Archambeau (2018) Scalable hyperparameter transfer learning. In Advances in Neural Information Processing Systems 31, S. Bengio, H. Wallach, H. Larochelle, K. Grauman, N. Cesa-Bianchi, and R. Garnett (Eds.), pp. 6845–6855. Cited by: §1.2.
  • [32] G. Peyré and M. Cuturi (2019) Computational optimal transport. Foundations and Trends® in Machine Learning 11 (5-6), pp. 355–607. External Links: Link, Document, ISSN 1935-8237 Cited by: §3.1.
  • [33] B. Pfahringer, H. Bensusan, and C. G. Giraud-Carrier (2000) Meta-learning by landmarking various learning algorithms. pp. 743–750. External Links: ISBN 1558607072 Cited by: §1.2, §4.3, §4.3.
  • [34] C. R. Qi, H. Su, K. Mo, and L. J. Guibas (2017) PointNet: deep learning on point sets for 3d classification and segmentation.

    Proc. Computer Vision and Pattern Recognition (CVPR), IEEE

    Cited by: §1.2, §1, §3.1.
  • [35] Q. I. Rahman and G. Schmeisser (2002) Analytic theory of polynomials. Cited by: item , §3.3.
  • [36] H. Rakotoarison, M. Schoenauer, and M. Sebag (2019-07) Automated machine learning with monte-carlo tree search. pp. 3296–3303. External Links: Document Cited by: §1.1, §1.2.
  • [37] S. Ravanbakhsh, J. Schneider, and B. Póczos (2017) Equivariance through parameter-sharing. 70, pp. 2892–2901. Cited by: §1.2.
  • [38] J. R. Rice (1976) The algorithm selection problem.. Advances in Computers 15, pp. 65–118. Cited by: §1.1, §1.2.
  • [39] F. Santambrogio (2015) Optimal transport for applied mathematicians. Birkäuser, NY. Cited by: Appendix B, §3.1.
  • [40] N. Segol and Y. Lipman (2019) On universal equivariant set networks. External Links: 1910.02421 Cited by: §1.2.
  • [41] J. Shawe-Taylor (1993-Sep.) Symmetries and discriminability in feedforward network architectures. IEEE Transactions on Neural Networks 4 (5), pp. 816–826. External Links: Document, ISSN 1941-0093 Cited by: §1.
  • [42] C. Thornton, F. Hutter, H. H. Hoos, and K. Leyton-Brown (2013) Auto-weka: combined selection and hyperparameter optimization of classification algorithms. pp. 847–855. Cited by: §1.2.
  • [43] J. Vanschoren, J. N. van Rijn, B. Bischl, and L. Torgo (2013) OpenML: networked science in machine learning. SIGKDD Explorations 15 (2), pp. 49–60. External Links: Link, Document Cited by: §D.1, §1.1.
  • [44] J. Vanschoren, J. N. van Rijn, B. Bischl, and L. Torgo (2013) OpenML: networked science in machine learning. SIGKDD Explorations 15 (2), pp. 49–60. External Links: Link, Document Cited by: §D.4, §4.1.
  • [45] P. Vincent, H. Larochelle, Y. Bengio, and P. Manzagol (2008)

    Extracting and composing robust features with denoising autoencoders

    pp. 1096–1103. Cited by: §D.1.
  • [46] O. Vinyals, S. Bengio, and M. Kudlur (2016) Order matters: sequence to sequence for sets. Cited by: §1.2.
  • [47] D. H. Wolpert (1996) The lack of A priori distinctions between learning algorithms. Neural Computation 8 (7), pp. 1341–1390. Note: No Free Lunch for Machine Learning Cited by: §1.2.
  • [48] J. Wood and J. Shawe-Taylor (1996) Representation theory and invariant neural networks. Discrete applied mathematics 69 (1-2), pp. 33–60. Cited by: §1.2.
  • [49] F. Xia, T. Liu, J. Wang, W. Zhang, and H. Li (2008) Listwise approach to learning to rank: theory and algorithm. pp. 1192–1199. Cited by: §4.3.
  • [50] J. Yoon, T. Kim, O. Dia, S. Kim, Y. Bengio, and S. Ahn (2018) Bayesian model-agnostic meta-learning. In Advances in Neural Information Processing Systems 31, S. Bengio, H. Wallach, H. Larochelle, K. Grauman, N. Cesa-Bianchi, and R. Garnett (Eds.), pp. 7332–7342. Cited by: §1.2.
  • [51] M. Zaheer, S. Kottur, S. Ravanbakhsh, B. Poczos, R. R. Salakhutdinov, and A. J. Smola (2017) Deep sets. In Advances in Neural Information Processing Systems 30, I. Guyon, U. V. Luxburg, S. Bengio, H. Wallach, R. Fergus, S. Vishwanathan, and R. Garnett (Eds.), pp. 3391–3401. Cited by: 2nd item, §1.2.


Appendix A Extension to arbitrary distributions

Overall notations.

Let denote a random vector on with its law (a positive Radon measure with unit mass). By definition, its expectation denoted reads , and for any continuous function , . In the following, two random vectors and with same law are considered indistinguishable, noted . Letting denote a function on , the push-forward operator by , noted is defined as follows, for any continuous function from to ( in ):

Letting be a set of points in with such that , the discrete measure is the sum of the Dirac measures weighted by .


In this paper, we consider functions on probability measures that are invariant with respect to permutations of coordinates. Therefore, denoting the -sized permutation group, we consider measures over a symmetrized compact equipped with the following equivalence relation: for , , such that a measure and its permuted counterpart are indistinguishable in the corresponding quotient space, denoted alternatively  or . A function is said to be invariant (by permutations of coordinates) iff (Definition 1).


Letting and respectively denote two random vectors on and

, the tensor product vector

is defined as: , where and are independent and have the same law as and , i.e. . In the finite case, for and , then , weighted sum of Dirac measures on all pairs . The fold tensorization of a random vector , with law , generalizes the above construction to the case of

independent random variables with law

. Tensorization will be used to define the law of datasets, and design universal architectures (Appendix C).

Invariant layers.

In the general case, an invariant layer with invariant map such that satisfies

is defined as

where the expectation is taken over . Note that considering the couple of independent random vectors amounts to consider the tensorized law .

Taking as input a discrete distribution , the invariant layer outputs another discrete distribution with ; each input point is mapped onto summarizing the pairwise interactions with after .

Invariant layers can also be generalized to handle higher order interactions functionals, namely , which amounts to consider, in the discrete case, -uple of inputs points

Appendix B Proofs on Regularity

Wasserstein distance.

The regularity of the involved functionals is measured w.r.t. the -Wasserstein distance between two probability distributions

where the minimum is taken over measures on with marginals . is known to be a norm [39], that can be conveniently computed using _1(,) = _1(-) = (g) ≤1 ∫_^d g (̣-), where is the Lipschitz constant of with respect to the Euclidean norm (unless otherwise stated). For simplicity and by abuse of notations, is used instead of when and . The convergence in law denoted is equivalent to the convergence in Wasserstein distance in the sense that is equivalent to .

Permutation-invariant Wasserstein distance.

The Wasserstein distance is quotiented according to the permutation-invariance equivalence classes: for

such that . defines a norm on .

Lipschitz property.

A map is continuous for the convergence in law (aka the weak of measures) if for any sequence , then . Such a map is furthermore said to be -Lipschitz for the permutation invariant 1-Wasserstein distance if


Lipschitz properties enable us to analyze robustness to input perturbations, since it ensures that if the input distributions of random vectors are close in the permutation invariant Wasserstein sense, the corresponding output laws are close, too.

Proofs of section 3.2.


(Proposition 3.2). For , Proposition 1 from [6] yields , hence, for

hence, taking the infimum over yields

Since is invariant, for , ,

Taking the infimum over yields the result. ∎


(Proposition 3.2). To upper bound for , we proceed as follows, using proposition 3 from [6] and proposition 3.2:

For , we get

Taking the infimum over yields