1 Introduction
Deep networks architectures, initially devised for structured data such as images [24] and speech [17], have been extended to respect some invariance or equivariance [41] of more complex data sets. This includes for instance point clouds [34], graphs [16] and probability distributions [6], which are invariant with respect to permutations of the input points. In such cases, invariant architectures improve practical performance while inheriting the universal approximation properties of neural nets [5, 25].
1.1 Distributionbased Architectures and AutoML
This paper focuses on distributionbased neural architectures, i.e. deep networks tailored to manipulate distributions of points. For the sake of simplicity, we describe our architectures over discrete distributions, represented as uniform distributions on a set of points of arbitrary size. The extension to arbitrary (possibly continuous) distributions is detailed in supplementary material, Appendix A.
In this paper, distributionbased neural architectures are extended to cope with an additional invariance: the space of features and labels (i.e. the space supporting the distributions) is also assumed to be invariant under permutation of its coordinates. This extra invariance is important to tackle AutoML problems [38, 30, 11, 19, 1, 18, 22, 36, 10]. AutoML aims to identify a priori
the ML configuration (learning algorithm and hyperparameters thereof) best suited to the dataset under consideration in the sense of a given performance indicator. Would a dataset be associated with accurate descriptive features, referred to as metafeatures, the AutoML problem could be handled via solving yet another supervised learning problem: given archives recording the performance of various ML configurations on various datasets
[43], with each dataset described as a vector of metafeatures, the bestperforming algorithm (among these configurations) on a new dataset
z could be predicted from its metafeatures. The design of accurate metafeatures however has eluded research since the 80s (with the except of [20], more in Section 1.2), to such an extent that the prominent AutoML approaches currently rely on learning a performance model specific to each dataset [11, 36].1.2 Related Works and Contributions
Learning from finite discrete distributions.
Learning from sets of samples subject to invariance or equivariance properties opens up a wide range of applications: in the sequencetosequence framework, relaxing the order in which the input is organized might be beneficial [46]. The ability to follow populations at a macroscopic level, using distributions on their evolution along time without requiring to follow individual trajectories, and regardless of the population size, is appreciated when modelling dynamic cell processes [15]
. The use of sets of pixels, as opposed to e.g., voxellized approaches in computer vision
[6], offers a better scalability in terms of data dimensionality and computational resources.Most generally, the fact that the considered hypothesis space / neural architecture complies with domaindependent invariances ensures a better robustness of the eventually learned model, better capturing the data geometry. Such neural architectures have been pioneered by [34, 51] for learning from point clouds subject to permutation invariance or equivariance. These have been extended to permutation equivariance across sets [14]. Characterizations of invariance or equivariance under group actions have been proposed in the finite [13, 3, 37] or infinite case [48, 23]. A general characterization of linear layers on the top of a representation that are invariant or equivariant with respect to the whole permutation group has been proposed by [26, 21]. Universality results are known to hold in the case of sets [51], point clouds [34], equivariant point clouds [40], discrete measures [6], invariant [27] and equivariant [21]
graph neural networks. The approach most related to our work is that of
[28], presenting a neural architecture invariant w.r.t. the ordering of samples and their features. The originality of our approach is that we do not fix in advance the number of samples, and consider probability distributions instead of point clouds. This allows us to leverage the natural topology of optimal transport to assess theoretically the universality and smoothness of our architectures, which is adapted to tackle the AutoML problem.AutoML.
The absence of learning algorithms efficient on all datasets [47] makes AutoML
i.e. the automatic identification of the machine learning pipelines yielding the best performance on the task at hand
a main bottleneck toward the socalled democratizing of the machine learning technology [19]. The AutoML field has been sparking interest for more than four decades [38], spread from hyperparameter optimization
[2] to the optimization of the whole pipeline [11]. Formally, AutoML defines a mixed integer and discrete optimization problem (finding the ML pipeline algorithms and their hyperparameters), involving a blackbox expensive objective function. The organization of international challenges spurred the development of various efficient AutoML systems, instrinsically relying on Bayesian optimization [11, 42], MonteCarlo tree search [7] on top of a surrogate model, or their combination [36].As said, the ability to characterize tasks (datasets, in the remainder of the paper) via vectors of metafeatures
would solve AutoML through learning the performance model. Metafeatures, expected to describe the joint distribution underlying the dataset, should also be inexpensive to compute. Particular metafeatures called
landmarks [33]are given by the performance of fast ML algorithms; indeed, knowing that a decision tree reaches a given level of accuracy on a dataset gives some information on this dataset; see also
[30]. Another direction is explored by [20], defining the Dataset2Vec representation. Specifically, metafeatures are extracted through solving the classification problem of whether two patches of data (subset of examples, described according to a subset of features) are extracted from the same dataset. Metalearning [12, 50]and hyperparameter transfer learning
[31], more remotely related to the presented approach, respectively aim to find a generic model with quick adaptability to new tasks, achieved through fewshot learning, and to transfer the performance model learned for a task, to another task.Contributions.
The contribution of the paper is twofold. On the algorithmic side, a distributionbased invariant deep architecture (Dida) able to learn such metafeatures is presented in Section 2. The challenge is that a metafeature associated to a set of samples must be invariant both under permutation of the samples, and under permutation of their coordinates. Moreover, the architecture must be flexible enough to accept discrete distributions with diverse support and feature sizes. The theoretical properties of these architectures (smoothness and universality) are detailed in Section 3. A proof of concept of the merits of the approach is presented in Section 4, where the AutoML problem is restricted to the identification of the best SVM configuration on a largesize benchmark of diversified datasets.
2 DistributionBased Invariant Networks for MetaFeature Learning
This section describes our distributionbased invariant layers, mapping a point distribution to another one while respecting invariances. It details how they can be trained to perform invariant regression and achieve metafeature learning.
2.1 Invariant Functions of Discrete Distributions
Let z denote a dataset including labelled samples, with an instance and the associated multilabel. With and respectively being the dimensions of the instance and label spaces, let . By construction, z is invariant under permutation on the sample ordering; it is viewed as an size discrete distribution in , as opposed to a point cloud. While the paper focuses on the case of discrete distributions, the approach and theoretical results also hold in the general case of continuous distribution (Appendix A).
We denote the space of such size point distributions, with the space of distributions of arbitrary size.
As the performance of an ML algorithm is most generally invariant w.r.t. permutations operating on the feature or label spaces, the neural architectures leveraged to learn the metafeatures must enjoy the same property. Formally, let denote the group of permutations independently operating on the feature and label spaces. For , the image of a labelled sample is defined as , with and . For simplicity and by abuse of notations, the operator mapping a distribution to is still denoted .
We denote the space of distributions supported on some set , and we assume that the domain is invariant under permutations in .
The goal of the paper is to define trainable deep architectures, implementing functions defined on such that these are invariant under , i.e. for any . Such functions will be trained to define metafeatures.
2.2 DistributionBased Invariant Layers
Taking inspiration from [6], the basic buildingblocks of the proposed neural architecture are extended to satisfy the feature and labelinvariance requirements.
(Distributionbased invariant layers) Let an interaction functional be invariant, i.e.
A distributionbased invariant layer is defined as
(1) 
It is easy to see that is invariant. The construction of such a distributionbased invariant is extended to arbitrary (possibly continuous) probability distributions by essentially replacing sums by integrals (Appendix A).
(Nature of the invariance) Note that the invariance requirement on actually is less demanding than requiring for any two distinct permutations and in .
Two particular cases are when only depends on its first or second input:

if , then transports the input distribution via , as . This operation is referred to as a pushforward.
(Spaces of arbitrary dimension) Both in practice and in theory, it is important to define layers (in particular the first one of the architecture) that can be applied to distributions on of arbitrary dimensions and . This can be achieved by constraining to be of the form, with and :
where and are independent of .
(Generalization to arbitrary groups) The definition of invariant functions (and the corresponding architectures) can be generalized to arbitrary group operating on (in particular subgroups of the permutation group). A simple way to design an invariant function is to consider where is invariant. In the linear case, [28], Theorem 5 shows that these types of functions are the only ones, but this is not anymore true for nonlinear functions.
(Localized computation) In practice, the complexity of computing can be reduced by considering only in a neighborhood of . The layer then extracts local information around each of the points.
2.3 Learning Dataset Metafeatures from Distributions
The proposed invariant regression neural architectures defined on point distributions (Dida) are defined as
(2) 
where are the trainable parameters of the architecture (detailed below). Here , and only depends on its second argument (such that should be understood as being a vector, as opposed to a distribution). Note that only is required to be invariant and dimensionagnostic for the architecture to be as well. In practice, this map defined as in Remark 2.2 is thus learned using inputs of varying dimension as a invariant layer with , where maps to , maps to , with are affine functions, is a nonlinearity and denotes concatenation.
As the following layers () need not be invariant, they are parameterized as using a pair of (matrix,vector). The parameters of the Dida architecture are thus
. They are learned in a supervised fashion, with a loss function depending on the task at hand (see Section
4). By construction, these architectures are invariant w.r.t. the orderings of both the points composing the input distributions and their coordinates. The input distributions can be composed of any number of points in any dimension, which is a distinctive feature with respect to [28].3 Theoretical Analysis
To get some insight on these architectures, we now detail their robustness to perturbations and their approximation abilities with respect to the convergence in law, which is the natural topology for distributions. Although we expose these contributions for discrete distributions, these results hold for arbitrary (possibly continuous) distributions (supplementary material, Appendix A).
3.1 Optimal Transport Comparison of Datasets
Point clouds vs. distributions.
It is important to note that learning from datasets, referred to as metalearning for simplicity in the sequel, requires such datasets be seen as probability distributions, as opposed to point clouds. For instance, having twice the same point in a dataset really corresponds to doubling its mass, i.e. it should have twice more importance than the other points. We thus argue that the natural topology to analyze metalearning methods is the one of the convergence in law, which can be quantified using Wasserstein optimal transport distances. This is in sharp contrast with point clouds architectures (see for instance [34]
), making use of maxpooling and relying on the Haussdorff distance to analyze the architecture properties. While this analysis is standard for lowdimensional (2D and 3D) applications in graphics and vision, this is not suitable for our purpose, because maxpooling is not a continuous operation for the topology of convergence in law.
Wasserstein distance.
In order to quantify the regularity of the involved functionals, we resort to the Wasserstein distance between two discrete probability distributions (referring the reader to [39, 32] for a comprehensive presentation of Wasserstein distance):
where is the space of Lipschitz functions . In this paper, as probability distribution and its permuted image under are considered to be indistinguishable, one introduces the permutationinvariant Wasserstein distance: for :
such that if and only if z and are equal (in the sense of probability distributions) up to feature permutations (i.e. belong to the same equivalence class, Appendix A).
Lipschitz property.
In this context, a map is continuous for the convergence in law^{1}^{1}1Note that takes any probability distribution on as input, hence in particular, size samples belonging to for any are accepted, as well as continuous distributions (Appendix A). (aka the weak of distributions, denoted ) if for any sequence , then . The Wasserstein distance metrizes the convergence in law, in the sense that is equivalent to . Such a map is furthermore said to be Lipschitz for the permutation invariant Wasserstein distance if
(3) 
Lipschitz properties enable us to analyze robustness to input perturbations, since it ensures that if the input distributions are close enough (in the permutation invariant Wasserstein sense), the corresponding outputs are close too.
3.2 Regularity of DistributionBased Invariant Layers
The following propositions show the robustness of invariant layers with respect to different variations of their input, assuming the following regularity condition on the interaction functional:
(4) 
The proofs of this section are detailed in Appendix B. We first show that invariant layers are Lipschitz regular. This ensures that deep architectures of the form (2) map close inputs onto close outputs.
Secondly, we consider perturbations with respect to diffeomorphisms. This stability is important for instance to cope with situation where an autoencoder has been trained, so that a dataset and its encodeddecoded representation are expected to yield similar metafeatures. The following proposition shows that and are indeed close if is close to the identity, which is expected when using autoencoders. It also shows that similarly, if both inputs and outputs are modified by regular deformations and , then the output are also close.
For and two Lipschitz maps, one has for all ,
3.3 Universality of Invariant Layers
We now show that our architecture can approximate any continuous invariant map. More precisely, the following proposition shows that the combination of an invariant layer (1) and a fullyconnected layer are enough to reach universal approximation capability. This statement holds for arbitrary distributions (not necessarily discrete) and for functions defined on spaces of arbitrary dimension in the sense of Remark 2.2 (assuming some a priori bound on the dimensions).
Let a invariant map on a compact , continuous for the convergence in law. Then , there exists two continuous maps such that
where is invariant and independent of .
Proof.
We give a sketch of the proof, more detail is provided in Appendix C). We consider where: (i) is the collection of elementary symmetric polynomials in the features and elementary symmetric polynomials in the labels, which are invariant to ; (ii) is defined through a discretization of on a grid; (iii) applies function on a discretized version of z – which requires to be bijective: this is achieved by , through a projection on the quotient space and a restriction to its image compact . The sum in definition of computes an expectation which collects integrals over each cell of the grid to approximate measure by a discrete counterpart . Hence applies to . Continuity is obtained as follows: (i) proximity of and is guaranteed (see Lemma C from [6]) and gets tighter as the discretization step tends to 0 ; (ii) the map is regular enough (Hölder, see theorem 1.3.1 from [35]) such that according to Lemma C, can be upperbounded; (iii) since is compact, by BanachAlaoglu theorem, also is. Since is continuous, it is thus uniformly weak* continuous: choosing a discretization step small enough ensures the result. ∎
(Approximation by an invariant NN) A consequence of theorem 3.3 is that any continuous invariant regression function taking (compactly supported) distributions can be approximated to arbitrary precision by an invariant neural network. This result is detailed in Appendix C and uses the following ingredients: (i) an invariant layer with that can be approximated by an invariant network; (ii) the universal approximation theorem [5, 25]; (iii) uniform continuity to obtain uniform bounds.
(Extension to different spaces) Theorem 3.3 also extends to distributions supported on different spaces, by considering a joint embedding space of large enough dimension. This way, any invariant prediction function can (uniformly) be approximated by an invariant network, up to setting added coordinates to zero (Appendix C).
4 Learning metafeatures: proofs of concept
To showcase the validity of the proposed architecture, two proofs of concept are proposed, extracting metafeatures by training Dida^{2}^{2}2Dida code is available at: https://github.com/herilalaina/dida. to achieve two tasks, respectively distribution identification and performance model learning.
4.1 Experimental setting
Three benchmarks have been considered (details in supplementary material, Appendix D). Benchmarks TOY and UCI are taken from [20], respectively involving toy datasets with instances in , and 121 datasets from the UCI repository [8]. Benchmark OpenML3D is derived from 593 datasets extracted from the OpenML repository [44], where each dataset gives rise to compressed datasets using autoencoders (instance being replaced with its 3dimage in latent space). Twenty such compressed datasets are generated for each initial OpenML dataset. Each benchmark is divided into 70%30% trainingtest sets (all compressed datasets generated from a same dataset being either in training or in test sets).
The Dida neural architecture includes 2 invariant layers followed by three fully connected layers of sizes 256, 128, 64. The first layer processes a dataset z (finite distribution in dimension ), yielding a distribution in dimension 10, while the second layer yields a deterministic vector in dimension 1024. The latter is processed by the FC architecture; denotes the learned metafeatures, with Dida parameters (section 2.3).
All experiments are run on 4 NVIDIATeslaV100SXM2 GPUs with 32GB memory, using Adam optimizer with base learning rate and batch size 32.
4.2 Task 1: Distribution Identification
The patch identification task is introduced by [20]. Let dataset , referred to as patch of dataset , be extracted by uniformly selecting a subset of samples with indices in . To each pair of patches (z,z’) (with same number of instances) is associated the binary metalabel , set to 1 iff z and z’ are extracted from the same initial dataset. In this case, the Dida parameters are trained to build the (dimensionagnostic) model minimizing the (weighted version of) binary crossentropy loss:
(5) 
with and metafeatures defined as the 64dimensional output of the last FC layer.
The Dida performance is assessed comparatively to Dataset2Vec^{3}^{3}3Dataset2Vec code is available at https://github.com/hadijomaa/dataset2vec.. Table 1 shows that Dida significantly outperforms Dataset2Vec
on all benchmarks (columns 13), all the more so as the number of features in the datasets is large (in UCI). Uncertainty estimates are obtained with 3 folds splitting of the test set.
Method  TOY  UCI  OpenML  

(1)  (2)  (3)  (4)  
Dataset2Vec  
Dida  97.2 % 0.1  89.2 % 2.1  98.54% 0.9  91.57% 2.11 
An original generalization of patch identification is defined using OpenML3D, where the label of a pair of patches is thereafter set to 1 iff z and are extracted from some u and , with u and derived by autoencoder from the same initial OpenML dataset. The task difficulty is increased compared to patch identification as patches z and are now extracted from similar distributions^{4}^{4}4If the composition of the encoder and decoder module were the identity, then the u distribution is mapped onto the distribution by composing the decoder of the AE used to generate u with the encoder of the AE used to generate ., as opposed to the same distribution. Dida also significantly outperforms Dataset2Vec (Table 1, column (4)).
All experiments are conducted using 10 patches of 100 samples for each dataset. Dida computational time is ca 2 hours on TOY and UCI, and 6 hours on OpenML 3D. Dataset2Vec hyperparameters are set to their default values except size and number of patches, set to same values as in Dida.
4.3 Task 2: Performance model learning
The set of ML configurations includes 100 SVM configurations (e.g. type and hyperparameters of the kernel). For each configuration and dataset z, the performance is the predictive accuracy of the SVM learned from z and assessed using a 90%10% split among training and test sets, with and respectively the best and the median values of for ranging in . Topk(z) is the set of configurations with highest accuracy on z. The goal of performance modelling is to support the a priori identification of a sufficiently good, or quasioptimal, configuration for each z.
Dida is trained to approximate the metric induced on OpenML 3D benchmark by the ML configurations . Let the dissimilarity of two datasets z and be defined as:
(6) 
Based on this dissimilarity, three clusters are defined on each benchmark, and the associated 3class learning problem is considered, with metalabel the index of the cluster z belongs to. On the top of the last invariant layer (delivering metafeatures ) are built the three fullyconnected layers followed by a softmax with output for . The Dida parameters are thus learned by classically minimizing the (weighted version of) crossentropy loss . On the top of metafeatures , a metric learning module is trained using ListMLE [49], yielding such that the Euclidean metric based on the be compliant with :
(7) 
The merits of the metafeatures are comparatively established as follows. For each z in the benchmark, let denote the th nearest neighbor of z according to the metric defined by metafeatures MF, be they extracted by Dida, handcrafted as used in [29] or in [11], or based on landmarks [33]. For each z in the benchmark, let denote the th nearest neighbor of z according to the metric defined by metafeatures MF. Likewise, let denote the performance on z of the best configuration for , and . The regret of the AutoML process based on MF is defined as .
Figure 1 displays the regret curve associated to Dida metafeatures, comparatively to that of handcrafted metafeatures [29, 11], landmarks [33], or random metafeatures; the regret of the best
on average on the training set is displayed for comparison. Handcrafted and landmark metafeatures are normalized then preprocessed using SVD, retaining the top 10 singular values. These regret curves establish the relevance of the proposed
Dida approach; a discussion on its limitations is presented in supplementary material, Appendix D.5 Conclusion
In this paper, we develop Dida, an architecture performing invariant regression on point distributions, invariant w.r.t. feature permutations and accommodating various data sizes, backed by theoretical capabilities of universal approximation and robustness, with natural extensions to continuous distributions.
Tackling the longknown AutoML problem, we demonstrate the feasibility and relevance of automatically extracting metafeature vectors using Dida,
outperforming the Dataset2Vec approach [20]
and the metafeatures manually defined in the last two decades [11, 29].
The ability to pertinently situate a dataset in the landscape defined by ML algorithms paves the way to quite a few applications beyond AutoML, ranging from domain adaptation to metalearning.
6 Acknowledgements
The work of G. De Bie is supported by the Region IledeFrance. H. Rakotoarison acknoledges funding from the ADEME #1782C0034 project NEXT. The work of G. Peyré is supported by the European Research Council (ERC project NORIA).
References
 [1] (2013) Collaborative hyperparameter tuning. pp. II–199–II–207. Cited by: §1.1.
 [2] (2011) Algorithms for hyperparameter optimization. pp. 2546–2554. Cited by: §1.2.
 [3] (201620–22 Jun) Group equivariant convolutional networks. 48, pp. 2990–2999. Cited by: §1.2.
 [4] (2007) Ideals, varieties, and algorithms: an introduction to computational algebraic geometry and commutative algebra, 3/e (undergraduate texts in mathematics). SpringerVerlag, Berlin, Heidelberg. External Links: ISBN 0387356509 Cited by: item , Appendix C.

[5]
(1989)
Approximation by superpositions of a sigmoidal function
. Mathematics of control, signals and systems 2 (4), pp. 303–314. Cited by: Appendix C, §1, §3.3.  [6] (2019) Stochastic deep networks. pp. 1556–1565. Cited by: Appendix B, Appendix B, item , Appendix C, §1.2, §1.2, §1, §2.2, §3.3.
 [7] (2018) AlphaD3M: machine learning pipeline synthesis. Cited by: §1.2.
 [8] (2017) UCI machine learning repository. University of California, Irvine, School of Information and Computer Sciences. External Links: Link Cited by: §4.1.
 [9] (2017) UCI machine learning repository. External Links: Link Cited by: §D.1.
 [10] (2019) Neural architecture search: A survey. J. Mach. Learn. Res. 20, pp. 55:1–55:21. External Links: Link Cited by: §1.1.
 [11] (2015) Efficient and robust automated machine learning. pp. 2962–2970. External Links: Link Cited by: §1.1, §1.2, §4.3, §4.3, §5.
 [12] (2018) Probabilistic modelagnostic metalearning. In Advances in Neural Information Processing Systems 31, S. Bengio, H. Wallach, H. Larochelle, K. Grauman, N. CesaBianchi, and R. Garnett (Eds.), pp. 9516–9527. Cited by: §1.2.
 [13] (2014) Deep symmetry networks. In Advances in Neural Information Processing Systems 27, Z. Ghahramani, M. Welling, C. Cortes, N. D. Lawrence, and K. Q. Weinberger (Eds.), pp. 2537–2545. Cited by: §1.2.
 [14] (2018) Deep models of interactions across sets. External Links: 1803.02879 Cited by: §1.2.
 [15] (201620–22 Jun) Learning populationlevel diffusions with generative rnns. 48, pp. 2417–2426. Cited by: §1.2.
 [16] (2015) Deep convolutional networks on graphstructured data. CoRR abs/1506.05163. External Links: 1506.05163 Cited by: §1.
 [17] (2012) Deep neural networks for acoustic modeling in speech recognition: the shared views of four research groups. IEEE Signal processing magazine 29 (6), pp. 82–97. Cited by: §1.
 [18] (2011) Sequential modelbased optimization for general algorithm configuration. pp. 507–523. External Links: ISBN 9783642255656, Link, Document Cited by: §1.1.
 [19] F. Hutter, L. Kotthoff, and J. Vanschoren (Eds.) (2018) Automated machine learning: methods, systems, challenges. Springer. Note: In press, available at http://automl.org/book. Cited by: §1.1, §1.2.
 [20] (2019) Dataset2Vec: learning dataset metafeatures. External Links: 1905.11063 Cited by: §D.1, §D.2, §1.1, §1.2, §4.1, §4.2, §5.
 [21] (2019) Universal invariant and equivariant graph neural networks. pp. 7090–7099. Cited by: §1.2.
 [22] (201720–22 Apr) Fast Bayesian Optimization of Machine Learning Hyperparameters on Large Datasets. 54, pp. 528–536. External Links: Link Cited by: §1.1.
 [23] (2018) On the generalization of equivariance and convolution in neural networks to the action of compact groups. External Links: 1802.03690 Cited by: §1.2.
 [24] (2012) Imagenet classification with deep convolutional neural networks. pp. 1097–1105. Cited by: §1.

[25]
(1993)
Multilayer feedforward networks with a nonpolynomial activation function can approximate any function
. Neural networks 6 (6), pp. 861–867. Cited by: Appendix C, §1, §3.3.  [26] (2019) Invariant and equivariant graph networks. Cited by: §1.2.
 [27] (2019) On the universality of invariant networks. pp. 4363–4371. Cited by: §1.2.
 [28] (2020) On learning sets of symmetric elements. External Links: 2002.08599 Cited by: item , §1.2, §2.2, §2.3.
 [29] (2018) Instance spaces for machine learning classification. Machine Learning 107 (1), pp. 109–147. Cited by: §4.3, §4.3, §5.
 [30] (201801) Instance spaces for machine learning classification. Machine Learning 107 (1), pp. 109–147. External Links: ISSN 08856125, Document Cited by: §1.1, §1.2.
 [31] (2018) Scalable hyperparameter transfer learning. In Advances in Neural Information Processing Systems 31, S. Bengio, H. Wallach, H. Larochelle, K. Grauman, N. CesaBianchi, and R. Garnett (Eds.), pp. 6845–6855. Cited by: §1.2.
 [32] (2019) Computational optimal transport. Foundations and Trends® in Machine Learning 11 (56), pp. 355–607. External Links: Link, Document, ISSN 19358237 Cited by: §3.1.
 [33] (2000) Metalearning by landmarking various learning algorithms. pp. 743–750. External Links: ISBN 1558607072 Cited by: §1.2, §4.3, §4.3.

[34]
(2017)
PointNet: deep learning on point sets for 3d classification and segmentation.
Proc. Computer Vision and Pattern Recognition (CVPR), IEEE
. Cited by: §1.2, §1, §3.1.  [35] (2002) Analytic theory of polynomials. Cited by: item , §3.3.
 [36] (201907) Automated machine learning with montecarlo tree search. pp. 3296–3303. External Links: Document Cited by: §1.1, §1.2.
 [37] (2017) Equivariance through parametersharing. 70, pp. 2892–2901. Cited by: §1.2.
 [38] (1976) The algorithm selection problem.. Advances in Computers 15, pp. 65–118. Cited by: §1.1, §1.2.
 [39] (2015) Optimal transport for applied mathematicians. Birkäuser, NY. Cited by: Appendix B, §3.1.
 [40] (2019) On universal equivariant set networks. External Links: 1910.02421 Cited by: §1.2.
 [41] (1993Sep.) Symmetries and discriminability in feedforward network architectures. IEEE Transactions on Neural Networks 4 (5), pp. 816–826. External Links: Document, ISSN 19410093 Cited by: §1.
 [42] (2013) Autoweka: combined selection and hyperparameter optimization of classification algorithms. pp. 847–855. Cited by: §1.2.
 [43] (2013) OpenML: networked science in machine learning. SIGKDD Explorations 15 (2), pp. 49–60. External Links: Link, Document Cited by: §D.1, §1.1.
 [44] (2013) OpenML: networked science in machine learning. SIGKDD Explorations 15 (2), pp. 49–60. External Links: Link, Document Cited by: §D.4, §4.1.

[45]
(2008)
Extracting and composing robust features with denoising autoencoders
. pp. 1096–1103. Cited by: §D.1.  [46] (2016) Order matters: sequence to sequence for sets. Cited by: §1.2.
 [47] (1996) The lack of A priori distinctions between learning algorithms. Neural Computation 8 (7), pp. 1341–1390. Note: No Free Lunch for Machine Learning Cited by: §1.2.
 [48] (1996) Representation theory and invariant neural networks. Discrete applied mathematics 69 (12), pp. 33–60. Cited by: §1.2.
 [49] (2008) Listwise approach to learning to rank: theory and algorithm. pp. 1192–1199. Cited by: §4.3.
 [50] (2018) Bayesian modelagnostic metalearning. In Advances in Neural Information Processing Systems 31, S. Bengio, H. Wallach, H. Larochelle, K. Grauman, N. CesaBianchi, and R. Garnett (Eds.), pp. 7332–7342. Cited by: §1.2.
 [51] (2017) Deep sets. In Advances in Neural Information Processing Systems 30, I. Guyon, U. V. Luxburg, S. Bengio, H. Wallach, R. Fergus, S. Vishwanathan, and R. Garnett (Eds.), pp. 3391–3401. Cited by: 2nd item, §1.2.
Supplementary
Appendix A Extension to arbitrary distributions
Overall notations.
Let denote a random vector on with its law (a positive Radon measure with unit mass). By definition, its expectation denoted reads , and for any continuous function , . In the following, two random vectors and with same law are considered indistinguishable, noted . Letting denote a function on , the pushforward operator by , noted is defined as follows, for any continuous function from to ( in ):
Letting be a set of points in with such that , the discrete measure is the sum of the Dirac measures weighted by .
Invariances.
In this paper, we consider functions on probability measures that are invariant with respect to permutations of coordinates. Therefore, denoting the sized permutation group, we consider measures over a symmetrized compact equipped with the following equivalence relation: for , , such that a measure and its permuted counterpart are indistinguishable in the corresponding quotient space, denoted alternatively or . A function is said to be invariant (by permutations of coordinates) iff (Definition 1).
Tensorization.
Letting and respectively denote two random vectors on and
, the tensor product vector
is defined as: , where and are independent and have the same law as and , i.e. . In the finite case, for and , then , weighted sum of Dirac measures on all pairs . The fold tensorization of a random vector , with law , generalizes the above construction to the case ofindependent random variables with law
. Tensorization will be used to define the law of datasets, and design universal architectures (Appendix C).Invariant layers.
In the general case, an invariant layer with invariant map such that satisfies
is defined as
where the expectation is taken over . Note that considering the couple of independent random vectors amounts to consider the tensorized law .
Taking as input a discrete distribution , the invariant layer outputs another discrete distribution with ; each input point is mapped onto summarizing the pairwise interactions with after .
Invariant layers can also be generalized to handle higher order interactions functionals, namely , which amounts to consider, in the discrete case, uple of inputs points
Appendix B Proofs on Regularity
Wasserstein distance.
The regularity of the involved functionals is measured w.r.t. the Wasserstein distance between two probability distributions
where the minimum is taken over measures on with marginals . is known to be a norm [39], that can be conveniently computed using _1(,) = _1() = (g) ≤1 ∫_^d g (̣), where is the Lipschitz constant of with respect to the Euclidean norm (unless otherwise stated). For simplicity and by abuse of notations, is used instead of when and . The convergence in law denoted is equivalent to the convergence in Wasserstein distance in the sense that is equivalent to .
Permutationinvariant Wasserstein distance.
The Wasserstein distance is quotiented according to the permutationinvariance equivalence classes: for
such that . defines a norm on .
Lipschitz property.
A map is continuous for the convergence in law (aka the weak of measures) if for any sequence , then . Such a map is furthermore said to be Lipschitz for the permutation invariant 1Wasserstein distance if
(8) 
Lipschitz properties enable us to analyze robustness to input perturbations, since it ensures that if the input distributions of random vectors are close in the permutation invariant Wasserstein sense, the corresponding output laws are close, too.
Comments
There are no comments yet.