Approximation capability of neural networks on spaces of probability measures and tree-structured domains

06/03/2019 ∙ by Tomáš Pevný, et al. ∙ 0

This paper extends the proof of density of neural networks in the space of continuous (or even measurable) functions on Euclidean spaces to functions on compact sets of probability measures. By doing so the work parallels a more then a decade old results on mean-map embedding of probability measures in reproducing kernel Hilbert spaces. The work has wide practical consequences for multi-instance learning, where it theoretically justifies some recently proposed constructions. The result is then extended to Cartesian products, yielding universal approximation theorem for tree-structured domains, which naturally occur in data-exchange formats like JSON, XML, YAML, AVRO, and ProtoBuffer. This has important practical implications, as it enables to automatically create an architecture of neural networks for processing structured data (AutoML paradigms), as demonstrated by an accompanied library for JSON format.



There are no comments yet.


page 1

page 2

page 3

page 4

This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

1 Motivation

{"weekNumber":"39", "workouts":[ { "sport":"running",   "distance":19738,   "duration":1500,   "calories":375,   "avgPace":76,   "speedData":{     "speed":[10,9,8],     "altitude":[100,104,103,81],     "labels":["0.0km","6.6km","13.2km","19.7km"]}},   {"sport":"swimming",     "distance":664,     "duration":1800,     "calories":250,     "avgPace":2711}]}
Figure 1: Example of JSON document, adapted from

Prevalent machine learning methods assume their input to be a vector or a matrix of a fixed dimension, or a sequence, but many sources of data have the structure of a tree, imposed by data formats like JSON, XML, YAML, Avro, or ProtoBuffer (see Figure 


for an example). While the obvious complication is that such a tree structure is more complicated than having a single variable, these formats also contain some “elementary” entries which are already difficult to handle in isolation. Beside strings, for which a plethora conversions to real-valued vectors exists (one-hot encoding, histograms of n-gram models, word2vec 

[15], output of a recurrent network, etc.), the most problematic elements seem to be unordered lists (sets) of records (such as the "workouts" element and all of the subkeys of "speedData" in Figure 1

), whose length can differ from sample to sample and the classifier processing this input needs to be able to cope with this variability.

The variability exemplified above by "workouts" and "speedData" is the defining feature of Multi-instance learning (MIL) problems (also called Deep Sets in [28]), where it is intuitive to define a sample as a collection of feature vectors. Although all vectors within the collection have the same dimension, their number can differ from sample to sample. In MIL nomenclature, a sample is called a bag and an individual vector an instance. The difference between sequences and bags is that the order of instances in the bag is not important and the output of the classifier should be the same for an arbitrary permutation of instances in the vector.

MIL was introduced in [4] as a solution for a problem of learning a classifier on instances from labels available on the level of a whole bag. To date, many approaches to solve the problem have been proposed, and the reader is referred to [1] for an excellent review and taxonomy. The setting has emerged from the assumption of a bag being considered positive if at least one instance was positive. This assumption is nowadays used for problems with weakly-labeled data [2]. While many different definitions of the problem have been introduced (see [7] for a review), this work adopts a general definition of [16]

, where each sample (bag) is viewed as a probability distribution observed through a set of realizations (instances) of a random variable with this distribution. Rather than working with vectors, matrices or sequences, the classifier therefore classifies

probability measures.

Independent works of [28, 6] and [19]

have proposed an adaptation of neural networks to MIL problems (hereinafter called MIL NN). The adaptation uses two feed-forward neural networks, where the first network takes as an input individual instances, its output is an element-wise averaged, and the resulting vector describing the whole bag is sent to the second network. This simple approach yields a very general, well performing and robust algorithm, which has been reported by all three works. Since then, the MIL NN has been used in numerous applications, for example in causal reasoning 


, in computer vision to process point clouds 

[25, 27], in medicine to predict prostate cancer [12], in training generative adversarial networks [12], or to process network traffic to detect infected computers [18]. The last work has demonstrated that the MIL NN construction can be nested (using sets of sets as an input), which allows the neural network to handle data with a hierarchical structure.

The wide-spread use of neural networks is theoretically justified by their universal approximation property – the fact that any continuous function on (a compact subset of) a Euclidean space to real numbers can be approximated by a neural network with arbitrary precision [11, 14]. However, despite their good performance and increasing popularity, no general analogy of the universal approximation theorem has been proven for MIL NNs. This would require showing that MIL NNs are dense in the space of continuous functions from the space of probability measures to real numbers and – to the best of our knowledge – the only result in this direction is restricted to input domains with finite cardinality [28].

This work fills this gap by formally proving that MIL NNs with two non-linear layers, a linear output layer and mean aggregation after the first layer are dense in the space of continuous functions from the space of probability measures to real numbers (Theorem 2 and Corollary 3). In Theorem 5, the proof is extended to data with an arbitrary tree-like schema (XML, JSON, ProtoBuffer). The reasoning behind the proofs comes from kernel embedding of distributions (mean map) [21, 23] and related work on Maximum Mean Discrepancy [8]. This work can therefore be viewed as a formal adaptation of these tools to neural networks. While these results are not surprising, the authors believe that as the number of applications of NNs to MIL and tree-structured data grows, it becomes important to have a formal proof of the soundness of this approach.

The paper only contains theoretical results — for experimental comparison to prior art, the reader is referred to [28, 6, 19, 20, 25, 27, 12, 18]. However, the authors provide a proof of concept demonstration of processing JSON data at

2 Notation and summary of relevant work

This section provides background for the proposed extensions of the universal approximation theorem [11, 14]. For convenience, it also summarizes solutions to multi-instance learning problems proposed in [19, 6].

By we denote the space of continuous functions from to endowed with the topology of uniform convergence. Recall that this topology is metrizable by the supremum metric .

Throughout the text, will be an arbitrary metric space and will be some compact set of (Borel) probability measures on . Perhaps the most useful example of this setting is when is a compact metric space and is the space of all Borel probability measures on . Endowing with the topology turns it into a compact metric space (the metric being for some dense subset – see for example Proposition 62 from [9]). Alternatively, one can define metric on using for example integral probability metrics [17] or total variation. In this sense, the results presented below are general, as they are not tied to any particular topology.

2.1 Universal approximation theorem on compact subsets of

The next definition introduces set of affine functions forming the base of linear and non-linear layers of neural networks.

Definition 1.

For any , is the set of all affine functions on i.e.


The main result of [14] states that feed-forward neural networks with a single non-linear hidden layer and linear output layer (hereinafter called -networks) are dense in the space of continuous functions. Lemma 1.1 then implies that the same holds for measurable functions.

Theorem 1 (Universal approximation theorem on ).

For any non-polynomial measurable function on and every , the following family of functions is dense in :


The key insight of the theorem isn’t that a single non-linear layer suffices, but the fact that any continuous function can be approximated by neural networks. Recall that for compact, any can be continuolusly extended to , and thus the same result holds for . Note that if was a polynomial of order , would only contain polynomials of order .

The following metric corresponds to the notion of convergence in measure:

Definition 2 (Def. 2.9 from [11]).

For a Borel probability measure on , define a metric


on , where denotes the collection of all (Borel) measurable functions.

Note that for finite , the uniform convergence implies convergence in  [11, L. A.1]:

Lemma 1.1.

For every finite Borel measure on a compact is -dense in

2.2 Multi-instance neural networks

In Multi-instance learning it is assumed that a sample consists of multiple vectors of a fixed dimension, i.e. , . Furthermore, it is assumed that labels are provided on the level of samples , rather than on the level of individual instances

To adapt feed-forward neural networks to MIL problems, the following construction has been proposed in [19, 6]. Assuming mean aggregation function, the network consists of two feed-forward neural networks and The output of function is calculated as follows:


where is the dimension of the input, output of the first neural network, and the output. This construction also allows the use of other aggregation functions such as maximum.

The general definition of a MIL problem [16] adopted here views instances of a single sample as realizations of a random variable with distribution where is a set of probability measures on . This means that the sample is not a single vector but a probability distribution observed through a finite number of realizations of the corresponding random variable.

The main result of Section 3 is that the set of neural networks with (i) being a single non-linear layer, (ii) being one non-linear layer followed by a linear layer, and (iii) the aggregation function being mean as in Equation (4) is dense in the space of continuous functions on any compact set of probability measures. Lemma 1.1 extends the result to the space of measurable functions.

The theoretical analysis assumes functions of the form


whereas in practice can only be observed through a finite set of observations This might seem as a discrepancy, but the sample x can be interpreted as a mixture of Dirac probability measures By definition of , we immediatelly get

from which it easy to recover Equation (4). Since approaches as increases,

can be seen as an estimate of

Indeed, if the non-linearities in neural networks implementing functions and are continuous, the function is bounded and from Hoeffding’s inequality [10] it follows that for some constant

3 Universal approximation theorem for probability spaces

To extend Theorem 1 to spaces of probability measures, the following definition introduces the set of functions which represent the layer that embedds probability measures into .

Definition 3.

For any  and set of functions we define as


can be viewed as an analogy of affine functions defined by Equation (1) in the context of probability measures on .


Let and suppose that only contains the basic projections . If is the set of Dirac measures, then coincides with .

Using , the following definition extends the -networks from Theorem 1 to probability spaces.

Definition 4 (-networks).

For any set of functions and a measurable function let be class of functions


The main theorem of this work can now be presented. As illustrated in a corollary below, when applied to it states that three-layer neural networks, where first two layers are non-linear interposed with an integration (average) layer, allow arbitrarily precise approximations of continuous function on . (In other words this class of networks is dense in .)

Theorem 2.

Let be a compact set of Borel probability measures on a metric space , be a set of continuous functions dense in and finally be a measurable non-polynomial function. Then the set of functions is dense in .

Using Lemma 1.1, an immediate corollary is that a similar result holds for measurable funcitons:

Corollary 1 (Density of MIL NN in ).

Under the assumptions of Theorem 2, is -dense in for any finite Borel measure on .

The proof of Theorem 2 is similar to the proof of Theorem 2.4 from [11]. One of the ingredients of the proof is the classical Stone-Weierstrass theorem [24]. Recall that a collection of functions is an algebra if it is closed under multiplication and linear combinations.

Stone-Weierstrass Theorem.

Let be an algebra of functions on a compact . If

  1. separates points in : and

  2. vanishes at no point of : ,

then the uniform closure of is equal to

Since is not closed under multiplication, we cannot apply the SW theorem directly. Instead, we firstly prove the density of the class of networks (Theorem 3) which does form an algebra, and then we extend the result to -networks.

Theorem 3.

Let  be a compact set of Borel probability measures on a metric space , and be a dense subset of Then the following set of functions is dense in :

The proof shall use the following immediate corollary of Lemma 9.3.2 from [5].

Lemma 3.1 (Lemma 9.3.2 of [5]).

Let be a metric space and let and be two Borel probability measures on If , then we have for some .

Proof of Theorem 3.

Since is clearly an algebra of continuous functions on , it suffices to verify the assumptions of the SW theorem (separation and non-vanishing properties).

(i) Separation: Let be distinct. By Lemma 3.1 there is some and such that Since is dense in , there exists such that Using triangle inequality yields

Denoting it is trivial to see that It follows that , implying that separates the points of .

(ii) Non-vanishing: Let Choose such that Since is dense in there exists such that Since , we get

Denote . It follows that , and hence vanishes at no point of .

Since the assumptions of SW theorem are satisfied, is dense in

The following simple lemma will be useful in proving Theorem 2.

Lemma 3.2.

If is dense in then for any , the collection of functions is dense in


Let and be such that Then we have


which proves the lemma. ∎

Proof of Theorem 2.

Theorem 2 is a consequence of Theorem 3 and -networks being dense in for any

Let , and be as in the assumptions of the theorem. Let and fix Then, there exist such that This function is of the form

for some and Moreover can be written as a composition , where


Denoting , we identify the range of and the domain of with .

Since is clearly continuous and is dense in (by Theorem 1) there exists such that It follows that satisfies

Since it is easy to see that belongs to which concludes the proof. ∎

The function in the above construction (Equation (9

)) can be seen as a feature extraction layer embedding the space of probability measures into a Euclidean space. It is similar to a mean-map 

[21, 23] — a well-established paradigm in kernel machines — in the sense that it characterizes a class of probability measures but, unlike mean-map, only in parts where positive and negative samples differ.

4 Universal approximation theorem for product spaces

The next result is the extension of the universal approximation theorem to product spaces, which naturally occur in structured data. The motivation here is for example if one sample consists of some real vector set of vectors and another set of vectors

Theorem 4.

Let  be a Cartesian product of metric compacts, , be dense subsets of and be a measurable function which is not an algebraic polynomial. Then is dense in

The theorem is general in the sense that it covers cases where some are compact sets of probability measures as defined in Section 2, some are subsets of Euclidean spaces, and others can be general compact spaces for which the corresponding sets of continuous function are dense in

The theorem is a simple consequence of the following corollary of Stone-Weierstrass theorem.

Corollary 2.

For and compact, the following set of functions is dense in

Proof of Theorem 4.

The proof is technically similar to the proof of Theorem 2. Specifically, let be a continuous function on and . By the aforementioned corollary of the SW theorem, there are some such that

Again, the above function can be written as a composition of two functions


Since is continuous, Theorem 1 can be applied to obtain a function of the form for some and , which approximates with error at most . Applying Lemma 3.2 to and concludes the proof.

5 Multi-instance learning and tree structured data

The following corollary of Theorem 2 justifies the embedding paradigm of [28, 6, 19] to MIL problems:

Corollary 3 (Density of MIL NN in ).

Let be a compact subset of and a compact set of probability measures on Then any function can be arbitrarily closely approximated by a three-layer neural network composed of two non-linear layers with integral (mean) aggregation layer between them, and a linear output layer.

If in Theorem 2 is set to all feed-forward networks with a single non-linear layer (that is, when ) then the theorem says that for every and , there is some such that This can be written as

where for brevity the bias vectors are omitted,

and are element-wise, and are matrices of appropriate sizes. Since the integral in the middle is linear with respect to the matrix-vector multiplication, and can be replaced by a single matrix, which proves the corollary:

Since Theorem 2 does not have any special conditions on except to be compact metric space and to be continuous and uniformly dense in the theorem can be used as an induction step and the construction can be repeated.

For example, consider a compact set of probability measures  on a . Then the space of neural networks with four layers is dense in The network consists of three non-linear layers with integration (mean) layer between them, and the last layer which is linear.

The above induction is summarized in the following theorem.

Theorem 5.

Let be the class of spaces which (i) contains all compact subsets of , , (ii) is closed under finite cartesian products, and (iii) for each we have .111Here we assume that is endowed with the metric from Section 2. Then for each , every continuous function on can be arbitrarilly well approximated by neural networks.

By Lemma 1.1, an analogous result holds for measurable functions.


It suffices to show that is contained in the class of all compact metric spaces for which functions realized by neural networks are dense in . By Theorem 1, satisfies (i). The properties (ii) and (iii) hold for by Theorems 4 and 2. It follows that . ∎

6 Related Work

Works most similar to this one are on kernel mean embedding [21, 23], showing that a probability measure can be uniquely embedded into high-dimensional space using characteristic kernel. Kernel mean embedding is widely used in Maximum Mean Discrepancy [8] and in Support Measure Machines [16, 3], and is to our knowledge the only algorithm with proven approximation capabilities comparable to the present work. Unfortunately its worst-case complexity of where is the number of bags and is the average size of a bag, prevents it from scaling to problems above thousands of bags.

The MIL problem has been studied in [26] proposing to use a LSTM network augmented by memory. The reduction from sets to vectors is indirect by computing a weighted average over elements in an associative memory. Therefore the aggregation tackled here is an integral part of architecture. The paper lacks any approximation guarantees.

Problems, where input data has a tree structure, naturally occur in language models, where they are typically solved by recurrent neural networks 

[13, 22]. The difference between these models is that the tree is typically binary and all leaves are homogeneous in the sense that either each of them is a vector representation of a word or each of them is a vector representation of an internal node. Contrary, here it is assumed that the tree can have an arbitrary number of heterogeneous leaves following a certain fixed scheme.

Due to lack of space, the authors cannot list all works on MIL. The reader is instead invited to look at the excellent overview in [1] and the works listed in the introductory part of this paper.

7 Conclusion

This work has been motivated by recently proposed solutions to multi-instance learning [28, 19, 6] and by mean-map embedding of probability measures [23]. It generalizes the universal approximation theorem of neural networks to compact sets of probability measures over compact subsets of Euclidean spaces. Therefore, it can be seen as an adaptation of the mean-map framework to the world of neural networks, which is important for comparing probability measures and for multi-instance learning, and it proves the soundness of the constructions of [19, 6].

The universal approximation theorem is extended to inputs with a tree schema (structure) which, being the basis of many data exchange formats like JSON, XML, ProtoBuffer, Avro, etc., are nowadays ubiquitous. This theoretically justifies applications of (MIL) neural networks in this setting.

As the presented proof relies on the Stone-Weierstrass theorem, it restricts non-linear functions in neural networks to be continuous in all but the last non-linear layer. Although this does not have an impact on practical applications (all commonly use nonlinear functions within neural networks are continuous) it would be interesting to generalize the result to non-continuous non-linearities, as has been done for feed-forward neural networks in [14].


  • [1] Jaume Amores. Multiple instance classification: Review, taxonomy and comparative study. Artif. Intell., 201:81–105, August 2013.
  • [2] Alessandro Bergamo and Lorenzo Torresani. Exploiting weakly-labeled web images to improve object classification: a domain adaptation approach. In Advances in neural information processing systems, pages 181–189, 2010.
  • [3] Andreas Christmann and Ingo Steinwart. Universal kernels on non-standard input spaces. In J. D. Lafferty, C. K. I. Williams, J. Shawe-Taylor, R. S. Zemel, and A. Culotta, editors, Advances in Neural Information Processing Systems 23, pages 406–414. Curran Associates, Inc., 2010.
  • [4] Thomas G Dietterich, Richard H Lathrop, and Tomás Lozano-Pérez. Solving the multiple instance problem with axis-parallel rectangles. Artificial intelligence, 89(1):31–71, 1997.
  • [5] R. M. Dudley. Real Analysis and Probability. Cambridge University Press, 2002.
  • [6] Harrison Edwards and Amos Storkey. Towards a Neural Statistician. 2 2017.
  • [7] James Foulds and Eibe Frank. A review of multi-instance learning assumptions.

    The Knowledge Engineering Review

    , 25(01):1–25, 2010.
  • [8] Arthur Gretton, Karsten M. Borgwardt, Malte J. Rasch, Bernhard Schölkopf, and Alexander Smola. A kernel two-sample test. J. Mach. Learn. Res., 13:723–773, March 2012.
  • [9] Petr Habala, Petr Hájek, and Václav Zizler. Introduction to Banach spaces. Matfyzpress, vydavatelství Matematicko-fyzikální fakulty Univerzity Karlovy, 1996.
  • [10] Wassily Hoeffding. Probability inequalities for sums of bounded random variables. Journal of the American Statistical Association, 58(301):13–30, 1963.
  • [11] Kurt Hornik. Approximation capabilities of multilayer feedforward networks. Neural Networks, 4(2):251 – 257, 1991.
  • [12] Nathan Ing, Jakub M Tomczak, Eric Miller, Isla P Garraway, Max Welling, Beatrice S Knudsen, and Arkadiusz Gertych. A deep multiple instance model to predict prostate cancer metastasis from nuclear morphology. 2018.
  • [13] Ozan Irsoy and Claire Cardie. Deep recursive neural networks for compositionality in language. In Z. Ghahramani, M. Welling, C. Cortes, N. D. Lawrence, and K. Q. Weinberger, editors, Advances in Neural Information Processing Systems 27, pages 2096–2104. Curran Associates, Inc., 2014.
  • [14] Moshe Leshno, Vladimir Ya. Lin, Allan Pinkus, and Shimon Schocken.

    Multilayer feedforward networks with a nonpolynomial activation function can approximate any function.

    Neural Networks, 6(6):861 – 867, 1993.
  • [15] Tomas Mikolov, Ilya Sutskever, Kai Chen, Greg S Corrado, and Jeff Dean. Distributed representations of words and phrases and their compositionality. In Advances in neural information processing systems, pages 3111–3119, 2013.
  • [16] Krikamol Muandet, Kenji Fukumizu, Francesco Dinuzzo, and Bernhard Schölkopf. Learning from distributions via support measure machines. In F. Pereira, C. J. C. Burges, L. Bottou, and K. Q. Weinberger, editors, Advances in Neural Information Processing Systems 25, pages 10–18. Curran Associates, Inc., 2012.
  • [17] Alfred Müller. Integral probability metrics and their generating classes of functions. Advances in Applied Probability, 29(2):429–443, 1997.
  • [18] Tomas Pevný and Petr Somol. Discriminative models for multi-instance problems with tree structure. In Proceedings of the 2016 ACM Workshop on Artificial Intelligence and Security, AISec ’16, pages 83–91, New York, NY, USA, 2016. ACM.
  • [19] Tomáš Pevný and Petr Somol. Using neural network formalism to solve multiple-instance problems. In Fengyu Cong, Andrew Leung, and Qinglai Wei, editors, Advances in Neural Networks - ISNN 2017, pages 135–142, Cham, 2017. Springer International Publishing.
  • [20] Adam Santoro, David Raposo, David G Barrett, Mateusz Malinowski, Razvan Pascanu, Peter Battaglia, and Tim Lillicrap. A simple neural network module for relational reasoning. In Advances in neural information processing systems, pages 4967–4976, 2017.
  • [21] Alex Smola, Arthur Gretton, Le Song, and Bernhard Schölkopf. A hilbert space embedding for distributions. In Marcus Hutter, Rocco A. Servedio, and Eiji Takimoto, editors, Algorithmic Learning Theory, pages 13–31, Berlin, Heidelberg, 2007. Springer Berlin Heidelberg.
  • [22] Richard Socher, Alex Perelygin, Jean Wu, Jason Chuang, Christopher D Manning, Andrew Ng, and Christopher Potts. Recursive deep models for semantic compositionality over a sentiment treebank. In

    Proceedings of the 2013 conference on empirical methods in natural language processing

    , pages 1631–1642, 2013.
  • [23] Bharath K Sriperumbudur, Arthur Gretton, Kenji Fukumizu, Gert Lanckriet, and Bernhard Schölkopf. Injective hilbert space embeddings of probability measures. 2008.
  • [24] M. H. Stone. The generalized weierstrass approximation theorem. Mathematics Magazine, 21(4):167–184, 1948.
  • [25] Hang Su, Varun Jampani, Deqing Sun, Subhransu Maji, Evangelos Kalogerakis, Ming-Hsuan Yang, and Jan Kautz. Splatnet: Sparse lattice networks for point cloud processing. In

    Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition

    , pages 2530–2539, 2018.
  • [26] Oriol Vinyals, Samy Bengio, and Manjunath Kudlur. Order matters: Sequence to sequence for sets. In International Conference on Learning Representations (ICLR), 2016.
  • [27] Yifan Xu, Tianqi Fan, Mingye Xu, Long Zeng, and Yu Qiao. Spidercnn: Deep learning on point sets with parameterized convolutional filters. arXiv preprint arXiv:1803.11527, 2018.
  • [28] Manzil Zaheer, Satwik Kottur, Siamak Ravanbakhsh, Barnabas Poczos, Ruslan R Salakhutdinov, and Alexander J Smola. Deep sets. In Advances in Neural Information Processing Systems, pages 3391–3401, 2017.