1 Motivation
Prevalent machine learning methods assume their input to be a vector or a matrix of a fixed dimension, or a sequence, but many sources of data have the structure of a tree, imposed by data formats like JSON, XML, YAML, Avro, or ProtoBuffer (see Figure
1for an example). While the obvious complication is that such a tree structure is more complicated than having a single variable, these formats also contain some “elementary” entries which are already difficult to handle in isolation. Beside strings, for which a plethora conversions to realvalued vectors exists (onehot encoding, histograms of ngram models, word2vec
[15], output of a recurrent network, etc.), the most problematic elements seem to be unordered lists (sets) of records (such as the "workouts" element and all of the subkeys of "speedData" in Figure 1), whose length can differ from sample to sample and the classifier processing this input needs to be able to cope with this variability.
The variability exemplified above by "workouts" and "speedData" is the defining feature of Multiinstance learning (MIL) problems (also called Deep Sets in [28]), where it is intuitive to define a sample as a collection of feature vectors. Although all vectors within the collection have the same dimension, their number can differ from sample to sample. In MIL nomenclature, a sample is called a bag and an individual vector an instance. The difference between sequences and bags is that the order of instances in the bag is not important and the output of the classifier should be the same for an arbitrary permutation of instances in the vector.
MIL was introduced in [4] as a solution for a problem of learning a classifier on instances from labels available on the level of a whole bag. To date, many approaches to solve the problem have been proposed, and the reader is referred to [1] for an excellent review and taxonomy. The setting has emerged from the assumption of a bag being considered positive if at least one instance was positive. This assumption is nowadays used for problems with weaklylabeled data [2]. While many different definitions of the problem have been introduced (see [7] for a review), this work adopts a general definition of [16]
, where each sample (bag) is viewed as a probability distribution observed through a set of realizations (instances) of a random variable with this distribution. Rather than working with vectors, matrices or sequences, the classifier therefore classifies
probability measures.Independent works of [28, 6] and [19]
have proposed an adaptation of neural networks to MIL problems (hereinafter called MIL NN). The adaptation uses two feedforward neural networks, where the first network takes as an input individual instances, its output is an elementwise averaged, and the resulting vector describing the whole bag is sent to the second network. This simple approach yields a very general, well performing and robust algorithm, which has been reported by all three works. Since then, the MIL NN has been used in numerous applications, for example in causal reasoning
[20], in computer vision to process point clouds
[25, 27], in medicine to predict prostate cancer [12], in training generative adversarial networks [12], or to process network traffic to detect infected computers [18]. The last work has demonstrated that the MIL NN construction can be nested (using sets of sets as an input), which allows the neural network to handle data with a hierarchical structure.The widespread use of neural networks is theoretically justified by their universal approximation property – the fact that any continuous function on (a compact subset of) a Euclidean space to real numbers can be approximated by a neural network with arbitrary precision [11, 14]. However, despite their good performance and increasing popularity, no general analogy of the universal approximation theorem has been proven for MIL NNs. This would require showing that MIL NNs are dense in the space of continuous functions from the space of probability measures to real numbers and – to the best of our knowledge – the only result in this direction is restricted to input domains with finite cardinality [28].
This work fills this gap by formally proving that MIL NNs with two nonlinear layers, a linear output layer and mean aggregation after the first layer are dense in the space of continuous functions from the space of probability measures to real numbers (Theorem 2 and Corollary 3). In Theorem 5, the proof is extended to data with an arbitrary treelike schema (XML, JSON, ProtoBuffer). The reasoning behind the proofs comes from kernel embedding of distributions (mean map) [21, 23] and related work on Maximum Mean Discrepancy [8]. This work can therefore be viewed as a formal adaptation of these tools to neural networks. While these results are not surprising, the authors believe that as the number of applications of NNs to MIL and treestructured data grows, it becomes important to have a formal proof of the soundness of this approach.
The paper only contains theoretical results — for experimental comparison to prior art, the reader is referred to [28, 6, 19, 20, 25, 27, 12, 18]. However, the authors provide a proof of concept demonstration of processing JSON data at https://codeocean.com/capsule/182df5258417441f80ef4d3c02fea970/?ID=f4d3be809b14466c87c45dfabbaccd32.
2 Notation and summary of relevant work
This section provides background for the proposed extensions of the universal approximation theorem [11, 14]. For convenience, it also summarizes solutions to multiinstance learning problems proposed in [19, 6].
By we denote the space of continuous functions from to endowed with the topology of uniform convergence. Recall that this topology is metrizable by the supremum metric .
Throughout the text, will be an arbitrary metric space and will be some compact set of (Borel) probability measures on . Perhaps the most useful example of this setting is when is a compact metric space and is the space of all Borel probability measures on . Endowing with the topology turns it into a compact metric space (the metric being for some dense subset – see for example Proposition 62 from [9]). Alternatively, one can define metric on using for example integral probability metrics [17] or total variation. In this sense, the results presented below are general, as they are not tied to any particular topology.
2.1 Universal approximation theorem on compact subsets of
The next definition introduces set of affine functions forming the base of linear and nonlinear layers of neural networks.
Definition 1.
For any , is the set of all affine functions on i.e.
(1) 
The main result of [14] states that feedforward neural networks with a single nonlinear hidden layer and linear output layer (hereinafter called networks) are dense in the space of continuous functions. Lemma 1.1 then implies that the same holds for measurable functions.
Theorem 1 (Universal approximation theorem on ).
For any nonpolynomial measurable function on and every , the following family of functions is dense in :
(2) 
The key insight of the theorem isn’t that a single nonlinear layer suffices, but the fact that any continuous function can be approximated by neural networks. Recall that for compact, any can be continuolusly extended to , and thus the same result holds for . Note that if was a polynomial of order , would only contain polynomials of order .
The following metric corresponds to the notion of convergence in measure:
Definition 2 (Def. 2.9 from [11]).
For a Borel probability measure on , define a metric
(3) 
on , where denotes the collection of all (Borel) measurable functions.
Note that for finite , the uniform convergence implies convergence in [11, L. A.1]:
Lemma 1.1.
For every finite Borel measure on a compact is dense in
2.2 Multiinstance neural networks
In Multiinstance learning it is assumed that a sample consists of multiple vectors of a fixed dimension, i.e. , . Furthermore, it is assumed that labels are provided on the level of samples , rather than on the level of individual instances
To adapt feedforward neural networks to MIL problems, the following construction has been proposed in [19, 6]. Assuming mean aggregation function, the network consists of two feedforward neural networks and The output of function is calculated as follows:
(4) 
where is the dimension of the input, output of the first neural network, and the output. This construction also allows the use of other aggregation functions such as maximum.
The general definition of a MIL problem [16] adopted here views instances of a single sample as realizations of a random variable with distribution where is a set of probability measures on . This means that the sample is not a single vector but a probability distribution observed through a finite number of realizations of the corresponding random variable.
The main result of Section 3 is that the set of neural networks with (i) being a single nonlinear layer, (ii) being one nonlinear layer followed by a linear layer, and (iii) the aggregation function being mean as in Equation (4) is dense in the space of continuous functions on any compact set of probability measures. Lemma 1.1 extends the result to the space of measurable functions.
The theoretical analysis assumes functions of the form
(5) 
whereas in practice can only be observed through a finite set of observations This might seem as a discrepancy, but the sample x can be interpreted as a mixture of Dirac probability measures By definition of , we immediatelly get
from which it easy to recover Equation (4). Since approaches as increases,
can be seen as an estimate of
Indeed, if the nonlinearities in neural networks implementing functions and are continuous, the function is bounded and from Hoeffding’s inequality [10] it follows that for some constant3 Universal approximation theorem for probability spaces
To extend Theorem 1 to spaces of probability measures, the following definition introduces the set of functions which represent the layer that embedds probability measures into .
Definition 3.
For any and set of functions we define as
(6) 
can be viewed as an analogy of affine functions defined by Equation (1) in the context of probability measures on .
Remark.
Let and suppose that only contains the basic projections . If is the set of Dirac measures, then coincides with .
Using , the following definition extends the networks from Theorem 1 to probability spaces.
Definition 4 (networks).
For any set of functions and a measurable function let be class of functions
(7) 
The main theorem of this work can now be presented. As illustrated in a corollary below, when applied to it states that threelayer neural networks, where first two layers are nonlinear interposed with an integration (average) layer, allow arbitrarily precise approximations of continuous function on . (In other words this class of networks is dense in .)
Theorem 2.
Let be a compact set of Borel probability measures on a metric space , be a set of continuous functions dense in and finally be a measurable nonpolynomial function. Then the set of functions is dense in .
Using Lemma 1.1, an immediate corollary is that a similar result holds for measurable funcitons:
Corollary 1 (Density of MIL NN in ).
Under the assumptions of Theorem 2, is dense in for any finite Borel measure on .
The proof of Theorem 2 is similar to the proof of Theorem 2.4 from [11]. One of the ingredients of the proof is the classical StoneWeierstrass theorem [24]. Recall that a collection of functions is an algebra if it is closed under multiplication and linear combinations.
StoneWeierstrass Theorem.
Let be an algebra of functions on a compact . If

separates points in : and

vanishes at no point of : ,
then the uniform closure of is equal to
Since is not closed under multiplication, we cannot apply the SW theorem directly. Instead, we firstly prove the density of the class of networks (Theorem 3) which does form an algebra, and then we extend the result to networks.
Theorem 3.
Let be a compact set of Borel probability measures on a metric space , and be a dense subset of Then the following set of functions is dense in :
The proof shall use the following immediate corollary of Lemma 9.3.2 from [5].
Lemma 3.1 (Lemma 9.3.2 of [5]).
Let be a metric space and let and be two Borel probability measures on If , then we have for some .
Proof of Theorem 3.
Since is clearly an algebra of continuous functions on , it suffices to verify the assumptions of the SW theorem (separation and nonvanishing properties).
(i) Separation: Let be distinct. By Lemma 3.1 there is some and such that Since is dense in , there exists such that Using triangle inequality yields
Denoting it is trivial to see that It follows that , implying that separates the points of .
(ii) Nonvanishing: Let Choose such that Since is dense in there exists such that Since , we get
Denote . It follows that , and hence vanishes at no point of .
Since the assumptions of SW theorem are satisfied, is dense in ∎
The following simple lemma will be useful in proving Theorem 2.
Lemma 3.2.
If is dense in then for any , the collection of functions is dense in
Proof.
Let and be such that Then we have
(8) 
which proves the lemma. ∎
Proof of Theorem 2.
Let , and be as in the assumptions of the theorem. Let and fix Then, there exist such that This function is of the form
for some and Moreover can be written as a composition , where
(9)  
(10) 
Denoting , we identify the range of and the domain of with .
Since is clearly continuous and is dense in (by Theorem 1) there exists such that It follows that satisfies
Since it is easy to see that belongs to which concludes the proof. ∎
The function in the above construction (Equation (9
)) can be seen as a feature extraction layer embedding the space of probability measures into a Euclidean space. It is similar to a meanmap
[21, 23] — a wellestablished paradigm in kernel machines — in the sense that it characterizes a class of probability measures but, unlike meanmap, only in parts where positive and negative samples differ.4 Universal approximation theorem for product spaces
The next result is the extension of the universal approximation theorem to product spaces, which naturally occur in structured data. The motivation here is for example if one sample consists of some real vector set of vectors and another set of vectors
Theorem 4.
Let be a Cartesian product of metric compacts, , be dense subsets of and be a measurable function which is not an algebraic polynomial. Then is dense in
The theorem is general in the sense that it covers cases where some are compact sets of probability measures as defined in Section 2, some are subsets of Euclidean spaces, and others can be general compact spaces for which the corresponding sets of continuous function are dense in
The theorem is a simple consequence of the following corollary of StoneWeierstrass theorem.
Corollary 2.
For and compact, the following set of functions is dense in
Proof of Theorem 4.
The proof is technically similar to the proof of Theorem 2. Specifically, let be a continuous function on and . By the aforementioned corollary of the SW theorem, there are some such that
Again, the above function can be written as a composition of two functions
(11)  
(12) 
Since is continuous, Theorem 1 can be applied to obtain a function of the form for some and , which approximates with error at most . Applying Lemma 3.2 to and concludes the proof.
∎
5 Multiinstance learning and tree structured data
The following corollary of Theorem 2 justifies the embedding paradigm of [28, 6, 19] to MIL problems:
Corollary 3 (Density of MIL NN in ).
Let be a compact subset of and a compact set of probability measures on Then any function can be arbitrarily closely approximated by a threelayer neural network composed of two nonlinear layers with integral (mean) aggregation layer between them, and a linear output layer.
If in Theorem 2 is set to all feedforward networks with a single nonlinear layer (that is, when ) then the theorem says that for every and , there is some such that This can be written as
where for brevity the bias vectors are omitted,
and are elementwise, and are matrices of appropriate sizes. Since the integral in the middle is linear with respect to the matrixvector multiplication, and can be replaced by a single matrix, which proves the corollary:Since Theorem 2 does not have any special conditions on except to be compact metric space and to be continuous and uniformly dense in the theorem can be used as an induction step and the construction can be repeated.
For example, consider a compact set of probability measures on a . Then the space of neural networks with four layers is dense in The network consists of three nonlinear layers with integration (mean) layer between them, and the last layer which is linear.
The above induction is summarized in the following theorem.
Theorem 5.
Let be the class of spaces which (i) contains all compact subsets of , , (ii) is closed under finite cartesian products, and (iii) for each we have .^{1}^{1}1Here we assume that is endowed with the metric from Section 2. Then for each , every continuous function on can be arbitrarilly well approximated by neural networks.
By Lemma 1.1, an analogous result holds for measurable functions.
6 Related Work
Works most similar to this one are on kernel mean embedding [21, 23], showing that a probability measure can be uniquely embedded into highdimensional space using characteristic kernel. Kernel mean embedding is widely used in Maximum Mean Discrepancy [8] and in Support Measure Machines [16, 3], and is to our knowledge the only algorithm with proven approximation capabilities comparable to the present work. Unfortunately its worstcase complexity of where is the number of bags and is the average size of a bag, prevents it from scaling to problems above thousands of bags.
The MIL problem has been studied in [26] proposing to use a LSTM network augmented by memory. The reduction from sets to vectors is indirect by computing a weighted average over elements in an associative memory. Therefore the aggregation tackled here is an integral part of architecture. The paper lacks any approximation guarantees.
Problems, where input data has a tree structure, naturally occur in language models, where they are typically solved by recurrent neural networks
[13, 22]. The difference between these models is that the tree is typically binary and all leaves are homogeneous in the sense that either each of them is a vector representation of a word or each of them is a vector representation of an internal node. Contrary, here it is assumed that the tree can have an arbitrary number of heterogeneous leaves following a certain fixed scheme.Due to lack of space, the authors cannot list all works on MIL. The reader is instead invited to look at the excellent overview in [1] and the works listed in the introductory part of this paper.
7 Conclusion
This work has been motivated by recently proposed solutions to multiinstance learning [28, 19, 6] and by meanmap embedding of probability measures [23]. It generalizes the universal approximation theorem of neural networks to compact sets of probability measures over compact subsets of Euclidean spaces. Therefore, it can be seen as an adaptation of the meanmap framework to the world of neural networks, which is important for comparing probability measures and for multiinstance learning, and it proves the soundness of the constructions of [19, 6].
The universal approximation theorem is extended to inputs with a tree schema (structure) which, being the basis of many data exchange formats like JSON, XML, ProtoBuffer, Avro, etc., are nowadays ubiquitous. This theoretically justifies applications of (MIL) neural networks in this setting.
As the presented proof relies on the StoneWeierstrass theorem, it restricts nonlinear functions in neural networks to be continuous in all but the last nonlinear layer. Although this does not have an impact on practical applications (all commonly use nonlinear functions within neural networks are continuous) it would be interesting to generalize the result to noncontinuous nonlinearities, as has been done for feedforward neural networks in [14].
References
 [1] Jaume Amores. Multiple instance classification: Review, taxonomy and comparative study. Artif. Intell., 201:81–105, August 2013.
 [2] Alessandro Bergamo and Lorenzo Torresani. Exploiting weaklylabeled web images to improve object classification: a domain adaptation approach. In Advances in neural information processing systems, pages 181–189, 2010.
 [3] Andreas Christmann and Ingo Steinwart. Universal kernels on nonstandard input spaces. In J. D. Lafferty, C. K. I. Williams, J. ShaweTaylor, R. S. Zemel, and A. Culotta, editors, Advances in Neural Information Processing Systems 23, pages 406–414. Curran Associates, Inc., 2010.
 [4] Thomas G Dietterich, Richard H Lathrop, and Tomás LozanoPérez. Solving the multiple instance problem with axisparallel rectangles. Artificial intelligence, 89(1):31–71, 1997.
 [5] R. M. Dudley. Real Analysis and Probability. Cambridge University Press, 2002.
 [6] Harrison Edwards and Amos Storkey. Towards a Neural Statistician. 2 2017.

[7]
James Foulds and Eibe Frank.
A review of multiinstance learning assumptions.
The Knowledge Engineering Review
, 25(01):1–25, 2010.  [8] Arthur Gretton, Karsten M. Borgwardt, Malte J. Rasch, Bernhard Schölkopf, and Alexander Smola. A kernel twosample test. J. Mach. Learn. Res., 13:723–773, March 2012.
 [9] Petr Habala, Petr Hájek, and Václav Zizler. Introduction to Banach spaces. Matfyzpress, vydavatelství Matematickofyzikální fakulty Univerzity Karlovy, 1996.
 [10] Wassily Hoeffding. Probability inequalities for sums of bounded random variables. Journal of the American Statistical Association, 58(301):13–30, 1963.
 [11] Kurt Hornik. Approximation capabilities of multilayer feedforward networks. Neural Networks, 4(2):251 – 257, 1991.
 [12] Nathan Ing, Jakub M Tomczak, Eric Miller, Isla P Garraway, Max Welling, Beatrice S Knudsen, and Arkadiusz Gertych. A deep multiple instance model to predict prostate cancer metastasis from nuclear morphology. 2018.
 [13] Ozan Irsoy and Claire Cardie. Deep recursive neural networks for compositionality in language. In Z. Ghahramani, M. Welling, C. Cortes, N. D. Lawrence, and K. Q. Weinberger, editors, Advances in Neural Information Processing Systems 27, pages 2096–2104. Curran Associates, Inc., 2014.

[14]
Moshe Leshno, Vladimir Ya. Lin, Allan Pinkus, and Shimon Schocken.
Multilayer feedforward networks with a nonpolynomial activation function can approximate any function.
Neural Networks, 6(6):861 – 867, 1993.  [15] Tomas Mikolov, Ilya Sutskever, Kai Chen, Greg S Corrado, and Jeff Dean. Distributed representations of words and phrases and their compositionality. In Advances in neural information processing systems, pages 3111–3119, 2013.
 [16] Krikamol Muandet, Kenji Fukumizu, Francesco Dinuzzo, and Bernhard Schölkopf. Learning from distributions via support measure machines. In F. Pereira, C. J. C. Burges, L. Bottou, and K. Q. Weinberger, editors, Advances in Neural Information Processing Systems 25, pages 10–18. Curran Associates, Inc., 2012.
 [17] Alfred Müller. Integral probability metrics and their generating classes of functions. Advances in Applied Probability, 29(2):429–443, 1997.
 [18] Tomas Pevný and Petr Somol. Discriminative models for multiinstance problems with tree structure. In Proceedings of the 2016 ACM Workshop on Artificial Intelligence and Security, AISec ’16, pages 83–91, New York, NY, USA, 2016. ACM.
 [19] Tomáš Pevný and Petr Somol. Using neural network formalism to solve multipleinstance problems. In Fengyu Cong, Andrew Leung, and Qinglai Wei, editors, Advances in Neural Networks  ISNN 2017, pages 135–142, Cham, 2017. Springer International Publishing.
 [20] Adam Santoro, David Raposo, David G Barrett, Mateusz Malinowski, Razvan Pascanu, Peter Battaglia, and Tim Lillicrap. A simple neural network module for relational reasoning. In Advances in neural information processing systems, pages 4967–4976, 2017.
 [21] Alex Smola, Arthur Gretton, Le Song, and Bernhard Schölkopf. A hilbert space embedding for distributions. In Marcus Hutter, Rocco A. Servedio, and Eiji Takimoto, editors, Algorithmic Learning Theory, pages 13–31, Berlin, Heidelberg, 2007. Springer Berlin Heidelberg.

[22]
Richard Socher, Alex Perelygin, Jean Wu, Jason Chuang, Christopher D Manning,
Andrew Ng, and Christopher Potts.
Recursive deep models for semantic compositionality over a sentiment
treebank.
In
Proceedings of the 2013 conference on empirical methods in natural language processing
, pages 1631–1642, 2013.  [23] Bharath K Sriperumbudur, Arthur Gretton, Kenji Fukumizu, Gert Lanckriet, and Bernhard Schölkopf. Injective hilbert space embeddings of probability measures. 2008.
 [24] M. H. Stone. The generalized weierstrass approximation theorem. Mathematics Magazine, 21(4):167–184, 1948.

[25]
Hang Su, Varun Jampani, Deqing Sun, Subhransu Maji, Evangelos Kalogerakis,
MingHsuan Yang, and Jan Kautz.
Splatnet: Sparse lattice networks for point cloud processing.
In
Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition
, pages 2530–2539, 2018.  [26] Oriol Vinyals, Samy Bengio, and Manjunath Kudlur. Order matters: Sequence to sequence for sets. In International Conference on Learning Representations (ICLR), 2016.
 [27] Yifan Xu, Tianqi Fan, Mingye Xu, Long Zeng, and Yu Qiao. Spidercnn: Deep learning on point sets with parameterized convolutional filters. arXiv preprint arXiv:1803.11527, 2018.
 [28] Manzil Zaheer, Satwik Kottur, Siamak Ravanbakhsh, Barnabas Poczos, Ruslan R Salakhutdinov, and Alexander J Smola. Deep sets. In Advances in Neural Information Processing Systems, pages 3391–3401, 2017.
Comments
There are no comments yet.