A finite mixture model is a probability law based on a finite number of probability measures, , and a discrete distribution . A realization of a mixture model is first generated by first generating a component at random , , and then drawing from . A mixture model can be associated with a probability measure on probability measures, which we denote . Mixture models are used to model data throughout statistics and machine learning.
A primary theoretical question concerning mixture models is identifiability. A mixture model is said to be identifiable if no other mixture model (of equal or lesser complexity) explains the distribution of the data. Some previous work on identifiability considers the situation where the observations are drawn iid from the mixture model, and conditions on are imposed, such as Gaussianity (Dasgupta and Schulman, 2007; Anderson et al., 2014). In this work we make no assumptions on . Instead, we assume the observations are grouped, such that realizations from the same group are known to be iid from the same component. We call these groups of samples “random groups.” We define a random group to be a random collection , where and .
Consider the set of all mixtures of probability measures which yield the same distribution over the random groups as does . If some element of this set other than has no more components than then is not identifiable. In other words, there is no way to differentiate from another model of equal or lesser complexity. Fortunately, with a sufficient number of samples in each random group, becomes the most simple model which describes the data. In this paper we show that, for any sample space, any mixture of probability measures with components is identifiable when there are samples per random group. Furthermore we show that this bound cannot be improved, regardless of sample space.
1.1 Applications of Probability Measures over Probability Measures
Though a somewhat mathematically abstract object, probability measures over spaces of probability measures arise quite naturally in many statistical problems. Any application which use mixture models, for example clustering, is utilizing a probability measure over probability measures. Moreover mixture models are a subset of a larger class of models known as latent variable models. One problem in latent variable models which has seen significant interest recently is topic modeling. Topic modelling is concerned with the extraction of some sort of topical structure from a collection of documents. Many popular methods for topic modelling assume that each document in question has a latent variable representing a “topic” or a random convex combination of topics which determines the distribution of words in that document (Blei et al., 2003; Anandkumar et al., 2014; Arora et al., 2012).
Another statistical problem which often utilizes a probability measure over probability measures is transfer learning. In transfer learning one is interested in utilizing several different but related training datasets (perhaps a collection of datasets which correspond to different patients in a study) to construct some sort of classifier or regressor for another different but related testing dataset. There are many approaches to this problem but one formulation assumes that each dataset is generated from a random probability measure and each random measure is generated from a fixed probability measure over probability measures(Blanchard et al., 2011; Maurer et al., 2013).
1.2 How Does Group Size Affect Consistency?
Many of the applications above assume a model similar to the one we described in the first paragraph. They assume there exists some probability measure, , over a space of probability measures from which we have observed groups of data with and . For example in topic modeling each is a document which contains words and in transfer learning is one of the several different training datasets. Proposed algorithms for solving these problems often contain some sort of consistency result and these results typically require that and either for all or that satisfies some properties which makes unnecessary. When considering such results one may wonder what sort of statistical penalty we incur from fixing for all .
While this question is clearly interesting from a theoretical perspective it has a couple of important practical implications. Firstly it is not uncommon for to be restricted in practice. An example of this is topic modelling of Twitter documents, where the restricted character count keeps each quite small. The second important practical consideration is that some latent variable techniques do not utilize the full sample and instead break down into many pairs or triplets of samples for analysis (Anandkumar et al., 2014; Arora et al., 2012). It is important to know what, if anything, is lost from doing this. Though we do not provide a direct answer to this question, our results seem to suggest that such techniques may significantly limit what can be known about .
2 Related Work
The question of how many samples are necessary in each random group to uniquely identify a finite mixture of measures has come up sporadically over the past couple of decades. The application of Kruskal’s theorem (Kruskal, 1977) has been used to concoct various identifiability results for random groups containing three samples. In Allman et al. (2009)
it was shown that any mixture of linearly independent measures over a discrete space or linearly independent probability distributions onare identifiable from random groups containing three samples. In Hettmansperger and Thomas (2000) it was shown that a mixture of probability measures on is identifiable from random groups of size provided there exists some point in where the cdf of each mixture component at that point is distinct. The result most closely resembling our own is in Rabani et al. (2013). In that paper they show that a mixture of probability measures over a discrete domain is identifiable with samples in each random group. They also show that this bound is tight and provide a consistent algorithm for estimating arbitrary mixtures of measures over a discrete domain.
Our proofs are quite different from other related identifiability results and rely on tools from functional analysis. Other results in the same vein as ours rely on algebraic or spectral theoretic tools. Our proofs basically rely on two proof techniques. The first technique is the embedding of finite collections of measures in some Hilbert space. The second technique is using the properties of symmetric tensors over and applying them to tensor products of Hilbert spaces. Our proofs are not totally detached from the algebraic techniques but the algebraic portions are hidden away in previous results about symmetric tensors.
3 Problem Setup
We will be treating this problem in as general of a setting as possible. For any measurable space we define as the Dirac measure at . For a set, -algebra, or measure, we denote to be the standard -fold product associated with that object. For any natural number we define . Let be a set containing more than one element. This set is the sample space of our data. Let be a -algebra over . Assume . We denote the space of probability measures over this space as , which we will shorten to . We will equip with the -algebra so that each Dirac measure over is unique. Define . This will be the ambient space where our mixtures of probability measures live. Let be a probability measure in . Let and . We will denote .
We will now derive the probability law of . Let , we have
The second equality follows from Lemma 3.10 in Kallenberg (2002). So the probability law of is
We want to view the probability law of as a function of in a mathematically rigorous way, which requires a bit of technical buildup. Let
be a vector space. We will now construct a version of the integral for-valued functions over . Let . From the definition of it follows that admits the representation
From the well-ordering principle there must exist some representation with minimal and we define as the order of . We can show that the representation of any is unique up to permutation of its indices. We call a mixture of measures if it is a probability measure in . We will say that has mixture components if it has order .
Let and admit minimal representations . There exists some permutation such that and for all . Because both representations are minimal it follows that for all and for all . From this we know for all . Because for all it follows that for any there exists some such that . Let be a function satisfying . Because the elements are also distinct must be injective and thus a permutation. Again from this distinctness we get that, for all , and we are done. Henceforth when we define an element of with a summation we will assume that the summation is a minimal representation. Any minimal representation of a mixture of measures with components satisfies with for all and . So any mixture of measures is a convex combination of Dirac measures at elements in .
For a function define
where is a minimal representation of . This integral is well defined as a consequence of Lemma 3.
For a -algebra we define as the space of all finite signed measures over that space. Let . We introduce the operator
For a minimal representation , we have
From this definition we have that is simply the law of which we derived earlier. Two mixtures of measures are different if they admit a different measure over . We call a mixture of measures, , -identifiable if there does not exist a different mixture of measures , with order no greater than the order of , such that .
Definition 3 is the central object of interest in this paper. Given a mixture of measures, then is equal to , the measure from which is drawn. In topic modelling would be the samples from a single document and in transfer learning it would be one of the several collections of training samples. If is not -identifiable then we know that there exists a mixture of measures which is no more complex (in terms of number of mixture components) than which is not discernible from given the data. Practically speaking this means we need more samples in each random group in order for the full richness of to be manifested in .
Our primary result gives us a bound on the -identifiability of all mixtures of measures with or fewer components. We also show that this bound is tight. Let be a measurable space. Mixtures of measures with components are -identifiable.
Let be a measurable space with . For all , there exists a mixture of measures with components which is not -identifiable. Unsurprisingly, if a mixture of measures is -identifiable then it is -identifiable for all . Likewise if a mixture of measures is not -identifiable then it is not -identifiable for . Thus identifiability is, in some sense, monotonic. If a mixture of measures is -identifiable then it is -identifiable for all . We will proceed by contradiction. Let be -identifiable, let be a different mixture of measures with and
for some . Let be arbitrary. We have
This implies that is not -identifiable, a contradiction. If a mixture of measures is not -identifiable then it is not -identifiable for any . Let a mixture of measures not be -identifiable. It follows that there exists a different mixture of measures , with , such that
Let be arbitrary, we have
and therefore is not -identifiable. Viewed alternatively these results say that is the smallest value for which is injective over the set of all minimal mixtures of measures with or fewer components.
5 Tensor Products of Hilbert Spaces
Our proofs will rely heavily on the geometry of tensor products of Hilbert spaces which we will introduce in this section.
5.1 Overview of Tensor Products
First we introduce tensor products of Hilbert spaces. To our knowledge there does not exist a rigorous construction of the tensor product Hilbert space which is both succinct and intuitive. Because of this we will simply state some basic facts about tensor products of Hilbert spaces and hopefully instill some intuition for the uninitiated by way of example. A through treatment of tensor products of Hilbert spaces can be found in Kadison and Ringrose (1983).
Let and be Hilbert spaces. From these two Hilbert spaces the “simple tensors” are elements of the form with and . We can treat the simple tensors as being the basis for some inner product space , with the inner product of simple tensors satisfying
The tensor product of and is the completion of and is denoted . To avoid potential confusion we note that notation just described is standard in operator theory literature. In some literature our definition of is denoted as and our definition of is denoted .
As an illustrative example we consider the tensor product . It can be shown that there exists an isomorphism between and which maps the simple tensors to separable functions, . We can demonstrate this isomorphism with a simple example. Let . Taking the inner product of and gives us
Beyond tensor product we will need to define tensor power. To begin we will first show that tensor products are, in some sense, associative. Let be Hilbert spaces. Proposition 2.6.5 in Kadison and Ringrose (1983) states that there is a unique unitary operator, , which satisfies the following for all ,
This implies that for any collection of Hilbert spaces, , the Hilbert space is defined unambiguously regardless of how we decide to associate the products. In the space we define a simple tensor as a vector of the form with . In Kadison and Ringrose (1983) it is shown that is the closure of the span of these simple tensors. To conclude this primer on tensor products we introduce the following notation. For a Hilbert space we denote and for , .
5.2 Some Results for Tensor Product Spaces
We will derive state technical results which will be useful for the rest of the paper. These lemmas are similar to or are straightforward extensions of previous results which we needed to modify for our particular purposes. Let be a -finite measure space. We have the following lemma which connects the space of products of measures to the tensor products of the space for each measure. The proof of this lemma is straightforward but technical and can be found in the appendix. There exists a unitary transform such that, for all , . The following lemma used in the proof of Lemma 5.2 as well as the proof of Theorem 4. The proof of this lemma is also not particularly interesting and can be found in the appendix. Let be a collection of Hilbert spaces and a collection of unitary operators with for all . There exists a unitary operator satisfying for all . Let and let be elements of a Hilbert space such that no elements are zero and no pairs of elements are collinear. Then are linearly independent. A statement of this lemma for can be found in Comon et al. (2008). We present our own proof for the Hilbert space setting. We will proceed by induction. For the lemma clearly holds. Suppose the lemma holds for and let satisfy the assumptions in the lemma statement. Let satisfy
To finish the proof we will show that must be zero which can be generalized to any without loss of generality. Let and be Hilbert spaces and let be the space of Hilbert-Schmidt operators from to . Hilbert-Schmidt operators are a closed subspace of bounded linear operators. Proposition 2.6.9 in Kadison and Ringrose (1983) states that for a pair of Hilbert spaces there exists an unitary operator such that . Applying this operator to (2) we get
Because and are linearly independent we can choose such that and . Plugging into (3) yields
and therefore by the inductive hypothesis.
6 Proofs of Theorems
With the tools developed in the previous sections we can now prove our theorems. First we introduce one additional piece of notation. For a function on a domain we define as simply the product of the function times on the domain , . For a measure the notation continues to denote the standard product measure.
Finally will need the following technical lemma to connect the product of Radon-Nikodym derivatives to product measures. The proof is straightforward and can be found in the appendix. Let be a measurable space, and a pair of bounded measures on that space, and a nonnegative function in such that, for all , . Then for all , for all we have
of Theorem 4 We will proceed by contradiction. Suppose there exist two different mixtures of measures , such that
and . From our assumption on representation we know for all and similarly for . We will also assume that for all . Were this not true we could simply subtract the smaller of the common terms from both sides of (6) and normalize to yield another pair of distinct mixtures of measures with fewer components and no shared terms, and . Let have components and have with . If then we can apply Lemma 4 to give us and proceed as usual.
Let . Clearly dominates and for all so we can define Radon-Nikodym derivatives , which are in . We can assert that these derivatives are everywhere nonnegative without issue. Clearly no two of these derivatives are equal. If one of the derivatives were a scalar multiple of another, for example for some , it would imply
This is not true so no pair of these derivatives are collinear.
Lemma 6 tells us that, for any we have
-almost everywhere (Proposition 2.23 in Folland (1999)). We will now show for all that and . We will argue this for which will clearly generalize to the other elements. First we will show that -almost everywhere. Suppose this were not true and that there exists with and . Now we would have
a contradiction. Evaluating directly we get
Since Lemma 5.2 states that are all linearly independent and thus and for all , a contradiction.
of Theorem 4 To prove this theorem we will construct a pair of different mixture of measures, which both contain components and satisfy .
From our definition of we know there exists such that are nonempty. Let and . It follows that are different probability measures on . Because and are dominated by we know that there exists a pair of measurable functions such that, for all , and . We can assert that and are nonnegative without issue.
From the same argument we used in the proof of Theorem 4 we know . Let be the Hilbert space generated from the span of . Let be distinct elements of and let be elements of with . Clearly is a pdf over for all and there are no pairs in this collection which are collinear. Let be the Hilbert space generated from the span of and . Since is isomorphic to there exists a unitary operator . From Lemma 5.2 there exists a unitary operator with . Because is unitary the set maps exactly to the set . An order tensor, , is symmetric if for any and permutation . A consequence of Lemma 4.2 in Comon et al. (2008) is that is exactly the space of all symmetric order tensors over .
From Proposition 3.4 in Comon et al. (2008) it follows that the dimension of is . From this we get that .
The bound on the dimension of implies that are linearly dependent. Conversely Lemma 5.2 implies that removing a single vector from yields a set of vectors which are linearly independent. It follows that there exists with for all and
Without loss of generality we will assume that for with . From this we have
From Lemma 5.2 we have
Let . We know so dividing both sides of (5) by gives us
and the left and the right side are convex combinations. Let positive numbers with for and for . This gives us
It follows that
We will now show that . Suppose . Then are linearly independent. From this we know that there exists such that for but is not orthogonal to . Using this vector we have
and thus .
Now we have
Applying Lemma 5.2 we get that
From Lemma 6 we have,
Thus setting and gives us and by construction.
6.1 Discussion of the Proof of Theorem 4
In the previous proof we could have replaced with any distinct pair of probability measures on . Thus the pair are not pathological because of some property of each individual mixture component, but because of geometry of the mixture components considered as a whole. The measures are a convex combinations of and and therefore lie in a one dimensional affine subspace of . The space of Bernoulli measures similarly lie in a subspace between two measures, the point mass at and the point mass at
. Given a mixture of Bernoulli distributions, the sum of iid samples of Bernoulli random variables is a binomial distribution. We can draw a connection between our result and the identifiability of mixtures of binomial distributions.
Consider as mixture of Bernoulli distributions with parameters and weights . Suppose we have samples in each random group. If we let be the sum of the random group then the probability law of is a mixture of binomial random variables. Let be the distribution of a Bernoulli random variable with parameters and . Specifically we have that the distribution of . In Blischke (1964) it was shown that is a necessary and sufficient condition for the identifiability of the parameters from the samples . We find these similarities provoking but are not prepared to make more precise connections at this time.
In this paper we have proven a fundamental bound on the identifiability of mixture models in a nonparametric setting. Any mixture with components is identifiable with groups of samples containing samples from the same latent probability measure. We show that this bound is tight by constructing a mixture of probability measures which is not identifiable with groups of samples containing . These results hold for any mixture over any domain with at least two elements.
- Allman et al. (2009) Elizabeth S. Allman, Catherine Matias, and John A. Rhodes. Identifiability of parameters in latent structure models with many observed variables. Ann. Statist., 37(6A):3099–3132, 12 2009. doi: 10.1214/09-AOS689. URL http://dx.doi.org/10.1214/09-AOS689.
- Anandkumar et al. (2014) Animashree Anandkumar, Rong Ge, Daniel Hsu, Sham M. Kakade, and Matus Telgarsky. Tensor decompositions for learning latent variable models. Journal of Machine Learning Research, 15:2773–2832, 2014. URL http://jmlr.org/papers/v15/anandkumar14b.html.
- Anderson et al. (2014) Joseph Anderson, Mikhail Belkin, Navin Goyal, Luis Rademacher, and James Voss. The more, the merrier: the blessing of dimensionality for learning large gaussian mixtures. In Proceedings of The 27th Conference on Learning Theory, pages 1135–1164, 2014.
Arora et al. (2012)
Sanjeev Arora, Rong Ge, Ravindran Kannan, and Ankur Moitra.
Computing a nonnegative matrix factorization – provably.
Proceedings of the Forty-fourth Annual ACM Symposium on Theory of Computing, STOC ’12, pages 145–162, New York, NY, USA, 2012. ACM. ISBN 978-1-4503-1245-5. doi: 10.1145/2213977.2213994. URL http://doi.acm.org/10.1145/2213977.2213994.
- Blanchard et al. (2011) Gilles Blanchard, Gyemin Lee, and Clayton Scott. Generalizing from several related classification tasks to a new unlabeled sample. In John Shawe-Taylor, Richard S. Zemel, Peter L. Bartlett, Fernando C. N. Pereira, and Kilian Q. Weinberger, editors, NIPS, pages 2178–2186, 2011. URL http://dblp.uni-trier.de/db/conf/nips/nips2011.html#BlanchardLS11.
- Blei et al. (2003) David M. Blei, Andrew Y. Ng, and Michael I. Jordan. Latent dirichlet allocation. J. Mach. Learn. Res., 3:993–1022, March 2003. ISSN 1532-4435. URL http://dl.acm.org/citation.cfm?id=944919.944937.
- Blischke (1964) W. R. Blischke. Estimating the parameters of mixtures of binomial distributions. Journal of the American Statistical Association, 59(306):pp. 510–528, 1964. ISSN 01621459. URL http://www.jstor.org/stable/2283005.
- Comon et al. (2008) Pierre Comon, Gene Golub, Lek-Heng Lim, and Bernard Mourrain. Symmetric tensors and symmetric tensor rank. SIAM Journal on Matrix Analysis and Applications, 30(3):1254–1279, 2008. doi: 10.1137/060661569. URL http://dx.doi.org/10.1137/060661569.
- Dasgupta and Schulman (2007) Sanjoy Dasgupta and Leonard Schulman. A probabilistic analysis of em for mixtures of separated, spherical gaussians. J. Mach. Learn. Res., 8:203–226, May 2007. ISSN 1532-4435. URL http://portal.acm.org/citation.cfm?id=1248659.1248666.
- Folland (1999) Gerald B. Folland. Real analysis: modern techniques and their applications. Pure and applied mathematics. Wiley, 1999. ISBN 9780471317166. URL http://books.google.com/books?id=uPkYAQAAIAAJ.
- Hettmansperger and Thomas (2000) T. P. Hettmansperger and Hoben Thomas. Almost nonparametric inference for repeated measures in mixture models. Journal of the Royal Statistical Society: Series B (Statistical Methodology), 62(4):811–825, 2000. ISSN 1467-9868. doi: 10.1111/1467-9868.00266. URL http://dx.doi.org/10.1111/1467-9868.00266.
- Kadison and Ringrose (1983) R.V. Kadison and J.R. Ringrose. Fundamentals of the theory of operator algebras. V1: Elementary theory. Pure and Applied Mathematics. Elsevier Science, 1983. ISBN 9780080874166. URL https://books.google.com/books?id=JbxgKOwu2McC.
- Kallenberg (2002) Olav Kallenberg. Foundations of modern probability. Probability and its applications. Springer, New York, Berlin,, Paris, 2002. ISBN 0-387-95313-2. URL http://opac.inria.fr/record=b1098179. Sur la 4e de couv. : This new edition contains four new chapters as well as numerous improvements throughout the text.
- Kruskal (1977) Joseph B. Kruskal. Three-way arrays: rank and uniqueness of trilinear decompositions, with application to arithmetic complexity and statistics. Linear Algebra and its Applications, 18(2):95 – 138, 1977. ISSN 0024-3795.
- Maurer et al. (2013) Andreas Maurer, Massi Pontil, and Bernardino Romera-paredes. Sparse coding for multitask and transfer learning. In Sanjoy Dasgupta and David Mcallester, editors, Proceedings of the 30th International Conference on Machine Learning (ICML-13), volume 28, pages 343–351. JMLR Workshop and Conference Proceedings, May 2013. URL http://jmlr.csail.mit.edu/proceedings/papers/v28/maurer13.pdf.
Muandet and Schölkopf (2013)
Krikamol Muandet and Bernhard Schölkopf.
One-class support measure machines for group anomaly detection.CoRR, abs/1303.0309, 2013. URL http://dblp.uni-trier.de/db/journals/corr/corr1303.html#abs-1303-0309.
- Póczos et al. (2013) Barnabás Póczos, Aarti Singh, Alessandro Rinaldo, and Larry A. Wasserman. Distribution-free distribution regression. In AISTATS, volume 31 of JMLR Proceedings, pages 507–515. JMLR.org, 2013. URL http://dblp.uni-trier.de/db/conf/aistats/aistats2013.html#PoczosSRW13.
- Rabani et al. (2013) Yuval Rabani, Leonard J. Schulman, and Chaitanya Swamy. Learning mixtures of arbitrary distributions over large discrete domains. ArXiv e-prints, 2013. URL http://arxiv.org/abs/1212.1527.
- Szabo et al. (2014) Z. Szabo, B. Sriperumbudur, B. Poczos, and A. Gretton. Learning Theory for Distribution Regression. ArXiv e-prints, November 2014.