Estimating distributions from observed data is a fundamental task in statistics that has been studied for over a century. This task frequently arises in applied machine learning and it is very common to assume that the distribution can be modeled using a mixture of Gaussians. Popular software packages have implemented heuristics, such as the EM algorithm, for learning a mixture of Gaussians. The theoretical machine learning community also has a rich literature on distribution learning. For example, the recent survey of Diakonikolas (2016) considers learning of structured distributions, and the survey of Kalai et al. (2012) focuses on mixtures of Gaussians.
This paper develops a general technique for distribution learning, then employs these techniques in the important setting of mixtures of Gaussians. The theoretical model that we adopt is density estimation: given i.i.d. samples from an unknown target distribution, find a distribution that is close to the target distribution in total variation (TV) distance. Our focus is on sample complexity bounds: using as few samples as possible to obtain a good estimate of the target distribution. For background on this model see, e.g., Devroye and Lugosi (2001, Chapter 5) and Diakonikolas (2016).
Our new technique for upper bounds on the sample complexity involves a form of sample compression. If it is possible to “encode” members of a class of distributions using a carefully chosen subset of the samples, then this yields an upper bound on the sample complexity of distribution learning for that class. In particular, by constructing compression schemes for mixtures of axis-aligned Gaussians and general Gaussians, we obtain new upper bounds on the sample complexity of learning with respect to these classes, which are optimal up to logarithmic factors.
The compression framework can incorporate a notion of robustness, which leads to sample complexity bounds for agnostic learning. Namely, if the target distribution is close to a mixture of Gaussians (in TV distance), our method uses few samples to find a mixture of Gaussians that is close to the target distribution (in TV distance).
1.1 Main results
In this section, all learning results refer to the problem of producing a distribution within total variation distance from the target distribution.
Our first main result is an upper bound for learning mixtures of multivariate Gaussians. This bound is tight up to logarithmic factors.
The class of -mixtures of -dimensional Gaussians can be learned using samples. This result generalizes to the agnostic setting.
Previously, the best upper bounds on the sample complexity of this problem were , due to Ashtiani et al. (2017), and , based on a VC-dimension bound discussed later. For the case of a single Gaussian (i.e., ), a sample complexity bound of is well known (see, e.g., Ashtiani et al. (2017, Theorem 13)).
Our second main result is a lower bound matching Theorem 1.1 up to logarithmic factors.
Any method for learning the class of -mixtures of -dimensional Gaussians has sample complexity .
Previously, the best lower bound on the sample complexity was (Suresh et al., 2014). Even for a single Gaussian (i.e., ), an lower bound was not known prior to this work.
Our third main result is an upper bound for learning mixtures of axis-aligned Gaussians, i.e., Gaussians with diagonal covariance matrix. This bound is tight up to logarithmic factors.
The class of -mixtures of axis-aligned -dimensional Gaussians can be learned using samples. This result generalizes to the agnostic setting.
Although our approach for proving sample complexity upper bounds is algorithmic, our focus is not on computational efficiency. The resulting algorithms are efficient in terms of sample complexity, but their runtime is exponential in the dimension and the number of mixture components . The existence of a polynomial time algorithm for density estimation is unknown even for the class of mixtures of axis-aligned Gaussians (Diakonikolas et al., 2017a, Question 1.1).
Even for the case of a single Gaussian, the published proofs of the bound are not algorithmically efficient. Using ideas from our proof of Theorem 1.1, we show in Appendix B that an algorithmically efficient proof for the single Gaussian case can be obtained simply by computing the empirical mean and covariance matrix of samples.
1.2 Related work
Distribution learning is a vast topic and many approaches have been considered in the literature. We briefly review approaches that are most relevant to our problem.
For parametric families of distributions, a common approach is to use the samples to estimate the parameters of the distribution, possibly in a maximum likelihood sense, or possibly aiming to approximate the true parameters. For the specific case of mixtures of Gaussians, there is a substantial theoretical literature on algorithms that approximate the mixing weights, means and covariances. Kalai et al. (2012) gave a recent survey of this literature. The strictness of this objective cuts both ways. On the one hand, a successful learner uncovers substantial structure of the target distribution. On the other hand, this objective is clearly impossible when the means and covariances are extremely close. Thus, algorithms for parameter estimation of mixtures necessarily require some assumptions on the target parameters. Also, the basic definition of parameter estimation does not immediately extend to an agnostic setting, although there is literature on agnostic parameter estimation, e.g., Lai et al. (2016).
for general background. It was first studied in the computational learning theory community under the namePAC learning of distributions by Kearns et al. (1994), whose focus is on the computational complexity of the learning algorithm.
For density estimation there are various possible measures of distance between distributions, the most popular ones being the TV distance and the Kullback-Leibler (KL) divergence. Here we focus on the TV distance since it has several appealing properties, such as being a metric and having a natural probabilistic interpretation. In contrast, KL divergence is not even symmetric and can be unbounded even for intuitively close distributions. For a detailed discussion on why TV is a natural choice, see Devroye and Lugosi (2001, Chapter 5).
A popular method for distribution learning in practice is kernel density estimation (see, e.g.,Devroye and Lugosi (2001, Chapter 9)). The few rigorously proven sample complexity bounds for this method require certain smoothness assumptions on the class of densities (e.g., Devroye and Lugosi (2001, Theorem 9.5)). The class of Gaussians is not universally Lipschitz and does not satisfy these assumptions, so those results do not apply to the problems we consider.
Another elementary method for density estimation is using histogram estimators. Straightforward calculations show that histogram estimators for mixtures of Gaussians would result in a sample complexity that is exponential in the dimension. The same is true for estimators based on piecewise polynomials.
The minimum distance estimate (Devroye and Lugosi, 2001, Section 6.8) is another approach for deriving sample complexity upper bounds for distribution learning. This approach is based on uniform convergence theory. In particular, an upper bound for any class of distributions can be achieved by bounding the VC-dimension of an associated set system, called the Yatracos class (see Devroye and Lugosi (2001, page 58) for the definition). For example, Diakonikolas et al. (2017b) used this approach to bound the sample complexity of learning high-dimensional log-concave distributions. However, for mixtures of Gaussians and axis-aligned Gaussians in , the best known VC-dimension bound (Anthony and Bartlett, 1999, Theorem 8.14) results in loose upper bounds of and respectively.
Another approach is to first approximate the mixture class using a more manageable class such as piecewise polynomials, and then study the associated Yatracos class, see, e.g., Chan et al. (2014). However, piecewise polynomials do a poor job in approximating -dimensional Gaussians, resulting in an exponential dependence on .
For density estimation of mixtures of Gaussians, the current best sample complexity upper bounds (in terms of and ) are for general Gaussians and for axis-aligned Gaussians, both due to Ashtiani et al. (2017). For the general Gaussian case, their method takes an i.i.d. sample of size and partitions this sample in every possible way into subsets. Based on those partitions, “candidate distributions” are generated. The problem is then reduced to learning with respect to that finite class of candidates. Their sample complexity has a suboptimal factor of , of which arises in their approach for choosing the best candidate, and another factor is due to the exponent in the number of candidates.
Our approach via compression schemes also ultimately reduces the problem to learning with respect to finite classes. However, our compression technique leads to a more refined bound. In the case of mixtures of Gaussians, one factor of is again incurred due to learning with respect to finite classes. The key is that the number of compressed samples has no additional factor of , so the overall sample complexity bound has only a dependence on .
As for lower bounds on the sample complexity, much fewer results are known for learning mixtures of Gaussians. The only lower bound of which we are aware is due to Suresh et al. (2014), who show a bound of for learning mixtures of axis-aligned Gaussians (and hence for general Gaussians as well). This bound is tight for the axis-aligned case, as we show in Theorem 1.3, but loose in the general case, as we show in Theorem 1.2.
1.3 Our techniques
We introduce a novel method for learning distributions via a form of sample compression. Given a class of distributions, suppose there is a method for “compressing” the samples generated by any distribution in the class. Further, suppose there exists a fixed decoder for the class, such that given the compressed set of instances and a sequence of bits, it approximately recovers the original distribution. In this case, if the size of the compressed set and the number of bits is guaranteed to be small, we show that the sample complexity of learning that class is small as well.
More precisely, say a class of distributions admits compression if there exists a decoder function such that upon generating i.i.d. samples from any distribution in the class, we are guaranteed, with reasonable probability, to have a subset of size at most of that sample, and a sequence of at most bits, on which the decoder outputs an approximation to the original distribution. Note that and can be functions of , the accuracy parameter.
This definition is generalized to a stronger notion of robust compression, where the target distribution is to be encoded using samples that are not necessarily generated from the target itself, but are generated from a distribution that is close to the target. We prove that robust compression implies agnostic learning. In particular, if a class admits robust compression, then the sample complexity of agnostic learning with respect to this class is bounded by (Theorem 3.5).
An attractive property of robust compression is that it enjoys two closure properties. Specifically, if a base class admits robust compression, then the class of -mixtures of that base class, as well as the class of products of the base class, are robustly compressible (Lemmas 3.6 and 3.7).
Consequently, it suffices to provide a robust compression scheme for the class of single Gaussian distributions in order to obtain a compression scheme for classes of mixtures of Gaussians (and therefore, to be able to bound their sample complexity). We prove that the class of -dimensional Gaussian distributions admits robust compression (Lemma 4.2). The high level idea is that by generating
samples from a Gaussian, one can get some rough sketch of the geometry of the Gaussian. In particular, the convex hull of the points drawn from a Gaussian enclose an ellipsoid centered at the mean and whose principal axes are the eigenvectors of the covariance matrix. Using ideas from convex geometry and random matrix theory, we show one can in fact encode the center of the ellipsoidand the principal axes using a convex combination of these samples. Then we discretize the coefficients and obtain an approximate encoding.
The above results together imply tight (up to logarithmic factors) upper bounds of for mixtures of Gaussians, and for mixtures of axis-aligned Gaussians over . The robust compression framework we introduce is quite flexible, and can be used to prove sample complexity upper bounds for other distribution classes as well.
For proving our lower bound for mixtures of Gaussians, we first prove a lower bound of for learning a single Gaussian. Although the approach is quite intuitive, the details are intricate and much care is required to make a formal proof. The main step is to construct a large family (of size
) of covariance matrices such that the associated Gaussian distributions are well-separated in terms of their total variation distance while simultaneously ensuring that their Kullback-Leibler divergences are small. Once this is established, we can then apply a generalized version of Fano’s inequality to complete the proof.
To construct this family of covariance matrices, we sample matrices from the following probabilistic process: start with an identity covariance matrix. Then choose a random subspace of dimensionto roughly
. It is easy to bound the KL divergence between the constructed Gaussians. To lower bound the total variation, we show that for every pair of these distributions, there is some subspace for which a vector drawn from one Gaussian will have slightly larger projection than a vector drawn from the other Gaussian. Quantifying this gap will then give us the desired lower bound on the total variation distance.
1.4 Paper outline
We set up our formal framework and notations in Section 2. In Section 3, we define compression schemes for distributions, prove their closure properties, and show their connection with density estimation. Theorem 1.1 and Theorem 1.3 are proved in Section 4. Theorem 1.2 is proved in Section 5. All omitted proofs can be found in the appendix.
A distribution learning method or density estimation method is an algorithm that takes as input a sequence of i.i.d. samples generated from a distribution , and outputs (a description of) a distribution as an estimation forand be two probability distributions defined over the Borel -algebra . The total variation (TV) distance between and is defined by
where is the norm of . The Kullback-Leibler (KL) divergence between and is defined by
In the following definitions, is a class of probability distributions, and is a distribution (not necessarily in ).
Definition 2.1 (-approximation, -approximation).
A distribution is an -approximation for if . A distribution is an -approximation for with respect to if
Definition 2.2 (PAC-learning distributions, realizable setting).
A distribution learning method is called a (realizable) PAC-learner for with sample complexity , if for all distribution and all , given , , and a sample of size generated i.i.d. by that , with probability at least (over the samples) the method outputs an -approximation of .
Definition 2.3 (PAC-learning distributions, agnostic setting).
For , a distribution learning method is called a -agnostic PAC-learner for with sample complexity , if for all distributions and all , given , , and a sample of size generated i.i.d. from , with probability at least the method outputs an -approximation of w.r.t. .
We sometimes say a class can be “-learned in the agnostic setting” to indicate the existence of a -agnostic PAC-learner for the class. The case is sometimes called semi-agnostic learning. Let denote the -dimensional simplex.
Definition 2.4 ().
Let be a class of probability distributions. Then the class of -mixtures of , written ), is defined as
Let denote the dimension. A Gaussian distribution with mean and covariance matrix is denoted by . If is a diagonal matrix, then is called an axis-aligned Gaussian. For a distribution , we write to mean
is a random variable with distribution, and we write to mean that is an i.i.d. sample of size generated from .
A random variable is said to be -subgaussian if for any .
Note that if then is -subgaussian, see, e.g., Abramowitz and Stegun (1984, formula (7.1.13)).
Let be symmetric, positive definite matrices of the same size. The log-det divergence of and is defined as .
We will use or to denote the Euclidean norm of a vector , or to denote the operator norm of a matrix , and to denote the Frobenius norm of a matrix . For , we will write .
3 Compression schemes and their connection with learning
Let be a class of distributions over a domain .
Definition 3.1 (distribution decoder).
A distribution decoder for is a deterministic function , which takes a finite sequence of elements of and a finite sequence of bits, and outputs a member of .
Definition 3.2 (robust distribution compression schemes).
Let be functions, and let . We say admits -robust compression if there exists a decoder for such that for any distribution , and for any distribution on with , the following holds:
For any , if the sample is drawn from , then with probability at least , there exists a sequence of at most elements of , and a sequence of at most bits, such that .
Essentially, the definition asserts that with high probability, there should be a (small) subset of and some (small number of) additional bits, from which can be approximately reconstructed. We say that the distribution is “encoded” with and , and in general we would like to have a compression scheme of a small size. This compression scheme is called “robust” since it requires to be approximately reconstructed from a sample generated from rather than itself.
In the definition above we required the probability of existence of and to be at least 2/3, but one can boost this probability to by generating a sample of size .
Next we show that if a class of distributions can be compressed, then it can be learned; thus we build the connection between robust compression and agnostic learning. We will need the following useful result about PAC-learning of finite classes of distributions, which immediately follows from Devroye and Lugosi (2001, Theorem 6.3) and a standard Chernoff bound. It states that a finite class of size can be -learned in the agnostic setting using samples. Denote by the set . Throughout the paper, always means .
Theorem 3.4 (Devroye and Lugosi (2001)).
There exists a deterministic algorithm that, given candidate distributions , a parameter , and i.i.d. samples from an unknown distribution , outputs an index such that
with probability at least .
The proof of the following theorem appears in Appendix C.1.
Theorem 3.5 (compressibility implies learnability).
Suppose admits -robust compression. Let . Then can be -learned in the agnostic setting using
If admits 0-robust compression, then can be learned in the realizable setting using the same number of samples.
We next prove two closure properties of compression schemes. First, Lemma 3.6 below implies that if a class of distributions can be compressed, then the class of distributions that are formed by taking products of members of can also be compressed. If are distributions over domains , then denotes the standard product distribution over . For a class of distributions, define The following lemma is proved in Appendix C.2.
Lemma 3.6 (compressing product distributions).
If admits -robust compression, then admits -robust compression.
Our next lemma implies that if a class of distributions can be compressed, then the class of distributions that are formed by taking mixtures of members of can also be compressed. The proof appears in Appendix C.3.
Lemma 3.7 (compressing mixtures).
If admits -robust compression, then admits -robust compression.
4 Upper bound: learning mixtures of Gaussians by compression schemes
4.1 Warm-up: learning mixtures of axis-aligned Gaussians by compression schemes
In this section, we give a simple application of our compression framework to prove an upper bound of for the sample complexity of learning mixtures of axis-aligned Gaussians in the realizable setting. In the following section, we generalize these arguments to work for general Gaussians in the agnostic setting.
The class of single-dimensional Gaussians admits a 0-robust compression scheme.
Let be such that . Let be the target distribution. We first show how to encode . Let . Then . So with probability at least , we have . Conditioned on this event, this implies that there is a such that . We now choose satisfying
, and encode the standard deviation by. The decoder then estimates . Note that and that the encoding requires two sample points and bits (for encoding ).
Now we turn to encoding . Let . Then with probability at least . We will condition on this event, which implies existence of some such that . We choose such that , and encode the mean by . The decoder estimates . Again, note that . Moreover, encoding the mean requires one sample point and bits.
To summarize, the decoder has and . Plugging these bounds into Lemma A.4 gives , as required. ∎
To complete the proof of Theorem 1.3 in the realizable setting, we note that Lemma 4.1 combined with Lemma 3.6 implies that the class of axis-aligned Gaussians in admits a 0-robust compression scheme. Then, by Lemma 3.7, the class of mixtures of axis-aligned Gaussians admit a 0-robust compression scheme. Applying Theorem 3.5 implies that the class of -mixtures of axis-aligned Gaussians in can be learned using many samples in the realizable setting.
4.2 Agnostic learning mixtures of Gaussians by compression schemes
In this section we prove an upper bound of for the sample complexity of learning mixtures of Gaussians in dimensions, and an upper bound of for the sample complexity of learning mixtures of axis-aligned Gaussians, both in the agnostic sense. The heart of the proof is to show that Gaussians have robust compression schemes in any dimension.
For any positive integer , the class of -dimensional Gaussians admits an 2/3-robust compression scheme.
This lemma can be boosted to give an -robust compression schemes for any at the expense of worse constants hidden in the big Oh, but this will not yield any improvement in the final results.
In the special case , there also exists a (i.e., constant size) -robust compression scheme using completely different ideas. The proof appears in Appendix D.4. Remarkably, this compression scheme has constant size, as the value of is independent of (unlike Lemma 4.2). This scheme could be used instead of Lemma 4.2 in the proof of Theorem 1.3, although it would not improve the sample complexity bound asymptotically.
Proof of Theorem 1.3. Let denote the class of -dimensional Gaussian distributions. By Lemma 4.2, admits an -robust compression scheme. By Lemma 3.6, the class admits a -robust compression scheme. Then, by Lemma 3.7, the class -mix() admits -robust compression. Applying Theorem 3.5 implies that the class of -mixtures of axis-aligned Gaussians in can be 3-agnostically learned using many samples.
4.3 Proof of Lemma 4.2
Let denote the target distribution, which satisfies for some Gaussian which we are to encode. Note that this implies .
The case of rank-deficient can easily be reduced to the case of full-rank . If the rank of is , any lies in some affine subspace of dimension . Thus, any lies in with probability at least 2/3. With high probability, after seeing samples from , at least points from will appear in the sample. We encode using these samples, and for the rest of the process we work in this affine space, and discard outside points. Hence, we may assume has full rank .
We first prove a lemma that is similar to known results in random matrix theory (see Litvak et al., 2005, Corollary 4.1), but is tailored for our purposes. Its proof appears in Appendix D.1. Let and .
Let be i.i.d. samples from a distribution where Let
Then for a large enough constant , if then
Suppose , where the vectors are orthogonal. Let . Note that both and are positive definite, and that . Moreover, it is easy to see that and .
The following lemma is proved in Appendix D.2.
Let be a sufficiently large constant. Given samples from , where , with probability at least , one can encode vectors satisfying
using bits and the points in .
Suppose , where are orthogonal and is full rank, and that
5 The lower bound for Gaussians and their mixtures
In this section, we establish a lower bound of for learning a single Gaussian, and then lift it to obtain a lower bound of for learning mixtures of Gaussians in dimensions. Both our lower bounds consider the realizable setting (so they also hold in the agnostic setting).
Let be a class of distributions such that for all small enough there exist densities with
Then any algorithm that learns to within total variation distance with success probability at least has sample complexity .
Any algorithm that learns a general Gaussian in in the realizable setting within total variation distance and with success probability has sample complexity .
Let and . Guided by Lemma 5.1, we will build Gaussian distributions of the form where , where each is a matrix with orthonormal columns. To apply Lemma 5.1, we need to give an upper bound on the KL-divergence between any two and , and a lower bound on their total variation distance. Upper bounding the KL divergence is easy: by Lemma A.1 and since ,
Our next goal is to give a lower bound on the total variation distance between and . For this, we would like the matrices to be “spread out,” in the sense that their columns should be nearly orthogonal. This is formalized in Lemma 5.3 below, where we show if we choose the randomly, we can achieve for any . Then, if is the subspace spanned by the columns of , then we expect that a Gaussian drawn from should have a slightly larger projection onto then a Gaussian drawn from . This will then allow us to give a lower bound on the total variation distance between and . More precisely, in Lemma 5.4 we show that implies , completing the proof. ∎
Suppose . Then there exists orthonormal matrices such that for any we have .
Suppose that , and . If , then .
Finally, in Appendix E.4 we prove our lower bound for mixtures.
Any algorithm that learns a mixture of general Gaussians in in the realizable setting within total variation distance and with success probability at least has sample complexity .
6 Further discussion
A central open problem in distribution learning and density estimation is characterizing the sample complexity of learning a distribution class. An insight from supervised learning theory is that the sample complexity of learning a class (of concepts, functions, or distributions) may be proportional to some kind of intrinsic dimension of the class divided by, where is the error tolerance. For the case of agnostic binary classification, the intrinsic dimension is captured by the VC-dimension of the concept class (see Vapnik and Chervonenkis (1971); Blumer et al. (1989)). For the case of distribution learning with respect to ‘natural’ parametric classes, we expect this dimension to be equal to the number of parameters. In this paper, we showed that this is indeed the case for the class of Gaussians, axis-aligned Gaussians, and their mixtures in any dimension.
In binary classification, the combinatorial notion of Littlestone-Warmuth compression has been shown to be sufficient (Littlestone and Warmuth, 1986) and necessary (Moran and Yehudayoff, 2016) for learning. In this work, we showed that the new but related notion of robust distribution compression is sufficient for distribution learning. Whether the existence of compression schemes is necessary for learning an arbitrary class of distributions remains an intriguing open problem.
We would like to mention that while it may first seem that the VC-dimension of the Yatracos set associated with a class of distributions can characterize its sample complexity, it is not hard to come up with examples where this VC-dimension is infinite while the class can be learned with finite samples. Covering numbers do not work, either; for instance the class of Gaussians do not have a bounded covering number in the TV metric, nevertheless it is learnable with finite samples.
A concept related to compression is that of core-sets. In a sense, core-sets can be viewed as a special case of compression, where the decoder is required to be the empirical error minimizer. See the work of (Lucic et al., 2017) for using core-sets in maximum likelihood estimation.
Appendix A Standard results
Lemma A.1 (Rasmussen and Williams (2006, Equation A.23)).
For two full-rank Gaussians and , their KL divergence is
Lemma A.2 (Pinsker’s Inequality (Tsybakov, 2009, Lemma 2.5)).
For any two distributions and , we have .
For two full-rank Gaussians and , their total variation distance is bounded by
For any with and and we have
By Lemma A.3,
Since and , using the inequality valid for all , we find
And the lemma follows since the distance is twice the TV distance. ∎
Let and be arbitrary random variables on the same space. For any function , we have
This follows from the observation that
so taking supremum of the left-hand side gives the result. ∎
Lemma A.6 (Laurent and Massart (2000, Lemma 1)).
Let have the chi-squared distribution with parameter
have the chi-squared distribution with parameter; that is, where the are i.i.d. standard normal. Then,
The first inequality above implies, in particular, that for any .
Let be independent samples from . For ,
Note that has the chi-squared distribution with parameter . Applying Lemma A.6 with shows that . ∎
Lemma A.8 (Theorem 3.1.1 in Vershynin (2018)).
Let . Then is -subgaussian. Consequently, is also -subgaussian.
Lemma A.9 (Proposition 2.5.2 in Vershynin (2018)).
A random variable is -subgaussian if and only if for some global constant .
Lemma A.10 (Hoeffding’s Inequality, Proposition 2.6.1 in Vershynin (2018)).
Let be independent, mean-zero random variables and suppose is -subgaussian. Then, for some global constant and any ,
Lemma A.11 (Bernstein’s Inequality, Theorem 2.8.1 in Vershynin (2018)).