These days unsupervised learning is very popular due to the amount of available unlabeled data. The general goal in unsupervised learning is to find structure in the data. This ‘structure’ can be the clusters in the data, the principal components of the data, or the intrinsic dimension of the data, and so on. Distribution learning (also known as density estimation) is the task of explicitly estimating the distribution underlying the data, which can then be explored to find the desired structure, or to generate new data. We mention two examples.
The first example, taken from is learned from the data. When a new input is presented to the system, a high value for indicates a normal image, while a low value indicates a novel input, which might be characteristic of an abnormality; the patient is then referred to a clinician for further examination. The second example, taken from , is synthesis and sampling, or generative models: in many cases we would like to generate new examples that are similar to those in the training data, e.g., in media applications, where it can be expensive or boring for an artist to generate large volumes of content by hand. Given the training data, the algorithm estimates a probability density function that models the data, and then generates new examples according to this distribution. For example, video games can automatically generate (random but reasonable) textures for large objects or landscapes, rather than requiring an artist to manually colour each pixel.
For supervised learning (in particular, classification problems), there are by now a variety of mathematical tools to understand the hardness of the problem (VC-dimension, Rademacher complexity, covering numbers, margins, etc., see[1, 20]). We lack such a satisfactory mathematical understanding in the case of unsupervised learning (in particular, distribution learning); determining the sample complexity of learning with respect to a general class of distributions is an open problem (see [11, Open Problem 15.1]).
More specifically, distribution learning refers to the following task: given data generated from an unknown target probability distribution, find a distribution that is ‘close’ to . To define this problem more precisely, one needs to specify:
What is assumed about the target distribution? This question is more pertinent than ever in this era of large high-dimensional data sets. Typically one assumes the target belongs to some class of distributions, or it is close to some distribution in that class.
How is data sampled from the distribution? One usually assumes access to i.i.d. data, but in some settings other models such as Markov Chain-based sampling may be more appropriate.
Once the above questions are answered, we have a well defined problem, for which we can propose algorithms. Such an algorithm is evaluated using two metrics: (i) the sample complexity, i.e., the number of samples needed to guarantee a small error, and (ii) the computational complexity, or the running time of the algorithm.
In this survey, we assume the target class consists of mixtures of Gaussians in high dimensions, or is a subclass of this class. We focus on the distance as the measure of closeness, and we assume i.i.d. sampling. Our goal is to give bounds for the sample complexity for distribution learning, or density estimation. We shall use these two phrases interchangeably here; distribution learning (or PAC-learning of distributions) is usually used in the computer science/machine learning community and is a broader term, whereas density estimation is usually used in the statistics literature (see [11, Section 2] for a discussion).
The literature on density estimation is vast and we have not tried to be comprehensive. We shall just review some techniques that have been particularly successful in proving rigorous bounds for sample complexity of learning mixtures of Gaussians. The reader is referred to  for a broader, recent survey. For a general, well written introduction to density estimation, read . This survey is based on the papers [2, 3, 6, 21]; the reader is referred to the original papers for full proofs. Most of the material in Section 3 also appears in .
2. The formal framework
A distribution learning method or density estimation method is an algorithm that takes as input an i.i.d. sample generated from a distribution , and outputs (a description) of a distribution as an estimation for . Furthermore, we assume that belongs to some known class of distributions, but is not required to belong to (if it does, then the method is called a ‘proper’ learner).
We only consider continuous distributions in this survey, and so we identify a ‘probability distribution’ by its ‘probability density function.’ Let be a Euclidean space, and let and be two distributions defined over the Borel -algebra . The total variation distance between and is defined by
where is the norm of . In the following definitions, is a class of probability distributions, and is an arbitrary distribution. The total variation distance and the distance are within constant factor of each other, and we generally do not worry about constants in this survey, so we will use them interchangeably, except when a confusion might occur.
We writehas distribution , and we write to mean that is an i.i.d. sample of size generated from .
Definition 1 (-approximation, -close).
A distribution is an -approximation for , or is -close to , if .
Definition 2 (density estimation method, sample complexity).
A density estimation method for has sample complexity if, for any distribution and any , given , , and an i.i.d. sample of size from , with probability at least outputs an -approximation of .
In the machine learning literature, such a density estimation method for class is called a ‘PAC distribution learning method for in the realizable setting,’ or an ‘-leaner,’ with sample complexity . We also say we can ‘learn class with samples’. Typically the dependence on is logarithmic and hence non-crucial, and sometimes we may just say the sample complexity is , ignoring its dependence on (this means we take, e.g., ). Note that the sample complexity should not depend on the specific underlying probability distribution , but should uniformly hold for all . This uniform notion of learning is sometimes called minimax density estimation in the statistics literature.
denote the -dimensional simplex.
Definition 3 (-mix()).
The class of -mixtures of , written -mix(), is defined as
A one-dimensional Gaussian random variable with mean
and varianceis denoted by . Let be a positive number, denoting the dimension. A -dimensional Gaussian with mean and (positive semidefinite) covariance matrix is a probability distribution over , denoted , with probability density function
A random variable with density is denoted by . Let denote the class of
-dimensional Gaussian distributions, and letdenote the class of -mixtures of -dimensional Gaussian distributions: .
If is a diagonal matrix, then is called an axis-aligned
Gaussian, since in this case the eigenspace ofcoincide with the standard basis. Let denote the class of -dimensional axis-aligned Gaussian distributions, and let .
|Distribution family||Bound on sample complexity||Reference||Section|
We demonstrate some of the techniques used in density estimation by proving the upper and lower bounds in Table 1. For proving the upper bounds, we provide a density estimation method. For the lower bounds, we show that any density estimation method for the corresponding class must use at least the given number of samples. Throughout the survey, allow for polylogarithmic factors, and . We say a bound is tight if it is tight up to polylogarithmic factors.
3. Sample complexity upper bounds via VC-dimension
The VC-dimension of a set system, first introduced by Vapnik and Chervonenkis, has applications in diverse areas such as graph theory, discrete geometry , and the theory of empirical processes , and is known to precisely capture the sample complexity of learning in the setting of binary classification [1, 20]. In this section we show it can also be used to give upper bounds for sample complexity of density estimation. The methods of this section have been developed in , which is the first place where VC-dimension is used for bounding the sample complexity of density estimation. The main results of this section appear in .
We start by defining the Vapnik-Chervonenkis dimension (VC-dimension for short) of a set system.
Definition 4 (VC-dimension).
Let be a family of subsets of a set . We say a set is shattered if, for any , there exists some such that . The VC-dimension of , denoted , is the size of the largest shattered set.
In this section we show an upper bound of for learning , and an upper bound of for learning . The plan is to first connect the sample complexity of learning an arbitrary class to the VC-dimension of a class of a related set system, called the Yatracos class of (this is done in Theorem 10), and then provide upper bounds on VC-dimension of this set system. Let be an arbitrary set, which will be the domain of our probability distributions.
Definition 5 (-distance).
Let , and let and be two probability distributions over . The -distance between and is defined as
Definition 6 (empirical distribution).
Let be a sequence of members of . The empirical distribution corresponding to this sequence is defined by for any .
Lemma 7 (uniform convergence theorem).
Let be a probability distribution over . Let and let be the VC-dimension of . Then, there exist universal positive constants such that
Definition 8 (Yatracos class).
For a class of functions from to , the associated Yatracos class is the family of subsets of defined as
Observe that if then . To see this, let , and observe that
The other direction, namely , follows from the definition of the total variation distance.
Definition 9 (empirical Yatracos minimizer).
Let be a class of distributions over domain . The empirical Yatracos minimizer is a function defined as
If the argmin is not unique, we may choose one arbitrarily.
Theorem 10 (density estimation via empirical Yatracos minimizer).
Let be a class of probability distributions, and let , where . Then, with probability at least we have
where is VC-dimension of , and is a universal constant.
The equality is because . The first inequality is the triangle inequality. The second inequality is because is the empirical minimizer of the -distance. The third inequality holds by Lemma 7 with probability . ∎
For any class , the sample complexity for learning is bounded by .
In view of Corollary 11, to prove the sample complexity bounds for and , it remains to show upper bounds on the VC-dimensions of the Yatracos classes and . We provide the proof for general Gaussians only; the proof for axis-aligned Gaussians is very similar.
For classes and of functions from to , let
and notice that
We upper bound the VC-dimension of
via the following well known result in statistical learning theory, which first appeared in this form in[14, Theorem 7.2] (see [9, Lemma 4.2] for a historical discussion).
Theorem 12 (VC-dimension of vector spaces).
Let be an -dimensional vector space of real-valued functions. Then .
Now let be the indicator function for an arbitrary element in , where . Then is a -valued function and we have:
The inner expression is a quadratic form, and the linear dimension of all quadratic functions is . Hence, by Theorem 12, we have . Combined with Corollary 11, this gives a sample complexity upper bound of for learning , which is the main result of this section. An upper bound for can be proved similarly.
The problem with extending these results to mixtures of Gaussians is that it is not easy to bound the VC-dimension of the Yatrocas class of the family of mixtures of Gaussians. It is an intriguing open problem whether . One can also ask a more ambitious question: is it true that for any class of distributions, ? We believe the answer to this latter question is no, but this is yet to be disproved.
4. Sample complexity upper bounds via piecewise polynomials
In this section we give an upper bound for learning the class . This was proved in , which also gives a polynomial time algorithm for density estimation of this class. The main idea is to approximate a Gaussian with a piecewise polynomial function. For positive integers , let denote the class of density functions that are piecewise polynomials with at most pieces, where each piece is a polynomial of degree at most . First, we give a sample complexity upper bound for learning using the ideas from Section 3.
We need to bound . Note that . And since any has at most roots and is continuous, any element in is a union of at most intervals. The VC-dimension of the class of unions of at most intervals can be easily seen to be . This gives , hence by Corollary 11, the sample complexity of learning is .
For any there exists with . (This is obtained by taking the Taylor polynomial for the main body of the Gaussian, and taking the zero polynomial for the two tails, see  for the details.) Also, -mix. This implies that, for any there exists with .
Let be the target distribution. Now, consider the empirical Yatracos minimizer (see Definition 9) for the class . Given samples from , the minimizer ‘imagines’ the samples are coming from , and after taking samples, outputs an estimate such that . Then, the triangle inequality gives
There is an issue with the above argument; our proof for Theorem 10 assumed the samples are from a distribution in the known class of distributions ( in this case), whereas in this case, they are not. However, one can amend the argument (by applying two careful triangle inequalities) to show that, if the samples are coming from a distribution that is not necessarily in , then with high probability the empirical Yatrocas minimizer outputs a distribution satisfying:
which will be in our case, as required (see  for the proof of (4.1)). Such a result is called agnostic learning, since it does not assume the target belongs to the known class, but only assumes it can be approximated well by some element of the class.
Unfortunately, the idea of piecewise polynomial approximation cannot be extended to higher dimensions, because to approximate a high-dimensional Gaussians, one needs a piecewise polynomial with either the degree or the number of pieces being exponential in the dimension. The ideas for extending the bounds to higher dimensions are quite different and are described next.
5. A generic upper bound for mixtures
We consider a more general problem in this section. Assume that we have a method to learn an arbitrary class . Does this mean that we can learn -mix? And if so, what is the sample complexity of this task? We give an affirmative answer to the first question, and provide a bound for sample complexity of learning -mix. As an application of this general result, we give an upper bound for the case of mixtures of Gaussians in high-dimensions. This section is based on .
Theorem 13 (sample complexity of learning mixtures).
Assume that can be learned with sample complexity for some and some function . Then there exists a density estimation method for -mix() requiring samples.
One may wonder about tightness of this theorem. In Theorem 2 in , it is shown that if is the class of spherical Gaussians, we have , therefore, the factor of is necessary in general. However, it is not clear whether the additional factor of in the theorem is tight.
If we apply this theorem to the class , which has sample complexity as proved in Section 3, we immediately obtain an upper bound of for the sample complexity of learning , and a sample complexity upper bound of for .
We now give a sketch of the proof of Theorem 13. Suppose the target distribution is , where each . The are called the mixing weights, and the are called the components. Consider a die with faces, such that when you roll it, the th face has probability of coming. To generate a point according to , one can roll this die, and if the th face comes, generate a point according to distribution . So, any i.i.d. sample generated from can be coloured with colours, such that almost a fraction of points have colour , and the points with colour are i.i.d. distributed as .
Now, if the colouring was given to the algorithm, there was a clear way to proceed: estimate each of the using the -learner, and estimate by the proportion of points with colour , and then output the resulting mixture. The issue is that the colouring is not given to the algorithm. But, in principle, it can do an exhaustive search over all possible colourings, and ‘choose the best one.’
More precisely, the algorithm has two main steps. In the first step we generate a finite set of ‘candidate distributions,’ such that at least one of them is -close to in distance. These candidates are of the form , where the ’s are extracted from samples and are estimates for the real components , and the ’s come from a fixed discretization of , and are estimates for the real mixing weights . In the second step, we take lots of additional samples and use the following result to choose the best one among them, giving a distribution that is -close to .
The following theorem provides an algorithm that chooses the almost-best one among a finite set of candidate distributions. It follows from [9, Theorem 6.3] and a standard Chernoff bound.
Theorem 14 (handpicking from a finite set of candidates).
Suppose we are given candidate distributions and we have access to i.i.d. samples from an unknown distribution . Then there exists an algorithm that given the ’s and , takes samples from , and with probability outputs an index such that
We now analyze the sample complexity of our proposed algorithm. First consider the simpler case that all mixing weights are equal to . To estimate within distance , it suffices to estimate each within distance . Therefore, we need total data points from , with some large constant , so that we get samples from each with probability . For each fixed way of colouring these data points with colours, we provide the points of each colour to the -learner, and get an estimate , and then we add to the set of candidate distributions (recall that we have assumed the mixture weights are ). Hence, the total number of candidate distributions is .
We now show that at least one of the candidate distributions is -close to the target. Consider the colouring that assigns points to components correctly. Then the -learner would provide us with that each is -close to the corresponding with probability . So, by the union bound, they are simultaneously close, with probability . Thus, when we apply the algorithm of Theorem 14, with probability it provides us with one of the candidate distributions that is -close to the target. The total sample complexity of the whole algorithm is thus
The general case of arbitrary mixing weights brings two challenges: first, we do not know the weights, and so we also do an exhaustive search over a finite fine grid on the simplex to make sure that at least one of the candidate distributions also gets the weights right; it turns out that this does not increase the sample complexity by more than a constant factor. The more important problem is that, for components with very small weight, we may not get enough samples if we take a total of samples from the mixture. The solution is to have different precision for different components: from small-weight components we will have fewer data points, so we will estimate them with larger error; this is compensated by the fact that their weight is small, so the effect of this error in the total estimation error can be controlled. Here is the place that, for the error controlling calculations to work out, we need the technical conditions in the theorem, namely that for some and that . See  for the details..
6. Sample complexity upper bounds via compression schemes
The method of previous section would give sample complexity upper bounds of for learning , and for learning . In this section, we show how the work of  improves these to and using a technique called ‘compression.’ As before, let be a class of distributions over a domain .
Definition 15 (distribution decoder).
A distribution decoder for is a deterministic function , which takes a finite sequence of elements of and a finite sequence of bits, and outputs a member of .
Definition 16 (distribution compression scheme).
Let be functions. We say admits -compression if there exists a decoder for such that for any distribution the following holds:
For any , if , then with probability at least , there exists a sequence of at most elements of , and a sequence of at most bits, such that .
Essentially, the definition asserts that with high probability, there should be a (small) subset of and some (small number of) additional bits, from which can be reconstructed, or ‘decoded.’ We say that the distribution is ‘encoded’ with and , and in general we would like to have a compression scheme of a small size, for a reason that will be clarified soon.
In the above definition we required the probability of existence of and to be at least 2/3, but one can boost this probability to by generating a sample of size .
We next establish a connection between compression and learning, and also show some properties of compression schemes. The proofs can be found in .
Lemma 18 (compression implies learning).
Suppose admits -compression. Let . Then can be learned using
The proof resembles that for Theorem 13: perform an exhaustive search over all possibilities for the ‘defining sequences’ to generate some candidates; one of these candidates would give the ‘correct’ ; then apply Theorem 14 to find the best one among the candidates.
Compression schemes have two nice closure properties. First, if a class of distributions can be compressed, then the class of distributions that are formed by taking products of distributions in can also be compressed. For a class of distributions, we define The proof of the following lemma is not too difficult.
Lemma 19 (compressing product distributions).
If admits -compression, then admits -compression.
Second, if a class of distributions can be compressed, then the class of mixtures of distributions in can also be compressed.
Lemma 20 (compressing mixtures).
If admits -compression, then admits -compression.
The proof of this lemma also resembles that for Theorem 13: just take enough samples so that you have enough samples from each component, and also encode the weights of the mixture using the additional bits.
One can easily show that a single 1-dimensional Gaussian can be compressed.
Lemma 21 (compressing 1-dimensional Gaussians).
The class admits -compression.
The proof is simple: given i.i.d. samples from a Gaussian for a large constant , with high probability there exists two generated points with and . The decoder estimates and . It is not hard to verify that .
Using the above properties, one can show a tight upper bound on the sample complexity of learning mixtures of axis-aligned Gaussians.
Theorem 22 (learning mixtures of axis-aligned Gaussians).
The class of mixtures of axis-aligned Gaussians in can be learned using many samples.
Using more complicated arguments, one can also prove that the class of -dimensional Gaussian distributions admits compression. The high level idea is that by generating
samples from a Gaussian, one gets a rough sketch of the geometry of the Gaussian. In particular, the convex hull of the points drawn from a Gaussian enclose an ellipsoid centred at the mean and whose principal axes are the eigenvectors of the covariance matrix. Using ideas from convex geometry and random matrix theory, one can in fact encode the centre of the ellipsoidand the principal axes using a convex combination of these samples. Then we discretize the coefficients and obtain an approximate encoding. See  for details.
Lemma 23 (compressing high-dimensional Gaussians).
For any positive integer , the class of -dimensional Gaussians admits an -compression scheme.
In the next section we will see the following bound is tight up to polylogarithmic factors.
Theorem 24 (learning mixtures of Gaussians).
The class of -mixtures of -dimensional Gaussians can be learned using samples.
7. Lower bounds via Fano’s inequality
In the previous sections we gave several techniques for upper bounding the sample complexity of density estimation. This survey would feel incomplete if we do not discuss at least one technique for proving lower bounds. Note that each of our upper bounds holds uniformly over a class: the sample complexity does not depend on the specific distribution. Similarly, the lower bounds we discuss in this section also hold for a class of distribution rather than for a specific distribution. Such a bound is called a minimax lower bound in the statistics literature, and a worst-case lower bound in the computer science literature.
In this section, which is based on [21, 2], we give a sample complexity lower bound of for , and a lower bound of for . That is, we show that any density estimation method that learns the class uniformly, in the sense of Definition 2, must have a sample complexity of . This shows the density estimation method described in the previous section has optimal sample complexity, up to polylogarithmic factors.
We will need the definition of the Kullback-Leibler divergence (KL-divergence, also called the relative entropy) between two distributions.
Definition 25 (Kullback-Leibler divergence).
Let and be densities over domain . Their KL-divergence is defined as
The KL-divergence is a measure of closeness between distributions. It is always non-negative, and is zero if and only if the two distributions are equal almost everywhere. However, it is not a metric, since it is not symmetric, and it can be .
The proof of the following lemma, which is called the ‘generalized Fano’s inequality,’ uses Fano’s inequality from information theory. It was first proved in [8, page 77]. We write here a slightly stronger version, which appears in [24, Lemma 3].
Lemma 26 (generalized Fano’s inequality).
Suppose we have distributions with
Consider any density estimation method that gets i.i.d. samples from some , and outputs an estimate (the method does not know ). For each , define as follows: assume the method receives samples from , and outputs . Then . Then, we have
This immediately leads to the following sample complexity lower bound for learning a class .
Suppose for all small enough there exist densities with
Then the sample complexity of learning is .
We start with describing the lower bound construction which gives a sample complexity lower bound of for . This lower bound was proved in .
We claim it suffices to give a lower bound of for . For, consider a mixture of axis-aligned Gaussians whose components are extremely far away, such that the total variation distance between any two components is very close to 1. To learn the mixture distribution, one needs to learn each component. But each data point can help in learning one of the components. Since for learning any of the components one needs samples, one will need samples to learn the mixture. Some nontrivial work has to be done to make this intuitive argument rigorous, but we omit that, and focus on proving the lower bound of for .
Let . To prove a lower bound of for , we will build densities satisfying the conditions of the corollary.
By the Gilbert-Varshamov bound in coding theory, there exists elements in such that any two of them differ in at least components. (To see this, note that the size of a Hamming ball of radius is
hence one can start from an empty set , then add elements from one by one to , and delete the Hamming ball of each added element. So long as has less than elements, the number of deleted elements is less than , so there are still undeleted elements in , and one still can add more elements to .) Call them , and let . The densities are Gaussians with identity covariance matrix, with their means are chosen carefully at vertices of -dimensional hypercube with side length .
The KL-divergence between two general Gaussians and is given by (see, e.g., [13, Section 9])
thus, for we get
Fix distinct and . We now lower bound the distance between and . Let and . Any two and differ in at least coordinates. Fix such coordinates, and, without loss of generality, assume that in these coordinates are 0, and they are 1 in . If we project onto one such coordinate, we get an random variable, so the sum over these coordinates of has distribution . Similarly, if we project onto one such coordinate, we get an random variable, so the sum over these coordinates of has distribution . The total variation distance between and equals the total variation distance between and , which is (see Lemma 28 below). Hence, the total variation distance between and is also , as required.
Let . Then,