We are given uniformly random samples from an unknown convex body in , how many samples are needed to approximately reconstruct the body? It seems intuitively clear, at least for , that if we are given sufficiently many such samples then we can reconstruct (or learn) the body with very little error. For general , it is known to require samples [GR09] (see also [KOS08] for a similar lower bound in a different but related model of learning). This is an information-theoretic lower bound and no computational considerations are involved. As mentioned in [GR09], it turns out that if the body has few facets (e.g. polynomial in ), then polynomial in samples are sufficient for approximate reconstruction. This is an information-theoretic upper bound and no efficient algorithms (i.e., with running time poly) are known. (We remark that to our knowledge the same situation holds for polytopes with poly vertices.) In this paper we study the reconstruction problem for the special case when the input bodies are restricted to be (full-dimensional) simplices. We show that in this case one can in fact learn the body efficiently. More precisely, the algorithm knows that the input body is a simplex but only up to an affine transformation, and the problem is to recover this affine transformation. This answers a question of [FJK96, Section 6].
The problem of learning a simplex is also closely related to the well-studied problem of learning intersections of half-spaces. Suppose that the intersection of half-spaces in is bounded, and we are given poly uniformly random samples from it. Then our learning simplices result directly implies that we can learn the half-spaces. This also has the advantage of being a proper learning algorithm, meaning that the output of the algorithm is a set of half-spaces, unlike many of the previous algorithms.
Perhaps the first approach to learning simplices that comes to mind is to find a minimum volume simplex containing the samples. This can be shown to be a good approximation to the original simplex. (Such minimum volume estimators have been studied in machine learning literature, see e.g.[SPST01]
for the problem of estimating the support of a probability distribution. We are not aware of any technique that applies to our situation and provides theoretical guarantees.) However, the problem of finding a minimum volume simplex is in general NP-hard[Pac02]
. This hardness is not directly applicable for our problem because our input is a random sample and not a general point set. Nevertheless, we do not have an algorithm for directly finding a minimum volume simplex; instead we use ideas similar to those used in Independent Component Analysis (ICA). ICA studies the following problem: Given a sample from an affine transformation of a random vector with independently distributed coordinates, recover the affine transformation (up to some unavoidable ambiguities).[FJK96] gave an efficient algorithm for this problem (with some restrictions on the allowed distributions, but also with some weaker requirements than full independence) along with most of the details of a rigorous analysis (a complete analysis of a special case can be found in [AGMS12]; see also [VX11] for a generalization of ICA to subspaces along with a rigorous analysis). The problem of learning parallelepipeds from uniformly random samples is a special case of this problem. [FJK96] asked if one could learn other convex bodies, and in particular simplices, efficiently from uniformly random samples. [NR09] gave a simpler and rigorous algorithm and analysis for the case of learning parallelepipeds with similarities to the popular FastICA algorithm of [Hyv99]. The algorithm in [NR09] is a first order algorithm unlike Frieze et al.’s second order algorithm.
is the random variable distributed according to the input distribution. The moment function can be estimated from the samples. The independent components of the distribution correspond to local maxima or minima of the moment function, and can be approximately found by finding the local maxima/minima of the moment function estimated from the sample.
More information on ICA including historical remarks can be found in [HKO01, CJ10]. Ideas similar to ICA have been used in statistics in the context of projection pursuit since the mid-seventies. It is not clear how to apply ICA to the simplex learning problem directly as there is no clear independence among the components. Let us note that [FJK96] allow certain kinds of dependencies among the components, however this does not appear to be useful for learning simplices.
Learning intersections of half-spaces is a well-studied problem in learning theory. The problem of PAC-learning intersections of even two half-spaces is open, and there is evidence that it is hard at least for sufficiently large number of half-spaces: E.g., [KS09] prove that learning intersections of half-spaces in (for constant ) is hard under standard cryptographic assumptions (PAC-learning is possible, however, if one also has access to a membership oracle in addition to random samples [KP98]). Because of this, much effort has been expended on learning when the distribution of random samples is some simple distribution, see e.g. [KS07, Vem10b, Vem10a] and references therein. This line of work makes substantial progress towards the goal of learning intersections of half-spaces efficiently, however it falls short of being able to do this in time polynomial in andFJK96] and [GR09] consider the uniform distribution over the intersection. Note that this requires that the intersection be bounded. Note also that one only gets positive samples in this case unlike other work on learning intersections of half-spaces. The problem of learning convex bodies can also be thought of as learning a distribution or density estimation problem for a special class of distributions.
[GLPR12] show how to reconstruct a polytope with vertices in , given its first moments in random directions. In our setting, where we have access to only a polynomial number of random samples, it’s not clear how to compute moments of such high orders to the accuracy required for the algorithm of [GLPR12] even for simplices.
A recent and parallel work of [AGH12]
is closely related to ours. They show that tensor decomposition methods can be applied to low-order moments of various latent variable models to estimate their parameters. The latent variable models considered by them include Gaussian mixture models, hidden Markov models and latent Dirichlet allocations. The tensor methods used by them and the local optima technique we use seem closely related. One could view our work, as well as theirs, as showing that the method of moments along with existing algorithmic techniques can be applied to certain unsupervised learning problems.
For clarity of the presentation, we use the following machine model for the running time: a random access machine that allows the following exact arithmetic operations over real numbers in constant time: addition, subtraction, multiplication, division and square root.
The estimation error is measured using total variation distance, denoted (see Section 2).
There is an algorithm (Algorithm 1 below) such that given access to random samples from a simplex , with probability at least over the sample and the randomness of the algorithm, it outputs vectors that are the vertices of a simplex so that . The algorithm runs in time polynomial in , and .
As mentioned earlier, our algorithm uses ideas from ICA. Our algorithm uses the third moment instead of the fourth moment used in certain versions of ICA. The third moment is not useful for learning symmetric bodies such as the cube as it is identically . It is however useful for learning a simplex where it provides useful information, and is easier to handle than the fourth moment. One of the main contributions of our work is the understanding of the third moment of a simplex and the structure of local maxima. This is more involved than in previous work as the simplex has no obvious independence structure, and the moment polynomial one gets has no obvious structure unlike for ICA.
The probability of success of the algorithm can be “boosted” so that the dependence of the running time on is only linear in as follows: The following discussion uses the space of simplices with total variation distance as the underlying metric space. Let be the target distance. Take an algorithm that succeeds with probability and error parameter to be fixed later (such as Algorithm 1 with ). Run the algorithm times to get simplices. By a Chernoff-type argument, at least simplices are within of the input simplex with probability at least .
By sampling, we can estimate the distances between all pairs of simplices with additive error less than in time polynomial in and so that all estimates are correct with probability at least . For every output simplex, compute the number of output simplices within estimated distance . With probability at least both of the desirable events happen, and then necessarily there is at least one output simplex, call it , that has output simplices within estimated distance . Any such must be within of the input simplex. Thus, set .
While our algorithm for learning simplices uses techniques for ICA, we have to do substantial work to make those techniques work for the simplex problem. We also show a more direct connection between the problem of learning a simplex and ICA: a randomized reduction from the problem of learning a simplex to ICA. The connection is based on a known representation of the uniform measure on a simplex as a normalization of a vector having independent coordinates. Similar representations are known for the uniform measure in an -dimensional ball (denoted ) [BGMN05] and the cone measure on the boundary of an ball [SZ90, RR91, SG97] (see Section 2 for the definition of the cone measure). These representations lead to a reduction from the problem of learning an affine transformation of an ball to ICA. These reductions show connections between estimation problems with no obvious independence structure and ICA. They also make possible the use of any off-the-shelf implementation of ICA. However, the results here do not supersede our result for learning simplices because to our knowledge no rigorous analysis is available for the ICA problem when the distributions are the ones in the above reductions.
Idea of the algorithm.
The new idea for the algorithm is that after putting the samples in a suitable position (see below), the third moment of the sample can be used to recover the simplex using a simple FastICA-like algorithm. We outline our algorithm next.
As any full-dimensional simplex can be mapped to any other full-dimensional simplex by an invertible affine transformation, it is enough to determine the translation and linear transformation that would take the given simplex to some canonical simplex. As is well-known for ICA-like problems (see, e.g.,[FJK96]), this transformation can be determined up to a rotation from the mean and the covariance matrix of the uniform distribution on the given simplex. The mean and the covariance matrix can be estimated efficiently from a sample. A convenient choice of an -dimensional simplex is the convex hull of the canonical vectors in . We denote this simplex and call it the standard simplex. So, the algorithm begins by picking an arbitrary invertible affine transformation that maps
onto the hyperplane. We use a so that is an isotropic111See Section 2. simplex. In this case, the algorithm brings the sample set into isotropic position and embeds it in using . After applying these transformations we may assume (at the cost of small errors in the final result) that our sample set is obtained by sampling from an unknown rotation of the standard simplex that leaves the all-ones vector (denoted from now on) invariant (thus this rotation keeps the center of mass of the standard simplex fixed), and the problem is to recover this rotation.
To find the rotation, the algorithm will find the vertices of the rotated simplex approximately. This can be done efficiently because of the following characterization of the vertices: Project the vertices of the simplex onto the hyperplane through the origin orthogonal to and normalize the resulting vectors. Let denote this set of points. Consider the problem of maximizing the third moment of the uniform distribution in the simplex along unit vectors orthogonal to . Then is the complete set of local maxima and the complete set of global maxima (Theorem 8). A fixed point-like iteration (inspired by the analysis of FastICA [Hyv99] and of gradient descent in [NR09]) starting from a random point in the unit sphere finds a local maximum efficiently with high probability. By the analysis of the coupon collector’s problem, repetitions are highly likely to find all local maxima.
Idea of the analysis.
In the analysis, we first argue that after putting the sample in isotropic position and mapping it through , it is enough to analyze the algorithm in the case where the sample comes from a simplex that is close to a simplex that is the result of applying a rotation leaving invariant to the standard simplex. The closeness here depends on the accuracy of the sample covariance and mean as an estimate of the input simplex’s covariance matrix and mean. A sample of size guarantees ([ALPTJ10, Theorem 4.1], [SV11, Corollary 1.2]) that the covariance and mean are close enough so that the uniform distributions on and are close in total variation. We show that the subroutine that finds the vertices (Subroutine 1), succeeds with some probability when given a sample from . By definition of total variation distance, Subroutine 1 succeeds with almost as large probability when given a sample from (an argument already used in [NR09]). As an additional simplifying assumption, it is enough to analyze the algorithm (Algorithm 1) in the case where the input is isotropic, as the output distribution of the algorithm is equivariant with respect to affine invertible transformations as a function of the input distribution.
Organization of the paper.
Starting with some preliminaries in Sec. 2, we state some results on the third moment of simplices in Sec. 3. In Sec. 4 we give an algorithm that estimates individual vertices of simplices in a special position; using this algorithm as a subroutine in Sec. 5 we give the algorithm for the general case. Sec. 6 characterizes the set of local maxima of the third moment. Sec. 7 gives the probabilistic results underlying the reductions from learning simplices and balls to ICA. Sec. 8 explains those reductions.
An -simplex is the convex hull of points in that do not lie on an -dimensional affine hyperplane. It will be convenient to work with the standard -simplex living in defined as the convex hull of the canonical unit vectors ; that is
The canonical simplex living in is given by
Note that is the facet of opposite to the origin.
Let denote the -dimensional Euclidean ball.
The complete homogeneous symmetric polynomial of degree in variables , denoted , is the sum of all monomials of degree in the variables:
Also define the -th power sum as
For a vector , we define
Vector denotes the all ones vector (the dimension of the vector will be clear from the context).
A random vector is isotropic if and . A compact set in is isotropic if a uniformly distributed random vector in it is isotropic. The inradius of an isotropic -simplex is , the circumradius is .
The total variation distance between two probability measures is for measurable . For two compact sets , we define the total variation distance as the total variation distance between the corresponding uniform distributions on each set. It can be expressed as
This identity implies the following elementary estimate:
Let be two compact sets in . Let such that . Then .
We have . Triangle inequality implies the desired inequality. ∎
Consider the coupon collector’s problem with coupons where every coupon occurs with probability at least . Let . Then with probability at least all coupons are collected after trials.
The probability that a particular coupon is not collected after that many trials is at most
The union bound over all coupons implies the claim. ∎
For a point , is the standard norm. The unit ball is defined by
The Gamma distribution is denoted asand has density , with shape parameters .
is the exponential distribution, denoted. The Gamma distribution also satisfies the following additivity property: If is distributed as and is distributed as , then is distributed as .
It is easy to see that is uniform on for .
Let be iid random variables with density proportional to . Then the random vector is independent of . Moreover, is distributed according to .
From [BGMN05], we also have the following variation, a representation of the uniform distribution in :
Let be iid random variables with density proportional to . Let be a random variable distributed as , independent of . Then the random vector
is uniformly distributed in .
See e.g. [Bil95, Section 20] for the change of variable formula in probability.
3 Computing the moments of a simplex
The -th moment over is the function
From the above we can easily derive a formula for integration over :
The variant of Newton’s identities for the complete homogeneous symmetric polynomial gives the following relations which can also be verified easily by direct computation:
Divide the above integral by the volume of the standard simplex to get the moment:
4 Subroutine for finding the vertices of a rotated standard simplex
In this section we solve the following simpler problem: Suppose we have samples from a rotated copy of the standard simplex, where the rotation is such that it leaves invariant. The problem is to approximately estimate the vertices of the rotated simplex from the samples.
We will analyze our algorithm in the coordinate system in which the input simplex is the standard simplex. This is only for convenience in the analysis and the algorithm itself does not know this coordinate system.
As we noted in the introduction, our algorithm is inspired by the algorithm of [NR09] for the related problem of learning hypercubes and also by the FastICA algorithm in [Hyv99]. New ideas are needed for our algorithm for learning simplices; in particular, our update rule is different. With the right update rule in hand the analysis turns out to be quite similar to the one in [NR09].
We want to find local maxima of the sample third moment. A natural approach to do this would be to use gradient descent or Newton’s method (this was done in [FJK96]). Our algorithm, which only uses first order information, can be thought of as a fixed point algorithm leading to a particularly simple analysis and fast convergence. Before stating our algorithm we describe the update rule we use.
We will use the abbreviation . Then, from the expression for we get
Solving for we get
While the above expressions are in the coordinate system where the input simplex is the canonical simplex, the important point is that all terms in the last expression can be computed in any coordinate system that is obtained by a rotation leaving invariant. Thus, we can compute as well independently of what coordinate system we are working in. This immediately gives us the algorithm below. We denote by the sample third moment, i.e., for samples. This is a polynomial in
, and the gradient is computed in the obvious way. Moreover, the gradient of the sample moment is clearly an unbiased estimator of the gradient of the moment; a bound on the deviation is given in the analysis (Lemma6). For each evaluation of the gradient of the sample moment, we use a fresh sample.
It may seem a bit alarming that the fixed point-like iteration is squaring the coordinates of , leading to an extremely fast growth (see Equation 1 and Subroutine 1). But, as in other algorithms having quadratic convergence like certain versions of Newton’s method, the convergence is very fast and the number of iterations is small. We show below that it is , leading to a growth of that is polynomial in and . The boosting argument described in the introduction makes the final overall dependence in to be only linear in .
We state the following subroutine for instead of (thus it is learning a rotated copy of instead of ). This is for notational convenience so that we work with instead of .
Let be a constant, , and . Suppose that Subroutine 1 uses a sample of size for each evaluation of the gradient and runs for iterations. Then with probability at least Subroutine 1 outputs a vector within distance from a vertex of the input simplex. With respect of the process of picking a sample and running the algorithm, each vertex is equally likely to be the nearest.
Note that if we condition on the sample, different vertices are not equally likely over the randomness of the algorithm. That is, if we try to find all vertices running the algorithm multiple times on a fixed sample, different vertices will be found with different likelihoods.
Our analysis has the same outline as that of [NR09]. This is because the iteration that we get is the same as that of [NR09] except that cubing is replaced by squaring (see below); however some details in our proof are different. In the proof below, several of the inequalities are quite loose and are so chosen to make the computations simpler.
We first prove the lemma assuming that the gradient computations are exact and then show how to handle samples. We will carry out the analysis in the coordinate system where the given simplex is the standard simplex. This is only for the purpose of the analysis, and this coordinate system is not known to the algorithm. Clearly, . It follows that,
Now, since we choose randomly, with probability at least one of the coordinates of is greater than all the other coordinates in absolute value by a factor of at least , where . (A similar argument is made in [NR09] with different parameters. We briefly indicate the proof for our case: The probability that the event in question does not happen is less than the probability that there are two coordinates and such that their absolute values are within factor , i.e. . The probability that for given this event happens can be seen as the Gaussian area of the four sectors (corresponding to the four choices of signs of ) in the plane each with angle less than . By symmetry, the Gaussian volume of these sectors is . The probability that such a pair exists is less than .) Assuming this happens, then after iterations, the ratio between the largest coordinate (in absolute value) and the absolute value of any other coordinate is at least . Thus, one of the coordinates is very close to and others are very close to , and so is very close to a vertex of the input simplex.
Now we drop the assumption that the gradient is known exactly. For each evaluation of the gradient we use a fresh subset of samples of points. Here is chosen so that each evaluation of the gradient is within -distance from its true value with probability at least , where will be set at the end of the proof. An application of the Chernoff bound yields that we can take ; we omit the details. Thus all the evaluations of the gradient are within distance from their true values with probability at least .
We assumed that our starting vector has a coordinate greater than every other coordinate by a factor of in absolute value; let us assume without loss of generality that this is the first coordinate. Hence . When expressing in terms of the gradient, the gradient gets multiplied by (we are assuming ), keeping this in mind and letting we get for
If , where will be determined later, then we get
where we used and in the last inequality.
We choose so that
For this, or equivalently suffices.
For satisfying (3) we have . It then follows from (4) that the first coordinate continues to remain the largest in absolute value by a factor of at least after each iteration. Also, once we have , we have for all .
(4) gives that after iterations we have
Now if is such that , we will be guaranteed that . This condition is satisfied if we have , or equivalently . Now using (3) it suffices to choose so that . Thus we can take .
Hence we get . It follows that for , the -distance from the vertex is at most for ; we omit easy details.
Now we set our parameters: and and satisfies all the constraints we imposed on . Choosing , we get that the procedure succeeds with probability at least . Now setting gives the overall probability of error , and the number of samples and iterations as claimed in the lemma. ∎
5 Learning simplices
In this section we give our algorithm for learning general simplices, which uses the subroutine from the previous section. The learning algorithm uses an affine map that maps some isotropic simplex to the standard simplex. We describe now a way of constructing such a map: Let be a matrix having as columns an orthonormal basis of in . To compute one such , one can start with the -by- matrix that has ones in the diagonal and first column, everything else is zero. Let
be a QR-decomposition of. By definition we have that the first column of is parallel to and the rest of the columns span . Given this, let be the matrix formed by all columns of except the first. We have that the set is the set of vertices of a regular -simplex. Each vertex is at distance
from the origin, while an isotropic simplex has vertices at distance from the origin. So an affine transformation that maps an isotropic simplex in to the standard simplex in is .
To simplify the analysis, we pick a new sample to find every vertex, as this makes every vertex equally likely to be found when given a sample from an isotropic simplex. (The core of the analysis is done for an isotropic simplex; this is enough as the algorithm’s first step is to find an affine transformation that puts the input simplex in approximately isotropic position. The fact that this approximation is close in total variation distance implies that it is enough to analyze the algorithm for the case of exact isotropic position, the analysis carries over to the approximate case with a small loss in the probability of success. See the proof below for the details.) A practical implementation may prefer to select one such sample outside of the for loop, and find all the vertices with just that sample—an analysis of this version would involve bounding the probability that each vertex is found (given the sample, over the choice of the starting point of gradient descent) and a variation of the coupon collector’s problem with coupons that are not equally likely.
Proof of Theorem 1.
As a function of the input simplex, the distribution of the output of the algorithm is equivariant under invertible affine transformations. Namely, if we apply an affine transformation to the input simplex, the distribution of the output is equally transformed.222To see this: the equivariance of the algorithm as a map between distributions is implied by the equivariance of the algorithm on any given input sample. Now, given the input sample, if we apply an affine transformation to it, this transformation is undone except possibly for a rotation by the step . A rotation may remain because of the ambiguity in the characterization of . But the steps of the algorithm that follow the definition of are equivariant under rotation, and the ambiguous rotation will be removed at the end when is applied again in the last step. The notion of error, total variation distance, is also invariant under invertible affine transformations. Therefore, it is enough to analyze the algorithm when the input simplex is in isotropic position. In this case (see Section 2) and we can set so that with probability at least (by an easy application of Chernoff’s bound), for some to be fixed later. Similarly, using results from [ALPTJ10, Theorem 4.1], a choice of implies that the empirical second moment matrix
satisfies with probability at least . We have and this implies . Now, is an iid sample from a simplex . Simplex is close in total variation distance to some isotropic simplex333The isotropic simplex will typically be far from the (isotropic) input simplex, because of the ambiguity up to orthogonal transformations in the characterization of . . More precisely, Lemma 7 below shows that
with probability at least .
Assume for a moment that are from . The analysis of Subroutine 1 (fixed point-like iteration) given in Lemma 6 would guarantee the following: Successive invocations to Subroutine 1 find approximations to vertices of within Euclidean distance for some to be determined later and . We ask for each invocation to succeed with probability at least with . Note that each vertex is equally likely to be found. The choice of is so that, if all invocations succeed (which happens with probability at least ), then the analysis of the coupon collector’s problem, Lemma 3, implies that we fail to find a vertex with probability at most . Overall, we find all vertices with probability at least .
But in reality samples are from , which is only close to . The estimate from (4) with appropriate gives
which implies that the total variation distance between the joint distribution of allsamples used in the loop and the joint distribution of actual samples from the isotropic simplex is at most , and this implies that the loop finds approximations to all vertices of when given samples from with probability at least . The points in are still within Euclidean distance of corresponding vertices of .
To conclude, we turn our estimate of distances between estimated and true vertices into a total variation estimate, and map it back to the input simplex. Let . As maps an isotropic simplex to a standard simplex, we have that is an isometry, and therefore the vertices of are within distance of the corresponding vertices of . Thus, the corresponding support functions are uniformly within
of each other on the unit sphere. This and the fact that imply
Thus, by Lemma 2, and this implies that the total variation distance between the uniform distributions on and the input simplex is at most . Over all random choices, this happens with probability at least . We set . ∎
Let be an -dimensional isotropic simplex. Let be an -by- positive definite matrix such that . Let be an -dimensional vector such that . Let be an -by- matrix such that . Let be the simplex . Then there exists an isotropic simplex such that .