Suppose we are given sample access to a distribution over the hypercube , where each sample is generated in the following manner: there are product distributions over (the “centers” of the distribution), and is drawn from
with probability. This distribution is called a product mixture over the hypercube.
Given such a distribution, our goal is to recover from samples the parameters of the individual product distributions. That is, we would like to estimate the probabilityof drawing from each product distribution, and furthermore we would like to estimate the parameters of the product distribution itself. This problem has been studied extensively and approached with a variety of strategies (see e.g. [FM99, CR08, FOS08]).
A canonical approach to problems of this type is to empirically estimate the moments of the distribution, from which it may be possible to calculate the distribution parameters using linear-algebraic tools (see e.g.[AM05, MR06, FOS08, AGH14], and many more). For product distributions over the hypercube, this technique runs into the problem that the square moments are always , and so they provide no information.
The seminal work of Feldman, O’Donnell and Servedio [FOS08] introduces an approach to this problem which compensates for the missing higher-order moment information using matrix completion. Via a restricted brute-force search, Feldman et al. check all possible square moments, resulting in an algorithm that is triply-exponential in the number of distribution centers. Continuing this line work, by giving an alternative to the brute-force search Jain and Oh [JO13] recently obtained a polynomial-time algorithm for a restricted class of product mixtures. In this paper we extend these ideas, giving a polynomial-time algorithm for a wider class of product mixtures, and a quasi-polynomial time algorithm for an even broader class of product mixtures (including product mixtures with centers which are not linearly independent).
Our main tool is a matrix-completion-based algorithm for completing tensors of order from their multilinear moments in time , which we believe may be of independent interest. There has been ample work in the area of noisy tensor decomposition (and completion), see e.g. [JO14, BKS15, TS15, BM15]. However, these works usually assume that the tensor is obscured by random noise, while in our setting the “noise” is the absence of all non-multilinear entries. An exception to this is the work of [BKS15], where to obtain a quasi-polynomial algorithm it suffices to have the injective tensor norm of the noise be bounded via a Sum-of-Squares proof.111It may be possible that this condition is met for some symmetric tensors when only multilinear entries are known, but we do not know an SOS proof of this fact. To our knowledge, our algorithm is the only -time algorithm that solves the problem of completing a symmetric tensor when only multilinear entries are known.
1.1 Our Results
Our main result is an algorithm for learning a large subclass of product mixtures with up to even centers in polynomial (or quasi-polynomial) time. The subclass of distributions on which our algorithm succeeds is described by characteristics of the subspace spanned by the bias vectors. Specifically, the rank and incoherence of the span of the bias vectors cannot simultaneously be too large. Intuitively, the incoherence of a subspace measures how close the subspace is to a coordinate subspace of . We give a formal definition of incoherence later, in def:incoherence.
More formally, we prove the following theorem:
Let be a mixture over product distributions on , with bias vectors and mixing weights . Let have dimension and incoherence . Suppose we are given as input the moments of .
If are linearly independent, then as long as , there is a algorithm that recovers the parameters of .
Otherwise, if for every and , then as long as , there is an time algorithm that recovers the parameters of .
In the case that are not linearly independent, the runtime depends on the separation between the vectors. We remark however that if we have some for , then the distribution is equivalently representable with fewer centers by taking the center with mixing weight . If there is some , then our algorithm can be modified to work in that case as well, again by considering and as one center–we detail this in sec:pdist.
In the main body of the paper we assume access to exact moments; in app:error we prove thm:learn-big, a version of thm:main_learn_prod which accounts for sampling error.
The foundation of our algorithm for learning product mixtures is an algorithm for completing a low-rank incoherent tensor of arbitrary order given access only to its multilinear entries:
Let be a symmetric tensor of order , so that for some vectors and scalars . Let have incoherence and dimension . Given perfect access to all multilinear entries of , if , then there is an algorithm which returns the full tensor in time .
1.2 Prior Work
We now discuss in more detail prior work on learning product mixtures over the hypercube, and contextualize our work in terms of previous results.
The pioneering papers on this question gave algorithms for a very restricted setting: the works of [FM99] and [C99, CGG01] introduced the problem and gave algorithms for learning a mixture of exactly two product distributions over the hypercube.
The first general result is the work of Feldman, O’Donnell and Servedio, who give an algorithm for learning a mixture over product distributions in dimensions in time with sample complexity . Their algorithm relies on brute-force search to enumerate all possible product mixtures that are consistent with the observed second moments of the distribution. After this, they use samples to select the hypothesis with the Maximum Likelihood. Their paper leaves as an open question the more efficient learning of discrete mixtures of product distributions, with a smaller exponential dependence (or even a quasipolynomial dependence) on the number of centers.222 We do not expect better than quasipolynomial dependence on the number of centers, as learning the parity distribution on bits is conjectured to require at least time, and this distribution can be realized as a product mixture over centers.
More recently, Jain and Oh [JO13] extended this approach: rather than generate a large number of hypotheses and pick one, they use a tensor power iteration method of [AGH14] to find the right decomposition of the second- and third-order moment tensors. To learn these moment tensors in the first place, they use alternating minimization to complete the (block)-diagonal of the second moments matrix, and they compute a least-squares estimation of the third-order moment tensor. Using these techniques, Jain and Oh were able to obtain a significant improvement for a restricted class of product mixtures, obtaining a polynomial time algorithm for linearly independent mixtures over at most centers. In order to ensure the convergence of their matrix (and tensor) completion subroutine, they introduce constraints on the span of the bias vectors of the distribution (see sec:pinc for a discussion of incoherence assumptions on product mixtures). Specifically, letting the rank of the span, letting be the incoherence of the span, and letting be the dimension of the samples, they require that .333 The conditions are actually more complicated, depending on the condition number of the second-moment matrix of the distribution. For precise conditions, see [JO13]. Furthermore, in order to extract the bias vectors from the moment information, they require that the bias vectors be linearly independent. When these conditions are met by the product mixture, Jain and Oh learn the mixture in polynomial time.
In this paper, we improve upon this result, and can handle as many as centers in some parameter settings. Similarly to [JO13], we use as a subroutine an algorithm for completing low-rank matrices with adversarially missing entries. However, unlike [JO13], we use an algorithm with more general guarantees, the algorithm of [HKZ11].444 A previous version of this paper included an analysis of a matrix completion algorithm almost identical to that of [HKZ11], and claimed to be the first adversarial matrix completion result of this generality. Thanks to the comments of an anonymous reviewer, we were notified of our mistake. These stronger guarantees allow us to devise an algorithm for completing low-rank higher-order tensors from their multilinear entries, and this algorithm allows us to obtain a polynomial time algorithm for a more general class of linearly independent mixtures of product distributions than [JO13].
Furthermore, because of the more general nature of this matrix completion algorithm, we can give a new algorithm for completing low-rank tensors of arbitrary order given access only to the multilinear entries of the tensor. Leveraging our multilinear tensor completion algorithm, we can reduce the case of linearly dependent bias vectors to the linearly independent case by going to higher-dimensional tensors. This allows us to give a quasipolynomial algorithm for the general case, in which the centers may be linearly dependent. To our knowledge, thm:main_learn_prod is the first quasi-polynomial algorithm that learns product mixtures whose centers are not linearly independent.
Restrictions on Input Distributions
We detail our restrictions on the input distribution. In the linearly independent case, if there are bias vector and is the incoherence of their span, and is the dimension of the samples, then we learn a product mixture in time so long as . Compare this to the restriction that , which is the restriction of Jain and Oh–we are able to handle even a linear number of centers so long as the incoherence is not too large, while Jain and Oh can handle at most centers. If the bias vectors are not independent, but their span has rank and if they have maximum pairwise inner product (when scaled to unit vectors), then we learn the product mixture in time so long as (we also require a quasipolynomial number of samples in this case).
While the quasipolynomial runtime for linearly dependent vectors may not seem particularly glamorous, we stress that the runtime depends on the separation between the vectors. To illustrate the additional power of our result, we note that a choice of randomin an -dimensional subspace meet this condition extremely well, as we have with high probability–for, say, , the algorithm of [JO13] would fail in this case, since are not linearly independent, but our algorithm succeeds in time .
This quasipolynomial time algorithm resolves an open problem of [FOS08], when restricted to distributions whose bias vectors satisfy our condition on their rank and incoherence. We do not solve the problem in full generality, for example our algorithm fails to work when the distribution can have multiple decompositions into few centers. In such situations, the centers do not span an incoherent subspace, and thus the completion algorithms we apply fail to work. In general, the completion algorithms fail whenever the moment tensors admit many different low-rank decompositions (which can happen even when the decomposition into centers is unique, for example parity on three bits). In this case, the best algorithm we know of is the restricted brute force of Feldman, O’Donnell and Servedio.
One note about sample complexity–in the linearly dependent case, we require a quasipolynomial number of samples to learn our product mixture. That is, if there are product centers, we require samples, where the tilde hides a dependence on the separation between the centers. In contrast, Feldman, O’Donnell, and Servedio require samples. This dependence on in the sample complexity is not explicitly given in their paper, as for their algorithm to be practical they consider only constant .
Parameter Recovery Using Tensor Decomposition
The strategy of employing the spectral decomposition of a tensor in order to learn the parameters of an algorithm is not new, and has indeed been employed successfully in a number of settings. In addition to the papers already mentioned which use this approach for learning product mixtures ([JO14] and in some sense [FOS08], though the latter uses matrices rather than tensors), the works of [MR06, AHK12, HK13, AGHK14, BCMV14], and many more also use this idea. In our paper, we extend this strategy to learn a more general class of product distributions over the hypercube than could previously be tractably learned.
The remainder of our paper is organized as follows. In sec:prelims, we give definitions and background, then outline our approach to learning product mixtures over the hypercube, as well as put forth a short discussion on what kinds of restrictions we place on the bias vectors of the distribution. In sec:tensor, we give an algorithm for completing symmetric tensors given access only to their multilinear entries, using adversarial matrix completion as an algorithmic primitive. In sec:pdist, we apply our tensor completion result to learn mixtures of product distributions over the hypercube, assuming access to the precise second- and third-order moments of the distribution. app:noisy-nnm and app:error contain discussions of matrix completion and learning product mixtures in the presence of sampling error, and app:whitening contains further details about the algorithmic primitives used in learning product mixtures.
We use to denote the th standard basis vector.
For a tensor , we use to denote the entry of the tensor indexed by , and we use to denote the th slice of the tensor, or the subset of entries in which the first coordinate is fixed to . For an order- tensor , we use to represent the entry indexed by the string , and we use to denote the slice of indexed by the string . For a vector , we use the shorthand to denote the -tensor .
We use for the set of observed entries of the hidden matrix , and denotes the projection onto those coordinates.
In this section we present background necessary to prove our results, as well as provide a short discussion on the meaning behind the restrictions we place on the distributions we can learn. We start by defining our main problem.
2.1 Learning Product Mixtures over the Hypercube
A distribution over is called a product distribution if every bit in a sample is independently chosen. Let be a set of product distributions over . Associate with each a vector whose th entry encodes the bias of the th coordinate, that is
Define the distribution to be a convex combination of these product distributions, sampling , where and . The distributions are said to be the centers of , the vectors are said to be the bias vectors, and are said to be the mixing weights of the distribution.
Problem 2.1 (Learning a Product Mixture over the Hypercube).
Given independent samples from a distribution which is a mixture over centers with bias vectors and mixing weights , recover and .
This framework encodes many subproblems, including learning parities, a notorious problem in learning theory; the best current algorithm requires time , and the noisy version of this problem is a standard cryptographic primitive [MOS04, Fel07, Reg09, Val15]. We do not expect to be able to learn an arbitrary mixture over product distribution efficiently. We obtain a polynomial-time algorithm when the bias vectors are linearly independent, and a quasi-polynomial time algorithm in the general case, though we do require an incoherence assumption on the bias vectors (which parities do not meet), see def:incoherence.
In [FOS08], the authors give an -time algorithm for the problem based on the following idea. With great accuracy in polynomial time we may compute the pairwise moments of ,
The matrix is a diagonal matrix which corrects for the fact that always. If we were able to learn and thus access , the “augmented second moment matrix,” we may hope to use spectral information to learn .
The algorithm of [FOS08] performs a brute-force search to learn , leading to a runtime exponential in the rank. By making additional assumptions on the input and computing higher-order moments as well, we avoid this brute force search and give a polynomial-time algorithm for product distributions with linearly independent centers: If the bias vectors are linearly independent, a power iteration algorithm of [AGH14] allows us to learn given access to both the augmented second- and third-order moments.555 There are actually several algorithms in this space; we use the tensor-power iteration of [AGH14] specifically. There is a rich body of work on tensor decomposition methods, based on simultaneous diagonalization and similar techniques (see e.g. Jenrich’s algorithm [Har70] and [LCC07]). Again, sampling the third-order moments only gives access to , where is a tensor which is nonzero only on entries of multiplicity at least two. To learn and , Jain and Oh used alternating minimization and a least-squares approximation. For our improvement, we develop a tensor completion algorithm based on recursively applying the adversarial matrix completion algorithm of Hsu, Kakade and Zhang [HKZ11]. In order to apply these completion algorithms, we require an incoherence assumption on the bias vectors (which we define in the next section).
In the general case, when the bias vectors are not linearly independent, we exploit the fact that high-enough tensor powers of the bias vectors are independent, and we work with the th moments of , applying our tensor completion to learn the full moment tensor, and then using [AGH14] to find the tensor powers of the bias vectors, from which we can easily recover the vectors themselves. (the tilde hides a dependence on the separation between the bias vectors). Thus if the distribution is assumed to come from bias vectors that are incoherent and separated, then we can obtain a significant runtime improvement over [FOS08].
2.2 Matrix Completion and Incoherence
As discussed above, the matrix (and tensor) completion problem arises naturally in learning product mixtures as a way to compute the augmented moment tensors.
Problem 2.2 (Matrix Completion).
Given a set of observed entries of a hidden rank- matrix , the Matrix Completion Problem is to successfully recover the matrix given only .
However, this problem is not always well-posed. For example, consider the input matrix . is rank-, and has only nonzero entries on the diagonal, and zeros elsewhere. Even if we observe almost the entire matrix (and even if the observed indices are random), it is likely that every entry we see will be zero, and so we cannot hope to recover . Because of this, it is standard to ask for the input matrix to be incoherent:
Let be a subspace of dimension . We say that is incoherent with parameter if . If is a matrix with left and right singular spaces and , we say that is -incoherent if (resp. ) is incoherent with parameter (resp ). We say that are incoherent with parameter if their span is incoherent with parameter .
Incoherence means that the singular vectors are well-spread over their coordinates. Intuitively, this asks that every revealed entry actually gives information about the matrix. For a discussion on what kinds of matrices are incoherent, see e.g. [CR09]. Once the underlying matrix is assumed to be incoherent, there are a number of possible algorithms one can apply to try and learn the remaining entries of . Much of the prior work on matrix completion has been focused on achieving recovery when the revealed entries are randomly distributed, and the goal is to minimize the number of samples needed (see e.g. [CR09, Rec09, GAGG13, Har14]). For our application, the revealed entries are not randomly distributed, but we have access to almost all of the entries ( entries as opposed to the entries needed in the random case). Thus we use a particular kind of matrix completion theorem we call “adversarial matrix completion,” which can be achieved directly from the work of Hsu, Kakade and Zhang [HKZ11]:
Let be an rank- matrix which is -incoherent, and let be the set of hidden indices. If there are at most elements per column and elements per row of , and if , then there is an algorithm that recovers .
For the application of learning product mixtures, note that the moment tensors are incoherent exactly when the bias vectors are incoherent. In sec:tensor we show how to apply thm:main_matrix_completion recursively to perform a special type of adversarial tensor completion, which we use to recover the augmented moment tensors of after sampling.
Further, we note that thm:main_matrix_completion is almost tight. That is, there exist matrix completion instances with , and for which finding any completion is NP-hard [HMRW14, Pee96] (via a reduction from three-coloring), so the constant on the right-hand side is necessarily at most six. We also note that the tradeoff between and in thm:main_matrix_completion is necessary because for a matrix of fixed rank, one can add extra rows and columns of zeros in an attempt to reduce , but this process increases by an identical factor. This suggests that improving thm:main_learn_prod by obtaining a better efficient adversarial matrix completion algorithm is not likely.
2.3 Incoherence and Decomposition Uniqueness
In order to apply our completion techniques, we place the restriction of incoherence on the subspace spanned by the bias vectors. At first glance this may seem like a strange condition which is unnatural for probability distributions, but we try to motivate it here. When the bias vectors are incoherent and separated enough, even high-order moment-completion problems have unique solutions, and moreover that solution is equal to. In particular, this implies that the distribution must have a unique decomposition into a minimal number of well-separated centers (otherwise those different decompositions would produce different minimum-rank solutions to a moment-completion problem for high-enough order moments). Thus incoherence can be thought of as a special strengthening of the promise that the distribution has a unique minimal decomposition. Note that there are distributions which have a unique minimal decomposition but are not incoherent, such as a parity on any number of bits.
3 Symmetric Tensor Completion from Multilinear Entries
In this section we use adversarial matrix completion as a primitive to give a completion algorithm for symmetric tensors when only a special kind of entry in the tensor is known. Specifically, we call a string multilinear if every element of is distinct, and we will show how to complete a symmetric tensor when only given access to its multilinear entries, i.e. is known if is multilinear. In the next section, we will apply our tensor completion algorithm to learn mixtures of product distributions over the boolean hypercube.
Our approach is a simple recursion: we complete the tensor slice-by-slice, using the entries we learn from completing one slice to provide us with enough known entries to complete the next. The following definition will be useful in precisely describing our recursive strategy:
Define the histogram of a string to be the multiset containing the number of repetitions of each character making at least one appearance in .
For example, the string and the string both have the histogram . Note that the entries of the histogram of a string of length always sum to , and that the length of the histogram is the number of distinct symbols in the string.
Having defined a histogram, we are now ready to describe our tensor completion algorithm.
Algorithm 3.2 (Symmetric Tensor Completion from Multilinear Moments).Input: The multilinear entries of the tensor , for vectors and scalars and some error tensor . Goal: Recover the symmetric tensor . Initialize the tensor with the known multilinear entries of . For each subset with no repetitions: Let be the tensor slice indexed by . Remove the rows and columns of corresponding to indices present in . Complete the matrix using the algorithm of [HKZ11] from thm:main_matrix_completion and add the learned entries to . For : For each with a histogram of length , if is empty: If there is an element appearing at least times, let . Else there are elements each appearing twice, let . Let be the tensor slice indexed by . Complete the matrix using the algorithm from thm:main_matrix_completion and add the learned entries to . Symmetrize by taking each entry to be the average over entries indexed by the same subset. Output: .
One might ask why we go through the effort of completing the tensor slice-by-slice, rather than simply flattening it to an matrix and completing that. The reason is that when has incoherence and dimension , may have incoherence as large as , which drastically reduces the range of parameters for which recovery is possible (for example, if then we would need ). Working slice-by-slice keeps the incoherence of the input matrices small, allowing us to complete even up to rank .
Let be a symmetric tensor of order , so that for some vectors and scalars . Let have incoherence and dimension . Given perfect access to all multilinear entries of (i.e. ), if , then alg:tensor-complete returns the full tensor in time .
In app:error, we give a version of thm:tensor-complete-alg that accounts for error in the input.
We prove that alg:tensor-complete successfully completes all the entries of by induction on the length of the histograms of the entries. By assumption, we are given as input every entry with a histogram of length . For an entry with a histogram of length , exactly one of its elements has multiplicity two, call it , and consider the set . When step:base reaches , the algorithm attempts to complete a matrix revealed from , where , and is the projector to the matrix with the rows and columns corresponding to indices appearing in removed. Exactly the diagonal of is missing since all other entries are multilinear moments, and the th entry should be . Because the rank of this matrix is equal to and , by thm:main_matrix_completion, we can successfully recover the diagonal, including . Thus by the end of step:base, contains every entry with a histogram of length .
For the inductive step, we prove that each time step:larger completes an iteration, contains every entry with a histogram of length at least . Let be an entry with a histogram of length . When step:larger reaches in the th iteration, if does not already contain , the algorithm attempts to complete a matrix with entries revealed from , where is a substring of with a histogram of the same length. Since has a histogram of length , every entry of corresponds to an entry with a histogram of length at least , except for the principal submatrix whose rows and columns correspond to elements in . Thus by the inductive hypothesis, is only missing the aforementioned submatrix, and since , by thm:main_matrix_completion, we can successfully recover this submatrix, including . Once all of the entries of are filled in, the algorithm terminates.
Finally, we note that the runtime is , because the algorithm from thm:main_matrix_completion runs in time , and we perform at most matrix completions because there are strings of length over the alphabet , and we perform at most one matrix completion for each such string. ∎
4 Learning Product Mixtures over the Hypercube
In this section, we apply our symmetric tensor completion algorithm (alg:tensor-complete) to learning mixtures of product distributions over the hypercube, proving thm:main_learn_prod. Throughout this section we will assume exact access to moments of our input distribution, deferring finite-sample error analysis to app:error. We begin by introducing convenient notation.
Let be a mixture over centers with bias vectors and mixing weights . Define to be the tensor of order- moments of the distribution , so that . Define to be the symmetric tensor given by the weighted bias vectors of the distribution, so that .
Note that and are equal on their multilinear entries, and not necessarily equal elsewhere. For example, when is even, entries of indexed by a single repeating character (the “diagonal”) are always equal to 1. Also observe that if one can sample from distribution , then estimating is easy.
Suppose that the bias vectors of are linearly independent. Then by thm:whitening (due to [AGH14], with similar statements appearing in [AHK12, HK13, AGHK14]), there is a spectral algorithm which learns given and 666We remark again that the result in [AGH14] is quite general, and applies to a large class of probability distributions of this character. However the work deals exclusively with distributions for which and , and assumes access to and through moment estimation. (we give an account of the algorithm in app:whitening).
Theorem 4.1 (Consequence of Theorem 4.3 and Lemma 5.1 in [Agh14]).
Let be a mixture over centers with bias vectors and mixing weights . Suppose we are given access to and . Then there is an algorithm which recovers the bias vectors and mixing weights of within in time .
Because and are equal to and on their multilinear entries, the tensor completion algorithm of the previous section allows us to find and from and (this is only possible because and are low-rank, whereas and are high-rank). We then learn by applying thm:whitening.
A complication is that thm:whitening only allows us to recover the parameters of if the bias vectors are linearly independent. However, if the vectors are not linearly independent, we can reduce to the independent case by working instead with for sufficiently large . The tensor power we require depends on the separation between the bias vectors:
We call a set of vectors -separated if for every such that ,
Suppose that are vectors which are -separated, for . Let . Then are linearly independent.
For vectors and for an integer , we have that . If are -separated, then for all ,
Now considering the Gram matrix of the vectors , we have a matrix with diagonal entries of value and off-diagonal entries with maximum absolute value . This matrix is strictly diagonally dominant, and thus full rank, so the vectors must be linearly independent. ∎
We re-iterate here that in the case where , we can reduce our problem to one with fewer centers, and so our runtime is never infinite. Specifically, if for some , then we can describe the same distribution by omitting and including with weight . If , in the even moments we will see the center with weight
, and in the odd moments we will seewith weight . So we simply solve the problem by taking for the first odd so that the are linearly independent, so that both the - and -order moments are even to learn and , and then given the decomposition into centers we can extract and from the order- moments by solving a linear system.
Thus, in the linearly dependent case, we may choose an appropriate power , and instead apply the tensor completion algorithm to and to recover and . We will then apply thm:whitening to the vectors in the same fashion.
Here we give the algorithm assuming perfect access to the moments of and defer discussion of the finite-sample case to app:error.
Algorithm 4.5 (Learning Mixtures of Product Distributions).Input: Moments of the distribution . Goal: Recover and . Let be the smallest odd integer such that are linearly independent. Let and be approximations to the moment tensors of order and . Set the non-multilinear entries of and to “missing,” and run alg:tensor-complete on and to recover and . Flatten to the matrix and similarly flatten to the tensor . Run the “whitening” algorithm from thm:whitening (see app:whitening) on to recover and . Recover entry-by-entry, by taking the th root of the corresponding entry in . Output: and .
Now thm:main_learn_prod is a direct result of the correctness of alg:recovery:
Proof of thm:main_learn_prod.
The proof follows immediately by combining thm:whitening and thm:tensor-complete-alg, and noting that the parameter is bounded by . ∎
We would like to thank Prasad Raghavendra, Satish Rao, and Ben Recht for helpful discussions, and Moritz Hardt and Samuel B. Hopkins for helpful questions. We also thank several anonymous reviewers for very helpful comments.
Animashree Anandkumar, Rong Ge, Daniel Hsu, Sham M. Kakade, and Matus
Telgarsky, Tensor decompositions for learning latent variable models
, Journal of Machine Learning Research15 (2014), no. 1, 2773–2832.
- [AGHK14] Animashree Anandkumar, Rong Ge, Daniel Hsu, and Sham M. Kakade, A tensor approach to learning mixed membership community models, Journal of Machine Learning Research 15 (2014), no. 1, 2239–2312.
- [AGJ14a] Anima Anandkumar, Rong Ge, and Majid Janzamin, Analyzing tensor power method dynamics: Applications to learning overcomplete latent variable models, CoRR abs/1411.1488 (2014).
- [AGJ14b] Animashree Anandkumar, Rong Ge, and Majid Janzamin, Guaranteed non-orthogonal tensor decomposition via alternating rank-1 updates, CoRR abs/1402.5180 (2014).
Animashree Anandkumar, Daniel Hsu, and Sham M. Kakade,
A method of moments for mixture models and hidden markov models, COLT 2012 - The 25th Annual Conference on Learning Theory, June 25-27, 2012, Edinburgh, Scotland, 2012, pp. 33.1–33.34.
- [AM05] Dimitris Achlioptas and Frank McSherry, On spectral learning of mixtures of distributions, Learning Theory, 18th Annual Conference on Learning Theory, COLT, 2005, pp. 458–469.
Aditya Bhaskara, Moses Charikar, Ankur Moitra, and Aravindan Vijayaraghavan,
Smoothed analysis of tensor decompositions
, Symposium on Theory of Computing, STOC, 2014.
- [BKS15] Boaz Barak, Jonathan A. Kelner, and David Steurer, Dictionary learning and tensor decomposition via the sum-of-squares method, Proceedings of the Forty-Seventh Annual ACM on Symposium on Theory of Computing, STOC 2015, Portland, OR, USA, June 14-17, 2015, 2015, pp. 143–151.
- [BM15] Boaz Barak and Ankur Moitra, Tensor prediction, rademacher complexity and random 3-xor, CoRR abs/1501.06521 (2015).
- [C99] Ph.D. thesis.
- [CGG01] Mary Cryan, Leslie Ann Goldberg, and Paul W. Goldberg, Evolutionary trees can be learned in polynomial time in the two-state general markov model, SIAM J. Comput. 31 (2001), no. 2, 375–397.
- [CP09] Emmanuel J. Candès and Yaniv Plan, Matrix completion with noise, CoRR abs/0903.3131 (2009).
- [CR08] Kamalika Chaudhuri and Satish Rao, Learning mixtures of product distributions using correlations and independence, 21st Annual Conference on Learning Theory - COLT 2008, Helsinki, Finland, July 9-12, 2008, 2008, pp. 9–20.
- [CR09] Emmanuel J. Candès and Benjamin Recht, Exact matrix completion via convex optimization, Foundations of Computational Mathematics 9 (2009), no. 6, 717–772.
- [Fel07] Vitaly Feldman, Attribute-efficient and non-adaptive learning of parities and DNF expressions, Journal of Machine Learning Research 8 (2007), 1431–1460.
Yoav Freund and Yishay Mansour, Estimating a mixture of two product
, Proceedings of the Twelfth Annual Conference on Computational Learning Theory, COLT Santa Cruz, CA, USA, July 7-9, 1999, pp. 53–62.
- [FOS08] Jon Feldman, Ryan O’Donnell, and Rocco A. Servedio, Learning mixtures of product distributions over discrete domains, SIAM J. Comput. 37 (2008), no. 5, 1536–1564.
- [GAGG13] Suriya Gunasekar, Ayan Acharya, Neeraj Gaur, and Joydeep Ghosh, Noisy matrix completion using alternating minimization, Machine Learning and Knowledge Discovery in Databases (Hendrik Blockeel, Kristian Kersting, Siegfried Nijssen, and Filip Železný, eds.), Lecture Notes in Computer Science, vol. 8189, Springer Berlin Heidelberg, 2013, pp. 194–209 (English).
- [GHJY15] Rong Ge, Furong Huang, Chi Jin, and Yang Yuan, Escaping from saddle points - online stochastic gradient for tensor decomposition, CoRR abs/1503.02101 (2015).
- [Har70] Richard A. Harshman, Foundations of the PARFAC Procedure: Models and Conditions for and “Explanatory” Multimodal Factor Analysis.
- [Har14] Moritz Hardt, Understanding alternating minimization for matrix completion, 55th IEEE Annual Symposium on Foundations of Computer Science, FOCS 2014, Philadelphia, PA, USA, October 18-21, 2014, 2014, pp. 651–660.
- [HK13] Daniel Hsu and Sham M. Kakade, Learning mixtures of spherical gaussians: moment methods and spectral decompositions, Innovations in Theoretical Computer Science, ITCS ’13, Berkeley, CA, USA, January 9-12, 2013, 2013, pp. 11–20.
- [HKZ11] Daniel Hsu, Sham M. Kakade, and Tong Zhang, Robust matrix decomposition with sparse corruptions, IEEE Transactions on Information Theory 57 (2011), no. 11, 7221–7234.
- [HMRW14] Moritz Hardt, Raghu Meka, Prasad Raghavendra, and Benjamin Weitz, Computational limits for matrix completion, CoRR abs/1402.2331 (2014).
- [JO13] Prateek Jain and Sewoong Oh, Learning mixtures of discrete product distributions using spectral decompositions., CoRR abs/1311.2972 (2013).
- [JO14] Prateek Jain and Sewoong Oh, Provable tensor factorization with missing data, Advances in Neural Information Processing Systems 27: Annual Conference on Neural Information Processing Systems 2014, December 8-13 2014, Montreal, Quebec, Canada, 2014, pp. 1431–1439.
- [LCC07] Lieven De Lathauwer, Joséphine Castaing, and Jean-François Cardoso, Fourth-order cumulant-based blind identification of underdetermined mixtures, IEEE Transactions on Signal Processing 55 (2007), no. 6-2, 2965–2973.
- [MOS04] Elchanan Mossel, Ryan O’Donnell, and Rocco A. Servedio, Learning functions of k relevant variables, J. Comput. Syst. Sci. 69 (2004), no. 3, 421–434.
- [MR06] Elchanan Mossel and Sébastien Roch, Learning nonsingular phylogenies and hidden markov models, The Annals of Applied Probability 16 (2006), no. 2, 583–614.
- [Pee96] Renè Peeters, Orthogonal representations over finite fields and the chromatic number of graphs, Combinatorica 16 (1996), no. 3, 417–431 (English).
- [Rec09] Benjamin Recht, A simpler approach to matrix completion, CoRR abs/0910.0651 (2009).
- [Reg09] Oded Regev, On lattices, learning with errors, random linear codes, and cryptography, J. ACM 56 (2009), no. 6.
- [TS15] Gongguo Tang and Parikshit Shah, Guaranteed tensor decomposition: A moment approach.
- [Val15] Gregory Valiant, Finding correlations in subquadratic time, with applications to learning parities and the closest pair problem, J. ACM 62 (2015), no. 2, 13.
Appendix A Tensor Completion with Noise
Here we will present a version of thm:tensor-complete-alg which account for noise in the input to the algorithm.
We will first require a matrix completion algorithm which is robust to noise. The work of [HKZ11] provides us with such an algorithm; the following theorem is a consequence of their work.777In a previous version of this paper, we derive thm:noisy_matrix_completion as a consequence of thm:main_matrix_completion and the work of [CP09]; we refer the interested reader to http://arxiv.org/abs/1506.03137v2 for the details.
Let be an rank- matrix which is -incoherent, and let be the set of hidden indices. If there are at most elements per column and elements per row of , and if , then let and . In particular, and . Then for every , there is a semidefinite program that computes outputs satisfying
We now give an analysis for the performance of our tensor completion algorithm, alg:tensor-complete, in the presence of noise in the input moments. This will enable us to use the algorithm on empirically estimated moments.
Let be a symmetric tensor of order , so that for some vectors and scalars . Let have incoherence and dimension . Suppose we are given access to , where is a noise tensor with for every . Then if
Then alg:tensor-complete recovers a symmetric tensor such that
for any slice indexed by a string , in time . In particular, the total Frobenius norm error is bounded by .
We proceed by induction on the histogram length of the entries: we will prove that an entry with a histogram of length has error at most .
In the base case of , we have that by assumption, every entry of is bounded by .
Now, for the inductive step, consider an entry with a histogram of length . In filling in the entry , we only use information from entries with shorter histograms, which by the inductive hypothesis each have error at most . Summing over the squared errors of the individual entries, the squared Frobenius norm error of the known entries in the slice in which was completed, pre-completion is at most . Due to the assumptions on , by thm:noisy_matrix_completion, matrix completion amplifies the Frobenius norm error of to at most a Frobenius norm error of . Thus, we have that the Frobenius norm of the slice was completed in, post-completion, is at most , and therefore that the error in the entry is as most , as desired.
This concludes the induction. Finally, as our error bound is per entry, it is not increased by the symmetrization in step:symmetrize. Any slice has at most one entry with a histogram of length one, entries with a histogram of length two, and entries with a histogram of length three. Thus the total error in a slice is at most , and there are slices. ∎
Appendix B Empirical Moment Estimation for Learning Product Mixtures
In sec:pdist, we detailed our algorithm for learning mixtures of product distributions while assuming access to exact moments of the distribution . Here, we will give an analysis which accounts for the errors introduced by empirical moment estimation. We note that we made no effort to optimize the sample complexity, and that a tighter analysis of the error propagation may well be possible.
Algorithm B.1 (Learning product mixture over separated centers).Input: independent samples from , where has bias vectors with separation . Goal: Recover the bias vectors and mixing weights of . Let be the smallest odd integer for which become linearly independent. Empirically estimate and by calculation and . Run alg:recovery on and . Output: The approximate mixing weights , and the approximate vectors .
Theorem B.2 (thm:main_learn_prod with empirical moment estimation).
Let be a product mixture over centers with bias vectors and mixing weights . Let be the smallest odd integer for which are linearly independent (if are -separated for , then ). Define . Suppose
where and are the incoherence and dimension of the space respectively. Furthermore, let be suitably small, and let the parameter in alg:learning_dep satisfy for satisfying
Finally, pick any . Then with probability at least , alg:learning_dep returns vectors and mixing weights such that
and runs in time . In particular, a choice of gives sub-constant error, where the tilde hides the dependence on and .
Before proving thm:learn-big, we will state state the guarantees of the whitening algorithm of [AGH14] on noisy inputs, which is used as a black box in alg:recovery. We have somewhat modified the statement in [AGH14] for convenience; for a breif account of their algorithm, as well as an account of our modifications to the results as stated in [AGH14], we refer the reader to app:whitening.
Theorem B.3 (Corollary of Theorem 4.3 in [Agh14]).
Let be vectors and let be weights. Define and , and suppose we are given and , where and are symmetric error terms such that
Then there is an algorithm that recovers vectors and weights such that for all ,
with probability in time , where is .
Having stated the guarantees of the whitening algorithm, we are ready to prove thm:learn-big.
Proof of thm:learn-big.
We account for the noise amplification in each step.
Step 1: In this step, we empirically estimate the multilinear moments of the distribution. We will apply concentration inequalities on each entry individually. By a Hoeffding bound, each entry concentrates within of its expectation with probability . Taking a union bound over the moments we must estimate, we conclude that with probability at least , all moments concentrate to within of their expectation. Setting , we have that with probability , every entry concentrates to within of its expectation.
Now, we run alg:recovery on the estimated moments.
Step a of alg:recovery: Applying thm:completion-err, we see that the error satisfies and .
Step b of alg:recovery: No error is introduced in this step.
Step c of alg:recovery: Here, we apply thm:whitening-err out of the box, where our vectors are now the . The desired result now follows immediately for the estimated mixing weights, and for the estimated tensored vectors we have , for as defined in thm:learn-big. Note that , so let .
Step d of alg:recovery: Let be the restriction of to the single-index entries, and let be the same restriction for . The bound on the error of the applies to restrictions, so we have So the error in each entry is bounded by . By the concavity of the th root, we thus have that