1 Introduction
Tensor decomposition is one of the key algorithmic tools for learning many latent variable models Chang [1996], Mossel and Roch [2005], Hsu et al. [2012], Anandkumar et al. [2012]. In practice, tensor decomposition methods based on gradient descent and power method have been observed to work well Kolda and Mayo [2011], Ge and Ma [2017]. Theoretically, determining the minimum number of rank one components in the tensor decomposition is known to be NPhard in the worst case Håstad [1990], Hillar and Lim [2013], so usually tensor decomposition is analyzed in the average case. Several algorithms have been analyzed in the average case, where the input tensor is produced according to some probabilistic model, for example see Bhaskara et al. [2014], Goyal et al. [2014], De Lathauwer et al. [2007] as well as sumofsquaresbased algorithms like Barak et al. [2015], Ge and Ma [2015], Hopkins et al. [2016], Ma et al. [2016].
The average case models studied in the literature generally fall into two categories. They either assume components of the tensor are fully random, i.e., generated from a known distribution (e.g., Gaussian), or they follow a smoothed analysis setting where some adversarially chosen instance is perturbed by random noise, see for example Bhaskara et al. [2014], Goyal et al. [2014], Ma et al. [2016]. Our work falls into the second category.
We build upon the framework used in Bhaskara et al. [2014] which reduces decomposing sums of rank one tensors to showing robust linear independence of related rank one tensors, by using Jennrich’s algorithm, also known as Chang’s lemma Chang [1996], Leurgans et al. [1993]. The main departing point of our work is our smoothed analysis of linear independence, which we base on a new notion we call echelon trees, a generalization of Gaussian elimination and echelon form to highorder tensors, which might be of independent interest. We also get improved guarantees compared to Bhaskara et al. [2014] when the tensors are of high enough order.
The main feature of our analysis is that it can handle discrete perturbations. To illustrate, suppose that vectors are drawn from some unknown distribution and our goal is to recover them by (noisily) observing for small values of . Bhaskara et al. [2014] showed that up to constant factor blowups in an efficient algorithm can do this as long as are linearly independent in a robust sense. Note that the set of vector tuples for which are linearly dependent can be defined by polynomial equations, using determinants, and is therefore an algebraic variety. As long as , this variety will have dimension smaller than the whole space, so we expect most vector tuples to fall outside. Bhaskara et al. [2014] showed that starting from an arbitrary set of vectors , by adding Gaussian noise, the new tuple will lie far away from this variety. Our analysis on the other hand, handles a much wider class of perturbations. For example, if each
is independently chosen at random from a “large enough” discrete set such as the vertices of an arbitrary hypercube, we show that with very high probability the resulting tensors are linearly independent, again in a robust sense.
For our main application, described in the next section, it is important to assume components of the tensor come from a discrete set.
1.1 Assemblies of neurons and recovering sparse Venn diagrams
Experiments by neuroscientists over the past three decades Quiroga [2012] have identified neurons which are selectively activated when a realworld object^{1}^{1}1Or person, these are commonly known as “Jennifer Aniston neurons”. is seen (or more generally sensed). It is now widely accepted Buzsáki [2010] that these neurons are part of large cell assemblies, stable sets of highly interconnected neurons whose firing (more or less simultaneous and in unison) is tantamount to a cognitive event such as the sensing or imagining of a person, or of a word or concept (hence the other common name “concept cells”).
In a recent experiment Ison et al. [2015], a neuron firing when one realworld entity is seen (say, the Eiffel tower) but not another (e.g., Barak Obama) may start firing on presentation of an image of Obama after a visual experience associating the two — for example, a picture of Obama in front of the Eiffel tower. This experiment has taught us that assemblies seem to be “mobile” and able to intersect in complex ways reflecting perceived varying degrees of associations between the corresponding entities. The stronger the association between the entities, the larger the intersection will be of the corresponding assemblies. During one’s life, presumably a complex mesh of entities and associations will be created, of some degree of permanence, reflecting the sum total of one’s cognitive experiences.
All said, this complex mesh of memories in somebody’s brain can be modeled as a Venn diagram where each set or assembly consists of neurons firing for a particular concept, and each region of the Venn diagram, a minimal set obtained from an intersection of assemblies and their complements, represents a class of neurons behaving the same way towards all concepts.
Alternatively to the Venn diagram, one may record associations between assemblies in a hypergraph. The entities are the sets or nodes, and the edges reflect associations between the nodes. Furthermore, the hypergraph representing a person’s state of knowledge can be adorned with edge weights reflecting the degree of affinity between a set of nodes (or equivalently, the size of the intersection of their corresponding sets).
This gives rise to several natural questions. The first question concerns reconstruction. How many experiments or observations are needed to identify the structure of cell assembly intersections, or in other words the Venn diagram? Here, we make two crucial assumptions. First, we assume that we can only measure the degree of association between a small number of entities or concepts. Second, the total number of classes of neurons (which behave similarly in response to stimuli) is bounded. In the language of sets, we assume the number of nonempty regions of the Venn diagram is upper bounded by some number and we can measure the sizes of wise intersections of any of our sets for for some small . We also allow for measurement errors.
Our main result here is as follows: As long as the cell assemblies are slightly randomly perturbed, and as long as the number of measurements, , is polynomially larger than the number of nonempty regions of the Venn diagram, , we can fully reconstruct the Venn diagram. The perturbation of cell assemblies, a process which likely occurs naturally in the brain, is a mild assumption that we need in order to escape idiosyncratic cases. We solve the problem of reconstructing the Venn diagram by casting it as a tensor decomposition problem where the elements of the decomposition come from high order tensors of the vertices of the hypercube.
We also explore a simpler graphtheoretic model of assembly association, motivated by more recent experimental findings Ison et al. [2015], De Falco et al. [2016]: Assume that all assemblies have the same size , and that two assemblies are associated if their intersection is of size at least , and are not associated if the intersection is less than another threshold ; the results of Ison et al. [2015], De Falco et al. [2016] suggest that is of , while is of . We show that an unreasonably rich and complex family of graphs can be realized by associations (roughly, any graph of degree ).
1.2 Problem formulation
Suppose that we have a Venn diagram formed by some sets . We will assume that this Venn diagram has at most nonempty regions. For our main application, each set corresponds to neurons that respond to a particular stimuli, so we are assuming that there are at most classes of neurons. We let denote the set of neuron classes. We also have a weight function representing the sizes of various classes. Each set is an assembly and is its weight. Our main question is the following:
Question 1.
Given the sizes of wise intersections of for some constant , i.e., for all , can we recover the full Venn diagram of , i.e., the weight of all intersections formed by these sets and their complements?
Our main result is that as long as the set memberships of elements are slightly perturbed to avoid worst case scenarios, and as long as is polynomially larger than , the answer is yes and moreover there is an efficient algorithm that performs recovery. Our algorithm is also robust to inverse polynomial noise in the input.
We pose the question as a tensor decomposition problem in the following way: To each element assign a vector , where indicates whether . Then the entries of the following tensor capture all wise intersections:
For simplicity of exposition, we assume weights are all equal to , but our results easily generalize, since each weight can be absorbed into .
2 Notations and preliminaries
We denote the set by . For a matrix
, we denote the minimum and maximum singular values of
by and . We use to denote the standard inner product.We denote the tensor product of two vectors and by which belongs to . We use the notation to denote
By abuse of notation we identify tensors with multilinear maps from to . In other words we let denote . We also use the notation to denote the multilinear map from to given by:
In general we can use in place of any of the arguments of . So for example is interpreted as living in . With a slight abuse of notation we let some of the inputs of be merged together by tensor operations. In other words we let be the same as .
We use to denote the standard basis of . For a tuple of coordinates we let denote . With this notation, the entry corresponding to coordinate of a tensor can be written as .
3 Tensor decomposition
Suppose that we have a finite universe of elements with a vector assigned to each . Our goal is to recover ’s by observing . A necessary condition is for ’s to be linearly independent, otherwise it is an easy exercise to show that there is another decomposition for some positive weights not all equal to . The framework introduced by Bhaskara et al. [2014] shows that linear independence is not just necessary, but up to a constant factor blowup in , it is sufficient. A more detailed account is given in supplementary materials.
We also use another trick from this framework which allows us to replace symmetric tensors with asymmetric ones. If we divide the coordinates into roughlyequal sized parts and define to be the projection of onto the th part, then is a subtensor of . So linear independence of these tensors proves linear independence of ’s. The advantage of this trick is that when we introduce perturbations to , we do not have to worry about consistently perturbing the same coordinates and we can potentially use independent randomness. For simplicity of notation, from here on, we use (as opposed to ) to denote the dimension of each . So now we can work with the following tensor:
Our main result is that the components of this sum are robustly linearly independent, assuming the components are randomly perturbed. We remark that this implies robust linear independence of as well, so we can recover them from the sum .
We first define our model of perturbations:
Definition 2.
Assume that a vector is drawn according to some distribution . We call a nondeterministic distribution if for every coordinate and any interval of the form we have
where represents the projection of onto the coordinates .
For a set of random vectors
, we call their joint distribution
nondeterministic iff their concatenation is nondeterministic. In our setting, we will assume that for each , the vectors are chosen from a nondeterministic distribution.Two examples of nondeterministic perturbations can be obtained as follows:
Example 3.
Suppose that each is chosen adversarially from , but then each bit is independently flipped with some probability . This distribution is nondeterministic.
Example 4.
Suppose that each is chosen adversarially from
, but a standard Gaussian noise of total variance
is added to each one. Then for any , this distribution is nondeterministic.Gaussian perturbations are the model used in Bhaskara et al. [2014]. Our main result is the following:
Theorem 5.
Assume that for each , the concatenation of the dimensional vectors is drawn from a distribution that is nondeterministic. Let be the matrix whose columns are given by flattened for various . Then, assuming , we have
This theorem shows how the nondeterministic property ensures robust linear independence. To prove it, we use a strategy similar to Bhaskara et al. [2014], by proving a bound on the leaveoneout distance. The leaveoneout distance is closely related to , and only differs from it by a factor of at most Bhaskara et al. [2014]. It is enough to prove that for any fixed
with probability at least . Here measures the distance of a vector to the closet point in a linear subspace. A union bound implies the leaveoneout distance for all is large. As in Bhaskara et al. [2014], we simplify the analysis by treating as a generic linear subspace , and only using the fact that . Noting that , it is enough to prove the following
Lemma 6.
Assume that vectors are drawn according to a nondeterministic distribution. Further assume that is a subspace of dimension at most . Then
In the rest of this section we prove lemma 6.
Let be the linear subspace of all tensors that vanish on , or in other words have zero dot product with every member of . Then . We will show that with high probability there is an element such that and
This implies that
and the proof would be complete.
We find it instructive to first prove this fact for and then for general .
3.1 The case
Proof of lemma 6 for .
We will generate a sequence , such that for all . This ensures that . We will then show that
(1) 
We will first pick to be any nonzero element of . By rescaling, we can assume that and that for some . Let us call the pivot point of . By rearranging the coordinates we can assume without loss of generality that . In other words and .
In order to pick , consider the subspace . This subspace has dimension at least , and we can pick to be any nonzero element of it. As before, we can without loss of generality and by scaling assume that and .
When picking , we pick any nonzero element of and by rescaling and rearranging the coordinates assume that and . Thus we make sure that the pivot point of is . A keen observer would notice that can also be obtained by a modified Gaussian elimination procedure run on some basis of the space .
Now that we have fixed it remains to prove eq. 1.
To do this, let us fix the coordinates of the random vector onebyone, starting from and going backwards to . Once we have fixed we can argue about the probability of the event . Since for , we have
But is a constant once we have fixed . So if and only if . Because is distributed according to a nondeterministic distribution, this event happens with probability at most . In other words
If this event does not occur, we are already done. Otherwise we can condition on , and look at the event . Once we condition on , this event becomes independent of the previous event and we can again upperbound its probability by . So we have
which implies
By continuing this, in the end we get
which is the complement of eq. 1. ∎
3.2 The general case
Here we describe a structure that we name echelon tree. This definition is motivated by the Gaussian elimination procedure for matrices that produces an echelon form. Our definition can be seen as a generalization of this form for tensor spaces.
We first describe an index tree for : Consider an abstract rooted tree of height where the nodes at level are labeled by different partial indices from ; the root has the empty label and resides at level , and all leaves reside at level . We require the indices to be consistent with the tree structure, i.e., all children (and by extension descendants) of a node labeled must contain as the prefix of their label. We further assume that is ordered, i.e., each node of has an ordering over its children. This enables us to talk about postorder traversal of the tree, a linear ordering of the nodes of the tree, which we denote by the binary relation . For two nodes labeled and , we let exactly when (i) is a descendant of or (ii) there are ancestors of with a common parent who places before (according to the ordering induced by the parent on its children).
Definition 7.
An index tree for is a height tree of partial indices together with a posttraversal ordering on its nodes as described above.
We emphasize that nodes of an index tree have different labels, so we consider the partial indices the same as the nodes. For example, an index tree of height is identical to an ordered list of elements in , with no repetitions allowed. Next we define an echelon tree.
Definition 8.
An echelon tree is an index tree where each leaf is additionally labeled by an element . We require that and that for every node that appears before in the postorder traversal, i.e., , the following identity to hold:
Note that the identity in the above definition is requiring an entire subarray of to be zero. For example a height 1 echelon tree is a list of unique indices of together with vectors such that has zeros in the entries and has a nonzero th entry. Notice the similarity to the echelon form obtained by Gaussian elimination in a matrix. In particular, for a height 1 echelon tree, the vectors must be linearly independent.
We say that is an echelon tree for the linear subspace if for all leaves , we have . Notice that we can collapse or flatten consecutive levels of an echelon tree, and the result would remain an echelon tree. In this operation, nodes of a particular level are removed, and each orphaned node of level is assigned to its grandparent (of level ). We then treat the indices as coming from , i.e., we merge the st level indices. This also corresponds to partially flattening tensors and considering them as elements of . It is easy to check that these operations preserve the properties in definition 8:
Fact 9.
Collapsing an echelon tree at level produces an echelon tree.
The main question we would like to address here is how large of an echelon tree can be constructed for a subspace . For example, for one can get a full tree, where nodes at level have branching factor , by simply placing the standard basis for at the leaves. We measure the size of a tree by its fractional branching factor.
Definition 10.
An echelon tree has fractional branching if each node at level has at least children. For a single number , we say has fractional branching when it has fractional branching .
Note that fractional branching implies that the tree has at least leaves. On the other hand, repeated applications of creftype 9 on the echelon tree would produce a height 1 echelon tree, and we have already observed that the vectors assigned to the leaves in such a tree must be linearly independent. So this implies that . There is a partial inverse to this statement: If has dimension , then there is an echelon tree with fractional branching for . However, this fact is not “robust”, since the elements of assigned to the leaves can have arbitrarily small or large entries. Instead we prove the following:
Theorem 11.
If has dimension , then there is an echelon tree with fractional branching for such that for every leaf we have and .
Let us see first see why theorem 11 is enough to prove lemma 6.
Proof of lemma 6 for general .
Note that implies that . So it suffices to show that for some with high probability.
Let us say that an echelon tree is large when for all leaves . Theorem 11 guarantees that the echelon tree produced by it is large.
Our strategy is to fix in that order, and simultaneously reduce the height of our echelon tree by each time. When we fix , we can get a smaller echelon tree in the following way: For each leaf in the echelon tree, consider the reduced tensor as a candidate tensor for the parent of . Now let be a node of level . Its children have produced candidate tensors for . Pick the candidate with the highest to be . In this way we have removed the lowest level of the tree and have assigned appropriate tensors to the new leaves.
Our goal is to prove that if we start with an large echelon tree, then with high probability the next echelon tree is large. Inductively this would prove that with high probability over the choice of , we have for some leaf of the original echelon tree, completing the proof.
For a fixed node of level , we want to show that the quantity is at least in magnitude for some child of . But this is very similar to the case of lemma 6, which we have already proved. The difference is that the pivots are not necessarily equal to , but are at least in magnitude. This implies that
The number of nodes at level is at most , so by a union bound, we get that with probability at least , the tree produced at the next level is large (the union bound is over fewer than events, each corresponding to one ). Induction completes the proof. ∎
Now we give a proof of theorem 11. We use induction to prove a stronger version. Theorem 11 will be a corollary of the following by setting .
Theorem 12.
If is a subspace, and are such that
then there is an echelon tree for with fractional branching such that for each leaf we have and .
Proof.
We use induction on . For the base case of , we have and we want an echelon tree with branching factor . We have already proved this case.
Now assume we have proved the statement for and want to prove it for . Consider partially flattening the tensor space by merging the first two dimensions, i.e., considering as a subspace of . Let us fix such that the premise of the induction hypothesis holds and we can get an echelon tree of height with fractional branching . Nodes at level of this tree have indices in , and there are of them. Considering these indices as living in , by the pigeonhole principle at least of them will have the same first component; let’s call this component . We can now extract the subtrees of these elements and join them into an echelon tree of height . The common parent of these nodes will have index . So far we have constructed an echelon tree of height with fractional branching .
Now consider the subspace . We think of as living in , since index has been eliminated from the first dimension. We can again apply the induction hypothesis to this space and as long as the premise holds obtain an echelon tree of height with fractional branching . We can apply the pigeonhole principle again to find level nodes having the same first index . We extract a height echelon tree from them and join this with the height echelon tree we already have. At the end we will have an echelon tree with fractional branching .
Suppose we have repeated this procedure many times and currently have a height echelon tree with fractional branching . As long as the premise of the induction hypothesis holds we can grow this echelon tree. The current subspace is which lives in . The dimension of this subspace is at least . So the premise of the induction hypothesis holds as long as
This means that as long as , we can grow the echelon tree.
To finish the proof, we set , which means that while , we can grow the echelon tree. So when this procedure stops we have an echelon tree with fractional branching . ∎
3.3 Implications for the main question
Our result, theorem 5, together with results from Bhaskara et al. [2014] (see the supplementary material), imply that under very mild assumptions we can recover from their wise intersections as long as . These mild assumptions are necessary to prevent adversarially constructed examples that have no hope of unique recovery.
To get a sense of the mild assumptions that we need, let us discuss the parameters that appear in theorem 5. We assume that is a constant that does not grow with . We can take to be some fixed constant as well. For example , or even . If we perturb our cell assemblies according to example 3, i.e., flip assembly memberships for each neuron class and assembly pair with probability , how large of a do we need for the conditions of theorem 5 and Bhaskara et al. [2014] to be satisfied? The distribution we get for s is going to be nondeterministic as long as . So is a constant. The only condition we need is now for the failure probability to be small. This roughly translates to
which will be satisfied for . In other words, we only have to flip each coordinate of with probability . On average, each neuron’s membership will be changed in about of the assemblies, which is a very small fraction of the assemblies. For slightly larger values of , e.g., , the probability of failure becomes exponentially small similar to Bhaskara et al. [2014].
We also assumed that for all . In general this is not needed. As long as the weights are in a range whose upper bound is at most a polynomially bounded factor larger than the lower bound, we can absorb the weights into the vector and the running time and accuracy will only suffer by a polynomially bounded factor.
We also remark that recovering a vector within an additive error of is the same as exact recovery (by rounding the coordinates). So by setting the recovery error (see supplementary material) to we get exact recovery.
Finally, we remark that even though we are mostly interested in the case where , our dependencies on seem to be better than the results of Bhaskara et al. [2014] even in the setting of Gaussian perturbations. In particular, our running time (as well as our tolerance for error) grows polynomially with , whereas the running time of Bhaskara et al. [2014] grows with . When adding Gaussian noise of total variance as in example 4, we can treat our vectors as coming from a nondeterministic distribution. This means our probability of failure will be at most . To have a fair comparison, we need to allow for the number of components to be roughly half the total dimension, so we need to let . So the probability of failure will be roughly . For large enough values of this is much better than the guarantee of of Bhaskara et al. [2014].
4 Association graphs and the soft model
When the number of observations is smaller than what is needed for reconstruction, we can still ask whether there exists some Venn diagram that is consistent with the observations. Which classes of weighted graphs (or hypergraphs) can be represented by Venn diagrams?
Interestingly, a similar model was formulated almost three decades ago, motivated by quantum mechanics and spin glass systems, and a mathematical object called correlation polytope was defined to frame that investigation Pitowsky [1991]. It is not hard to show that membership in the polytope is an NPhard problem and natural optimization variants of it are hard to approximate.
In this section we formulate a promise version of the problem where either the intersection is above a certain threshold (corresponding to association) or below another (corresponding to nonassociation) which seems to be more tractable.
More precisely, we are given a graph that is unweighted. The nodes still stand for assemblies of neurons, all of the same size , out of a universe of neurons, and the edges signify association; the difference is that, in this model, if two assemblies are associated then they have an intersection of size at least ; whereas if they are not, then their intersection is at most . The intended relationship between these numbers is that is much larger than (we take it to be a power of ), and is in turn much larger than , while is quite a bit larger than . To fix ideas, in the sequel we take and small constant fractions of ; in the experiment in Ison et al. [2015], De Falco et al. [2016] and are found to be about and of , respectively. We call a graph representable with parameters if every node of can be associated with a set of neurons such that for any two adjacent nodes the corresponding sets have intersection at least , while for any two nonadjacent nodes the corresponding sets have intersection at most . The question is, which graphs are representable?
Theorem 13.
Any graph of maximum degree at most is representable, and so is any tree of maximum degree .
The bound follows from the fact that the edges of a regular Eulerian graph can be decomposed into cycles, while the follows from the theory of block designs. Recalling that is a small fraction of , we conclude that rather rich and complex “association graphs” can be represented in principle. But can these sophisticated combinatorial constructions be carried out with surgical precision in the wet chaos of the brain?
Here is a more realistic framework which we call the soft model: Suppose that we are given an association graph . We wish to determine whether a model of exists, i.e., sets corresponding to nodes of whose pairwise intersections realize according to the rules above involving and . We wish to create sets of expected size representing the nodes, starting from the universe of neurons and executing instructions of the following form (in the following are previously constructed sets, and is the set being constructed):
where by we denote the result of sampling each node in set with probability — a simple and realistic enough primitive. The question is, which graphs can be realized in such a way that the intended relations between the nodes and their intersections are not corrupted, with high enough probability, by the randomness of the process? We can show the following:
Theorem 14.
Any graph with maximum degree can be realized in the soft model with high probability.
References

Anandkumar et al. [2012]
Animashree Anandkumar, Daniel Hsu, and Sham M Kakade.
A method of moments for mixture models and hidden markov models.
In Conference on Learning Theory, pages 33–1, 2012. 
Barak et al. [2015]
Boaz Barak, Jonathan A Kelner, and David Steurer.
Dictionary learning and tensor decomposition via the sumofsquares
method.
In
Proceedings of the fortyseventh annual ACM symposium on Theory of computing
, pages 143–151. ACM, 2015.  Bhaskara et al. [2014] Aditya Bhaskara, Moses Charikar, Ankur Moitra, and Aravindan Vijayaraghavan. Smoothed analysis of tensor decompositions. In Proceedings of the fortysixth annual ACM symposium on Theory of computing, pages 594–603. ACM, 2014.
 Buzsáki [2010] György Buzsáki. Neural syntax: cell assemblies, synapsembles, and readers. Neuron, 68(3):362–385, 2010.
 Chang [1996] Joseph T Chang. Full reconstruction of markov models on evolutionary trees: identifiability and consistency. Mathematical biosciences, 137(1):51–73, 1996.
 De Falco et al. [2016] Emanuela De Falco, Matias J Ison, Itzhak Fried, and Rodrigo Quian Quiroga. Longterm coding of personal and universal associations underlying the memory web in the human brain. Nature communications, 7:13408, 2016.
 De Lathauwer et al. [2007] Lieven De Lathauwer, Josphine Castaing, and JeanFranois Cardoso. Fourthorder cumulantbased blind identification of underdetermined mixtures. IEEE Transactions on Signal Processing, 55(6):2965–2973, 2007.
 Ge and Ma [2015] Rong Ge and Tengyu Ma. Decomposing overcomplete 3rd order tensors using sumofsquares algorithms. arXiv preprint arXiv:1504.05287, 2015.
 Ge and Ma [2017] Rong Ge and Tengyu Ma. On the optimization landscape of tensor decompositions. In Advances in Neural Information Processing Systems, pages 3656–3666, 2017.
 Goyal et al. [2014] Navin Goyal, Santosh Vempala, and Ying Xiao. Fourier pca and robust tensor decomposition. In Proceedings of the fortysixth annual ACM symposium on Theory of computing, pages 584–593. ACM, 2014.
 Håstad [1990] Johan Håstad. Tensor rank is npcomplete. Journal of Algorithms, 11(4):644–654, 1990.
 Hillar and Lim [2013] Christopher J Hillar and LekHeng Lim. Most tensor problems are nphard. Journal of the ACM (JACM), 60(6):45, 2013.
 Hopkins et al. [2016] Samuel B Hopkins, Tselil Schramm, Jonathan Shi, and David Steurer. Fast spectral algorithms from sumofsquares proofs: tensor decomposition and planted sparse vectors. In Proceedings of the fortyeighth annual ACM symposium on Theory of Computing, pages 178–191. ACM, 2016.
 Hsu et al. [2012] Daniel Hsu, Sham M Kakade, and Tong Zhang. A spectral algorithm for learning hidden markov models. Journal of Computer and System Sciences, 78(5):1460–1480, 2012.
 Ison et al. [2015] Matias J Ison, Rodrigo Quian Quiroga, and Itzhak Fried. Rapid encoding of new memories by individual neurons in the human brain. Neuron, 87(1):220–230, 2015.
 Kolda and Mayo [2011] Tamara G Kolda and Jackson R Mayo. Shifted power method for computing tensor eigenpairs. SIAM Journal on Matrix Analysis and Applications, 32(4):1095–1124, 2011.
 Leurgans et al. [1993] SE Leurgans, RT Ross, and RB Abel. A decomposition for threeway arrays. SIAM Journal on Matrix Analysis and Applications, 14(4):1064–1083, 1993.
 Ma et al. [2016] Tengyu Ma, Jonathan Shi, and David Steurer. Polynomialtime tensor decompositions with sumofsquares. In Foundations of Computer Science (FOCS), 2016 IEEE 57th Annual Symposium on, pages 438–446. IEEE, 2016.
 Mossel and Roch [2005] Elchanan Mossel and Sébastien Roch. Learning nonsingular phylogenies and hidden markov models. In Proceedings of the thirtyseventh annual ACM symposium on theory of computing, pages 366–375. ACM, 2005.
 Pitowsky [1991] Itamar Pitowsky. Correlation polytopes: their geometry and complexity. Mathematical Programming, 50(1):395–414, 1991.
 Quiroga [2012] Rodrigo Quian Quiroga. Concept cells: the building blocks of declarative memory functions. Nature reviews. Neuroscience, 13(8):587, 2012.
Appendix A Reduction to linear independence
We now mention the main result of Bhaskara et al. [2014]:
Theorem 15 (Bhaskara et al. [2014]).
Let for some constant . Assume that for each and , we choose a vector by starting from an adversarially chosen vector of norm at most and adding a standard Gaussian noise with variance to each coordinate of . Now define the order tensor
and assume that we get as input where is an order (measurement error) tensor, whose entries are bounded by for some . Then there is an algorithm that recovers all the tensors up to an additive error. This algorithm runs in time and succeeds with probability .
The algorithm behind theorem 15 is based on a robust version of order3 tensor decomposition, widely known as the “simultaneous diagonalization” or Chang’s lemma Bhaskara et al. [2014], Leurgans et al. [1993], Chang [1996]. Roughly speaking, the tensor
can be viewed as an order3 tensor by grouping some of factors together:
(2) 
Then the algorithm from Bhaskara et al. [2014] depends on using a robust version of Chang’s lemma to decompose . It only needs the collection of first factors to be “robustly” linearly independent, the collection of the second factors to be “robustly” linearly independent, and the collection of the third factors to “robustly” not contain vectors parallel to each other (a weaker notion than linear independence). We give the precise required conditions below:
Theorem 16 (Bhaskara et al. [2014]).
Consider the tensor
and assume that the following conditions are satisfied:

The condition numbers of the matrices are bounded by , where is formed by taking s as columns and by taking s as columns,

For any , the vectors and are far from being parallel: ,

All of the vectors have norms bounded by , a polynomially bounded quantity.
Then there is an efficient algorithm, running in time , that recovers for all within additive error , by only observing where is a noise tensor whose entries are bounded by .
Condition 1 is arguably the most difficult one to satisfy. Condition 2 is satisfied with high probability for many distributions of interest , but it can also be automatically reduced to condition 1 if one is willing to change the grouping in eq. 2. If instead of having three groups, the first two composed of factors and the last one composed of one factor, we create three equalsized groups (each consisting of factors), then the last group would also have a bounded condition number (by an extension of condition 1) and will automatically satisfy condition 2. This makes the dependency on worse but would still give us something similar to theorem 15 with replaced by . Finally, note that condition 3 is also automatically satisfied with very high probability for Gaussian perturbations and also our model, in which we sample vectors from the hypercube . We assume that is not only nondeterministic but also that it satisfies condition 3 with high probability.
In section 3 we focus only on proving condition 1 in theorem 16. To make the notation simpler we replace by , and assume s are tensors of factors. In order to bound the condition number of in theorem 16, we need to lowerbound the minimum singular value and upperbound the maximum singular value of . An upperbound on is readily given by condition 3 of theorem 16. The matrix has columns with norms bounded by and therefore