# Learning Binary Latent Variable Models: A Tensor Eigenpair Approach

Latent variable models with hidden binary units appear in various applications. Learning such models, in particular in the presence of noise, is a challenging computational problem. In this paper we propose a novel spectral approach to this problem, based on the eigenvectors of both the second order moment matrix and third order moment tensor of the observed data. We prove that under mild non-degeneracy conditions, our method consistently estimates the model parameters at the optimal parametric rate. Our tensor-based method generalizes previous orthogonal tensor decomposition approaches, where the hidden units were assumed to be either statistically independent or mutually exclusive. We illustrate the consistency of our method on simulated data and demonstrate its usefulness in learning a common model for population mixtures in genetics.

## Authors

• 8 publications
• 4 publications
• 1 publication
• 20 publications
• 18 publications
• ### Analyzing Tensor Power Method Dynamics in Overcomplete Regime

We present a novel analysis of the dynamics of tensor power iterations i...
11/06/2014 ∙ by Anima Anandkumar, et al. ∙ 0

• ### Nonparametric Estimation of Multi-View Latent Variable Models

Spectral methods have greatly advanced the estimation of latent variable...
11/13/2013 ∙ by Le Song, et al. ∙ 0

• ### Sample Complexity Analysis for Learning Overcomplete Latent Variable Models through Tensor Methods

We provide guarantees for learning latent variable models emphasizing on...
08/03/2014 ∙ by Rong Ge, et al. ∙ 0

• ### Provable learning of Noisy-or Networks

Many machine learning applications use latent variable models to explain...
12/28/2016 ∙ by Sanjeev Arora, et al. ∙ 0

• ### A general method for regularizing tensor decomposition methods via pseudo-data

Tensor decomposition methods allow us to learn the parameters of latent ...
05/24/2019 ∙ by Omer Gottesman, et al. ∙ 0

• ### Tensor Decompositions for Identifying Directed Graph Topologies and Tracking Dynamic Networks

Directed networks are pervasive both in nature and engineered systems, o...
10/26/2016 ∙ by Yanning Shen, et al. ∙ 0

• ### Learning Tensor Latent Features

We study the problem of learning latent feature models (LFMs) for tensor...
10/10/2018 ∙ by Sung-En Chang, et al. ∙ 0

##### This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

## 1 Introduction

In this paper we propose a spectral method for learning the following binary latent variable model, shown in Figure 1. The hidden layer, , consists of

binary random variables with an unknown joint distribution

. The observed vector

of features is modeled as

 x=W⊤h+σξ, (2)

where is an unknown weight matrix assumed to be full rank . Here, is the noise level and is an additive noise vector independent of , whose

coordinates are all i.i.d. zero mean and unit variance random variables. For simplicity we assume it is Gaussian, though our method can be modified to handle other noise distributions.

The model in (2) appears, for example, in overlapping clustering [8, 7], in various problems in bioinformatics [45, 9, 47], and in blind source separation [49]. A special instance of model (2

) is the Gaussian-Bernoulli restricted Boltzmann machine (G-RBM) where the distribution

is further assumed to have a parametric energy-based structure [25, 15, 50]. G-RBMs were used, e.g., in modeling human motion [48] and natural image patches [37].

Given i.i.d. samples from model (2), the goal is to estimate the weight matrix . A common approach for learning is by maximum likelihood. As this function is non-convex, common optimization schemes include the EM algorithm and alternating least squares (ALS). In addition, several works developed iterative methods specialized to G-RBMs [24, 15]. All these methods, however, often lack consistency guarantees and may not be well suited for large datasets due to their potential slow convergence. This is not surprising, as learning under model (2) is believed to be computationally hard; see for example Mossel and Roch [39].

Over the past years, several works considered variants and specific instances of model (2) under additional assumptions on the distribution or on the weight matrix . For example, when

is a product distribution, the learning problem becomes that of independent component analysis (ICA) with binary signals

[28]. In this case, several methods have been derived for estimating and under suitable non-degeneracy conditions were proven to be both computationally efficient and statistically consistent [46, 18, 44, 28, 4, 30]. Similarly, when the hidden units are mutually exclusive, namely has support , the model is a Gaussian mixture (GMM) with spherical components with linearly independent means. Efficient and consistent algorithms have been derived for this case as well [38, 2, 3, 26]. Among those, most relevant to this work are orthogonal tensor decomposition methods [4]. Interestingly, these methods can learn some additional latent models, with hidden units that are not necessarily binary, such as Dirichlet allocation and other correlated topic models [5].

Learning given the observed data can also be viewed as a noisy matrix factorization problem. If is known to be non-negative, then various non-negative matrix factorization methods can be used. Moreover, under appropriate conditions, some of these methods were proven to be computationally efficient and consistent [17, 6]. For general full rank , the matrix factorization method in Slawski et al. [47] (SHL) exactly recovers when with a runtime exponential in . This method, however, can handle only low levels of noise and has no consistency guarantees when .

### A tensor eigenpair approach

In this paper we propose a novel spectral method for learning which is based on the eigenvectors of both the second order moment matrix and the third order moment tensor of the observed data. We prove that our method is consistent under mild non-degeneracy conditions and achieves the parametric rate for any noise level .

The non-degeneracy conditions we pose are significantly weaker than those required by previous tensor decomposition methods mentioned above. In particular, their assumptions and resulting methods can be viewed as specific cases of our more general approach.

Similarly to the matrix factorization method in Slawski et al. [47], our algorithm has runtime linear in , polynomial in , and in general exponential in . With our current Matlab implementation, most of the runtime is spent on computing the eigenpairs of a tensor. Practically, our method, implemented without any particular optimization, can learn a model with 12 hidden units in less than ten minutes on a standard PC. Furthermore, the overall runtime can be significantly reduced, since the step of computing the tensor eigenpairs can be embarrassingly parallelized.

### Paper outline

In the next section we fix the notation and provide necessary background on tensor eigenpairs. In Section 3 we introduce our method in the case . The case is treated in Section 4. Experiments with our method and comparison to other approaches appear in Section 5. All proofs are deferred to the appendices.

## 2 Preliminaries

### Notation

We abbreviate and denote as the -th unit vector with entries . We slightly abuse notation and view a matrix also as the set of its columns, namely is some column of and is the span of all its columns. The unit sphere is denoted by .

A tensor is symmetric if for all permutations of . Here, we consider only symmetric tensors. can also be seen as a multi-linear operator: for matrices with , the tensor-mode product, denoted , is a tensor whose -th entry is

 ∑j1,j2,j3∈[d]W1j1i1W2j2i2W3j3i3Tj1j2j3.

### Tensor eigenpairs

Several types of eigenpairs of a tensor have been proposed. Here, we consider the following definition, termed -eigenpairs by Qi [43] and -eigenpairs by Lim [36]. Henceforth we just call them eigenpairs.

###### Definition 1.

is an eigenpair of if

 T(I,u,u)=λuand∥u∥=1. (3)

Note that if

is an eigenpair then the eigenvalue is simply

. In addition, is also an eigenpair. Following common practice, we treat these two pairs as one. So, without loss of generality, we make the convention that .

In contrast to the matrix case, the number of eigenvalues of a tensor can be much larger than . As shown by Cartwright and Sturmfels [12], for a tensor, there can be at most of them. With precise definitions appearing in Cartwright and Sturmfels [12], for a generic tensor, all its eigenvalues have multiplicity one and the number of eigenpairs is at most .

In principle, computing the set of all eigenpairs of a general symmetric tensor is a #P problem [23]. Nevertheless, several methods have been proposed for computing at least some eigenpairs, including iterative higher-order power methods [31, 32], homotopy continuation [13], semidefinite programming [16], and iterative Newton-based methods [29, 22]. We conclude this section with the definition of Newton-stable eigenpairs [29] which are most relevant to our work.

### Newton-stable eigenpairs

Equivalently to (3), eigenpairs of can also be characterized by the function ,

 g(u)=T(I,u,u)−T(u,u,u)⋅u. (4)

It is easy to verify that a pair with is an eigenpair of if and only if and . The stability of an eigenpair is determined by its Jacobian matrix , more precisely, by its projection into the dimensional subspace orthogonal to . Formally, let be a matrix with orthonormal columns that span the subspace orthogonal to and define the projected Jacobian matrix

 Jp(u)=L⊤u∇g(u)Lu. (5)
###### Definition 2.

An eigenpair of is Newton-stable if the matrix has full rank .

The homotopy continuation method in Chen et al. [13] is guaranteed to compute all the Newton-stable eigenpairs of a tensor. Alternatively, Newton-stable eigenpairs are attracting fixed points for the iterative orthogonal Newton correction method (O–NCM) in Jaffe et al. [29]. Moreover, O–NCM converges to any Newton-stable eigenpair at a quadratic rate given a sufficiently close initial guess. Finally, for a generic tensor, all its eigenpairs are Newton-stable.

## 3 Learning in the noiseless case

To motivate our approach for estimating the matrix it is instructive to first consider the ideal noiseless case where . In this case, model (2) takes the form . Our problem then becomes that of factorizing the observed matrix of samples into a product of real and binary low-rank matrices,

 Find W∈Rd×m,H∈{0,1}d×ns.t.X=W⊤H. (6)

To be able to recover we first need conditions under which the decomposition of into and is unique. Clearly, such a factorization can be unique at most up to a permutation of its components; we henceforth ignore this degeneracy. A sufficient condition for uniqueness, similar to the one posed in Slawski et al. [47], is that is rigid. Formally, is rigid if any non-trivial linear combination of its rows yields a non-binary vector: ,

 u⊤H∈{0,1}n⇔u∈{ei}di=1. (7)

Condition (7) is satisfied, for example, when the columns of include and for all .

The following proposition, similar in nature to the (affine constrained) uniqueness guarantee in Slawski et al. [47], shows that under condition (7) the factorization in (6) is unique and fully characterized by the binary constraints.

###### Proposition 1.

Let with rigid and full rank with . Let be the unique right pseudo-inverse of so . Then and are unique and for all ,

 v⊤X∈{0,1}n⇔v∈W†. (8)

Hence, under the rigidity condition (7), the matrix factorization problem in (6) is equivalent to the problem of finding the unique set of non-zero vectors that satisfy the binary constraints . The weight matrix is then .

### Algorithm outline

We recover via a two step procedure. First, a finite set of candidate vectors is computed with a guarantee that . Specifically, is computed from the set of eigenpairs of a tensor, constructed from the low order moments of . Typically, the size of will be much larger than , so in the second step is filtered by selecting all that satisfy .

Before describing the two steps in more detail we first state the additional non-degeneracy conditions we pose. To this end, denote the unknown first, second, and third order moments of the latent binary vector by p&= E[h] ∈R^d,
C&= E[hh]∈R^d×d,
C&= E[hhh]∈R^d×d×d.

### Non-degeneracy conditions

We assume the following:

• is rigid.

• for all .

Condition (I) implies . This in turn implies for all and that at most one variable has . Such an “always on” variable can model a fixed bias to . As far as we know, condition (II) is new and its nature will become clear shortly.

We now describe each step of our algorithm in more detail.

### Computing the candidate set

To compute a set that is guaranteed to include the columns of we make use of the second and third order moments of , M&= E[xx]∈R^m×m,
M&= E[xxx]∈R^m×m×m. Given a large number of samples , these can be easily and accurately estimated from the sample . For simplicity, in this section we consider the population setting where , so and are known exactly. and are related to the unknown second and third order moments of in (3) via [4]

 M=W⊤CW,M=C(W,W,W). (9)

Since both and are full rank, the number of latent units can be deduced by . Since is positive definite, there is a whitening matrix such that

 K⊤MK=Id. (10)

Such a can be computed, for example, by an eigen-decomposition of . Although is not unique, any that satisfies (10) suffices for our purpose. Define the lower dimensional whitened tensor

 W=M(K,K,K). (11)

Denote the set of eigenpairs of by

 U={(u,λ)∈Sd−1×R+:W(I,u,u)=λu}. (12)

Our set of candidates is then

 V={Ku/λ:(u,λ)∈U with λ≥1}⊆Rm. (13)

The following lemma shows that under condition (I) the set is guaranteed to contain the columns of .

###### Lemma 1.

Let be the tensor in (11) corresponding to model (2) with and let be as in (13). If condition (I) holds then . In particular, each in the set of relevant eigenpairs

 U∗={(u,λ)∈U:Ku/λ∈W†} (14)

has the eigenvalue where .

### Computing the tensor eigenpairs

By Lemma 1, we may construct a candidate set that contains by first calculating the set of eigenpairs of . Unfortunately, computing the set of all eigenpairs of a general symmetric tensor is computationally hard [23]. Moreover, besides the columns of , the set in (13) may contain many spurious candidates, as the number of eigenpairs of is typically which is much larger than [12].

Nevertheless, as discussed in Section 2, several methods have been proposed for computing some eigenpairs of a tensor under appropriate stability conditions. The following lemma highlights the importance of condition (II) for the stability of the eigenpairs in . Note that conditions (I)-(II) do not depend on , but only on the distribution of the latent variables .

###### Lemma 2.

Let be the whitened tensor in (11) corresponding to model (2) with . If conditions (I)-(II) hold, then all are Newton-stable eigenpairs of .

Hence, under conditions (I)-(II), the homotopy method in Chen et al. [13], or alternatively the O–NCM with a sufficiently large number of random initializations [29], are guaranteed to compute a candidate set which includes all the columns of . The next step is to extract out of .

### Filtering

As suggested by Eq. (8) we select the subset of vectors that satisfy the binary constraints,

 ¯V={v∈V:vTX∈{0,1}n}. (15)

Indeed, under condition (I), Proposition 1 implies that and the weight matrix is thus .

Algorithm 1 summarizes our method for estimating in the noiseless case and has the following recovery guarantee.

###### Theorem 1.

Let be a matrix of samples from model (2) with . If conditions (I)-(II) hold, then Algorithm 1 recovers exactly.

We note that when and conditions (I)-(II) hold for the empirical latent moments and (rather than and ), Algorithm 1 exactly recovers when and are replaced by their finite sample estimates. The matrix factorization method SHL in Slawski et al. [47] also exactly recovers in the case . While its runtime is also exponential in , practically it may be much faster than our proposed tensor based approach. This is because SHL constructs a candidate set of size

that can be computed by a suitable linear transformation of the

fixed set , as opposed to our candidate set which is constructed by eigenpairs of a tensor. However, SHL does not take advantage of the large number of samples , since only sub-matrices of the sample matrix are used for constructing its candidate set. Indeed, in the noisy case where , SHL has no consistency guarantees and as demonstrated by the simulation results in Section 5 it may fail at high levels of noise. In the next section we derive a robust version of our method that consistently estimates for any noise level .

## 4 Learning in the presence of noise

The method in Section 3 to estimate is clearly inadequate when . However, we now show that by making several adjustments, the two steps of computing the candidate set and its filtering can be both made robust to noise, yielding a consistent estimator of for any .

### Computing the candidate set

As in the case our goal in the first step is to compute a finite candidate set that is guaranteed to contain accurate estimates for the columns of . To this end, in addition to the second and third order moments and in (3), we also consider the first order moment and define the following noise corrected moments, M_σ  =  & M- σ^2 I_m,
M_σ  =  & M- σ^2 ∑_i=1^m ( μe_i ⊗e_i  + e_i ⊗μe_i + e_i ⊗e_i ⊗μ). By assumption, the noise satisfies . Thus, similarly to the moment equations in (9), the modified moments in (4) are related to these of by [4]

 Mσ=W⊤CW,Mσ=C(W,W,W). (16)

Hence, if and were known exactly, a candidate set that contains could be obtained exactly as in the noiseless case, but with and replaced with and ; namely, first calculate the whitening matrix such that and then compute the eigenpairs of the population whitened tensor

 Wσ=Mσ(Kσ,Kσ,Kσ). (17)

In practice, , , , and are all unknown and need to be estimated from the sample matrix . Assuming , the parameters and can be consistently estimated, for example, by the methods in Kritchman and Nadler [33]. For simplicity, we assume they are known exactly. Similarly, , , are consistently estimated by their empirical means, , , and . So, after computing the plugin estimates such that and , we compute the set of eigenpairs of and for some small take our candidate set as

 ^Vσ={^Kσu/λ:(u,λ)∈^Uσ with 1−τ≤λ}. (18)

The following lemma shows that under conditions (I)-(II) the above procedure is stable to small perturbations. Namely, for perturbations of order in and , the method computes a candidate set that contains a subset of vectors that are close to the columns of . Furthermore, these vectors all correspond to Newton-stable eigenpairs of the perturbed tensor and are separated from the other candidates in .

###### Lemma 3.

Let be the population quantities in (17) and let be their perturbed versions, inducing the candidate set in (18). If conditions (I)-(II) hold, then there are such that for all the following holds: If the perturbed versions satisfy

 max{∥^Wσ−Wσ∥F,∥^Kσ−Kσ∥F}≤δ, (19)

then any has a unique such that

 ∥^v−v∗∥≤cδ. (20)

Moreover, corresponds to a Newton-stable eigenpair of with eigenvalue and for all ,

 ∥~v−v∗∥≥δ1>2cδ. (21)

The proof is based on the implicit function theorem [27]; small perturbations to a tensor result in small perturbations to its Newton-stable eigenpairs.

Now, by the delta method, the plugin estimates and are both close to their population quantities, ∥^K_σ- K_σ∥_F &  =  O_P(n^-12),
∥^W_σ - W_σ∥_F & =  O_P(n^-12). By (3), we have that (19) holds with . Hence, by Lemma 3, the eigenpairs of provide a candidate set that contains vectors that are close to the columns of . In addition, any irrelevant candidate is far away from

. As we show next, these properties ensure that with high probability the

relevant candidates can be identified in .

### Filtering

Given the candidate set computed in the first step, our goal now is to find a set of vectors that accurately estimate the columns of . To simplify the theoretical analysis, we assume we are given a fresh sample of size that is independent of . This can be achieved by first splitting a sample of size into two sets of size , one for each step.

Recall that for a vector from model (2) and any

 v⊤x=v⊤W⊤h+σv⊤ξ. (22)

Obviously, when , the filtering procedure in (15) for the noiseless case is inadequate, as typically no will exactly satisfy . Nevertheless, we expect that for a sufficiently small noise level , any that is close to some will result in that is close to being binary, while any sufficiently far from will result in that is far from being binary. A natural measure for how is “far from being binary”, similar to the one used for filtering in Slawski et al. [47], is simply its deviation from its binary rounding,

 minb∈{0,1}n∥vTX−b∥2n∥v∥2. (23)

Eq. (23) works extremely well for small , but fails for high noise levels. Here we instead propose a filtering procedure based on the classical Kolmogorov-Smirnov goodness of fit test [34]. As we show below, this approach gives consistent estimates of for any .

Before describing the test, we first introduce the probabilistic analogue of the rigidity condition (7). For any , define its corresponding expected binary rounding,

 r(u)=Eh∼Ph[minb∈{0,1}(u⊤h−b)2].

Clearly, and for all . We pose the following expected rigidity condition: for all ,

 r(u)=0⇔u∈{ei}di=1. (24)

Analogously to the deterministic rigidity condition in (7), condition (24) is satisfied, for example, when and for all .

To introduce our filtering test, recall that under model (2), . Hence, for any fixed , the random variable in (22

) is distributed according to the following univariate Gaussian mixture model (GMM),

 v⊤x∼∑h∈{0,1}dPh[h]⋅N(v⊤W⊤h,σ2∥v∥2). (25)

Denote the cumulative distribution function of

by . For general , this mixture may have up to distinct components. However, for , it reduces to a mixture of two components with means at and . More precisely, for any candidate with corresponding eigenvalue , define the GMM with two components

 (1−1λ(v)2)⋅N(0,σ2∥v∥2)+1λ(v)2⋅N(1,σ2∥v∥2). (26)

Denote its cumulative distribution function by . The following lemma shows that under condition (24), fully characterizes the columns of .

###### Lemma 4.

Let be the population quantities in (17) and let be the set of population candidates as computed from the eigenpairs of . If conditions (I)-(II) and the expected rigidity condition (24) hold, then for any and its corresponding eigenvalue ,

 Fv=Gv⇔v∈W†.

Given the empirical candidate set , Lemma 4 suggests ranking all according to their goodness of fit to and taking the candidates with the best fit. More precisely, given a sample that is independent of , for each candidate we compute the empirical cumulative distribution function,

 ^F^v(t)=1nn∑j=11{^v⊤xj≤t},t∈R,

and calculate its Kolmogorov-Smirnov score

 Δn(^v)=supt∈R|^F^v(t)−G^v(t)|. (27)

Our estimator for is then the set of vectors with the smallest scores . The estimator for is the pseudo-inverse, .

The following lemma shows that for sufficiently large , accurately distinguishes between that are close to the columns of from these that are not.

###### Lemma 5.

Let and a sequence of random vectors such that Then,

 Δn(^v(n))=oP(1).

In contrast, if , then

 Δn(^v(n))=ΩP(1),

provided the expected rigidity condition (24) holds.

Lemma 5 follows from classical and well studied properties of the Kolmogorov-Smirnov test, see for example Lehmann and Romano [34], Billingsley [10].

Algorithm 2 summarizes our method for estimating in the general case where and . The following theorem establishes its consistency.

###### Theorem 2.

Let be i.i.d. samples from model (2). If conditions (I)-(II) and the expected rigidity condition (24) hold, then the estimator computed by Algorithm 2 is consistent, achieving the parametric rate,

 ^W=W+OP(n−12).

### Runtime

The runtime of Algorithm 2 is composed of three main parts. First, operations are needed to compute all the relevant moments from the data and to construct the whitened tensor . The most time consuming task is computing the eigenpairs of , which can be done by either the homotopy method or O–NCM. Currently, no runtime guarantees are available for either of these methods. In practice, since there are eigenpairs, these methods spend operations in total. Finally, since there are candidates and each KS test takes operations [20], the filtering procedure runtime is .

### Power-stability and orthogonal decomposition

The exponential runtime of our algorithm stems from the fact that the set of Newton-stable eigenpairs of is typically exponentially large. Indeed, the above algorithm becomes intractable for large values of . However, in some cases, the set of relevant eigenpairs has additional structure so that a smaller candidate set may be computed instead of . Specifically, consider the subset of power-stable eigenpairs of .

###### Definition 3.

An eigenpair is power-stable if its projected Jacobian is either positive or negative definite.

Typically, the number of power-stable eigenpairs is significantly smaller than the number of Newton-stable eigenpairs.222We currently do not know whether the number of power-stable eigenpairs of a generic tensor is polynomial or exponential in . In addition, can be computed by the shifted higher-order power method [31, 32].

Similarly to Lemma 2, one can show that is guaranteed to contain whenever the following stronger version of condition (II) holds: for all , the matrix

 (WKLui)⊤(2C(I,I,ei)−C)(WKLui) (28)

is either positive-definite or negative-definite.

As an example, consider the case where has the support . Then model (2) corresponds to a GMM with spherical components with linearly independent means. In this case, both and are diagonal with on their diagonal. Thus, the matrices in (28) take the form which by condition (I) are all negative-definite. In fact, in this case, has an orthogonal decomposition and the orthogonal eigenpairs in are the only negative-definite power-stable eigenpairs of [4]. Similarly, when is a product distribution, the same orthogonal structure appears if the centered moments of are used instead of and . As shown in Anandkumar et al. [4], the power method, accompanied with a deflation procedure, decompose an orthogonal tensor in polynomial time, thus implying an efficient algorithm in these cases.

## 5 Experiments

We demonstrate our method in two scenarios: (I) simulations from the exact binary model (2); and (II) learning a common population genetic admixture model. Code to reproduce the simulation results can be found at https://github.com/arJaffe/BinaryLatentVariables.

### Simulations

We generated samples from model (2) with hidden units, observable features, and Gaussian noise . The columns of were drawn uniformly from the unit sphere . Fixing a mean vector and a covariance matrix , each hidden vector was generated independently by first drawing and then taking its binary rounding.

Figure 2 shows the error, in Frobenius norm, averaged over independent realizations of as a function of (upper panel) and (lower panel) for five methods: (i) our spectral approach, detailed in Algorithm 2 (Spectral); (ii) Algorithm 2 followed by an additional single weighted least square step detailed in Appendix I (Spectral+WLS); (iii) SHL, the matrix decomposition approach of Slawski et al. [47]333Code taken from https://sites.google.com/site/slawskimartin/code. For each realization, we aggregated over runs of SHL and chose the output that minimized .; (iv) ALS with a random starting point (see Appendix J); and (v) an oracle estimator that is given the exact matrix and computes via least squares.

As one can see, as opposed to SHL, our method is consistent for and achieves an error rate corresponding to a slope of in the left panel of Fig. 2. In addition, as seen in the right panel of Fig. 2, at low level of noise our method is comparable to SHL, whereas at high level of noise it is far more accurate. Finally, adding a weighted least square step reduces the error for low noise levels, but increases the error for high noise levels.

We present an application of our method to a fundamental problem in population genetics, known as admixture, illustrated in Fig. 3. Admixture refers to the mixing of ancestral populations that were long separated, e.g., due to geographical or cultural barriers [42, 1, 35]. The observed data is an matrix where is the number of modern “admixed” individuals and is the number of relevant locations in their DNA, known as SNPs. Each SNP corresponds to two alleles and different individuals may have different alleles. Fixing a reference allele for each location, takes values in according to the number of reference alleles appearing in the genotype of individual at locus .

Given the genotypes , an important problem in population genetics is to estimate the following two quantities. The allele frequency matrix whose entry is the frequency of the reference allele at locus in ancestral population ; and the admixture proportion matrix whose columns sum to and its entry is the proportion of individual ’s genome that was inherited from population .

A common model for in terms of and is to assume that the number of alleles is the sum of two i.i.d. Bernoulli random variables with success probability . Namely,

 Xij|H∼12⋅Binomial(2,Fij).

Note that under this model

 E[X|H]=F=W⊤H. (29)

Although (29) has similar form to model (2

), there are two main differences; the noise is not normally distributed and the matrix

is non-binary. Yet, model (2) is expected to be a good approximation whenever various alleles are rare in some populations but abundant in others. Specifically, for ancestral populations that have been long separated, some alleles may become fixed in one population (i.e., reach frequency of 1) while being totally absent in others.

### Genetic simulations

We followed a standard simulation scheme appearing, for example, in Xue et al. [51], Gravel [21], Price et al. [41]. First, using SCRM [40], we simulated ancestral populations separated for generations and generated the genomes of individuals for each. was then computed as the frequency of the reference alleles in each population. Next, the columns of were sampled from a symmetric Dirichlet distribution with parameter . Finally, the genomes of admixed individuals were generated as mosaics of genomic segments of individuals from the ancestral populations with proportions . The mosaic nature of the admixed genomes is an important realistic detail, due to the linkage (correlation) between SNPs [51]. A detailed description is in Appendix K.

We compare our algorithm to two methods. The first is Admixture [1], one of the most widely used algorithms in population genetics, which aims to maximize the likelihood of . A recent spectral approach is ALStructure [11], where an estimation of via Chen and Storey [14] is followed by constrained ALS iterations of and . For our method, two modification are needed for Algorithm 2. First, since the distribution of is not Gaussian, the corrected moments as calculated by (4) do not satisfy (16). Instead, we implemented a matrix completion algorithm derived in [30] for a similar setup, see Appendix H for more details. In addition, the filtering process described in Section 4 is no longer valid. However, as is relatively small, we were able to perform exhaustive search over all candidate subsets of size and choose the one that maximized the likelihood.

Figure 4 compares the results of the methods for . The spectral method outperforms Admixture and ALStructure for and performs similarly to Admixture for .

### Acknowledgments

This research was funded in part by NIH Grant 1R01HG008383-01A1.

## References

• Alexander et al. [2009] David H Alexander, John Novembre, and Kenneth Lange. Fast model-based estimation of ancestry in unrelated individuals. Genome research, 19(9):1655–1664, 2009.
• Anandkumar et al. [2012a] Animashree Anandkumar, Dean P Foster, Daniel J Hsu, Sham M Kakade, and Yi-Kai Liu. A spectral algorithm for latent dirichlet allocation. In Advances in Neural Information Processing Systems, pages 917–925, 2012a.
• Anandkumar et al. [2012b] Animashree Anandkumar, Daniel Hsu, and Sham M Kakade.

A method of moments for mixture models and hidden markov models.

In COLT, volume 1, page 4, 2012b.
• Anandkumar et al. [2014] Animashree Anandkumar, Rong Ge, Daniel Hsu, Sham M Kakade, and Matus Telgarsky. Tensor decompositions for learning latent variable models.

Journal of Machine Learning Research

, 15(1):2773–2832, 2014.
• Arabshahi and Anandkumar [2017] Forough Arabshahi and Anima Anandkumar. Spectral methods for correlated topic models. In Artificial Intelligence and Statistics, pages 1439–1447, 2017.
• Arora et al. [2012] Sanjeev Arora, Rong Ge, Ravindran Kannan, and Ankur Moitra. Computing a nonnegative matrix factorization–provably. In

Proceedings of the forty-fourth annual ACM symposium on Theory of computing

, pages 145–162. ACM, 2012.
• Baadel et al. [2016] Said Baadel, Fadi Thabtah, and Joan Lu. Overlapping clustering: A review. In SAI Computing Conference (SAI), 2016, pages 233–237. IEEE, 2016.
• Banerjee et al. [2005] Arindam Banerjee, Chase Krumpelman, Joydeep Ghosh, Sugato Basu, and Raymond J Mooney. Model-based overlapping clustering. In Proceedings of the eleventh ACM SIGKDD international conference on Knowledge discovery in data mining, pages 532–537. ACM, 2005.
• Becker et al. [2011] Emmanuelle Becker, Benoît Robisson, Charles E Chapple, Alain Guénoche, and Christine Brun. Multifunctional proteins revealed by overlapping clustering in protein interaction network. Bioinformatics, 28(1):84–90, 2011.
• Billingsley [2013] Patrick Billingsley. Convergence of probability measures. John Wiley & Sons, 2013.
• Cabreros and Storey [2017] Irineo Cabreros and John D Storey.

A nonparametric estimator of population structure unifying admixture models and principal components analysis.

bioRxiv, page 240812, 2017.
• Cartwright and Sturmfels [2013] Dustin Cartwright and Bernd Sturmfels. The number of eigenvalues of a tensor. Linear algebra and its applications, 438(2):942–952, 2013.
• Chen et al. [2016] Liping Chen, Lixing Han, and Liangmin Zhou. Computing tensor eigenvalues via homotopy methods. SIAM Journal on Matrix Analysis and Applications, 37(1):290–319, 2016.
• Chen and Storey [2015] Xiongzhi Chen and John D Storey. Consistent estimation of low-dimensional latent structure in high-dimensional data. arXiv preprint arXiv:1510.03497, 2015.
• Cho et al. [2011] KyungHyun Cho, Alexander Ilin, and Tapani Raiko. Improved learning of gaussian-bernoulli restricted boltzmann machines.

Artificial Neural Networks and Machine Learning–ICANN 2011

, pages 10–17, 2011.
• Cui et al. [2014] Chun-Feng Cui, Yu-Hong Dai, and Jiawang Nie. All real eigenvalues of symmetric tensors. SIAM Journal on Matrix Analysis and Applications, 35(4):1582–1601, 2014.
• Donoho and Stodden [2004] David Donoho and Victoria Stodden. When does non-negative matrix factorization give a correct decomposition into parts? In Advances in neural information processing systems, pages 1141–1148, 2004.
• Frieze et al. [1996] Alan Frieze, Mark Jerrum, and Ravi Kannan. Learning linear transformations. In Foundations of Computer Science, 1996. Proceedings., 37th Annual Symposium on, pages 359–368. IEEE, 1996.
• Golub and Van Loan [2012] Gene H Golub and Charles F Van Loan. Matrix computations, volume 3. JHU Press, 2012.
• Gonzalez et al. [1977] Teofilo Gonzalez, Sartaj Sahni, and William R. Franta. An efficient algorithm for the kolmogorov-smirnov and lilliefors tests. ACM Transactions on Mathematical Software (TOMS), 3(1):60–64, 1977.
• Gravel [2012] Simon Gravel. Population genetics models of local ancestry. Genetics, 191(2):607–619, 2012.
• Guo et al. [2017] Chun-Hua Guo, Wen-Wei Lin, and Ching-Sung Liu. A modified newton iteration for finding nonnegative z-eigenpairs of a nonnegative tensor. arXiv preprint arXiv:1705.07487, 2017.
• Hillar and Lim [2013] Christopher J Hillar and Lek-Heng Lim. Most tensor problems are np-hard. Journal of the ACM (JACM), 60(6):45, 2013.
• Hinton [2010] Geoffrey Hinton. A practical guide to training restricted boltzmann machines. Momentum, 9(1):926, 2010.
• Hinton and Salakhutdinov [2006] Geoffrey E Hinton and Ruslan R Salakhutdinov. Reducing the dimensionality of data with neural networks. science, 313(5786):504–507, 2006.
• Hsu and Kakade [2013] Daniel Hsu and Sham M Kakade. Learning mixtures of spherical gaussians: moment methods and spectral decompositions. In Proceedings of the 4th conference on Innovations in Theoretical Computer Science, pages 11–20. ACM, 2013.
• Hubbard and Hubbard [2015] John H Hubbard and Barbara Burke Hubbard. Vector calculus, linear algebra, and differential forms: a unified approach. Matrix Editions, 2015.
• Hyvärinen et al. [2004] Aapo Hyvärinen, Juha Karhunen, and Erkki Oja. Independent component analysis, volume 46. John Wiley & Sons, 2004.
• Jaffe et al. [2017] Ariel Jaffe, Roi Weiss, and Boaz Nadler. Newton correction methods for computing real eigenpairs of symmetric tensors. arXiv preprint arXiv:1706.02132, 2017.
• Jain and Oh [2014] Prateek Jain and Sewoong Oh. Learning mixtures of discrete product distributions using spectral decompositions. In Conference on Learning Theory, pages 824–856, 2014.
• Kolda and Mayo [2011] Tamara G Kolda and Jackson R Mayo. Shifted power method for computing tensor eigenpairs. SIAM Journal on Matrix Analysis and Applications, 32(4):1095–1124, 2011.
• Kolda and Mayo [2014] Tamara G Kolda and Jackson R Mayo. An adaptive shifted power method for computing generalized tensor eigenpairs. SIAM Journal on Matrix Analysis and Applications, 35(4):1563–1581, 2014.

Non-parametric detection of the number of signals: Hypothesis testing and random matrix theory.

IEEE Transactions on Signal Processing, 57(10):3930–3941, 2009.
• Lehmann and Romano [2006] Erich L Lehmann and Joseph P Romano. Testing statistical hypotheses. Springer Science & Business Media, 2006.
• Li et al. [2008] Jun Z Li, Devin M Absher, Hua Tang, Audrey M Southwick, Amanda M Casto, Sohini Ramachandran, Howard M Cann, Gregory S Barsh, Marcus Feldman, Luigi L Cavalli-Sforza, et al. Worldwide human relationships inferred from genome-wide patterns of variation. science, 319(5866):1100–1104, 2008.
• Lim [2005] Lek-Heng Lim. Singular values and eigenvalues of tensors: a variational approach. In Computational Advances in Multi-Sensor Adaptive Processing, 2005 1st IEEE International Workshop on, pages 129–132. IEEE, 2005.
• Melchior et al. [2017] Jan Melchior, Nan Wang, and Laurenz Wiskott. Gaussian-binary restricted boltzmann machines for modeling natural image statistics. PloS one, 12(2):e0171015, 2017.
• Moitra and Valiant [2010] Ankur Moitra and Gregory Valiant. Settling the polynomial learnability of mixtures of gaussians. In Foundations of Computer Science (FOCS), 2010 51st Annual IEEE Symposium on, pages 93–102. IEEE, 2010.
• Mossel and Roch [2005] Elchanan Mossel and Sébastien Roch. Learning nonsingular phylogenies and hidden markov models. In Proceedings of the thirty-seventh annual ACM symposium on Theory of computing, pages 366–375. ACM, 2005.
• Paul R. Staab et al. [2015] Paul R. Staab, Sha Zhu, Dirk Metzler, and Gerton Lunter. scrm: efficiently simulating long sequences using the approximated coalescent with recombination. Bioinformatics, 31(10):1680–1682, 2015.
• Price et al. [2009] Alkes L Price, Arti Tandon, Nick Patterson, Kathleen C Barnes, Nicholas Rafaels, Ingo Ruczinski, Terri H Beaty, Rasika Mathias, David Reich, and Simon Myers. Sensitive detection of chromosomal segments of distinct ancestry in admixed populations. PLoS genetics, 5(6):e1000519, 2009.
• Pritchard et al. [2000] Jonathan K Pritchard, Matthew Stephens, and Peter Donnelly. Inference of population structure using multilocus genotype data. Genetics, 155(2):945–959, 2000.
• Qi [2005] Liqun Qi. Eigenvalues of a real supersymmetric tensor. Journal of Symbolic Computation, 40(6):1302–1324, 2005.
• Regalia and Kofidis [2003] Phillip A Regalia and Eleftherios Kofidis. Monotonic convergence of fixed-point algorithms for ica. IEEE Transactions on Neural Networks, 14(4):943–949, 2003.
• Segal et al. [2002] Eran Segal, Alexis Battle, and Daphne Koller. Decomposing gene expression into cellular processes. In Proceedings of the Pacific Symposium on Biocomputing, volume 8, pages 89–100, 2002.
• Shalvi and Weinstein [1993] Ofir Shalvi and Ehud Weinstein. Super-exponential methods for blind deconvolution. IEEE Transactions on Information Theory, 39(2):504–519, 1993.
• Slawski et al. [2013] Martin Slawski, Matthias Hein, and Pavlo Lutsik. Matrix factorization with binary components. In Advances in Neural Information Processing Systems, pages 3210–3218, 2013.
• Taylor et al. [2007] Graham W Taylor, Geoffrey E Hinton, and Sam T Roweis. Modeling human motion using binary latent variables. In Advances in neural information processing systems, pages 1345–1352, 2007.
• Van der Veen [1997] A-J Van der Veen. Analytical method for blind binary signal separation. IEEE Transactions on Signal Processing, 45(4):1078–1082, 1997.
• Wang et al. [2012] Nan Wang, Jan Melchior, and Laurenz Wiskott. An analysis of gaussian-binary restricted boltzmann machines for natural images. In ESANN, 2012.
• Xue et al. [2017] James Xue, Todd Lencz, Ariel Darvasi, Itsik Pe’er, and Shai Carmi. The time and place of european admixture in ashkenazi jewish history. PLoS genetics, 13(4):e1006644, 2017.

## Appendix A Proof of Proposition 1

Uniqueness of the factorization readily follows from (8) so we proceed to prove (8). First note that . Since is full rank, we have . Hence,

 (W†)⊤X=(WW†)⊤H=H∈{0,1}d×n.

So any satisfies the binary constraint . For the other direction, let be such that . Since , the rigidity condition (7) implies . Since is full rank and , must be a column of .

## Appendix B Proof of Lemma 1

Since the vector is binary, its second and third order moments are related as follows. For all ,

 Ciij=Ciji=Cjii=E[h2ihj]=E[hihj]=Cij. (30)

Since is full rank, . Hence, applying multi-linearly on the moment equations in (9) we obtain

 C = (W†)⊤MW†, C = M(W†,W†,W†).

Thus, the equality in (30) is equivalent to

 [M(W†,W†,W†)]iij=[(W†)⊤MW†]ij. (31)

Let be the full rank matrix that satisfies where is the whitening matrix in (10). Then,

 M(W†,W†,W†)=M(KY∗,KY∗,KY∗)=W(Y∗,Y∗,Y∗) (32)

where is the whitened tensor in (11). Similarly, by (10),

 (W†)⊤MW†=(Y∗)⊤(KTMK)(Y∗)=(Y∗)⊤Y∗.

Inserting these into (31), the matrix must satisfy

 [W(Y∗,Y∗,Y∗)]iij=[(Y∗)⊤(Y∗)]ij,∀i,j∈[d]. (33)

The following lemma, proved in Appendix G, shows that Eq. (33) is nothing but a tensor eigen-problem. Specifically, the columns of , up to scaling, are eigenvectors of .

###### Lemma 6.

Let be an arbitrary symmetric tensor. Then, a matrix of rank satisfies (33) if and only if for all , , where are eigenpairs of with linearly independent .

By Lemma 6, the set of scaled eigenpairs of is guaranteed to contain the columns of . Since , the set is guaranteed to contain .

To show that each