## 1 Introduction

Independent component analysis (ICA) is a statistical model which has become ubiquitous in a variety of applications including image processing [5, 25], neuroscience [17], and genomics [23, 18, 10]. The ICA model expresses an observed random vector

of a latent random vector with independent components, called *sources*. Here, is an unknown deterministic *mixing matrix* [13]. Arguably the most studied problem in the ICA model is *blind source separation* where the goal is to recover both the mixing matrix and the sources from observations of . Another problem, called *feature extraction*, is that of recovering just the mixing matrix [14]

. In this paper we focus on the feature extraction problem. Unlike blind source separation, feature extraction may be solved even in the

*overcomplete*setting where . Nevertheless, for to be identifiable, additional assumptions need to be imposed beyond independence of the latent components. Indeed, if , then is a sufficient statistic for the distribution of , and this is unchanged if is replaced by

for any orthogonal matrix

. Thus, is at best identifiable up to right multiplication by an orthogonal matrix. It turns out that the above example is essentially the only case when is not identifiable. More precisely, a classical result states that if at most one component of is Gaussian, then can be recovered up to a permutation and rescaling of its columns [8].### Previous work.

In view of the identifiability issues arising in the Gaussian case, practical algorithms for ICA have traditionally relied on the fourth cumulants of to exploit non-Gaussianity [2]. Perhaps the most widely known method in this line of work is the FastICA algorithm by Hyvärinen and Oja [16]

, which iteratively finds the one-dimensional projections maximizing kurtosis of the data. However, all such methods fail when

has fourth moments close to those of a Gaussian. Moreover, traditional methods for ICA almost universally use of an initial

*whitening*step, which transforms to have covariance matrix . This step is fragile to independent additive noise on . Voss et al. [24] introduce algorithms to overcome this problem, and Arora et al. [4] use a

*quasi-whitening*step which allows them to prove guarantees for ICA in the presence of additive Gaussian noise. Nevertheless, all these methods exploit non-gaussianity of the sources.

### Our contribution.

In this paper we take a radically different approach that removes distributional assumptions on . In fact we do not even require to have independent components, but only that the components be uncorrelated. In addition our methods are robust to additive independent noise on with any centered distribution. We achieve this by instead making structural assumptions on . Specifically, we assume that is *sparse* and *generic*. Sparsity has been a key idea to reduce dimensionality in signal processing and statistics [7, 11, 12, 6].
Beyond reducing complexity and avoiding overfitting, sparsity of the mixing matrix is of great practical interest because it leads to *interpretable* results. Hyvärinen and Raju [15] have previously proposed to impose sparsity conditions on the mixing matrix in order to improve the performance of ICA. This work presents the first rigorous treatment of an algorithm taking advantage of a sparse mixing matrix.

In addition to being sparse we require that be generic, which we enforce by choosing as the realization of a Bernoulli-Gaussian ensemble. Similar structural assumptions have recently been employed in dictionary learning for example [22, 3, 1]. While the two problems are related, fundamental differences preclude the use of standard dictionary learning machinery (See section 2).

### Notation.

We use the shorthand . We write the th entry of vector as . For and indices , we write the restriction of to as . is the th standard unit vector, and for . Similarly for a matrix , is the submatrix . is the submatrix which keeps all rows but only columns indexed by . We say that a matrix is *fully dense* if all its entries are nonzero. For tuples and , is the entrywise ratio, . is the diagonal matrix with along the diagonal.
For a set , is its cardinality. We define the *support* of a vector and write . denotes the -norm, and is the entrywise supremum norm.
denotes the Wishart distribution with scale matrix and degrees of freedom, i.e., for i.i.d. samples of a Gaussian vector, .

is the binomial distribution with

trials and success probability

, and is the distribution of a Bernoulli trial with expectation. We write the identity matrix in

as and the all-ones vector as . We use the notation and .## 2 Statistical Model

Let be a random vector of sources with independent components. is transformed by multiplication with the unknown mixing matrix

to be estimated. We also allow independent additive noise

on the transformed vector, where is some diagonal nonnegative matrix (possibly zero). We arrange i.i.d. copies of in a matrix whose entries are i.i.d. with distribution .^{1}

^{1}1The Gaussianity assumption can be relaxed. Similarly, concatenate copies of to construct independent of and with independent entries . The observed data is then given by

(1) |

The columns of are i.i.d. with distribution , where

Write the sample covariance matrix as . Then follows a Wishart distribution with scale matrix and degrees of freedom. Our goal is to learn up to permutations and sign changes of its columns, i.e., to recover where is some permutation matrix, and is a diagonal matrix with diagonal entries in .

### Relation to dictionary learning.

Dictionary learning, also known as sparse coding [21], is a matrix factorization problem which is formally equivalent with (1) without the noise term. With our notation the problem can be stated as follows. An unknown matrix , called the *dictionary*, is assumed to have various properties for identifiability purposes. These include *incoherence* [3, 1], or invertibility [22]. The columns of are called the *atoms*. A sequence of vectors is observed, each of which is a sparse linear combination of atoms. Appending the observed vectors yields a matrix

where is sparse and generic. The task is to recover and the dictionary . While this problem is formally equivalent with (1), dictionary learning traditionally treats the regime , i.e., the number of samples is larger than the number of dictionary elements [22, 20]. This assumption is overly restrictive for our purposes as we allow the number of features to exceed the number of observed variables , so we cannot employ existing results on dictionary learning. More specifically, while our results also cover the case , we are primarily interested in the regime where .

In order to ensure that the mixing matrix is generic we generate by the following random model.

###### Definition 1.

Let

be mutually independent random variables where

If , we say that matrix arises from a Bernoulli-Gaussian ensemble and write .

The mixing matrix has, in expectation entries and we refer to has the *sparsity parameter*. It represents the fraction of nonzero-entries. Let and denote the columns and rows of respectively,

Because of ambiguities from permutations and sign changes of we evaluate the performance of the recovery algorithm in terms of the following distance measure.

###### Definition 2.

Given , write

where ranges over all permutation matrices, and ranges over diagonal matrices with diagonal entries taking values in .

## 3 Main result

We prove that when and the sparsity parameter is of order , can be efficiently recovered from the covariance matrix . In this setting, at most a small constant fraction of the entries of are nonzero. Moreover, we show that when is of order , the sample covariance matrix suffices to approximately recover . This is our main theorem.

###### Theorem 1.

There exist such that the following holds. Let be such that

Let where and is diagonal. Put

Then there is a randomized algorithm outputting on input in expected time such that with probability over the randomness of and , .

The quantity can also be characterized in terms of the largest squared norm of a row of , i.e., . Coupling , and yields the following theorem, which gives a qualitative illustration of theorem 1. This asymptotic result uses a bound stating that concentrates around .

###### Corollary 2.

Let , and let , where

for some fixed exponent , then there is an algorithm which takes input and outputs , such that in probability over the randomness of as . Moreover, the same guarantee holds for input if , where

## 4 Algorithm

We now describe our algorithm, beginning in the population setting where is known exactly. The input to our algorithm is , whose th row we write as

We recover the columns of one at a time by applying the following Single Column Identification Procedure (). takes as input and a pair of indices , where is a randomly chosen nonzero off-diagonal entry of , and it outputs a column , possibly with its sign changed. Denote the (unknown) supports of the columns and rows of by

, implies that . Equivalently, for some . Using this fact, and supposing that there is only one such (this turns out to be the typical situation), outputs . proceeds in two steps on input .

The first step of finds a subset containing a large fraction of the unknown support . To illustrate how this step works, assume

is the zero matrix

^{2}

^{2}2 does not affect the output of because of its sparsity, so we assume for notational convenience., and write as a sum of (unknown) rank one matrices

(2) |

where for . Note that . Then the -by- matrix is a fully dense matrix of rank one. It turns out that because the supports of the matrices have small overlaps, this property is approximately preserved when adding the contributions from the other terms in (2). That is, agrees on all but a small number of entries. Hence, can be made to have rank one by removing the columns where these entries appear. Equivalently, letting be the indices of the remaining columns, we get that is a set of indices such that is a rank one fully dense matrix. Therefore we can identify a large subset by picking the largest set such that is fully dense and has rank one. Another way of formulating this is that we define to be the largest set of indices such that and are fully dense and collinear. This concludes step 1 of .

We now begin step 2 of . At this stage we have where , so we could already make a crude approximation to by extending with zeroes outside of . However, this approximation misses the entries in . Instead, consider the identity . is known to us and , so we can rewrite the identity as

(3) |

The RHS is just copies of written side by side. So if were known, then we could easily recover up to a scalar by taking any column of (3). It turns out that replacing by in (3) changes only a small fraction of the entries in each row. Hence, each row of

is sparse. Moreover, both and are known to us, so we can compute . Now it is easy to compute , since its th entry is repeated several times in the th row of . This row can also be written . We identify using the fact that when , and we output . We have motivated the following procedure.

### Step 1.

Take input and . Intersect the supports of the st and nd rows of and store the resulting set of indices as . Compute the mode (most frequent value) of the entrywise ratio . Then let be the set of indices where the mode is attained.

### Step 2.

Restrict attention to the submatrix consisting of the columns indexed by . Construct vector by defining to be the median of for each . Multiply by a scalar and output the resulting rescaled vector .

### Deflation.

The algorithm to construct on input works by repeatedly applying and a deflation step replacing with . First, initialize as a width 0 matrix. At the start of each iteration, pick a nonzero entry , and assign . Append to , and subtract from . Repeat the above procedure until is diagonal, and output . A potential problem with the deflation procedure is that an incorrect output from column could impede the subsequent applications of . And indeed, fails for the (atypical) pairs such that . To overcome the problem of an incorrect output of we let and verify that is sparser than . This ensures that is correct. Only then do we append to and deflate by subtracting .

## 5 Finite sample case

When the input to is an approximation to we have to relax the requirement that and be collinear. The statement that the two vectors and in are collinear is equivalent with saying that the points are collinear, where ranges over . We take this as the starting point of the relaxed definition of . First, to bound the noise/signal ratio we pick a small constant and disregard points within a distance to a coordinate axis in . This means that is replaced by . We then approximate the entries of by values in a discrete set . This corresponds to placing the points into bins where each bin is a cone in . The appropriate choice of discretization is where

In particular, the set of bins arising from this discretization is symmetric about the axis because implies .

###### Definition 3.

For vectors , , and , define

Pick such that . Such can be estimated as from using the fact that a large fraction of the entries of are zero.

The over can be computed by computing two modes. First divide into two sets and according to the sign of . Then for make a list of integers and take the mode, pairing neighboring integers (for example by appending a copy of the list with added to each integer). This yields a mode for each set , and we finish by taking the most frequent of the two.

## 6 Structure of the mixing matrix

The proof of theorem 1 uses the fact that satisfies a condition , which we define below, and which captures the generic structure of .

###### Definition 4.

For indices let

and define .

A graphical illustration of the sets is given in Figure 4. The following *overlap condition* () is required for our recovery guarantee.

###### Condition 1 ().

Let and . We say that satisfies the overlap condition if for every ,

is a condition on which ensures that each square has a small overlap with . This holds in the regime where is sparse, which is clear by considering it as a sparsity condition on . More precisely we have the following theorem.

###### Theorem 3.

Let . There exists a choice of constants such that if

(4) |

then satisfies condition with probability at least ,

This theorem requires implicitly that . In essence, (13) enforces that, with high probability, each row of has at most non-zero entries.

The subprocedure identifies a large subset by searching for of size such that is nearly singular. If such approximately singular -by- submatrices appear by chance in , then this step fails. We therefore define as the maximum width such that this occurs. As the approximation gets better, i.e. as decreases, our notion of approximately singular becomes more restrictive. This implies that is an increasing function of (decreasing in the accuracy ).

###### Definition 5.

For with and , define

Akin to the restricted isometry property pervasive to compressed sensing, the randomly generated matrix satisfies condition with high probability.

###### Theorem 4.

## 7 Proof of recovery

We first show that our algorithm recovers when the population covariance matrix is known exactly. In this case we can set , so we benefit from the fact that .

###### Theorem 5.

Let be such that holds. Then our algorithm halts in iterations of and outputs such that , where , and the expectation is over the randomness of the algorithm.

###### Proof.

Let

(6) |

and note that iff . We first show that for , such that , it holds that .

### Correctness of SCIP.

Let . We then have that

(7) |

Indeed, for all it holds that , hence and . It follows that , i.e. . This shows that .

Consider the loop over . For every it holds that