# On the achievability of blind source separation for high-dimensional nonlinear source mixtures

For many years, a combination of principal component analysis (PCA) and independent component analysis (ICA) has been used as a blind source separation (BSS) technique to separate hidden sources of natural data. However, it is unclear why these linear methods work well because most real-world data involve nonlinear mixtures of sources. We show that a cascade of PCA and ICA can solve this nonlinear BSS problem accurately as the variety of input signals increases. Specifically, we present two theorems that guarantee asymptotically zero-error BSS when sources are mixed by a feedforward network with two processing layers. Our first theorem analytically quantifies the performance of an optimal linear encoder that reconstructs independent sources. Zero-error is asymptotically reached when the number of sources is large and the numbers of inputs and nonlinear bases are large relative to the number of sources. The next question involves finding an optimal linear encoder without observing the underlying sources. Our second theorem guarantees that PCA can reliably extract all the subspace represented by the optimal linear encoder, so that a subsequent application of ICA can separate all sources. Thereby, for almost all nonlinear generative processes with sufficient variety, the cascade of PCA and ICA performs asymptotically zero-error BSS in an unsupervised manner. We analytically and numerically validate the theorems. These results highlight the utility of linear BSS techniques for accurately recovering nonlinearly mixed sources when observations are sufficiently diverse. We also discuss a possible biological BSS implementation.

Comments

There are no comments yet.

## Authors

• 5 publications
• 6 publications
11/15/2020

### Deep-RLS: A Model-Inspired Deep Learning Approach to Nonlinear PCA

In this work, we consider the application of model-based deep learning i...
12/11/2018

### Sparse component separation from Poisson measurements

Blind source separation (BSS) aims at recovering signals from mixtures. ...
03/23/2018

### Trace your sources in large-scale data: one ring to find them all

An important preprocessing step in most data analysis pipelines aims to ...
02/04/2021

### Nonlinear Independent Component Analysis for Continuous-Time Signals

We study the classical problem of recovering a multidimensional source p...
11/11/2019

### Multidataset Independent Subspace Analysis with Application to Multimodal Fusion

In the last two decades, unsupervised latent variable models—blind sourc...
12/09/2014

### Sparsity and adaptivity for the blind separation of partially correlated sources

Blind source separation (BSS) is a very popular technique to analyze mul...
10/22/2016

### Independent Component Analysis by Entropy Maximization with Kernels

Independent component analysis (ICA) is the most popular method for blin...
##### This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

## Abstract

For many years, a combination of principal component analysis (PCA) and independent component analysis (ICA) has been used as a blind source separation (BSS) technique to separate hidden sources of natural data. However, it is unclear why these linear methods work well because most real-world data involve nonlinear mixtures of sources. We show that a cascade of PCA and ICA can solve this nonlinear BSS problem accurately as the variety of input signals increases. Specifically, we present two theorems that guarantee asymptotically zero-error BSS when sources are mixed by a feedforward network with two processing layers. Our first theorem analytically quantifies the performance of an optimal linear encoder that reconstructs independent sources. Zero-error is asymptotically reached when the number of sources is large and the numbers of inputs and nonlinear bases are large relative to the number of sources. The next question involves finding an optimal linear encoder without observing the underlying sources. Our second theorem guarantees that PCA can reliably extract all the subspace represented by the optimal linear encoder, so that a subsequent application of ICA can separate all sources. Thereby, for almost all nonlinear generative processes with sufficient variety, the cascade of PCA and ICA performs asymptotically zero-error BSS in an unsupervised manner. We analytically and numerically validate the theorems. These results highlight the utility of linear BSS techniques for accurately recovering nonlinearly mixed sources when observations are sufficiently diverse. We also discuss a possible biological BSS implementation.

## Introduction

Blind source separation (BSS) is the problem of separating sensory inputs into hidden sources (or causes) without observing the sources or knowing how they have been mixed Cichocki et al 2009 ; Comon Jutten 2010 . While numerous BSS algorithms have been developed, the combination of principal component analysis (PCA) and independent component analysis (ICA) is widely used today Bishop 2006 . PCA Pearson 1901 ; Oja 1982 ; Oja 1989 ; Sanger 1989 ; Xu 1993 ; Jolliffe 2002 finds a low-dimensional compressed representation of sensory inputs, i.e., principal components, that best describes the original high-dimensional inputs. These major principal components constitute the essential features to represent the hidden sources. Conversely, ICA Comon 1994 ; Bell Sejnowski 1995 ; Bell Sejnowski 1997 ; Amari et al 1996 ; Hyvarinen Oja 1997 finds a representation that makes outputs independent of each other. The sequential application of PCA and ICA is commonly used to find an encoder that removes noise and finds a hidden independent representation. Researchers believe that the brain also uses PCA- and ICA-like learning Brown et al 2001

, or more generally Bayesian inference, for sensory processing

Helmholtz 1925 ; Dayan et al 1995 ; Knill Pouget 2004 ; Friston et al 2006 ; Friston 2010 . For example, high-dimensional visual inputs are produced by a superposition of signals from objects, and the visual system performs segmentation and dimensionality reduction to perceive the underlying objects DiCarlo et al 2012 .

A classical setup for BSS assumes a linear generative process Bell Sejnowski 1995 , in which sensory inputs are a linear superposition of independent sources. This linear BSS problem has been extensively studied both analytically and numerically Amari et al 1997 ; Oja Yuan 2006 ; Erdogan 2009 . In this case, the cascade of PCA and ICA is well known to provide a linear encoder that separates the hidden sources Baldi Hornik 1989 ; Chen et al 1998 ; Papadias 2000 ; Erdogan 2007 . Conversely, a more general BSS setup involves a nonlinear generative process. Here, we study such a nonlinear BSS problem, where the goal is to learn the inverse of the nonlinear generative process and thereby infer the original independent hidden sources solely based on sensory inputs. There are five requirements for solving this problem. The first two involve the representational capacity of the encoder: (1) The encoder’s parameter space must cover a genuine solution that well approximates the inverse of the generative process. (2) The encoder’s parameter space should not be too large; otherwise, a nonlinear BSS problem can have infinitely many spurious solutions, at which all outputs are independent but dissimilar to the original hidden sources Hyvarinen Pajunen 1999 ; Jutten Karhunen 2004

. Hence, it is nontrivial to constrain the encoder’s representational capacity to satisfy these two opposing requirements. The remaining three requirements focus on the unsupervised learning algorithm, whose purpose is to find the optimal encoder’s parameters : (3) the learning dynamics must have a fixed point at the solution that well approximates the inverse; (4) the fixed point must be linearly stable so that the learning process can converge to the solution; and (5) the probability of not converging to this fixed point must be small, i.e., most realistic initial conditions must be within the basin of attraction of this genuine solution.

The cascade of PCA and ICA has been applied to real-world BSS problems that most likely involve nonlinear generative components Calhoun et al 2009 ; however, there is no guarantee that this linear method will solve a nonlinear BSS problem. Generally speaking, a linear encoder cannot represent the inverse of a nonlinear generative process, and this violates Requirement 1 above. A typical approach for solving a nonlinear BSS problem is to use a nonlinear BSS method Lappalainen Honkela 2000 ; Karhunen 2001

, in which a nonlinear encoder such as a multilayer neural network

Hinton Salakhutdinov 2006 ; Kingma Welling 2013 ; Dinh et al 2014

is trained to invert the generative process. If the representation capacity of the encoder is large enough (e.g., involving many neurons in the network), this approach satisfies Requirement 1, and learning algorithms that satisfy Requirements 3 and 4 are known

Dayan et al 1995 ; Friston 2008 ; Friston et al 2008 . However, it is still nontrivial to reliably solve a nonlinear BSS problem. If the representation capacity of the encoder is too large (i.e., violates Requirement 2), the encoder can have infinitely many spurious solutions. For a simple two-dimensional toy problem, it is possible to limit the representation capacity of the nonlinear encoder so that its parameter space only includes the true solution and no spurious solutions Hyvarinen Pajunen 1999 ; however, designing a good model representation is more difficult in more general cases. To our knowledge, there is no theoretical guarantee for solving a nonlinear BSS problem, except in some low-dimensional examples. Moreover, even when Requirement 2 is satisfied, there is no guarantee that a learning algorithm converges to the genuine source representation. It may be trapped in a local minimum, at which outputs are not independent of each other. Thus, the core of the problem is the lack of knowledge of how to simplify the parameter space of the inverse model to remove spurious solutions and local minima (to satisfy Requirements 2 and 5) while retaining the representation capacity (Requirement 1).

Here, we give a theoretical guarantee for the use of a linear encoder in solving a nonlinear BSS problem in cases in which the sensory input is sufficiently high-dimensional and diverse. If the number of the effecting dimensions of sensory inputs is much greater than the number of sources, even a linear superposition of sensory inputs can represent all the hidden sources accurately (satisfying Requirement 1). The use of the linear encoder is beneficial because it can asymptotically find an optimal linear encoder solely based on sensory data (Requirements 3,4) while avoiding any spurious solution or local minimum at which the estimated sources are distinct from the original sources (Requirements 2,5).

In what follows, we estimate the accuracy of and accessibility to a BSS solution by a linear encoder and find the conditions under which the nonlinear BSS problem is reliably solvable. We propose two theorems assuming that sensory inputs are generated by a nonlinear generative process composed of a two-layer network. As the dimensions of sources increase and the number of units in each layer increases relative to the number of sources, an optimal linear encoder can progressively more effectively decompose sensory inputs into hidden sources. Our first theorem guarantees that this reconstruction error reaches zero asymptotically. Our second theorem states that, despite the high-dimensional sensory inputs, PCA can pick up the relevant subspace spanned by the hidden sources in a completely unsupervised manner. Thus, by first performing PCA and then performing ICA, even a linear neural network can reliably find an optimal linear encoder whose reconstruction error asymptotically reaches zero. Altogether, these results reveal a mathematical condition in which the linear BSS technique can invert a nonlinear generative process without being trapped in a spurious solution or local minimum.

## Results

### Model inversion by linear neural network

First, let us see how an optimal linear encoder can approximately invert a two-layer nonlinear generative process to separate hidden sources. Suppose

as hidden sources independently following an identical probability distribution

, and as higher- and lower-layer mixing matrices, respectively,

as a constant vector of offsets,

as a nonlinear function, and as nonlinear bases (see also Fig. 1). Sensory inputs are defined by

 x≡Bf(As+a)≡Bf. (1)

This generative process is universal Hornik et al 1989 ; Barron 1993 and can represent an arbitrary mapping as increases by adjusting the parameters , , and . The model is also universal if each element of and

are independently generated from a Gaussian distribution

Rahimi Recht 2008a ; Rahimi Recht 2008b as long as is tuned to minimize the cost function . Here, describes the average over . The scaling of is to ensure that the argument of is order 1. The offset is introduced for this model to express any generative process but it is negligibly small relative to for large . We show that a robust nonlinear BSS is possible if the generative process includes sufficient variety of nonlinear mappings from sources to sensory inputs, which we formulate as a mathematical condition later. In the following, we assume .

Next, we consider a linear encoder (a single-layer linear neural network) defined by

 u≡W(x−E[x]), (2)

where are neural outputs, and is a synaptic strength matrix.

In the following, we first consider a synaptic strength matrix that minimizes the cost function and show that the resulting encoder extracts all hidden sources with asymptotically zero error if the inputs have sufficient variability. Note that both and are required to compute this encoder but is unknown in the BSS setup. In the section after next, we show that it is possible to find approximately the same encoder without knowing .

The minimization of the above cost function simply gives . Note that is approximately Gaussian distributed for large

by the central limit theorem and, in this case,

 E[xsT]=BE[f(As+a)sT]≈Bdiag(E[f′i])A≈¯¯¯¯¯f′BA (3)

from the Bussgang theorem Bussgang 1952 , where is a diagonal matrix whose -th diagonal element is . Note that, in case of random and small , all basis functions have asymptotically identical statistics and their derivatives approach a constant . The above approximations become exact if is Gaussian. Without losing the universality of the model representation, we can assume . Hence, we can compute the solution to be

 W∗=1¯¯¯¯¯f′(ATBTBA)−1ATBT. (4)

This result is intuitively understandable if is a linear function, but it also holds with a general . In the special case that

is an identity matrix, the above expression simplifies to

 W∗≈Ns¯¯¯¯¯f′NfATBT (5)

because follows and converges in probability to for large .

If we use the encoder in (5), the resulting neural outputs asymptotically converge to hidden sources as the dimensions of hidden sources and ratios increase. This was confirmed by numerical calculations, where is approximately expressed as a function only of (Fig. 2A). This also means that becomes almost independent from (Fig. 2B) and other sources. This means that, while is generally a function of , is a function only of asymptotically when the encoder is used.

### Asymptotic linearization theorem

In this section, we analytically quantify errors in expressing the hidden sources using the optimal linear encoder. Before introducing a general theorem, we intuitively show that the output vector asymptotically converges to the source vector . In this section, we assume and as the identity matrix. (We relax these assumptions in later sections.) The output here is simply given by from (5). The difference between the outputs and hidden sources is quantified by

 ⟨(u−s)(u−s)T⟩=N2s¯¯¯¯¯f′2N2f⟨ATCov[f]A⟩−I. (6)

from (3) and again for a random , where is the covariance. Here, describes an average over , , and . If two zero-mean and unit-variance Gaussian variables and have small correlation , we find with from the Taylor expansion by Toyoizumi Abbott 2011 (see Lemma 1 in Methods). Applying this expansion, we obtain the leading order . Altogether, we find

 ⟨(u−s)(u−s)T⟩≈⎛⎜⎝NsNf¯¯¯¯¯¯f2−¯¯¯f2¯¯¯¯¯f′2+¯¯¯¯¯¯¯¯¯f(3)22Ns¯¯¯¯¯f′2⎞⎟⎠I, (7)

Because the squared error cannot be negative, this equation shows that the output converges to the hidden sources as and tend to infinity for almost all and .

In the above expression, we focused on characterizing the distribution of while treating all nonlinear components such as for

as the error. Below, we take such nonlinear terms into account and formally quantify the conditional probability distribution of

given , , and , namely, with .

#### Theorem 1 (asymptotic linearization)

Suppose , as a random mixing matrix and as a random offset vector whose elements independently follow , and as a matrix that satisfies . When we use the encoder with the synaptic weight matrix in (5), the conditional probability of the -th neural output asymptotically follows Gaussian distribution , where the conditional mean and conditional variance follow

 p(μi|si,A,a) (8) p(σ2i|si,A,a) =N⎡⎢⎣NsNf¯¯¯¯¯f′2⟨Varv[wf(v+wsi√Ns)]⟩w+12¯¯¯¯¯f′2⟨wf′′(v+wsi√Ns)⟩2v,w,NsN2f⎤⎥⎦

respectively. The variability in and reflects the dependency on and . Note that and are the expectation over and , respectively, and that is the variance over , where and are unit random Gaussian variables. Further, the covariance between two different neural outputs and () is distributed according to . Thus, as increases, the conditional means, conditional variances, and covariances quickly converge to values that are not dependent on and . See the Methods section for the proof of Theorem 1.

In short, we demonstrated that, when the number of sources and the ratio of the number of basis functions to the number of sources are both large, we can analytically quantify the accuracy of the linear encoder in inverting the nonlinear generative process to separate hidden sources . The accuracy increases as and increase, and the output asymptotically converges to itself. In this manner, the linear encoder expresses the inverse of the nonlinear generative process, and its outputs express the genuine hidden sources.

Theorem 1 is empirically examined by numerical experiments. Examples are illustrated in Fig. 2. When , approximates but not (Fig. 2A,B) as expected. Indeed, approximate , respectively, but are independent of other sources (Fig. 2C). Whereas, when is a random synaptic matrix with the same variance as , are approximately independent of all sources and their variances are almost zero (Fig. 2D). Theorem 1 well predicts the marginal output distribution for a wide range of , , and (Fig. 2E-H) for different nonlinear basis functions (Fig. 2I-K). Note that when is uncorrelated with unit Gaussian variable , only a noise component is extracted; no linear (signal) component is extracted (Fig. 2L). Then, Theorem 1 is quantitatively verified (Fig. 3). The log-log plot illustrates that when and , the mean square error of from reduces inversely proportional to but saturates around (Fig. 3A). This shows the existence of the order error. In contrast, when is a random synaptic matrix, the variance of decreases inversely proportional to regardless of (Fig. 3B). All these results are predicted by Theorem 1.

### Extracting the optimal encoder without knowing the sources

Theorem 1 states that with the proposed synaptic strength matrix, a linear neural network can accurately extract hidden sources. However, whether the network can find this linear encoder through unsupervised learning is another problem. If the nonlinear generative process is unknown and only is observed as in the BSS problem, finding may be difficult especially when is large. Indeed, when is randomly chosen, rate containing components of optimal linear encoders decreases inversely proportional to (Fig. 3C). Hence, with large , only a fraction of in the entire space is close to by chance. However, we show below that it is possible to extract within the BSS problem setup.

Thus far, we have considered a special case in which and is an identity matrix. Here, we study a general case involving and non-identity matrix . Numerical experiments show that the scale of the uncertainty term depends on both and when is randomly sampled from a Gaussian distribution. The log-log plot illustrates that the mean square error of from is determined by a smaller one between and (Fig. 3D).

To analytically study the general case, we assume that

is an odd nonlinear function. Although it simplifies mathematical expressions, this assumption does not weaken our claims because the presumed generative process in (

1) remains universal.

In the special case of (i.e., ), we also know from Theorem 1 that the best encoder that minimizes the cost function is for large . Hence, the residual must be uncorrelated with the hidden sources, i.e., . Further, using Lemma 1, the covariance of the residual is to the leading order. Hence, the inputs are described by

 x−E[x]≈¯¯¯¯¯f′BAs+√¯¯¯¯¯¯f2−¯¯¯¯¯f′2Bz. (9)

Here we denote the residual as . The first term represents the signal that linearly encodes the hidden sources, and the second term represents the noise introduced by the nonlinearity. Hence, the hidden sources can be extracted if the first term is sufficiently larger than the second term.

To clearly determine which factor in the above equation dominates, let us consider a coordinate transformation. We introduce a rotation matrix , with block dimensions and , to decompose into the two orthogonal subspaces—one parallel to the matrix , i.e., , and the other perpendicular to it, i.e., . The orthogonality condition imposes that and are identity matrices with the corresponding dimensions; thus, and are satisfied. Using this notation, we obtain

. We assume that the generative process includes a sufficient variety of nonlinearity in the sense that all singular values of

are of the same order of magnitude. Applying this coordinate transformation, the inputs are described by

 x−E[x]≈¯¯¯¯¯f′B∥(RT∥A)s+√¯¯¯¯¯¯f2−¯¯¯¯¯f′2(B∥RT∥+B⊥RT⊥)z. (10)

There is an important dependency in the above equation, namely, in the limit of . Altogether, we find that the covariance of the inputs is

 Cov[x]≈NfNs¯¯¯¯¯f′2B∥BT∥+(¯¯¯¯¯¯f2−¯¯¯¯¯f′2)(B∥BT∥+B⊥BT⊥). (11)

Because and have the same order of magnitude, the first signal term (signal covariance) asymptotically dominates the major principal components for large . Hence, all the directions of must be extracted from as the major components. Specifically, by applying PCA to , the signal term is described by the first to the eigenmodes:

#### Theorem 2 (eigenvalue decomposition)

We sort the real eigenvalues of

in descending order and express the first to the major eigenvalues by diagonal matrix , and the corresponding eigenvector matrix . From the above argument, we asymptotically obtain

 PΛPT≈NfNs¯¯¯¯¯f′2B∥BT∥ (12)

for large . See the Methods section for a more quantitative condition for finite .

This expression directly provides a key factor to express the optimal encoder, i.e., , where is an arbitrary rotation matrix. Indeed, the synaptic strength matrix of the optimal encoder (4) is summarized by

 QW∗ =Q[QTΛ1/2PTPΛ1/2Q]−1QTΛ1/2PT=Λ−1/2PT (13)

This shows that we can compute the optimal encoding weight up to an arbitrary rotation factor from the major eigenvalues and eigenvectors of , which are available under the BSS setting. In the limit of large and , the outputs of the encoder asymptotically converge to . This means that the outputs are not independent of each other because of the multiplication with . However, what is remarkable about this approach is that it converts the original nonlinear BSS problem to a linear BSS problem. Namely, is now a linear mixture of the hidden sources, such that we can extract all the hidden sources by further applying a linear ICA method to the outputs .

As shown in Fig. 4, it is numerically validated that the first to -th major principal components of well approximate the signal covariance (Fig. 4A), i.e., . Moreover, as increases, the subspace of the major principal components asymptotically matches to the subspace of the signal components (Fig. 4B). This indicates that PCA is a promising method for finding an optimal linear encoder up to a linear random rotation of sources.

### Hebbian-like learning rules can find an asymptotically optimal linear encoder through a cascade of PCA and ICA

In the previous section, we showed that PCA can reliably extract a subspace spanned by components of the optimal linear encoder as the first to the -th major principal components. As the dimensions of sensory inputs and nonlinear bases become large, PCA can more accurately extract the sources. We next explore if a more biologically plausible learning rule for the encoder can also extract the subspace spanned by the optimal linear encoder.

Oja’s subspace rule Oja 1989 , a type of modified Hebbian plasticity (see also Methods for its update rule), is known to perform PCA and extract a subspace of the first to -th major principal components without being trapped by a spurious solution or local minimum Baldi Hornik 1989 ; Chen et al 1998 . Hence, as a corollary of Theorem 2, starting from almost all random initial conditions, Hebbian-like learning rules can reliably find an optimal linear encoder in an unsupervised manner:

Suppose a synaptic strength matrix is given by with coefficient matrices and . As described above, is the major eigenvector matrix of , while is the remaining minor eigenvector matrix perpendicular to . Together, is a rotation matrix. From (13), the synaptic strength matrix of the optimal encoder is expressed by the term. The so-called Hebbian factor is expressed as the product of the outputs and inputs:

 E[u(x−E[x])T] =WCov[x] =(CΛ)PT+(CmΛm)PTm, (14)

where we write the eigenvalue decomposition of the input covariance as , by introducing its minor eigenvalues . From the argument in the previous section, the signal components are typically times greater than the noise components. If we consider iteratively updating the synaptic strength matrix by with small positive learning rate , the signal components much more rapidly glow and dominate . Therefore, if an appropriate normalization of synaptic strengths Oja 1989 is imposed, we can extract with some coefficient matrix . Hence, the outputs are asymptotically expressed by . In this manner, the Hebbian plasticity also converts the nonlinear BSS problem to a linear BSS problem, and the hidden sources can be extracted if a linear ICA method is subsequently applied to the output .

Equation (Hebbian-like learning rules can find an asymptotically optimal linear encoder through a cascade of PCA and ICA) shows that Hebbian plasticity enhances the signal components while filtering out the noise components. Hence, Hebbian plasticity can extract the subspace of the optimal encoder even when initially starts with small signal components. This also indicates that the basin in which converges to an optimal linear encoder covers almost the entire parameter space of . This speaks the global convergence and the absence of any spurious solutions or local minima.

Numerical experiments show that the accuracy of Hebbian product in extracting components of increased as the time steps for the Hebbian update increased, and that it saturated at a containing rate of around 95% when and while was randomly initialized (Fig. 5A,B). The synaptic strength matrix updated by Oja’s subspace rule for PCA Oja 1989 also converged to up to the multiplication of the random matrix from the left. We found that Oja’s subspace rule could extract all components of the optimal linear encoders as the first to the -th major principal components. The transition of is shown in Fig. 5C,D. While started from random initial states, reliably converged to a matrix that contains approximately 95% components. These results indicate that Hebbian plasticity can reliably and accurately find all components of the optimal linear encoders.

As we have shown, PCA as well as Hebbian plasticity successfully extract the linear components of the hidden sources covered in the nonlinear components. Thereby, the original nonlinear BSS problem has now become a simple linear BSS problem. Thus, the linear ICA (e.g., Amari’s ICA rule Amari et al 1996

; see also Methods for its update rule) can reliably separate all hidden sources from the features extracted by Hebbian plasticity. We numerically confirmed that this is the case. Crucially, those independent components match to the true sources of the nonlinear generative process up to the permutation and sign-flips. We quantify the BSS error, i.e., the difference between those independent components and the genuine hidden sources, by asking how much the mapping from sources to outputs is different from the identity mapping. We found that a nearly zero BSS error is achieved when

and while the error remains when (Fig. 5E). These results highlight that the cascade of PCA and ICA can find an optimal linear encoder with high accuracy when .

## Discussion

In this study, we theoretically quantified the accuracy of a model inversion performed using an optimal linear encoder when the sensory inputs are generated from a two-layer nonlinear generative process. First, we introduced the asymptotic linearization theorem, which states that as the dimension of hidden sources increases and the dimensions of sensory inputs and nonlinear bases increase relative to the source dimension, an optimal linear encoder can accurately separate sensory inputs into genuine hidden sources (Theorem 1). By applying the optimal linear encoder to sensory inputs, the linearly encodable component of hidden sources is magnified relative to the nonlinear component in proportion to the ratio of the numbers of bases and outputs to the number of sources; thereby, the nonlinear transformations of sources are effectively removed. Hence, an optimal linear encoder can approximately express the inverse of the nonlinear generative process. We analytically and numerically validated this theorem. Next, we showed that the first to -th major principal components approximately express a subspace spanned by the outputs of the optimal linear encoder, and the accuracy of the expression increases as the variety of sensory inputs increases relative to the source dimension (Theorem 2) because the gap between the minimum eigenvalue of these linear components (i.e., signals) and the maximum eigenvalue of the nonlinear components (i.e., noise) increases. This means that through Hebbian plasticity, a synaptic strength matrix reliably extracts the subspace spanned by the optimal linear encoder, when started from most initial states. Applying linear ICA to the extracted principal components reliably gives an optimal linear encoder up to permutations and sign-flips. Thus, the PCA-ICA cascade provides an optimal linear BSS method that separates sources of the nonlinear generative process, when the dimensions and the variety are sufficiently large. Unlike conventional nonlinear BSS methods that have spurious solutions Hyvarinen Pajunen 1999 ; Jutten Karhunen 2004 , the PCA-ICA cascade is guaranteed to find the genuine hidden sources in the asymptotic condition, which successfully satisfies Requirements 1-5 mentioned in the introduction.

Nonlinear variants of PCA, such as the autoencoder

Hinton Salakhutdinov 2006 , have been widely used for representation learning Arora 2017 . Because natural sensory data are highly redundant and occupy only a fraction of the entire input space Chandler Field 2007 , these algorithms seek compressed representation of the sensory data. Generally, if a large nonlinear neural network is used, many equally good solutions exist Kawaguchi 2016 ; Lu Kawaguchi 2017 ; Nguyen Hein 2017 . (This property is related to the fact that there are generally infinitely many spurious solutions for nonlinear ICA Hyvarinen Pajunen 1999 ; Jutten Karhunen 2004

.) Hence, there is no objective reason to choose one solution or another as long as they show similar reconstruction accuracy. Thereby, the results of these nonlinear algorithms are intrinsically ambiguous. Which of these solutions is actually found may depend on the heuristic design of regularization parameters

Dahl et al 2012 ; Hinton et al 2012 ; Wan et al 2013 . We have approached this representation learning from a different perspective, namely by inverting the nonlinear generative process and extracting independent hidden sources. Unlike the nonlinear approaches above, our approach using a linear encoder has a global convergence proof to extract an optimally compressed and unique set of hidden sources, if the sensory inputs have sufficient variety. That is, the algorithm always finds the same and true hidden sources regardless of how they are nonlinearly mixed by the generative process. This property would be related to a property of nonlinear neural networks in which a solution with high generalization performance tends to have a large volume of basin of attraction Wu et al 2017 . In short, we found that the linear PCA suffices to asymptotically achieve the optimal compression by extracting a linear mixture of hidden sources and eliminating their nonlinear components. Thus, the combination of linear PCA and ICA can find the inverse of the nonlinear generative process and it can be used, if needed, to further learn the forward model of this generative process.

Neural networks in the brain are known to exhibit Hebbian plasticity Hebb 1949 ; Malenka Bear 2004 . Researchers believe that Hebbian plasticity plays a key role in representation learning Dayan Abbott 2001 ; Gerstner Kistler 2002 and, more specifically, BSS Brown et al 2001 . Indeed, major BSS algorithms such as PCA Oja 1982 ; Oja 1989 and ICA Bell Sejnowski 1995 ; Bell Sejnowski 1997 ; Amari et al 1996 ; Hyvarinen Oja 1997 are formulated as a variant of Hebbian plasticity. Moreover, biological neural networks learn to separately represent independent signals only in the presence of Hebbian plasticity Isomura et al 2015 . Nonetheless, it is still unclear how the brain can possibly perform a nonlinear BSS, which has been suggested as a prerequisite for many cognitive processes such as visual recognition DiCarlo et al 2012 . Our theorems indicate that a combination of PCA and ICA algorithms can reliably separate hidden sources that are nonlinearly mixed in the environment, when sufficiently rich sensory inputs are provided. While we have used Oja’s subspace rule for PCA and Amari’s ICA rule in this paper, more biologically plausible local Hebbian learning rules are proposed Foldiak 1990 ; Linsker 1997 ; Isomura Toyoizumi 2016 ; Pehlevan et al 2017 ; Isomura Toyoizumi 2018 ; Leugering Pipa 2018 . A recent work showed that even a single-layer neural network can perform both PCA and ICA through a local learning rule Isomura Toyoizumi 2018 . Altogether, here we suggest a hypothesis to explain how the brain solves a nonlinear BSS problem; that is, first applying a Hebbian PCA rule to convert a nonlinear BSS problem to a linear ICA problem and then using Hebbian ICA rule to separate hidden independent sources mixed behind the scenes.

Because most natural data (including biological, chemical, and social data) are generated from nonlinear generative processes, broad applications of nonlinear BSS are considered. The proposed theorems provide a license to apply standard linear PCA and ICA to natural data for the purpose of inferring hidden sources of a nonlinear generative process. An interesting possibility is that living organisms might have developed high-dimensional sensors to perform nonlinear BSS with a linear encoder, which might be related to the large numbers of sensory cells in humans; e.g., about 100 million rod cells and six million cone cells in the retina and about 16,000 hair cells in the cochlea Kandel et al 2013 .

In summary, we showed that as the dimensions of sensory inputs, nonlinear bases, and hidden sources increase, a single-layer linear network can more accurately establish an optimal linear encoder to decompose the genuine hidden sources. This optimal linear encoder can be reliably found by PCA as well as Hebbian plasticity followed by subsequent ICA in an unsupervised manner. This is because, with the increases in dimension and variety, sensory inputs can provide greater evidence about the hidden sources, which remove the possibility of finding spurious solutions. The result guarantees that the PCA-ICA cascade can reliably find genuine hidden sources from high-dimensional nonlinear source mixtures.

## Methods

### Proof of Lemma 1

Suppose and are zero-mean and unit-variance Gaussian variables and and are arbitrary functions. When and have small correlation , satisfies , where is a zero-mean and unit-variance Gaussian variable that is independent of . When we define as the covariance between and , its derivative with respect to is given by

 ϕ′(c) =Cov[f(v),g′(cv+√1−c2ξ)(v−c√1−c2ξ)] =E[(f(v)−E[f(v)])g′(cv+√1−c2ξ)(v−c√1−c2ξ)]. (15)

From the Bussgang theorem Bussgang 1952 ,

 E[(f(v)−E[f(v)])g′(cv+√1−c2ξ)v]=E[f′(v)g′(cv+√1−c2ξ)+(f(v)−E[f(v)])g′′(cv+√1−c2ξ)c] (16)

and

 E[(f(v)−E[f(v)])g′(cv+√1−c2ξ)ξ]=E[(f(v)−E[f(v)])g′′(cv+√1−c2ξ)√1−c2]. (17)

Thus, becomes

 ϕ′(c)=E[f′(v)g′(cv+√1−c2ξ)] (18)

Hence, we find

 ϕ(n)(c)=E[f(n)(v)g(n)(cv+√1−c2ξ)]. (19)

From the Taylor expansion by ,

 ϕ(c)=∞∑n=1ϕ(n)(0)cnn!=∞∑n=1¯¯¯¯¯¯¯¯¯f(n)¯¯¯¯¯¯¯¯g(n)cnn!, (20)

where indicates the expectation of over .

Applying this expansion, we obtain to the leading order

 ⟨ATCov[f]A⟩ ≈⟨AT⎡⎢⎣diag(Cov[f])+∞∑n=1¯¯¯¯¯¯¯¯¯f(n)2n!{(AAT)⊙n−diag((AAT)⊙n)}⎤⎥⎦A⟩ ≈⎛⎜ ⎜⎝(¯¯¯¯¯¯f2−¯¯¯f2)NfNs+¯¯¯¯¯f′2N2fN2s+¯¯¯¯¯¯¯¯¯f(3)2N2f2N3s⎞⎟ ⎟⎠I, (21)

where describes the element-wise power. Note that the diagonal components of are evaluated separately from the non-diagonal components in the above equation.

### Proof of Theorem 1

Here we suppose is the identity matrix. Let us separately consider the term in the argument of

from other terms. We define a new random variable

. The expectation of over and follows a Gaussian distribution , where a fluctuation follows the unit Gaussian distribution . When the synaptic strength matrix is optimal (see (5)), from a Taylor expansion of , we obtain

 ui=NsNf¯¯¯¯¯f′Nf∑j=1Aji{f(yji+Ajisi)−E[f(v)]}=NsNf¯¯¯¯¯f′⎧⎨⎩∞∑n=0snin!Nf∑j=1f(n)(yji)An+1ji−¯¯¯fNf∑j=1Aji⎫⎬⎭. (22)

#### The probability distribution of the mean

Because and are independent of each other, the conditional expectation of the mean of under fixed and is given by

 μi(si)=E[ui|si,Aj1,…,AjNs]=NsNf¯¯¯¯¯f′⎧⎨⎩∞∑n=0snin!Nf∑j=1E[f(n)(yji)]An+1ji−¯¯¯fNf∑j=1Aji⎫⎬⎭≈NsNf¯¯¯¯¯f′∞∑n=1¯¯¯¯¯¯¯¯¯f(n)snin!Nf∑j=1An+1ji. (23)

The expectation of is given by . Hence, the mean of is

 E[μi|s