# Blind Source Separation: Fundamentals and Recent Advances (A Tutorial Overview Presented at SBrT-2001)

Blind source separation (BSS), i.e., the decoupling of unknown signals that have been mixed in an unknown way, has been a topic of great interest in the signal processing community for the last decade, covering a wide range of applications in such diverse fields as digital communications, pattern recognition, biomedical engineering, and financial data analysis, among others. This course aims at an introduction to the BSS problem via an exposition of well-known and established as well as some more recent approaches to its solution. A unified way is followed in presenting the various results so as to more easily bring out their similarities/differences and emphasize their relative advantages/disadvantages. Only a representative sample of the existing knowledge on BSS will be included in this course. The interested readers are encouraged to consult the list of bibliographical references for more details on this exciting and always active research topic.

## Authors

• 8 publications
08/24/2020

### Graph Signal Processing Meets Blind Source Separation

In graph signal processing (GSP), prior information on the dependencies ...
12/14/2018

### Towards Unsupervised Single-Channel Blind Source Separation using Adversarial Pair Unmix-and-Remix

Unsupervised single-channel blind source separation is a long standing s...
12/10/2021

### unrolling palm for sparse semi-blind source separation

Sparse Blind Source Separation (BSS) has become a well established tool ...
01/21/2015

### Convergent Bayesian formulations of blind source separation and electromagnetic source estimation

We consider two areas of research that have been developing in parallel ...
12/17/2018

### Heuristics for Efficient Sparse Blind Source Separation

Sparse Blind Source Separation (sparse BSS) is a key method to analyze m...
06/22/2015

### Blind Source Separation Algorithms Using Hyperbolic and Givens Rotations for High-Order QAM Constellations

This paper addresses the problem of blind demixing of instantaneous mixt...
03/23/2018

### Trace your sources in large-scale data: one ring to find them all

An important preprocessing step in most data analysis pipelines aims to ...
##### This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

## 1 Introduction

Consider the following scenario. A number of people are found in a room and involved in loud conversations in groups, just as it would happen in a cocktail party. There might also be some background noise, which could be music, car noise from outside, etc. Each person in this room is therefore forced to listen to a mixture of speech sounds coming from various directions, along with some noise. These sounds may come directly to one’s ear or have first suffered a sequence of reverberations because of their reflections on the room’s walls. The problem of focusing one’s listening attention on a particular speaker among this cacophony of conversations and noise has been known as the cocktail party problem [6]. It consists of separating a mixture of speech signals of different characteristics with noise added to it. The signals are a-priori unknown (one listens only to a combination of them) as is also the way they have been mixed. The above scenario is a good analog for many other examples of situations that demand for a separation of mixed signals with no presupposed knowledge on the signals and the system mixing them. The list is long:

• Reception for single- and multi-user communications with no training data [76, 175, 125, 108]

• Analysis of biomedical signals (e.g., electroencephalographic (EEG) [119] and electrocardiographic (ECG) signals [7], fetal electrocardiography [67], functional Magnetic Resonance Imaging (fMRI) data [70], etc.)

• Restoration of images taken from inaccessible scenes [116] (e.g., ultrasonography [1], astronomical imaging [8], images of a nuclear reactor, etc.)

• Uncalibrated (or partially calibrated) array processing [22, 25]

• Geophysical exploration [117]

• Denoising [95]

• Voice-controlled machines [172]

• Circuit testing [172]

• Semiconductor manufacturing [172]

This kind of problem is commonly referred to as Blind Source Separation (BSS). The term blind is used to emphasize that the signals are to be separated on the basis of their mixture only, without accessing the signals themselves and/or knowing the mixing system, i.e., blindly. Additional information is usually required, involving however only general statistical and/or structural properties of the sources and the mixing mechanism. For example, in a multi-user digital communications system one might assume the signals transmitted to be mutually independent, i.i.d., and BPSK modulated, and the channel(s) to be linear, time-invariant, but otherwise unknown. If no training data are allowed to be transmitted along, due for example to bandwidth limitations, then reception has to be performed blindly.

BSS sounds and is indeed a difficult problem, especially in its most general version. Note that humans experience in general not much difficulty in coping successfully with the cocktail party problem. However, this fact should not be overemphasized, in view of the amazing complexity and performance of the human cognitive system. It would not be that easy to separate speakers talking at the same time by relying, for example, solely on the recordings of a few microphones placed in the room. Nevertheless, a lot of progress has been made in the last ten years or so in the study of the BSS problem and a great number of successful methods, both general and specialized, have been developed and their applicability demonstrated in several contexts. Self-organizing neural networks

[83] have, as it will be seen in this course, directly or indirectly played an important role in this research effort, exploiting in a way the human capabilities for BSS through their analogy with related biological mechanisms.

This course aims at providing a tutorial overview of the most important aspects of the BSS problem and approaches to solving it. For reasons to be explained, only methods based on higher-order statistics are discussed. In addition to well-known and established results and algorithms, some more recent developments are also presented that show promise in improving/extending the capabilities and removing the limitations of earlier, classical approaches. BSS is a vast topic, impossible to fit into a short course. Only a selected, yet representative, part of it is thus treated here, hopefully sufficient to motivate interested readers to further explore it with the aid of the (necessarily incomplete) provided bibliography.

The rest of this text is organized as follows. The BSS problem is stated in precise mathematical terms in Section 2, along with a set of assumptions that allow its solution. Section 3 examines the possibility of solving the problem by relying on second-order statistics (SOS) only. It is argued there that SOS are insufficient to provide a piercing solution in all possible scenarios. Higher-order statistics (HOS) are generally required. An elegant and intuitively pleasing manner of expressing and exploiting this additional information is provided by information-theoretic (IT) criteria. Information is defined in Section 4 and its relevance to the BSS problem is explained. The criterion of minimal mutual information (MI) is formulated and interpreted in terms of the Kullback-Leibler divergence. A unification of seemingly unrelated IT and statistical criteria is put forward in Section 5, based on the criterion of minimum MI. This results in a single, “universal” criterion, that is shown to be interpretable in terms of the so-called nonlinear Principal Component Analysis (PCA). A corresponding algorithmic scheme, resulting from its gradient optimization, is presented in Section 6, along with its interpretation as a nonlinear-Hebbian learning scheme and a discussion of the choice of the involved nonlinearities and their implications to algorithm’s stability. It is also shown how the use of the relative gradient in the above algorithm transforms it to one that enjoys the so-called equivariance property. In Section 7 the close relation of the above scheme with the recently proposed FastICA algorithm is demonstrated. The latter is shown in turn to specialize to the super-exponential algorithm, and its variant, the constant-modulus algorithm, well-known from the channel equalization problem. Section 8 discusses algebraic methods for BSS, based on the concept of cumulant tensor diagonalization. Their connection with the above approaches is pointed out, via their interpretation as cost-optimization schemes. Some recently reported deterministic methods, that have been demonstrated to work for short data lengths are also included. The model employed thus far involves a static (memoryless), linear, time-invariant and perfectly invertible mixing system with stationary sources. These simplifying assumptions are relaxed in the following sections. Non-invertible mixing systems are dealt with in Section 9, where both algebraic and criterion-based approaches are discussed. Dynamic (convolutive) mixtures are discussed in Section 10, where methods for dealing with nonlinear or time-varying mixtures and nonstationary sources are also briefly reviewed. Some concluding remarks are included in Section 11.

Notation:

Bold lower case and upper case letters will designate vectors and matrices, respectively. Unless stated otherwise, capital, calligraphic, boldface letters will denote higher-order tensors.

will be the identity matrix. The subscript will be omitted whenever understood from the context. The superscript will denote transposition. is the Kronecker product. The inverse operation of building a vector from a matrix by stacking its columns one above another will be denoted by

## 2 Problem Statement

Consider the mapping

 \boldmathu(n)=F(\boldmatha(n),\boldmathv(n),n) (1)

mixing the source signals

 \boldmatha(n)=[a1(n)a2(n)⋯aN(n)]T (2)

and the noise signals

 \boldmathv(n)=[v1(n)v2(n)⋯vK(n)]T (3)

via the mixing system to yield the mixture

 \boldmathu(n)=[u1(n)u2(n)⋯uM(n)]T. (4)

is in general nonlinear and its time argument, , signifies that it can also be time-varying. BSS can thus be stated as the problem of identifying the mapping and/or extracting the sources having access only to the mixture Both the system and its inputs are assumed unknown. Nevertheless, so that the problem have a solution, some general structural and/or statistical properties of the quantities involved are assumed. The mixing system is usually supposed to be linear and time-invariant. The most common assumption for the sources is that they are mutually independent and independent of the noise. Depending on the application context, more information may be available. For example, in a digital communications system the source signals might be known to belong to some discrete alphabet. The set of assumptions made for and constitute the model adopted for the problem and determine the range of possible solution approaches.

Due to its relative simplicity and wide applicability, the case of a linear, time-invariant mixing system (with stationary sources) has been the focus of the great majority of the works on BSS. This can be written as

 \boldmathu(n)=\boldmathH\boldmatha(n)+% \boldmathv(n) (5)

where is an mixing matrix and The following set of assumptions is typical:

• The signals and are stationary and zero-mean.

• The sources are statistically independent.

• The noises are statistically independent and independent of the sources.

• The number of sensors exceeds or equals the number of sources:

The mixing system in (5) is moreover static, i.e., memoryless. This model could also encompass dynamic systems if it included an assumption of independence between samples of a signal and a structural constraint (Toeplitz) on Linear systems with memory will be briefly addressed in Section 10.

The goal of the linear BSS is to determine a separating matrix such that

 \boldmathy(n)=\boldmathG\boldmathu(n) (6)

is a good approximation of the source signals. Assumption A4 signifies that the mixture is not under-determined, i.e., in the absence of noise, if is identified and of full rank, then it can be inverted to yield the sources. If noise is not weak enough to be neglected, perfect recovery of all sources is impossible. In fact, the noisy mixture may be viewed as a special case of a noise-free mixture that does not meet A4. Just write (5) in the form

 \boldmathu(n)=\boldmathH′\boldmatha′(n)

where and Then is and cannot be (linearly) inverted.111Possible ways for recovering the sources in such under-determined mixtures, via nonlinear inversion or one-by-one source extraction, will be discussed in Section 9. Hence, one can, without loss of generality, write

 \boldmathu(n)=\boldmathH\boldmatha(n) (7)

for the linear mixing process. The general linear BSS setup is shown schematically in Fig. 1.

The (nonlinear) function at the separator’s output may not be really there. It is needed however in the development of the criteria and associated algorithms.

It must be noted that, unless additional information is available, there is an indeterminacy with respect to the scaling and order of the separated sources [172]. The result of the mixing in (7) will not change if a source is multiplied with a scalar and the corresponding column of is divided by it. A permutation of the source signals accompanied by an analogous permutation of the columns of the mixing matrix will not have any effect on either. The best one could expect therefore is to recover ’s scaled and/or permuted, i.e.,

 \boldmathy(n)=\boldmathP\boldmathD\boldmatha(n)

where is a permutation matrix and

is a diagonal (invertible) matrix. The following assumption on the source energy can therefore be made without harming the generality:

• The sources have unit variance:

This course, in its main part, will focus on the noise-free, linear, static (instantaneous) mixture case (7), with as many sensors as sources, i.e., , an invertible mixing matrix, and sources satisfying assumptions A1, A2, A5. Furthermore, to simplify the presentation, we shall frequently omit the inherent scale and order indeterminacy and write Moreover, all quantities are assumed to be real. It will be usually straightforward to extend the results to the complex case.

The model adopted may not be realistic in many practical scenarios. It is sufficient however to convey the basic ideas underlying the problem, while keeping things simple. Besides, as the following analysis should show, its study is already far from being trivial.

## 3 Are Second-Order-Statistics Sufficient?

Based on the model assumptions made above, the BSS problem is solved once the mixing matrix has been identified. Thus the question arises whether could be somehow extracted from measuring appropriate quantities related to the mixture signals. Note that the autocorrelation matrix of the observation vector is given by:

 \boldmathRu=E[\boldmathu(n)\boldmathuT(n)]=\boldmathH\boldmathRa\boldmathHT=% \boldmathH\boldmathHT (8)

where the index is omitted due to stationarity and the source autocorrelation matrix equals the identity according to the assumptions A1, A2, and A5. The above equation suggests that the mixing matrix might be identifiable as a square root of i.e.,

, via its eigenvalue decomposition (EVD) for example. In fact it is only a basis for the column space of

that this would provide.222If results as a convolution matrix of a filter, this information may suffice to identify it [123]. It is easy to see that not only

but also any product of it with an orthogonal matrix,

, will verify (8). In other words, allows to be identified only up to an orthogonal factor, .

It is instructive to observe that this ambiguity is an instance of the well-known phase indeterminacy problem in spectral factorization for linear prediction [143], where the role of the sources play the innovation sequence samples. The net effect of the corresponding “separator” is to whiten the observations:

 \boldmath¯u(n)=\boldmathT\boldmathu(n) (9)

with

 \boldmathR¯u=\boldmathT\boldmathRu% \boldmathTT=\boldmathIN.

The operation (9) amounts to a projection of the available data onto the principal directions determined from and is known in the statistics jargon as Principal Component Analysis (PCA) [58, 78]. The whitened vector , whose entries are the principal components of , is commonly referred to as standardized or sphered [117, 55].

The above reduces the original problem to one where the mixing matrix is constrained to be orthogonal:

 \boldmath¯u(n)=\boldmathT\boldmathH% \boldmatha(n)=\boldmathQ\boldmatha(n),\ \ % \boldmathQ\boldmathQT=\boldmathI. (10)

It is clear that to eliminate the remaining uncertainty about more information is required. In fact it turns out that, subject to some hypotheses on the source signals to be described shortly, the mixing matrix can be identified by exploiting the mixture autocorrelation for a nonzero lag as well. Assume that there is a lag for which no two sources have the same autocorrelation, i.e.,

 E[ai(n)ai(n−k)]≠E[aj(n)aj(n−k)],\ for all\ i≠j. (11)

Take the lag- autocorrelation matrix of the standardized data:

 \boldmathR¯u(k)=E[\boldmath¯u(n)\boldmath% ¯uT(n−k)]=\boldmathQ\boldmathRa(k)% \boldmathQT. (12)

By A2 is diagonal, hence the above equation represents an EVD of In view of (11), the eigenvalues are distinct. This implies that the columns of are uniquely determined from an EVD of subject to sign/order changes.

The above method, known as Algorithm for Multiple Unknown Signals Extraction (AMUSE) [172], is found at the heart of many other BSS algorithms relying on the mixture’s SOS. These approaches provide simple and fast BSS tools whose effectiveness is well documented in the relevant literature [172]. They rely heavily however on the assumption that not two sources have the same power spectral contents (see (11)). If two sources have different, but similar, power spectra, the separation will be theoretically possible but with practically unreliable results [117]. Moreover, it is clear that these methods will not apply to scenarios where should be considered as white, and even i.i.d.333The latter source model is very common in digital communications [90].

Since it is our purpose to cover methods for the BSS problem in its generality as described in Section 2, no assumption like (11) will be made henceforth. It is then made clear from the above discussion that (10) represents the limit of what a SOS-based approach can bring to our problem. In fact, one cannot go any further than that in the case that sources are Gaussian, since such a signal is completely described by its mean and covariance. It will be shown later on that at most one source can be Gaussian if the BSS problem is to have any solution at all. Observe that in the above algorithm the sources did not have to be independent, only the fact that they are uncorrelated was made use of. This suggests that confining ourselves to the second-order information does not exploit all available knowledge. It is of interest to recall that it is for a Gaussian signal that the notions of independence and uncorrelatedness coincide [143]. However, most of the sources encountered in practical applications, e.g., speech, music, data, image, are distinctly non-Gaussian. It follows therefore that another transformation, that can extract from the data more structure than PCA is able to, is needed. This will be what we call

## 4 Mutual Information and Independent Component Analysis

### 4.1 Statistical Independence

The independence assumption for the source signals can be equivalently stated as the equality of the joint probability density function (pdf) of the source vector

with the product of the marginal pdf’s of its entries:

 pa(\boldmatha)=~pa(\boldmatha)△=N∏i=1pai(ai). (13)

Hence an analogous equality would hold for the output of a successful separating system, i.e.,

 py(\boldmathy)=~py(\boldmathy)△=N∏i=1pyi(yi). (14)

The above represents the goal of the ICA transformation, i.e., (linearly) transforming a random vector into one with (as) independent components (as possible). Information theory constitutes a powerful and elegant framework for expressing and studying this problem.

### 4.2 Entropy and Mutual Information

Recall that the entropy

of a random variable provides a measure of our uncertainty about the value this variable takes. For a continuous-valued random vector

, it takes the form444We will be concerned with the Shannon entropy. For IT approaches to BSS that employ other definitions of entropy we refer the reader to [147]. [143, 91]

 H(\boldmathx)=−E[lnpx(\boldmathx)]=−∫px(% \boldmathx)lnpx(\boldmathx)d\boldmathx. (15)

In fact this is rather what we call the differential entropy of . The entropy of a continuous random vector goes to infinity (since the uncertainty on the value assumed is infinite) and is equal to the differential entropy only modulo a reference term [91]. However, since it will be the optimization of such quantities and mainly their differences that will be interested in, we may rest on the differential entropy and call it simply entropy from now on.

is a measure of the uncertainty on the value of the vector Let another random vector , and let denote the pdf of conditioned on Then a measure of the uncertainty remaining about after having observed can be given by the conditional entropy:

 H(\boldmathx|\boldmathy)=−E[lnpx|y(\boldmathx|\boldmathy)]=−∫px,y(\boldmathx,\boldmathy)lnpx|y(\boldmathx|\boldmathy)d\boldmathxd%\boldmath$y$ (16)

where

 px,y(\boldmathx,\boldmathy)=px|y(\boldmathx|\boldmathy)py(\boldmathy) (17)

is the joint pdf of and It is then reasonable to say that the difference

 I(\boldmathx,\boldmathy)=H(\boldmathx)−H(% \boldmathx|\boldmathy) (18)

represents the information concerning that is acquired when observing It is termed as the mutual information (MI) between and As it will be seen shortly, is always nonnegative and vanishes if and only if and are statistically independent. This is to be expected since for independent and the observation of the one does not provide any information concerning the other. This is also seen in the definition (18) since then The MI is thus a meaningful measure of statistical dependence and, in fact, it will prove to be a good starting point for a fruitful study of ICA.

### 4.3 Kullback-Leibler Divergence

The condition for independence stated in (14) implies that independence can also be viewed in terms of the distance between and A measure of closeness between two pdf’s and is given by the so-called Kullback-Leibler divergence (KLD), defined as [91]

 D(px∥py)=Ex[lnpx(\boldmathr)py(% \boldmathr)]=∫px(\boldmathr)lnpx(% \boldmathr)py(\boldmathr)d\boldmathr. (19)

An important property of the KLD is that it is always nonnegative and becomes zero if and only if its two arguments coincide.555A proof follows easily by considering the property of the natural logarithm. Hence, although it is not symmetric with respect to its arguments, it is employed as a metric for the space of pdf’s.666Strictly speaking, it is a semi-distace. The KLD satisfies (under some weak conditions [91]) the following relationship, where are pdf’s:

 D(px∥py)=D(px∥pz)+D(pz∥py). (20)

This can be viewed as the extension of the Pythagorean theorem for orthogonal triangles in Euclidean spaces to the space of pdf’s endowed with the metric777The differential-geometric study of such manifolds has resulted in an exciting new discipline, called information geometry [134]. and will be seen below to provide a means for an insightful analysis of BSS criteria.

Another property of the KLD that is going to be used in the sequel is that it is invariant under any invertible (linear or nonlinear) transformation, i.e., for any ,

 D(pg(x)∥pg(y))=D(px∥py). (21)

The MI between and can be formulated as a KLD. Indeed, using eqs. (15), (16), (17) in (18) results in

 I(\boldmathx,\boldmathy) = ∫px,y(\boldmathx,\boldmathy)lnpx,y(\boldmathx,\boldmathy)px(\boldmathx)py(\boldmathy)d\boldmathxd\boldmathy (22) = D(px,y∥pxpy)

verifying the appropriateness of the MI as a measure of statistical dependence.

The KLD formulation above makes easier to also define the mutual information between the entries of an -vector, It will simply equal the KLD between and in (14):

 I(\boldmathy)=D(py∥~py)=∫∫⋯∫py(y1,y2,…,yN)lnpy(y1,y2,…,yN)∏Ni=1pyi(yi)dy1dy2⋯dyN (23)

and vanish if and only if the components of are mutually independent. From the above equation an expression of in terms of the entropy of and the (marginal) entropies of its entries readily results:

 I(\boldmathy)=−H(\boldmathy)+N∑i=1H(yi). (24)

Thus, minimizing MI amounts to making the entropy of as close as possible to the sum of its marginal entropies.

Negentropy, the distance of a given pdf from the Gaussian pdf with the same mean and covariance, can also be defined with the aid of the KLD as

 JG(\boldmathy)△=D(py∥pyG), (25)

where denotes a Gaussian random vector with the same mean and covariance as

### 4.4 Negative Mutual Information as an ICA Contrast

It will be useful to write the MI of the vector in terms of its negentropy. This is easily done by appealing to the Pythagorean theorem for KLD’s [35]. Just write (20) for the two triangles shown in Fig. 2.

This yields:

 D(py∥N∏i=1pyGi) = D(py∥pyG)+D(pyG∥N∏i=1pyGi) = JG(\boldmathy)+IG(\boldmathy)

and

 D(py∥N∏i=1pyGi) = D(py∥N∏i=1pyi)+D(N∏i=1pyi∥N∏i=1pyGi) = I(\boldmathy)+N∑i=1JG(yi),

with denoting the MI of the Gaussian version of , It then follows that888An alternative proof for this result can be found in [83].

 I(\boldmathy)=IG(\boldmathy)+(JG(\boldmathy)−N∑i=1JG(yi)). (26)

The above shows to consist of two terms responsible for the redundancy within . The first term accounts for the second-order information in the process , whereas non-Gaussian, higher-order information is measured by the second term. The prewhitening transformation discussed in Section 2 aims at nulling . Methods based on higher-order statistics can then be employed to minimize the second term, subject to the constraint that the separating matrix be orthogonal. It should be emphasized that this two-step minimization of the MI is not necessarily optimal [35]. Ideally, the function in (26) should be optimized as a whole. Such approaches have only recently been reported [37, 124, 166]. It is however a common practice to first sphere the data before going on to extract their higher-order structure. An advantage of such an approach is that the corresponding normalization of the data may help avoid numerical inaccuracies that could result from the subsequent nonlinear operations [117]. Moreover, the fact that the separating matrix is then orthogonal may sometimes facilitate the derivation of separation methods [48, 138]. Bounds on the errors in the separation due to imperfect standardization have been derived and reported in [26]

. In general, the standardization step poses no problem for the separation task (provided of course one uses a sufficiently large number of samples to estimate the autocorrelation matrix) and two-step methods have proved their effectiveness in practice

[117].

It was seen above that the MI is a meaningful measure of statistical dependence. In fact, its minimization constitutes a valid criterion for BSS, since it can be shown that is a contrast over the set of random -vectors [48]. A real-valued function, , defined on the space of -variate pdf’s, is said to be a contrast if it satisfies the following requirements:

• is invariant to permutations:

 Ψ(pPy)=Ψ(py)\ for any permutation matrix\ \boldmathP
• is invariant to scale changes:

 Ψ(pDy)=Ψ(py)\ for any diagonal matrix\ \boldmathD
• If has independent components, then

 Ψ(pAy)≤Ψ(py),\ for any invertible matrix\ % \boldmathA.

Requirement C3 shows that the ICA might be obtained by maximizing a contrast of the separator’s output The requirements C1, C2 correspond to the permutation/scaling indeterminacy inherent in the BSS problem, as discussed in Section 2. However, there is one more requirement that a contrast should satisfy so that it can be a valid BSS cost functional:

• The equality in C3 holds if and only if is a generalized permutation matrix, i.e., , where is a permutation matrix and is an invertible diagonal matrix.

A contrast satisfying C4 as well is said to be discriminating. It can be shown that the negative MI, , is a discriminating contrast over all -vectors with at most one Gaussian component [48].999The Gaussian MI, , has also been shown to be a discriminating contrast over random vector processes whose components have different spectral densities [146]. The latter constraint is typical in SOS-based BSS approaches as we saw in Section 3. Henceforth, the term contrast will be used to refer to a discriminating contrast.

Before concluding this section, let us consider simplifying the expression for the negative MI contrast. Taking into account the fact that the separating matrix is invertible, one can express the joint pdf of in (6) in terms of that of using the well-known formula [143]:

 py(\boldmathy)=pu(\boldmathG−1\boldmath% y)|det(\boldmathG)|. (27)

This in turn allows the entropy of to be expressed in terms of that of :

 H(\boldmathy)=H(\boldmathu)+ln|det(\boldmathG)|. (28)

Thus (24) becomes

 I(\boldmathy)=−H(\boldmathu)−ln|det(\boldmathG)|−Ey[lnN∏i=1pyi(yi)]

Since does not depend on , the corresponding contrast may be written as

 ψICA(\boldmathG)=ln|det(\boldmathG)|+Ey[lnN∏i=1pyi(yi)], (29)

where, to emphasize its role in BSS, the dependence on is made explicit. The role of the first term above is to ensure a valid separation solution, with invertible.101010Recall that Note that with prewhitened mixture data, is orthogonal and the first term becomes zero. The maximization of amounts then to minimizing the sum of the entropies of the components of Since maximum entropy corresponds to Gaussianity [143, 91], this is equivalent to making the

’s as less Gaussian as possible, which necessarily involves statistics of order higher than two. The central limit theorem

[143] provides another interpretation of that, since a linear mixture of independent signals tends to be close to Gaussian. This minimum-entropy, or minimum Gaussianity point of view has been central in pioneering works on blind deconvolution [14, 71] and will be further elaborated upon later on.

## 5 A “Universal” Criterion for BSS

It will be of interest to examine the possible connections between the minimum MI criterion and other proposed criteria that are inspired from information theory and statistics. It will be seen that subject to some constraints on the choice of the nonlinearities involved, all these criteria are equivalent and can thus be viewed as special cases of a single, “universal” criterion.

### 5.1 Infomax and MaxEnt

We have seen above that sources are completely separated when the separator’s outputs are independent or equivalently their mutual information is zero.111111Provided, of course, that at most one of the sources is Gaussian. An alternative way of looking at the separation goal is by formulating it as a maximal information transfer from the mixture, , to the source estimates, [9]. This criterion, similar to the principle of information rate and capacity in the Shannon theory of communication, though less evidently pertinent to the BSS task, will be seen to be closely related to that of minimum MI between the entries of

Consider the MI between and , namely:

 I(\boldmathy,\boldmathu)=H(\boldmathy)−H(% \boldmathy|\boldmathu).

Since the transfer from to is deterministic, knowledge of the first completely determines the latter, hence the second term above is null. Therefore, in this context:

 I(\boldmathy,\boldmathu)=H(\boldmathy). (30)

Since for an unrestricted its entropy has no bound, we assume an entrywise nonlinearity at the output of the separating system (see Fig. 1), i.e.,

 \boldmathy(n)=\boldmathg(\boldmathG\boldmathu(n)) (31)

where

and the functions are monotonically increasing with and . These are the common assumptions for the nonlinearities in artificial neural networks (ANN) [91]. Then the entropy of is maximized when equals , the uniform pdf in Note that in that case, the components of

are statistically independent and uniformly distributed in

. If

is chosen to be equal to the cumulative distribution function (cdf) of the source

, can only be uniform in when is equal to or to another source having the same pdf. Therefore the maximization of the entropy of yields a meaningful criterion for BSS, called MaxEnt [10]. In view of (30), this criterion is equivalent to the maximization of the MI , referred to as Infomax [10, 9]. The above source-output pdf-matching point of view was also considered in early studies of blind equalization [14] and will be seen to underly the maximum likelihood approach as well.

The MaxEnt and Infomax criteria can be seen to be equivalent, under some conditions, with the minimum MI criterion discussed in Section 4. This equivalence has been argued upon in [10] considering the functions as being given by the source pdf’s, as explained above. However, a more rigorous study was reported in [185] and [137]. Here a brief explanation of this relation will be provided stating more subtle results without proof. Denote by the linear part of the separating system (see Fig. 1), i.e.,

 \boldmathx(n)=\boldmathG\boldmathu(n). (32)

Then, since is invertible it holds that [143]

 py(\boldmathy)=px(\boldmathx)∣∣∏Ni=1g′i(xi)∣∣,

with denoting the first derivative of The entropy of can thus be written as121212Recall that is nonnegative.

 H(\boldmathy)=H(\boldmathu)+ln|det(\boldmathG)|+Ex[lnN∏i=1g′i(xi)]. (33)

Hence the Infomax (IM), or equivalently MaxEnt, criterion reads as:

 maxG(ψIM(\boldmathG)△=ln|det(\boldmathG)|+Ex[lnN∏i=1g′i(xi)]). (34)

Comparing eqs. (29) and (34), one can see that these two criteria are equivalent if ’s coincide with source cdf’s.131313Note that in both cases, the expectation in the last term is with respect to the linear part of the demixing system. Eq. (26) expresses the MI between the components of in terms of its joint and marginal negentropies. It has been demonstrated that Infomax is also equivalent with negentropy maximization, under the same conditions [121]. In fact, since it is the stability of the desired stationary point that counts, it can be seen that a mismatch between the ’s and the source probability laws can be tolerated. This is stated in [137] as a requirement for sufficiently rich parameterization of . The problem of correctly choosing the nonlinearities will be discussed in more detail later on when dealing with the corresponding algorithms.

### 5.2 Maximum Likelihood and Bayesian Approaches

Let us continue to consider the demixing model where the ’s are always taking values in Then the entropy of can be written as the negative distance of its pdf from the uniform one:

 H(\boldmathy) = −∫py(\boldmathy)ln[py(% \boldmathy)∏Ni=1U(yi)]d\boldmathy (35) = −D(py∥UN),

where is the uniform pdf in and In order to formulate a Maximum Likelihood (ML) approach to the corresponding BSS problem, we need a model for the observed data generation including a hypothesis for the source statistics. According to (7) this will be written as

 \boldmathu=\boldmathΘ\boldmath~a

where ideally and Parameterize the pdf of as Although this pdf depends on both the system parameter and our hypothesized probability model for the sources, , the latter will be omitted from this notation since it will be with respect to that the criterion will be optimized. The idea in the ML principle is to find, among a set of choices for , that which maximizes the pdf of the data conditioned on the model parameter.

Take independent realizations of , say, Then the likelihood that these samples are drawn with a pdf is given by The normalized log-likelihood will be

 LT(\boldmathΘ)=1TlnT∏t=1pu(% \boldmathu(t)|\boldmathΘ)=1TT∑t=1lnpu(\boldmathu(t)|\boldmathΘ).

As in (27), we have

 pu(\boldmathu(t)|\boldmathΘ) = p~a(\boldmathΘ−1\boldmathu(t))|det(\boldmathΘ)| = p~a(\boldmathx(t))|det(\boldmathG)|

where we recall that and we set Hence,

 LT(\boldmathΘ)=1TT∑t=1lnp~a(\boldmathx(t))+ln|det(\boldmathG)|

and from the law of large numbers

[143]:

 LT(\boldmathΘ)T→∞⟶L(\boldmathG)△=∫px(\boldmathx|\boldmathG)lnp~a(% \boldmathx)d\boldmathx+ln|det(\boldmathG)|.

Using the KLD formulation, the latter is written as

 L(\boldmathG) = −D(px(\boldmathx|\boldmathG)∥p~a(\boldmathx))−H(\boldmathx|\boldmathG)+ln|det(\boldmathG)| = −D(px(\boldmathx|\boldmathG)∥p~a(\boldmathx))−H(\boldmathu),

where eq. (28) was used. The ML criterion is then expressed in the form:

 maxG(ψML(\boldmathG)△=−D(px(\boldmathx|\boldmathG)∥p~a(% \boldmathx))). (36)

That is, it aims at matching the pdf’s of the source estimates with the hypothesized source pdf’s.

For the sake of clarity of presentation, we shall abuse the notation and write for the KLD between the pdf’s of the random vectors and Then the ML cost functional is written in the form:

 ψML(\boldmathG)=−D(\boldmathG\boldmathH% \boldmatha∥\boldmath~a). (37)

Now assume that the ’s are the cdf’s corresponding to the hypothesized probability law for the sources, In that case, the vector is uniformly distributed in and, in view of (35), can be written as:

 ψIM(\boldmathG) = H(\boldmathy) (38) = −D(\boldmathy∥\boldmathz) = −D(\boldmathg(\boldmathG\boldmathH\boldmatha)∥\boldmathg(\boldmath~a)) = −D(\boldmathG\boldmathH\boldmatha∥\boldmath~a)

where the invariance of the KLD under an invertible transformation was exploited. Comparing eqs. (37) and (38) the equivalence of ML with the IM criterion is deduced [29]. Clearly, if the source pdf’s are exactly known, i.e., both cost functions above are maximized (become zero) when or equivalently that is, for the perfect separation solution. Nevertheless, as mentioned earlier, there is some tolerance in the mismatch of the source probability model, quantified by the stability conditions stated in Section 6.

An insightful interpretation of the ML criterion can be obtained by making use in (37) of the Pythagorean theorem for the KLD’s (eq. (20)). If the components of are independent and denotes the vector with independent components distributed according to the marginal distributions of , then

 D(\boldmathy∥\boldmath~a)=D(\boldmathy∥\boldmath~y)+D(\boldmath~y∥\boldmath~a). (39)

Eq. (37) then yields:

 −ψML(\boldmathG) = D(\boldmathy∥\boldmath~y)+D(% \boldmath~y∥\boldmath~a) (40) = −ψICA(\boldmathG)+N∑i=1D(~yi∥~ai)

which leads to the following interpretation of ML [30]:

 (Totalmismatch)=(Deviation fromindependence)+(Marginalmismatch) (41)

In fact, the above relationship shows that the ICA criterion results from optimizing the KLD of (37) with respect to both and the source probability model. In other words,

 ψICA(\boldmathG)=max~aψML(\boldmathG).

This shows to be the quantitative measure of dependence associated with the ML principle [30].

A criterion which is equivalent to the above results also via the Bayesian approach [111, 129]

. By Bayes’ theorem

[143] one may write the a-posteriori pdf of the model in terms of the likelihood. Namely:

 p(\boldmathΘ,\boldmatha|\boldmathu,I)=pu(u|Θ)p(\boldmathu|\boldmathΘ,% \boldmatha,I)p(\boldmathΘ,% \boldmatha|I)p(\boldmathu|I)

where is used to denote any possible additional prior information on the BSS setup. One can simplify the above expression as

 p(\boldmathΘ,\boldmatha|\boldmathu,I)∝p(\boldmathu|\boldmathΘ,\boldmatha,I)p(\boldmathΘ,\boldmatha|I)

where the prior probability of the data was incorporated in the proportionality constant since it does not depend explicitly on the model. Since the mixing matrix is in general independent of the sources, we write

 p(\boldmathΘ,\boldmatha|I)=p(\boldmathΘ|I)p(\boldmatha|I).

Finally, since it is the determination of the demixing matrix that is of interest, we consider the following a-posteriori model:

 p(\boldmathΘ|\boldmathu,I)∝p(% \boldmathΘ|I)∫p(\boldmathu|\boldmathΘ% ,\boldmatha,I)p(\boldmatha|I)d\boldmath% a

or its logarithm:

 lnp(\boldmathΘ|\boldmathu,I)=lnp(% \boldmathΘ|I)+ln∫p(\boldmathu|\boldmathΘ,\boldmatha,I)p(\boldmatha|I)d% \boldmatha (42)

where a constant was omitted as it has no effect on the subsequent optimization.

Consider the Maximum A-Posteriori probability (MAP) criterion:

 maxΘ(ψMAP(\boldmathΘ)△=E[lnp(\boldmathΘ|\boldmathu,I)]) (43)

The essence of the Bayesian approach is to make an inference on the unknown model parameters on the basis of all information available [111]. The information that is exploited by setting the likelihood equal to:

 p(\boldmathu|\boldmathΘ,\boldmatha,I)=N∏i=1δ(ui−N∑j=1Θijaj)

with denoting the Dirac function. The knowledge concerning the independence of the sources is made use of by using

 p(\boldmatha|I)=N∏i=1pai(ai|I).

Since no specific knowledge is available for the mixing matrix   apart from its dimensions and its nonsingularity, is set to a constant141414With a finite extent, of course. and is thus omitted from the MAP criterion. In summary,

 ∫p(\boldmathu|\boldmathΘ,% \boldmatha,I)p(\boldmatha|I)d\boldmatha = ∫∫⋯∫N∏i=1δ(ui−N∑j=1Θijaj)N∏i=1pai(ai|I)da1da2⋯daN = det(\boldmathG)N∏i=1pai(N∑j=1Gijuj)

with The MAP cost functional then takes the form:

 ψMAP(\boldmathG)=ln|det(\boldmathG)|+E[lnN∏i=1pai(xi)]. (44)

A comparison of the latter expression with that in (34) demonstrates the equivalence of the Bayesian approach with the Infomax, provided that the nonlinearities match the source cdf’s.

### 5.3 A Universal Criterion

In view of the above results, one can define a general BSS criterion in terms of the maximization with respect to of the expectation of the functional

 J(\boldmathG)=ln|det(\boldmathG)|+lnN∏i=1ϕi(yi) (45)

where . The functions are ideally the pdf’s of the source signals. Since these are in general unknown, some approximations have to be made. A short discussion of such approaches, mainly via pdf expansions, is given here. More will be said in the next section where the role of the nonlinearities in the stability of the separator will be reviewed.

#### 5.3.1 Approximations

The criteria discussed above require a knowledge of pdf’s connected to the separator’s outputs. The marginal pdf’s of are needed for example in An approach towards overcoming the lack of this knowledge, which has been proved particularly effective, it to employ truncated expansions of the unknown pdf’s and thus reduce the unknowns to a limited set of cumulants, computable from the system outputs. To give an idea of this approach, let us recall the Edgeworth expansion of a pdf of zero mean and unit variance, around a standard normal pdf, [117]:

 p(y) = pG(y)[1+13!c3(y)h3(y)+14!c4(y)h4(y)+106!c23(y)h6(y) (46) \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ +15!c5(y)h5(y)+357!c3(y)c4(y)h7(y)+⋯]

where is the th-order cumulant of and is the Hermite polynomial of degree , defined by [91]:

 dkpG(y)dyk=(−1)kpG(y)hk(y)

or equivalently by the recursion

 h0(y) = 1 h1(y) = y hk+1(y) = yhk(y)−h′k(y).

Keeping only the first few terms in the sum (46) one comes up with a polynomial approximation of in (29). Using a Taylor approximation for the logarithm function finally leads to an approximative expression of the cost functional in terms of some cumulants of the

’s. In the case that a prewhitening has been performed on the data, it can be shown that, when data are symmetrically distributed (so that cumulants of odd order are null), a good approximation is provided by

[48]:

 ψICA(\boldmathG)≈ψ4(\boldmathG)△=N∑i=1c24(yi) (47)

where it should be kept in mind that is constrained to be orthogonal. As shown in [48], is a (discriminating) contrast over the random -vectors with at most one component of null 4th-order cumulant. Note that this constraint is stronger than that applied to , since a signal has to have all its high-order cumulants equal to zero to be Gaussian. The same cost function has been arrived at via a Gram-Charlier truncated expansion for the ML criterion [117]. It is interesting to note that, due to the orthogonality hypothesis, is equivalent to forcing the cross-cumulants of to zero, thus rendering its components mutually independent.151515It has been recently shown, however, that such a separation condition may be greatly simplified to the requirement of zeroing a much smaller set of cumulants [135]. We shall have more to say about that in the context of algebraic BSS approaches.

Polynomial approximations of the above type may not be sufficiently accurate in practice. This is due to the fact that finite-sample cumulant estimates are highly sensitive to outliers: a few, possibly erroneous observations with large values may prove to be decisive for the accuracy of the resulting estimate. To cope with this problem, alternative methods for estimating the entropy have been devised, which perform better than the cumulant-based estimates and at a comparable computational cost. These include

[96, 139].

#### 5.3.2 Nonlinear PCA

PCA aims at best approximating, in the least-squares (LS) sense, the given data by a linear projection on a set of orthogonal vectors. As explained in Section 3, this is not sufficient to extract from the data its higher-order structure. A generalization of PCA, involving a nonlinear projection, has been proposed and demonstrated to be effective in the BSS problem. This so-called nonlinear PCA approach can be described in terms of the minimization of the cost functional [109]

 ψnPCA(\boldmathG)=E[∥\boldmathu−\boldmathGT\boldmathg(\boldmathG\boldmathu)∥2]. (48)

Assuming data prewhitening, is orthogonal and the above can be rewritten as:

 ψnPCA(\boldmathG) = E[∥\boldmathG\boldmathu