1 Introduction
Independent component analysis (ICA) [1] is a popular technique for multisensor signal processing. Given an data matrix made of signals of length , an ICA algorithm finds an ‘unmixing matrix’ such that the rows of are ‘as independent as possible’. An important class of ICA methods further constrains the rows of
to be uncorrelated. Assuming zeromean and unit variance signals, that is:
(1) 
The ‘whiteness constraint’ (1) is satisfied if, for instance,
(2) 
and the matrix is constrained to be orthogonal: . For this reason, ICA algorithms which enforce signal decorrelation by proceeding as suggested by (2) —whitening followed by rotation— can be called ‘orthogonal methods’.
The orthogonal approach is followed by many ICA algorithms. Among them, FastICA [2] stands out by its simplicity, good scaling and a builtin ability to deal with both subGaussian and superGaussian components. It also enjoys an impressive convergence speed when applied to data which are actual mixture of independent components [3, 4]. However, real data can rarely be accurately modeled as mixtures of independent components. In that case, the convergence of FastICA may be impaired or even not happen at all.
In [5], we introduced the Picard algorithm for fast nonorthogonal ICA on real data. In this paper, we extend it to an orthogonal version, dubbed PicardO, which solves the same problem as FastICA, but faster on real data.
In section 2, the nonGaussian likelihood for ICA is studied in the orthogonal case yielding the Picard Orthogonal (PicardO) algorithm in section 3. Section 4 connects our approach to FastICA: PicardO converges toward the fixed points of FastICA, yet faster thanks to a better approximation of the Hessian matrix. This is illustrated through extensive experiments on four types of data in section 5.
2 Likelihood under whiteness constraint
Our approach is based on the classical nonGaussian ICA likelihood [6]. The data matrix is modeled as where the mixing matrix is invertible and where has statistically independent rows: the ‘source signals’. Further, each row is modeled as an i.i.d. signal with
the probability density function (pdf) common to all samples. In the following, this assumption is denoted as the mixture model. It never perfectly holds on real problems. Under this assumption, the likelihood of
reads:(3) 
It is convenient to work with the negative averaged loglikelihood parametrized by the unmixing matrix , that is, . Then, (3) becomes:
(4) 
where denotes the empirical mean (sample average) and where, implicitly, .
We consider maximum likelihood estimation under the whiteness constraint (1). By (2) and (4), this is equivalent to minimizing
with respect to the orthogonal matrix
. To do so, we propose an iterative algorithm. A given iterate is updated by replacing by a more likely orthonormal matrix in its neighborhood. Following classical results of differential geometry over the orthogonal group [7], we parameterize that neighborhood by expressing as where is a (small) skewsymmetric matrix: .The secondorder Taylor expansion of reads:
(5) 
The first order term is controlled by the matrix , called relative gradient and the secondorder term depends on the tensor , the relative Hessian. Both quantities only depend on and have simple expressions in the case of a small relative perturbation of the form (see [5] for instance). Those expressions are readily adapted to the case of interest, using the second order expansion . One gets:
(6)  
(7) 
where is called the score function.
This Hessian (7) is quite costly to compute, but a simple approximation is obtained using the form that it would take for large and independent signals. In that limit, one has:
(8) 
hence the approximate Hessian (recall ):
(9) 
Plugging this expression in the 2ndorder expansion (5) and expressing the result as a function of the free parameters of a skewsymmetric matrix yields after simple calculations:
(10) 
where we have defined the nonlinear moments:
(11) 
If (this assumption will be enforced in the next section), the form (10) is minimized for : the resulting quasiNewton step would be
(12) 
That observation forms the keystone of the new orthogonal algorithm described in the next section.
3 The PicardO algorithm
As we shall see in Sec. 4, update (12) is essentially the behavior of FastICA near convergence. Hence, one can improve on FastICA by using a more accurate Hessian approximation. Using the exact form (7) would be quite costly for large data sets. Instead, following the same strategy as [5], we base our algorithm on the LBFGS method (which learns the actual curvature of the problem from the data themselves) using approximation (10) only as a preconditioner.
LBFGS and its preconditioning: The LBFGS method keeps track of the previous values of the (skewsymmetric) relative moves and gradient differences and of auxiliary quantities . It returns a descent direction by running through one backward loop and one forward loop. It can be preconditioned by inserting a Hessian approximation in between the two loops as summarized in algorithm 1.
Stability: If the ICA mixture model holds, the sources should constitute a local minimum of . According to (10), that happens if for all (see also [8]). We enforce that property by taking, at each iteration:
(13) 
where and is a fixed nonlinearity (a typical choice is ). This is very similar to the technique in extended Infomax [9]. It enforces , and the positivity of . In practice, if for any signal the sign of changes from one iteration to the next, LBFGS’s memory is flushed.
Regularization: The switching technique guarantees that the Hessian approximation is positive, but one may be wary of very small values of . Hence, the preconditioner uses a floor: for some small positive value of (typically ).
Line search: A backtracking line search helps convergence. Using the search direction returned by LBFGS, and starting from a step size , if , then set and , otherwise divide by 2 and repeat.
Stopping: The stopping criterion is where is a small tolerance constant.
Combining these ideas, we obtain the Preconditioned Independent Component Analysis for Real DataOrthogonal (PicardO) algorithm, summarized in table 2.
The Python code for PicardO is available online at https://github.com/pierreablin/picard.
4 Link with FastICA
This section briefly examines the connections between PicardO and symmetric FastICA [10]. In particular, we show that both methods essentially share the same solutions and that the behavior of FastICA is similar to a quasiNewton method.
Recall that FastICA is based on an matrix matrix defined entrywise by:
(14) 
The symmetric FastICA algorithm, starting from white signals , can be seen as iterating until convergence, where is the orthogonal matrix computed as
(15) 
In the case of a fixed score function, a signflipping phenomenon appears leading to the following definition of a fixed point: is a fixed point of FastICA if is a diagonal matrix of [11]. This behavior can be fixed by changing the score functions as in (13). It is not hard to see that, if
is an odd function, such a modified version has the same trajectories as the fixed score version (up to to irrelevant sign flips), and that the fixed points of the original algorithm now all verify
.Stationary points : We first relate the fixed points of FastICA (or rather the signadjusted version described above) to the stationary points of PicardO.
Denote (resp. ) the symmetric (resp. skewsymmetric) part of and similarly for . It follows from (6) and (14) that
since .
One can show that is a fixed point of FastICA if and only if is symmetric and is positive definite. Indeed, at a fixed point, , so that by Eq. (15), one has which is a positive matrix (almost surely). Conversely, if is symmetric, then so is . If is also positive, then its polar factor
is the identity matrix, so that
is a fixed point of FastICA.The modification of FastICA ensures that the diagonal of is positive, but does not guarantee positive definiteness. However, we empirically observed that on each dataset used in the experiments, the matrix is positive definite when is small. Under that condition, we see that the stationary points of PicardO, characterized by are exactly the fixed points of FastICA.
Asymptotic behavior of FastICA : Let us now expose the behavior of FastICA close to a fixed point i.e. when for some small skewsymmetric matrix .
At first order in , the polar factor of is obtained as solution of (proof omitted):
(16) 
Denote by the linear mapping . When FastICA perform a small move, it is (at first order) of the form with . It corresponds to a quasiNewton step with as approximate Hessian.
Furthermore, under the mixture model assumption, close from separation and with a large number of samples, becomes the diagonal matrix of coefficients and simplifies, giving the same direction given in (12).
In summary, we have shown that a slightly modified version of FastICA (with essentially the same iterates as the original algorithm) has the same fixed points as PicardO. Close to such a fixed point, each of FastICA’s iteration is similar to a quasiNewton step with an approximate Hessian. This approximation matches the true Hessian if the mixture model holds, but this cannot be expected in practice on real data.
5 Experiments
This section illustrates the relative speeds of FastICA and PicardO. Both algorithms are coded in the Python programming language. Their computation time being dominated by score evaluations, dot products and sample averages, a fair comparison is obtained by ensuring that both algorithms call the exact same procedures so that speeds differences mostly stems from algorithmics rather than implementation.
Figure 1 summarizes our results. It shows the evolution of the projected gradient norm versus iterations (left column) and versus time (right column). The 4 rows correspond to 4 data types: synthetic mixtures, fMRI signals, EEG signals, and image patches. FastICA and PicardO speeds are compared using . The signals are centered and whitened before running ICA.
Experiments are repeated several times for each setup. The solid line shows the median of the gradient curves and the shaded area shows the percentile (meaning that half the runs completed faster than the solid line and that have convergence curves in the shaded area).
Synthetic data
We generate i.i.d. sources of length . The first sources follow a uniform law between and , the last follow a Laplace law (). The source matrix is multiplied by a random square mixing matrix . This experiment is repeated times, changing each time the seed generating the signals and the mixing matrix.
fMRI
Eeg
ICA is applied on publicly available^{1}^{1}1https://sccn.ucsd.edu/wiki/BSSComparison electroencephalography datasets [14]. Each recording contains signals, of length .
Image patches
We use a database of different images of open country [15]. From each image, patches of size
are extracted and vectorized to obtain
signals, before applying ICA.Results. FastICA is slightly faster than PicardO on the simulated problem, for which the ICA mixture model holds perfectly. However, on real data, the rate of convergence of FastICA is severely impaired because the underlying Hessian approximation is far from the truth, while our algorithm still converges quickly. PicardO is also more consistent in its convergence pattern, showing less spread than FastICA.
6 Discussion
In this paper, we show that, close from its fixed points, FastICA’s iterations are similar to quasiNewton steps for maximizing a likelihood. Furthermore, the underlying Hessian approximation matches the true Hessian of the problem if the signals are independent. However, on real datasets, the independence assumption never perfectly holds. Consequently, FastICA may converge very slowly on applied problems [16] or can get stuck in saddle points [17].
To overcome this issue, we propose the PicardO algorithm. As an extension of [5], it uses a preconditioned LBFGS technique to solve the same minimization problem as FastICA. Extensive experiments on three types of real data demonstrate that PicardO can be orders of magnitude faster than FastICA.
References
 [1] P. Comon, “Independent component analysis, a new concept?” Signal Processing, vol. 36, no. 3, pp. 287 – 314, 1994.

[2]
A. Hyvarinen, “Fast and robust fixedpoint algorithms for independent
component analysis,”
IEEE Transactions on Neural Networks
, vol. 10, no. 3, pp. 626–634, 1999.  [3] E. Oja and Z. Yuan, “The fastica algorithm revisited: Convergence analysis,” IEEE Transactions on Neural Networks, vol. 17, no. 6, pp. 1370–1381, 2006.
 [4] H. Shen, M. Kleinsteuber, and K. Huper, “Local convergence analysis of FastICA and related algorithms,” IEEE Transactions on Neural Networks, vol. 19, no. 6, pp. 1022–1032, 2008.
 [5] P. Ablin, J.F. Cardoso, and A. Gramfort, “Faster independent component analysis by preconditioning with hessian approximations,” Arxiv Preprint, 2017.
 [6] D. T. Pham and P. Garat, “Blind separation of mixture of independent sources through a quasimaximum likelihood approach,” IEEE Transactions on Signal Processing, vol. 45, no. 7, pp. 1712–1725, 1997.
 [7] W. Bertram, “Differential Geometry, Lie Groups and Symmetric Spaces over General Base Fields and Rings,” Memoirs of the American Mathematical Society, 2008.
 [8] J.F. Cardoso, “On the stability of some source separation algorithms,” in Proc. of the 1998 IEEE SP workshop on neural networks for signal processing (NNSP ’98), 1998, pp. 13–22.
 [9] T.W. Lee, M. Girolami, and T. J. Sejnowski, “Independent component analysis using an extended infomax algorithm for mixed subgaussian and supergaussian sources,” Neural computation, vol. 11, no. 2, pp. 417–441, 1999.
 [10] A. Hyvärinen, “The fixedpoint algorithm and maximum likelihood estimation for independent component analysis,” Neural Processing Letters, vol. 10, no. 1, pp. 1–5, 1999.

[11]
T. Wei, “A convergence and asymptotic analysis of the generalized symmetric FastICA algorithm,”
IEEE transactions on signal processing, vol. 63, no. 24, pp. 6445–6458, 2015.  [12] G. Varoquaux, S. Sadaghiani, P. Pinel, A. Kleinschmidt, J.B. Poline, and B. Thirion, “A group model for stable multisubject ICA on fMRI datasets,” Neuroimage, vol. 51, no. 1, pp. 288–299, 2010.
 [13] A.. Consortium, “The ADHD200 consortium: a model to advance the translational potential of neuroimaging in clinical neuroscience,” Frontiers in systems neuroscience, vol. 6, 2012.
 [14] A. Delorme, J. Palmer, J. Onton, R. Oostenveld, and S. Makeig, “Independent EEG sources are dipolar,” PloS one, vol. 7, no. 2, p. e30135, 2012.

[15]
A. Oliva and A. Torralba, “Modeling the shape of the scene: A holistic
representation of the spatial envelope,”
International journal of computer vision
, vol. 42, no. 3, pp. 145–175, 2001.  [16] P. Chevalier, L. Albera, P. Comon, and A. Ferréol, “Comparative performance analysis of eight blind source separation methods on radiocommunications signals,” in Proc. of IEEE International Joint Conference on Neural Networks, vol. 1, 2004, pp. 273–278.
 [17] P. Tichavsky, Z. Koldovsky, and E. Oja, “Performance analysis of the FastICA algorithm and cramerrao bounds for linear independent component analysis,” IEEE transactions on Signal Processing, vol. 54, no. 4, pp. 1189–1203, 2006.