Independent component analysis (ICA) comon1994independent
is a data exploration technique used in many observational sciences. In its classical and most widely spread form, it makes the assumption that a random vectoris a linear mixture of independent sources. It means that there exists a source vector of statistically independent features and a mixing matrix , such that . The aim of ICA is to recover from some realizations of .
One of first and most employed ICA algorithm is Infomax bell1995information. It assumes that each feature of follows a given super-Gaussian distribution . A rigorous definition of super-Gaussianity will be given in Section 2.1, for now on it can be thought of as a heavy-tail distribution. The likelihood of given writes pham1997blind:
It is more convenient to work with the unmixing matrix and the negative log-likelihood, yielding a cost function :
The underlying expected risk is then:
Given a set of i.i.d. samples of , , the empirical risk is:
This article focuses on the inference of in two cases. The first one is the finite-sum setting. Using only samples, is found by searching for a minimizer of . The other case is the online setting, where a stream of samples arriving one by one is considered. In this case, goes to infinity, and then tends towards . It is important to note that it is empirically observed that these criteria to unmix sources of different density than , as long as they are super-Gaussian.
Although not formulated like this in the original article, Infomax has been shown to solve the finite-sum problem cardoso1997infomax. It does so by using a stochastic gradient method. Unfortunately, is not convex, hence it is hard to find a good step-size policy which fits any kind of data bottou2016optimization. It can take an extremely long time before it reaches convergence, or fail to converge at all. Still, the stochasticity of Infomax makes it efficient when the number of samples is large, because the cost of one iteration does not depend on . More recently, several full-batch second-order algorithms have been derived for the exact minimization of . For instance, in zibulevsky2003blind, an approximation of the Hessian of is used to obtain a simple quasi-Newton method. Full-batch methods are robust and can show quadratic convergence speed, but an iteration can take a very long time when the number of samples is large. They also crucially rely on a costly line-search strategy because of the non-convexity of the problem.
We make the following contributions:
We introduce a set of surrogate functions for , allowing for a majorization-minimization method. We show that this view is equivalent to an EM algorithm for ICA. Consequently, techniques likes incremental EM neal1998view and online EM cappe2009line can efficiently be applied to this problem.
Critically, the surrogates can be minimized in closed-form with respect to one row of . It implies that the incremental algorithm possesses the guarantee to decrease the surrogate loss at each iteration, without having to resort to expensive line-search techniques. To the best of our knowledge, this feature is a novelty in the world of ICA algorithms.
Owing to a cheap partial update, the cost of one iteration of the proposed algorithm is similar to the cost of a stochastic gradient descent step. Through experiments, we observe that the proposed methods performs better than the state-of-the-art, while enjoying the robust property of guaranteed decrease.
In the following, scalar values are noted in lower case (e.g. ), vectors in bold font (e.g., ), and matrices in upper case (e.g. ). For a matrix , denotes its -th row, and denotes its -th column. For complexity analysis, we say that a quantity is if is bounded.
2 Representations of super-Gaussian densities
Super-Gaussian densities can be represented in several forms, variationally through a surrogate function, or probabilistically through a Gaussian scale mixture palmer2006variational. The two approaches lead to the same optimization algorithms but with a slightly different view point.
2.1 Surrogate functions
The density is assumed symmetric and super-Gaussian in the sense that is an increasing concave function over palmer2006variational. Following palmer2006variational, there exists a function such that:
and the minimum is reached for an unique value denoted as . Simple computations show that . We introduce a new cost function where , which writes:
and the associated empirical risk, for :
Following Eq. (5), we have:
Lemma 1 (Majorization).
Let . For any , , with equality if and only if , that is, .
Lemma 2 (Same minimizers).
Let , and . Then, minimizes if and only if minimizes .
Given the auxiliary function and the two lemmas above, a natural algorithm relies on alternatively minimizing with respect to and . This will be shown to be equivalent to the EM algorithm for the Gaussian scale mixture interpretation in the next Section.
The rest of the paper focuses on the minimization of rather than , which yields the same unmixing matrix by Lemma 2.
2.2 EM algorithm with Gaussian scale mixtures
Super-Gaussian densities can also be obtained as scale mixtures of Gaussian densities palmer2006variational, as , where
is a centered Gaussian density of variance, and
a distribution on the variance of the Gaussian distribution. It turns out that the EM algorithm using the above form for our ICA model is exactly equivalent to the alternating optimization of(see a proof in the supplementary material). The variable corresponds to the scale parameter in palmer2006variational and the EM algorithm alternates between setting to the posterior mean (E-step) and a descent move in (M-step).
Relationship to the noisy case.
Many articles (e.g. palmer2006variational; girolami2001variational; bermond1999approximate
) propose EM-based techniques for the estimation of the latent parameters of the more general linear model:
where is the mixing matrix, and
is a Gaussian noise random variable. Inpalmer2006variational, the matrix is assumed to be known, as well as the noise covariance . On the contrary, the present article deals with the case where is unknown, and where there is no noise. The noisy case (with unknown ) is studied in e.g. bermond1999approximate; girolami2001variational. An EM algorithm is derived for the estimation of , and . In the appendix, it is shown that this EM algorithm is stuck in the case of no noise (): its update rule for is . It means that EM algorithms found in the literature for the noisy case do not work when there is no noise, while the approach derived in the following section correctly estimates the mixing matrix.
Many choices for can be found in the ICA literature. In the following, we omit the irrelevant normalizing constants. In the original Infomax paper bell1995information, it is proposed to use , yielding and . This density model is one of the most widely spread. However, an ICA algorithm evaluates those functions many times. Thus, in order to have a lighter algorithm, one can use simpler functions. One possibility is to use a Student distribution: , which gives . In the following, we choose the Huber function: if , else. This gives if , otherwise.
3 Stochastic minimization of the loss function
Using an EM strategy, is minimized by alternating descent moves in and in . We propose an incremental technique which minimizes with a finite number of samples, and an online technique where each sample is only used once. They also follow from the majorization-minimization algorithm interpretation mairal2015incremental. The pseudo code for these algorithms is given in Algorithms 1 and 2. The difference between incremental and online technique only reflects through the variable which is estimated at the E-step. Hence, we first discuss the M-step.
3.1 M-step: Descent in W
Expanding , the middle term in the new loss function (6) is quadratic in the rows of :
where denotes the -th row of , and the ’s are matrices given by:
Therefore, when is fixed, with respect to , is the sum of the function and a quadratic term. The minimization of such a function is difficult, mostly because the part introduces non-convexity. However, similarly to a coordinate descent move, it can be partially minimized in closed-form, with a multiplicative update of one of its rows. Let , and . Define such that except its -th row which is equal to . We want to find such that the update minimizes . With respect to , is of the form where we define . This expression can be minimized exactly by canceling the gradient, yielding:
Performing multiplicative updates on the iterate enforces the equivariance of the proposed methods cardoso1996equivariant: denoting by the “algorithm operator" which maps input signals
(be it a stream or a finite set) to the estimated mixing matrix, for any invertible matrix, .
3.2 E-step : Descent in U
For a fixed unmixing matrix , Lemma 1 gives: . Such an operation works on the full batch of samples . When only one sample is available, the operation minimizes with respect to the -th row of . As seen previously (Section 3.1), we only need to compute the ’s to perform a descent in , hence one needs a way to accumulate those matrices.
Incremental algorithm. To do so in an incremental way neal1998view, a memory stores the values of . When a sample is seen by the algorithm, we compute , and update the ’s as:
The memory is then updated by enforcing at each iteration.
Online algorithm. When each sample is only seen once, there is no memory, and a natural update rule following cappe2009line is:
where is the number of samples seen, and is a well chosen factor. Setting yields the unbiased formula . A more aggressive policy for empirically leads to faster estimation of the latent parameters. Note that since we are averaging sufficient statistics, there is no need to multiply by a constant.
3.3 Complexity analysis
Memory: The proposed algorithm stores matrices , which requires a memory of size (since they are symmetric). In the incremental case, it stores the real numbers , requiring a memory of size . In most practical cases of ICA, the number of sources is very small compared to (), meaning that the dominating memory cost is . In the online case, the algorithm only loads one mini-batch of data at a time, leading to a reduced memory size of , where is the mini-batch size.
Time: The -step requires to update each coefficient of the matrices ’s, meaning that it has a time complexity of . The M-step requires to solve linear systems to obtain the matrices . Each system takes (an improvement using a preconditioned conjugate gradient is proposed in the appendix). The total cost of the -step is thus . In practice, , so the overall cost of one iteration is dominated by the E-step, and is . A stochastic gradient descent algorithm with the same mini-batch size , as described later in Section 4, has a lower time complexity of . We now propose a way to reach the same time complexity with the EM approach.
3.4 Greedy update based on the dual gap
In order to reduce the complexity by one order of magnitude, we only update a subset of fixed size of the matrices for each sample. In the incremental setting, it is simple to compute the decrease in induced by the update of one coefficient from to . Following Eq. (5), it is given by the dual gap which is a positive quantity measuring the decrease in provided by updating :
Since all the above quantities are computed during one iteration anyway, computing the dual gap for each signal only adds a negligible computational overhead, which scales linearly with . Then, in a greedy fashion, only the coefficients corresponding to the largest dual-gaps are updated, yielding the largest decrease in possible with updates. In the experiments (Figure 3), we observe that it is much faster than a random selection, and that it does not impair convergence too much compared to the full-selection (). In the online setting, there is no memory, so we simply choose indices among at random.
Related work: The matrices are sufficient statistics of the surrogate ICA model for a given value of . The idea to perform a coordinate descent move (11) after each update of the sufficient statistics is inspired by online dictionary learning mairal2009online and non-negative matrix factorization lefevre2011online.
In this section, we compare the proposed approach to other classical methods to minimize .
4.1 Other algorithms
Stochastic gradient descent (SGD). Given a mini-batch containing samples, the relative gradient is computed. Then, a descent move is performed. The choice of the step size
is critical and difficult. The original article uses a constant step size, but more sophisticated heuristics can be derived. This method can be used both for the finite sum and the online problem. It is important to note that onceand are computed, it needs twice as many elementary operations to compute the gradient as it takes to update one matrix (eq. (12) and eq.(13)) when . The first computation requires operations, while the second takes (since the matrices are symmetric). When is large enough, as it is the case in practice in the experiments, these computations are the bottlenecks of their respective methods. Hence, we take in the experiments for the EM algorithms, so that the theoretical cost of one iteration of the proposed method matches that of SGD.
Variance reduced methods. One of the drawbacks of the stochastic gradient method is its sub-linear rate of convergence, which happens because the stochastic gradient is a very noisy estimate of the true gradient. Variance reduced methods such as SAG schmidt2017minimizing, SAGA defazio2014saga or SVRG johnson2013accelerating reduce the variance of the estimated gradient, leading to better rates of convergence. However, these methods do not solve the other problem of SGD for ICA, which is the difficulty of finding a good step-size policy. We compare our approach to SAG, which keeps the past stochastic gradients in memory and performs a descent step in the averaged direction. This approach is only relevant in the finite-sum setting.
Full batch second order algorithms. We compare our approach to the “Fast-Relative Newton” method (FR-Newton) zibulevsky2003blind. This algorithm uses a simple approximation of the Hessian of , as costly to compute as a gradient, to perform quasi-Newton steps. One iteration requires to compute the gradient and the Hessian on the full dataset, resulting in a cost of , and to evaluate the gradient and loss function for each point tested during the line search, so the overall cost is where
is the number of points tested during the line-search. Thus, one epoch requires more thantimes more computations than one of SGD or of the proposed algorithms. This algorithm cannot be used online.
Full batch EM. For the finite-sum problem, we also compare our approach to the full-batch EM, where the whole is updated at the E step.
4.2 Quality measure
The following quality measures are used to assess the performance of the different algorithms:
Loss on left-out data: It is the value of the loss on some data coming from the same dataset but that have not been used to train the algorithms. This measure is similar to the testing error in machine learning, and can be computed in both the streaming and finite-sum settings.
Amari distance moreau1998self: When the true mixing matrix is available, for a matrix , the product is computed, and the Amari distance is given by: . This distance measures the proximity of and up to scale and permutation indetermination. It cancels if and only if is a scale and permutation matrix, i.e., is the separation if perfect. This measure is relevant both for the online and finite-sum problems.
Relative gradient norm: The norm of the full-batch relative gradient of is another measure of convergence. Since the problem is non-convex, the algorithms may end in different local minima, which is why we favor this metric over the train error. It is only relevant in the finite-sum setting. A converging algorithm should drive this quantity to zero.
The stochastic algorithms (SGD, SAG, and the proposed EM techniques) are used with a batch size of . The proposed EM algorithms are run with a parameter , which ensures that each of their iterations is equivalent to one iteration of the SGD algorithm. In the online setting, we use a power to speed up the estimation. The step-sizes of SGD and SAG are chosen by trial and error on each dataset, by finding a compromise between convergence speed and accuracy of the final mixing matrix. In the online case, the learning rate is chosen as for SGD, where is here to make sure the iterates do not blow up. FR-Newton is run with its default parameters.
4.4 Experiment on simulated data
For this experiment, we generate a matrix with and of independent sources following a super-Gaussian Laplace distribution (). Note that this distribution does not match the Huber function used in the algorithms, but estimation is still possible since the sources are super-Gaussian. Then, we generate a random mixing matrix
of normally distributed coefficients. The algorithms discussed above are then run on, and the sequence of iterates produced is recorded. Finally, the different quality measures are computed on those iterates. We repeat this process times with different random realizations, in order to increase the robustness of the conclusions. The averaged quality measures are displayed in Fig. 1. In order to compare different random initializations, the loss on left-out data is always shifted so that its plateau is at . For the finite sum setting, we can see that the proposed method shows a linear convergence rate (as seen on the gradient curve). FR-Newton, eventually exhibits a faster rate of convergence, but this happens after the generalization metrics (loss on left out data, Amari distance) reach their final values. We also emphasize that one epoch of FR-Newton is about three times costlier than one epoch of SGD and the proposed algorithm. Overall, the proposed methods are the fastest when looking at generalization metrics, both in the online and finite sum setting.
4.5 Experiment on real data
We apply ICA to an openly available EEG dataset delorme2012independent. The initial dimension of the dataset is and , and we reduce the dimension of the data to by PCA. The signals are then multiplied by a random matrix. The different algorithms are applied on this dataset with 10 different random initializations; for 50 epochs in the finite sum setting. Results are displayed in Fig. 2. The proposed methods perform best, even looking at the gradient norm.
4.6 Effect of the greedy update rule
We generate a synthetic dataset like in Sec. 4.4 of sources and samples. On this dataset, we run the incremental algorithm with the greedy coordinate update rule discussed in Sec. 3.4 with and . We compare it to a random approach (where random sources are updated at each iteration) for the same values of , and to the much costlier full-selection algorithm, where each source is updated for each sample. Results are displayed in Figure 3. The greedy approach only adds a negligible computational overhead linear in compared to the random approach, while leading to much faster estimation. In terms of generalization error, it is only slightly outperformed by the full selection approach ().
In this article, we have introduced a new majorization-minimization framework for ICA, and have shown that it is equivalent to an EM approach for Gaussian scale mixtures. This approach has the valuable advantage of guaranteeing a decrease of the surrogate loss-function. This enables stochastic methods with descent guarantees which is, to the best of our knowledge, an unique feature for an ICA algorithm. We have proposed an incremental and an online algorithm for the finite-sum and online problems, with the same complexity as SGD. Experiments show that they are competitive with the state-of-the-art, while being safer thanks to the descent guarantee.
We acknowledge support from the European Research Council (grants SEQUOIA 724063 and SLAB 676943).
Appendix A Proof of equivalence of EM
Given the Gaussian scale mixture formulation of , as , an EM algorithm would do the following:
where the are density functions, and with equality if and only if . With respect to , the upper bound in the last equation is of the form:
where . This is exactly the same dependence in as in . The E-step thus computes using a density . This exactly corresponds to the E-step described in our algorithm. The proof, from palmer2006variational, is as follows. Dropping the indices for readability, and denoting , normalizing gives:
Hence, using , we get
This demonstrates that, although not formulated in this fashion, the proposed method is indeed equivalent to an EM algorithm.
Appendix B The EM algorithm for noisy mixtures is stuck in the noise-free limit
We follow the update rules given in bermond1999approximate, girolami2001variational. The model is where . Key quantities for the update rule are the following expectations:
where is a diagonal matrix. In the case considered in the present article, is square and invertible and . Basic algebra shows that in that case, the above formula simplify to:
The EM update for based on samples : then is
which yields by using eq. (17). The EM algorithm is thus frozen in the case of no noise.
Appendix C Proof of guaranteed descent
Let us demonstrate that one iteration of the incremental algorithm 1 decreases . At the iteration , let be the current unmixing matrix, the state of the memory, and the ’s the current sufficient statistics. As said in Section 2.2, E-step, we have . Hence, the algorithm is in the state , and the corresponding loss is . After the E-step, the memory on the mini-batch is updated to minimize . Hence, the E-step diminishes . Then, each descent move in the M-step guarantees a decrease of . Both steps decrease , the incremental algorithm overall decreases the surrogate loss function.
Appendix D Fast M-step using conjugate gradient
The M-step (11) involves computing the -th row of the inverse of a given matrix . This amounts to finding such that . Exact solution can be found by Gauss-Jordan elimination with a complexity . However, expanding the expression of yields , where . If the rows of are independent (as it is expected at convergence of the algorithm), then the off-diagonal elements of almost cancel. Hence, the diagonal matrix is an excellent preconditioner for solving the previous equation. The equation is thus solved using a preconditioned conjugate gradient method, which build an excellent approximation of the solution in a fraction of the time taken to obtain the exact solution.