The task of joint diagonalization arises in several formulations of the blind source separation problem. In [cardoso1993blind]
, independent component analysis is performed by joint-diagonalization of a set of cumulant matrices. In[belouchrani1997blind], the separation of non-stationnary signals is carried by joint-diagonalization of a set of autocorrelation matrices. Finally, in [pham2001blind], joint-diagonalization of a set of covariance matrices separates Gaussian sources that have non-stationnary power.
Consider a set of symmetric square matrices of size . Its approximate joint diagonalization consists in finding a matrix such that the matrix set contains matrices that are as diagonal as possible, as measured by some joint-diagonality criterion. This paper considers the joint diagonalization of positive matrices, defined as the minimization of the (non-convex) criterion
This criterion was introduced by Pham in [pham2001blind] who derived it as the negative log-likelihood of a source separation model for Gaussian stationary sources.
Pham [pham2001joint] proposes in its seminal work a block coordinate descent approach for the minimization of . Each iteration of this method guarantees a decrease of . Further, when there exists a matrix such that contains only diagonal matrices (that is, if the set is exactly jointly diagonalizable), then in the vicinity of , the algorithm converges quadratically. Since then, only a few other papers have focused on minimizing this criterion. In [joho2008newton], the Newton method is studied. However, this method is not practical as it requires solving a linear system at each iteration. In [todros2007fast], it is proposed to minimize an approximation of the cost function. The authors do so by alternating minimization over columns. This method minimizes an approximation of , and scales in .
In this paper, we propose a quasi-Newton method for the minimization of . We use sparse approximations of the Hessian which are cheap to compute and match the true Hessian when the set is jointly diagonalized, granting quadratic convergence. The algorithm as a cost per iteration of , which is the natural scaling of the problem since it is the size of the dataset. Through experiments, we show that the proposed method outperforms Pham’s algorithm, on both synthetic and real data.
2 Study of the cost function
The cost function is defined on the group of invertible matrices, with acting as a barrier. Its minimization is performed by iterative algorithms. To exploit the multiplicative structure of the group of invertible matrices, we perform relative updates on [cardoso1996equivariant]. The neighborhood of an iterate is parameterized by a small matrix as . The second-order (in
) Taylor expansion of the loss function
where the matrix is the relative gradient and the tensor is the relative Hessian. Simple algebra yields:
where we define for . Eq. (3) shows that , consistent with the scale-indeterminacy of the criterion: for any diagonal matrix . The criterion is flat in the direction of the scale matrices.
Complexity: The cost of computing a gradient is . It is the natural complexity of an iterative algorithm as it is the same cost as computing the set . Computing the Hessian is , and computing is . This is prohibitively costly when is large compared to a gradient evaluation. As a consequence, Newton’s method is extremely costly to setup.
3 Relative Quasi-Newton method
In this section, we first introduce an approximation of the Hessian and then derive a quasi-Newton algorithm to minimize (1).
3.1 Hessian approximation
The accelerated algorithm presented in this article is based on the massive sparsification of the Hessian tensor when matrices are all diagonal. Indeed, it that case, it becomes:
This Hessian approximation has three key properties. First, it is cheap to compute, at cost , just like a gradient. Then, it is sparse and structured. It only has non-zero coefficients, and can be seen as a block-diagonal operator, with blocks of size . Indeed, for a matrix , we have:
The following lemma establishes the positivity of the approximate Hessian:
Proof: Using eq. (4), one has , where is the matrix with for its coefficient, and elsewhere. Thus has
zero eigenvalues, and the associated eigenvectors are thefor . The other eigenvalues are the eigenvalues of the blocks introduced in eq. (5). The diagonal coefficients of the blocks are the . The determinant of a block is given by . Cauchy-Schwartz inequality yields , with equality if and only if . This happens with probability , concluding the proof.
Finally, the approximation is straightforwardly inverted by inverting each blocks. The Moore-Penrose pseudoinverse of , , satisfies:
and . The cost of inversion is thus .
The proposed quasi-Newton method uses as search direction. Following from the positivity of , this is a descent direction. Unfortunately, there is no guarantee that the iteration decreases the cost function. Therefore, we resort to a line-search to find a step ensuring (condition ). This is done using backtracking, starting from and iterating until the condition is met. The full algorithm is summarized in Algorithm. 1.
Quadratic convergence: Like Pham’s algorithm, the proposed algorithm enjoys quadratic convergence when the matrix set is jointly diagonalizable. Indeed, assume that there exists a matrix such that contains only diagonal matrices. Then, by construction, . It follows that the method converges quadratically in the vicinity of .
For the experiments, three data sets are used – coming either from synthetic or real data – and we compare our approach to Pham’s algorithm. The code to reproduce the experiments is available online111https://github.com/pierreablin/qndiag. We set and .
Initialization: For a dataset , the algorithms start from a whitener (whitening matrix): writing with orthogonal and diagonal, the initial matrix is taken as .
Metrics: To compare the speed of convergence of the algorithms, we monitor the diagonalization error , and the gradient norm . The first quantity goes to if the dataset is perfectly diagonalizable, while the second should always converge to since the algorithm should reach a local minimum.
Synthetic data: We proceed as in [ziehe2004fast] for generating synthetic datasets. We generate diagonal matrices of size ,
for which each diagonal coefficient is drawn from an uniform distribution in. Then, we generate a random ‘mixing’ matrix
with independent normally distributed entries. Finally, in order to add noise, we generatematrices of normally distributed entries. The dataset is then with:
where controls the noise level. In practice we take (experiment (a), perfecty jointly-diagonalizable set) or (experiment (b)).
Magnetoencephalography (MEG) data: We use the MNE sample dataset [gramfort2014mne]. From matrices containing signals of samples, , corresponding to time segments of MEG signals, we jointly diagonalize the set of covariance matrices (experiment (c)). This practical task is in the spirit of [pham2001blind]. Results are displayed in Figure 1
In experiment (a), where the dataset is exactly jointly diagonalizable, we observe the expected quadratic rate of convergence for both the proposed algorithm and Pham’s method. We also observe that breaking the model (experiments (b) and (c)) makes convergence fall back to a linear rate. As expected, the cost function does not go to in those cases. We observe that the proposed quasi-Newton algorithm outperforms Pham’s method by about an order of magnitude on each experiment.