Beyond Pham's algorithm for joint diagonalization

11/28/2018 ∙ by Pierre Ablin, et al. ∙ 28

The approximate joint diagonalization of a set of matrices consists in finding a basis in which these matrices are as diagonal as possible. This problem naturally appears in several statistical learning tasks such as blind signal separation. We consider the diagonalization criterion studied in a seminal paper by Pham (2001), and propose a new quasi-Newton method for its optimization. Through numerical experiments on simulated and real datasets, we show that the proposed method outper-forms Pham's algorithm. An open source Python package is released.



page 1

page 2

page 3

page 4

This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

1 Introduction

The task of joint diagonalization arises in several formulations of the blind source separation problem. In [cardoso1993blind]

, independent component analysis is performed by joint-diagonalization of a set of cumulant matrices. In 

[belouchrani1997blind], the separation of non-stationnary signals is carried by joint-diagonalization of a set of autocorrelation matrices. Finally, in [pham2001blind], joint-diagonalization of a set of covariance matrices separates Gaussian sources that have non-stationnary power.

Consider a set of symmetric square matrices of size . Its approximate joint diagonalization consists in finding a matrix such that the matrix set contains matrices that are as diagonal as possible, as measured by some joint-diagonality criterion. This paper considers the joint diagonalization of positive matrices, defined as the minimization of the (non-convex) criterion


This criterion was introduced by Pham in [pham2001blind] who derived it as the negative log-likelihood of a source separation model for Gaussian stationary sources.

Pham [pham2001joint] proposes in its seminal work a block coordinate descent approach for the minimization of . Each iteration of this method guarantees a decrease of . Further, when there exists a matrix such that contains only diagonal matrices (that is, if the set is exactly jointly diagonalizable), then in the vicinity of , the algorithm converges quadratically. Since then, only a few other papers have focused on minimizing this criterion. In [joho2008newton], the Newton method is studied. However, this method is not practical as it requires solving a linear system at each iteration. In [todros2007fast], it is proposed to minimize an approximation of the cost function. The authors do so by alternating minimization over columns. This method minimizes an approximation of , and scales in .

In this paper, we propose a quasi-Newton method for the minimization of . We use sparse approximations of the Hessian which are cheap to compute and match the true Hessian when the set is jointly diagonalized, granting quadratic convergence. The algorithm as a cost per iteration of , which is the natural scaling of the problem since it is the size of the dataset. Through experiments, we show that the proposed method outperforms Pham’s algorithm, on both synthetic and real data.


: The identity matrix of size

is denoted . The Frobenius scalar product between two matrices is noted as . The corresponding norm is . Given a tensor , the weighted scalar product is . The Kronecker symbol is is , otherwise.

2 Study of the cost function

The cost function is defined on the group of invertible matrices, with acting as a barrier. Its minimization is performed by iterative algorithms. To exploit the multiplicative structure of the group of invertible matrices, we perform relative updates on  [cardoso1996equivariant]. The neighborhood of an iterate is parameterized by a small matrix as . The second-order (in

) Taylor expansion of the loss function


where the matrix is the relative gradient and the tensor is the relative Hessian. Simple algebra yields:


where we define for . Eq. (3) shows that , consistent with the scale-indeterminacy of the criterion: for any diagonal matrix . The criterion is flat in the direction of the scale matrices.

Complexity: The cost of computing a gradient is . It is the natural complexity of an iterative algorithm as it is the same cost as computing the set . Computing the Hessian is , and computing is . This is prohibitively costly when is large compared to a gradient evaluation. As a consequence, Newton’s method is extremely costly to setup.

3 Relative Quasi-Newton method

In this section, we first introduce an approximation of the Hessian and then derive a quasi-Newton algorithm to minimize (1).

3.1 Hessian approximation

The accelerated algorithm presented in this article is based on the massive sparsification of the Hessian tensor when matrices are all diagonal. Indeed, it that case, it becomes:


This Hessian approximation has three key properties. First, it is cheap to compute, at cost , just like a gradient. Then, it is sparse and structured. It only has non-zero coefficients, and can be seen as a block-diagonal operator, with blocks of size . Indeed, for a matrix , we have:


The following lemma establishes the positivity of the approximate Hessian:

Lemma 1

The approximation is positive with

zero eigenvalues. If the matrices

are independently sampled from continuous densities, with probability one, the other

eigenvalues are strictly positive.

Proof: Using eq. (4), one has , where is the matrix with for its coefficient, and elsewhere. Thus has

zero eigenvalues, and the associated eigenvectors are the

for . The other eigenvalues are the eigenvalues of the blocks introduced in eq. (5). The diagonal coefficients of the blocks are the . The determinant of a block is given by . Cauchy-Schwartz inequality yields , with equality if and only if . This happens with probability , concluding the proof.

Finally, the approximation is straightforwardly inverted by inverting each blocks. The Moore-Penrose pseudoinverse of , , satisfies:


and . The cost of inversion is thus .

3.2 Algorithm

Input : Set of matrices , number of iterations .
Initialize .
for  do
       Compute the gradient using eq.(3)
       Compute the Hessian approximation using eq.4
       Compute the search direction using eq. (6)
       Do a backtracking line search in that direction to find a step size decreasing
end for
Output : 
Algorithm 1 Quasi-Newton method for joint-diagonalization

The proposed quasi-Newton method uses as search direction. Following from the positivity of , this is a descent direction. Unfortunately, there is no guarantee that the iteration decreases the cost function. Therefore, we resort to a line-search to find a step ensuring (condition ). This is done using backtracking, starting from and iterating until the condition is met. The full algorithm is summarized in Algorithm. 1.

Quadratic convergence: Like Pham’s algorithm, the proposed algorithm enjoys quadratic convergence when the matrix set is jointly diagonalizable. Indeed, assume that there exists a matrix such that contains only diagonal matrices. Then, by construction, . It follows that the method converges quadratically in the vicinity of .

4 Experiments

4.1 Setting

For the experiments, three data sets are used – coming either from synthetic or real data – and we compare our approach to Pham’s algorithm. The code to reproduce the experiments is available online111 We set and .

Initialization: For a dataset , the algorithms start from a whitener (whitening matrix): writing with orthogonal and diagonal, the initial matrix is taken as .

Metrics: To compare the speed of convergence of the algorithms, we monitor the diagonalization error , and the gradient norm . The first quantity goes to if the dataset is perfectly diagonalizable, while the second should always converge to since the algorithm should reach a local minimum.

Synthetic data: We proceed as in [ziehe2004fast] for generating synthetic datasets. We generate diagonal matrices of size ,

for which each diagonal coefficient is drawn from an uniform distribution in

. Then, we generate a random ‘mixing’ matrix

with independent normally distributed entries. Finally, in order to add noise, we generate

matrices of normally distributed entries. The dataset is then with:


where controls the noise level. In practice we take (experiment (a), perfecty jointly-diagonalizable set) or (experiment (b)).

Magnetoencephalography (MEG) data: We use the MNE sample dataset [gramfort2014mne]. From matrices containing signals of samples, , corresponding to time segments of MEG signals, we jointly diagonalize the set of covariance matrices (experiment (c)). This practical task is in the spirit of [pham2001blind]. Results are displayed in Figure 1

Figure 1: Comparison of the algorithms on three datasets. (a): Jointly diagonalizable synthetic dataset. (b): Same dataset with added noise, rendering perfect joint diagonalization impossible.(c): Real data, covariance matrices from MEG signals. Note the log-log scale.

4.2 Discussion

In experiment (a), where the dataset is exactly jointly diagonalizable, we observe the expected quadratic rate of convergence for both the proposed algorithm and Pham’s method. We also observe that breaking the model (experiments (b) and (c)) makes convergence fall back to a linear rate. As expected, the cost function does not go to in those cases. We observe that the proposed quasi-Newton algorithm outperforms Pham’s method by about an order of magnitude on each experiment.