# Beyond Pham's algorithm for joint diagonalization

The approximate joint diagonalization of a set of matrices consists in finding a basis in which these matrices are as diagonal as possible. This problem naturally appears in several statistical learning tasks such as blind signal separation. We consider the diagonalization criterion studied in a seminal paper by Pham (2001), and propose a new quasi-Newton method for its optimization. Through numerical experiments on simulated and real datasets, we show that the proposed method outper-forms Pham's algorithm. An open source Python package is released.

## Authors

• 8 publications
• 5 publications
• 43 publications
• ### Spatial Blind Source Separation

Recently a blind source separation model was suggested for multivariate ...

12/21/2018 ∙ by François Bachoc, et al. ∙ 0

• ### Adaptive Sampling Quasi-Newton Methods for Derivative-Free Stochastic Optimization

We consider stochastic zero-order optimization problems, which arise in ...

10/29/2019 ∙ by Raghu Bollapragada, et al. ∙ 0

• ### Estimation of large block covariance matrices: Application to the analysis of gene expression data

Motivated by an application in molecular biology, we propose a novel, ef...

06/26/2018 ∙ by Marie Perrot-Dockès, et al. ∙ 0

• ### Generalized Non-orthogonal Joint Diagonalization with LU Decomposition and Successive Rotations

Non-orthogonal joint diagonalization (NJD) free of prewhitening has been...

12/03/2013 ∙ by Xiao-Feng Gong, et al. ∙ 0

• ### Approximate Joint Matrix Triangularization

We consider the problem of approximate joint triangularization of a set ...

07/02/2016 ∙ by Nicolo Colombo, et al. ∙ 0

Recurrent Neural Networks (RNNs) are powerful models that achieve except...

11/04/2015 ∙ by Nitish Shirish Keskar, et al. ∙ 0

##### This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

## 1 Introduction

The task of joint diagonalization arises in several formulations of the blind source separation problem. In [cardoso1993blind]

, independent component analysis is performed by joint-diagonalization of a set of cumulant matrices. In

[belouchrani1997blind], the separation of non-stationnary signals is carried by joint-diagonalization of a set of autocorrelation matrices. Finally, in [pham2001blind], joint-diagonalization of a set of covariance matrices separates Gaussian sources that have non-stationnary power.

Consider a set of symmetric square matrices of size . Its approximate joint diagonalization consists in finding a matrix such that the matrix set contains matrices that are as diagonal as possible, as measured by some joint-diagonality criterion. This paper considers the joint diagonalization of positive matrices, defined as the minimization of the (non-convex) criterion

 L(B)=12nn∑i=1[logdetdiag(BCiB⊤)−logdet(BCiB⊤)], (1)

This criterion was introduced by Pham in [pham2001blind] who derived it as the negative log-likelihood of a source separation model for Gaussian stationary sources.

Pham [pham2001joint] proposes in its seminal work a block coordinate descent approach for the minimization of . Each iteration of this method guarantees a decrease of . Further, when there exists a matrix such that contains only diagonal matrices (that is, if the set is exactly jointly diagonalizable), then in the vicinity of , the algorithm converges quadratically. Since then, only a few other papers have focused on minimizing this criterion. In [joho2008newton], the Newton method is studied. However, this method is not practical as it requires solving a linear system at each iteration. In [todros2007fast], it is proposed to minimize an approximation of the cost function. The authors do so by alternating minimization over columns. This method minimizes an approximation of , and scales in .

In this paper, we propose a quasi-Newton method for the minimization of . We use sparse approximations of the Hessian which are cheap to compute and match the true Hessian when the set is jointly diagonalized, granting quadratic convergence. The algorithm as a cost per iteration of , which is the natural scaling of the problem since it is the size of the dataset. Through experiments, we show that the proposed method outperforms Pham’s algorithm, on both synthetic and real data.

Notation

: The identity matrix of size

is denoted . The Frobenius scalar product between two matrices is noted as . The corresponding norm is . Given a tensor , the weighted scalar product is . The Kronecker symbol is is , otherwise.

## 2 Study of the cost function

The cost function is defined on the group of invertible matrices, with acting as a barrier. Its minimization is performed by iterative algorithms. To exploit the multiplicative structure of the group of invertible matrices, we perform relative updates on  [cardoso1996equivariant]. The neighborhood of an iterate is parameterized by a small matrix as . The second-order (in

) Taylor expansion of the loss function

 L((Ip+E)B)=L(B)+⟨G|E⟩+12⟨E|H|E⟩+o(||E||2). (2)

where the matrix is the relative gradient and the tensor is the relative Hessian. Simple algebra yields:

where we define for . Eq. (3) shows that , consistent with the scale-indeterminacy of the criterion: for any diagonal matrix . The criterion is flat in the direction of the scale matrices.

Complexity: The cost of computing a gradient is . It is the natural complexity of an iterative algorithm as it is the same cost as computing the set . Computing the Hessian is , and computing is . This is prohibitively costly when is large compared to a gradient evaluation. As a consequence, Newton’s method is extremely costly to setup.

## 3 Relative Quasi-Newton method

In this section, we first introduce an approximation of the Hessian and then derive a quasi-Newton algorithm to minimize (1).

### 3.1 Hessian approximation

The accelerated algorithm presented in this article is based on the massive sparsification of the Hessian tensor when matrices are all diagonal. Indeed, it that case, it becomes:

This Hessian approximation has three key properties. First, it is cheap to compute, at cost , just like a gradient. Then, it is sparse and structured. It only has non-zero coefficients, and can be seen as a block-diagonal operator, with blocks of size . Indeed, for a matrix , we have:

 ([~HM]abba)=(Γab1−2δab1−2δabΓba)(MabMba). (5)

The following lemma establishes the positivity of the approximate Hessian:

###### Lemma 1

The approximation is positive with

zero eigenvalues. If the matrices

are independently sampled from continuous densities, with probability one, the other

eigenvalues are strictly positive.

Proof: Using eq. (4), one has , where is the matrix with for its coefficient, and elsewhere. Thus has

zero eigenvalues, and the associated eigenvectors are the

for . The other eigenvalues are the eigenvalues of the blocks introduced in eq. (5). The diagonal coefficients of the blocks are the . The determinant of a block is given by . Cauchy-Schwartz inequality yields , with equality if and only if . This happens with probability , concluding the proof.

Finally, the approximation is straightforwardly inverted by inverting each blocks. The Moore-Penrose pseudoinverse of , , satisfies:

 [~H+G]ab=ΓbaGab−GbaΓabΓba−1,∀a≠b, (6)

and . The cost of inversion is thus .

### 3.2 Algorithm

The proposed quasi-Newton method uses as search direction. Following from the positivity of , this is a descent direction. Unfortunately, there is no guarantee that the iteration decreases the cost function. Therefore, we resort to a line-search to find a step ensuring (condition ). This is done using backtracking, starting from and iterating until the condition is met. The full algorithm is summarized in Algorithm. 1.

Quadratic convergence: Like Pham’s algorithm, the proposed algorithm enjoys quadratic convergence when the matrix set is jointly diagonalizable. Indeed, assume that there exists a matrix such that contains only diagonal matrices. Then, by construction, . It follows that the method converges quadratically in the vicinity of .

## 4 Experiments

### 4.1 Setting

For the experiments, three data sets are used – coming either from synthetic or real data – and we compare our approach to Pham’s algorithm. The code to reproduce the experiments is available online. We set and .

Initialization: For a dataset , the algorithms start from a whitener (whitening matrix): writing with orthogonal and diagonal, the initial matrix is taken as .

Metrics: To compare the speed of convergence of the algorithms, we monitor the diagonalization error , and the gradient norm . The first quantity goes to if the dataset is perfectly diagonalizable, while the second should always converge to since the algorithm should reach a local minimum.

Synthetic data: We proceed as in [ziehe2004fast] for generating synthetic datasets. We generate diagonal matrices of size ,

for which each diagonal coefficient is drawn from an uniform distribution in

. Then, we generate a random ‘mixing’ matrix

with independent normally distributed entries. Finally, in order to add noise, we generate

matrices of normally distributed entries. The dataset is then with: