 # A Quasi-Newton algorithm on the orthogonal manifold for NMF with transform learning

Nonnegative matrix factorization (NMF) is a popular method for audio spectral unmixing. While NMF is traditionally applied to off-the-shelf time-frequency representations based on the short-time Fourier or Cosine transforms, the ability to learn transforms from raw data attracts increasing attention. However, this adds an important computational overhead. When assumed orthogonal (like the Fourier or Cosine transforms), learning the transform yields a non-convex optimization problem on the orthogonal matrix manifold. In this paper, we derive a quasi-Newton method on the manifold using sparse approximations of the Hessian. Experiments on synthetic and real audio data show that the proposed algorithm out-performs state-of-the-art first-order and coordinate-descent methods by orders of magnitude. A Python package for fast TL-NMF is released online at https://github.com/pierreablin/tlnmf.

## Authors

##### This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

## 1 Introduction

Nonnegative matrix factorization (NMF) consists in decomposing a nonnegative data matrix into

 V≈WH (1)

where and are two nonnegative matrices referred to as dictionary and activation matrix, respectively. The rank of the factorization is generally chosen to be smaller than so that the approximation is low-rank. This method has been popularized by the seminal paper of Lee and Seung . In audio signal processing, is typically a magnitude or power spectrogram, where is the short-time Fourier or Cosine transform of some signal (the notation denotes element-wise operations throughout the paper). The short-time frequency transform is computed by applying an orthogonal frequency transform to the frames matrix which contains windowed segments of the original temporal signal in its columns. is the length of the window and is the resulting number of time frames. As such, we have . Factorizing as in (1) can lead to a meaningful decomposition where the dictionary captures spectral patterns and the activation matrix contains data decomposition coefficients. This decomposition can then be used to solve a variety of signal processing problems such as source separation [2, 3, 4] or music transcription [5, 6]. In the latter works, is computed with a given off-the-shelf short-time frequency transform. This sets a limit to the accuracy of the factorization. To adress this issue, transform-learning NMF (TL-NMF) was introduced in [7, 8]. It computes an optimal transform from the input signal: the transform is learned together with the latent factors and . TL-NMF has been employed successfully for source separation examples: it leads to better or comparable performance as compared with traditional fixed-transform NMF [7, 8, 9].

The contribution of this article is to propose a faster solver for TL-NMF. In , an orthogonal transform is learned using a projected gradient descent onto the orthogonal matrix manifold. In , a faster Jacobi approach (in which is searched as a product of Givens rotations) is proposed. In a different framework, 

optimizes a nonsingular transform (not constrained to be orthogonal) with majorization-minimization (MM). In all cases, the cost of TL-NMF remains prohibitively large compared to standard NMF. The estimation of the transform is the computational bottleneck of the algorithms, and takes orders of magnitude more time than standard NMF. The present work aims at reducing the gap in terms of execution time between TL-NMF and traditional NMF in the orthogonal transform setting (which gently relaxes Fourier or Cosine transforms while still imposing orthogonality). To that purpose, we propose a quasi-Newton method on the orthogonal manifold.

The article is organized as follows. Section 2 introduces the optimization problem behind TL-NMF and presents the standard MM updates used for and . Section 3 starts with a brief introduction to optimization on the orthogonal manifold, introduces previous work and presents the new quasi-Newton algorithm. Finally, Section 4 describes comparative experiments with synthetic and real data. Exploiting the reduced computational load, we highlight a previously unnoticed energy concentration phenomenon of the learned transform, and study the structure of the local minima of the objective function.

Notation.  Scalars are written in lower-case (e.g.,

), vectors in bold lower-case (e.g.,

) and matrices in bold upper-case (e.g.,

), while tensors are in calligraphic upper-case (e.g.,

). Entry of a matrix is denoted as or while entry of a tensor is denoted as

. The identity matrix of size

is denoted as . The element-wise operations between two matrices and are written and for the multiplication and division while and denote the element-wise exponentiation and modulus, respectively. The orthogonal matrix set is the set of matrices such that . The Frobenius scalar product is denoted as . Given a fourth-order tensor of size , the weighted Frobenius inner product is . The Itakura-Saito divergence is given by . Finally, is the Kronecker delta function of equal to if and otherwise.

## 2 NMF with transform learning

### 2.1 Objective function

TL-NMF consists in solving a NMF problem while learning a data-adapted transform . This is done by minimizing some measure of fit between the transformed data and the factorized expression where we here assume that is a real-valued orthogonal matrix (of size ). In addition, a penalty is added to promote sparsity of the activation coefficients. The TL-NMF problem thus writes:

 minΦ,W,HCλ(Φ,W,H)=DIS(|ΦY|∘2|WH)+λMK||H||1 s.t.W≥0,H≥0, ∀k,||wk||1=1,ΦΦ⊤=IM, (2)

where is the -th column of and is the Itakura-Saito (IS) divergence defined as . Note that any other measure of fit could be used with no loss of generality. However, the IS divergence has been proven to be particularly relevant for decomposing power spectrograms . The factor allows to the values of the measure of fit and of the penalty term to be of comparable orders of magnitude.

A straightforward estimation procedure to solve the problem (2.1) is to use alternate minimization. It is summarized in Algorithm 1. It alternates between two steps. In the NMF step, the current “spectrogram” is fixed and the algorithm decreases with respect to (w.r.t.) and . This is done using classical NMF MM update rules, described in the next section. In the transform-learning part, the factorization is fixed, and the algorithm decreases w.r.t. , using an optimization algorithm denoted as . This article aims to propose a new fast algorithm for the minimization of w.r.t. .

### 2.2 Majorization-minimization updates of W and H

We update and (step in Algorithm 1) with the standard multiplicative updates that can be derived from a majorization-minimization procedure . The sum-to-one constraint on the columns of (which is necessary to avoid degenerate solutions) can be rigorously enforced using a change of variable, like in [12, 13]. The updates read:

 H←H∘⎡⎢ ⎢⎣WT((WH)∘−2∘|ΦY|∘2)WT(WH)∘−1+λMK1K×N⎤⎥ ⎥⎦∘12,
 W←W∘⎡⎢ ⎢⎣((WH)∘−2∘|ΦY|∘2)HT(WH)∘−1HT+λMK1M×NHT⎤⎥ ⎥⎦∘12.

They should be followed by a joint normalization of the columns of and rows of [12, 13].

## 3 Quasi-Newton update of the transform Φ

### 3.1 Optimization on the orthogonal manifold

This section focuses on the minimization of with respect to . In the following, we define , and let . We may write:

 L(Φ)=M∑m=1N∑n=1f^vmn([ΦY]mn), (3)

where we define . The orthogonality constraint imposed to () implies that (3) should be minimized on the orthogonal matrix manifold . This manifold appears in many optimization problems and its geometry is well-studied . To derive an iterative algorithm that minimizes (3), we propose to parametrize the neighborhood of an iterate via the matrix exponential (following, e.g., ). We set:

 Φt+1=exp(E)Φt, (4)

where is an anti-symmetric matrix. If is orthogonal, this update enforces that remains orthogonal. It thus provides a natural framework for iterative optimization over the orthogonal manifold.

### 3.2 Previous methods

A projected gradient method is presented in . Iterates are of the form:

 Φ←Π((IM−ηG)Φ), (5)

where is the natural gradient  of , is a step-size, and is the projection to the manifold, given by . The main drawback is that, as a first order method, it is hard to have a proper step size policy, and the convergence is at most linear .

A variant was proposed in  where the transform was updated using Givens rotations as:

 Φ←Rpq(θ)Φ (6)

where is a unidirectional rotation matrix with axis and angle . This update rule results in an acceleration because the single-axis rotations are cheap to compute. However, finding the best angle given an axis was shown to involve a highly non-convex problem with the presence of many local minima. As such is selected by grid search which is not entirely satisfactory.

### 3.3 Derivatives of the objective function

In this section, the derivatives of with respect to the parametrization (4) are computed. The gradient is a matrix denoted as , and the Hessian is a tensor denoted as . They are obtained from the following second-order Taylor expansion:

 L(exp(E)Φ)=L(Φ)+⟨G|E⟩+12⟨E|H|E⟩+O(||E||3). (7)

Using , the gradient is given by

 Gij=N∑n=1f′^vin(xin)xjn=2N∑n=1(xin^vin−1xit)xjt (8)

and the Hessian is given by

 Hijkl=δikN∑n=1f′′^vin(xin)xjnxln+δjkGil. (9)

Newton’s method.  Newton method on the manifold would take , where is the projection onto the anti-symmetric matrices:

 ΠA(C)=C−C⊤2. (10)

Note that this projection is much cheaper to compute than . Newton’s method provides fast convergence, but is not practical for several reasons. First, it requires the computation of the Hessian. The complexity of this operation is . Besides, the cost of computing a gradient is . Thus, a gradient method can roughly perform iterations when Newton’s method performs one. Second, because the problem is non-convex, the Hessian should be regularized to enforce its positive-definiteness, thereby guaranteeing that is a descent direction. A standard regularization procedure consists in adding to the Hessian where and where

is the smallest eigenvalue of

. The Hessian is sparse, but its sparsity structure does not help in computing the key quantity . As such one we would have to compute the smallest eigenvalue of a matrix which is prohibitively expensive. Finally, solving the linear system using, e.g., Gaussian elimination has complexity , which is orders of magnitude higher than the computation of the gradient.

### 3.4 A fast algorithm based on Hessian approximation

To derive a practical quasi-Newton algorithm, one can observe that the Hessian of has two terms. The second term, , cancels when the algorithm is close to convergence, so we may ignore it. As an approximation of the first term, we impose that it cancels when , leading to the following Hessian approximation:

 ~Hijkl =δikδjlN∑n=1f′′^vin(xin)x2jn (11) =2δikδjlN∑n=1(1^vin+1x2in)x2jt. (12)

Our approximation provides an even sparser version of the true Hessian. Then, then proposed update for the transform reads:

 Φ←exp(−ηΠA(~H−1G))Φ (13)

where is a step size. The step size is chosen to verify the Wolfe conditions 

and is computed using the classical interpolation algorithm thoroughly described in

[17, pp. 59-60]. Informally, Wolfe conditions guarantee that the objective function is sufficiently decreased by the step size, and that the projected gradient in the search direction is also decreased. These conditions are critical to obtain convergence of quasi-Newton methods, and in practice help in achieving fast convergence.

Denote by the matrix with coefficients , so that . Our quasi-Newton’s method solves all the aforementioned problems of Newton’s method. The approximated Hessian is

• cheap: computing has the same complexity as computing a gradient, i.e., .

• positive definite: the approximation boils down to a diagonal operator, i.e., . Hence, its eigenvalues are the coefficients , which are all nonnegative. As such, our method does not require Hessian regularization.

• easy to invert: because it boils down to a diagonal operator, we have . Inversion is , which is negligible compared to the cost of computing the gradient.

The resulting optimization procedure is described in Algorithm 2.

### 3.5 Relation to independent component analysis (ICA)

This objective function (3) is reminiscent of maximum-likelihood ICA where the maximum-likelihood objective is given by :

 L(Φ)=−Nlog|det(Φ)|+M∑m=1N∑n=1f([ΦY]mn), (14)

where is a pre-specified function. Under the orthogonal constraint, becomes constant. As such, the ICA objective function shares the same dependency in with TL-NMF and the algorithm proposed in this paper is inspired by the ICA acceleration techniques proposed in .

## 4 Experiments

The following experiments are run on a single core of a laptop equipped with an Intel Core i7-6600U @ 2.6 GHz processor and 16 GB of RAM. The Python code is available online.

### 4.1 Synthetic data

We first focus on the sole optimization of , and not on the full TL-NMF procedure. For this experiment, we generate random normal matrices of size , for and , and a random transform . We set , so that the minimum of is . Algorithms start from an orthogonal initialization in the vicinity of . More precisely, we set where . Fig. 1 shows the convergence curve of the proposed method, projected gradient  and Jacobi search . The proposed quasi-Newton approach leads to a drastic improvement in speed of convergence.

### 4.2 Real data

Experimental setup. We consider a seconds-long excerpt from My Heart (Will Always Lead Me Back To You) by Louis Armstrong and His Hot Five. The sampling rate is Hz. Using a ms-long analysis windows () with 50% overlap between two frames, we obtain . The rank of the decomposition is fixed to , which is known empirically to provide a satisfactory decomposition with traditional NMF .

Comparison of the algorithms performance. In a first experiment, we first run traditional IS-NMF on the DCT spectrogram of the input signal and store . Then the three transform learning algorithms are run with fixed and from a random starting point for . This provides a realistic setting to compare their performance in optimizing . Full TL-NMF (with free and ) are computed in a second experiment, using (same) random starting points. The three different transform learning algorithms are run with . Results for the two experiments are shown in Fig. 2 and illustrate the superiority of the proposed quasi-Newton algorithm. Figure 2: Convergence curves with real data. Left: minimization of L(Φ) only. Right: full TL-NMF optimization.

We now discuss some features of the transform learned with (full) TL-NMF using the quasi-Newton algorithm. We will refer to the rows of as atoms (real-valued vectors of size ). The learned atoms are not shown here due to space limitation but are similar to those obtained in [7, 8].

Energy concentration. The contributed energy of a single atom is defined as . Fig. 2(a) shows the cumulative distribution of the energies for three different transforms: estimated by TL-NMF, the DCT, and a random orthogonal matrix. As expected, DCT concentrates the energy while a random orthogonal transform barely does so. The energy concentration phenomenon is accentuated by transform learning. This behavior was observed with other music datasets as well.

Reliability of the learned transform. The problem solved by TL-NMF is non-convex, hence different initializations can lead to different local minima. We investigate the structure of the local minima returned by the proposed quasi-Newton algorithm using a technique similar to ICASSO in ICA . It appears that a subset of atoms are reliably returned by the algorithm, regardless of initialization. To observe this behavior, we consider two transforms obtained from two random initializations. We select the 50 most-contributing atoms based of the values of , yielding two matrices and of size . We compute the correlation matrix of size and find a permutation matrix such that is as block-diagonal as possible. The absolute value of the resulting matrix is displayed in Fig. 2(b). It is well structured and shows in particular that the first atoms (top left) are the same. Furthermore, some pairwise couplings are also uncovered. The diagonal blocks in Fig. 2(b) correspond to sets of atoms such that .

## 5 Conclusion

We introduced a quasi-Newton method on the orthogonal manifold to solve the TL-NMF problem. It relies on a sparse approximation of the Hessian. The proposed method outperforms the state-of-the-art methods by orders of magnitude. On the laptop used for the experiments, the whole estimation took about 10 minutes for a 2-minutes signal, while NMF without transform learning takes roughly 2 minutes. This work is thus a step towards making TL-NMF a practical tool for music signal processing. The shortened time of estimation also helps investigate properties of the learned transform without prohibitive computational burden. Results on the concentration of energy obtained by TL-NMF suggest that an algorithm that only learns a few atoms instead of would not result in too much loss of information. Such an algorithm would further reduce the computational cost of TL-NMF, since the number of parameters would plummet. We intend to study this matter in a future work.

## References

•  Daniel D. Lee and H. Sebastian Seung, “Learning the parts of objects by non-negative matrix factorization,” Nature, vol. 401, no. 6755, pp. 788–791, 1999.
•  Paris Smaragdis, Cédric Févotte, Gautham J Mysore, Nasser Mohammadiha, and Matthew Hoffman, “Static and dynamic source separation using nonnegative factorizations: A unified view,” IEEE Signal Processing Magazine, vol. 31, no. 3, pp. 66–75, 2014.
•  Emmanuel Vincent, Tuomas Virtanen, and Sharon Gannot, Audio source separation and speech enhancement, John Wiley & Sons, 2018.
•  Andrzej Cichocki, Rafal Zdunek, Anh Huy Phan, and Shun-ichi Amari, Nonnegative matrix and tensor factorizations: applications to exploratory multi-way data analysis and blind source separation, John Wiley & Sons, 2009.
•  Paris Smaragdis and Judith C. Brown, “Non-negative matrix factorization for polyphonic music transcription,” in Proc. IEEE Workshop on Applications of Signal Processing to Audio and Acoustics (WASPAA), 2003.
•  Emmanuel Vincent, Nancy Bertin, and Roland Badeau, “Harmonic and inharmonic nonnegative matrix factorization for polyphonic pitch transcription,” in Proc. IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), 2008, pp. 109–112.
•  Dylan Fagot, Herwig Wendt, and Cédric Févotte, “Nonnegative matrix factorization with transform learning,” in Proc. IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), 2018, pp. 2431–2435.
•  Herwig Wendt, Dylan Fagot, and Cédric Févotte, “Jacobi algorithm for nonnegative matrix factorization with transform learning,” in Proc. European Signal Processing Conference (EUSIPCO), 2018.
•  Kazuyoshi Yoshii, Koichi Kitamura, Yoshiaki Bando, Eita Nakamura, and Tatsuya Kawahara, “Independent low-rank tensor analysis for audio source separation,” in Proc. European Signal Processing Conference (EUSIPCO), 2018.
•  Cédric Févotte, Nancy Bertin, and Jean-Louis Durrieu, “Nonnegative matrix factorization with the itakura-saito divergence: With application to music analysis,” Neural computation, vol. 21, no. 3, pp. 793–830, 2009.
•  Cédric Févotte and Jérôme Idier, “Algorithms for nonnegative matrix factorization with the -divergence,” Neural computation, vol. 23, no. 9, pp. 2421–2456, 2011.
•  Augustin Lefevre, Francis Bach, and Cédric Févotte, “Itakura-saito nonnegative matrix factorization with group sparsity,” in Proc. IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), 2011, pp. 21–24.
•  Slim Essid and Cédric Févotte, “Smooth nonnegative matrix factorization for unsupervised audiovisual document structuring,” IEEE Transactions on Multimedia, vol. 15, no. 2, pp. 415–425, 2013.
•  Pierre-Antoine Absil, Robert Mahony, and Rodolphe Sepulchre, Optimization algorithms on matrix manifolds, Princeton University Press, 2009.
•  Pierre Ablin, Jean-François Cardoso, and Alexandre Gramfort, “Faster ICA under orthogonal constraint,” in Proc. IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), 2018.
•  Shun-Ichi Amari, “Natural gradient works efficiently in learning,” Neural computation, vol. 10, no. 2, pp. 251–276, 1998.
•  Jorge Nocedal and Stephen J Wright, Numerical Optimization, Springer, 1999.
•  Philip Wolfe, “Convergence conditions for ascent methods,” SIAM review, vol. 11, no. 2, pp. 226–235, 1969.
•  Dinh Tuan Pham and Philippe Garat, “Blind separation of mixture of independent sources through a quasi-maximum likelihood approach,” IEEE Trans. on Signal Processing, vol. 45, no. 7, pp. 1712–1725, 1997.
•  Johan Himberg, Aapo Hyvärinen, and Fabrizio Esposito, “Validating the independent components of neuroimaging time series via clustering and visualization,” Neuroimage, vol. 22, no. 3, pp. 1214–1222, 2004.