1 Introduction
Nonnegative matrix factorization (NMF) consists in decomposing a nonnegative data matrix into
(1) 
where and are two nonnegative matrices referred to as dictionary and activation matrix, respectively. The rank of the factorization is generally chosen to be smaller than so that the approximation is lowrank. This method has been popularized by the seminal paper of Lee and Seung [1]. In audio signal processing, is typically a magnitude or power spectrogram, where is the shorttime Fourier or Cosine transform of some signal (the notation denotes elementwise operations throughout the paper). The shorttime frequency transform is computed by applying an orthogonal frequency transform to the frames matrix which contains windowed segments of the original temporal signal in its columns. is the length of the window and is the resulting number of time frames. As such, we have . Factorizing as in (1) can lead to a meaningful decomposition where the dictionary captures spectral patterns and the activation matrix contains data decomposition coefficients. This decomposition can then be used to solve a variety of signal processing problems such as source separation [2, 3, 4] or music transcription [5, 6]. In the latter works, is computed with a given offtheshelf shorttime frequency transform. This sets a limit to the accuracy of the factorization. To adress this issue, transformlearning NMF (TLNMF) was introduced in [7, 8]. It computes an optimal transform from the input signal: the transform is learned together with the latent factors and . TLNMF has been employed successfully for source separation examples: it leads to better or comparable performance as compared with traditional fixedtransform NMF [7, 8, 9].
The contribution of this article is to propose a faster solver for TLNMF. In [7], an orthogonal transform is learned using a projected gradient descent onto the orthogonal matrix manifold. In [8], a faster Jacobi approach (in which is searched as a product of Givens rotations) is proposed. In a different framework, [9]
optimizes a nonsingular transform (not constrained to be orthogonal) with majorizationminimization (MM). In all cases, the cost of TLNMF remains prohibitively large compared to standard NMF. The estimation of the transform is the computational bottleneck of the algorithms, and takes orders of magnitude more time than standard NMF. The present work aims at reducing the gap in terms of execution time between TLNMF and traditional NMF in the orthogonal transform setting (which gently relaxes Fourier or Cosine transforms while still imposing orthogonality). To that purpose, we propose a quasiNewton method on the orthogonal manifold.
The article is organized as follows. Section 2 introduces the optimization problem behind TLNMF and presents the standard MM updates used for and . Section 3 starts with a brief introduction to optimization on the orthogonal manifold, introduces previous work and presents the new quasiNewton algorithm. Finally, Section 4 describes comparative experiments with synthetic and real data. Exploiting the reduced computational load, we highlight a previously unnoticed energy concentration phenomenon of the learned transform, and study the structure of the local minima of the objective function.
Notation. Scalars are written in lowercase (e.g.,
), vectors in bold lowercase (e.g.,
) and matrices in bold uppercase (e.g.,), while tensors are in calligraphic uppercase (e.g.,
). Entry of a matrix is denoted as or while entry of a tensor is denoted as. The identity matrix of size
is denoted as . The elementwise operations between two matrices and are written and for the multiplication and division while and denote the elementwise exponentiation and modulus, respectively. The orthogonal matrix set is the set of matrices such that . The Frobenius scalar product is denoted as . Given a fourthorder tensor of size , the weighted Frobenius inner product is . The ItakuraSaito divergence is given by . Finally, is the Kronecker delta function of equal to if and otherwise.2 NMF with transform learning
2.1 Objective function
TLNMF consists in solving a NMF problem while learning a dataadapted transform [7]. This is done by minimizing some measure of fit between the transformed data and the factorized expression where we here assume that is a realvalued orthogonal matrix (of size ). In addition, a penalty is added to promote sparsity of the activation coefficients. The TLNMF problem thus writes:
(2) 
where is the th column of and is the ItakuraSaito (IS) divergence defined as . Note that any other measure of fit could be used with no loss of generality. However, the IS divergence has been proven to be particularly relevant for decomposing power spectrograms [10]. The factor allows to the values of the measure of fit and of the penalty term to be of comparable orders of magnitude.
A straightforward estimation procedure to solve the problem (2.1) is to use alternate minimization. It is summarized in Algorithm 1. It alternates between two steps. In the NMF step, the current “spectrogram” is fixed and the algorithm decreases with respect to (w.r.t.) and . This is done using classical NMF MM update rules, described in the next section. In the transformlearning part, the factorization is fixed, and the algorithm decreases w.r.t. , using an optimization algorithm denoted as . This article aims to propose a new fast algorithm for the minimization of w.r.t. .
2.2 Majorizationminimization updates of and
We update and (step in Algorithm 1) with the standard multiplicative updates that can be derived from a majorizationminimization procedure [11]. The sumtoone constraint on the columns of (which is necessary to avoid degenerate solutions) can be rigorously enforced using a change of variable, like in [12, 13]. The updates read:
They should be followed by a joint normalization of the columns of and rows of [12, 13].
3 QuasiNewton update of the transform
3.1 Optimization on the orthogonal manifold
This section focuses on the minimization of with respect to . In the following, we define , and let . We may write:
(3) 
where we define . The orthogonality constraint imposed to () implies that (3) should be minimized on the orthogonal matrix manifold . This manifold appears in many optimization problems and its geometry is wellstudied [14]. To derive an iterative algorithm that minimizes (3), we propose to parametrize the neighborhood of an iterate via the matrix exponential (following, e.g., [15]). We set:
(4) 
where is an antisymmetric matrix. If is orthogonal, this update enforces that remains orthogonal. It thus provides a natural framework for iterative optimization over the orthogonal manifold.
3.2 Previous methods
A projected gradient method is presented in [7]. Iterates are of the form:
(5) 
where is the natural gradient [16] of , is a stepsize, and is the projection to the manifold, given by . The main drawback is that, as a first order method, it is hard to have a proper step size policy, and the convergence is at most linear [17].
A variant was proposed in [8] where the transform was updated using Givens rotations as:
(6) 
where is a unidirectional rotation matrix with axis and angle . This update rule results in an acceleration because the singleaxis rotations are cheap to compute. However, finding the best angle given an axis was shown to involve a highly nonconvex problem with the presence of many local minima. As such is selected by grid search which is not entirely satisfactory.
3.3 Derivatives of the objective function
In this section, the derivatives of with respect to the parametrization (4) are computed. The gradient is a matrix denoted as , and the Hessian is a tensor denoted as . They are obtained from the following secondorder Taylor expansion:
(7) 
Using , the gradient is given by
(8) 
and the Hessian is given by
(9) 
Newton’s method. Newton method on the manifold would take , where is the projection onto the antisymmetric matrices:
(10) 
Note that this projection is much cheaper to compute than . Newton’s method provides fast convergence, but is not practical for several reasons. First, it requires the computation of the Hessian. The complexity of this operation is . Besides, the cost of computing a gradient is . Thus, a gradient method can roughly perform iterations when Newton’s method performs one. Second, because the problem is nonconvex, the Hessian should be regularized to enforce its positivedefiniteness, thereby guaranteeing that is a descent direction. A standard regularization procedure consists in adding to the Hessian where and where
is the smallest eigenvalue of
. The Hessian is sparse, but its sparsity structure does not help in computing the key quantity . As such one we would have to compute the smallest eigenvalue of a matrix which is prohibitively expensive. Finally, solving the linear system using, e.g., Gaussian elimination has complexity , which is orders of magnitude higher than the computation of the gradient.3.4 A fast algorithm based on Hessian approximation
To derive a practical quasiNewton algorithm, one can observe that the Hessian of has two terms. The second term, , cancels when the algorithm is close to convergence, so we may ignore it. As an approximation of the first term, we impose that it cancels when , leading to the following Hessian approximation:
(11)  
(12) 
Our approximation provides an even sparser version of the true Hessian. Then, then proposed update for the transform reads:
(13) 
where is a step size. The step size is chosen to verify the Wolfe conditions [18]
and is computed using the classical interpolation algorithm thoroughly described in
[17, pp. 5960]. Informally, Wolfe conditions guarantee that the objective function is sufficiently decreased by the step size, and that the projected gradient in the search direction is also decreased. These conditions are critical to obtain convergence of quasiNewton methods, and in practice help in achieving fast convergence.Denote by the matrix with coefficients , so that . Our quasiNewton’s method solves all the aforementioned problems of Newton’s method. The approximated Hessian is

cheap: computing has the same complexity as computing a gradient, i.e., .

positive definite: the approximation boils down to a diagonal operator, i.e., . Hence, its eigenvalues are the coefficients , which are all nonnegative. As such, our method does not require Hessian regularization.

easy to invert: because it boils down to a diagonal operator, we have . Inversion is , which is negligible compared to the cost of computing the gradient.
The resulting optimization procedure is described in Algorithm 2.
3.5 Relation to independent component analysis (ICA)
This objective function (3) is reminiscent of maximumlikelihood ICA where the maximumlikelihood objective is given by [19]:
(14) 
where is a prespecified function. Under the orthogonal constraint, becomes constant. As such, the ICA objective function shares the same dependency in with TLNMF and the algorithm proposed in this paper is inspired by the ICA acceleration techniques proposed in [15].
4 Experiments
The following experiments are run on a single core of a laptop equipped with an Intel Core i76600U @ 2.6 GHz processor and 16 GB of RAM. The Python code is available online.^{1}^{1}1https://github.com/pierreablin/tlnmf
4.1 Synthetic data
We first focus on the sole optimization of , and not on the full TLNMF procedure. For this experiment, we generate random normal matrices of size , for and , and a random transform . We set , so that the minimum of is . Algorithms start from an orthogonal initialization in the vicinity of . More precisely, we set where . Fig. 1 shows the convergence curve of the proposed method, projected gradient [7] and Jacobi search [8]. The proposed quasiNewton approach leads to a drastic improvement in speed of convergence.
4.2 Real data
Experimental setup. We consider a secondslong excerpt from My Heart (Will Always Lead Me Back To You) by Louis Armstrong and His Hot Five. The sampling rate is Hz.
Using a mslong analysis windows () with 50% overlap between two frames, we obtain .
The rank of the decomposition is fixed to , which is known empirically to provide a satisfactory decomposition with traditional NMF [10].
Comparison of the algorithms performance.
In a first experiment, we first run traditional ISNMF on the DCT spectrogram of the input signal and store . Then the three transform learning algorithms are run with fixed and from a random starting point for . This provides a realistic setting to compare their performance in optimizing . Full TLNMF (with free and ) are computed in a second experiment, using (same) random starting points. The three different transform learning algorithms are run with . Results for the two experiments are shown in Fig. 2 and illustrate the superiority of the proposed quasiNewton algorithm.
We now discuss some features of the transform learned with (full) TLNMF using the quasiNewton algorithm.
We will refer to the rows of as atoms (realvalued vectors of size ).
The learned atoms are not shown here due to space limitation but are similar to those obtained in [7, 8].
Energy concentration.
The contributed energy of a single atom is defined as .
Fig. 2(a) shows the cumulative distribution of the energies for three different transforms: estimated by TLNMF, the DCT, and a random orthogonal matrix.
As expected, DCT concentrates the energy while a random orthogonal transform barely does so. The energy concentration phenomenon is accentuated by transform learning.
This behavior was observed with other music datasets as well.
Reliability of the learned transform. The problem solved by TLNMF is nonconvex, hence different initializations can lead to different local minima. We investigate the structure of the local minima returned by the proposed quasiNewton algorithm using a technique similar to ICASSO in ICA [20]. It appears that a subset of atoms are reliably returned by the algorithm, regardless of initialization. To observe this behavior, we consider two transforms obtained from two random initializations. We select the 50 mostcontributing atoms based of the values of , yielding two matrices and of size . We compute the correlation matrix of size and find a permutation matrix such that is as blockdiagonal as possible. The absolute value of the resulting matrix is displayed in Fig. 2(b). It is well structured and shows in particular that the first atoms (top left) are the same. Furthermore, some pairwise couplings are also uncovered. The diagonal blocks in Fig. 2(b) correspond to sets of atoms such that .

5 Conclusion
We introduced a quasiNewton method on the orthogonal manifold to solve the TLNMF problem. It relies on a sparse approximation of the Hessian. The proposed method outperforms the stateoftheart methods by orders of magnitude. On the laptop used for the experiments, the whole estimation took about 10 minutes for a 2minutes signal, while NMF without transform learning takes roughly 2 minutes. This work is thus a step towards making TLNMF a practical tool for music signal processing. The shortened time of estimation also helps investigate properties of the learned transform without prohibitive computational burden. Results on the concentration of energy obtained by TLNMF suggest that an algorithm that only learns a few atoms instead of would not result in too much loss of information. Such an algorithm would further reduce the computational cost of TLNMF, since the number of parameters would plummet. We intend to study this matter in a future work.
References
 [1] Daniel D. Lee and H. Sebastian Seung, “Learning the parts of objects by nonnegative matrix factorization,” Nature, vol. 401, no. 6755, pp. 788–791, 1999.
 [2] Paris Smaragdis, Cédric Févotte, Gautham J Mysore, Nasser Mohammadiha, and Matthew Hoffman, “Static and dynamic source separation using nonnegative factorizations: A unified view,” IEEE Signal Processing Magazine, vol. 31, no. 3, pp. 66–75, 2014.
 [3] Emmanuel Vincent, Tuomas Virtanen, and Sharon Gannot, Audio source separation and speech enhancement, John Wiley & Sons, 2018.
 [4] Andrzej Cichocki, Rafal Zdunek, Anh Huy Phan, and Shunichi Amari, Nonnegative matrix and tensor factorizations: applications to exploratory multiway data analysis and blind source separation, John Wiley & Sons, 2009.
 [5] Paris Smaragdis and Judith C. Brown, “Nonnegative matrix factorization for polyphonic music transcription,” in Proc. IEEE Workshop on Applications of Signal Processing to Audio and Acoustics (WASPAA), 2003.
 [6] Emmanuel Vincent, Nancy Bertin, and Roland Badeau, “Harmonic and inharmonic nonnegative matrix factorization for polyphonic pitch transcription,” in Proc. IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), 2008, pp. 109–112.
 [7] Dylan Fagot, Herwig Wendt, and Cédric Févotte, “Nonnegative matrix factorization with transform learning,” in Proc. IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), 2018, pp. 2431–2435.
 [8] Herwig Wendt, Dylan Fagot, and Cédric Févotte, “Jacobi algorithm for nonnegative matrix factorization with transform learning,” in Proc. European Signal Processing Conference (EUSIPCO), 2018.
 [9] Kazuyoshi Yoshii, Koichi Kitamura, Yoshiaki Bando, Eita Nakamura, and Tatsuya Kawahara, “Independent lowrank tensor analysis for audio source separation,” in Proc. European Signal Processing Conference (EUSIPCO), 2018.
 [10] Cédric Févotte, Nancy Bertin, and JeanLouis Durrieu, “Nonnegative matrix factorization with the itakurasaito divergence: With application to music analysis,” Neural computation, vol. 21, no. 3, pp. 793–830, 2009.
 [11] Cédric Févotte and Jérôme Idier, “Algorithms for nonnegative matrix factorization with the divergence,” Neural computation, vol. 23, no. 9, pp. 2421–2456, 2011.
 [12] Augustin Lefevre, Francis Bach, and Cédric Févotte, “Itakurasaito nonnegative matrix factorization with group sparsity,” in Proc. IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), 2011, pp. 21–24.
 [13] Slim Essid and Cédric Févotte, “Smooth nonnegative matrix factorization for unsupervised audiovisual document structuring,” IEEE Transactions on Multimedia, vol. 15, no. 2, pp. 415–425, 2013.
 [14] PierreAntoine Absil, Robert Mahony, and Rodolphe Sepulchre, Optimization algorithms on matrix manifolds, Princeton University Press, 2009.
 [15] Pierre Ablin, JeanFrançois Cardoso, and Alexandre Gramfort, “Faster ICA under orthogonal constraint,” in Proc. IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), 2018.
 [16] ShunIchi Amari, “Natural gradient works efficiently in learning,” Neural computation, vol. 10, no. 2, pp. 251–276, 1998.
 [17] Jorge Nocedal and Stephen J Wright, Numerical Optimization, Springer, 1999.
 [18] Philip Wolfe, “Convergence conditions for ascent methods,” SIAM review, vol. 11, no. 2, pp. 226–235, 1969.
 [19] Dinh Tuan Pham and Philippe Garat, “Blind separation of mixture of independent sources through a quasimaximum likelihood approach,” IEEE Trans. on Signal Processing, vol. 45, no. 7, pp. 1712–1725, 1997.
 [20] Johan Himberg, Aapo Hyvärinen, and Fabrizio Esposito, “Validating the independent components of neuroimaging time series via clustering and visualization,” Neuroimage, vol. 22, no. 3, pp. 1214–1222, 2004.