Nonnegative Tucker Decomposition with Beta-divergence for Music Structure Analysis of audio signals

by   Axel Marmoret, et al.

Nonnegative Tucker Decomposition (NTD), a tensor decomposition model, has received increased interest in the recent years because of its ability to blindly extract meaningful patterns in tensor data. Nevertheless, existing algorithms to compute NTD are mostly designed for the Euclidean loss. On the other hand, NTD has recently proven to be a powerful tool in Music Information Retrieval. This work proposes a Multiplicative Updates algorithm to compute NTD with the beta-divergence loss, often considered a better loss for audio processing. We notably show how to implement efficiently the multiplicative rules using tensor algebra, a naive approach being intractable. Finally, we show on a Music Structure Analysis task that unsupervised NTD fitted with beta-divergence loss outperforms earlier results obtained with the Euclidean loss.



page 1

page 2

page 3

page 4


Legendre Tensor Decomposition

We present a novel nonnegative tensor decomposition method, called Legen...

Uncovering audio patterns in music with Nonnegative Tucker Decomposition for structural segmentation

Recent work has proposed the use of tensor decomposition to model repeti...

Graph Regularized Nonnegative Tensor Ring Decomposition for Multiway Representation Learning

Tensor ring (TR) decomposition is a powerful tool for exploiting the low...

Sparse Nonnegative CANDECOMP/PARAFAC Decomposition in Block Coordinate Descent Framework: A Comparison Study

Nonnegative CANDECOMP/PARAFAC (NCP) decomposition is an important tool t...

The Alpha-Beta-Symetric Divergence and their Positive Definite Kernel

In this article we study the field of Hilbertian metrics and positive de...

Exact multiplicative updates for convolutional β-NMF in 2D

In this paper, we extend the β-CNMF to two dimensions and derive exact m...

Strange Beta: An Assistance System for Indoor Rock Climbing Route Setting Using Chaotic Variations and Machine Learning

This paper applies machine learning and the mathematics of chaos to the ...
This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

1 Introduction

Tensor factorization models are powerful tools to interpret multi-way data, and are nowadays used in numerous applications, see [KoldaBader, sidiropoulos2017tensor] for introductory surveys. These models allow to extract easily interpretable and meaningful information from the input data, generally in an unsupervised (or weakly-supervised) fashion, which can be a great asset when training data is scarcely available. This is the case for Music Structure Analysis (MSA) which consists in segmenting music recordings solely from the audio signal. For such applications, annotations can be ambiguous and difficult to collect  [nieto2020segmentationreview].

Nonnegative Tucker Decomposition (NTD) has previously proven to be a powerful tool for MSA  [marmoret2020uncovering, smith2018nonnegative]. While usually the Euclidean distance is used to fit the NTD, audio spectra exhibit large dynamics with respect to frequencies, which leads to a preponderance of few and typically low frequencies when using Euclidean distance. Contrarily, -divergences, and more particularly Kullback-Leibler and Itakura-Saito divergences, are known to be better suited for time-frequency features. We introduce a new algorithm for NTD where the objective cost is the minimization of the -divergence, and we study the resulting decompositions as regards their benefit on the MSA task on the RWC-Pop database [goto2002rwc]. The proposed algorithm adapts the Multiplicative Updates framework well-known for Nonnegative Matrix Factorization (NMF) [fevotte2009nonnegative, lee1999learning] to the tensor case, detailing efficient tensor contractions. It is closely related to [kim2007nonnegative], but studies the more general

-divergence and proposes modified Multiplicative Updates that guarantees global convergence to a stationary point. Code is fully open-source in

nn_fac111 [marmoret2020nn_fac].

2 Mathematical background

2.1 Nonnegative Tucker Decomposition

NTD is a mathematical model where a nonnegative tensor is approximated as the product of factors (one for each mode of the tensor) and a small core tensor linking these factors. This decomposition results in a low-rank approximation of the original tensor, which can also be seen as a projection of the original tensor in the multilinear space spanned by the factors. NTD is often used as a dimensionality reduction technique, but it may also be seen as a part-based representation similar to NMF although its identifiability properties are still not fully understood [zhou2015efficient]. In this work, we focus on third-order tensors for simplicity but the algorithm and our implementation work for higher-order tensors. Denoting the tensor to approximate and using conventional tensor-product notation [KoldaBader], computing the NTD boils down to seek for three nonnegative matrices , and and a core tensor such that :


This decomposition is also presented in Figure 1.

Figure 1: Nonnegative Tucker Decomposition of tensor in factor matrices , and core tensor .

NTD is generally performed by minimizing some distance or divergence function between the original tensor and the approximation. Many algorithms found in the literature [marmoret2020uncovering, zhou2015efficient, phan2011extended, xu2015alternating] are based on the minimization of the squared Euclidean distance, which is the element-wise squared difference between tensors coefficients. In this work, we instead consider the -divergence, detailed hereafter.

2.2 The -divergence loss function

In this work, we will focus on the -divergence function introduced in [Basu]. Given two nonnegative scalars and , the -divergence between and denoted is defined as follows:


This divergence generalizes the Euclidean distance (), and the Kullback-Leibler (KL) () and Itakura-Saito (IS) () divergences. The -divergence is homogeneous of degree , that is for any , we have . It implies that factorizations obtained with

(such as the Euclidean distance or the KL divergence) will rely more heavily on the largest data values and less precision is to be expected in the estimation of the low-power components. The IS divergence (

) is scale-invariant and is the only one in the -divergences family to possess this property. It implies that entries of low power are as important in the divergence computation as the areas of high power. This property is interesting when processing audio signals as low-power frequency bands can contribute as much as high-power frequency bands to their characterization. Both KL and IS divergences are notoriously known to be better suited to audio source separation than the Euclidean distance [fevotte2009nonnegative, Lefevre_phd, Fevotte_betadiv].

Hence, this work focuses on how to compute a candidate solution to approximate NTD with

-divergence as a loss function:


with the elementwise -divergence between two tensors.

3 A Multiplicative Updates algorithm

3.1 Multiplicative updates for NTD

The cost function is non-convex with respect to all factors, and computing a global solution to NTD is NP-Hard since NTD is a generalization of NMF [vavasis2009complexity]. However, each subproblem obtained when fixing all but one mode is convex as long as . Hence, block-coordinate algorithms, that update one factor at a time while fixing all the other factors, are standard to solve both NMF and NTD [lee1999learning, kim2007nonnegative, phan2011extended, Fevotte_betadiv].

In particular, the seminal paper by Lee and Seung [lee1999learning] proposed an alternating algorithm for NMF with -divergence, later revisited by Fevotte et. al. [Fevotte_betadiv], which we shall extend to NTD. Indeed, the NTD model can be rewritten using tensor matricization, e.g. along the first mode:


where is the matricization of the tensor along mode  [cohen2015notations] and denotes the Kronecker product. The matricization are analogous for factors and . One can interpret equation (4) as a NMF of with respect to and . This means that the update rules from the Multiplicative Updates (MU) algorithm in [Fevotte_betadiv] can be used to compute the update of each factor in NTD, within the same majorization-minimization framework.

For the core factor, one can use the vectorization property


and we can make use of the MU rules as well.

The MU rule in approximate NMF may be defined as


with a small constant and a function equal to if , 1 if , and if  [Fevotte_betadiv]. The element-wise maximum between the matrix update, i.e. the closed form expression of the minimizer of the majorization built at the current iterate, and in (6) aims at avoiding zero entries in factors, which may cause division by zero, and establishing convergence guarantee to stationary points within the BSUM framework [doi:10.1137/120891009] by considering a modified optimization problem where the factors are elementwise greater than instead of zero. Hence the proposed updates do not increase the cost at each iteration and for any initial factors, every limit point is a stationary point [doi:10.1137/1.9781611976410, Theorem 8.9].

3.2 Low-complexities updates

Following the previous discussion, a naive approach to implement MU for NTD, e.g. for the first mode factor W, would be 1) Define 2) Update according to (6). However, forming matrix explicitly using a Kronecker product is bound to be extremely inefficient both in terms of memory allocation and computation time. Let us show how to use classical tensor product identities to make the MU tractable for NTD.

For the MU rule of factor matrices , we still propose to compute matrix using the identity :


which brings down the complexity of forming from222We consider the multiway products are computed in lexicographic order. if done naively to and drastically reduces memory requirements.

The core update follows the same idea. Formally, setting , the MU is written as:


For the core update, contrarily to factor updates, in most practical cases we cannot compute and store the matrix explicitly because it is times larger than the data itself. Instead, all products for any vector are computed using the identity


Products are computed similarly since the transposition is distributive with respect to the Kronecker product.

Algorithm 1 shows one loop of the iterative algorithm and summarizes the proposed MU rules. The overall complexity of such an iteration is dependant on the multiway product effective complexity, but is no worse than if .

Perform analogous updates for and return
Algorithm 1 A loop of _NTD(,dimensions,)

As a side note, there could be other ways to perform the updates in Algorithm 1. For instance, we could avoid forming and only contract it. Since is however used several times, it appeared faster in theory to explicitly compute it. Furthermore, our actual implementation features special update rules for and which make use of the simplifications induced by these particular values. For , i.e. Euclidean loss, it is known however that other algorithms such as Hierachical Alternating Least Squares [phan2011extended] outperform MU, which is also implemented in nn_fac [marmoret2020nn_fac].

4 Experimental Framework

Chromas (initial work [marmoret2020uncovering]) HALS-NTD 58.4% 60.7% 59.0% 72.5% 75.3% 73.2%
Mel-spectrogram Raw features 46.4% 47.0% 46.3% 70.9% 71.3% 70.6%
HALS-NTD 45.9% 49.5% 47.2% 68.1% 73.0% 69.7%
MU-NTD 53.3% 56.9% 54.6% 70.5% 75.3% 72.2%
51.1% 56.8% 53.3% 69.9% 78.0% 73.1%
NNLMS Raw features 43.6% 42.7% 42.8% 66.4% 64.5% 65.0%
HALS-NTD 50.5% 52.7% 51.1% 71.2% 74.6% 72.2%
MU-NTD 57.8% 61.9% 59.3% 73.9% 79.3% 75.9%
55.9% 61.7% 58.1% 74.0% 81.9% 77.1%
Table 1: Segmentation results on the RWC Pop dataset [goto2002rwc], with different loss functions. For the song ”POP-01” of RWC-Pop, the tensor-spectrogram is of size 8096118, and one iteration of the MU algorithm takes approximately 0.2s both for the Mel-spectrogram or the NNLMS when core dimensions , while one iteration of the HALS algorithm takes approximately 0.75s. Nonethelss, the HALS algorithm generally converges faster than the MU algorithm.

4.1 NTD for music processing

NTD has already been introduced to process audio signals, and to provide a barwise pattern representation of a music [marmoret2020uncovering, smith2018nonnegative]. This barwise representation can in turn be used as a salient representation to study the structure of music pieces.



(Chromas, MFCC, …)

Time at barscale


Figure 2: Tensor-spectrogram

NTD is performed on a 3rd-order tensor, called tensor-spectrogram, which is the result of splitting a spectrogram on bar frontiers and concatenating the subsequent barwise spectrograms on a 3rd-mode. Hence, a tensor-spectrogram is composed of a frequential mode and two time-related modes: an inner-bar (low-level) time, and a bar (high-level) time. Bars are estimated using the madmom toolbox [madmom]. Each bar contains 96 frames, which are selected as equally spaced on a oversampled spectrogram (hop length of 32 samples), in order to account for bar length discrepancies [marmoret2020uncovering].

In our previous work [marmoret2020uncovering], we computed the NTD on chromagrams of the songs. A chromagram reduces the frequential aspect of music to the 12 western semi-tones scale (C, C#, …, B). Nonetheless, it discards many details present in the music content. This work extends our previous study of NTD for music segmentation to Mel-spectrograms using -divergence as a loss function. Mel-spectrograms consist in the STFT of a song with frequencies aggregated following a Mel filter-bank. They provide a richer representation of music than chromagrams but they live in a higher dimensional space.

On the basis of this alternate representation, we compare the algorithm introduced in [marmoret2020uncovering] (HALS-based NTD with Euclidean loss minimization) with the proposed algorithm in the MSA task, on the audio signals of the RWC Pop database [goto2002rwc].

In practice, Mel-spectrograms are computed with Librosa [mcfee2015librosa], and are dimensioned following the work of [grill2015music]

, which is considered as state-of-the-art in this task. Precisely, STFT are computed as power spectrograms with a window size of 2048 samples for a signal sampling rate of 44.1 kHz. A Mel-filter bank of 80 triangular filter between 80 Hz and 16 kHz is then applied. In addition to this raw Mel representation, we study a logarithmic variant, which is generally used as a way to account for the exponential distribution of power in audio spectra. As the logarithmic function is negative for values lower than 1, we introduce the Nonnegative Log Mel spectrogram (NNLMS) as NNLMS =

. The logarithm is approximately linear for values close to 1, meaning that low Mel coefficients remain low in the NNLMS while higher values are damped.

Finally, each nonnegative tensor-spectrogram has sizes 80 96 , with the number of bars in the song.

4.2 Music structure analysis based on NTD

Music Structural Analysis consists in segmenting a song into sections, as done in [marmoret2020uncovering]. The boundaries are obtained by segmenting the autosimilarity matrix of , computed as , which corresponds to a compressed barwise representation of the song. Segmentation of is done using a dynamic programming algorithm designed to frame square regions of high similarity along the diagonal of the autosimilarity matrix. Details about the technique can be found in [marmoret2020uncovering] or in the experimental Notebooks associated with the segmentation code333

Segmentation results are presented in Table 1, where we compare the performance of segmenting the chromagram using the HALS-NTD (Euclidean loss) [marmoret2020uncovering] with segmenting Mel and NNLMS representations. The condition “Raw feature” presents results when segmenting the autosimilarity of the barwise feature representation, i.e. the raw Mel or NNLMS representation of the song instead of that resulting from NTD. Core dimensions

are fitted with two-fold cross-validation, by splitting the RWC-Pop dataset between even and odd songs.

These results show that, for both representations, using the -divergence instead of the Euclidean loss enhances segmentation performance. Segmentation results are also higher when using NTD on the NNLMS rather than on the Mel-spectrogram, whereas segmenting the raw barwise feature autosimilarity matrix leads to the opposite trend.

4.3 Qualitative assessment of patterns

As a qualitative study between the different -divergences (), we computed the NTD with these three values on the STFT of the song “Come Together” by the Beatles. Using the Griffin-Lim algorithm [griffin1984signal], the approximation of the original spectrogram obtained with the NTD and all factorized patterns (, with the pattern index) are reconstructed into listenable signals. Results are available online444, and qualitatively confirm that the KL and IS divergences are better adapted to signals than Euclidean loss. Nonetheless, these reconstructed signals are not high quality musical signals, either due to inaccuracies in the decomposition, or because of the Griffin-Lim algorithm, which introduces many artifacts in reconstructing the phase information.

5 Conclusion

Nonnegative Tucker Decomposition is a powerful dimensionality reduction technique, able to extract salient patterns in numerical data. This article has proposed a tractable and globally convergent algorithm to perform the NTD with the -divergence as loss function (in particular KL and IS divergences). This appears to be of particular interest for a wide range of signals and applications, notably in the audio domain, as supported in this paper by quantitative results on a Music Structure Analysis task and qualitative examples of the impact of the value on the reconstruction of music signals.

Future work may consider the introduction of sparsity constraints, which generally improve the interpretability of nonnegative decompositions, and seeking additional strategies to accelerate the algorithm itself, including all-at-once optimization approaches in the same spirit as those proposed for NMF in recent work by Marmin and al. [marmin2021joint].