# Distributionally Robust and Multi-Objective Nonnegative Matrix Factorization

Nonnegative matrix factorization (NMF) is a linear dimensionality reduction technique for analyzing nonnegative data. A key aspect of NMF is the choice of the objective function that depends on the noise model (or statistics of the noise) assumed on the data. In many applications, the noise model is unknown and difficult to estimate. In this paper, we define a multi-objective NMF (MO-NMF) problem, where several objectives are combined within the same NMF model. We propose to use Lagrange duality to judiciously optimize for a set of weights to be used within the framework of the weighted-sum approach, that is, we minimize a single objective function which is a weighted sum of the all objective functions. We design a simple algorithm using multiplicative updates to minimize this weighted sum. We show how this can be used to find distributionally robust NMF solutions, that is, solutions that minimize the largest error among all objectives. We illustrate the effectiveness of this approach on synthetic, document and audio datasets. The results show that DR-NMF is robust to our incognizance of the noise model of the NMF problem.

There are no comments yet.

## Authors

• 33 publications
• 5 publications
• 2 publications
• 51 publications
• ### Bi-Objective Nonnegative Matrix Factorization: Linear Versus Kernel-Based Models

Nonnegative matrix factorization (NMF) is a powerful class of feature ex...
01/22/2015 ∙ by Paul Honeine, et al. ∙ 0

• ### A Provably Correct and Robust Algorithm for Convolutive Nonnegative Matrix Factorization

In this paper, we propose a provably correct algorithm for convolutive n...
06/17/2019 ∙ by Anthony Degleris, et al. ∙ 0

• ### Deep Recurrent NMF for Speech Separation by Unfolding Iterative Thresholding

In this paper, we propose a novel recurrent neural network architecture ...
09/21/2017 ∙ by Scott Wisdom, et al. ∙ 0

• ### Speech Dereverberation Using Nonnegative Convolutive Transfer Function and Spectro temporal Modeling

This paper presents two single channel speech dereverberation methods to...
09/16/2017 ∙ by Nasser Mohammadiha, et al. ∙ 0

• ### Hybrid Clustering based on Content and Connection Structure using Joint Nonnegative Matrix Factorization

We present a hybrid method for latent information discovery on the data ...
03/28/2017 ∙ by Rundong Du, et al. ∙ 0

• ### Missing Spectrum-Data Recovery in Cognitive Radio Networks Using Piecewise Constant Nonnegative Matrix Factorization

In this paper, we propose a missing spectrum data recovery technique for...
08/28/2015 ∙ by Alireza Zaeemzadeh, et al. ∙ 0

• ### Toward Implicit Sample Noise Modeling: Deviation-driven Matrix Factorization

The objective function of a matrix factorization model usually aims to m...
10/28/2016 ∙ by Guang-He Lee, et al. ∙ 0

##### This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

## 1 Introduction

Nonnegative matrix factorization (NMF) consists in the following problem: Given a nonnegative matrix and a factorization positive rank , find two nonnegative matrices and such that . NMF is a linear dimensionality reduction technique for nonnegative data. In fact, assuming each column of is a data point, it is reconstructed via a linear combination of basis elements given by the columns of while the columns of provide the weights (or coefficients) to reconstruct each column of within that basis, that is, for all ,

 X(:,j)≈r∑k=1W(:,k)H(k,j).

NMF has attracted a lot of attention since the seminal paper of Lee and Seung [lee1999learning], with applications in image analysis, document classification and music analysis, to cite a few; see, e.g., [cichocki2006new, gillis2014] and the references therein. Many NMF models have been proposed over the years. They mostly differ in two aspects:

1. Additional constraints are added to the factor matrices and such as sparsityÌ [hoyer2004non], spatial coherence [liu2011approach] or smoothness [essid2013smooth]. These constraints are motivated by a priori information on the sought solution and depend on the application at hand. Note that these additional constraints are in most cases imposed via a penalty term in the objective function.

2. The choice of the objective function that assesses the quality of an approximation by evaluating some distance between and differs. This choice is usually motivated by the noise model/statistics assumed on the data matrix . The most widely used class of objective functions are component-wise and based on the -divergences defined as follows: for ,

 Dβ(x,y)=⎧⎪ ⎪ ⎪⎨⎪ ⎪ ⎪⎩xy−logxy−1 % for β=0,xlogxy−x+y for β=1,1β(β−1)(xβ+(β−1)bβ−βxyβ−1) for β≠0,1.

We will use the following matrix-wise notation,

 Dβ(X,WH)=∑i,jDβ(Xij,(WH)ij).

The following special cases are of particular interest (see for example [fevotte2009nonnegative] for a discussion):

• is the Frobenius norm (additive Gaussian noise).

• is the Kullback-Leibler (KL) divergence (additive Poisson noise).

• is the Itakura-Saito (IS) divergence (multiplicative Gamma noise).

In this paper, we focus on the second aspect, namely, the choice of the objective function. We will consider a multi-objective NMF (MO-NMF) formulation in which we will consider a weighted sum of the different objective functions, which is standard in multi-objective optimization; see, for example, [marler2010]. Our main motivation to consider this class of models is that in many applications it is not clear which objective function to use because the statistics of the noise is unknown. To the best of our knowledge, there are currently three main classes of methods to handle this situation:

• The user chooses the objective function she/he believes is the most suitable for the application at hand. This is, as far as we know, the simplest and most widely-used approach. However, this approach is an ad hoc one.

• The objective function is automatically selected using cross-validation, where the training is done on a subset of the entries of the input data matrix and the testing on the remaining entries; see, e.g., [mollah2007robust, choi2010learning].

• The most suitable objective function is chosen using some statistically motivated criteria such as score matching [lu2012selecting] or maximum likelihood [dikmen2015learning].

However, in all the above approaches, if the choice of the objective function is wrong, the NMF solution provided could be far from the desired solution (as we will show in our numerical experiments in Section 4). Another possibility which we propose in this paper is to compute an NMF solution that is robust to different types of noise distributions; this is referred to as distributionally robust, and is closely related to robust optimization [ben2009robust]. In mathematical terms, we will consider the problem

 minW,H≥0maxβ∈ΩDβ(X,WH),

where is a subset of ’s of interest. As we will see, this problem can be tackled by minimizing a weighted sum of the different objective functions [marler2010], exactly as for MO-NMF, but where the weights assigned to the different objective functions are automatically tuned within the iterative process.

##### Outline of the paper

In Section 2, we first define MO-NMF and explain how to scale the objective functions to make the comparison between the constituent NMF objective functions. Then we give our main motivation to consider MO-NMF, namely to be able to compute distributionally robust NMF (DR-NMF) solutions, that is, solutions that minimize the largest objective function value. In Section 3, we propose simple multiplicative updates (MU) to tackle a weighted-sum approach for MO-NMF. We then show how it can be used to solve the DR-NMF problem. Finally, we illustrate in Section 4 the effectiveness of our approach on synthetic, document and audio datasets.

## 2 Multi-Objective NMF

Let be a finite subset of . We consider in this paper the following multi-objective NMF (MO-NMF) problem:

 minW,H≥0{Dβ(X,WH)}β∈Ω.

Note that we focus on -divergences to simplify our presentation and because these are the most widely-used divergences to measure the “distance” between the given matrix and its approximation in the NMF literature. However, our approach can adapted to be used for other objectives functions (e.g., -divergences [cichocki2009nonnegative]). To tackle this problem, we consider the standard weighted-sum approach [deb2014multi] which consists in solving the following minimization problem which involves a single objective function

 minW,H≥0DλΩ(X,WH),where DλΩ(X,WH)=∑β∈ΩλβDβ(X,WH),

with , and . Using different values for allows to generate different Pareto-optimal solutions; see Section 4.1 for some examples. Note however that is does not allow to generate all Pareto-optimal solutions [deb2014multi]. A Pareto-optimal solution is a solution that is not dominated by any other solution, that is, is a Pareto-optimal solution if there does not exist a feasible solution such that

• for all , and

• there exists such that .

Multi-objective optimization has already been considered for NMF problems. However, most of the existing literature considers combining a single data fitting term with penalty terms on the factor matrices, e.g., an penalty to obtain sparse solutions [gong2018multiobjective]. As far as we know, the only paper where several objectives are used to balance different data fitting terms is [zhu2016biobjective]. The authors therein combined two objectives, one being a standard data fitting term (more precisely, they used the Frobenius norm ) and the other being a data fitting term in a feature space obtained using a nonlinear kernel (that is, a term of the form where corresponds to the norm in the feature space). Hence, this approach is rather different than ours where we allow more than two objectives and where we only focus on the input space. Moreover, we will optimize the weights in a principled optimization-theoretic fashion, whereas [zhu2016biobjective] uses an ad hoc manner to combine the two terms.

### 2.1 Scaling of the objectives

It can be easily checked that for any constant , we have

 Dβ(αX,αWH)=αβDβ(X,WH).

Hence, the values of the divergences for different values of depend highly on the scaling of the input matrix. This is usually not a desirable property in practice, since most datasets are not particularly properly scaled and since scaling simply multiplies the noise by a constant which in most cases does not change its distribution (only its parameters). Therefore, to have a meaningful linear combination of several objective functions in the sense that each term in the sum has a similar importance. It will be particularly crucial for our DR-NMF model described in the next section. In fact, as we will see in Section 4, DR-NMF will generate solutions that have small error for all objectives instead of just one; and as such, the solutions inherit superior qualities of the ones generated by different divergences. We will use the following approach to scale the different objective functions. First, we compute a solution for to obtain the error . Note that we can only compute this minimization in an approximate fashion because the NMF problem is NP-hard [vavasis2009complexity]. Then, we define

 ¯Dβ(X,WH)=Dβ(X,WH)eβ,

so that . Finally, we will only consider the MO-NMF problem where the objectives are replaced by their normalized versions , that is,

 minW,H≥0¯DλΩ(X,WH),where¯DλΩ(X,WH)=∑β∈Ωλβ¯Dβ(X,WH). (1)

In Section 3, we propose a MU algorithm to tackle this problem.

### 2.2 Main motivation: Distributionally robust NMF

If the noise model on the data is unknown, but corresponds to a distribution associated with a -divergence with (e.g., the Tweedie distribution as discussed in [tan2013automatic]), it makes sense to consider the following distributionally robust NMF (DR-NMF) problem

 minW,H≥0maxβ∈Ω¯Dβ(X,WH). (2)

Note that we use , not , because otherwise, in most cases, the above problem amounts to minimizing a single objective corresponding to the -divergence with the largest value; cf. the discussion in Section 2.1. Let us show how DR-NMF can be solved via a weighted sum of the different objective functions. We first observe that

 maxβ∈Ω¯Dβ(X,WH)=maxλ≥0,∑β∈Ωλβ=1∑β∈Ωλβ¯Dβ(X,WH).

Indeed, let

, then the problem on the right-hand-side has the optimal value at the vector

with and for ; and we have that . Hence (2) can be reformulated as

 minW,H≥0maxλ≥0,∥λ∥1=1∑β∈Ωλβ¯Dβ(X,WH). (3)

Denote which is concave. The dual problem of (3) is given by

 maxλ≥0,∥λ∥1=1g(λ). (4)

We know that when is convex with respect to and there exists a Slater point (a point in the relative interior of the feasible domain, which is clearly the case here), then strong duality holds by minimax theory, that is,

 minW,H≥0maxλ≥0,∥λ∥1=1¯DλΩ(X,WH)=maxλ≥0,∥λ∥1=1minW,H≥0¯DλΩ(X,WH).

As we are considering a non-convex optimization problem, strong duality may not hold. Assuming there exists a saddle point of such that

 L(W∗,H∗;λ)≤L(W∗,H∗;λ∗)≤L(W,H;λ∗), (5)

for all , and with , we have

 λ∗=argmaxλ≥0,∥λ∥1=1g(λ),and(W∗H∗)=argminW,H≥0¯Dλ∗Ω(X,WH).

Therefore, an optimal solution for DR-NMF can be computed by solving the dual (4) to obtain , and using the weighted-sum minimization problem in (1) to compute . We will adopt this approach in Section 3 to design an algorithm for DR-NMF in (2).

## 3 Multiplicative updates for (1)

In this section, we propose MU for (1) which we will be able to use to tackle MO-NMF and DR-NMF. As with most NMF algorithms, we use an alternating strategy, that is, we will first optimize over the variable for fixed and then reverse their roles. By the symmetry of the problem (), we will focus on the update of ; the update of can be obtained similarly.

### 3.1 Deriving MU

Let us recall the standard way MU are derived (see, e.g., [lee2001algorithms, fevotte2009nonnegative, yang2011unified]) on the following general optimization problem with nonnegativity constraints

 minx≥0f(x). (6)

Let us apply a rescaled gradient descent method to (6), that is, use the following update

 x+=x−B∇f(x),

where is the current iterate, is the next iterate, and is a diagonal matrix with positive diagonal elements. Let and be such that . Taking for all , we obtain the following MU rule:

 x+ =x−[x][∇+f(x)]∘(∇+f(x)−∇−f(x))=x∘[∇−f(x)][∇+f(x)], (7)

where (resp. ) refers to component-wise multiplication (resp. division) between two vectors or matrices. Note that we need strict positivity of and , otherwise we would encounter problems involving division by zero or a variable directly set to zero, which is not desirable. Using the above simple rule with proper choices for and leads to algorithms that are, in many cases, guaranteed to not increase the objective function, that is, ; see below for some examples, and [yang2011unified] for a discussion and an unified rule to design such updates. This is a desirable property since it avoids any line search procedure and also preserves non-negativity naturally. If we cannot guarantee that the updates are non-increasing, the step length can be reduced, that is, use

 x+γ=x−γB∇f(x),

for some which leads to

 x+γ=(1−γ)x+γx+.

For example, one can set the step size for the smallest such that the error decreases; such a is guaranteed to exist since the rescaled gradient direction is a descent direction. We implemented such a line search; see Algorithm 1 below. Note that this idea is similar to that in [lin2007convergence]. Moreover, it would be worth investigating the use of regularizes to guarantee convergence to stationary points without the use of a line search [zhao2018unified].

For , we have that and the MU are not able to modify : this is the so-called zero-locking phenomenon [berry2007algorithms]. A possible way to fix this issue in practice is to use a lower bound on the entries of , e.g., , replacing with . This allows such algorithms to be guaranteed to converge to a stationary point of  [gillis2011nonnegative, takahashi2014global]; more precisely, any sequence of solutions generated by the modified MU has at least one convergent subsequence and the limit of any convergent subsequence is a stationary point [takahashi2014global]. Moreover, it can also be shown that such stationary points are close to stationary points of the original problem (6[gillis2011nonnegative, Chap. 4.1]. We will use this simple strategy in this paper.

### 3.2 Multiplicative Updates for (1)

We now provide more details on how to choose and for the family of -divergences in order to tackle (1). For all , we have

 ∇HDβ(X,WH)=WT((WH)∘(β−2)∘(WH−X)),

where denotes the gradient with respect to variable , and is the component-wise exponentiation by of the matrix . To derive MU as described in the previous section, the standard choice in the literature is the following [tan2013automatic]:

 ∇H+¯Dβ(X,WH)=WT(WH)∘(β−1) and % ∇H−¯Dβ(X,WH)=WT((WH)∘(β−2)∘X).

For example, plugging these in (7) gives for the Frobenius norm (),

 H∘[WTX][WTWH],

and for the KL-divergence (),

 H∘[WT([X][WH])][WTemeTn],

where is the vector of all ones of dimension (hence is the -by- all-ones matrix). These correspond to the MU from [lee2001algorithms] which are guaranteed to not increase the objective function. This also holds shown for any ; see [kompass2007generalized]. To solve (1) using MU, we simply use the linear combination of the above standard choice, that is,

 ∇H±¯DλΩ(X,WH)=∇H±⎛⎝∑β∈Ωλβ¯Dβ(X,WH)⎞⎠=∑β∈Ωλβ(∇H±¯Dβ(X,WH)).

Algorithm 1 summarizes the MU for the update of . Note that the line search procedure (steps 3 to 6) is very rarely entered (we have only observed it in all our numerical experiments described in Section 4 when , that is, only for IS-NMF alone).

Because of the step length procedure that guarantees the objective function to not increase (steps 3-6), the use of Algorithm 1 in an alternating scheme to solve (1) by updating and alternatively is guaranteed to not cause the objective function to increase. Since the objective function is bounded below, this guarantees that the objective function value converges as .

### 3.3 Algorithm for DR-MF

As explained in Section 2, DR-NMF can be tackled by solving

 maxλ≥0,∥λ∥1=1g(λ), whereg(λ)=minW,H≥0¯DλΩ(X,WH).

Given , we update and using the MU to decrease the values of ; see Algorithm 1. For a fixed , let . This means that

 ¯Dβ∗(X,WH)≥¯Dβ(X,WH) for all β∈Ω.

Therefore, since we are trying to solve , the -divergence should be given more importance at the next iteration to minimize ; hence this forces the maximum to decrease. This can be achieved by increasing the corresponding entry in . More formally, we have that is given by

 λ∗β={1 if β=β∗,0 if β≠β∗.

Hence, at each step, we update as follows

 λ(k+1)=λ(k)+ρkλ∗, (8)

then we normalize so that its entries sum up to one as follows:

 λ(k+1)←λ(k+1)∥λ(k+1)∥1. (9)

In (8), is an appropriately chosen sequence of parameters. The above procedure for updating means that all entries of will be decreased (relative to ) except for the entry corresponding to . If we were able to solve the subproblem in exactly and use a subgradient direction to update , choosing such that and would guarantee convergence; see, e.g., [anstreicher2009two] and the references therein. However, as we are not solving the subproblems exactly (it is an NMF problem which is NP-hard in general [vavasis2009complexity]

), we use this as a heuristic. In our implementation, we used

which works well in practice.

## 4 Numerical Experiments

In this section, we apply our algorithms on several datasets. In all cases, we perform 1000 iterations. All tests are preformed using Matlab R2015a on a laptop Intel CORE i7-7500U CPU @2.9GHz 24GB RAM. The code is available from https://sites.google.com/site/nicolasgillis/code.

### 4.1 MO-NMF: Examples of Pareto frontier on synthetic data

In this section, we illustrate the use of Algorithm 1 to compute Pareto-optimal solutions. We will focus on the case , that is, IS- and KL-divergences and the Frobenius norm. Note, however, that our algorithm and code can deal with any and any finite set .

We generate the input matrix as follows: where the component matrices , and the noise matrix are generated as follows:

• The entries of and

are generated using the uniform distribution in the interval [0,1]. We define

which is the noiseless low-rank matrix.

• Let us define if , otherwise. Let also

 ~N=x0NIS∥NIS∥F+x1N%KL∥NKL∥F+x2NF∥NF∥F,

where

• be multiplicative Gamma noise where each entry of

is generated using the normal distribution of mean 0 and variance 1,

• each entry of

is generated using the Poisson distribution of parameter 1,

• each entry of is generated using the normal distribution of mean 0 and variance 1.

We set with .

Finally, is a low-rank matrix to which had been contaminated with 20% of noise (that is, ) and then was projected onto the nonnegative orthant. The noise is constructed using the distributions corresponding to .

Figure 1 shows the Pareto-optimal solutions for MO-NMF. More precisely, it provides the solution for the problems

 minW,H≥0¯DλΩ(X,WH),

where for , and for . To simplify computation, we have used the true underlying solution as the initialization (using random or other initializations sometimes generate solution which are more often not on the Pareto frontier because NMF may have many local minima). The Pareto frontier is as expected: the smallest possible value for each objective is 1 (because of the scaling), for which the other objective function is the largest. As changes, one objective increases while the other decreases. The DR-NMF solution finds the point on the Pareto frontier such that for .

For DR-NMF, we observe that

• The solution of DR-NMF does not necessarily coincide with a value of close to . For example, for the case of the IS divergence with the Frobenius norm, it is close to .

• Using DR-NMF allows to obtain a solution with low error for both objectives, always at most 2% worse than the lowest error. Minimizing a single objective sometimes leads to solution with error up to 35% higher than the lowest (in the case IS divergence with Frobenius norm). We will observe a similar behaviour on real datasets.

### 4.2 Sparse document datasets: Ω={1,2}

For sparse datasets, it is known that only the -divergence for can exploit the sparsity structure. In fact, in all other cases, all entries of the product have to be computed explicitly which is impractical for large sparse matrices since can be dense. In other words, let denote the number of non-zero entries of . Then the MU for NMF with the -divergence for can be run in operations, while for the other values of , it requires operations.

As explained in [chi2012tensors], for such sparse and discrete matrices, Poisson noise is the most appropriate noise model (in fact, Gaussian noise does not make much sense on sparse datasets). Hence we expect KL-NMF to provide better results than other forms of NMF.

In this section, we use the 15 sparse document datasets from [ZG05]. These are large and highly sparse matrices whose entries is the number of times word appears in document . We apply KL-NMF, Fro-NMF and DR-NMF with . To simplify the comparison, reduce the computational load and to have a good initial solution, we use the same initial matrices in all cases, namely the recently proposed SVD-based initialization for NMF [syed2018improved]. We perform rank- factorization where is the number of classes reported in these datasets. Table 1 reports the results. The first and second column report the name of the dataset and the number of classes, respectively. The next three columns report the accuracy of the clustering obtained with the factorizations produced by KL-NMF, Fro-NMF and DR-NMF with , respectively. Given the true disjoint clusters for and given a computed disjoint clustering , its accuracy is defined as

 accuracy({~Ci}ri=1)=minπ∈[1,2,…,r]1mr∑i=1|Ci∩~Cπ(i)|,

where is the set of permutations of . For simplicity, given an NMF where each row of corresponds to a cluster and each column to a topic, we cluster the documents by selecting the largest entry in the corresponding row of (after we have normalized each column of so that the entries sum to one). The next two columns report how much higher the KL error (in percent) of the solutions of Fro-NMF and DR-NMF are compared to KL-NMF, that is, it reports

 ¯D1(X,WH)−1=KL(X,WH)KL(X,W1H1)−1

where is the solution computed by KL-NMF. The last two columns report how much higher the Frobenius error (in percent) of the solutions of KL-NMF and DR-NMF are compared to Fro-NMF, that is,

 ¯D2(X,WH)−1=∥X−WH∥2F∥X−W2H2∥2F−1

where is the solution computed by Fro-NMF.

We observe the following:

• In terms of clustering, DR-NMF in fact allows us to be robust in the sense that it is able to provide in all cases but one at least the second highest clustering accuracy. On two datasets, it is even able to provide the highest accuracy. Globally, DR-NMF does not perform as well as KL-NMF although on average their accuracy only differs by 3.58%. However, it significantly outperforms Fro-NMF, with 9.62% higher accuracy on average.

• In terms of error, as already noted in the previous section, DR-NMF is able to simultaneously provide solutions with small KL and Frobenius error, on average 2.92% higher than the solution computed with a single objective. On the other hand, optimizing a single objective often leads to very large errors for the other one, up to 142% on NG20, with an average 18.45% for Fro-NMF and 32.17% for KL-NMF.

### 4.3 Dense time-frequency matrices of audio signals: Ω={0,1}

NMF has been used successfully to separate sources from a single audio recording. However, there is a debate in the literature as to whether the KL or the IS divergence should be used; see, e.g., [virtanen2007monaural, fevotte2009nonnegative] and the references therein. In fact, as we will see, IS-NMF and KL-NMF provide rather different results on different audio datasets. On one hand, due to its insensitivity to scaling (see Section 2.1), IS-NMF gives the same relative importance to all entries of the data matrix; e.g., the error for approximating 1 by 10 is the same as for approximating 10 by 100, that is, . On the other hand, KL-NMF gives more importance to larger entries as it is (linearly) sensitive to scaling; e.g., the error for approximating 1 by 10 is ten times smaller than approximating 10 by 100, that is, . We illustrate this difference on the spectrogram of an audio signal in Section 4.3.2.

#### 4.3.1 Quantitative results

Using our DR-NMF approach, we can overcome the issue of having to choose between the IS- and KL-divergences by generating solutions which possess small IS and KL errors simultaneously. Table 2 gives the error for the three different approaches on 10 diverse audio datasets:

For these datasets, the results are even more striking than for the sparse text datasets in Section 4.2. In particular, DR-NMF has on average an error higher by about 10% compared to both IS-NMF and KL-NMF, while KL-NMF (resp. IS-NMF) has on average an increase in IS error of 147% (resp. 186%). As we will see in the next section, using DR-NMF allows to obtain much more robust results than using IS-NMF or KL-NMF alone.

#### 4.3.2 Qualitative results

In the previous section, we have shown quantitative results showing that DR-NMF is able to obtain solutions with low KL and IS divergence simultaneously. In this section, we investigate the dataset pianoMary in more detail and show that DR-NMF also leads to better separation for three comparative studies described in detail below: (1) no noise added to the signal, (2) Poisson noise added and (3) Gamma noise added. This dataset is the first 4.7 seconds of “Mary had a little lamb”. The sequence is composed of three notes, namely, , and . The recorded signal is downsampled to Hz yielding

samples. The short-time Fourier transform (STFT) of the input signal

is computed using a Hamming window of size leading to a temporal resolution of 32ms and a frequency resolution of 31.25Hz. We use 50% overlap between two frames, leading to frames and frequency bins. Figure 2 displays the musical score, the time-domain signal and its amplitude spectrogram .

##### No added noise

Figure 3 displays the evolution of the IS- and KL-divergences along iterations, the columns of (dictionary matrix) and the rows of for NMF with IS- and KL-divergences, and DR-NMF with with . As expected, DR-NMF is able to compute a solution with low IS and KL error, which is not the case of IS-NMF and DR-NMF (in particular, KL-NMF has IS error almost 9 times larger than IS-NMF).

However, the three solutions generated by IS-NMF, KL-NMF and DR-NMF give a nice separation with similar results for and . The three notes are extracted and a fourth note (last column of and last row of in Figure 3) is the very first offset of each note in the musical sequence. This numerical result makes sense and corresponds to some common mechanical vibration acting in the piano just before triggering a specific note. In order to validate the nature of the three-source estimates, Table 3 gives the frequency peaks corresponding to the one-lined, two-lined and three-lined octaves obtained by the three NMF solutions (which in this case coincide) compared to equal temperament theoretical values. As it can be observed, peaks for the three notes are nicely estimated. Furthermore, the activation coefficients (rows of ) are coherent with the sequences of the notes.

Figure 4 shows the amplitude spectrogram of the input signal and the reconstructed spectrograms obtained respectively with IS-NMF, KL-NMF and DR-NMF. Two regions (low frequency and high frequency) of the spectrograms are also highlighted; see Figures 4-(b) and 4-(c). One can see that DR-NMF takes advantage from IS- and KL-divergences to accurately reconstruct the amplitude spectrogram respectively in the high frequency and low frequency regions, while KL-NMF has a better reconstruction for low frequencies (for which amplitudes are larger) and IS-NMF for high frequencies (for which amplitudes are lower).

##### Poisson noise

The second comparative study is performed on the same dataset with Poisson noise added to the input audio spectrogram following the methodology described in Section 4.1. We use with and . Figure 5 displays the columns of (dictionary matrix) and the rows of for NMF with IS- and KL-divergences, and for DR-NMF with with .

As expected with this noise model, IS-NMF is not able to extract the three notes, while KL-NMF and DR-NMF correctly identify the three sources with similar results for and . This illustrates that DR-NMF is robust to different types of noises (in this case, additive Poisson noise).

##### Gamma noise

The third comparative study is performed on the same dataset with multiplicative Gamma noise, accordingly to the the methodology described in Section 4.1. We use with and . For this experiment, we overestimate the number of sources present into the input spectrogram by choosing ; this allows to highlight the differences between the different NMF variants better. Figure 6 displays the columns of (dictionary matrix) and the rows of for NMF with IS- and KL-divergences, and for DR-NMF with .

KL-NMF identifies five sources among which one has no physical meaning and seems to be a mixture of several notes. IS-NMF correctly identifies the three notes, the fourth estimate (common offset) is less accurately estimated in terms of amplitude for the activations but IS-NMF is able to set to zero the fifth estimate which is appealing as it automatically remove an unnecessary component. DR-NMF again takes advantage from both divergences as it is able to extract the three notes correctly, the fourth estimate (common offset) is well extracted and the fifth estimate is close to zero. This again illustrates that DR-NMF is robust to different types of noises (in this case, multiplicative Gamma noise).

## 5 Conclusion and further work

In this paper, we have proposed, for the first time, a multi-objective model for NMF that takes into account several data fitting terms. We then proposed to tackle this problem with a weighted-sum approach with carefully chosen weights, and designed MU to minimize the corresponding objective function. We used this model to design a DR-NMF algorithm that allows to obtain NMF solutions with low reconstruction errors with respect to several objective functions. We illustrated the effectiveness of this approach on synthetic, document and audio datasets. For audio datasets, DR-NMF provided particularly stunning results, being able to obtain solutions with significantly lower IS and KL errors (simultaneously), while generating meaningful solutions under different noise models or statistics. It is our hope that the algorithm for DR-NMF that we proposed in Algorithm 2 resolves the long-standing debate [virtanen2007monaural, fevotte2009nonnegative] on whether to use IS- or KL-NMF for audio datasets. Using DR-NMF provides a safe alternative when one is uncertain of the noise statistics of audio datasets; the noise statistics is rarely, if at all, known in practice.

Possible further research include the design of more efficient algorithms to solve multi-objective NMF, the extension of our distributionally robust model to low-rank tensor decompositions, and the refinment of our model by adding additional penalty terms or contraints to exploit properties, such as sparsity, smoothness or minimum volume

[cichocki2009nonnegative, gillis2014, fu2018nonnegative], in the decompositions. Another challenging direction of research is to consider the DR-NMF problem with an countably infinite uncertainty set , e.g., .