1 Introduction
Nonnegative matrix factorization (NMF) consists in the following problem: Given a nonnegative matrix and a factorization positive rank , find two nonnegative matrices and such that . NMF is a linear dimensionality reduction technique for nonnegative data. In fact, assuming each column of is a data point, it is reconstructed via a linear combination of basis elements given by the columns of while the columns of provide the weights (or coefficients) to reconstruct each column of within that basis, that is, for all ,
NMF has attracted a lot of attention since the seminal paper of Lee and Seung [lee1999learning], with applications in image analysis, document classification and music analysis, to cite a few; see, e.g., [cichocki2006new, gillis2014] and the references therein. Many NMF models have been proposed over the years. They mostly differ in two aspects:

Additional constraints are added to the factor matrices and such as sparsityÌ [hoyer2004non], spatial coherence [liu2011approach] or smoothness [essid2013smooth]. These constraints are motivated by a priori information on the sought solution and depend on the application at hand. Note that these additional constraints are in most cases imposed via a penalty term in the objective function.

The choice of the objective function that assesses the quality of an approximation by evaluating some distance between and differs. This choice is usually motivated by the noise model/statistics assumed on the data matrix . The most widely used class of objective functions are componentwise and based on the divergences defined as follows: for ,
We will use the following matrixwise notation,
The following special cases are of particular interest (see for example [fevotte2009nonnegative] for a discussion):

is the Frobenius norm (additive Gaussian noise).

is the KullbackLeibler (KL) divergence (additive Poisson noise).

is the ItakuraSaito (IS) divergence (multiplicative Gamma noise).

In this paper, we focus on the second aspect, namely, the choice of the objective function. We will consider a multiobjective NMF (MONMF) formulation in which we will consider a weighted sum of the different objective functions, which is standard in multiobjective optimization; see, for example, [marler2010]. Our main motivation to consider this class of models is that in many applications it is not clear which objective function to use because the statistics of the noise is unknown. To the best of our knowledge, there are currently three main classes of methods to handle this situation:

The user chooses the objective function she/he believes is the most suitable for the application at hand. This is, as far as we know, the simplest and most widelyused approach. However, this approach is an ad hoc one.

The objective function is automatically selected using crossvalidation, where the training is done on a subset of the entries of the input data matrix and the testing on the remaining entries; see, e.g., [mollah2007robust, choi2010learning].

The most suitable objective function is chosen using some statistically motivated criteria such as score matching [lu2012selecting] or maximum likelihood [dikmen2015learning].
However, in all the above approaches, if the choice of the objective function is wrong, the NMF solution provided could be far from the desired solution (as we will show in our numerical experiments in Section 4). Another possibility which we propose in this paper is to compute an NMF solution that is robust to different types of noise distributions; this is referred to as distributionally robust, and is closely related to robust optimization [ben2009robust]. In mathematical terms, we will consider the problem
where is a subset of ’s of interest. As we will see, this problem can be tackled by minimizing a weighted sum of the different objective functions [marler2010], exactly as for MONMF, but where the weights assigned to the different objective functions are automatically tuned within the iterative process.
Outline of the paper
In Section 2, we first define MONMF and explain how to scale the objective functions to make the comparison between the constituent NMF objective functions. Then we give our main motivation to consider MONMF, namely to be able to compute distributionally robust NMF (DRNMF) solutions, that is, solutions that minimize the largest objective function value. In Section 3, we propose simple multiplicative updates (MU) to tackle a weightedsum approach for MONMF. We then show how it can be used to solve the DRNMF problem. Finally, we illustrate in Section 4 the effectiveness of our approach on synthetic, document and audio datasets.
2 MultiObjective NMF
Let be a finite subset of . We consider in this paper the following multiobjective NMF (MONMF) problem:
Note that we focus on divergences to simplify our presentation and because these are the most widelyused divergences to measure the “distance” between the given matrix and its approximation in the NMF literature. However, our approach can adapted to be used for other objectives functions (e.g., divergences [cichocki2009nonnegative]). To tackle this problem, we consider the standard weightedsum approach [deb2014multi] which consists in solving the following minimization problem which involves a single objective function
with , and . Using different values for allows to generate different Paretooptimal solutions; see Section 4.1 for some examples. Note however that is does not allow to generate all Paretooptimal solutions [deb2014multi]. A Paretooptimal solution is a solution that is not dominated by any other solution, that is, is a Paretooptimal solution if there does not exist a feasible solution such that

for all , and

there exists such that .
Multiobjective optimization has already been considered for NMF problems. However, most of the existing literature considers combining a single data fitting term with penalty terms on the factor matrices, e.g., an penalty to obtain sparse solutions [gong2018multiobjective]. As far as we know, the only paper where several objectives are used to balance different data fitting terms is [zhu2016biobjective]. The authors therein combined two objectives, one being a standard data fitting term (more precisely, they used the Frobenius norm ) and the other being a data fitting term in a feature space obtained using a nonlinear kernel (that is, a term of the form where corresponds to the norm in the feature space). Hence, this approach is rather different than ours where we allow more than two objectives and where we only focus on the input space. Moreover, we will optimize the weights in a principled optimizationtheoretic fashion, whereas [zhu2016biobjective] uses an ad hoc manner to combine the two terms.
2.1 Scaling of the objectives
It can be easily checked that for any constant , we have
Hence, the values of the divergences for different values of depend highly on the scaling of the input matrix. This is usually not a desirable property in practice, since most datasets are not particularly properly scaled and since scaling simply multiplies the noise by a constant which in most cases does not change its distribution (only its parameters). Therefore, to have a meaningful linear combination of several objective functions in the sense that each term in the sum has a similar importance. It will be particularly crucial for our DRNMF model described in the next section. In fact, as we will see in Section 4, DRNMF will generate solutions that have small error for all objectives instead of just one; and as such, the solutions inherit superior qualities of the ones generated by different divergences. We will use the following approach to scale the different objective functions. First, we compute a solution for to obtain the error . Note that we can only compute this minimization in an approximate fashion because the NMF problem is NPhard [vavasis2009complexity]. Then, we define
so that . Finally, we will only consider the MONMF problem where the objectives are replaced by their normalized versions , that is,
(1) 
In Section 3, we propose a MU algorithm to tackle this problem.
2.2 Main motivation: Distributionally robust NMF
If the noise model on the data is unknown, but corresponds to a distribution associated with a divergence with (e.g., the Tweedie distribution as discussed in [tan2013automatic]), it makes sense to consider the following distributionally robust NMF (DRNMF) problem
(2) 
Note that we use , not , because otherwise, in most cases, the above problem amounts to minimizing a single objective corresponding to the divergence with the largest value; cf. the discussion in Section 2.1. Let us show how DRNMF can be solved via a weighted sum of the different objective functions. We first observe that
Indeed, let
, then the problem on the righthandside has the optimal value at the vector
with and for ; and we have that . Hence (2) can be reformulated as(3) 
Denote which is concave. The dual problem of (3) is given by
(4) 
We know that when is convex with respect to and there exists a Slater point (a point in the relative interior of the feasible domain, which is clearly the case here), then strong duality holds by minimax theory, that is,
As we are considering a nonconvex optimization problem, strong duality may not hold. Assuming there exists a saddle point of such that
(5) 
for all , and with , we have
Therefore, an optimal solution for DRNMF can be computed by solving the dual (4) to obtain , and using the weightedsum minimization problem in (1) to compute . We will adopt this approach in Section 3 to design an algorithm for DRNMF in (2).
3 Multiplicative updates for (1)
In this section, we propose MU for (1) which we will be able to use to tackle MONMF and DRNMF. As with most NMF algorithms, we use an alternating strategy, that is, we will first optimize over the variable for fixed and then reverse their roles. By the symmetry of the problem (), we will focus on the update of ; the update of can be obtained similarly.
3.1 Deriving MU
Let us recall the standard way MU are derived (see, e.g., [lee2001algorithms, fevotte2009nonnegative, yang2011unified]) on the following general optimization problem with nonnegativity constraints
(6) 
Let us apply a rescaled gradient descent method to (6), that is, use the following update
where is the current iterate, is the next iterate, and is a diagonal matrix with positive diagonal elements. Let and be such that . Taking for all , we obtain the following MU rule:
(7) 
where (resp. ) refers to componentwise multiplication (resp. division) between two vectors or matrices. Note that we need strict positivity of and , otherwise we would encounter problems involving division by zero or a variable directly set to zero, which is not desirable. Using the above simple rule with proper choices for and leads to algorithms that are, in many cases, guaranteed to not increase the objective function, that is, ; see below for some examples, and [yang2011unified] for a discussion and an unified rule to design such updates. This is a desirable property since it avoids any line search procedure and also preserves nonnegativity naturally. If we cannot guarantee that the updates are nonincreasing, the step length can be reduced, that is, use
for some which leads to
For example, one can set the step size for the smallest such that the error decreases; such a is guaranteed to exist since the rescaled gradient direction is a descent direction. We implemented such a line search; see Algorithm 1 below. Note that this idea is similar to that in [lin2007convergence]. Moreover, it would be worth investigating the use of regularizes to guarantee convergence to stationary points without the use of a line search [zhao2018unified].
For , we have that and the MU are not able to modify : this is the socalled zerolocking phenomenon [berry2007algorithms]. A possible way to fix this issue in practice is to use a lower bound on the entries of , e.g., , replacing with . This allows such algorithms to be guaranteed to converge to a stationary point of [gillis2011nonnegative, takahashi2014global]; more precisely, any sequence of solutions generated by the modified MU has at least one convergent subsequence and the limit of any convergent subsequence is a stationary point [takahashi2014global]. Moreover, it can also be shown that such stationary points are close to stationary points of the original problem (6) [gillis2011nonnegative, Chap. 4.1]. We will use this simple strategy in this paper.
3.2 Multiplicative Updates for (1)
We now provide more details on how to choose and for the family of divergences in order to tackle (1). For all , we have
where denotes the gradient with respect to variable , and is the componentwise exponentiation by of the matrix . To derive MU as described in the previous section, the standard choice in the literature is the following [tan2013automatic]:
For example, plugging these in (7) gives for the Frobenius norm (),
and for the KLdivergence (),
where is the vector of all ones of dimension (hence is the by allones matrix). These correspond to the MU from [lee2001algorithms] which are guaranteed to not increase the objective function. This also holds shown for any ; see [kompass2007generalized]. To solve (1) using MU, we simply use the linear combination of the above standard choice, that is,
Algorithm 1 summarizes the MU for the update of . Note that the line search procedure (steps 3 to 6) is very rarely entered (we have only observed it in all our numerical experiments described in Section 4 when , that is, only for ISNMF alone).
Because of the step length procedure that guarantees the objective function to not increase (steps 36), the use of Algorithm 1 in an alternating scheme to solve (1) by updating and alternatively is guaranteed to not cause the objective function to increase. Since the objective function is bounded below, this guarantees that the objective function value converges as .
3.3 Algorithm for DRMF
As explained in Section 2, DRNMF can be tackled by solving
Given , we update and using the MU to decrease the values of ; see Algorithm 1. For a fixed , let . This means that
Therefore, since we are trying to solve , the divergence should be given more importance at the next iteration to minimize ; hence this forces the maximum to decrease. This can be achieved by increasing the corresponding entry in . More formally, we have that is given by
Hence, at each step, we update as follows
(8) 
then we normalize so that its entries sum up to one as follows:
(9) 
In (8), is an appropriately chosen sequence of parameters. The above procedure for updating means that all entries of will be decreased (relative to ) except for the entry corresponding to . If we were able to solve the subproblem in exactly and use a subgradient direction to update , choosing such that and would guarantee convergence; see, e.g., [anstreicher2009two] and the references therein. However, as we are not solving the subproblems exactly (it is an NMF problem which is NPhard in general [vavasis2009complexity]
), we use this as a heuristic. In our implementation, we used
which works well in practice.4 Numerical Experiments
In this section, we apply our algorithms on several datasets. In all cases, we perform 1000 iterations. All tests are preformed using Matlab R2015a on a laptop Intel CORE i77500U CPU @2.9GHz 24GB RAM. The code is available from https://sites.google.com/site/nicolasgillis/code.
4.1 MONMF: Examples of Pareto frontier on synthetic data
In this section, we illustrate the use of Algorithm 1 to compute Paretooptimal solutions. We will focus on the case , that is, IS and KLdivergences and the Frobenius norm. Note, however, that our algorithm and code can deal with any and any finite set .
We generate the input matrix as follows: where the component matrices , and the noise matrix are generated as follows:

The entries of and
are generated using the uniform distribution in the interval [0,1]. We define
which is the noiseless lowrank matrix. 
Let us define if , otherwise. Let also
where

be multiplicative Gamma noise where each entry of
is generated using the normal distribution of mean 0 and variance 1,

each entry of
is generated using the Poisson distribution of parameter 1,

each entry of is generated using the normal distribution of mean 0 and variance 1.
We set with .

Finally, is a lowrank matrix to which had been contaminated with 20% of noise (that is, ) and then was projected onto the nonnegative orthant. The noise is constructed using the distributions corresponding to .
Figure 1 shows the Paretooptimal solutions for MONMF. More precisely, it provides the solution for the problems
where for , and for . To simplify computation, we have used the true underlying solution as the initialization (using random or other initializations sometimes generate solution which are more often not on the Pareto frontier because NMF may have many local minima). The Pareto frontier is as expected: the smallest possible value for each objective is 1 (because of the scaling), for which the other objective function is the largest. As changes, one objective increases while the other decreases. The DRNMF solution finds the point on the Pareto frontier such that for .
For DRNMF, we observe that

The solution of DRNMF does not necessarily coincide with a value of close to . For example, for the case of the IS divergence with the Frobenius norm, it is close to .

Using DRNMF allows to obtain a solution with low error for both objectives, always at most 2% worse than the lowest error. Minimizing a single objective sometimes leads to solution with error up to 35% higher than the lowest (in the case IS divergence with Frobenius norm). We will observe a similar behaviour on real datasets.
4.2 Sparse document datasets:
For sparse datasets, it is known that only the divergence for can exploit the sparsity structure. In fact, in all other cases, all entries of the product have to be computed explicitly which is impractical for large sparse matrices since can be dense. In other words, let denote the number of nonzero entries of . Then the MU for NMF with the divergence for can be run in operations, while for the other values of , it requires operations.
As explained in [chi2012tensors], for such sparse and discrete matrices, Poisson noise is the most appropriate noise model (in fact, Gaussian noise does not make much sense on sparse datasets). Hence we expect KLNMF to provide better results than other forms of NMF.
In this section, we use the 15 sparse document datasets from [ZG05]. These are large and highly sparse matrices whose entries is the number of times word appears in document . We apply KLNMF, FroNMF and DRNMF with . To simplify the comparison, reduce the computational load and to have a good initial solution, we use the same initial matrices in all cases, namely the recently proposed SVDbased initialization for NMF [syed2018improved]. We perform rank factorization where is the number of classes reported in these datasets. Table 1 reports the results. The first and second column report the name of the dataset and the number of classes, respectively. The next three columns report the accuracy of the clustering obtained with the factorizations produced by KLNMF, FroNMF and DRNMF with , respectively. Given the true disjoint clusters for and given a computed disjoint clustering , its accuracy is defined as
where is the set of permutations of . For simplicity, given an NMF where each row of corresponds to a cluster and each column to a topic, we cluster the documents by selecting the largest entry in the corresponding row of (after we have normalized each column of so that the entries sum to one). The next two columns report how much higher the KL error (in percent) of the solutions of FroNMF and DRNMF are compared to KLNMF, that is, it reports
where is the solution computed by KLNMF. The last two columns report how much higher the Frobenius error (in percent) of the solutions of KLNMF and DRNMF are compared to FroNMF, that is,
where is the solution computed by FroNMF.
Dataset  number  Clustering accuracy (%)  (%)  (%)  

of classes  KLNMF  FroNMF  DRNMF  FroNMF  DRNMF  KLNMF  DRNMF  
NG20  20  50.15  17.78  27.60  22.92  4.15  142.58  4.13 
NG3SIM  3  59.07  34.29  68.05  17.70  3.01  16.51  3.01 
classic  4  65.53  49.21  58.98  14.62  0.51  2.06  0.51 
ohscal  10  41.54  35.71  40.23  8.63  1.34  9.46  1.34 
k1b  6  54.40  73.50  62.35  8.85  1.46  5.25  1.46 
hitech  6  41.03  48.28  41.68  8.67  1.28  3.88  1.28 
reviews  5  78.10  45.24  75.33  8.85  0.72  6.29  0.72 
sports  7  53.48  49.24  62.60  10.30  1.34  7.66  1.34 
la1  6  70.69  45.47  66.67  9.96  1.07  3.54  1.07 
la12  6  71.24  47.91  67.75  8.52  0.68  2.63  0.68 
la2  6  70.34  51.58  68.62  8.83  0.84  3.08  0.84 
tr11  9  52.90  46.38  46.62  27.69  5.35  51.94  5.35 
tr23  6  30.39  39.71  34.80  58.08  9.71  72.63  9.70 
tr41  10  60.25  35.31  49.20  20.87  4.33  60.01  4.32 
tr45  10  56.67  38.12  31.59  42.31  8.63  94.97  8.63 
Average  57.05  43.85  53.47  18.45  2.96  32.17  2.96 
We observe the following:

In terms of clustering, DRNMF in fact allows us to be robust in the sense that it is able to provide in all cases but one at least the second highest clustering accuracy. On two datasets, it is even able to provide the highest accuracy. Globally, DRNMF does not perform as well as KLNMF although on average their accuracy only differs by 3.58%. However, it significantly outperforms FroNMF, with 9.62% higher accuracy on average.

In terms of error, as already noted in the previous section, DRNMF is able to simultaneously provide solutions with small KL and Frobenius error, on average 2.92% higher than the solution computed with a single objective. On the other hand, optimizing a single objective often leads to very large errors for the other one, up to 142% on NG20, with an average 18.45% for FroNMF and 32.17% for KLNMF.
4.3 Dense timefrequency matrices of audio signals:
NMF has been used successfully to separate sources from a single audio recording. However, there is a debate in the literature as to whether the KL or the IS divergence should be used; see, e.g., [virtanen2007monaural, fevotte2009nonnegative] and the references therein. In fact, as we will see, ISNMF and KLNMF provide rather different results on different audio datasets. On one hand, due to its insensitivity to scaling (see Section 2.1), ISNMF gives the same relative importance to all entries of the data matrix; e.g., the error for approximating 1 by 10 is the same as for approximating 10 by 100, that is, . On the other hand, KLNMF gives more importance to larger entries as it is (linearly) sensitive to scaling; e.g., the error for approximating 1 by 10 is ten times smaller than approximating 10 by 100, that is, . We illustrate this difference on the spectrogram of an audio signal in Section 4.3.2.
4.3.1 Quantitative results
Using our DRNMF approach, we can overcome the issue of having to choose between the IS and KLdivergences by generating solutions which possess small IS and KL errors simultaneously. Table 2 gives the error for the three different approaches on 10 diverse audio datasets:

voicecell, syntBassDrum and syntCCcyGC were downloaded from http://isse.sourceforge.net/demos.html.

preludeJSB is the the welltempered Clavier performed by Glenn Gould 1/13 between the 19th et 49th seconds, downloaded from https://www.youtube.com/watch?v=IrJjPYi_vhM.

ShanHursunrise was downloaded from http://bassdb.gforge.inria.fr/fasst/.

trioBrahms and triobapitru were derived from the TRIOS dataset [fritsch2012high]; see https://c4dm.eecs.qmul.ac.uk/rdr/handle/123456789/27.

sisecmixdrums and sisecmixfemale come from the SISEC dataset; see http://sisec.wiki.irisa.fr/tikiindexbfd7.html?page=Underdetermined+speech+and+music+mixtures.

pianoMary is a recording at the third author’s house.
Dataset  (%)  (%)  

KLNMF  DRNMF  ISNMF  DRNMF  
syntBassDrum  543  38.18 3.81  5.49 2.81  114.08 22.78  5.41 2.96 
pianoMary  586  373.64 227.30  8.07 1.06  174.22 35.88  7.61 1.57 
preludeJSB  2582  33.81 3.67  12.51 1.43  151.09 35.22  12.51 1.43 
syntCCcyGC  1377  9.83 0.46  2.33 0.36  43.74 6.82  2.33 0.36 
trioBrahms  14813  364.66 32.47  12.39 2.21  365.77 346.03  12.40 2.21 
triobapitru  6200  349.97 62.86  7.67 2.26  255.60 48.78  7.63 2.22 
voicecell  2181  183.29 17.43  13.14 4.54  174.26 23.15  13.14 4.54 
ShanHursunrise  4102  54.71 4.42  11.33 1.38  185.79 39.03  11.33 1.38 
sisecmixdrums  1249  26.03 0.84  11.36 1.27  300.60 40.23  11.36 1.27 
sisecmixfemale  1249  37.86 2.56  11.68 0.95  98.73 9.35  11.67 0.95 
Average  147.20 35.58  9.60 1.83  186.39 60.73  9.54 1.89 
. The table reports the average and standard deviation over 10 initializations.
For these datasets, the results are even more striking than for the sparse text datasets in Section 4.2. In particular, DRNMF has on average an error higher by about 10% compared to both ISNMF and KLNMF, while KLNMF (resp. ISNMF) has on average an increase in IS error of 147% (resp. 186%). As we will see in the next section, using DRNMF allows to obtain much more robust results than using ISNMF or KLNMF alone.
4.3.2 Qualitative results
In the previous section, we have shown quantitative results showing that DRNMF is able to obtain solutions with low KL and IS divergence simultaneously. In this section, we investigate the dataset pianoMary in more detail and show that DRNMF also leads to better separation for three comparative studies described in detail below: (1) no noise added to the signal, (2) Poisson noise added and (3) Gamma noise added. This dataset is the first 4.7 seconds of “Mary had a little lamb”. The sequence is composed of three notes, namely, , and . The recorded signal is downsampled to Hz yielding
samples. The shorttime Fourier transform (STFT) of the input signal
is computed using a Hamming window of size leading to a temporal resolution of 32ms and a frequency resolution of 31.25Hz. We use 50% overlap between two frames, leading to frames and frequency bins. Figure 2 displays the musical score, the timedomain signal and its amplitude spectrogram .No added noise
Figure 3 displays the evolution of the IS and KLdivergences along iterations, the columns of (dictionary matrix) and the rows of for NMF with IS and KLdivergences, and DRNMF with with . As expected, DRNMF is able to compute a solution with low IS and KL error, which is not the case of ISNMF and DRNMF (in particular, KLNMF has IS error almost 9 times larger than ISNMF).
However, the three solutions generated by ISNMF, KLNMF and DRNMF give a nice separation with similar results for and . The three notes are extracted and a fourth note (last column of and last row of in Figure 3) is the very first offset of each note in the musical sequence. This numerical result makes sense and corresponds to some common mechanical vibration acting in the piano just before triggering a specific note. In order to validate the nature of the threesource estimates, Table 3 gives the frequency peaks corresponding to the onelined, twolined and threelined octaves obtained by the three NMF solutions (which in this case coincide) compared to equal temperament theoretical values. As it can be observed, peaks for the three notes are nicely estimated. Furthermore, the activation coefficients (rows of ) are coherent with the sequences of the notes.
Notes/Octaves  Onelined  Twolined  Threelined  
Theoretical  262  523  1046.5  

Measured (NMF)  250  531.3  1031 

Theoretical  294  587  1175 

Measured (NMF)  281.3  593.8  1188 

Theoretical  330  659  1318.5 

Measured (NMF)  343.8  656.3  1313 

Figure 4 shows the amplitude spectrogram of the input signal and the reconstructed spectrograms obtained respectively with ISNMF, KLNMF and DRNMF. Two regions (low frequency and high frequency) of the spectrograms are also highlighted; see Figures 4(b) and 4(c). One can see that DRNMF takes advantage from IS and KLdivergences to accurately reconstruct the amplitude spectrogram respectively in the high frequency and low frequency regions, while KLNMF has a better reconstruction for low frequencies (for which amplitudes are larger) and ISNMF for high frequencies (for which amplitudes are lower).
Poisson noise
The second comparative study is performed on the same dataset with Poisson noise added to the input audio spectrogram following the methodology described in Section 4.1. We use with and . Figure 5 displays the columns of (dictionary matrix) and the rows of for NMF with IS and KLdivergences, and for DRNMF with with .
As expected with this noise model, ISNMF is not able to extract the three notes, while KLNMF and DRNMF correctly identify the three sources with similar results for and . This illustrates that DRNMF is robust to different types of noises (in this case, additive Poisson noise).
Gamma noise
The third comparative study is performed on the same dataset with multiplicative Gamma noise, accordingly to the the methodology described in Section 4.1. We use with and . For this experiment, we overestimate the number of sources present into the input spectrogram by choosing ; this allows to highlight the differences between the different NMF variants better. Figure 6 displays the columns of (dictionary matrix) and the rows of for NMF with IS and KLdivergences, and for DRNMF with .
KLNMF identifies five sources among which one has no physical meaning and seems to be a mixture of several notes. ISNMF correctly identifies the three notes, the fourth estimate (common offset) is less accurately estimated in terms of amplitude for the activations but ISNMF is able to set to zero the fifth estimate which is appealing as it automatically remove an unnecessary component. DRNMF again takes advantage from both divergences as it is able to extract the three notes correctly, the fourth estimate (common offset) is well extracted and the fifth estimate is close to zero. This again illustrates that DRNMF is robust to different types of noises (in this case, multiplicative Gamma noise).
5 Conclusion and further work
In this paper, we have proposed, for the first time, a multiobjective model for NMF that takes into account several data fitting terms. We then proposed to tackle this problem with a weightedsum approach with carefully chosen weights, and designed MU to minimize the corresponding objective function. We used this model to design a DRNMF algorithm that allows to obtain NMF solutions with low reconstruction errors with respect to several objective functions. We illustrated the effectiveness of this approach on synthetic, document and audio datasets. For audio datasets, DRNMF provided particularly stunning results, being able to obtain solutions with significantly lower IS and KL errors (simultaneously), while generating meaningful solutions under different noise models or statistics. It is our hope that the algorithm for DRNMF that we proposed in Algorithm 2 resolves the longstanding debate [virtanen2007monaural, fevotte2009nonnegative] on whether to use IS or KLNMF for audio datasets. Using DRNMF provides a safe alternative when one is uncertain of the noise statistics of audio datasets; the noise statistics is rarely, if at all, known in practice.
Possible further research include the design of more efficient algorithms to solve multiobjective NMF, the extension of our distributionally robust model to lowrank tensor decompositions, and the refinment of our model by adding additional penalty terms or contraints to exploit properties, such as sparsity, smoothness or minimum volume
[cichocki2009nonnegative, gillis2014, fu2018nonnegative], in the decompositions. Another challenging direction of research is to consider the DRNMF problem with an countably infinite uncertainty set , e.g., .
Comments
There are no comments yet.