Minimum-volume Multichannel Nonnegative matrix factorization for blind source separation

by   Jianyu Wang, et al.

Multichannel blind source separation aims to recover the latent sources from their multichannel mixture without priors. A state-of-art blind source separation method called independent low-rank matrix analysis (ILRMA) unified independent vector analysis (IVA) and nonnegative matrix factorization (NMF). However, speech spectra modeled by NMF may not find a compact representation and it may not guarantee that each source is identifiable. To address the problem, here we propose a modified blind source separation method that enhances the identifiability of the source model. It combines ILRMA with penalty item of volume constraint. The proposed method is optimized by standard majorization-minimization framework based multiplication updating rule, which ensures the stability of convergence. Experimental results demonstrate the effectiveness of the proposed method compared with AuxIVA, MNMF and ILRMA.



There are no comments yet.


page 1

page 2

page 3

page 4


Multichannel Audio Source Separation with Independent Deeply Learned Matrix Analysis Using Product of Source Models

Independent deeply learned matrix analysis (IDLMA) is one of the state-o...

Blind Audio Source Separation with Minimum-Volume Beta-Divergence NMF

Considering a mixed signal composed of various audio sources and recorde...

Nonnegative Tensor Factorization for Directional Blind Audio Source Separation

We augment the nonnegative matrix factorization method for audio source ...

Blind Source Separation for NMR Spectra with Negative Intensity

NMR spectral datasets, especially in systems with limited samples, can b...

Independent Vector Analysis via Log-Quadratically Penalized Quadratic Minimization

We propose a new algorithm for blind source separation of convolutive mi...

Generalized Multichannel Variational Autoencoder for Underdetermined Source Separation

This paper deals with a multichannel audio source separation problem und...

Joint-Diagonalizability-Constrained Multichannel Nonnegative Matrix Factorization Based on Multivariate Complex Sub-Gaussian Distribution

In this paper, we address a statistical model extension of multichannel ...
This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

1 Introduction

Blind audio source separation separates a mixture of multiple sources into their components without prior information of the recording environments, mixing system, or source locations [5, 4, 18]. A typical approach to blind audio

source separation is based on unsupervised learning of a probabilistic model. It can be categorized into single-channel source separation and multichannel source separation. This paper focuses on multichannel source separation. A multichannel source separation method usually consists of a source model representing the time-frequency structure of source images and a spatial model representing their inter-channel covariance structure. A widely used source model is the low-rank model based on nonnegative matrix factorization (NMF) for mitigating the permutation problem. The time-frequency bins of each source in the spatial model are usually assumed to be multivariate complex Gaussian


A representative of multichannel source separation is multichannel nonnegative matrix factorization (MNMF) [23, 24, 17, 25]

. It consists of a low-rank source model and a full-rank spatial model. The full-rank spatial model is capable of representing a wide variety of source directivity under an echoic condition. However, MNMF tends to get stuck at bad local optima since a large number of unconstrained spatial covariance matrices need to be estimated iteratively. To address this problem, Kitamura

et al. [15, 16] proposed independent low-rank matrix analysis (ILRMA) which makes rank-1 assumption for the spatial model. It performs well for directional sources in practice. Essentially, the spatial model and source model of ILRMA are independent vector analysis (IVA) [14] and NMF respectively, which are optimized iteratively. The aforementioned NMF-based methods, e.g. MNMF, ILRMA [16] and its variants [21] use NMF to decompose a given spectrogram into several spectral bases and temporal activations. Although the spatial properties of the source images constrain the bases of NMF for the uniqueness of the decomposition, it may not guarantee that the spectral content of each source is identifiable. Therefore, a good source model has the potential to improve the source separation performance [16].

To improve the source identifiability of separation algorithms, here we propose a new geometric inference method for MNMF, named MinVol. It penalizes the columns of the spectral bases of NMF by volume minimization [9, 12], so that their convex hull has a small volume. Volume minimization factorizes a given data matrix into a basis matrix and a structured coefficient matrix by finding a minimum-volume simplex that encloses all columns of the data matrix [10]. It guarantees the identifiability of the factorized matrices under a so-called sufficiently scattered condition [19, 11]. We associate the minimum-volume penalty with the Itakura-Saito (IS) divergence for MNMF. To our knowledge, this is the first time that the minimum-volume penalty is used in MNMF. Also, the minimum-volume constraint implicitly enhances the sparsity of the temporal activations, so that many frequency bands will be located on the facets of the cone of the spectral bases. The proposed MinVol method is optimized by a multiplicative update (MU) rule under the standard majorization-minimization framework. Experimental results show that the proposed method outperforms Auxiliary-IVA (AuxIVA) [22], MNMF [24], and ILRMA [16] in speech separation tasks.

Figure 1: Principle of the proposed MinVol algorithm.

2 Methods

2.1 Problem formulation

Suppose the short-time Fourier transform (STFT) of a multichannel mixture is

, where , and are the indices of the frequency bins, time frames, and microphones, respectively, and denotes the transpose operator. Its source components are denoted as , where is the number of sources and is the index of the sources.

We assume that each source of the mixture is a point source, then the mixture and its sources have the following connection:


where is the mixing matrix at the th frequency bin. If is invertible and , we can find a demixing matrix for recovering .

The problem of source separation is to find an estimation of , denoted as , such that when we apply to , we obtain the separated signal :


where denotes the Hermitian transpose, and is an estimation of .

Many MNMF methods model the power spectrogram by , and use NMF [23, 3, 24] to decompose by:


where is the number of basis, is the element of a spectral basis matrix for the th source, is the element of a temporal activation matrix for the th source, and is the spatial covariance at the th frequency band for the th source. We denote the full representation of

at all frequency bands for all sources as a tensor

, and the full representation of at all time-frequency bins as a tensor .

2.2 Minimum-volume multichannel source separation

Because there exists several valid solutions of in (3), the decomposition of the source model of MNMF is not unique. To improve the identifiability of ILRMA (see Section 2.4 for the definition of the identifiability), we propose the minimum-volume based MNMF (MinVol). The principle of MinVol is shown in Fig. 1. Its objective function is:


where is an all-one vector and


is the minimum-volume regularization with as a small positive constant that ensures is bounded from below, unlike the quantity .

is the identity matrix with dimensions

, and is the loss of the approximation.

The reason for using the minimum-volume is that minimizing the volume of makes the columns of to be as close as possible to each other within the unit simplex. For different assumptions of data distribution, the loss

should be chosen differently. Because we assume that the data is multiplicative Gamma distribution in this paper, we choose the IS divergence as the loss. The IS divergence is the only one in the

divergence family that has the scale-invariant property. It implies that the distribution of the time-frequency bins with low power is as important as that with high power during the divergence computation [6].

2.3 Optimization algorithm

The objective function based on the IS divergence is formulated as:


According to ILRMA [16], the spatial covariance can be modeled by the rank-1 assumption. With the assumption, (6) can be formulated as:


where the term is called the spatial model, and the sum of all other terms are called the source model. The spatial and source models of the objective are optimized iteratively.

For each single iteration, to optimize the spatial model, an IVA-based auxiliary function [22] is used, which results in the following solution:


where denotes the th column vector of the identity matrix, is the estimated spectrogram of the th source, and is the element of .

Substituting the solution (8) into (7) derives the following optimization objective of the th source model:


where each source model is optimized independently as follows.

Because the first term of (9) is a difficult optimization problem, we propose to optimize a new auxiliary function instead of the difficult problem. The design of the auxiliary function follows that in [7]:

Lemma 1 ([7]).

Let and , . Then, the function:


is the auxiliary function for at . where is the convex function of , is the concave function of , and is the constant of . is the differential of for . Due to the IS divergence, , , , .

Because the second term of (9), i.e. the minimum-volume regularization, is also a difficult optimization problem, we use its first-order Taylor expansion as an approximation which constructs an upper bound of the expansion:


where with , is an arbitrary positive definite matrix. We can set in the experiments, since is a positive definite matrix. Finally, the right side of (11) is an auxiliary function for . However, it is quadratic and inseparable, which makes the problem hard to optimize over the nonnegative orthant. We use an approximation to represent the right side of (11). The non-constant part can be written as . Let with and , Then, the right side of (11) can be written as:


where is the component division between and , is the diagonal matrix, and .

At last, we replace the first term of (9) by (10) and the second term of (9) by (12), which results in the following auxiliary function at :


where is a constant for . Similarly with (13), we obtain:


as an auxiliary function at for

Setting the derivative of the auxiliary function to zero:


and solve (15) by Vieta’s theorem [26] derives the updating function of . Similarly, setting the derivative of (14) to zero derives the updating function of :


where is the solution of (15). (9) is solved.

The regularization coefficient affects the model performance. Here we update automatically. First, the variables and are initialized with the successive nonnegative projection algorithm [20], then is updated by:


where is the value of at the previous iteration, and recommended to be chosen between and at the first iteration.

2.4 Theoretical analysis

Similar to [8], we prove the identifiability of in MinVol, which supports the superiority of the proposed MinVol-ILRMA over ILRMA theoretically.

Theorem 1.

Let be an optimal solution of (4). If the ground truth and satisfies the scattered condition [8] and . Then and , where is a permutation matrix.

Proof 1.

The method can be repeated here


Denote the optimal solution of (18) as and . There exists a permutation matrix such that ,. Because , there exists a non-singular matrix such that , . Because we assume and are the optimal solution, we have


On the other hand, because is an optimal solution of (19), we have:


We assume that is sufficiently scattered, therefore . Then, due to the Hadamard inequality, we have:


Combining (20) and (21) derives that . The above conclusions imply that the columns of can only be selected from the columns of the identity matrix. So should be a non-singular and permutation matrix.

(a) female+female (b) female+female (c) male+male (d) male+male (e) female+male (f) female+male
Figure 2: Average SDR improvement of the comparison methods over mixed speech in anechoic environments. (a), (c), (e) are the results in condition 1. (b), (d), (f) are the results in condition 2.

3 experiments

Experimental settings: We followed the environment of the SISEC challenge [2] to construct a determined multichannel speech separation task with . We used the Wall Street Journal (WSJ0) corpus [13] as the speech source. We evaluated the comparison methods on all gender combinations.

We generated two test conditions, denoted as condition 1 and condition 2. In both conditions, the room size was set to m; the two speakers were positioned 2 m from the center of the two microphones. The differences between the two conditions are that (i) the microphone spacing is 5.66 cm and 2.83 cm respectively, and (ii) the incident angles of the two speakers follow [21, Figs. 9a and 9b]. The image source model [1] was used to generate the room impulse responses with the reverberation time selected from ms. For each gender combination and each in each condition, we generated 200 mixtures for evaluation. The sampling rate was set to 16 kHz.

The parameter of MinVol in (5) was set to 0.5. Note that MinVol is insensitive to the selection of , since it is only used to prevent (5) from infinity. We compared MinVol with AuxIVA [22], MNMF [24], and ILRMA [16]. For each comparison method, we set the frame length and frame shift of STFT to 64 ms and 32 ms respectively. Hamming window was also applied to each frame. The number of basis vectors were set to in MNMF, ILRMA and MinVol by default.

The evaluation metric is signal-to-distortion ration (SDR)


Results We first conducted an experiment in anechoic environments. Fig. 2 shows the average SDR improvement of the comparison methods over the mixed speech. From the figure, we see that the performance of the proposed MinVol is significantly better than that of MNMF. Compared to AuxIVA and ILRMA, MinVol achieves an SDR improvement of about 3 dB on average.

Then, we studied the performance of the comparison methods in reverberant environments. Fig. 3 shows the SDR improvement over the mixed speech with respect to . From the figure, we see that the curves of the SDR improvement produced by MinVol are always higher than those produced from the comparison methods.

To clearly show the general improvement of MinVol over the referenced methods, we average the SDR improvement with respect to different gender combinations and for each condition. The average results are listed in Table 1. From the table, we see that the average SDR improvement brought by the proposed MinVol is 2 dB higher than ILRMA in condition 1, and 3 dB higher in condition 2.

(a) female+female (b) female+female (c) male+male (d) male+male (e) female+male (f) female+male
Figure 3: The curves of the SDR improvement of the comparison methods in reverberant environments. (a), (c), (e) are in condition 1. (b), (d), (f) are in condition 2.
Condition 1 Condition 2
f+f m+m f+m f+f m+m f+m
AuxIVA [22] 2.98 3.40 2.95 5.92 7.55 7.60
MNMF [24] 1.25 1.84 1.97 1.47 2.00 2.11
ILRMA [16] 5.03 6.89 5.72 5.17 7.31 6.00
MinVol 7.39 8.77 7.87 8.31 10.06 9.29
Table 1: The average SDR improvement (dB).

4 conclusion

This paper proposes MinVol source separation method. It constrains ILRMA with the volume minimization to improve the identifiability of the source model estimation of ILRMA. It further unifies the IVA-based blind spatial optimization and the minimum-volume constrained MNMF. It is optimized by the alternating fast projected gradient algorithm. We have also proved the identifiability of the volume minimum regularizer. Experimental results show that the proposed algorithm outperforms three representative blind audio source separation methods.


  • [1] J. B. Allen and D. A. Berkley (1979) Image method for efficiently simulating small-room acoustics. The Journal of the Acoustical Society of America 65 (4), pp. 943–950. Cited by: §3.
  • [2] S. Araki, F. Nesta, E. Vincent, Z. Koldovskỳ, G. Nolte, A. Ziehe, and A. Benichoux (2012) The 2011 signal separation evaluation campaign (sisec2011):-audio source separation. In International Conference on Latent Variable Analysis and Signal Separation, pp. 414–422. Cited by: §3.
  • [3] S. Arberet, A. Ozerov, N. Q. Duong, E. Vincent, R. Gribonval, F. Bimbot, and P. Vandergheynst (2010) Nonnegative matrix factorization and spatial covariance model for under-determined reverberant audio source separation. In 10th International Conference on Information Science, Signal Processing and their Applications (ISSPA 2010), pp. 1–4. Cited by: §2.1.
  • [4] A. Belouchrani, K. Abed-Meraim, J. Cardoso, and E. Moulines (1997) A blind source separation technique using second-order statistics. IEEE Transactions on signal processing 45 (2), pp. 434–444. Cited by: §1.
  • [5] J. Cardoso (1997) Infomax and maximum likelihood for blind source separation. IEEE Signal processing letters 4 (4), pp. 112–114. Cited by: §1.
  • [6] C. Févotte, N. Bertin, and J. Durrieu (2009) Nonnegative matrix factorization with the itakura-saito divergence: with application to music analysis. Neural computation 21 (3), pp. 793–830. Cited by: §2.2.
  • [7] C. Févotte and J. Idier (2011) Algorithms for nonnegative matrix factorization with the -divergence. Neural computation 23 (9), pp. 2421–2456. Cited by: §2.3, Lemma 1.
  • [8] X. Fu, K. Huang, N. D. Sidiropoulos, Q. Shi, and M. Hong (2019) Anchor-free correlated topic modeling. IEEE Transactions on Pattern Analysis and Machine Intelligence 41 (5), pp. 1056–1071. Cited by: §2.4, Theorem 1.
  • [9] X. Fu, K. Huang, and N. D. Sidiropoulos (2018) On identifiability of nonnegative matrix factorization. IEEE Signal Processing Letters 25 (3), pp. 328–332. Cited by: §1.
  • [10] X. Fu, K. Huang, B. Yang, W. Ma, and N. D. Sidiropoulos (2016) Robust volume minimization-based matrix factorization for remote sensing and document clustering. IEEE Transactions on Signal Processing 64 (23), pp. 6254–6268. Cited by: §1.
  • [11] X. Fu, K. Huang, N. D. Sidiropoulos, and W. Ma (2019) Nonnegative matrix factorization for signal and data analytics: identifiability, algorithms, and applications.. IEEE Signal Process. Mag. 36 (2), pp. 59–80. Cited by: §1.
  • [12] X. Fu, W. Ma, K. Huang, and N. D. Sidiropoulos (2015) Blind separation of quasi-stationary sources: exploiting convex geometry in covariance domain. IEEE Transactions on Signal Processing 63 (9), pp. 2306–2320. Cited by: §1.
  • [13] J. Garofolo, D. Graff, D. Paul, and D. Pallett (1993) Csr-i (wsj0) complete ldc93s6a. Web Download. Philadelphia: Linguistic Data Consortium 83. Cited by: §3.
  • [14] T. Kim, H. T. Attias, S. Lee, and T. Lee (2006) Blind source separation exploiting higher-order frequency dependencies. IEEE transactions on audio, speech, and language processing 15 (1), pp. 70–79. Cited by: §1.
  • [15] D. Kitamura, N. Ono, H. Sawada, H. Kameoka, and H. Saruwatari (2015) Efficient multichannel nonnegative matrix factorization exploiting rank-1 spatial model. In 2015 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pp. 276–280. Cited by: §1.
  • [16] D. Kitamura, N. Ono, H. Sawada, H. Kameoka, and H. Saruwatari (2016) Determined blind source separation unifying independent vector analysis and nonnegative matrix factorization. IEEE/ACM Transactions on Audio, Speech, and Language Processing 24 (9), pp. 1626–1641. Cited by: §1, §1, §2.3, Table 1, §3.
  • [17] D. Kitamura, H. Saruwatari, H. Kameoka, Y. Takahashi, K. Kondo, and S. Nakamura (2015) Multichannel signal separation combining directional clustering and nonnegative matrix factorization with spectrogram restoration. IEEE/ACM Transactions on Audio, Speech, and Language Processing 23 (4), pp. 654–669. Cited by: §1.
  • [18] Y. Kubo, N. Takamune, D. Kitamura, and H. Saruwatari (2020)

    Blind speech extraction based on rank-constrained spatial covariance matrix estimation with multivariate generalized gaussian distribution

    IEEE/ACM Transactions on Audio, Speech, and Language Processing 28, pp. 1948–1963. Cited by: §1.
  • [19] V. Leplat, N. Gillis, and A. M. S. Ang (2020) Blind audio source separation with minimum-volume beta-divergence nmf. IEEE Transactions on Signal Processing 68 (), pp. 3400–3410. Cited by: §1.
  • [20] V. Leplat, A. M. Ang, and N. Gillis (2019) Minimum-volume rank-deficient nonnegative matrix factorizations. In ICASSP 2019-2019 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pp. 3402–3406. Cited by: §2.3.
  • [21] S. Mogami, N. Takamune, D. Kitamura, H. Saruwatari, Y. Takahashi, K. Kondo, and N. Ono (2020) Independent low-rank matrix analysis based on time-variant sub-gaussian source model for determined blind source separation. IEEE/ACM Transactions on Audio, Speech, and Language Processing 28 (), pp. 503–518. Cited by: §1, §1, §3.
  • [22] N. Ono (2011) Stable and fast update rules for independent vector analysis based on auxiliary function technique. In 2011 IEEE Workshop on Applications of Signal Processing to Audio and Acoustics (WASPAA), Vol. , pp. 189–192. Cited by: §1, §2.3, Table 1, §3.
  • [23] A. Ozerov and C. Févotte (2009) Multichannel nonnegative matrix factorization in convolutive mixtures for audio source separation. IEEE Transactions on Audio, Speech, and Language Processing 18 (3), pp. 550–563. Cited by: §1, §2.1.
  • [24] H. Sawada, H. Kameoka, S. Araki, and N. Ueda (2013) Multichannel extensions of non-negative matrix factorization with complex-valued data. IEEE Transactions on Audio, Speech, and Language Processing 21 (5), pp. 971–982. Cited by: §1, §1, §2.1, Table 1, §3.
  • [25] K. Sekiguchi, Y. Bando, A. A. Nugraha, K. Yoshii, and T. Kawahara (2020) Fast multichannel nonnegative matrix factorization with directivity-aware jointly-diagonalizable spatial covariance matrices for blind source separation. IEEE/ACM Transactions on Audio, Speech, and Language Processing. Cited by: §1.
  • [26] J. S. Tanton (2005) Encyclopedia of mathematics. Infobase Publishing. Cited by: §2.3.
  • [27] E. Vincent, R. Gribonval, and C. Févotte (2006) Performance measurement in blind audio source separation. IEEE transactions on audio, speech, and language processing 14 (4), pp. 1462–1469. Cited by: §3.