1 Introduction
Blind audio source separation separates a mixture of multiple sources into their components without prior information of the recording environments, mixing system, or source locations [5, 4, 18]. A typical approach to blind audio
source separation is based on unsupervised learning of a probabilistic model. It can be categorized into singlechannel source separation and multichannel source separation. This paper focuses on multichannel source separation. A multichannel source separation method usually consists of a source model representing the timefrequency structure of source images and a spatial model representing their interchannel covariance structure. A widely used source model is the lowrank model based on nonnegative matrix factorization (NMF) for mitigating the permutation problem. The timefrequency bins of each source in the spatial model are usually assumed to be multivariate complex Gaussian
[21].A representative of multichannel source separation is multichannel nonnegative matrix factorization (MNMF) [23, 24, 17, 25]
. It consists of a lowrank source model and a fullrank spatial model. The fullrank spatial model is capable of representing a wide variety of source directivity under an echoic condition. However, MNMF tends to get stuck at bad local optima since a large number of unconstrained spatial covariance matrices need to be estimated iteratively. To address this problem, Kitamura
et al. [15, 16] proposed independent lowrank matrix analysis (ILRMA) which makes rank1 assumption for the spatial model. It performs well for directional sources in practice. Essentially, the spatial model and source model of ILRMA are independent vector analysis (IVA) [14] and NMF respectively, which are optimized iteratively. The aforementioned NMFbased methods, e.g. MNMF, ILRMA [16] and its variants [21] use NMF to decompose a given spectrogram into several spectral bases and temporal activations. Although the spatial properties of the source images constrain the bases of NMF for the uniqueness of the decomposition, it may not guarantee that the spectral content of each source is identifiable. Therefore, a good source model has the potential to improve the source separation performance [16].To improve the source identifiability of separation algorithms, here we propose a new geometric inference method for MNMF, named MinVol. It penalizes the columns of the spectral bases of NMF by volume minimization [9, 12], so that their convex hull has a small volume. Volume minimization factorizes a given data matrix into a basis matrix and a structured coefficient matrix by finding a minimumvolume simplex that encloses all columns of the data matrix [10]. It guarantees the identifiability of the factorized matrices under a socalled sufficiently scattered condition [19, 11]. We associate the minimumvolume penalty with the ItakuraSaito (IS) divergence for MNMF. To our knowledge, this is the first time that the minimumvolume penalty is used in MNMF. Also, the minimumvolume constraint implicitly enhances the sparsity of the temporal activations, so that many frequency bands will be located on the facets of the cone of the spectral bases. The proposed MinVol method is optimized by a multiplicative update (MU) rule under the standard majorizationminimization framework. Experimental results show that the proposed method outperforms AuxiliaryIVA (AuxIVA) [22], MNMF [24], and ILRMA [16] in speech separation tasks.
2 Methods
2.1 Problem formulation
Suppose the shorttime Fourier transform (STFT) of a multichannel mixture is
, where , and are the indices of the frequency bins, time frames, and microphones, respectively, and denotes the transpose operator. Its source components are denoted as , where is the number of sources and is the index of the sources.We assume that each source of the mixture is a point source, then the mixture and its sources have the following connection:
(1) 
where is the mixing matrix at the th frequency bin. If is invertible and , we can find a demixing matrix for recovering .
The problem of source separation is to find an estimation of , denoted as , such that when we apply to , we obtain the separated signal :
(2) 
where denotes the Hermitian transpose, and is an estimation of .
Many MNMF methods model the power spectrogram by , and use NMF [23, 3, 24] to decompose by:
(3) 
where is the number of basis, is the element of a spectral basis matrix for the th source, is the element of a temporal activation matrix for the th source, and is the spatial covariance at the th frequency band for the th source. We denote the full representation of
at all frequency bands for all sources as a tensor
, and the full representation of at all timefrequency bins as a tensor .2.2 Minimumvolume multichannel source separation
Because there exists several valid solutions of in (3), the decomposition of the source model of MNMF is not unique. To improve the identifiability of ILRMA (see Section 2.4 for the definition of the identifiability), we propose the minimumvolume based MNMF (MinVol). The principle of MinVol is shown in Fig. 1. Its objective function is:
(4) 
where is an allone vector and
(5) 
is the minimumvolume regularization with as a small positive constant that ensures is bounded from below, unlike the quantity .
is the identity matrix with dimensions
, and is the loss of the approximation.The reason for using the minimumvolume is that minimizing the volume of makes the columns of to be as close as possible to each other within the unit simplex. For different assumptions of data distribution, the loss
should be chosen differently. Because we assume that the data is multiplicative Gamma distribution in this paper, we choose the IS divergence as the loss. The IS divergence is the only one in the
divergence family that has the scaleinvariant property. It implies that the distribution of the timefrequency bins with low power is as important as that with high power during the divergence computation [6].2.3 Optimization algorithm
The objective function based on the IS divergence is formulated as:
(6) 
According to ILRMA [16], the spatial covariance can be modeled by the rank1 assumption. With the assumption, (6) can be formulated as:
(7) 
where the term is called the spatial model, and the sum of all other terms are called the source model. The spatial and source models of the objective are optimized iteratively.
For each single iteration, to optimize the spatial model, an IVAbased auxiliary function [22] is used, which results in the following solution:
(8) 
where denotes the th column vector of the identity matrix, is the estimated spectrogram of the th source, and is the element of .
Substituting the solution (8) into (7) derives the following optimization objective of the th source model:
(9) 
where each source model is optimized independently as follows.
Because the first term of (9) is a difficult optimization problem, we propose to optimize a new auxiliary function instead of the difficult problem. The design of the auxiliary function follows that in [7]:
Lemma 1 ([7]).
Let and , . Then, the function:
(10) 
is the auxiliary function for at . where is the convex function of , is the concave function of , and is the constant of . is the differential of for . Due to the IS divergence, , , , .
Because the second term of (9), i.e. the minimumvolume regularization, is also a difficult optimization problem, we use its firstorder Taylor expansion as an approximation which constructs an upper bound of the expansion:
(11) 
where with , is an arbitrary positive definite matrix. We can set in the experiments, since is a positive definite matrix. Finally, the right side of (11) is an auxiliary function for . However, it is quadratic and inseparable, which makes the problem hard to optimize over the nonnegative orthant. We use an approximation to represent the right side of (11). The nonconstant part can be written as . Let with and , Then, the right side of (11) can be written as:
(12) 
where is the component division between and , is the diagonal matrix, and .
At last, we replace the first term of (9) by (10) and the second term of (9) by (12), which results in the following auxiliary function at :
(13) 
where is a constant for . Similarly with (13), we obtain:
(14) 
as an auxiliary function at for
Setting the derivative of the auxiliary function to zero:
(15) 
and solve (15) by Vieta’s theorem [26] derives the updating function of . Similarly, setting the derivative of (14) to zero derives the updating function of :
(16) 
The regularization coefficient affects the model performance. Here we update automatically. First, the variables and are initialized with the successive nonnegative projection algorithm [20], then is updated by:
(17) 
where is the value of at the previous iteration, and recommended to be chosen between and at the first iteration.
2.4 Theoretical analysis
Similar to [8], we prove the identifiability of in MinVol, which supports the superiority of the proposed MinVolILRMA over ILRMA theoretically.
Theorem 1.
Proof 1.
The method can be repeated here
(18) 
Denote the optimal solution of (18) as and . There exists a permutation matrix such that ,. Because , there exists a nonsingular matrix such that , . Because we assume and are the optimal solution, we have
(19) 
On the other hand, because is an optimal solution of (19), we have:
(20) 
We assume that is sufficiently scattered, therefore . Then, due to the Hadamard inequality, we have:
(21) 
Combining (20) and (21) derives that . The above conclusions imply that the columns of can only be selected from the columns of the identity matrix. So should be a nonsingular and permutation matrix.
3 experiments
Experimental settings: We followed the environment of the SISEC challenge [2] to construct a determined multichannel speech separation task with . We used the Wall Street Journal (WSJ0) corpus [13] as the speech source. We evaluated the comparison methods on all gender combinations.
We generated two test conditions, denoted as condition 1 and condition 2. In both conditions, the room size was set to m; the two speakers were positioned 2 m from the center of the two microphones. The differences between the two conditions are that (i) the microphone spacing is 5.66 cm and 2.83 cm respectively, and (ii) the incident angles of the two speakers follow [21, Figs. 9a and 9b]. The image source model [1] was used to generate the room impulse responses with the reverberation time selected from ms. For each gender combination and each in each condition, we generated 200 mixtures for evaluation. The sampling rate was set to 16 kHz.
The parameter of MinVol in (5) was set to 0.5. Note that MinVol is insensitive to the selection of , since it is only used to prevent (5) from infinity. We compared MinVol with AuxIVA [22], MNMF [24], and ILRMA [16]. For each comparison method, we set the frame length and frame shift of STFT to 64 ms and 32 ms respectively. Hamming window was also applied to each frame. The number of basis vectors were set to in MNMF, ILRMA and MinVol by default.
The evaluation metric is signaltodistortion ration (SDR)
[27].Results We first conducted an experiment in anechoic environments. Fig. 2 shows the average SDR improvement of the comparison methods over the mixed speech. From the figure, we see that the performance of the proposed MinVol is significantly better than that of MNMF. Compared to AuxIVA and ILRMA, MinVol achieves an SDR improvement of about 3 dB on average.
Then, we studied the performance of the comparison methods in reverberant environments. Fig. 3 shows the SDR improvement over the mixed speech with respect to . From the figure, we see that the curves of the SDR improvement produced by MinVol are always higher than those produced from the comparison methods.
To clearly show the general improvement of MinVol over the referenced methods, we average the SDR improvement with respect to different gender combinations and for each condition. The average results are listed in Table 1. From the table, we see that the average SDR improvement brought by the proposed MinVol is 2 dB higher than ILRMA in condition 1, and 3 dB higher in condition 2.
4 conclusion
This paper proposes MinVol source separation method. It constrains ILRMA with the volume minimization to improve the identifiability of the source model estimation of ILRMA. It further unifies the IVAbased blind spatial optimization and the minimumvolume constrained MNMF. It is optimized by the alternating fast projected gradient algorithm. We have also proved the identifiability of the volume minimum regularizer. Experimental results show that the proposed algorithm outperforms three representative blind audio source separation methods.
References
 [1] (1979) Image method for efficiently simulating smallroom acoustics. The Journal of the Acoustical Society of America 65 (4), pp. 943–950. Cited by: §3.
 [2] (2012) The 2011 signal separation evaluation campaign (sisec2011):audio source separation. In International Conference on Latent Variable Analysis and Signal Separation, pp. 414–422. Cited by: §3.
 [3] (2010) Nonnegative matrix factorization and spatial covariance model for underdetermined reverberant audio source separation. In 10th International Conference on Information Science, Signal Processing and their Applications (ISSPA 2010), pp. 1–4. Cited by: §2.1.
 [4] (1997) A blind source separation technique using secondorder statistics. IEEE Transactions on signal processing 45 (2), pp. 434–444. Cited by: §1.
 [5] (1997) Infomax and maximum likelihood for blind source separation. IEEE Signal processing letters 4 (4), pp. 112–114. Cited by: §1.
 [6] (2009) Nonnegative matrix factorization with the itakurasaito divergence: with application to music analysis. Neural computation 21 (3), pp. 793–830. Cited by: §2.2.
 [7] (2011) Algorithms for nonnegative matrix factorization with the divergence. Neural computation 23 (9), pp. 2421–2456. Cited by: §2.3, Lemma 1.
 [8] (2019) Anchorfree correlated topic modeling. IEEE Transactions on Pattern Analysis and Machine Intelligence 41 (5), pp. 1056–1071. Cited by: §2.4, Theorem 1.
 [9] (2018) On identifiability of nonnegative matrix factorization. IEEE Signal Processing Letters 25 (3), pp. 328–332. Cited by: §1.
 [10] (2016) Robust volume minimizationbased matrix factorization for remote sensing and document clustering. IEEE Transactions on Signal Processing 64 (23), pp. 6254–6268. Cited by: §1.
 [11] (2019) Nonnegative matrix factorization for signal and data analytics: identifiability, algorithms, and applications.. IEEE Signal Process. Mag. 36 (2), pp. 59–80. Cited by: §1.
 [12] (2015) Blind separation of quasistationary sources: exploiting convex geometry in covariance domain. IEEE Transactions on Signal Processing 63 (9), pp. 2306–2320. Cited by: §1.
 [13] (1993) Csri (wsj0) complete ldc93s6a. Web Download. Philadelphia: Linguistic Data Consortium 83. Cited by: §3.
 [14] (2006) Blind source separation exploiting higherorder frequency dependencies. IEEE transactions on audio, speech, and language processing 15 (1), pp. 70–79. Cited by: §1.
 [15] (2015) Efficient multichannel nonnegative matrix factorization exploiting rank1 spatial model. In 2015 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pp. 276–280. Cited by: §1.
 [16] (2016) Determined blind source separation unifying independent vector analysis and nonnegative matrix factorization. IEEE/ACM Transactions on Audio, Speech, and Language Processing 24 (9), pp. 1626–1641. Cited by: §1, §1, §2.3, Table 1, §3.
 [17] (2015) Multichannel signal separation combining directional clustering and nonnegative matrix factorization with spectrogram restoration. IEEE/ACM Transactions on Audio, Speech, and Language Processing 23 (4), pp. 654–669. Cited by: §1.

[18]
(2020)
Blind speech extraction based on rankconstrained spatial covariance matrix estimation with multivariate generalized gaussian distribution
. IEEE/ACM Transactions on Audio, Speech, and Language Processing 28, pp. 1948–1963. Cited by: §1.  [19] (2020) Blind audio source separation with minimumvolume betadivergence nmf. IEEE Transactions on Signal Processing 68 (), pp. 3400–3410. Cited by: §1.
 [20] (2019) Minimumvolume rankdeficient nonnegative matrix factorizations. In ICASSP 20192019 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pp. 3402–3406. Cited by: §2.3.
 [21] (2020) Independent lowrank matrix analysis based on timevariant subgaussian source model for determined blind source separation. IEEE/ACM Transactions on Audio, Speech, and Language Processing 28 (), pp. 503–518. Cited by: §1, §1, §3.
 [22] (2011) Stable and fast update rules for independent vector analysis based on auxiliary function technique. In 2011 IEEE Workshop on Applications of Signal Processing to Audio and Acoustics (WASPAA), Vol. , pp. 189–192. Cited by: §1, §2.3, Table 1, §3.
 [23] (2009) Multichannel nonnegative matrix factorization in convolutive mixtures for audio source separation. IEEE Transactions on Audio, Speech, and Language Processing 18 (3), pp. 550–563. Cited by: §1, §2.1.
 [24] (2013) Multichannel extensions of nonnegative matrix factorization with complexvalued data. IEEE Transactions on Audio, Speech, and Language Processing 21 (5), pp. 971–982. Cited by: §1, §1, §2.1, Table 1, §3.
 [25] (2020) Fast multichannel nonnegative matrix factorization with directivityaware jointlydiagonalizable spatial covariance matrices for blind source separation. IEEE/ACM Transactions on Audio, Speech, and Language Processing. Cited by: §1.
 [26] (2005) Encyclopedia of mathematics. Infobase Publishing. Cited by: §2.3.
 [27] (2006) Performance measurement in blind audio source separation. IEEE transactions on audio, speech, and language processing 14 (4), pp. 1462–1469. Cited by: §3.
Comments
There are no comments yet.