1 Introduction
The ubiquity of smart devices in our present day provide a diverse range of audio based applications like detecting threat indicating sounds, transcribing speech, building AI assistants etc. to facilitate our daytoday activities [1]. Depending on the application, our desired information extracted from the audio signals changes. Real world audio signals are often obscured by undesired information and necessitate the development of source separation methods [2]. As most smart devices are equipped with two or more microphones, the recorded multichannel data can be leveraged to improve the separation performance [3]. However the arrangement of microphones varies across devices, as do the source characteristics and locations. Separating sources without such device or source information is termed as blind source separation (BSS) [4, 5].
Fundamental BSS techniques include matrix factorization methods like independent component analysis (ICA)
[6], nonnegative matrix factorization (NMF) [7] etc. These techniques formulate the audio signals captured by microphones as linear mixtures of two or more source signals. In other words, the spectra of mixture signals are treated as linear combinations of spectra of source signals, and solved by assuming independence among the distributions of source spectra. This assumption of independence is often not true because the spectra of a typical audio source is highly correlated with itself across different time intervals. Such correlations or fundamental patterns in each source are efficiently extracted using NMF [8], which are referred to as bases. The way these bases are linearly combined to reconstruct spectra, are referred to as activations and are extracted simultaneously by NMF. BSS methods are therefore extended by modelling such correlations using multichannel NMF [7, 9]. It is intuitive that source separation becomes more reliable as the number of available microphones increases. When there are as many number of microphones as there are sources, the situation is said to be determined. Ono et. al. [10]proposed independent vector analysis (IVA) for determined BSS using an auxiliary function technique to derive stable and fast update rules for demixing parameters. Kitamura et. al.
[11] proposed an independent lowrank matrix analysis method (ILRMA) by unifying IVA and multisource NMF modelling.In the above discussed NMFbased BSS methods, it is required to provide an appropriate value for NMF’s model complexity i.e. the number of bases extracted for each source model. This latent variable when too low or too high, leads to underfitting or overfitting respectively. Its value is often chosen depending on source characteristics or by parameter tuning. However real world source characteristics are unknown, and their complexities range from simple keyboard clicking to moderate sounds of drums to complex music pieces. Although nonparametric methods exist for estimating number of sources, to the best of our knowledge, there aren’t any NMFbased BSS techniques which can adaptively model sources having different complexities.
We propose a nonparametric framework of multisource modelling unified with IVA for determined BSS, and overcome the problem of tuning NMF’s model complexity parameter. Our proposed method utilizes the concepts of variational Bayesian inference to statistically estimate each source’s complexity, thereby optimally separating the sources. Proposed method therefore serves as a generalization of ILRMA by adapting to varying source complexities.
2 Conventional Method
Most BSS methods model the given mixture spectra as linear combinations of source spectra and are formulated as
(1) 
where and denote the mixture and source spectra respectively for each index , . denotes a transpose and denote the number of frequency bins, time frames, microphones and sources respectively. is an mixing matrix comprising of steering vectors for the respective sources. In a determined case, and the square matrix has a valid inverse matrix. Therefore [10] proposed IVA by defining a demixing matrix and reformulated Eq. (1) as
(2) 
where denotes the estimated source spectra and denotes a hermitian transpose.
2.1 ILRMA: Unifying IVA and NMF
This method extends IVA by independently modelling each source with an isotropic complex Gaussian distribution. For each source index
, the distribution variance denoted as
is nonnegative and modelled using NMF as(3) 
where and are the elements of basis and activation vectors respectively. is NMF’s model complexity parameter, signifying the number of basis vectors for source. The cost function of ILRMA is derived in [11] as
(4) 
ILRMA estimates the demixing parameters by maximizing cost function using multisource NMF modelling in Eq. (3).
2.2 Limitation: Number of Source Bases
In the formulation of ILRMA, note that a set of complexity parameters is to be specified by the user. Failing to provide a reasonable estimate of each source’s complexity will lead to overfitting or underfitting. This is discussed in [11] by comparing the cases of separating speech signals with that of music signals. The former requires only bases for optimal modelling as compared to the latter requiring more than . Therefore the model complexity parameter can effect separation performance, depending on the types of sources being separated. For simplicity, ILRMA assumes equal number of bases for each source i.e. . This further limits ILRMA’s ability to optimally separate a low complexity source and a high complexity source from their mixture signals.
3 Proposed MultiSource Modelling
We overcome the limitation of ILRMA by proposing a probabilistic modelling of the variance of source distributions. In such techniques, it is common to introduce hidden variables to capture the structure of given observed data, and utilize inference algorithms to estimate the posterior distribution. Accordingly, our proposed method flexibly models each source using a large number of basis vectors and incorporates a reliability value for each basis vector. We denote each reliability value as which can be interpreted as a quantified contribution of basis towards source.
3.1 Model Formulation
Contrast to the NMFbased source modelling in Eq. (3), we propose a probabilistic model for source variance as
(5) 
where the prior distributions for each of , and are drawn from a random process as
(6) 
where are positive constants, Gamma
is a gamma distribution defined over a shape parameter and a rate (inversescale) parameter. When
, a sparse prior is set over the reliability values and therefore adapts to different model complexities depending on each source’s characteristics [12]. As each source’s expected variance should correspond to the expectation of its power, the choice of prior parameters require that ,(7) 
Maximizing the cost function in Eq. (4), as updated by Eq. (5) is the central focus of our approach.
3.2 Variational Bayesian Inference
The isotropic gaussian distribution of sources is not conjugate to the gamma distribution of source parameters
. This nonconjugacy precludes the use of Markov chain Monte Carlo (MCMC) based MetropolisHastings, Gibbs sampling for inferring our desired posterior
[13, 14]. However variational approaches have provided alternatives for nonconjugate models that are also faster for large amounts of data. We adopt a fully factorized meanfield variational inference technique as it assumes conditional independence among the hidden variables [15] and approximates them from a family of conditional distributions over variational parameters [16]. Generalized inverse Gaussian (GIG) distributions are chosen as our variational family and are expressed as(8) 
where is a modified Bessel function of the second kind and are the variational hyperparameters. GIG’s sufficient statistics includes which eases the optimization of our cost function’s term and motivates our choice of GIG [17]. We define the conditional distributions as
(9) 
We now derive update equations from the cost function in Eq. (4) using firstorder Taylor expansion and Jensen’s inequality [18] by introducing their respective auxiliary positive constants and as
(10) 
where denotes a leftover constant. Note that the pairs and are sufficient statistics for their respective GIG distributions. This allows us to avoid taking partial derivatives and directly derive the analytic coordinate ascent updates for our hyperparameters by comparing the coefficients of sufficient statistics [17]. Constants and retighten the above inequality (3.2) when:
(11)  
(12) 
Expectation of source parameters in Eqs. (11) and (12
) can be obtained similar to the expectation of a random variable
defined over a GIG distribution in Eq. (8) as(13) 
The update equations for our hyperparameters are derived as
(14)  
(15)  
(16)  
(17)  
(18)  
(19) 
3.3 Update Rules for Demixing Matrix
Given each source’s variance, the proposed method does not alter the partial derivatives of cost function over the demixing matrix . Hence its update equations coincide with those described for IVA using an auxiliary function technique [10] and are derived as follows:
(20)  
(21)  
(22) 
where is a unit vector whose element equals one. After the elements of demixing matrix are estimated, the separated source spectra can be extracted as
(23) 
In each iteration of the proposed method, hyperparameters and are updated using Eq. (14)(19). Demixing matrices and the separated source spectra are then updated using Eq. (20)(23). As each source’s modelling begins with a large , our variational inference is computationally exacting. However over a few iterations, the sparse prior placed over estimates a small number of bases which are reliable. So we employ a thresholding technique [19] to skip the optimization of less reliable bases in subsequent iterations.
4 Simulations and Results
4.1 Experimental Conditions
We evaluate our proposed method on the DSD dataset containing professionally produced songs from the 2016 SiSEC challenge [20]. Each song data consists of clean sounds for vocals, drums, bass and other accompaniments, and lasts for more than  minutes. So we only choose a second portion (ss time interval) of clean sources and downsample them to kHz for creating mixture signals. For each of these songs, we randomly choose between the pairs (drums, vocals) or (bass, vocals) and create synthetic twochannel reverberant mixture signals using the recoding conditions shown in Fig. 1. Room impulse responses E2A (ms) for above recording conditions were obtained from the RWCP Sound Scene Database [21].
Mixture spectra are estimated from the time domain signals using a Hamming window of length ms shifted every ms, which were found to be optimal by [22]. Each demixing matrix
is initialized with an identity matrix. We model each source’s variance with
bases and set . Hyperparameters and are initialized randomly from gamma distributions with shape and rate parameters set to . Parameters are optimized for iterations and then the separated spectra are converted to time domain using a backprojection technique [23].4.2 Evaluations and Comparisons
Three metrics: signal to distortion ratio (SDR), signal to interference ratio (SIR) and signal to artifacts ratio (SAR) [24] are used to evaluate the quality of separated sources. Separation performance of the proposed method is compared with three NMFbased BSS method i.e. IVA [10], MNMF [9] and ILRMA [11] (with source bases). Each separation is repeated for different random initializations, and the average of the above performance metrics are reported in Table 1.



Methods  SDR  SIR  SAR 
IVA [10]  dB  dB  dB 
MNMF[9]  dB  dB  dB 
ILRMA [11]  dB  dB  dB 
Proposed  dB  dB  dB 

Although the proposed method outperforms ILRMA, it is important to verify that we overcome its limitation of having to tune the complexity parameter . This limitation can be seen in Fig. 2 where ILRMA’s SDR averaged over all the mixture signals containing (bass, vocals) increases as increases, while an opposite trend is seen for mixture signals containing (drums, vocals). The proposed method on the other hand starts with a large value for ( in this case) and tunes itself depending on each source’s characteristics. Hence it optimally separates the sources from both types of mixture signals. The choice of , if sufficiently large, does not to significantly impact proposed method’s performance.
4.3 Possible Extension
As each source is modelled individually using NMF, it is possible to extend our approach by considering a common basis and activation matrix for capturing the variance of all sources i.e. . This requires fewer parameters to be estimated and its computational complexity becomes at least half as compared to that of the proposed method. We note that this extension is similar to Ozerov’s MNMF [7] and has also been considered for ILRMA [11]. However the key difference is that we do not restrict the overall contribution of each basis to be , i.e. . Due to space constraints, this extension will be explored as a future work.
5 Conclusions
This work proposes a Bayesian generalization of ILRMA for determined blind source separation by performing multisource modelling using nonparametric NMF. Our formulation for individual source modelling is able to overcome the limitation of conventional method, whose separation performance is effected by NMF’s model complexity parameter. Proposed approach is flexible in modelling sources of different complexities and is therefore able to optimally separate them. We further show that our approach outperforms the stateoftheart NMF based techniques.
References
 [1] J. Foote, “An overview of audio information retrieval,” Multimedia systems, vol.7, no.1, pp.2–10, 1999.
 [2] M. Davies, “Audio source separation,” in Institute of mathematics and its applications conference series, 2002, vol. 71, pp. 57–68.
 [3] E. Weinstein, M. Feder, and A. V. Oppenheim, “Multichannel signal separation by decorrelation,” IEEE transactions on Speech and Audio Processing, vol.1, no.4, pp.405–413, 1993.
 [4] P. Comon and C. Jutten, Handbook of Blind Source Separation: Independent component analysis and applications, Academic press, 2010.
 [5] X. Cao and R. Liu, “General approach to blind source separation,” IEEE Transactions on signal Processing, vol.44, no.3, pp.562–571, 1996.
 [6] P. Comon, “Independent component analysis, a new concept?,” Signal processing, vol.36, no.3, pp.287–314, 1994.

[7]
A. Ozerov, C. Févotte, R. Blouet, and J. Durrieu,
“Multichannel nonnegative tensor factorization with structured constraints for userguided audio source separation,”
in IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), 2011, pp. 257–260.  [8] C. Narisetty, T. Komatsu, and R. Kondo, “Modelling of sound events with hidden imbalances based on clustering and separate SubDictionary learning,” in 26th European Signal Processing Conference (EUSIPCO), 2018.
 [9] H. Sawada, H. Kameoka, S. Araki, and N. Ueda, “Multichannel extensions of nonnegative matrix factorization with complexvalued data,” IEEE Transactions on Audio, Speech, and Language Processing, vol.21, no.5, pp.971–982, 2013.
 [10] N. Ono, “Stable and fast update rules for independent vector analysis based on auxiliary function technique,” in IEEE Workshop on Applications of Signal Processing to Audio and Acoustics (WASPAA), 2011, pp. 189–192.
 [11] D. Kitamura, N. Ono, H. Sawada, H. Kameoka, and H. Saruwatari, “Determined blind source separation unifying independent vector analysis and nonnegative matrix factorization,” IEEE/ACM Transactions on Audio, Speech and Language Processing (TASLP), vol.24, no.9, pp.1622–1637, 2016.
 [12] V. Y. Tan and C. Févotte, “Automatic relevance determination in nonnegative matrix factorization,” in SPARS’09Signal Processing with Adaptive Sparse Structured Representations, 2009.
 [13] W. K. Hastings, “Monte carlo sampling methods using markov chains and their applications,” 1970.
 [14] D. M. Blei, J. D. Lafferty, et al., “A correlated topic model of science,” The Annals of Applied Statistics, vol.1, no.1, pp.17–35, 2007.

[15]
M. D. Hoffman, D. M. Blei, C. Wang, and J. Paisley,
“Stochastic variational inference,”
The Journal of Machine Learning Research
, vol.14, no.1, pp.1303–1347, 2013.  [16] M. I. Jordan, Z. Ghahramani, T. S. Jaakkola, and L. K. Saul, “An introduction to variational methods for graphical models,” Machine learning, vol.37, no.2, pp.183–233, 1999.
 [17] D. M. Blei, P. R. Cook, and M. Hoffman, “Bayesian nonparametric matrix factorization for recorded music,” in Proceedings of the 27th International Conference on Machine Learning (ICML10), 2010, pp. 439–446.
 [18] J. D. Lafferty and D. M. Blei, “Correlated topic models,” in Advances in neural information processing systems, 2006, pp. 147–154.
 [19] J. Paisley and L. Carin, “Nonparametric factor analysis with beta process priors,” in ACM Proceedings of the 26th Annual International Conference on Machine Learning, 2009, pp. 777–784.
 [20] A. Liutkus, F. Stöter, Z. Rafii, D. Kitamura, B. Rivet, N. Ito, N. Ono, and J. Fontecave, “The 2016 signal separation evaluation campaign,” in Latent Variable Analysis and Signal Separation: 13th International Conference, LVA/ICA, Grenoble, France, 2017, pp. 323–332.

[21]
S. Nakamura, K. Hiyane, F. Asano, T. Nishiura, and T. Yamada,
“Acoustical sound database in real environments for sound scene understanding and HandsFree speech recognition.,”
in LREC, 2000, [Online; accessed 29Oct2018] Available: http://www.openslr.org/13/.  [22] D. Kitamura, N. Ono, and H. Saruwatari, “Experimental analysis of optimal window length for independent lowrank matrix analysis,” in 25th European Signal Processing Conference (EUSIPCO), 2017, pp. 1170–1174.
 [23] N. Murata, S. Ikeda, and A. Ziehe, “An approach to blind source separation based on temporal structure of speech signals,” Neurocomputing, vol.41, no.14, pp.1–24, 2001.
 [24] E. Vincent, R. Gribonval, and C. Févotte, “Performance measurement in blind audio source separation,” IEEE transactions on audio, speech, and language processing, vol.14, no.4, pp.1462–1469, 2006.
Comments
There are no comments yet.