1 Introduction
0.84(0.08,0.93) ©2019 IEEE. Personal use of this material is permitted. Permission from IEEE must be obtained for all other uses, in any current or future media, including reprinting/republishing this material for advertising or promotional purposes, creating new collective works, for resale or redistribution to servers or lists, or reuse of any copyrighted component of this work in other works.
Microphone input signal in teleconferencing systems, speech diarization systems, and automatic speech recognition systems is typically a mixture of multiple speech sources and it is also contaminated by room reverberation. Thus, speech source separation techniques have been highly spotlighted. As speech source separation techniques, blind source separation (BSS)
[1, 2, 3, 4, 5, 6, 7, 8] has been actively studied. Parameters which are needed for speech source separation can be optimized in an unsupervised manner with a statistical model. A speech source model is highly important for estimating a separation filter and solving the wellknown interfrequency permutation problem [9]. There are two requirements for a speech source model in the BSS. At first, the speech source model should capture complicated spectral characteristics of a speech source. Secondly, there should be a computationally efficient algorithm for optimizing parameters based on the speech source model. However, it is highly difficult to define a statistical model which fulfills these two requirements.As supervised speech source separation techniques, recently, deep neural network (DNN) based approaches with a training dataset in which there are microphone input signal and corresponding oracle clean data have been widely studied, e.g., deep clustering (DC) [10, 11], permutation invariant training (PIT) [12, 13], deep attractor network [14, 15], and hybrid approaches with BSS [16, 17, 18]. DNN based approaches can capture complicated spectral characteristics of a speech source. Parameter optimization can be done efficiently by forward calculation of the DNN. However, it is hard to obtain an oracle clean data in a target environment. Thus, it is highly required to train the DNN by utilizing only observed microphone input signals which contain multiple speech sources without an oracle clean data.
Recently, unsupervised DNN training techniques have been proposed [19, 20]
. These techniques estimate a timefrequency mask based on the DC. The DNN is trained without an oracle timefrequency mask. An estimated timefrequency mask by a BSS technique in an unsupervised manner is adopted as an alternative to the oracle timefrequency mask. In the BSS technique, a timefrequency mask is estimated under the assumption that each component of the microphone input signal is sparse enough at the timefrequency domain. However, when there are reverberation and background noise, the sparseness assumption does not hold and speech source separation performance degrades.
In this paper, we propose an unsupervised DNN training technique which utilizes an estimated speech signal by an unsupervised speech source separation with a timevarying spatial filter as an alternative to the clean speech signal. The timevarying spatial filter is constructed based on a timevarying spatial covariance matrix (SCM) model [5, 21]
which includes submodels about reverberation and background noise so as to increase speech source separation performance under reverberant and noisy environments. The proposed method also estimates a timevarying spatial filter via the DNN. The DNN infers intermediate variables which are utilized for constructing the timevarying spatial filter. Since there are several errors in a separated signal, both the separated signal by the unsupervised method and the separated signal via the DNN are modeled as a probabilistic signal so as to avoid overfitting to the separation errors in traing phase. The proposed method adopts a loss function which evaluates KullbackLeibler Divergence (KLD) between the posterior probability density function (PDF) of the separated signal by the unsupervised method and that of the separated signal via the DNN. Although there are multiple intermediate variables which should be inferred by the DNN, gradient of the loss function can be backpropagated into the DNN through all the intermediate variables jointly, thanks to evaluating the output signal in the loss function. Experimental results under reverberant and noisy conditions show that the proposed method can train the DNN more effectively in an unsupervised manner than conventional methods even when the number of the training utterances is small, i.e., 1K. The proposed KLD loss function is also shown to achieve better performance than the
loss function that evaluates the output signal as a deterministic signal.2 Microphone input signal model
In this paper, speech source separation is performed in a timefrequency domain. Multichannel microphone input signal ( is the frame index and is the frequency index) is modeled as follows:
(1) 
where is the number of the speech sources, is the th speech signal, is the late reverberation term, and is the multichannel background noise term. The objective of speech source separation is estimation of .
3 Proposed method
3.1 Overview
The proposed method trains a DNN which infers parameters of speech source separation without no clean data. Block diagram of the proposed method is shown in Fig. 1.
The proposed method consists of two major parts. In each part, an input signal is a dereverberated signal by the Weighted Prediction Error (WPE) [22]. Let be the output signal of the WPE, where ( is the taplength of early reverberation and is the taplength of the dereverberation filter ). The first part is a pseudo clean signal generator (PCSG). As an alternative clean signal, the PCSG generates a separated speech signal in an unsupervised manner based on the local Gaussian modeling (LGM) [5]. The PCSG regards the pseudo clean signal (PCS) as a probabilistic signal and estimates the posterior probability density function (PDF) of the PCS in which is the separation parameter that is estimated in an iterative manner. The second part is the DNN based estimation part of each speech source. In the DNN part, each speech source is also regarded as a probabilistic signal and the posterior PDF is estimated, where
is the separation parameter which is estimated via the DNN. As the PCS and the estimated signal by the DNN are both probabilistic signals, we evaluate the difference between the PCS and the estimated signal by a loss function which evaluates a difference between two posterior PDFs. By consideration of uncertainty of the PCS and the estimated signal, gradient of the loss function propagates into the DNN not only through the mean vector but also through the covariance matrix term of the posterior PDF inferred by the DNN, which leads to efficient DNN training.
3.2 Pseudo Clean Signal Generator: Unsupervised speech source separation with local Gaussian modeling
The LGM based speech source separation [5]
separates multiple speech sources assuming that the PDF of each speech source belongs to a timevarying Gaussian distribution. The PDF of the dereverberated signal is modeled as
. The multichannel spatial covariance matrix (SCM) of the dereverberated signal is modeled as follows:(2) 
where the first term is the SCM of each speech source,
is the timefrequency variance of the
th speech source, is the multichannel covariance matrix of the th speech source, the second term is the SCM of a residual late reverberation which is not removed by the WPE, and the third term is the SCM of the background noise. Reflecting that the amount of the late reverberation depends on the past speech source variance, the late reverberation term is modeled as a convolution of the past timevarying speech source variance with the timeinvariant covariance matrix [21] as follows:(3) 
where is the taplength of the residual late reverberation and is the timeinvariant covariance matrix of the th speech source. The third term in Eq. 2 is the timeinvariant SCM of the background noise. Thus, is . As all the PDFs are Gaussian distributions, the posterior PDF of the th speech source is estimated as the following Gaussian distribution:
(4) 
where and are calculated as and is a identity matrix ( is the number of the microphones), and is the MWF. The separation parameter is iteratively updated so as to maximize the log likelihood function with an auxiliary function [6, 21]. After update, the interfrequency permutation problem is solved by [23].
3.3 Posterior PDF estimation via deep neural network
In the DNN part, the posterior PDF of each speech source is also estimated based on the LGM with the timevarying multichannel SCM model defined in Eq. 2. The posterior PDF is calculated with the estimated . and are calculated in the same way as and , respectively. In the DNN part, the parameter is estimated via the DNN. The DNN structure is shown in Fig. 2. The input feature is concatenation of log spectral of the dereverberated signal and phase difference between microphones
. Timefrequency masks and a timefrequency variance of each speech source are inferred via the DNN that contains four bidirectional long short term memory (BLSTM) layers with 1200 hidden units and five dense layers. All of the covariance matrices are estimated via timefrequency masks inferred by the DNN, i.e.,
, , and , as follows:(5)  
(6)  
(7) 
where ( is the Hermitian transpose of a matrix/vector).
3.3.1 Loss function for deep neural network training
The loss function for the DNN training is set to a divergence between two posterior PDFs, i.e., the posterior PDF estimated by the LGM and the posterior PDF estimated via the DNN . As a loss function, the proposed method adopts a KullbackLeibler divergence defined as where the utterancelevel permutation invariant training (PIT) [12] is utilized similarly to conventional supervised speech source separation [13, 24], is a set of possible permutations, and D(p_i,l,k —— q_j,l,k)= (μ_q,j,l,kμ_p,i,l,k)^HV_q,j,l,k^1(μ_q,i,l,kμ_p,i,l,k) +tr(V_q,j,l,k^1V_p,i,l,k )+log—Vq,j,l,k——Vp,i,l,k—N_m. It is shown that acts as a regularization term in Eq. 3.3.1, which leads to avoiding overfitting of the MAP estimate to that contains separation error, and gradient of the loss function propagates not only through but also through , which is favorable for the DNN training of timefrequency masks.
3.3.2 Output signal in inference phase
In the inference phase, the parameter is inferred via a DNN. After that, is iteratively updated so as to minimize the auxiliary function in the same way as the PCSG. Finally, the separated signal is obtained as a mean vector of the posterior PDF, , by the MWF.
4 Experiment
Loss  SDR  SIR  CD  FWSeg.  PESQ  

Func.  (dB)  (dB)  (dB)  SNR (dB)  
Unprocessed  2.01  0.52  5.60  6.56  1.52  
  4.75  8.42  5.05  9.05  1.90  
  4.12  8.19  5.12  8.20  1.82  
  3.84  7.76  5.17  7.93  1.78  
5.14  8.99  5.05  8.79  1.92  
4.87  8.71  5.04  8.14  1.89  
3.62  6.02  5.27  7.04  1.75  
KLD  5.44  9.74  4.95  8.63  1.98  
KLD  5.53  9.84  4.89  8.88  1.99  
KLD  5.71  10.27  4.86  8.95  2.02 
Approaches  Filtering  Loss Func.  Phase Diff.  SDR (dB)  SIR (dB)  CD (dB)  FWSeg.SNR (dB)  PESQ 
Unprocessed  2.01  0.52  5.60  6.56  1.52  
CACGMM  Mask      4.63  9.70  5.01  7.58  1.87 
CACGMM  MVDR      5.11  7.91  5.19  8.87  1.87 
CACGMM  Mask  DC  No  3.03  5.26  5.37  6.46  1.65 
CACGMM  MVDR  DC  No  3.35  4.36  5.42  7.57  1.68 
CACGMM  Mask  DC  Yes  4.10  8.32  5.15  7.06  1.79 
CACGMM  MVDR  DC  Yes  4.58  6.85  5.26  8.39  1.81 
CACGMM  Mask  MSA+PIT  No  3.41  5.74  5.33  6.87  1.66 
CACGMM  MVDR  MSA+PIT  No  3.63  4.92  5.41  7.61  1.70 
CACGMM  Mask  MSA+PIT  Yes  4.64  8.87  5.04  7.76  1.85 
CACGMM  MVDR  MSA+PIT  Yes  4.95  7.41  5.21  8.70  1.84 
LGM  MWF ()  KLD  Yes  5.71  10.27  4.86  8.95  2.02 

4.1 Experimental setup
Speech source separation performance of the proposed method was evaluated by using measured impulse responses in Multichannel Impulse Response Database (MIRD) [25] and TIMIT speech corpus [26]. In the training phase, TIMIT train corpus was utilized. In the evaluation phase, TIMIT test corpus was utilized. The reverberation time was randomly set to 0.36 (sec) or 0.61 (sec). Sampling rate was set to 8000 Hz. The number of the microphone was set to . The number of the speech sources was set to . Two microphone indices were randomly selected for each sample both in the training phase and in the evaluation phase. A 3338333 spacing (cm) microphone array, a 4448444 spacing (cm) microphone array, and a 8888888 spacing (cm) microphone array were utilized. Frame size was 256 pt. Frame shift was 64 pt. The number of frequency bins was . The distance between speech sources and microphones was set to m. Azimuth of each talker is randomly selected for each utterance. The number of total training utterances was set to 1000, which is a smaller dataset than the conventional one, e.g., 30000 [19]
, because small number of required utterances is preferable in practice. The number of total test utterances was 200. As a background noise signal, white Gaussian noise is added. Signal to noise Ratio (S/N) was randomly set from 20 dB to 30 dB. S/N between two speakers was randomly set from 5 dB to 5 dB. Minibatch size was set to 128. Each utterance was split in every 100frames segment. Neural network parameters were updated by
times. Adam optimizer [27] (learning rate was) with gradient clipping was utilized. The proposed method calculates complexvalued gradient by Tensorflow
[28]. In each method, WPE was utilized (tap length was and was set to ).4.2 Evaluation measures and comparative methods
We utilized Cepstrum distance (CD), Frequencyweighted segmental SNR (FWSegSNR), and PESQ as dereverberation performance measures. For speech source separation performance evaluation, we utilized SDR and SIR from BSS_EVAL [29]
. Four methods were evaluated, i.e., 1) Conventional unsupervised training method with complex angular central Gaussian mixture model (CACGMM)
[19]: Timefrequency mask of each source is inferred with the sparseness assumption. This model does not have any reverberation model. A loss function which evaluates an intermediate variable is adopted. The DNN has four BLSTM layers. Only the output dense layer of the DNN is different from that of the proposed method. 2) Unsupervised speech source separation based on LGM without DNN: The separation parameter is updated iteratively based on [6]. This method is also identical to PCGS in the proposed method. 3) Unsupervised training with LGM based PCGS and loss function: The loss function evaluates difference between the MAP estimate of the PCS posterior PDF and that of the estimated posterior PDF via the DNN, i.e., . 4) Proposed unsupervised training with LGM based PCGS and KLD loss function.4.3 Experimental results
At first, we evaluated three types of LGM based methods. The number of the covariance matrices of residual late reverberation was set to 1, 4, or 8. In Table 1, experimental results for LGM based methods are shown. The proposed unsupervised training methods with KLD loss function is shown to be more effective than the unsupervised training methods with loss function. The proposed method also outperformed the LGM without the DNN (PCGS). This result confirmed that the proposed method is robust against separation error of the PCGS. In the loss function cases, when is 4 or 8, performance was degraded. It can be interpreted that the DNN parameters were not correctly learned by backpropagation only through the mean vector of the posterior PDF. In the proposed KLD loss function cases, performance monotonically increased in accordance with the number of . It is shown that backpropagation via the covariance matrix term of the posterior PDF is effective.
In Table 2, we compared the proposed KLD loss function based method with and the conventional unsupervised DNN training method with CACGMM. Unlike the proposed method, CACGMM does not have a reverberation model. Originally, a CACGMM based method without phase difference feature was proposed in [19]. However, the proposed method utilizes phase difference between microphones as an input feature. To evaluate each method fairly, we also evaluated CACGMM based methods with the phase difference feature. In addition to deep clustering (DC) based methods in which the dimension of the embedding vector was set to , PIT based methods which evaluate timefrequency masks were also evaluated. The timefrequency mask is evaluated by the magnitude spectrum approximation (MSA), because the pseudo oracle timefrequency mask is realvalued and the phasesensitive spectrum approximation (PSA) [30] cannot be utilized. We also evaluated the original CACGMM [8] without no DNN training. As an output signal, we evaluated timefrequency masking results and minimum variance distortionless response (MVDR) results. It is shown that the proposed method outperformed all variants of CACGMM based methods. This result confirmed that effectiveness of the proposed reverberation and background noise models and DNN training with the proposed probabilistic loss function based on KLD.
5 Conclusions
We proposed an unsupervised multichannel speech source separation method in which the deep neural network (DNN) is trained with no oracle clean signal. As a pseudo clean signal, the proposed method adopts the separated signal by the conventional unsupervised local Gaussian modeling. So as to reduce reverberation and background noise effectively, the proposed method estimates a timevarying covariance matrix of microphone input signal which contains reverberation and background noise components. Since both the pseudo clean signal and the estimated signal via the DNN are probabilistic signals, we proposed a loss function which evaluates the KullbackLeibler divergence (KLD) between two posterior probability density functions. Experimental results showed that the proposed method can separate speech sources more accurately than the conventional methods under a reverberant and noisy environment.
References
 [1] O. Yilmaz and S. Rickard, “Blind separation of speech mixtures via timefrequency masking,” IEEE Transactions on Signal Processing, vol. 52, no. 7, pp. 1830–1847, July 2004.

[2]
P. Common,
“Independent component analysis, a new concept ?,”
Signal Processing, vol. 36, no. 3, pp. 287–314, April 1994.  [3] A. Hiroe, “Solution of permutation problem in frequency domain ica using multivariate probability density functions,” in Proceedings ICA, Mar. 2006, pp. 601–608.
 [4] T. Kim, H.T. Attias, S.Y. Lee, and T.W. Lee, “Independent vector analysis: an extension of ica to multivariate components,” in Proceedings ICA, Mar. 2006, pp. 165–172.
 [5] N.Q.K. Duong, E. Vincent, and R. Gribonval, “Underdetermined reverberant audio source separation using a fullrank spatial covariance model,” IEEE Trans. Audio Speech Lang. Process., vol. 18, no. 7, pp. 1830–1840, 2010.
 [6] H. Sawada, H. Kameoka, S. Araki, and N. Ueda, “Multichannel extensions of nonnegative matrix factorization with complexvalued data,” IEEE Trans. Audio, Speech, and Language Process., vol. 21, no. 5, pp. 971–982, May 2013.
 [7] D. Kitamura, N. Ono, H. Sawada, H. Kameoka, and H. Saruwatari, Determined Blind Source separation with Independent LowRank Matrix Analysis, chapter 6, pp. 125–155, Springer Publishing Company, Incorporated, 2018.
 [8] N. Ito, S. Araki, and T. Nakatani, “Complex angular central gaussian mixture model for directional statistics in maskbased microphone array signal processing,” in EUSIPCO 2016, Aug 2016, pp. 1153–1157.
 [9] H. Sawada, S. Araki, and S. Makino, “Underdetermined convolutive blind source separation via frequency binwise clustering and permutation alignment,” IEEE Transactions on Audio, Speech, and Language Processing, vol. 19, no. 3, pp. 516–527, March 2011.
 [10] J.R. Hershey, Z. Chen, J. Le Roux, and S. Watanabe, “Deep clustering: Discriminative embeddings for segmentation and separation,” in ICASSP 2016, 2016, pp. 31–35.
 [11] Z.Q. Wang, J. Le Roux, and J.R. Hershey, “Multichannel deep clustering: Discriminative spectral and spatial embeddings for speakerindependent speech separation,” in ICASSP 2018, 2018, pp. 1–5.
 [12] D. Yu, M. Kolbæk, Z. H. Tan, and J. Jensen, “Permutation invariant training of deep models for speakerindependent multitalker speech separation,” in ICASSP 2017, March 2017, pp. 241–245.
 [13] T. Yoshioka, H. Erdogan, Z. Chen, and F. Alleva, “Multimicrophone neural speech separation for farfield multitalker speech recognition,” in ICASSP 2018, April 2018, pp. 5739–5743.
 [14] Z. Chen, Y. Luo, and N. Mesgarani, “Deep attractor network for singlemicrophone speaker separation,” in ICASSP 2017, March 2017, pp. 246–250.
 [15] Y. Luo, Z. Chen, and N. Mesgarani, “Speakerindependent speech separation with deep attractor network,” IEEE/ACM Transactions on Audio, Speech, and Language Processing, vol. 26, no. 4, pp. 787–796, April 2018.
 [16] A.A. Nugraha, A. Liutkus, and E. Vincent, “Multichannel audio source separation with deep neural networks,” IEEE/ACM Trans. Audio Speech Lang. Process., vol. 24, no. 9, pp. 1652–1664, 2016.
 [17] A.A. Nugraha, A. Liutkus, and E. Vincent, “Deep neural network based multichannel audio source separation,” in Audio Source Separation. Springer, Mar. 2018.
 [18] S. Mogami, H. Sumino, D. Kitamura, N. Takamune, S. Takamichi, H. Saruwatari, and N. Ono, “Independent deeply learned matrix analysis for multichannel audio source separation,” in EUSIPCO 2018, Sep. 2018, pp. 1557–1561.
 [19] L. Drude, D. Hasenklever, and R. HaebUmbach, “Unsupervised training of a deep clustering model for multichannel blind source separation,” in ICASSP 2019, May 2019, pp. 695–699.
 [20] E. Tzinis, S. Venkataramani, and P. Smaragdis, “Unsupervised deep clustering for source separation: Direct learning from mixtures using spatial information,” in ICASSP 2019, May 2019, pp. 81–85.
 [21] M. Togami, “Multichannel timevarying covariance matrix model for late reverberation reduction,” arXiv:1910.08710, 2019.
 [22] T. Nakatani, T. Yoshioka, K. Kinoshita, M. Miyoshi, and B. H. Juang, “Speech dereverberation based on variancenormalized delayed linear prediction,” IEEE Transactions on Audio, Speech, and Language Processing, vol. 18, no. 7, pp. 1717–1731, Sept 2010.
 [23] N. Murata, S. Ikeda, and A. Ziehe, “An approach to blind source separation based on temporal structure of speech signals,” Neurocomputing, vol. 41, no. 1, pp. 1 – 24, 2001.
 [24] M. Togami, “Multichannel Itakura Saito distance minimization with deep neural network,” in ICASSP 2019, May 2019, pp. 536–540.
 [25] E. Hadad, F. Heese, P. Vary, and S. Gannot, “Multichannel audio database in various acoustic environments,” IWAENC 2014, pp. 313–317, 2014.
 [26] J. S. Garofolo, L. F. Lamel, W. M. Fisher, J. G. Fiscus, D. S. Pallett, and N. L. Dahlgren, “DARPA TIMIT acoustic phonetic continuous speech corpus CDROM,” 1993.
 [27] D. Kingma and J. Ba, “Adam: A method for stochastic optimization,” in ICLR 2015, 2015.

[28]
M. Abadi, A. Agarwal, P. Barham, E. Brevdo, Z. Chen, C. Citro, G. S. Corrado,
A. Davis, J. Dean, M. Devin, S. Ghemawat, I. Goodfellow, A. Harp, G. Irving,
M. Isard, Y.Jia, R. Jozefowicz, L. Kaiser, M. Kudlur, J. Levenberg,
D. Mané, R. Monga, S. Moore, D. Murray, C. Olah, M. Schuster, J. Shlens,
B. Steiner, I. Sutskever, K. Talwar, P. Tucker, V. Vanhoucke, V. Vasudevan,
F. Viégas, O. Vinyals, P. Warden, M. Wattenberg, M. Wicke, Y. Yu, and
X. Zheng,
“TensorFlow: Largescale machine learning on heterogeneous systems,” 2015,
Software available from tensorflow.org.  [29] E. Vincent, R. Gribonval, and C. Fevotte, “Performance measurement in blind audio source separation,” IEEE Transactions on Audio, Speech, and Language Processing, vol. 14, no. 4, pp. 1462–1469, July 2006.

[30]
H. Erdogan, J. R. Hershey, S. Watanabe, and J. Le Roux,
“Phasesensitive and recognitionboosted speech separation using deep recurrent neural networks,”
in ICASSP 2015, April 2015, pp. 708–712.
Comments
There are no comments yet.