1 Introduction
Speech source separation is a fundamental technique with many applications including automatic speech recognition (ASR) [1, 2] and hearing aid [3]. Although speech source separation with a single microphone is applicable [4], that with multiple microphones is more effective because it can take advantage of spatial information [5]
. There exist several unsupervised approaches for multichannel speech source separation including independent component analysis based methods
[6, 7, 8] and local Gaussian model (LGM) based method [9]. Meanwhile, motivated by the strong capability of a deep neural network (DNN) to model a spectrogram of a speech, supervised approaches have been paid increasing attention [10, 11, 12, 13].In supervised speech source separation, beamforming using a DNN have been mainly studied [10, 11, 12, 13]. It has also been studied in speech enhancement and noiserobust ASR [14, 15, 16]. One approach is to estimate the complexvalued filter coefficients by a DNN [17, 18]. This approach can apply to only the same microphone configurations as in its training. Another approach is called maskbased beamforming where a TFmask is used for estimating spatial covariance matrices [14]
. After estimating spatial covariance matrices, several beamformers such as minimum variance distortionless response (MVDR) beamformer
[19], generalized eigenvalue (GEV) beamformer
[20], and timeinvariant multichannel Wiener filter (MWF) [21] can be constructed in accordance with applications. This approach does not depend on microphone configurations, and the effectiveness of the maskbased beamforming has been shown in noiserobust ASR [14, 15].While maskbased beamforming for speech enhancement and noiserobust ASR has been well studied, that for speakerindependent multitalker separation is still a challenging problem due to the utterancelevel permutation problem. In order to address this issue, permutation invariant training (PIT) was proposed, which solves the permutation problem so that its loss function takes the lowest value [22, 23]. In contrast to other approaches to speakerindependent multitalker separation [24, 25, 26], PIT can freely design its loss function, and thus the choice of the loss function is important.
Recently, using PIT, speakerindependent multitalker separation by maskbased beamforming was presented [11, 10, 12]. In these studies, loss functions designed for monaural speech enhancement/separation, such as the phase sensitive approximation (PSA) [27], are employed in the training. However, monaural loss functions do not consider intermicrophone information. A TFmask considers the signaltonoise ratio (SNR) at each TF bin, which is not directly related to the spatial covariance matrices. Meanwhile, the performance of beamforming significantly depends on the estimated spatial covariance matrices. Hence, the performance of monaural TFmasking does not directly correspond to that of maskbased beamforming.
In this paper, we propose two maskbased beamforming methods with multichannel loss functions. As illustrated in Fig. 1, the multichannel loss functions evaluate the estimated spatial covariance matrices which are used for constructing beamformers. A multichannel loss function was originally proposed for the timevarying MWF based on the multichannel Itakura–Saito divergence (MISD) [28]. We first import it for timeinvariant maskbased beamforming. Furthermore, since the loss function presented in [28] is redundant for the timeinvariant maskbased beamforming, we also propose the maskbased beamforming with the lowcomputational loss function. By using PIT, both proposed methods can be easily applied to speakerindependent multitalker separation. Our main contributions are twofold: (1) proposing maskbased beamforming with multichannel loss functions; (2) clarifying the effectiveness of multichannel loss functions for several beamformers.
2 Preriminary
2.1 Maskbased beamforming
Let source signals be observed by microphones, be the observed mixture, and be the th source signal observed at the th microphone where and are time and frequency indices, respectively. A separated source obtained by beamforming is given as
(1) 
where is the timeinvariant filter coefficients for extracting th source, and is the Hermitian transpose of . For constructing beamformers, the spatial covariance matrices are required. Assuming the sparsity of the speeches in TF domain, the spatial covariance matrix of th speech source can be estimated as [29]
(2) 
where is a TFmask for extracting the th source, , and is the transpose of . Thus, the complexvalued spatial covariance estimation is substituted by the realvalued TFmask estimation which is independent of the number of microphones.
2.2 Loss function for TFmask estimation
To train a DNN for TFmask estimation, several training criteria have been presented such as PSA which minimizes the mean square error between clean and estimated sources on the complex plane. PSA considers the following loss function:
(3) 
where the microphone index is omitted because PSA does not requires multichannel observation. Note that the oracle phase sensitive mask (PSM), achieves the highest SNR in realvalued TFmasking [27], and it was recently applied to maskbased beamforming [10].
However, the performance of monaural speech enhancement/separation does not directly correspond to that of the maskbased beamforming. This is because such a monaural loss function does not consider intermicrophone information. Furthermore, TFmask considers SNR at each TF bin, but it does not directly correspond to the accuracy of the timeinvariant spatial covariance matrix calculated by Eq. (2).
2.3 Beamformers
2.3.1 MVDR beamformer
MVDR beamformer, which aims to minimize the total power of the extracted source without distortion of the target, is one of the most popular beamformers. Based on [19], it is given as
(4) 
where and are the spatial covariance matrices of the target and interference, respectively, and .
2.3.2 GEV beamformer
GEV beamformer, which aims to maximize SNR for each frequency subband, is formulated as [20]:
(5) 
Note that there exists the ambiguity of complex value scalar multiplication in . In [30], it was solved by minimizing the difference between the estimated source and the observation, which was used in the experiment.
2.3.3 Multichannel Wiener filter
Assuming each source signal
independently follows a zeromean complexvalued Gaussian distribution
[9]:(6)  
(7) 
where is the timevarying activation of the th source, the observed mixture follows
(8) 
Then, timevarying MWF can be obtained in the minimum mean square error sense as
(9) 
While Eq. (9) is a timevarying filter, its timeinvariant version can calculate replacing by [21, 31].
3 Proposed maskbased beamforming with multichannel loss function
In this paper, we propose two maskbased beamforming methods using DNNs trained by multichannel loss functions which evaluate the estimated spatial covariance matrices as illustrated in Fig. 1. After reviewing a multichannel loss function for timevarying MWF [28], the proposed timeinvariant maskbased beamforming is introduced, which is based on the same loss function used in [28]. Since the loss function focuses on the timevarying MWF, it requires the estimated timevarying activation which is redundant for timeinvariant beamforming. Hence, we also propose a maskbased beamforming method based on another loss function which does not require the estimation of the timevarying activation.
3.1 Multichannel loss function for timevarying MWF [28]
For timevarying MWF, we proposed a multichannel loss function which evaluates the estimated timevarying spatial covariance matrices . In [28], a DNN estimates the timevarying activation and TFmask. Based on DNN’s outputs, the timevarying spatial covariance matrices are calculated as where is given by Eq. (2). Then, the loss function based on the MISD [32] between the clean source signal and estimated one is given by
(10)  
(11)  
(12)  
(13) 
where
is the identity matrix, and timevarying MWF is calculated as in Eq. (
9). Note that the multichannel loss function given in Eq. (10) corresponds to the negative loglikelihood of the posterior distribution .3.2 Maskbased beamforming with multichannel loss function given in Eq. (10)
The effectiveness of the multichannel loss function given in Eq. (10) was confirmed for timevarying MWF [28]. As a timeinvariant version of [28], we propose a maskbased beamforming method based on the multichannel loss function given in Eq. (10). Specifically, the proposed method uses the same DNN as in [28] where the DNN estimates both timevarying activation and timeinvariant spatial covariance matrices in its training. In the testing phase, the DNN estimates only timeinvariant spatial covariance matrices for constructing several timeinvariant beamformers.
In conventional maskbased beamforming, a DNN is trained to maximize the performance of monaural speech enhancement/separation. In contrast, the proposed approach trains a DNN based on the model of multichannel signal [9], and TFmasks are trained to estimate the accurate spatial covariance matrices. The effectiveness of this approach was confirmed in experiments in Section 4 where it is referred to as Prop. .
3.3 Maskbased beamforming with lowcomputational multichannel loss function
In the aforementioned method, a DNN estimates both timevarying activation and spatial covariance matrices, but the estimation of the timevarying activation is redundant for maskbased beamforming because it is not used for constructing timeinvariant beamformers. In addition, minimizing the loss function given in Eq. (10) requires huge computation for estimating clean source by timevarying MWF.
In order to address these problems, we propose a maskbased beamforming method using another multichannel loss function given by
(14)  
(15) 
where , is calculated by Eq. (2) as in maskbased beamforming, and is the timevarying activation calculated from the oracle multichannel signal as
(16) 
which represents the fluctuation from the average power for each source. While the loss function given in Eq. (10) considers the estimated clean source, that in Eq. (14) corresponds to the MISD between and , which corresponds to the maximum likelihood estimation for [32]. The proposed loss function given in Eq. (14) requires less computation comparing to that in Eq. (10) thanks to avoiding timevarying MWF calculation. In addition, by avoiding the estimation of the timevarying activation, the redundant DNN parameters for maskbased beamforming are eliminated. This approach will be referred to as Prop. in the experiment.
When applying speakerindependent multitalker separation, there exists the permutation problem between the estimated spatial covariance matrices and the oracle timevarying activation. In order to solve this problem, we can use PIT [23]. That is, the permutation problem is solved so that the loss function takes small value.
4 Experiment
In order to confirm the effectiveness of the multichannel loss functions, DNNs trained by PSA in Eq. (3) [10] and by multichannel loss functions were compared in speakerindependent multitalker separation by the maskbased beamforming. Based on the spatial covariance matrices estimated by TFmasking, three beamformers (MVDR beamformer in Eq. (4), GEV beamformer in Eq. (5), and timeinvariant MWF) were tested. In addition, we also evaluated [28] which uses timevarying MWF and maskbased beamformers with oracle PSM.
Mic arrangement [cm]  Corpus  [ms]  

Train    Train  
  
Condition    Test  
  
Condition    Test  
Condition    Test 
4.1 Experimental conditions
MVDR beamformer  GEV beamformer  MWF  
Approaches  SDR [dB]  SIR [dB]  CD [dB]  SDR [dB]  SIR [dB]  CD [dB]  SDR [dB]  SIR [dB]  CD [dB] 
Mixed  0.20  0.91  4.44             
PSA [10]  6.87  7.63  3.68  7.57  9.12  3.44  5.79  6.21  3.93 
Prop.  8.54  9.42  3.14  8.35  9.76  3.13  7.10  7.57  3.40 
Prop.  7.86  8.92  3.25  7.83  9.35  3.16  6.76  7.37  3.56 
Timevarying [28]              8.69  11.56  3.11 
Oracle PSM  10.75  11.87  2.84  10.75  12.15  2.82  10.43  11.04  3.09 
MVDR beamformer  GEV beamformer  MWF  
Approaches  SDR [dB]  SIR [dB]  CD [dB]  SDR [dB]  SIR [dB]  CD [dB]  SDR [dB]  SIR [dB]  CD [dB] 
Mixed  0.20  0.86  4.48             
PSA [10]  6.31  6.93  3.78  6.82  8.26  3.58  5.40  5.80  4.00 
Prop.  7.88  8.62  3.27  7.72  8.98  3.24  6.42  6.93  3.53 
Prop.  7.05  7.94  3.43  7.07  8.45  3.37  6.17  6.78  3.69 
Timevarying [28]              7.75  10.49  3.24 
Oracle PSM  10.52  11.47  2.96  10.47  11.74  2.94  10.36  10.98  3.18 
MVDR beamformer  GEV beamformer  MWF  
Approaches  SDR [dB]  SIR [dB]  CD [dB]  SDR [dB]  SIR [dB]  CD [dB]  SDR [dB]  SIR [dB]  CD [dB] 
Mixed  0.18  0.90  4.05             
PSA [10]  3.69  4.32  3.74  3.82  5.54  3.68  3.68  4.04  3.84 
Prop.  4.59  5.56  3.49  4.40  6.08  3.49  4.26  4.77  3.60 
Prop.  4.09  4.94  3.56  4.02  5.64  3.54  4.26  4.79  3.68 
Timevarying [28]              5.91  8.29  3.35 
Oracle PSM  6.45  7.80  3.26  6.32  8.30  3.25  7.21  7.94  3.38 
4.1.1 Datasets
In both training and testing phases, the measured impulse response in Multichannel Impulse Response Database (MIRD) [33] and the clean speech in TIMIT corpus [34] were used for making multichannel signals. The training and testing conditions are summarized in Table 1. The number of microphones and sources were set to where microphones were randomly selected for each sample from the microphone arrangement shown in Table 1. For training, speeches were selected, and they were split into every frames in TF domain. While Condition used the same microphone arrangement as training in the testing phase, Condition employed different microphone array. In Condition , performances in longer reverberation case were evaluated. In all conditions, the distance between speech sources and microphones was set to m, and the azimuth of each talker is randomly selected for each sample. All the speeches were resampled at
kHz, and the shorttime Fourier transform was computed using the Hann window whose length was
ms with ms shift.4.1.2 DNN architecture and training setup
A DNN used in this experiment is illustrated in Fig. 2
, which contains two bidirectional longshort term memory (BLSTM) layers, each with
units in each direction, followed by parallel dense layers where the estimation of the timevarying activation was used for only Prop. and [28]. Dropout of was applied to the output of each BLSTM. In all methods, input feature was calculated by(17) 
where is the utterancelevel mean and variance normalization. DNN parameters were updated times where the batchsize is , the Adam optimizer was used, and the learning rate was .
4.2 Experimental results
The performance of speech source separation was evaluated by the signaltodistortion ratio (SDR) and signaltointerference ratio (SIR) from BSSEVAL [35], and cepstrum distortion (CD). The separation results are summarized in Tables 4–4 where the scores of the unprocessed mixed signal are omitted in GEV beamformer and MWF because they are the same as in MVDR beamformer. Prop. achieved the highest scores in maskbased beamforming, and Prop. also resulted in better scores than PSM. In addition, MVDR beamformer with Prop. achieved comparable SDR and CD with timevarying MWF when is [ms]. That is, the multichannel loss functions can be applied to not only the timevarying MWF but also several timeinvariant beamformers. We stress that MVDR beamformer is more preferred in many applications such as ASR because it does not cause artificial noise. Comparing Tables 4 and 4, both proposed methods with multichannel loss functions worked well even if the microphone arrangement is different from training. That is, they can be applied to maskbased beamforming in different microphone arrangements as the conventional monaural losses. Furthermore, they also worked with longer reverberation as illustrated in Table 4.
Prop. achieved better scores than Prop. in most cases. That is, the joint estimation of the spatial covariance matrices with the timevarying activation improved the quality of the estimated spatial covariance matrices where the joint estimation can be interpreted as multitask training. However, training of Prop. takes times slower than that of Prop. with ”NVIDIA Tesla V100” because Prop. requires the calculation of timevarying MWF as in Eq. (12).
5 Conclusion
In this paper, we proposed two maskbased beamforming methods using DNNs trained by multichannel loss functions. Two multichannel loss functions, used in the proposed methods, evaluate the spatial covariance matrices based on two types of MISD. The experimental results indicate that the maskbased beamforming with the multichannel loss functions outperformed that with the monaural loss function regardless of the microphone arrangements. Hence, we conclude the multichannel loss function is effective for various maskbased beamforming techniques.
6 Acknowledgements
The authors would like to thank Dr. Kohei Yatabe for his valuable comments and discussion.
References
 [1] B. Li, T. N. Sainath, A. Narayanan, J. Caroselli, M. Bacchiani, A. Misra, I. Shafran, H. Sak, G. Pundak, K. Chin, K. C. Sim, R. J. Weiss, K. W. Wilson, E. Variani, C. Kim, O. Siohan, M. Weintraub, E. McDermott, R. Rose, and M. Shannon, “Acoustic modeling for google home,” in Proc. Interspeech, Aug. 2017, pp. 399–403.

[2]
S. Watanabe, M. Delcroix, F. Metze, and J. R. Hershey, Eds.,
New Era for Robust Speech Recognition: Exploiting Deep Learning
. Springer, 2017. 
[3]
M. Sunohara, C. Haruta, and N. Ono, “Lowlatency realtime blind source separation for hearing aids based on timedomain implementation of online independent vector analysis with truncation of noncausal components,” in
IEEE Int. Conf. on Acoust., Speech Signal Process. (ICASSP), Mar. 2017, pp. 216–220.  [4] D. Wang and J. Chen, “Supervised speech separation based on deep learning: An overview,” IEEE/ACM Trans. Audio, Speech, Lang. Process., vol. 26, no. 10, pp. 1702–1726, Oct. 2018.
 [5] S. Gannot, E. Vincent, S. MarkovichGolan, and A. Ozerov, “A consolidated perspective on multimicrophone speech enhancement and source separation,” IEEE/ACM Trans. Audio, Speech and Lang. Proc., vol. 25, no. 4, pp. 692–730, Apr. 2017.
 [6] P. Smaragdis, “Blind separation of convolved mixtures in the frequency domain,” Neurocomputing, vol. 22, no. 1, pp. 21–34, 1998.
 [7] H. Saruwatari, T. Kawamura, T. Nishikawa, A. Lee, and K. Shikano, “Blind source separation based on a fastconvergence algorithm combining ICA and beamforming,” IEEE Trans. Audio, Speech, Lang. Process., vol. 14, no. 2, pp. 666–678, Mar. 2006.
 [8] K. Yatabe and D. Kitamura, “Determined blind source separation via proximal splitting algorithm,” in IEEE Int. Conf. Acoust., Speech, Signal Process. (ICASSP), Apr. 2018, pp. 776–780.
 [9] N. Q. K. Duong, E. Vincent, and R. Gribonval, “Underdetermined reverberant audio source separation using a fullrank spatial covariance model,” IEEE Trans Audio, Speech, Lang. Process., vol. 18, no. 7, pp. 1830–1840, Sep. 2010.
 [10] L. Yin, Z. Wang, R. Xia, J. Li, and Y. Yan, “Multitalker speech separation based on permutation invariant training and beamforming,” in Interspeech, Sep. 2018, pp. 851–855.
 [11] T. Yoshioka, H. Erdogan, Z. Chen, and F. Alleva, “Multimicrophone neural speech separation for farfield multitalker speech recognition,” in IEEE Int. Conf. on Acoust., Speech Signal Process. (ICASSP), Apr. 2018, pp. 5739–5743.
 [12] Z. Wang and D. Wang, “Combining spectral and spatial features for deep learning based blind speaker separation,” IEEE/ACM Trans. Audio, Speech, Lang. Process., vol. 27, no. 2, pp. 457–468, Feb. 2019.
 [13] L. Drude and R. HaebUmbach, “Tight integration of spatial and spectral features for BSS with deep clustering embeddings,” in Interspeech, Aug. 2017, pp. 2650–2654.
 [14] J. Heymann, L. Drude, and R. HaebUmbach, “Neural network based spectral mask estimation for acoustic beamforming,” in IEEE Int. Conf. on Acoust., Speech Signal Process. (ICASSP), Mar. 2016, pp. 196–200.
 [15] J. Heymann, L. Drude, C. Boeddeker, P. Hanebrink, and R. HaebUmbach, “Beamnet: Endtoend training of a beamformersupported multichannel ASR system,” in IEEE Int. Conf. on Acoust., Speech Signal Process. (ICASSP), 2017.
 [16] T. Ochiai, S. Watanabe, T. Hori, J. R. Hershey, and X. Xiao, “Unified architecture for multichannel endtoend speech recognition with neural beamforming,” IEEE J. Selected Topics Signal Process., vol. 11, no. 8, pp. 1274–1288, Dec. 2017.
 [17] B. Li, T. N. Sainath, R. J. Weiss, K. W. Wilson, and M. Bacchiani, “Neural network adaptive beamforming for robust multichannel speech recognition,” in Interspeech, 2016, pp. 1976–1980.
 [18] X. Xiao, S. Watanabe, H. Erdogan, L. Lu, J. Hershey, M. L. Seltzer, G. Chen, Y. Zhang, M. Mandel, and D. Yu, “Deep beamforming networks for multichannel speech recognition,” in IEEE Int. Conf. on Acoust., Speech Signal Process. (ICASSP), 2016, pp. 5745–5749.
 [19] M. Souden, J. Benesty, and S. Affes, “On optimal frequencydomain multichannel linear filtering for noise reduction,” IEEE Trans. Audio, Speech, Lang. Process., vol. 18, no. 2, pp. 260–276, Feb. 2010.
 [20] E. Warsitz and R. HaebUmbach, “Blind acoustic beamforming based on generalized eigenvalue decomposition,” IEEE Trans. Audio, Speech, Language Process, vol. 15, no. 5, pp. 1529–1539, 2007.
 [21] S. Doclo and M. Moonen, “GSVDbased optimal filtering for single and multimicrophone speech enhancement,” IEEE Trans. Signal Process., vol. 50, no. 9, pp. 2230–2244, Sep. 2002.
 [22] D. Yu, M. Kolbæk, Z. Tan, and J. Jensen, “Permutation invariant training of deep models for speakerindependent multitalker speech separation,” in IEEE Int. Conf. on Acoust., Speech Signal Process. (ICASSP), Mar. 2017, pp. 241–245.

[23]
M. Kolbæk, D. Yu, Z.H. Tan, and J. Jensen, “Multitalker speech separation with utterancelevel permutation invariant training of deep recurrent neural networks,”
IEEE/ACM Trans. Audio, Speech Lang. Proc., vol. 25, no. 10, pp. 1901–1913, Oct. 2017.  [24] J. R. Hershey, Z. Chen, J. Le Roux, and S. Watanabe, “Deep clustering: Discriminative embeddings for segmentation and separation,” in IEEE Int. Conf. on Acoust., Speech Signal Process. (ICASSP), Mar. 2016, pp. 31–35.
 [25] Z. Chen, Y. Luo, and N. Mesgarani, “Deep attractor network for singlemicrophone speaker separation,” in IEEE Int. Conf. on Acoust., Speech Signal Process. (ICASSP), Mar. 2017, pp. 246–250.
 [26] Z. Wang, J. Le Roux, and J. R. Hershey, “Multichannel deep clustering: Discriminative spectral and spatial embeddings for speakerindependent speech separation,” in IEEE Int. Conf. on Acoust., Speech Signal Process. (ICASSP), Apr. 2018, pp. 1–5.
 [27] H. Erdogan, J. R. Hershey, S. Watanabe, and J. Le Roux, “Phasesensitive and recognitionboosted speech separation using deep recurrent neural networks,” in IEEE Int. Conf. on Acoust., Speech Signal Process. (ICASSP), Apr. 2015, pp. 708–712.
 [28] M. Togami, “Multichannel itakura saito distance minimization with deep neural network,” in IEEE Int. Conf. on Acoust., Speech Signal Process. (ICASSP), May 2019.
 [29] T. Higuchi, K. Kinoshita, N. Ito, S. Karita, and T. Nakatani, “Framebyframe closedform update for maskbased adaptive MVDR beamforming,” in IEEE Int. Conf. on Acoust., Speech Signal Process. (ICASSP), 2018, pp. 531–535.
 [30] S. Araki, H. Sawada, and S. Makino, “Blind speech separation in a meeting situation with maximum SNR beamformers,” in IEEE Int. Conf. on Acoust., Speech Signal Process. (ICASSP), vol. 1, Apr. 2007, pp. 41–44.
 [31] S. Sivasankaran, A. A. Nugraha, E. Vincent, J. A. MoralesCordovilla, S. Dalmia, I. Illina, and A. Liutkus, “Robust ASR using neural network based speech enhancement and feature simulation,” in IEEE Workshop Autom. Speech Recognit. Underst. (ASRU), Dec. 2015, pp. 482–489.
 [32] H. Sawada, H. Kameoka, S. Araki, and N. Ueda, “Multichannel extensions of nonnegative matrix factorization with complexvalued data,” IEEE Trans. Audio, Speech, Lang. Process., vol. 21, no. 5, pp. 971–982, May 2013.
 [33] E. Hadad, F. Heese, P. Vary, and S. Gannot, “Multichannel audio database in various acoustic environments,” in Int. Workshop Acoust. Signal Enhance. (IWAENC), Sep. 2014, pp. 31–317.
 [34] J. S. Garofolo, L. F. Lamel, W. M. Fisher, J. G. Fiscus, and D. S. Pallett, “DARPA TIMIT acousticphonetic continous speech corpus CDROM,” 1993.
 [35] E. Vincent, R. Gribonval, and C. Févotte, “Performance measurement in blind audio source separation,” IEEE Trans. Audio, Speech, Lang. Process., vol. 14, no. 4, pp. 1462–1469, 2006.
Comments
There are no comments yet.