1 Introduction
Audio source separation has many useful applications, for example as a frontend for robust speech recognition and to improve voice quality for telephony, augmented reality, and assistive hearing devices. Leveraging multiple microphones has great potential to improve separation quality, since the spatial relationship between microphones provides additional useful information beyond cues provided by spectral patterns of acoustic sources that are exploited by singlemicrophone approaches. Multimicrophone processing can also improve the rejection of reverberation and diffuse background noise.
Recently, a new paradigm has emerged as a promising alternative to conventional beamforming approaches: neural beamforming, where the key advance is to utilize the nonlinear modeling power of DNNs to identify timefrequency (TF) units dominated by each source for spatial covariance matrix computation [1, 2]. Unlike traditional approaches, neural beamforming methods have the potential to learn and adapt from massive training data, which improves their robustness to unknown position and orientation of microphones and sources, types of acoustic sources, and room geometry. An initial success of neural beamforming approaches was improving timeinvariant beamforming using TF domain mask prediction, where predicted masks were used to obtain timeinvariant spatial covariance matrices for all sources. This has proven useful in ASR tasks such as CHiME3/4 [3]. Recent studies considered online or lowlatency beamforming [4, 5] and timevarying beamforming [6] for improved performance in certain scenarios. In addition, spatial features such as interchannel phase differences (IPD) [7], cosine and sine IPDs [8] and target direction compensated IPDs [9], which can encode spatial information, are utilized as additional network input to improve the mask estimation in maskingbased beamforming. Other cues, such as visual information [10], locationguided side information [11] and speaker embeddings [12, 13] can also be used as additional inputs for both target speaker selection and performance improvement in neural speech separation for both single and multimicrophone setups.
In this paper, we explore alternating between spectral estimation using DNNbased masking and spatial separation using linear beamforming with a multichannel Wiener filter (MCWF). This is inspired by the singlechannel iterative network of [14], which we use as a baseline. By doing so, the linear beamforming is effectively driven by DNNbased masking. For beamforming, we consider both timeinvariant and timevarying ways of calculating covariance matrices to improve spatial separation. We also consider the spectral masking network in the second alternation as a postfilter, which takes the beamformed signal and the mixture as input to produce a refined spectral masking estimate. Evaluation results on four challenging sound separation tasks demonstrate the effectiveness of the proposed algorithms.
2 Methods
Assume a channel timedomain signal consisting of sources,
, that has been recorded in a reverberant environment. The shorttime Fourier transform (STFT) of this signal can be written as
(1) 
where and respectively represent the mixture and th source image at time and frequency . Our study proposes multiple algorithms to recover the constituent reverberant sources from a reverberant mixture received by a reference microphone, with or without leveraging the spatial information contained in . We assume an offline processing scenario, and the sources are nonmoving throughout each utterance.
Each spectral masking stage in the proposed system uses a timedomain convolutional neural networks (TDCN)
[15] The first stage performs singlechannel processing to estimate each source via TF masking. The estimated sources are then used to compute statistics for timeinvariant or timevarying beamforming. The second masking stage combines spectral and spatial information by taking in the mixture and beamformed results for postfiltering. See Figure 1 for an illustration of the overall system.As shown in Figure 1, we train through multiple iSTFT/STFT projection layers. These layers address the phase inconsistency problem, a common issue of realvalued masking in the magnitude domain [16, 17]. By using different window sizes for the iSTFT/STFT pairs, beamforming can be performed using a larger window size and hence a higher frequency resolution than the masking network. This strategy is found to dramatically improve timeinvariant MCWF in our experiments.
2.1 Spectral Mask Estimation for Sound Separation
We use TDCNbased TF masking for singlechannel speech enhancement and speaker separation to produce estimates , where is the mask network output [18]
. The loss function maximizes signaltonoise ratio (SNR) in the time domain:
(2) 
where , denotes elementwise multiplication, is the set of permutations over sources, and is the optimal permutation index for source . We use permutation invariance for speaker separation, but not for speech enhancement, which corresponds to .
2.2 Multichannel Wiener Filter for Spatial Separation
We perform beamforming in the frequency domain using an STFT that is not necessarily the same window size as that of the mask network. We use estimated signals the first TDCN to compute the spatial covariance of each source for a timeinvariant MCWF (TIMCWF):
(3) 
where
is a onehot vector with the reference microphone index set to one, the mixture covariance matrix is estimated as
(4) 
and is the source covariance matrix estimated as
(5) 
(6) 
This approach follows recent developments in neural beamforming [3, 19, 2]. The key idea is to use TF units dominated by source to compute its covariance matrix for beamforming. Here is recomputed in an alternate STFT domain from the timedomain output signal for source from the masking network. For convenience, the mask is considered the same across microphones, which is a good approximation for compact arrays in farfield conditions. The beamforming result for source is computed as
(7) 
2.3 TimeVarying Beamforming for Spatial Estimation
A TIMCWF has limited power for separation, as it is only a linear filter perfrequency. Similar to recursive averaging, one straightforward way to improve the separation capability is to estimate the covariances and within a sliding window, i.e.
(8) 
(9) 
where is half the window size in frames and from (6).
Another way to compute a timevarying covariance matrix of each source is to factorize it as a product of a timevarying power spectral density (PSD) and a timeinvariant coherence matrix [20, 21, 22]. The rationale is that for a nonmoving source, its coherence matrix is timeinvariant assuming that the beamforming STFT window is long enough to capture most of the reverberation. Unlike conventional methods, which typically use maximum likelihood estimation or nonnegative matrix factorization to estimate the PSD and spatial coherence, the proposed algorithm leverages the estimated source signals produced by the TDCN to compute these statistics. Mathematically,
(10) 
where is the PSD estimate, can be either computed in (5) or in (9) to utilize timeinvariant or timevarying coherence estimation, is the index of the reference microphone, and with normalizes the spatial component to have unit diagonal. In far field conditions for some microphone .
A timevarying factorized (TVF) MCWF is then computed as
(11) 
where . The beamforming result for source is computed as
(12) 
2.4 Spectral Estimation Revisited: Postfiltering
Given a beamformed mixture , we extract the magnitude and combine it with the mixture magnitude as input to a second network to estimate a second, postfiltering, mask for each source. The magnitude of the beamformed mixture can be considered as a directional feature to guide the network to attend to a particular direction [9]. We explore the following three different ways of applying this postfiltering mask:
BF:  (13)  
Noisy:  (14)  
Hybrid:  (15) 
The first method (13), denoted as BF, applies a mask to the beamformed mixture, where the phase produced by beamforming is considered as the phase estimate. The second (14), denoted as Noisy, applies the mask directly to the mixture, employing the mixture phase for signal resynthesis. The third one (15), denoted as Hybrid, applies the mask to the mixture magnitude and takes the phase produced by beamforming as the phase estimate. The loss function for postfiltering models is defined as
(16) 
Additional subsequent spatial and spectral iterations could be performed, but here we stop at the second spectral mask estimator.
3 Data and Tasks
We use simulated room impulse responses (RIRs) generated using an imagemethod room simulator with frequencydependent wall filters. During simulation, all source image locations are randomly perturbed by up to 8 cm in each direction to avoid the “sweeping echo” effect [23]. For each example, the RIRs are created by sampling a random position of a cubeshaped microphone array within a room defined using a random size: width from 3 to 7 meters, length from 4 to 8 meters, and height between 2.13 and 3.05 meters. These RIRs are used to generate mixtures of sounds. Speech is from the LibriTTS database [24], and nonspeech sounds are from freesound.org. Using userannotated tags, we filtered out artificial sounds (such as synthesizer noises) and used a sound classification network trained on AudioSet [25]
to avoid clips with a high probability of speech. The training set consists of about 366 hours, and the validation and test sets consist of about 10 hours each. These datasets will be publicly released at the final publication time.
We validate the proposed algorithms on one, two and eightmicrophone setups. The eight microphones at the eight corners of the cube are used for separation in the eightchannel setup, the two microphones at the two ends of a side are used for the twochannel setup, and the first microphone on the side used in the twochannel setup is used for the singlechannel case. The microphone used in the singlechannel setup is considered as the reference microphone for two and eightchannel separation.
We evaluate the proposed algorithms on four sound separation corpora. These include a twospeaker separation dataset used in [8], which is constructed using WSJ02mix and a room simulator with random room configurations and microphone positions, and three datasets mixed by us for twospeaker separation, threespeaker separation and speech enhancement. For the speech enhancement task, a speech source is mixed with three directional noise sources, and the goal is to separate the speech from the noise. For each task, a random speech clip from clean source data is selected, then each of the other sources is scaled to an SNR randomly drawn from dB, respective to the initial speech clip.
The network architecture of the two TDCNs used for mask estimation is similar to the recently proposed ConvTasNet [15]
. It consists of 4 repeats of 8 layers of convolutional blocks. Each block consists of a dilated separable convolution with global layer normalization and a residual connection, where the dilation factor for the
th block is . In contrast to ConvTasNet, we utilize STFT basis with 32 ms windows rather than a learned basis with very small window size, as initial results showed that the former leads to better performance. This is likely because a STFT with a larger window can better deal with room reverberation. The hop size is 8 ms. The sampling rate is 16 kHz. A 512point FFT is used to extract 257dimensional magnitude features for mask estimation. Scaleinvariant sourcetodistortion ratio improvement (SISNRi) [26]over unprocessed speech is utilized as the evaluation metric.
As a singlechannel baseline, we consider an iterative masking network [14] where no spatial information is used. This network consists of two masking networks. The separated timedomain outputs of the first network are concatenated with the timedomain mixture signal, then fed to the second masking stage to produce the final separated estimates. This model is trained with negative SNR loss on the separated waveforms of both stages.
4 Results
Figures 4, 4, and 4 show the performance of various methods under different conditions using either TI or TVF covariance estimation, with either 2 or 8 microphones, and for each of the four tasks.
Beamforming Conditions  Speech  2 Speaker  3 Speaker  WSJ0 2 Spk.  

No. of  Block  Window  Enhancement  Separation  Separation  Separation  
Method  Mics  (s)  (ms)  Val  Test  Val  Test  Val  Test  Val  Test  
Single Channel  Mask Network  1      15.8  15.1  16.7  15.6  13.0  12.3  6.5  6.1 
Sliding Block  TI  2  0.8  32  13.9  13.3  13.8  12.9  10.5  10.1  6.6  6.3 
TI  8  0.8  32  15.5  14.9  13.4  12.6  10.9  10.5  8.9  8.7  
TVF  2  full  32  15.4  14.8  14.6  13.6  10.5  10.2  9.3  9.0  
TVF  8  full  32  16.2  15.5  14.3  13.5  10.9  10.5  9.4  9.2  
Window Sizes  TI  2  full  128  12.1  11.5  7.6  7.3  7.7  7.5  7.2  6.9 
TI  8  full  128  16.3  15.7  12.8  12.4  11.6  11.3  10.1  9.8  
TVF  2  full  32  15.4  14.8  14.6  13.6  10.5  10.2  9.3  9.0  
TVF  8  full  32  16.2  15.5  14.3  13.5  10.9  10.5  9.4  9.2  
PostFiltering  TI + PF Noisy  2  full  128  16.6  15.9  17.4  16.4  14.4  13.9  10.3  9.9 
TI + PF Noisy  8  full  128  17.4  16.7  19.2  18.2  15.4  14.8  10.7  10.4  
TVF + PF Hybrid  2  full  32  15.9  15.2  17.2  16.2  13.3  12.7  9.0  8.5  
TVF + PF Noisy  8  full  32  16.1  15.4  16.9  15.7  12.7  12.1  9.3  9.0  
Oracle  Oracle Mask  1      18.5  17.9  23.0  22.1  21.2  20.7  12.6  12.4 
Oracle Mask + TI  2  full  128  12.7  12.2  10.4  10.2  11.1  10.9  8.6  8.4  
Oracle Mask + TI  8  full  128  18.2  17.6  18.5  18.2  19.4  19.1  12.9  12.7  
Oracle Mask + TVF  2  full  64  18.0  17.5  21.6  20.8  20.7  20.2  12.5  12.3  
Oracle Mask + TVF  8  full  64  18.9  18.3  22.0  21.3  21.7  21.2  13.3  13.1 
Table 1 summarizes the best results in terms of SISNRi on the validation and test sets for all conditions. For each set of experiments, we choose a particular beamforming parameter, marked with (*), that obtains the best average performance on the validation data. In the sliding window experiment, we choose the sliding window size, from those shown in 4, that has the best validation set performance, for each combination of method (TI and TVF), and number of microphones (2 and 8), with the beamforming window size fixed at 32ms, and show its validation and test performance on all tasks. In the beamforming window size experiment, we hold the block size fixed to the full condition, and optimize over 32ms, 64ms, and 128ms beamforming window sizes. The TI condition had an optimal window size of 128ms, whereas the TVF method had an optimal window size of 32ms. In the postfiltering experiment, we optimize over the three masking conditions, BF, Noisy, and Hybrid, with the beamforming window sizes held at the optimum determined from the beamforming window size experiment. The optimal condition was to mask the noisy signal for all but the TVF 2mic condition, in which the hybrid masking worked best. We also report the performance of oracle methods, with the optimized window size marked with (*). These oracle methods include oracle binary masking on the reference microphone, as well as our TI and TVF beamforming methods, where the source estimate is given by an oracle binary mask on the reference microphone. Overall, the best nonoracle approach across all the four tasks is performing postfiltering with TI beamforming using a 128 ms window and masking on the mixture reference microphone.
The figures show the complete results of all experiments that are summarized the table. Figure 4 plots the performance of MCWF beamforming for varying lengths of sliding window. As the sliding window length increases, TI methods degrade in performance, likely because they are less capable of modeling spatial dynamics. In contrast, the performance of TVF methods is fairly consistent, indicating that they are better at modeling the dynamic spatial statistics of different sources. The singlechannel model (denoted as Mask Net
) is quite competitive compared with all the beamforming approaches on all the tasks except the WSJ0 Reverb 2speaker separation task. This is possibly due to the significant overlap of speech signals in the WSJ0 Reverb dataset, which is substantially more than the overlap present in our own mixed dataset. Generally the singlechannel baseline outperforms the spatial approaches because the beamformer is limited to linear transformations of the mixture, while the nonlinear singlechannel masking has more flexibility.
Figure 4 shows the performance of MCWF beamforming versus the length of the beamforming STFT window for 32, 64 and 128 ms. We observe that longer STFT windows generally improve performance, especially for TI methods. Again, the singlechannel baseline is quite competitive on all tasks except for WSJ0 2speaker separation. We find that TVF generally achieves the best results, regardless of number of microphones. In particular, TVF substantially boosts the performance over TI for the twomicrophone conditions.
Figure 4 reports the results of our postfiltering models for the different masking methods described in (13)(15). Clearly, using postfiltering dramatically outperforms the singlechannel iterative masking baseline, suggesting the effectiveness of multimicrophone processing. In contrast to our beamforming results, using TI beamforming before the postfilter is better than TVF beamforming.
5 Conclusions
We have explored an alternating strategy between spectral estimation using a maskbased network, followed by spatial estimation using beamformers. For the spatial estimation, we compared multiple ways of computing convariance matrices for timeinvariant and timevarying beamforming. For the final spectral estimation, we investigated a deep network postfiltering method with various masking methods. The proposed methods were evaluated on four sound separation tasks. Experimental results suggest that when combined with neural network based postfiltering, timeinvariant beamforming with a reasonably large window size yields the best separation performance for nonmoving sources, although timevarying beamforming shows clear improvements over timeinvariant beamforming when postfiltering is not performed. Future research will investigate further iterations of alternation between spatial and spectral estimation, as well as more complex tasks with moving sources.
References
 [1] J. Heymann, L. Drude, and R. HaebUmbach, “Neural network based spectral mask estimation for acoustic beamforming,” in Proc. ICASSP, 2016.
 [2] H. Erdogan, J. R. Hershey, S. Watanabe, M. I. Mandel, and J. Le Roux, “Improved MVDR beamforming using singlechannel mask prediction networks,” in Proc. Interspeech, 2016.
 [3] J. Barker, R. Marxer, E. Vincent, and S. Watanabe, “The third ‘CHiME’ speech separation and recognition challenge: Analysis and outcomes,” Computer Speech and Language, vol. 46, 2017.
 [4] C. Boeddeker, H. Erdogan, T. Yoshioka, and R. HaebUmbach, “Exploring practical aspects of neural maskbased beamforming for farfield speech recognition,” in Proc. ICASSP, 2018.
 [5] T. Yoshioka, Z. Chen, C. Liu, X. Xiao, H. Erdogan, and D. Dimitriadis, “Lowlatency speakerindependent continuous speech separation,” in Proc. ICASSP, 2019.
 [6] Y. Kubo, T. Nakatani, M. Delcroix, K. Kinoshita, and S. Araki, “Maskbased MVDR beamformer for noisy multisource environments: Introduction of timevarying spatial covariance model,” in Proc. ICASSP, 2019.
 [7] T. Yoshioka, H. Erdogan, Z. Chen, and F. Alleva, “Multimicrophone neural speech separation for farfield multitalker speech recognition,” in Proc. ICASSP, 2018.
 [8] Z.Q. Wang, J. Le Roux, and J. R. Hershey, “Multichannel deep clustering: Discriminative spectral and spatial embeddings for speakerindependent speech separation,” in Proc. ICASSP, 2018.
 [9] Z.Q. Wang and D. Wang, “Combining spectral and spatial features for deep learning based blind speaker separation,” IEEE/ACM TASLP, vol. 27, no. 2, 2018.
 [10] A. Ephrat, I. Mosseri, O. Lang, T. Dekel, K. Wilson, A. Hassidim, W. T. Freeman, and M. Rubinstein, “Looking to listen at the cocktail party: A speakerindependent audiovisual model for speech separation,” ACM Transactions on Graphics, 2018.
 [11] Z. Chen, X. Xiao, T. Yoshioka, H. Erdogan, J. Li, and Y. Gong, “Multichannel overlapped speech recognition with location guided speech extraction network,” in Proc. SLT, 2018.
 [12] M. Delcroix, K. Zmolikova, K. Kinoshita, A. Ogawa, and T. Nakatani, “Single channel target speaker extraction and recognition with speaker beam,” in Proc. ICASSP, 2018.
 [13] Q. Wang, H. Muckenhirn, K. Wilson, P. Sridhar, Z. Wu, J. Hershey, R. A. Saurous, R. J. Weiss, Y. Jia, and I. L. Moreno, “Voicefilter: Targeted voice separation by speakerconditioned spectrogram masking,” in Proc. Interspeech, 2019.
 [14] I. Kavalerov, S. Wisdom, H. Erdogan, B. Patton, K. Wilson, J. L. Roux, and J. R. Hershey, “Universal sound separation,” Proc. WASPAA, 2019.
 [15] Y. Luo and N. Mesgarani, “ConvTasNet: Surpassing ideal timefrequency magnitude masking for speech separation,” IEEE/ACM TASLP, vol. 27, no. 8, 2019.
 [16] S. Wisdom, J. R. Hershey, K. Wilson, J. Thorpe, M. Chinen, B. Patton, and R. A. Saurous, “Differentiable consistency constraints for improved deep speech enhancement,” in Proc. ICASSP, 2019.
 [17] Z.Q. Wang, K. Tan, and D. Wang, “Deep learning based phase reconstruction for speaker separation: A trigonometric perspective,” in Proc. ICASSP, vol. 2019, 2019.
 [18] D. Wang and J. Chen, “Supervised speech separation based on deep learning: An overview,” IEEE/ACM TASLP, vol. 26, Aug. 2018. http://arxiv.org/abs/1708.07524
 [19] J. Heymann, L. Drude, A. Chinaev, and R. HaebUmbach, “BLSTM supported GEV beamformer frontend for the 3rd CHiME challenge,” in ASRU, 2015. http://ieeexplore.ieee.org/xpls/abs{_}all.jsp?arnumber=7404829
 [20] N. Duong, E. Vincent, and R. Gribonval, “Underdetermined reverberant audio source separation using a fullrank spatial covariance model,” IEEE TASLP, vol. 18, no. 7, 2010.
 [21] T. Higuchi, N. Ito, T. Yoshioka, and T. Nakatani, “Robust MVDR beamforming using timefrequency masks for online/offline ASR in noise,” in Proc. ICASSP, 2016. http://ieeexplore.ieee.org/xpls/abs{_}all.jsp?arnumber=7472671

[22]
K. Shimada, Y. Bando, M. Mimura, K. Itoyama, K. Yoshii, and T. Kawahara, “Unsupervised speech enhancement based on multichannel NMFinformed beamforming for noiserobust automatic speech recognition,”
IEEE/ACM TASLP, vol. 27, no. 5, 2019.  [23] E. De Sena, N. Antonello, M. Moonen, and T. Van Waterschoot, “On the modeling of rectangular geometries in room acoustic simulations,” IEEE/ACM TASLP, vol. 23, no. 4, 2015.
 [24] H. Zen, V. Dang, R. Clark, Y. Zhang, R. J. Weiss, Y. Jia, Z. Chen, and Y. Wu, “LibriTTS: A corpus derived from LibriSpeech for texttospeech,” in Proc. Interspeech, 2019.
 [25] J. F. Gemmeke, D. P. Ellis, D. Freedman, A. Jansen, W. Lawrence, R. C. Moore, M. Plakal, and M. Ritter, “Audio set: An ontology and humanlabeled dataset for audio events,” in Proc. ICASSP, 2017.
 [26] J. Le Roux, S. Wisdom, H. Erdogan, and J. R. Hershey, “SDR–halfbaked or well done?” in Proc. ICASSP, 2019.
Comments
There are no comments yet.