Alternating Between Spectral and Spatial Estimation for Speech Separation and Enhancement

11/18/2019 ∙ by Zhong-Qiu Wang, et al. ∙ 0

This work investigates alternation between spectral separation using masking-based networks and spatial separation using multichannel beamforming. In this framework, the spectral separation is performed using a mask-based deep network. The result of mask-based separation is used, in turn, to estimate a spatial beamformer. The output of the beamformer is fed back into another mask-based separation network. We explore multiple ways of computing time-varying covariance matrices to improve beamforming, including factorizing the spatial covariance into a time-varying amplitude component and time-invariant spatial component. For the subsequent mask-based filtering, we consider different modes, including masking the noisy input, masking the beamformer output, and a hybrid approach combining both. Our best method first uses spectral separation, then spatial beamforming, and finally a spectral post-filter, and demonstrates an average improvement of 2.8 dB over baseline mask-based separation, across four different reverberant speech enhancement and separation tasks.



There are no comments yet.


page 1

page 2

page 3

page 4

This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

1 Introduction

Audio source separation has many useful applications, for example as a frontend for robust speech recognition and to improve voice quality for telephony, augmented reality, and assistive hearing devices. Leveraging multiple microphones has great potential to improve separation quality, since the spatial relationship between microphones provides additional useful information beyond cues provided by spectral patterns of acoustic sources that are exploited by single-microphone approaches. Multimicrophone processing can also improve the rejection of reverberation and diffuse background noise.

Recently, a new paradigm has emerged as a promising alternative to conventional beamforming approaches: neural beamforming, where the key advance is to utilize the non-linear modeling power of DNNs to identify time-frequency (T-F) units dominated by each source for spatial covariance matrix computation [1, 2]. Unlike traditional approaches, neural beamforming methods have the potential to learn and adapt from massive training data, which improves their robustness to unknown position and orientation of microphones and sources, types of acoustic sources, and room geometry. An initial success of neural beamforming approaches was improving time-invariant beamforming using T-F domain mask prediction, where predicted masks were used to obtain time-invariant spatial covariance matrices for all sources. This has proven useful in ASR tasks such as CHiME-3/4 [3]. Recent studies considered online or low-latency beamforming [4, 5] and time-varying beamforming [6] for improved performance in certain scenarios. In addition, spatial features such as inter-channel phase differences (IPD) [7], cosine and sine IPDs [8] and target direction compensated IPDs [9], which can encode spatial information, are utilized as additional network input to improve the mask estimation in masking-based beamforming. Other cues, such as visual information [10], location-guided side information [11] and speaker embeddings [12, 13] can also be used as additional inputs for both target speaker selection and performance improvement in neural speech separation for both single- and multi-microphone setups.

In this paper, we explore alternating between spectral estimation using DNN-based masking and spatial separation using linear beamforming with a multichannel Wiener filter (MCWF). This is inspired by the single-channel iterative network of [14], which we use as a baseline. By doing so, the linear beamforming is effectively driven by DNN-based masking. For beamforming, we consider both time-invariant and time-varying ways of calculating covariance matrices to improve spatial separation. We also consider the spectral masking network in the second alternation as a post-filter, which takes the beamformed signal and the mixture as input to produce a refined spectral masking estimate. Evaluation results on four challenging sound separation tasks demonstrate the effectiveness of the proposed algorithms.

2 Methods

Assume a -channel time-domain signal consisting of sources,

, that has been recorded in a reverberant environment. The short-time Fourier transform (STFT) of this signal can be written as


where and respectively represent the mixture and th source image at time and frequency . Our study proposes multiple algorithms to recover the constituent reverberant sources from a reverberant mixture received by a reference microphone, with or without leveraging the spatial information contained in . We assume an offline processing scenario, and the sources are non-moving throughout each utterance.

Each spectral masking stage in the proposed system uses a time-domain convolutional neural networks (TDCN)

[15] The first stage performs single-channel processing to estimate each source via T-F masking. The estimated sources are then used to compute statistics for time-invariant or time-varying beamforming. The second masking stage combines spectral and spatial information by taking in the mixture and beamformed results for post-filtering. See Figure 1 for an illustration of the overall system.

As shown in Figure 1, we train through multiple iSTFT/STFT projection layers. These layers address the phase inconsistency problem, a common issue of real-valued masking in the magnitude domain [16, 17]. By using different window sizes for the iSTFT/STFT pairs, beamforming can be performed using a larger window size and hence a higher frequency resolution than the masking network. This strategy is found to dramatically improve time-invariant MCWF in our experiments.

Figure 1: System overview.

2.1 Spectral Mask Estimation for Sound Separation

We use TDCN-based T-F masking for single-channel speech enhancement and speaker separation to produce estimates , where is the mask network output [18]

. The loss function maximizes signal-to-noise ratio (SNR) in the time domain:


where , denotes element-wise multiplication, is the set of permutations over sources, and is the optimal permutation index for source . We use permutation invariance for speaker separation, but not for speech enhancement, which corresponds to .

2.2 Multichannel Wiener Filter for Spatial Separation

We perform beamforming in the frequency domain using an STFT that is not necessarily the same window size as that of the mask network. We use estimated signals the first TDCN to compute the spatial covariance of each source for a time-invariant MCWF (TI-MCWF):



is a one-hot vector with the reference microphone index set to one, the mixture covariance matrix is estimated as


and is the source covariance matrix estimated as


This approach follows recent developments in neural beamforming [3, 19, 2]. The key idea is to use T-F units dominated by source to compute its covariance matrix for beamforming. Here is recomputed in an alternate STFT domain from the time-domain output signal for source from the masking network. For convenience, the mask is considered the same across microphones, which is a good approximation for compact arrays in far-field conditions. The beamforming result for source is computed as


2.3 Time-Varying Beamforming for Spatial Estimation

A TI-MCWF has limited power for separation, as it is only a linear filter per-frequency. Similar to recursive averaging, one straightforward way to improve the separation capability is to estimate the covariances and within a sliding window, i.e.


where is half the window size in frames and from (6).

Another way to compute a time-varying covariance matrix of each source is to factorize it as a product of a time-varying power spectral density (PSD) and a time-invariant coherence matrix [20, 21, 22]. The rationale is that for a non-moving source, its coherence matrix is time-invariant assuming that the beamforming STFT window is long enough to capture most of the reverberation. Unlike conventional methods, which typically use maximum likelihood estimation or non-negative matrix factorization to estimate the PSD and spatial coherence, the proposed algorithm leverages the estimated source signals produced by the TDCN to compute these statistics. Mathematically,


where is the PSD estimate, can be either computed in (5) or in (9) to utilize time-invariant or time-varying coherence estimation, is the index of the reference microphone, and with normalizes the spatial component to have unit diagonal. In far field conditions for some microphone .

A time-varying factorized (TVF) MCWF is then computed as


where . The beamforming result for source is computed as


2.4 Spectral Estimation Revisited: Post-filtering

Given a beamformed mixture , we extract the magnitude and combine it with the mixture magnitude as input to a second network to estimate a second, post-filtering, mask for each source. The magnitude of the beamformed mixture can be considered as a directional feature to guide the network to attend to a particular direction [9]. We explore the following three different ways of applying this post-filtering mask:

BF: (13)
Noisy: (14)
Hybrid: (15)

The first method (13), denoted as BF, applies a mask to the beamformed mixture, where the phase produced by beamforming is considered as the phase estimate. The second (14), denoted as Noisy, applies the mask directly to the mixture, employing the mixture phase for signal re-synthesis. The third one (15), denoted as Hybrid, applies the mask to the mixture magnitude and takes the phase produced by beamforming as the phase estimate. The loss function for post-filtering models is defined as


Additional subsequent spatial and spectral iterations could be performed, but here we stop at the second spectral mask estimator.

3 Data and Tasks

We use simulated room impulse responses (RIRs) generated using an image-method room simulator with frequency-dependent wall filters. During simulation, all source image locations are randomly perturbed by up to 8 cm in each direction to avoid the “sweeping echo” effect [23]. For each example, the RIRs are created by sampling a random position of a cube-shaped microphone array within a room defined using a random size: width from 3 to 7 meters, length from 4 to 8 meters, and height between 2.13 and 3.05 meters. These RIRs are used to generate mixtures of sounds. Speech is from the LibriTTS database [24], and non-speech sounds are from Using user-annotated tags, we filtered out artificial sounds (such as synthesizer noises) and used a sound classification network trained on AudioSet [25]

to avoid clips with a high probability of speech. The training set consists of about 366 hours, and the validation and test sets consist of about 10 hours each. These datasets will be publicly released at the final publication time.

We validate the proposed algorithms on one-, two- and eight-microphone setups. The eight microphones at the eight corners of the cube are used for separation in the eight-channel setup, the two microphones at the two ends of a side are used for the two-channel setup, and the first microphone on the side used in the two-channel setup is used for the single-channel case. The microphone used in the single-channel setup is considered as the reference microphone for two- and eight-channel separation.

We evaluate the proposed algorithms on four sound separation corpora. These include a two-speaker separation dataset used in [8], which is constructed using WSJ0-2mix and a room simulator with random room configurations and microphone positions, and three datasets mixed by us for two-speaker separation, three-speaker separation and speech enhancement. For the speech enhancement task, a speech source is mixed with three directional noise sources, and the goal is to separate the speech from the noise. For each task, a random speech clip from clean source data is selected, then each of the other sources is scaled to an SNR randomly drawn from dB, respective to the initial speech clip.

The network architecture of the two TDCNs used for mask estimation is similar to the recently proposed Conv-TasNet [15]

. It consists of 4 repeats of 8 layers of convolutional blocks. Each block consists of a dilated separable convolution with global layer normalization and a residual connection, where the dilation factor for the

th block is . In contrast to Conv-TasNet, we utilize STFT basis with 32 ms windows rather than a learned basis with very small window size, as initial results showed that the former leads to better performance. This is likely because a STFT with a larger window can better deal with room reverberation. The hop size is 8 ms. The sampling rate is 16 kHz. A 512-point FFT is used to extract 257-dimensional magnitude features for mask estimation. Scale-invariant source-to-distortion ratio improvement (SI-SNRi) [26]

over unprocessed speech is utilized as the evaluation metric.

As a single-channel baseline, we consider an iterative masking network [14] where no spatial information is used. This network consists of two masking networks. The separated time-domain outputs of the first network are concatenated with the time-domain mixture signal, then fed to the second masking stage to produce the final separated estimates. This model is trained with negative SNR loss on the separated waveforms of both stages.

4 Results

Figures 4, 4, and 4 show the performance of various methods under different conditions using either TI or TVF covariance estimation, with either 2 or 8 microphones, and for each of the four tasks.

Figure 2: SI-SNRi vs. sliding window size.
Figure 3: SI-SNRi vs. BF STFT window size.
Figure 4: SI-SNRi vs. post-filtering method.
Beamforming Conditions Speech 2 Speaker 3 Speaker WSJ0 2 Spk.
No. of Block Window Enhancement Separation Separation Separation
Method Mics (s) (ms) Val Test Val Test Val Test Val Test
Single Channel Mask Network 1 - - 15.8 15.1 16.7 15.6 13.0 12.3 6.5 6.1
Sliding Block TI 2 0.8 32 13.9 13.3 13.8 12.9 10.5 10.1 6.6 6.3
TI 8 0.8 32 15.5 14.9 13.4 12.6 10.9 10.5 8.9 8.7
TVF 2 full 32 15.4 14.8 14.6 13.6 10.5 10.2 9.3 9.0
TVF 8 full 32 16.2 15.5 14.3 13.5 10.9 10.5 9.4 9.2
Window Sizes TI 2 full 128 12.1 11.5 7.6 7.3 7.7 7.5 7.2 6.9
TI 8 full 128 16.3 15.7 12.8 12.4 11.6 11.3 10.1 9.8
TVF 2 full 32 15.4 14.8 14.6 13.6 10.5 10.2 9.3 9.0
TVF 8 full 32 16.2 15.5 14.3 13.5 10.9 10.5 9.4 9.2
Post-Filtering TI + PF Noisy 2 full 128 16.6 15.9 17.4 16.4 14.4 13.9 10.3 9.9
TI + PF Noisy 8 full 128 17.4 16.7 19.2 18.2 15.4 14.8 10.7 10.4
TVF + PF Hybrid 2 full 32 15.9 15.2 17.2 16.2 13.3 12.7 9.0 8.5
TVF + PF Noisy 8 full 32 16.1 15.4 16.9 15.7 12.7 12.1 9.3 9.0
Oracle Oracle Mask 1 - - 18.5 17.9 23.0 22.1 21.2 20.7 12.6 12.4
Oracle Mask + TI 2 full 128 12.7 12.2 10.4 10.2 11.1 10.9 8.6 8.4
Oracle Mask + TI 8 full 128 18.2 17.6 18.5 18.2 19.4 19.1 12.9 12.7
Oracle Mask + TVF 2 full 64 18.0 17.5 21.6 20.8 20.7 20.2 12.5 12.3
Oracle Mask + TVF 8 full 64 18.9 18.3 22.0 21.3 21.7 21.2 13.3 13.1
Table 1: SI-SNRi (dB) results of different beamforming methods, using time-invariant (TI) versus time-variant factorized (TVF) covariances, for different numbers of mics, sliding window block sizes (’full’ indicates the whole utterance), beamforming window sizes, and for the four different tasks. The (*) indicates conditions that were optimized holding the other conditions constant.

Table 1 summarizes the best results in terms of SI-SNRi on the validation and test sets for all conditions. For each set of experiments, we choose a particular beamforming parameter, marked with (*), that obtains the best average performance on the validation data. In the sliding window experiment, we choose the sliding window size, from those shown in 4, that has the best validation set performance, for each combination of method (TI and TVF), and number of microphones (2 and 8), with the beamforming window size fixed at 32ms, and show its validation and test performance on all tasks. In the beamforming window size experiment, we hold the block size fixed to the full condition, and optimize over 32ms, 64ms, and 128ms beamforming window sizes. The TI condition had an optimal window size of 128ms, whereas the TVF method had an optimal window size of 32ms. In the post-filtering experiment, we optimize over the three masking conditions, BF, Noisy, and Hybrid, with the beamforming window sizes held at the optimum determined from the beamforming window size experiment. The optimal condition was to mask the noisy signal for all but the TVF 2-mic condition, in which the hybrid masking worked best. We also report the performance of oracle methods, with the optimized window size marked with (*). These oracle methods include oracle binary masking on the reference microphone, as well as our TI and TVF beamforming methods, where the source estimate is given by an oracle binary mask on the reference microphone. Overall, the best non-oracle approach across all the four tasks is performing post-filtering with TI beamforming using a 128 ms window and masking on the mixture reference microphone.

The figures show the complete results of all experiments that are summarized the table. Figure 4 plots the performance of MCWF beamforming for varying lengths of sliding window. As the sliding window length increases, TI methods degrade in performance, likely because they are less capable of modeling spatial dynamics. In contrast, the performance of TVF methods is fairly consistent, indicating that they are better at modeling the dynamic spatial statistics of different sources. The single-channel model (denoted as Mask Net

) is quite competitive compared with all the beamforming approaches on all the tasks except the WSJ0 Reverb 2-speaker separation task. This is possibly due to the significant overlap of speech signals in the WSJ0 Reverb dataset, which is substantially more than the overlap present in our own mixed dataset. Generally the single-channel baseline outperforms the spatial approaches because the beamformer is limited to linear transformations of the mixture, while the nonlinear single-channel masking has more flexibility.

Figure 4 shows the performance of MCWF beamforming versus the length of the beamforming STFT window for 32, 64 and 128 ms. We observe that longer STFT windows generally improve performance, especially for TI methods. Again, the single-channel baseline is quite competitive on all tasks except for WSJ0 2-speaker separation. We find that TVF generally achieves the best results, regardless of number of microphones. In particular, TVF substantially boosts the performance over TI for the two-microphone conditions.

Figure 4 reports the results of our post-filtering models for the different masking methods described in (13)-(15). Clearly, using post-filtering dramatically outperforms the single-channel iterative masking baseline, suggesting the effectiveness of multimicrophone processing. In contrast to our beamforming results, using TI beamforming before the post-filter is better than TVF beamforming.

5 Conclusions

We have explored an alternating strategy between spectral estimation using a mask-based network, followed by spatial estimation using beamformers. For the spatial estimation, we compared multiple ways of computing convariance matrices for time-invariant and time-varying beamforming. For the final spectral estimation, we investigated a deep network post-filtering method with various masking methods. The proposed methods were evaluated on four sound separation tasks. Experimental results suggest that when combined with neural network based post-filtering, time-invariant beamforming with a reasonably large window size yields the best separation performance for non-moving sources, although time-varying beamforming shows clear improvements over time-invariant beamforming when post-filtering is not performed. Future research will investigate further iterations of alternation between spatial and spectral estimation, as well as more complex tasks with moving sources.


  • [1] J. Heymann, L. Drude, and R. Haeb-Umbach, “Neural network based spectral mask estimation for acoustic beamforming,” in Proc. ICASSP, 2016.
  • [2] H. Erdogan, J. R. Hershey, S. Watanabe, M. I. Mandel, and J. Le Roux, “Improved MVDR beamforming using single-channel mask prediction networks,” in Proc. Interspeech, 2016.
  • [3] J. Barker, R. Marxer, E. Vincent, and S. Watanabe, “The third ‘CHiME’ speech separation and recognition challenge: Analysis and outcomes,” Computer Speech and Language, vol. 46, 2017.
  • [4] C. Boeddeker, H. Erdogan, T. Yoshioka, and R. Haeb-Umbach, “Exploring practical aspects of neural mask-based beamforming for far-field speech recognition,” in Proc. ICASSP, 2018.
  • [5] T. Yoshioka, Z. Chen, C. Liu, X. Xiao, H. Erdogan, and D. Dimitriadis, “Low-latency speaker-independent continuous speech separation,” in Proc. ICASSP, 2019.
  • [6] Y. Kubo, T. Nakatani, M. Delcroix, K. Kinoshita, and S. Araki, “Mask-based MVDR beamformer for noisy multisource environments: Introduction of time-varying spatial covariance model,” in Proc. ICASSP, 2019.
  • [7] T. Yoshioka, H. Erdogan, Z. Chen, and F. Alleva, “Multi-microphone neural speech separation for far-field multi-talker speech recognition,” in Proc. ICASSP, 2018.
  • [8] Z.-Q. Wang, J. Le Roux, and J. R. Hershey, “Multi-channel deep clustering: Discriminative spectral and spatial embeddings for speaker-independent speech separation,” in Proc. ICASSP, 2018.
  • [9] Z.-Q. Wang and D. Wang, “Combining spectral and spatial features for deep learning based blind speaker separation,” IEEE/ACM TASLP, vol. 27, no. 2, 2018.
  • [10] A. Ephrat, I. Mosseri, O. Lang, T. Dekel, K. Wilson, A. Hassidim, W. T. Freeman, and M. Rubinstein, “Looking to listen at the cocktail party: A speaker-independent audio-visual model for speech separation,” ACM Transactions on Graphics, 2018.
  • [11] Z. Chen, X. Xiao, T. Yoshioka, H. Erdogan, J. Li, and Y. Gong, “Multi-channel overlapped speech recognition with location guided speech extraction network,” in Proc. SLT, 2018.
  • [12] M. Delcroix, K. Zmolikova, K. Kinoshita, A. Ogawa, and T. Nakatani, “Single channel target speaker extraction and recognition with speaker beam,” in Proc. ICASSP, 2018.
  • [13] Q. Wang, H. Muckenhirn, K. Wilson, P. Sridhar, Z. Wu, J. Hershey, R. A. Saurous, R. J. Weiss, Y. Jia, and I. L. Moreno, “Voicefilter: Targeted voice separation by speaker-conditioned spectrogram masking,” in Proc. Interspeech, 2019.
  • [14] I. Kavalerov, S. Wisdom, H. Erdogan, B. Patton, K. Wilson, J. L. Roux, and J. R. Hershey, “Universal sound separation,” Proc. WASPAA, 2019.
  • [15] Y. Luo and N. Mesgarani, “Conv-TasNet: Surpassing ideal time-frequency magnitude masking for speech separation,” IEEE/ACM TASLP, vol. 27, no. 8, 2019.
  • [16] S. Wisdom, J. R. Hershey, K. Wilson, J. Thorpe, M. Chinen, B. Patton, and R. A. Saurous, “Differentiable consistency constraints for improved deep speech enhancement,” in Proc. ICASSP, 2019.
  • [17] Z.-Q. Wang, K. Tan, and D. Wang, “Deep learning based phase reconstruction for speaker separation: A trigonometric perspective,” in Proc. ICASSP, vol. 2019, 2019.
  • [18] D. Wang and J. Chen, “Supervised speech separation based on deep learning: An overview,” IEEE/ACM TASLP, vol. 26, Aug. 2018.
  • [19] J. Heymann, L. Drude, A. Chinaev, and R. Haeb-Umbach, “BLSTM supported GEV beamformer front-end for the 3rd CHiME challenge,” in ASRU, 2015.{_}all.jsp?arnumber=7404829
  • [20] N. Duong, E. Vincent, and R. Gribonval, “Under-determined reverberant audio source separation using a full-rank spatial covariance model,” IEEE TASLP, vol. 18, no. 7, 2010.
  • [21] T. Higuchi, N. Ito, T. Yoshioka, and T. Nakatani, “Robust MVDR beamforming using time-frequency masks for online/offline ASR in noise,” in Proc. ICASSP, 2016.{_}all.jsp?arnumber=7472671
  • [22]

    K. Shimada, Y. Bando, M. Mimura, K. Itoyama, K. Yoshii, and T. Kawahara, “Unsupervised speech enhancement based on multichannel NMF-informed beamforming for noise-robust automatic speech recognition,”

    IEEE/ACM TASLP, vol. 27, no. 5, 2019.
  • [23] E. De Sena, N. Antonello, M. Moonen, and T. Van Waterschoot, “On the modeling of rectangular geometries in room acoustic simulations,” IEEE/ACM TASLP, vol. 23, no. 4, 2015.
  • [24] H. Zen, V. Dang, R. Clark, Y. Zhang, R. J. Weiss, Y. Jia, Z. Chen, and Y. Wu, “LibriTTS: A corpus derived from LibriSpeech for text-to-speech,” in Proc. Interspeech, 2019.
  • [25] J. F. Gemmeke, D. P. Ellis, D. Freedman, A. Jansen, W. Lawrence, R. C. Moore, M. Plakal, and M. Ritter, “Audio set: An ontology and human-labeled dataset for audio events,” in Proc. ICASSP, 2017.
  • [26] J. Le Roux, S. Wisdom, H. Erdogan, and J. R. Hershey, “SDR–half-baked or well done?” in Proc. ICASSP, 2019.