1 Introduction
Speech separation refers to recovering the voice of each speaker from overlapped speech mixture. It is also known as cocktail party problem [1], which has been studied in signal processing literature for decades. Leveraging the power of deep learning, many methods have been proposed for multichannel speech separation (MCSS), including timefrequency (TF) masking [2, 3, 4, 5], integration of TF masking and beamforming [6, 7], and endtoend approaches [8]
. TF masking based methods formulate speech separation as a supervised learning task in frequency domain. The network learns to estimate a TF mask for each speaker based on the magnitude spectrogram and interaural differences calculated from the complex spectrograms of observed multichannel mixture signals, such as the phase difference between two microphone channels, which is known as the interaural phase difference (IPD).
However, one limitation for TF masking based methods is the phase reconstruction problem. To avoid the complex phase estimation, timedomain speech separation has attracted increasing focus recently. A singlechannel timedomain stateoftheart approach, referred as SCConvTasNet [9]
, replaces the short time Fourier transform (STFT)inverse STFT with an encoderdecoder structure. Under the supervision from clean waveforms of speakers, SCConvTasNet’s encoder learns to construct an audio representation that optimized for speech separation. However, the performance of SCConvTasNet is still limited under farfield scenario due to the smearing effects brought by reverberation. To tackle with this problem, in
[8], we proposed a new MCSS solution, in which handcrafted IPD features are used to provide spatial characteristic difference information between directional sources. With the aid of additional spatial cues, improved performances have been observed. However, the IPDs are computed in frequency domain with fixed complex filters (i.e., STFT) while the encoder output is learned in the datadriven manner. This causes a data mismatch, which indicates that IPDs may not be the optimal spatial features to incorporate into the endtoend MCSS framework.Bearing above discussions in mind, this work aims to design an endtoend MCSS model, which are endowed with the capability to learn effective spatial cues using a speech separation objective function in a purely datadriven fashion. As illustrated in Figure 1, inspired by the success of SCConvTasNet and [8]
, the main body of our proposed MCSS model adopts an encoderdecoder structure. In this design, the timedomain filters spanning all signal channels are trained to perform spatial filtering for multichannel setting. These filters are implemented by a 2d convolution (conv2d) layer to extract the spatial features. Furthermore, inspired by the formulation of IPD, a novel conv2d kernel is designed to compute the interchannel convolution differences (ICDs). It is noted that ICDs are learned in datadriven manner and are expected to provide the spatial cues that help to distinguish the directional sources without bringing any data mismatch issue compared with the handcrafted spatial features. In the end, an endtoend MCSS model is trained with the scale invariant signaltodistortion ratio (SISDR) loss function. Performance evaluation is conducted on a simulated spatialized WSJ0 2mix dataset. Experimental results demonstrate that our proposed ICDs based MCSS model outperforms IPD based MCSS model by 10.4% in terms of SISDRi.
2 Proposed Architecture
2.1 Multichannel speech separation
The baseline MCSS separation system [8] adopts an encoderdecoder structure, where the datadriven encoder and decoder respectively replaces the STFT and iSTFT operation in existing speech separation pipelines, as shown in Figure 1. Firstly, the encoder transforms each frame of first (reference) channel’s mixture waveform to the mixture encode in a realvalued feature space. Specifically, the learned encoder consists of a set of basis functions, as illustrated in Figure 2 (a). Most learned filters are tuned to lower frequencies, which shares the similar property with mel filter banks [10] and frequency distribution of human auditory system [11]. Secondly, IPDs computed by STFT and the mixture encode are concatenated along the feature dimension and fed into the separation module. The separation module learns to estimate a mask in encoder output domain for each speaker, which shares the similar concept with TF masking based methods. Finally, the decoder reconstructs the separated speech waveform from the masked mixture encode for each speaker. To optimize the network endtoend, scaleinvariant signaltodistortion ratio (SISDR) [12] is utilized as the training objective:
(1) 
where , , and are the reverberant clean and estimated source waveform, respectively. The zeromean normalization is applied to and to guarantee the scale invariance.
However, the combination of IPD and encoder output may cause a data mismatch. Different from the encoder which is learned in a datadriven way, the IPD is calculated with complex fixed filters (i.e., STFT), the center frequencies of which are evenly distributed, as illustrated in Figure 1 (b). Also, as [9] points out, STFT is a generic transformation for signal analysis that may not be necessarily optimal for speech separation.
2.2 Spatial feature learning
To perform the spatial feature learning jointly with the rest of the network, we propose to learn spatial features directly from multichannel waveforms with an integrated architecture. The main idea is to learn timedomain filters spanning all signal channels to perform adaptive spatial filtering [13, 14, 15]. These filters parameters are jointly optimized with the encoder using Eq. 1 in a purely datadriven fashion.
Denote these filters as , where is a set of filters spanning signal channels with window size of . Then, the multichannel features are computed by summing up the convolution products between the th channel mixture signal and filter along signal channel , named as multichannel convolution sum (MCS):
(2) 
where denotes the convolution operation. The design principle lies in Eq. 2 is similar to that of delayandsum beamformer, where signal arriving at each microphone are summed up with certain time delays to emphasize sound from a particular direction. Each set of filters is expected to steer at a different direction, therefore different spatial views of the multichannel mixture signals can be obtained by MCS and therefore enhancing the separation accuracy.
To implement these learnable filters within the network, we employ a 2d convolution (conv2d) layer. The generation of MCS with conv2d is illustrated in Figure 3. The kernel size is (heightwidth) and there are
convolution channels in total. The conv2d layer’s stride along width axis represents the hop size and is fixed as
in our experiments.Furthermore, inspired by the formulation of interaural differences (e.g., IPDs), we design a special conv2d kernel to extract interchannel convolution differences (ICDs). As we know, IPD is a wellestablished frequency domain feature widely used for spatial clustering algorithms and recent deep learning based MCSS methods. The rationale lies in that, the IPDs of TF bins that dominated by the same source will naturally form a cluster within each frequency band, since their time delays are approximately the same. The standard IPD is computed by the phase difference between channels of complex spectrogram as , where is the mutlichannel complex spectrogram computed by STFT of multichannel waveform , and represent two microphones’ indexes of the th microphone pair.
Following this concept, the th ICD between the th pair of signal channels can be computed by:
(3) 
where is a filter shared among all signal channels to ensure identical mapping, is a window function designed to smooth the ICD and prevent potential spectrum leakage. When is fixed as full ones and as full negative ones, Eq. 3 calculates the exact interchannel difference between the th microphone pair.
Figure 4 illustrates our designed conv2d kernel and generation of different pairs of ICDs. The conv2d kernel height is set as 2 to span a microphone pair. Note that different configurations of dilation and stride on the kernel height axis can extract ICDs from different pairs of signal channels, i.e., . For example, for a 6channel signal, setting dilation as 3 and stride as 1, we can obtain the three pairs of channels: (1, 4), (2, 5) and (3, 6).
To shed light on the property of learned filters , we visualize these filters in Figure 5. It can be observed that these learned filters show similar frequency tuning characteristics with the encoder (Figure 2 (a)). This suggests that the learned ICD may be more coincident with the encoder output and enables more efficient feature incorporation.
3 Experiments and Result Analysis
3.1 Dataset
We simulated a spatialized reverberant dataset derived from Wall Street Journal 0 (WSJ0) 2mix corpus, which are open and wellstudied datasets used for speech separation [16, 17, 9, 18]. There are 20,000, 5,000 and 3,000 multichannel, reverberant, twospeaker mixed speech in training, development and test set respectively. All the data is sampling at 16kHz. The performance evaluation is all done on test set, the speakers in which are all unseen during training. In this study, we take a 6microphone circular array of 7cm diameter with speakers and the microphone array randomly located in the room. The two speakers and the microphone array are on the same plane and all of them are at least 0.3m away from the wall. The image method [19] is employed to simulate RIRs randomly from 3000 different room configurations with the size (lengthwidthheight) ranging from 3m3m2.5m to 8m10m6m. The reverberation time T60 is sampled in a range of 0.05s to 0.5s. Samples with angle difference between two simultaneous speakers of 015, 1545, 4590 and 90180 respectively account for 16%, 29%, 26% and 29%.
Setup  window  # filters  SISDRi (dB)  SDRi (dB)  

15°  15°45°  45°90°  90°  Ave.  
Singlechannel ConvTasNet      8.5  9.0  9.1  9.3  9.1  9.4 
+MCS (conv2d (640))    256  5.7  10.3  11.9  12.9  10.8  11.2 
+ICD (conv2d (240))  fix 1  256  5.5  10.9  12.3  12.9  11.0  11.4 
+ICD (conv2d (240))  init. 1  256  6.2  11.2  12.6  13.2  11.4  11.8 
+ICD (conv2d (240))  init. randomly  33  8.2  8.1  9.0  9.1  8.9  9.2 
+ICD (conv2d (240))  fix 1  33  6.9  11.1  12.3  12.9  11.3  11.7 
+ICD (conv2d (240))  init. 1  33  6.7  11.7  13.1  13.9  11.9  12.3 
3.2 Network and Training details
All hyperparameters are the same with the best setup of ConvTasNet version 2 in [20], except
is set to 40 and encoder stride is 20. Batch normalization (BN) is used in all the experiments to speed up the separation process.
The microphone pairs for extracting IPDs and ICDs are (1, 4), (2, 5), (3, 6), (1, 2), (3, 4) and (5, 6) in all experiments. These pairs are selected because the distance of microphones in between each pair is either the furthest or nearest. In this case, there are two setups of dilation and stride for the conv2d layer, respectively and . The first channel of mixture waveform is set as the reference channel as the encoder input. To match the encoder output’s time steps, both IPDs and ICDs are extracted with 2.5ms (40point) window length and 1.25ms (20point) hop size with 64 FFT points. SISDR (Eq. 1) is utilized as training objective. The training uses chunks with 4.0 seconds duration. The batch size is set to 32. Permutation invariant training [17] is adopted to tackle with label permutation problem.
3.3 Result Analysis
Following the common speech separation metrics [12, 21]
, we adopt average SISDR and SDR improvement over mixture as the evaluation metrics. We also report the performances under different ranges of angle difference between speakers to give a more comprehensive assessment for the model.
Different configurations for conv2d layer. We explore different conv2d configurations for computing the ICD, including different numbers of filters and initialization methods of window function ( in section 2.2). The number of filters are chosen to be 256 and 33, where 256 matches the basis function number of encoder, 33 is the number of bins for 64point FFT size, which is the closest exponential of 2 for 40point frame length. The results are listed in Table 1. SCConvTasNet is served as the baseline system, achieving 9.1dB of SISDRi on the farfield dataset. By learning spatial filters, the MCS based model outperforms the baseline by 1.7dB of SISDRi. For ICD setups, we found that the performances with 33 filters are relatively superior to those with 256 filters. One possible reason is that, according to sampling theorem, the highest frequency resolution can achieve with sampling rate of 16kHz and frame length of 40 is limited.
Furthermore, the value of contributes significantly to the separation performance (9.2dB v.s. 12.3dB for model with 33 filters). If is randomly initialized, or in other words, there is no explicit subtraction operation between signal channels, the model will not be able to automatically learn useful spatial cues. If is initialized and fixed as 1 (fix 1), this indicates that the exact convolution difference operation between signal channels is computed as the ICD. Furthermore, relaxing to be learnable (init. 1) produces a much better result, which demonstrates the validity of ICD’s formulation.
Features  SISDRi (dB)  
15°  15°  Ave.  
cosIPD, sinIPD  7.7  12.2  11.5 
cosIPD, sinIPD (trainable kernel) [8]  7.9  12.3  11.6 
ICD  6.7  12.9  11.9 
ICD, cosIPD, sinIPD  8.1  13.2  12.4 
IPD versus ICD. We examine the performance of IPD versus proposed ICD for MCSS and report the results in Table 2. In addition, the performance of IPD with trainable kernel based MCSS model [8] is listed for comparison. Specifically, in [8], the standard STFT operation is reformulated as a function of time domain convolution with a trainable kernel, which is optimized for the speech separation task. Combining the cosIPD and sinIPD we can obtain SISDRi of 11.5dB, which has 2.4dB gain over the singlechannel baseline. It suggests IPDs can provide beneficial spatial information of sources. With the trainable kernel, the performance improves slightly. The proposed ICD based separation model obtains 0.4dB improvement over cosIPD+sinIPD based, benefiting from the datadriven learning fashion. Note that the performance under 15 for ICD based model is worse than that of IPD based. One possible reason is that the portion of data under 15 is relatively few hence causing difficulty in learning effective ICDs. The incorporation of ICDs and IPDs achieves further 0.5dB improvement. In this case, we found that almost all filters are tuned to relatively low frequency. This indicates that the ICDs learn complementary spatial information to compensate the IPD ambiguity in low frequencies.
4 Conclusion
This work proposes an endtoend multichannel speech separation model, which is able to learn effective spatial cues directly from the multichannel speech waveforms in a purely datadriven fashion. Experimental results demonstrated the MCSS model based on learned ICDs outperforms that based on well established IPDs.
References
 [1] C. Cherry and J. A. Bowles, “Contribution to a study of the “cocktail party problem”,” Journal of the Acoustical Society of America, vol. 32, no. 7, pp. 884–884, 1960.
 [2] Z.Q. Wang, J. Le Roux, and J. R. Hershey, “Multichannel deep clustering: Discriminative spectral and spatial embeddings for speakerindependent speech separation,” in IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), 2018, pp. 1–5.
 [3] L. Chen, M. Yu, D. Su, and D. Yu, “Multiband pit and model integration for improved multichannel speech separation,” in IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), 2019, pp. 705–709.
 [4] Z. Chen, X. Xiao, T. Yoshioka, H. Erdogan, J. Li, and Y. Gong, “Multichannel overlapped speech recognition with location guided speech extraction network,” in IEEE Spoken Language Technology Workshop (SLT), 2018, pp. 558–565.
 [5] Z.Q. Wang and D. Wang, “On spatial features for supervised speech separation and its application to beamforming and robust asr,” in IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), 2018, pp. 5709–5713.
 [6] Z. Chen, T. Yoshioka, X. Xiao, L. Li, M. L. Seltzer, and Y. Gong, “Efficient integration of fixed beamformers and speech separation networks for multichannel farfield speech separation,” in IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), 2018, pp. 5384–5388.
 [7] L. Drude and R. HaebUmbach, “Tight integration of spatial and spectral features for bss with deep clustering embeddings.” in Interspeech, 2017, pp. 2650–2654.
 [8] R. Gu, J. Wu, S.X. Zhang, L. Chen, Y. Xu, M. Yu, D. Su, Y. Zou, and D. Yu, “Endtoend multichannel speech separation,” arXiv preprint arXiv:1905.06286, 2019.
 [9] Y. Luo and N. Mesgarani, “Convtasnet: Surpassing ideal time–frequency magnitude masking for speech separation,” IEEE/ACM Transactions on Audio, Speech, and Language Processing (TASLP), vol. 27, no. 8, pp. 1256–1266, Aug 2019.
 [10] S. Imai, “Cepstral analysis synthesis on the mel frequency scale,” in IEEE International Conference on Acoustics, Speech, and Signal Processing (ICASSP), vol. 8, 1983, pp. 93–96.
 [11] C. Humphries, E. Liebenthal, and J. R. Binder, “Tonotopic organization of human auditory cortex,” Neuroimage, vol. 50, no. 3, pp. 1202–1211, 2010.
 [12] J. Le Roux, S. Wisdom, H. Erdogan, and J. R. Hershey, “Sdr–halfbaked or well done?” in IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), 2019, pp. 626–630.
 [13] Y. Hoshen, R. J. Weiss, and K. W. Wilson, “Speech acoustic modeling from raw multichannel waveforms,” in IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), 2015, pp. 4624–4628.

[14]
B. Li, T. N. Sainath, R. J. Weiss, K. W. Wilson, and M. Bacchiani, “Neural network adaptive beamforming for robust multichannel speech recognition,”
Interspeech 2016, pp. 1976–1980, 2016.  [15] T. N. Sainath, R. J. Weiss, K. W. Wilson, A. Narayanan, M. Bacchiani, B. Li, E. Variani, I. Shafran, A. Senior, K. Chin et al., “Raw multichannel processing using deep neural networks,” in New Era for Robust Speech Recognition. Springer, 2017, pp. 105–133.
 [16] J. R. Hershey, Z. Chen, J. Le Roux, and S. Watanabe, “Deep clustering: Discriminative embeddings for segmentation and separation,” in IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), 2016, pp. 31–35.
 [17] D. Yu, M. Kolbæk, Z.H. Tan, and J. Jensen, “Permutation invariant training of deep models for speakerindependent multitalker speech separation,” in IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), 2017, pp. 241–245.
 [18] Z.Q. Wang and D. Wang, “Combining spectral and spatial features for deep learning based blind speaker separation,” IEEE/ACM Transactions on Audio, Speech, and Language Processing (TASLP), vol. 27, no. 2, pp. 457–468, 2019.
 [19] J. B. Allen and D. A. Berkley, “Image method for efficiently simulating smallroom acoustics,” The Journal of the Acoustical Society of America, vol. 65, no. 4, pp. 943–950, 1979.
 [20] Y. Luo and N. Mesgaran1, “Tasnet: Surpassing ideal timefrequency masking for speech separation,” arXiv preprint arXiv:1809.07454, 2018.
 [21] E. Vincent, R. Gribonval, and C. Févotte, “Performance measurement in blind audio source separation,” IEEE/ACM Transactions on Audio, Speech, and Language Processing (TASLP), vol. 14, no. 4, pp. 1462–1469, 2006.
Comments
There are no comments yet.