1 Introduction
Leveraging deep learning techniques, closetalk speech separation has achieved a great progress in recent years. Several deep model based methods have been proposed, such as deep clustering
[10, 12], deep attractor network [11], permutation invariant training (PIT) [8, 9], chimera++ network [13] and timedomain audio separation network (TasNet) [3, 4]. These approaches have opened the door towards cracking the cocktail party problem. However, most of them operated on timefrequency (TF) representation of raw waveform signals after Short Time Fourier Transform (STFT). The computed spectrogram is in complexvalued domain, which can be decomposed into magnitude and phase part. Due to the difficulty on phase retrieval and human ears are insensitive to phase distortion to some extent, magnitude spectrogram is then a general choice for separation network to work with. The phase of speech mixture is used to combine with predicted magnitude to reconstruct waveforms. Currently, more researchers realized the negative influence of waveform reconstruction using mixture phase and started to work on phase modeling and retrieving [18, 15, 16, 14]. For example, under an extreme condition when the mixture phase is opposite to the oracle phase, even though the magnitude is perfectly predicted, the reconstructed waveform is far away from the ground truth [17]. To incorporate the phase into modeling, lots of efforts have been devoted to endtoend methods which are conducted in timedomain [19, 3, 4]. Typically, the convolutional timedomain audio separation network (ConvTasNet) [4] surpassed ideal TF masking methods and achieved the stateoftheart results on a widely used closetalk dataset WSJ0 mix.Although closetalk speech separation model achieves great progress, the performance of farfield speech separation is still far from satisfactory due to the reverberation. A microphone array is commonly used to record multichannel data. Correlation clues among multichannel signals, such as interchannel time difference, phase difference, level difference (ITD, IPD, ILD), can indicate the sound source position. These spatial features have been demonstrated to be beneficial, especially when combining with spectral features, for frequencydomain separation models [1, 2, 6, 5, 7]. Unfortunately, these spatial features (e.g., IPD) are hard to be incorporated in timedomain methods as they are typically extracted from frequency domain using different analysis window type/length and hop size.
In this work we propose a novel method to extract the spatial information from time domain using neural networks. This leads us to an integrated waveformin waveformout system for multichannel separation in a single neural network architecture that can be trained from end to end. This work can also be viewed as a multichannel extension to the ConvTasNet for timedomain farfield speech separation.
2 SingleChannel EndtoEnd Separation
There have been several representative works for singlechannel endtoend speech separation. One of them is TasNet proposed by Luo et al. [3] which works in time domain for the speech separation. The block diagram is illustrated in Figure 1. The network is designed as a encoderdecoder structure, where the mixed speech waveform with sampling points is decomposed on series of basis function to nonnegative activations, which later can be inverted back to the timedomain signal. Both the encoder and decoder are a Conv1D layer. The number of channels represents the number of basis functions. The kernel size
and stride
are the window length and hop size, respectively. The analysis window of TasNet is shortened from 32ms (512 sampling points) used in STFTbased methods to 5 ms (80 sampling points). Bidirectional long short term memory (BLSTM) layers are adopted in the separation module which computes a mask from the encoded mixture representation for each source, similar to the TF masking. Furthermore, instead of using a timedomain mean squared error (MSE) loss, the separation metric scaleinvariant signaltodistortion (SISNR) is used to directly optimize the separation performance, which is defined as:
(1) 
where and
are the estimated and clean source waveforms, respectively. The zeromean normalization is applied to
and to guarantee the scale invariance and .Recently, in [4], Luo et al. pointed out two major limits of LSTM networks. First, LSTM may have a long temporal dependency to handle the overgrowing number of frames with smaller kernel size, resulting accumulated error. The other is the computational complexity and large parameters of LSTM networks. To alleviate the problems above, Luo proposed a improved version of TasNet, separator is replaced with a temporal fullyconvolutional network (TCN) [22]. The separation module consists of sequential layers that repeat times of stacked dilated Conv1D blocks, which supports a large receptive field in a noncausal setup. Meanwhile, in these dilated Conv1D blocks, the traditional convolution is substituted with depthwise separable convolution to further reduce the parameters.
Another representative audio source separation work is WaveUNet [20], a multiscale, endtoend neural network introduced by Stoller et al. The model operates directly on the raw audio waveform, which allows encoding the phase information. Multiresolution features output from downsampling and upsampling blocks are computed and combined, which could incorporate long temporal context. In [14], endtoend training is also implemented by defining a timedomain MSE loss, where the TF masking network is trained through an iterative phase reconstruction procedure.
3 Multichannel EndtoEnd Separation
Deep learning based multichannel speech separation methods usually combine spectral features with interchannel spatial features (e.g., IPD) [1, 2]. These IPD features have been proven effective in many frequency domain frameworks [5, 7]. The rationale is that, the IPD computed by the STFT ratio of any two microphone channels will form clusters within each frequency, due to the source sparseness, for spatially separated directional sources with different time delays [26]. In section 3.1
we first try to incorporating the IPD features extracted from frequency domain into the TasNet. Then a crossdomain joint training is performed. In section
3.2 we reformulate the STFT and IPD as a function of timedomain convolution with special kernel. Then we relaxed those fixed kernels to be learnable. This leads us to an integrated waveformin waveformout system for multichannel separation in a single neural network architecture that can be trained from end to end.3.1 Crossdomain learning
Firstly, we intend to integrate the wellestablished spatial features extracted in frequency domain to TasNet and jointly train the network. The block diagram is illustrated in Figure 2.
Inspired by the dualstream model applied in action recognition [wang2018two, 23, 24], the crossdomain feature fusion is feasible and effective. In our particular case, the main branch for timedomain waveform estimates speaker masks for the first channel’s mixture representation . The frequencydomain features are extracted through performing STFT on each individual channel of the mixture waveform to compute the complex spectrogram . Given a window function with length , the spectrum is calculated by:
(2) 
where is the index of samples and is the index of frequency bands. We thereby compute IPD by the phase difference between channels of complex spectrogram as:
(3) 
where and represents two microphones’ indexes of the th microphone pair. Since the window length in encoder is much shorter than window size in STFT, upsampling is applied on the frequency domain features to match the dimension of encoded mixture . Since the mixture representation subspace and frequency domain do not share the same nature, frequencydomain features are encoded by an individual conv11 layer rather than directly concatenate with . Also, apart from the early fusion method illustrated in Figure 2, which concatenates spatial embedding with mixture representation before all 48 1D ConvBlocks in separator, we also investigate another two fusion methods, named middle fusion (after two individual branches of 28 1D ConvBlocks) and late fusion (after two individual branches of all 48 1D ConvBlocks).
3.2 EndtoEnd learning
In the approach above, IPD features are computed using STFT with one predesigned analysis window’s type () and length () for all the frequency bands (). At meanwhile, the timedomain encoder (see Fig. 2) automatically learns a serial of kernels in a datadriven fashion with different a kernel size and a stride size. This may cause a potential mismatch to the model. Motivated by this, the STFT in Eq. 2 can be reformulated as
(4) 
where is the index of samples, is the index of frequency bands and is the convolution. This implies that we can compute the STFT using timedomain convolution with a special kernel. When compute the IPD between two microphones, one can substitute Eq. 4 into Eq. 3. Note the phase factor in Eq. 4 is constant between the corresponding frequency bands of two microphones and thus can be cancelled out. This means the phase factor will neither affects the magnitude () nor IPD. The only thing matters to IPD features is the kernel in Eq. 4. Note the STFT kernel in Eq. 4 is a complex number. It can be split in real and imaginary parts.
(5) 
The shape of the kernel is determined by the . The size of the kernel is actually the window length of . The stride of convolution equivalents to the hop size in the STFT operation. Given the kernels, the th pair of IPD can be computed by:
(6) 
Interestingly, although we derived the Eq. 4 and 5 from STFT, actually they can be generalize to any arbitrary kernels. This means we are not tied to a specific shape of kernel determined by , but any kernels that can be automatically learned from the data. Thus everything can be done in an integrated neural network (see Figure 2). Everything can be trained from endtoend using the SISNR loss (Eq. 1). We can also customize the number of kernels and use preferred kernel sizes other than . The stride in convolution is also now configurable other than just using the hop size of STFT. To extract the IPD features directly from timedomain, all we need to do is to apply Eq. 6 with the kernel and
we learned in each epoch.
To shed light on the properties of the kernel learning, Figure 3 visualizes the learned kernels in different training stages. In the first column, kernel values are initialized with the STFT kernel in Eq. 5 but with a matched kernel size and stride with the encoder in Figure 2. In (a), along with the training, the kernels are unconstrained where and are separately learnable parameters, therefore not conforming to the physical definition of the magnitude and phase of these kernels. While in (b), in Eq. 5 is a trainable parameter so that the kernels try to learn a dynamic representation for the new “IPD” that is optimized for speech separation.
4 Experimental Setup
4.1 Dataset
We simulated a multichannel reverberant version of two speaker mixed Wall Street Journal (WSJ0 2mix) corpus introduced in [10]. A 6element uniform circular array is used as the signal receiver, the radius of which is 0.035m and these six microphones are placed with 60 degrees intervals. Each multichannel twospeaker mixture is generated as follows. Two speakers’ clean speech is mixed in the range from 2.5dB to 2.5dB. Then the classic image method is used to add multichannel room impulse response (RIR) to the anechoic mixture and T60 ranges from 0.05 to 0.5 seconds. The room configuration (lengthwidthheight) is randomly sampled from 332.5m to 8106m. The microphone array and speakers are at least 0.3m away from the wall. We do not add any constraints on the angle differences of speakers presence as [5, 7], so that our dataset contains samples of all ranges of angle differences, i.e., 0180 degrees. All data is sampling at 16kHz. The mixing SNR, pairs, dataset partition are complete coincident with anechoic monaural WSJ0 2mix. Note that all speakers in test set are unseen during training, thus our systems will be verified under speakerindependent scenario.
Approach  Input features  Angle differences (°)  Ave.  
15  1545  4590  90  
TasNet (Conv)  1^{st} ch wav  8.5  9.0  9.1  9.3  9.1 
FreqLSTM  1^{st} ch LPS + multichannel IPD  3.0  6.7  7.9  8.2  6.9 
FreqBLSTM  1^{st} ch LPS + multichannel cosIPD  6.5  9.0  9.4  9.0  8.7 
FreqTCN  1^{st} ch LPS + multichannel cosIPD  5.6  8.6  8.7  8.3  8.0 
cascaded networks  1^{st} ch LPS + multichannel IPD  3.5  8.5  10.1  10.6  8.8 
parallel encoder  multichannel wav  5.7  10.3  11.9  12.9  10.8 
crossdomain training  1^{st} ch wav + LPS / early fusion  8.7  9.6  9.5  9.5  9.4 
1^{st} ch wav + LPS / middle fusion  9.2  9.7  9.6  10.0  9.7  
1^{st} ch wav + LPS / late fusion  8.7  9.4  9.3  9.5  9.3  
1^{st} ch wav + LPS + multichannel cosIPD / middle fusion  8.3  11.4  11.7  11.3  11.0  
endtoend separation  multichannel wav / fixed kernel / cosIPD  8.5  11.8  12.0  11.6  11.2 
multichannel wav / fixed kernel / cosIPD + sinIPD  7.7  11.6  12.3  12.6  11.5  
multichannel wav / trainable kernel (unconstrained) / cosIPD  7.9  11.3  12.0  11.3  11.0  
multichannel wav / trainable kernel () / cosIPD  8.2  11.6  12.0  11.4  11.1  
multichannel wav / trainable kernel () / cosIPD + sinIPD  7.9  11.6  12.5  12.9  11.6  
IBM    11.6  11.5  11.5  11.5  11.5 
IRM    11.0  11.0  11.0  11.0  11.0 
IPSM    13.7  13.6  13.6  13.6  13.6 
4.2 Network training and Feature extraction
All hyperparameters are the same with the best setup of ConvTasNet, except
is set to 40 and encoder stride is 20. Batch normalization (BN) is adopted because it has the most stable performance. SISNR is utilized as training objective. The training uses chunks with 4.0 seconds duration. The batch size is set to 32. The selected pairs for IPDs are (1, 4), (2, 5), (3, 6), (1, 2), (3, 4) and (5, 6) in all experiments. For crossdomain training, both LPS and IPDs are served as frequency domain features. These features are extracted with 32ms window length and 16ms hop size with 512 FFT points. For endtoend separation, the number of filters is set to 33 since we round the window length to the closest exponential of 2, i.e., 64.
4.3 Rival Systems
Besides methods proposed in Section 3, we investigated several multichannel speech separation approaches as baseline systems. Also, we proposed a few alternative multichannel separation systems.
FreqLSTM/BLSTM. Several works have proven the effectiveness of integration of spatial and spectral features in LSTMbased frequencydomain separation networks [5, 7, 1]
. For simiplicity, we call it FreqLSTM. LPS and 6 pairs of IPD features are concatenated at input level and FreqLSTM estimates a TF mask for each speaker. The network contains three LSTM layers with 300 cells, followed by a 512node fullyconnected layer with Rectified Linear Unit (ReLU) activation function. The output layer with sigmoid consists of 257
2 nodes as it outputs estimated masks for two speakers.FreqTCN. We replace the separation network used in FreqLSTM with a TCN, which features longrange receptive field and deep extraction ability [21, 22], named FreqTCN. Also, a 256node BLSTM layer is appended after TCN to guarantee the temporal continuity of output sequence. The training craterion and output product is the same with FreqLSTM. The repeat times and number of dilated blocks are change to 6 and 4. The detail of this system will be discussed in Appendix A.
Cascaded Networks. One way of extending the timedomain methods (like TasNet) from singlechannel to multichannel, is to simply apply the traditional multichannel frequencydomain method (like FreqLSTM) first to incorporate the spatial information, then followed by singlechannel TasNet. Refer to Appendix C for the detailed description.
Parallel encoder. In this method we replace the single encoder in ConvTasNet with multiple parallel encoders to extract the mixture representation and spatial information simultaneously. The details are presented in Appendix B.
5 Result and Analysis
SISNR improvement (SISNRi) is used to measure the separated speech quality as described in Section 2. Except the overall performance, we list results under different angle difference ranges for comparisons. The results are presented in Table 1. We repeat the singlechannel ConvTasNet and achieve 15.2dB SISNRi compared to 14.6dB reported in [4] on closetalk WSJ0 2mix dataset. The singlechannel ConvTasNet, FreqLSTM, FreqBLSTM and FreqTCN are served as our baselines, respectively achieves 9.1dB, 6.9dB, 8.7dB and 8.0dB on farfield WSJ0 2mix dataset. For reference, we also report the results achieved by ideal timefrequency masks, including ideal binary mask (IBM), ideal ratio mask (IRM) and ideal phasesensitive mask (IPSM). These masks are calculated using STFT with a 512point FFT size and 256point hop size.
Cascaded networks: FreqLSTM alone achieves 6.9dB, and ConvTasNet refines the estimation and improves the performance to 8.8dB. The performance is slightly worse than singlechannel ConvTasNet baseline, however, when the angle difference is larger than 45°, it outperforms all baseline systems.
Parallel encoder: The parallel encoder pushes the overall performance up to 10.8dB. It demonstrates the effectiveness of datadriven spatial coding and achieves the best performance among all systems for samples with angle difference larger than 90°. However, the singlechannel mixture representation seems not well preserved since the performance degrades on samples with small angle difference.
Crossdomain training: Comparing to ConvTasNet baseline, joint training with LPS feature slightly improves the performance for all angle differences. Besides, middle fusion outperforms the other two fusion methods and achieves 9.7dB overall performance. With LPS and cosIPD features, both the spectral and spatial information is incorporated by crossdomain training and the result is largely increased to 11.0dB.
EndtoEnd separation: The convolution kernels enable IPD calculation inside the network, thus makes it an endtoend approach. With only cosIPD, fixing the kernels as initialized values, i.e., DCT coefficients, achieve 11.2 dB SISNRi. Also, the performance of learnable kernels is slightly worse than the fixed ones. However, with the complementary information from sinIPD, the trainable kernel achieves the best result among all systems, i.e., 11.6 dB, surpassing the performance of ideal binary mask and ideal ratio mask.
6 Conclusions
In this paper, we propose a new endtoend approach for multichannel speech separation. First, an integrated neural architecture is proposed to achieve a waveformin waveformout speech separation. Second, We reformulate the traditional STFT and IPD as a function of timedomain convolution with a fixed special kernel. Third, we relaxed the fixed kernels to be learnable, so that the entire architecture becomes purely datadriven and can be trained from endtoend. Experimental results on farfield WSJ0 2mix validate the effectiveness of our proposed endtoend systems as well as other multichannel extensions.
References
 [1] Z. Chen, X. Xiao, T. Yoshioka, H. Erdogan, J. Li, and Y. Gong, “Multichannel overlapped speech recognition with location guided speech extraction network,” in 2018 IEEE Spoken Language Technology Workshop (SLT). IEEE, 2018, pp. 558–565.
 [2] T. Yoshioka, H. Erdogan, Z. Chen, and F. Alleva, “Multimicrophone neural speech separation for farfield multitalker speech recognition,” in 2018 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP). IEEE, 2018, pp. 5739–5743.
 [3] Y. Luo and N. Mesgarani, “Tasnet: timedomain audio separation network for realtime, singlechannel speech separation,” in 2018 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP). IEEE, 2018, pp. 696–700.
 [4] Y. Luo and N. Mesgaran1, “Tasnet: Surpassing ideal timefrequency masking for speech separation,” arXiv preprint arXiv:1809.07454, 2018.
 [5] Z.Q. Wang, J. Le Roux, and J. R. Hershey, “Multichannel deep clustering: Discriminative spectral and spatial embeddings for speakerindependent speech separation,” in 2018 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP). IEEE, 2018, pp. 1–5.
 [6] L. Drude and R. HaebUmbach, “Tight integration of spatial and spectral features for bss with deep clustering embeddings.” in Interspeech, 2017, pp. 2650–2654.
 [7] Z. Wang and D. Wang, “Integrating spectral and spatial features for multichannel speaker separation,” in Proc. Interspeech, vol. 2018, 2018, pp. 2718–2722.
 [8] D. Yu, M. Kolbæk, Z.H. Tan, and J. Jensen, “Permutation invariant training of deep models for speakerindependent multitalker speech separation,” in 2017 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP). IEEE, 2017, pp. 241–245.

[9]
M. Kolbæk, D. Yu, Z.H. Tan, J. Jensen, M. Kolbaek, D. Yu, Z.H. Tan, and J. Jensen, “Multitalker speech separation with utterancelevel permutation invariant training of deep recurrent neural networks,”
IEEE/ACM Transactions on Audio, Speech and Language Processing (TASLP), vol. 25, no. 10, pp. 1901–1913, 2017.  [10] J. R. Hershey, Z. Chen, J. Le Roux, and S. Watanabe, “Deep clustering: Discriminative embeddings for segmentation and separation,” in 2016 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP). IEEE, 2016, pp. 31–35.
 [11] Y. Luo, Z. Chen, and N. Mesgarani, “Speakerindependent speech separation with deep attractor network,” IEEE/ACM Transactions on Audio, Speech, and Language Processing, vol. 26, no. 4, pp. 787–796, 2018.
 [12] Y. Isik, J. L. Roux, Z. Chen, S. Watanabe, and J. R. Hershey, “Singlechannel multispeaker separation using deep clustering,” arXiv preprint arXiv:1607.02173, 2016.
 [13] Z.Q. Wang, J. Le Roux, and J. R. Hershey, “Alternative objective functions for deep clustering,” in 2018 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP). IEEE, 2018, pp. 686–690.
 [14] Z.Q. Wang, J. L. Roux, D. Wang, and J. R. Hershey, “Endtoend speech separation with unfolded iterative phase reconstruction,” arXiv preprint arXiv:1804.10204, 2018.
 [15] N. Takahashi, P. Agrawal, N. Goswami, and Y. Mitsufuji, “Phasenet: Discretized phase modeling with deep neural networks for audio source separation,” in Proc. Interspeech, 2018, pp. 2713–2717.
 [16] H.S. Choi, J.H. Kim, J. Huh, A. Kim, J.W. Ha, and K. Lee, “Phaseaware speech enhancement with deep complex unet,” arXiv preprint arXiv:1903.03107, 2019.
 [17] J. Le Roux, G. Wichern, S. Watanabe, A. Sarroff, and J. R. Hershey, “Phasebook and friends: Leveraging discrete representations for source separation,” IEEE Journal of Selected Topics in Signal Processing, 2019.
 [18] P. Mowlaee, R. Saeidi, and R. Martin, “Phase estimation for signal reconstruction in singlechannel source separation,” in Thirteenth Annual Conference of the International Speech Communication Association, 2012.
 [19] S. Venkataramani, J. Casebeer, and P. Smaragdis, “Adaptive frontends for endtoend source separation,” in Proc. NIPS, 2017.
 [20] D. Stoller, S. Ewert, and S. Dixon, “Waveunet: A multiscale neural network for endtoend audio source separation,” arXiv preprint arXiv:1806.03185, 2018.

[21]
C. Lea, M. D. Flynn, R. Vidal, A. Reiter, and G. D. Hager, “Temporal
convolutional networks for action segmentation and detection,” in
proceedings of the IEEE Conference on Computer Vision and Pattern Recognition
, 2017, pp. 156–165.  [22] C. Lea, R. Vidal, A. Reiter, and G. D. Hager, “Temporal convolutional networks: A unified approach to action segmentation,” in European Conference on Computer Vision. Springer, 2016, pp. 47–54.
 [23] K. Simonyan and A. Zisserman, “Twostream convolutional networks for action recognition in videos,” in Advances in neural information processing systems, 2014, pp. 568–576.
 [24] C. Feichtenhofer, A. Pinz, and A. Zisserman, “Convolutional twostream network fusion for video action recognition,” in Proceedings of the IEEE conference on computer vision and pattern recognition, 2016, pp. 1933–1941.
 [25] S. Venkataramani, J. Casebeer, and P. Smaragdis, “Endtoend source separation with adaptive frontends,” in 2018 52nd Asilomar Conference on Signals, Systems, and Computers. IEEE, 2018, pp. 684–688.
 [26] S. Gannot, E. Vincent, S. MarkovichGolan, and A. Ozerov, “A Consolidated Perspective on MultiMicrophone Speech Enhancement and Source Separation,” IEEE/ACM Transactions on Audio, Speech, and Language Processing, vol. 25, pp. 692–730, 2017.
Appendix A Appendix
In this section, we describe three rival systems in detail: FreqTCN, cascaded networks and parallel encoder.
A FreqTCN
As shown in Figure 4, FreqTCN shares the same processing architecture with FreqLSTM except the separation network backbone structure. First, multichannel mixture speech waveform is transformed to complex STFT representation. Then, spectral and spatial features are extracted and served as the input of separation network, i.e., magnitude spectra and IPDs. For separation network, the LSTM layers are replaced with a temporal convolutional network (TCN), which features longrange receptive field and deep extraction ability [21, 22]. The TCN’s structure is the same as that used in ConvTasNet. Also, a 256node BLSTM layer is appended after the TCN to guarantee the temporal continuity of output sequence. The separation module learns to estimate a timefrequency mask for each source in the mixture. The network training criterion is phasesensitive spectrogram approximation (PSA):
(7) 
where indicates the number of speakers in the speech mixture, is the phasesensitive mask (PSM) estimated by the network, and is the magnitude and phase of the th source’s complex spectrogram, respectively.
B Cascaded Networks
It’s rather simple and facilitative to combine established spatial features such as IPDs with the spectral features in frequency domain. Several works have proven the effectiveness of this integration in LSTMbased frequencydomain separation networks. For convenience, we call them FreqLSTM methods. Thus, instead of bothering to explore exploitable spatial features in timedomain separation, we stack a FreqLSTM network on the top of singlechannel ConvTasNet. The illustration of this method is shown in Figure 5.
Specifically, under the supervision of ideal ratio mask (IRM), FreqLSTM learns to estimate the magnitude mask for each source. Then, reconstructing with the mixture phase of the first channel mixture spectrogram, preseparated waveforms could be obtained by inverse STFT (iSTFT). Next, ConvTasNet takes these two preseparated waveforms as input, and concatenates their representation along feature dimension after the TasNet’s encoder. The separation module is targeted at generating a mask for each source’s representation. In this approach, the multichannel FreqLSTM is served as a separation frontend, which takes both spatial and spectral feature into consideration and gives a reasonable yet somewhat coarse estimation. The backend ConvTasNet can refine the inaccurate phase and improve the separation performance of these preestimated results, in view of its powerful separation capability in timedomain. Unfortunately, the inverse STFT (iSTFT) operation hinders the joint training of FreqLSTM and ConvTasNet, thus accumulates estimation errors from each network. Also, the large mismatch between training and evaluation dataset also impedes the availability of this method.
C Parallel Encoder
To automatically dig out crosscorrelations between channels of multichannel speech, we adopt a parallel encoder instead of a single encoder, as illustrated in Figure 6. The parallel encoder carries out waveform encoding and crosschannel cues extraction as well. This parallel encoder contains convolution kernels for each channel of the input multichannel speech mixture y and sums the convolution output across channels to form mixture feature maps. The following steps are the same with singlechannel ConvTasNet.