Multi-talker monaural speech separation has a vast range of applications. For example, a home environment or a conference environment in which many people talk, the human auditory system can easily track and follow a target speaker’s voice from the multi-talker’s mixed voice. In this case, a clean speech signal of the target speaker needs to be separated from the mixed speech to complete the subsequent recognition work. Thus it is a problem that must be solved in order to achieve satisfactory performance in speech or speaker recognition tasks. There are two difficulties in this problem, the first is that since we don’t have any prior information of the user, a practical system must be speaker-independent. The second difficulty is that there is no way to use the beamforming algorithm for a single microphone signal. Many traditional methods, such as computational auditory scene analysis (CASA) [27, 18, 5], Non-negative matrix factorization (NMF) [22, 10], and probabilistic models , do not solve these two difficulties well.
More recently, a large number of techniques based on deep learning are proposed for this task. These methods can be briefly grouped into three categories. The first category is based on deep clustering (DPCL)[4, 6]
, which maps the time-frequency (TF) points of the spectrogram into the embedding vectors, then these embedding vectors are clustered into several classes corresponding to different speakers, and finally these clusters are used as masks to inversely transform the spectrogram to the separated clean voices; the second is the permutation invariant training (PIT)[9, 32], which solves the label permutation problem by minimizing the lowest error output among all possible permutations for mixing sources assignment; the third category is end-to-end speech separation in time-domain [15, 16, 24, 33, 14]
, which is a natural way to overcome the obstacles of the upper bound source-to-distortion ratio improvement (SDRi) in short-time Fourier transform (STFT) mask estimation based methods and real-time processing requirements in actual use.
This paper is based on the end-to-end method [15, 16, 24, 33, 14], which has achieved better results than DPCL based or PIT based approaches. Since most DPCL and PIT based methods use STFT as front-end. Specifically, the mixed speech signal is first transformed from one-dimensional signal in time domain to two-dimensional spectrum signal in TF domain, and then the mixed spectrum is separated to result in spectrums corresponding to different source speeches by a deep clustering or mask estimation method, and finally, the cleaned source speech signal can be restored by an inverse STFT on each spectrum. This framework has several limitations. Firstly, it is unclear whether the STFT is optimal (even assume the parameters it depends on are optimal, such as size and overlap of audio frames, window type and so on) transformation of the signal for speech separation . Secondly, most STFT based methods often assumed that the phase of the separated signal to be equal to the mixture phase, which is generally incorrect and imposes an obvious upper bound on separation performance by using the ideal masks. As an approach to overcome the above problems, several speech separation models were recently proposed that operate directly on time-domain speech signals [15, 16, 24, 33, 14]. Inspired by these first results, we propose ’La Furca’111‘Furca’ is the Latin for ‘fork’, which we used to mean that mixed speech is divided into two streams by our network, just like water., which is a general name for a series of fully end-to-end time-domain separation methods, includes 1) dual-path network with intra- and inter-parallel BiLSTM components (La Furca I): replace intra- and inter-BiLSTM 
by multiple parallel BiLSTM modules , which can reduce the variance of this model. The intra- and inter-paralleled BiLSTM modules replicate weight matrices and take the average from the feature maps produced by those layers. This convenient technique can effectively improve separation performance; 2) local context-aware network with attention BiLSTM (La Furca II): Since adjacent frames in the input of the inter-BiLSTM are far apart in the original speech, inter-BiLSTM cannot model the signal well, thus attention BiLSTM is employed to compensate the context information; 3) global context aware inter-intra cross-parallel BiLSTM (La Furca III): In order to further perceiving the global contextual information, intra- and intra-BiLSTM are placed side by side for mutual reference; 4) multiple spiral iterative refinement dual-path BiLSTM (La Furca IV): inspired by[6, 7], in which the signal estimates from an initial mask-based separation network serves as input, along with the original mixture, to a second identical separation network.
The remainder of this paper is organized as follows: section 2 introduces end-to-end monaural speech separation based on deep neural networks with dual-path BiLSTM blocks. Section 3 describe our proposed ’La Furca’ and the separation algorithm in detail. The experimental setup and results are presented in Section 4. We conclude this paper in Section 5.
2 Speech separation with dual-path BiLSTM blocks
In this section, we review the formal definition of the monaural speech separation task and the original dual-path BiLSTM based separation architecture .
The goal of monaural speech separation is to estimate the individual target signals from a linearly mixed single-microphone signal, in which the target signals overlap in the TF domain. Let denote the target speech signals and denotes the mixed speech respectively. If we assume the target signals are linearly mixed, which can be represented as:
then monaural speech separation aims at estimating individual target signals from given mixed speech . In this work it is assumed that the number of target signals is known.
In order to deal with this ill-posed problem, Luo et al. [16, 14] introduce adaptive front-end methods to achieves high speech separation performance on WSJ0-2mix dataset [4, 6]. Such methods contain three processing stages, here the state-of-the-art architecture  is used as an illustration. As shown in Figure 1, the architecture consists of an encoder (Conv1d is followed by a PReLU), a separator (consisted in the order by a LayerNorm, a 11conv, 6 dual-path BiLSTM layers, 11conv, and a softmax operation) and a decoder (a FC layer). First, the encoder module is used to convert short segments of the mixed waveform into their corresponding representations. Then, the representation is used to estimate the multiplication function (mask) of each source and each encoder output for each time step. The source waveform is then reconstructed by transforming the masked encoder features using a linear decoder module. This framework is called DPRNN-TasNet in .
first splits the output of the encoder into chunks with or without overlaps and concatenates them to form a 3-D tensor, as shown in Figure2. The dual-path BiLSTM modules will map these 3-D tensors to 3-D tensor masks, as shown in Figure 3. The output 3-D tensor masks and the original 3-D tensor are converted back to a sequential output by a ‘Merge’ operation as shown in Figure 4.
Some architectures similar to dual-path BiLSTM have been proposed as alternatives to recurrent neural network (RNN) in various tasks[34, 12]. dual-path BiLSTM can organize any type of RNN layer and model long sequence inputs in a very simple way. The intuition is to divide the input sequence into shorter blocks and interleave two BiLSTMs, intra-BiLSTM and an inter-BiLSTM, for local and global modeling, respectively. In a dual-path BiLSTM, the intra-BiLSTM first processes the local block independently, and then the inter-BiLSTM summarizes the information from all the blocks to perform sound level processing.
As shown in Figure 3, the input of intra-BiLSTM is a segment composed of several consecutive frames in time, and an utterance is divided into several such segments. These segments are passed through a BiLSTM, a fully connected projection, and a group normalization (GroupNorm) 
operation respectively A residue connection is added to the output of the GroupNorm to result in the final output of the intra-BiLSTM with the same shape as the input. The output of intra-BiLSTM will be used as the input of inter-BiLSTM, but a permutation will be performed on this input to let inter-BiLSTM capture global dependency. That is to say, adjacent frames in the input of the inter-BiLSTM are far apart and spread across the global real time dimension of the input mixed utterance.
Although DPRNN-TasNet has achieved an amazing signal to distortion ration improvement (SDRi) [3, 25] in some public data sets, there is a clear disadvantage in this structure, that is, all consecutive frames in the input of inter-BiLSTM are far apart in the original utterance. There are few sequence information and relationship between the adjacent frames in the input of inter-BiLSTM. If the context information or mechanism can be added to the neighboring frames or to the structure of the inter-BiLSTM respectively, it is believed the performance will be improved. At the same time, in the training of DPRNN-TasNet, the performance variance of different episodes is large, so some ensemble methods are tried to strengthen DPRNN-TasNet. Also, the output of the DPRNN-TasNet can be refined again by combining the original mixed utterance to feed into the DPRNN-TasNet to result in better SDRi. These are the motivations for all the improvements in the next section.
3 Speech separation with ‘La Furca’
3.1 La Furca I: Dual-path parallel BiLSTM based speech separation
The performance of a single predictive model can always be improved by ensemble, that is to combine a set of independently trained networks. The most commonly used method is to do the average of the model, which can at least help to reduce the variance of the performance. As shown in Figure 5, three identical parallel branches are added in the intra-BiLSTM and inter-BiLSTM blocks respectively. The total output of each intra- and inter-parallel BiLSTM component is obtained by averaging the outputs of all the different branches. The reason why we do this ensemble is to reduce the sub-variances of each block.
3.2 La Furca II: Dual-path BiLSTM with attention for speech separation
The standard LSTM cannot detect which is the important part for local and global mask prediction. In order to address this issue, we propose to employ an attention mechanism that can capture the key part of utterance in response to separation from local to global. Figure 6 represents the architecture of an attention based BiLSTM for sequence modeling. The outputs of BiLSTM are first passed through a local filter to calculate the attention weights, and then these weights will be used to multiply by the original outputs of BiLSTM to result in the final outputs of the attention BiLSTM. The attention BiLSTM is used to replace the plain LSTM in the dual-path BiLSTM to result in the La Furca II, which can make use of the local and global context information to help separation when compared with DPRNN-TasNet.
3.3 La Furca III: Cross dual-path parallel BiLSTM based speech separation
Since the consecutive input frames of intra-BiLSTM are continuous in the original time axis, intra-BiLSTM is more reasonable in modeling speech signals than inter-BiLSTM in DPRNN-TasNet. Therefore, as shown in Figure 7 we put intra-BiLSTM and inter-BiLSTM in parallel instead of the original serial. Their input differs in the arrangement of the data. The outputs of intra-BiLSTM and inter-BiLSTM are averaged so that they can make use of global and local information from each other. In particular, inter-BiLSTM can make use of the context information from the output of intra-BiLSTM to compensate for its weaknesses in this regard.
3.4 La Furca IV: Iterative multi-stage refined dual-path BiLSTM for speech separation
Since the separated outputs and mixed input of the speech separation network must meet a consistent condition, that is, the sum of the separated outputs must be consistent with the mixed input. Therefore, this consistent condition can also be used to refine the separated outputs of the network. Inspired by [6, 7], as shown in Figure 8 we propose to use a multi-stage iterative network to do monaural speech separation. In each stage, there is a complete separate pipeline mentioned earlier, such as any DPRNN-TasNet. The output of each stage pipeline is two separate utterances, and these two utterances will be sent to the next stage sub-network along with the original mixed utterance to continue through the exact same pipeline, such as DPRNN-TasNet, except that one of the input dimension is tripled.
In our implementation, we tried different numbers of stages, including 2 stages and 3 stages. In other words, as shown in Figure 8, 2 or 3 DPRNN-TasNets are connected series to form an iterative refinement network. The insight we got was that 3 or more stages did not improve the performance. That is, using only two stages is enough. When using three stages, the separation performances in SDR of the first stage and the second stage are basically the same, as can be seen from the Figure 9. We will elaborate on this in the experimental part.
3.5 Perceptual metric: Utterance-level SDR objective
Since the loss function of many STFT-based methods is not directly applicable to waveform-based end-to-end speech separation, the perceptual metric based loss function is tried in this work. The perception of speech is greatly affected by distortion[31, 1]. Generally in order to evaluate the performance of speech separation, the BSS_Eval metrics signal-to-distortion ratio (SDR), signal-to-Interference ratio (SIR), signal-to-artifact ratio (SAR) [3, 25], and short-time objective intelligibility (STOI)  have been often employed. In this work, we directly use SDR, which is most commonly used metrics to evaluate the performance of source separation, as the training objective. SDR measures the amount of distortion introduced by the output signal and define it as the ratio between the energy of the clean signal and the energy of the distortion.
SDR captures the overall separation quality of the algorithm. There is a subtle problem here. We first concatenate the outputs of La Furca into a complete utterance and then compare with the input full utterance to calculate the SDR in the utterance level instead of calculating the SDR for one frame at a time. These two methods are very different in ways and performance. If we denote the output of the network by , which should ideally be equal to the target source , then SDR can be given as [3, 25]
Then our target is to maximize SDR or minimize the negative SDR as loss function respect to the .
In order to solve the tracing and permutation problem, the PIT training criteria [9, 32] is employed in this work. We calculate the SDRs for all the permutations, pick the maximum one, and take the negative as the loss. It is called the uSDR loss in this work. Regarding the loss of La Furca IV, the uSDR losses of the separated speech outputs at all stages with gross truth will be calculated, and then be averaged as the final loss.
During training Adam 
serves as the optimizer to minimize the uSDR loss with an initial learning rate of 0.001 and scale down by 0.98 every two epochs. when the training loss increased on the development set, then restart training from the current checkpoint with the halved initial learning rate. In other words, the learning rates of restart training are 0.001, 0.0005, 0.00025, etc. respectively. Due to the limitation of GPU memory, the batch size is set to 1, 2, or 3 according to the size of GPU memory.
4.1 Dataset and neural network
, which contains 30 hours of training and 10 hours of validation data. The mixtures are generated by randomly selecting 49 male and 51 female speakers and utterances in the Wall Street Journal (WSJ0) training set si_tr_s, and mixing them at various signal-to-noise ratios (SNR) uniformly between 0 dB and 5 dB. 5 hours of evaluation set is generated in the same way, using utterances from 16 unseen speakers from si_dt_05 and si_et_05 in the WSJ0 dataset.
We evaluate the systems with the SDRi [3, 25] metrics used in [6, 13, 30, 2, 9]. The original SDR, that is the average SDR of mixed speech for the original target speech and is 0.15. Table 1 lists the average SDRi obtained by the different structures in ‘La Furca’ and almost all the results in the past two years, where IRM means the ideal ratio mask
applied to the STFT of to obtain the separated speech, which is evaluated to show the upper bounds of STFT based methods, where is the STFT of .
4.2 Results and Discussions
In this experiment, as baselines, we re-implemented several classical approaches, such as DPCL , TasNet , Conv-TasNet , and DPRNN-TasNet . Table 1 lists the SDRis obtained by our methods and almost all the results in the past three years, where IRM means the ideal ratio mask. Compared with these baselines an average increase of nearly 1dB SDRi is obtained. La Furca IV has achieved the most significant performance improvement compared with baseline systems, and it breaks through the upper bound of STFT based methods a lot (more than 7dB).
|La Furca I||19.3|
|La Furca II||19.4|
|La Furca III||19.4|
|La Furca IV||19.86|
Figure 9 shows the losses of different stages from different epoch models on the training and validation data during the training of La Furca IV. Here a 3-stage iterative refinement network is trained. It can be seen that the SDR obtained from the separated voices of the first stage network and the separated voices of the second stage network are almost completely coincident, whether it is on the training data or the validation data. That is to say, in practice, 2 stages is enough for La Furca IV.
Figure 10 shows the comparison of separation results (in SDR) between DPRNN-TasNet and La Furca IV on the WSJ0-2mix test set. It can be seen that most of our separated SDRs are concentrated above 16dB, and most of the utterances and overall are about 0.9dB higher than DPRNN-TasNet on average.
In this paper, we investigated the effectiveness of deep dilated temporal convolutional networks modeling for multi-talker monaural speech separation. We propose a series structure under the name of La Furca do to speech separation. Benefits from the strength of end-to-end processing, the novel dynamic improvements, the best performance of the structure in La Furca achieve 19.86dB SDRi on the public WSJ0-2mix data corpus, results in 16% relative improvement, and we achieve the new state-of-the-art on the public WSJ0-2mix data corpus. For further work, although SDR is widely used and can be useful, but it has some weaknesses . In the future, maybe we can use SNR to evaluate our models. It would be interesting to see how consistent the SDR and SNR are.
We would like to thank Yi Luo at Columbia University and Kaituo Xu at Beijing Kuaishou Technology for sharing their implementations of Conv-TasNet and DPRNN block, and valuable discussions on training of DPRNN-TasNet.
-  (2004) The perception of speech under adverse conditions. In Speech processing in the auditory system, pp. 231–308. Cited by: §3.5.
-  (2017) Deep attractor network for single-microphone speaker separation. In Acoustics, Speech and Signal Processing (ICASSP), 2017 IEEE International Conference on, pp. 246–250. Cited by: §4.1, Table 1.
-  (2005) BSS_EVAL toolbox user guide–revision 2.0. Cited by: §2, §3.5, §3.5, §4.1.
-  (2016) Deep clustering: discriminative embeddings for segmentation and separation. In Acoustics, Speech and Signal Processing (ICASSP), 2016 IEEE International Conference on, pp. 31–35. Cited by: §1, §2, §2, §4.1, §4.2, Table 1.
-  (2013) An unsupervised approach to cochannel speech separation. IEEE Transactions on audio, speech, and language processing 21 (1), pp. 122–131. Cited by: §1.
-  (2016) Single-channel multi-speaker separation using deep clustering. arXiv preprint arXiv:1607.02173. Cited by: §1, §1, §2, §2, §3.4, §4.1, §4.1, Table 1.
-  (2019) Universal sound separation. arXiv preprint arXiv:1905.03330. Cited by: §1, §3.4.
-  (2014) Adam: a method for stochastic optimization. arXiv preprint arXiv:1412.6980. Cited by: §3.6.
-  (2017) Multitalker speech separation with utterance-level permutation invariant training of deep recurrent neural networks. IEEE/ACM Transactions on Audio, Speech and Language Processing (TASLP) 25 (10), pp. 1901–1913. Cited by: §1, §3.5, §4.1.
-  (2015) Sparse nmf–half-baked or well done?. Mitsubishi Electric Research Labs (MERL), Cambridge, MA, USA, Tech. Rep., no. TR2015-023. Cited by: §1.
-  (2018) CBLDNN-based speaker-independent speech separation via generative adversarial training. Cited by: Table 1.
-  (2017) Global context-aware attention lstm networks for 3d action recognition. In , pp. 1647–1656. Cited by: §2.
-  (2018) Speaker-independent speech separation with deep attractor network. IEEE/ACM Transactions on Audio, Speech, and Language Processing 26 (4), pp. 787–796. Cited by: §4.1, Table 1.
-  (2019) Dual-path rnn: efficient long sequence modeling for time-domain single-channel speech separation. arXiv preprint arXiv:1910.06379. Cited by: La Furca: Iterative Context-Aware End-to-End Monaural Speech Separation Based on Dual-Path Deep Parallel Inter-Intra Bi-LSTM with Attention, §1, §1, Figure 1, §2, §2, §2, §4.2, Table 1.
-  (2017) TasNet: time-domain audio separation network for real-time, single-channel speech separation. arXiv preprint arXiv:1711.00541. Cited by: §1, §1, §4.2, Table 1.
-  (2018) TasNet: surpassing ideal time-frequency masking for speech separation. arXiv preprint arXiv:1809.07454. Cited by: §1, §1, §2, §4.2, Table 1.
-  (2018) SDR-half-baked or well done?. arXiv preprint arXiv:1811.02508. Cited by: §5.
-  (2006) Model-based sequential organization in cochannel speech. IEEE Transactions on Audio, Speech, and Language Processing 14 (1), pp. 289–298. Cited by: §1.
-  (2019) Is cqt more suitable for monaural speech separation than stft? an empirical study. arXiv preprint arXiv:1902.00631. Cited by: §1.
FurcaX: end-to-end monaural speech separation based on deep gated (de)convolutional neural networks with adversarial example training. In Proc. ICASSP, Cited by: Table 1.
-  (2019) FurcaNet: an end-to-end deep gated convolutional, long short-term memory, deep neural networks for single channel speech separation. arXiv preprint arXiv:1902.00651. Cited by: Table 1.
-  (2007) Convolutive speech bases and their application to supervised speech separation. IEEE Transactions on audio speech and language processing 15 (1), pp. 1. Cited by: §1.
-  (2010) A short-time objective intelligibility measure for time-frequency weighted noisy speech. In Acoustics Speech and Signal Processing (ICASSP), 2010 IEEE International Conference on, pp. 4214–4217. Cited by: §3.5.
-  (2017) Adaptive front-ends for end-to-end source separation. In Proc. NIPS, Cited by: §1, §1.
-  (2006) Performance measurement in blind audio source separation. IEEE transactions on audio, speech, and language processing 14 (4), pp. 1462–1469. Cited by: §2, §3.5, §3.5, §4.1.
Speech recognition using factorial hidden markov models for separation in the feature space. In Ninth International Conference on Spoken Language Processing, Cited by: §1.
-  (2006) Computational auditory scene analysis: principles, algorithms, and applications. Wiley-IEEE press. Cited by: §1.
-  (2018) Alternative objective functions for deep clustering. In Proc. IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), Cited by: Table 1.
-  (2018) Group normalization. In Proceedings of the European Conference on Computer Vision (ECCV), pp. 3–19. Cited by: §2.
-  (2018) SINGLE channel speech separation with constrained utterance level permutation invariant training using grid lstm. Cited by: §4.1, Table 1.
-  (1998) Performance of the modified bark spectral distortion as an objective speech quality measure. In Acoustics, Speech and Signal Processing, 1998. Proceedings of the 1998 IEEE International Conference on, Vol. 1, pp. 541–544. Cited by: §3.5.
-  (2017) Permutation invariant training of deep models for speaker-independent multi-talker speech separation. In Acoustics, Speech and Signal Processing (ICASSP), 2017 IEEE International Conference on, pp. 241–245. Cited by: §1, §3.5, Table 1.
-  (2020) FurcaNeXt: end-to-end monaural speech separation with dynamic gated dilated temporal convolutional networks. In International Conference on Multimedia Modeling, pp. 653–665. Cited by: §1, §1, Table 1.
-  (2018) Spatial–temporal recurrent neural network for emotion recognition. IEEE transactions on cybernetics 49 (3), pp. 839–847. Cited by: §2.