1 Introduction
Multitalker monaural speech separation has a vast range of applications. For example, a home environment or a conference environment in which many people talk, the human auditory system can easily track and follow a target speaker’s voice from the multitalker’s mixed voice. In this case, if automatic speech recognition and speaker recognition are to be performed, a clean speech signal of the target speaker needs to be separated from the mixed speech to complete the subsequent recognition work. Thus it is a problem that must be solved in order to achieve satisfactory performance in speech or speaker recognition tasks. There are two difficulties in this problem, the first is that since we don’t have any priori information of the user, a truly practical system must be speakerindependent. The second difficulty is that there is no way to use the beamforming algorithm for a single microphone signal. Many traditional methods, such as computational auditory scene analysis (CASA) [1, 2, 3], Nonnegative matrix factorization (NMF) [4, 5], and probabilistic models [6], do not solve these two difficulties well.
More recently, a large number of techniques based on deep learning are proposed for this task. These methods can be briefly grouped into three categories. The first category is based on deep clustering (DPCL) [7, 8]
, which maps the timefrequency (TF) points of the spectrogram into the embedding vectors, then these embedding vectors are clustered into several classes corresponding to different speakers, and finally these clusters are used as masks to inversely transform the spectrogram to the separated clean voices; the second is the permutation invariant training (PIT)
[9, 10], which solves the label permutation problem by minimizing the lowest error output among all possible permutations for N mixing sources assignment; the third category is endtoend speech separation in timedomain [11, 12, 13], which is a natural way to overcome the obstacles of the upper bound sourcetodistortion ratio improvement (SDRi) in shorttime Fourier transform (STFT) mask estimation based methods and realtime processing requirements in actual use.
This paper is based on the endtoend method [11, 12, 13], which has achieved better results than DPCL based or PIT based approaches. Since most DPCL and PIT based methods use STFT as frontend. Specifically, the mixed speech signal is first transformed from onedimensional signal in time domain to twodimensional spectrum signal in TF domain, and then the mixed spectrum is separated to result in spectrums corresponding to different source speeches by a deep clustering method, and finally the cleaned source speech signal can be restored by an inverse STFT on each spectrum. This framework has several limitations. Firstly, it is unclear whether the STFT is the optimal (even assume the parameters it depends on are optimal, such as size and overlap of audio frames, window type and so on) transformation of the signal for speech separation. Secondly, most STFT based methods often assumed that the phase of the separated signal to be equal to the mixture phase, which is generally incorrect and imposes an obvious upper bound on separation performance by using the ideal masks. As an approach to overcome the above problems, several speech separation models were recently proposed that operate directly on timedomain speech signals [11, 12]. Inspired by these first results, we propose FurcaNet^{1}^{1}1Furca is Latin for “fork”, and we use this word to mean the speech is split into two streams by our network like water., a fully endtoend timedomain separation system, based on deep gated convolutional neural network (GCNN) [14], bidirectional long shortterm memory (BiLSTM), deep neural network (DNN), which has showed promising performance on both a clean and noisy Voice Search tasks [15].
The remainder of this paper is organized as follows: section 2 introduces monaural speech separation, describe our proposed FurcaNet and the separation algorithm in detail. The experimental setup and results are presented in Section 3. We conclude this paper in Section 4.
2 The Furcanet Model
The proposed endtoend deep learning approach consists of two main components: one is the FurcaNet pipeline, which consists of GCNN, BiLSTM and DNN; and the other is the perceptual loss function.
In this section, we first review the formal definition of the monaural speech separation task and the GCNN architecture. The details of the FurcaNet structure we investigated will be introduced. Finally the perceptual metric as a loss function is introduced.
2.1 Monaural speech separation
The goal of monaural speech separation is to estimate the individual target signals from a linearly mixed singlemicrophone signal, in which the target signals overlap in the TF domain. Let denote the target speech signals and denotes the mixed speech respectively. If we assume the target signals are linearly mixed, which can be represented as:
then monaural speech separation aims at estimating individual target signals from given mixed speech . In this work it is assumed that the number of target signals is known.
In this work, we propose an endtoend deep learning approach to separate the mixed utterance. The input of the FurcaNet is a mixed utterance , and the output of the network are the separated utterances, ideally it is best to be exactly the same . In order to do this, firstly the mixed speech is framed. Then each frame of the mixed utterance is directly as raw wave forward propagated through the FurcaNet, and the output activations are the separated frames, each frame is corresponding only one speaker. Finally the separated frames are concatenate together to form the output utterances.
2.2 Network architecture
The proposed FurcaNet model is similar to [15], but with fine adjustment. The FurcaNet separation system comprises GCNN, BiLSTM and DNN, and the structure is illustrated in Fig. 1. A deep GCNN proposed in [14] is adopted here to build the frontend. GCNN is implemented by stacking multiple 1D gated convolutional (GConv) layers on top of each other.
Fig. 2 shows the structure of a 1D GConv layer. The main difference between a GConv layer and a plain convolutional layer is a gated linear unit (GLU) [14], namely the gates of Eq. (1
) is used as a nonlinear control function instead of tanh activation or regular rectified linear units (ReLUs)
[14]:(1) 
where and are the input and output, , , , and are learned parameters,
is the sigmoid function and
is the elementwise product between vectors or matrices. Similar to LSTMs, GLUs play the role of controlling the information passed on in the hierarchy. This special gating mechanism allows us to effectively capture longrange context dependencies by deepening layers without encountering the problem of vanishing gradient. In order to stabilize the training and also reduce the training time, a layer normalization(LNrom) operator [16] was added behind each GConv layer.In order to capture longterm contextual dependencies, BiLSTM is applied to replace the bottleneck architectures or dilated convolutional networks [17]. BiLSTM is a natural choice for modeling longterm time series data since the recurrent connection architectures allow the network to make prediction with the entire input time series. After the gated convolution, we pass the GCNN output to BiLSTM layers. Finally we pass the output of the BiLSTM to one fully connected DNN layer. The DNN layer maps the signal further to a more separable space. The FurcaNet incorporates the GConv, BiLSTM, and DNN layers into a unified framework, combines the advantages of different layers. All the layers are trained jointly. During training we need to provide the correct reference to the corresponding output layer for supervision.
2.3 Perceptual metric: Utterancelevel SDR objective
Since the loss function of many STFTbased methods is not directly applicable to waveformbased endtoend speech separation, perceptual metric based loss function is tried in this work. The perception of speech is greatly affected by distortion [18, 19]. Generally in order to evaluate the performance of speech separation, the BSS_Eval metrics signaltodistortion ratio (SDR), signaltoInterference ratio (SIR), signaltoartifact ratio (SAR) [20, 21], and shorttime objective intelligibility (STOI) [22] have been often employed. In this work we directly use SDR, which is most commonly used metrics to evaluate the performance of source separation, as the training objective. SDR measures the amount of distortion introduced by the output signal and define it as the ratio between the energy of the clean signal and the energy of the distortion.
SDR captures the overall separation quality of the algorithm. There is a subtle problem here. We first concatenate the outputs of FurcaNet into a complete utterance and then compare with the input full utterance to calculate the SDR in the utterance level instead of calculating the SDR for one frame at a time. These two methods are very different in ways and performance. If we denote the output of the network by , which should ideally be equal to the target source , then SDR can be given as [20, 21]
SDR 
Then our target is to maximize SDR or minimize the negative SDR as loss function respect to the .
3 Experiments
3.1 Dataset and neural network
We evaluated our system on twospeaker speech separation problem using WSJ02mix dataset [7, 8], which contains 30 hours of training and 10 hours of validation data. The mixtures are generated by randomly selecting 49 male and 51 female speakers and utterances in Wall Street Journal (WSJ0) training set si_tr_s, and mixing them at various signaltonoise ratios (SNR) uniformly between 0 dB and 5 dB. 5 hours of evaluation set is generated in the same way, using utterances from 16 unseen speakers from si_dt_05 and si_et_05 in the WSJ0 dataset.
In this work, we shift the window around raw waveform by 5ms and produce a set of frames at 10ms intervals. Thus structure of the FurcaNet instance used in this work is as the following. The frontend GCNN has 5 1D GConv layers. Since the input to the FurcaNet is a speech frame of 10ms (80 sample points), thus the size of the first convolution kernel is 80, and the other 4 1D GConv layers are with kernel of size 1000. Behind each GConv layer, we add a layer normalization operation [16] in order to stabilize the training. Then 2 BiLSTM layers with 1000 hidden units in each direction are employed after the GCNN. The DNN has 2 hidden layers of 2000 nodes each.
3.2 Training trick
During training Adam [23] serves as the optimizer to minimize the uSDR loss with initial learning rate of 0.001 and scale down by 0.5 when the training loss increased on the development set. Each minibatch had 8 randomly selected utterances. The uSDR loss function is a bit hard to optimize. Adam [23] often got stuck in a local minimum. As the Fig. 3
shows, the horizontal axis is the trained epochs and the vertical axis is the negative SDR. In failure training process, uSDR stuck at 3dB or 4dB, and do not go any further. We found an adhoc trick to deal with this problem, since the weights of the FurcaNet is randomly initialized, we restart the training program directly until that the initial SDR (before any training epochs) on the development dataset is greater than a threshold (for example 30dB, the Fig.
3 only shows the SDRs after each training epochs), then we will let the program start training.3.3 Results
We evaluate the systems with the SDR improvement (SDRi) [20, 21] metrics used in [8, 24, 25, 26, 9]. The original SDR, that is the average SDR of mixed speech for the original target speech and is 0.15. Table 1 lists the average SDRi obtained by FurcaNet and almost all the results in the past two years, where IRM means the ideal ratio mask
(2) 
applied to the STFT of to obtain the separated speech, which is evaluated to show the upper bounds of STFT based methods, where is the STFT of . In this experiment, Chimera++ [28, wang2018end] gives the best SDRi in all baselines shown in Table 1. FurcaNet has achieved an improvement of 1.3dB SDRi compared with this best baseline FurcaNet has achieved the most significant performance improvement compared with baseline systems, and it break through the upper bound of STFT based methods.
Method  SDRi 

DPCL [7]  5.9 
DPCL*  10.7 
DPCL++ [8]  10.8 
DANet [26]  10.5 
ADANet [24]  10.5 
uPITBLSTM [10]  10.0 
cuPITGridRD [25]  10.2 
CBLDNNGAT [27]  11.0 
TasNet [11]  11.2 
TasNet*  11.8 
Chimera++ [28, wang2018end]  12.0 
FurcaNet  13.3 
IRM  12.7 
4 Conclusion
In this study, we proposed an endtoend architecture called FurcaNet for monaural speech separation. FurcaNet can combine the advantages of different neural networks such as GCNN, BiLSTM, and DNN, and at the same time it can directly optimize parameters using perceptual indicators such as SDR. Our results on twospeaker mixed speech separation task indicate that FurcaNet can achieve a stateoftheart performance. Future research would include extending the experiment to threespeaker mix task to see whether it is independent of the number of sound sources.
5 Acknowledgment
We would like to thank Jian Wu at Northwestern Polytechnical University, Yi Luo at Columbia University, and ZhongQiu Wang at Ohio State University for valuable discussions on WSJ02mix database, DPCL, and endtoend speech separation.
References
 [1] DeLiang Wang and Guy J Brown, Computational auditory scene analysis: Principles, algorithms, and applications, WileyIEEE press, 2006.
 [2] Yang Shao and DeLiang Wang, “Modelbased sequential organization in cochannel speech,” IEEE Transactions on Audio, Speech, and Language Processing, vol. 14, no. 1, pp. 289–298, 2006.
 [3] Ke Hu and DeLiang Wang, “An unsupervised approach to cochannel speech separation,” IEEE Transactions on audio, speech, and language processing, vol. 21, no. 1, pp. 122–131, 2013.
 [4] Paris Smaragdis et al., “Convolutive speech bases and their application to supervised speech separation,” IEEE Transactions on audio speech and language processing, vol. 15, no. 1, pp. 1, 2007.
 [5] Jonathan Le Roux, Felix J Weninger, and John R Hershey, “Sparse nmf–halfbaked or well done?,” Mitsubishi Electric Research Labs (MERL), Cambridge, MA, USA, Tech. Rep., no. TR2015023, 2015.

[6]
Tuomas Virtanen,
“Speech recognition using factorial hidden markov models for separation in the feature space,”
in Ninth International Conference on Spoken Language Processing, 2006.  [7] John R Hershey, Zhuo Chen, Jonathan Le Roux, and Shinji Watanabe, “Deep clustering: Discriminative embeddings for segmentation and separation,” in Acoustics, Speech and Signal Processing (ICASSP), 2016 IEEE International Conference on. IEEE, 2016, pp. 31–35.
 [8] Yusuf Isik, Jonathan Le Roux, Zhuo Chen, Shinji Watanabe, and John R Hershey, “Singlechannel multispeaker separation using deep clustering,” arXiv preprint arXiv:1607.02173, 2016.

[9]
Morten Kolbæk, Dong Yu, ZhengHua Tan, Jesper Jensen, Morten Kolbaek, Dong
Yu, ZhengHua Tan, and Jesper Jensen,
“Multitalker speech separation with utterancelevel permutation invariant training of deep recurrent neural networks,”
IEEE/ACM Transactions on Audio, Speech and Language Processing (TASLP), vol. 25, no. 10, pp. 1901–1913, 2017.  [10] Dong Yu, Morten Kolbæk, ZhengHua Tan, and Jesper Jensen, “Permutation invariant training of deep models for speakerindependent multitalker speech separation,” in Acoustics, Speech and Signal Processing (ICASSP), 2017 IEEE International Conference on. IEEE, 2017, pp. 241–245.
 [11] Yi Luo and Nima Mesgarani, “Tasnet: timedomain audio separation network for realtime, singlechannel speech separation,” arXiv preprint arXiv:1711.00541, 2017.
 [12] Shrikant Venkataramani, Jonah Casebeer, and Paris Smaragdis, “Adaptive frontends for endtoend source separation,” in Proc. NIPS, 2017.
 [13] Yi Luo and Nima Mesgarani, “Tasnet: Surpassing ideal timefrequency masking for speech separation,” arXiv preprint arXiv:1809.07454, 2018.
 [14] Yann N Dauphin, Angela Fan, Michael Auli, and David Grangier, “Language modeling with gated convolutional networks,” arXiv preprint arXiv:1612.08083, 2016.
 [15] Tara N Sainath, Ron J Weiss, Andrew Senior, Kevin W Wilson, and Oriol Vinyals, “Learning the speech frontend with raw waveform cldnns,” in Sixteenth Annual Conference of the International Speech Communication Association, 2015.
 [16] Jimmy Lei Ba, Jamie Ryan Kiros, and Geoffrey E Hinton, “Layer normalization,” arXiv preprint arXiv:1607.06450, 2016.
 [17] Li Li and Hirokazu Kameoka, “Deep clustering with gated convolutional networks,” in Proc. ICASSP, 2018, pp. 16–20.
 [18] Wonho Yang, Majid Benbouchta, and Robert Yantorno, “Performance of the modified bark spectral distortion as an objective speech quality measure,” in Acoustics, Speech and Signal Processing, 1998. Proceedings of the 1998 IEEE International Conference on. IEEE, 1998, vol. 1, pp. 541–544.
 [19] Peter Assmann and Quentin Summerfield, “The perception of speech under adverse conditions,” in Speech processing in the auditory system, pp. 231–308. Springer, 2004.
 [20] Cédric Févotte, Rémi Gribonval, and Emmanuel Vincent, “Bss_eval toolbox user guide–revision 2.0,” 2005.
 [21] Emmanuel Vincent, Rémi Gribonval, and Cédric Févotte, “Performance measurement in blind audio source separation,” IEEE transactions on audio, speech, and language processing, vol. 14, no. 4, pp. 1462–1469, 2006.
 [22] Cees H Taal, Richard C Hendriks, Richard Heusdens, and Jesper Jensen, “A shorttime objective intelligibility measure for timefrequency weighted noisy speech,” in Acoustics Speech and Signal Processing (ICASSP), 2010 IEEE International Conference on. IEEE, 2010, pp. 4214–4217.
 [23] Diederik P Kingma and Jimmy Ba, “Adam: A method for stochastic optimization,” arXiv preprint arXiv:1412.6980, 2014.
 [24] Yi Luo, Zhuo Chen, and Nima Mesgarani, “Speakerindependent speech separation with deep attractor network,” IEEE/ACM Transactions on Audio, Speech, and Language Processing, vol. 26, no. 4, pp. 787–796, 2018.
 [25] Chenglin Xu, Xiong Xiao, Haizhou Li, CHENGLIN XU, WEI RAO, XIONG XIAO, ENG SIONG CHNG, and HAIZHOU LI, “Single channel speech separation with constrained utterance level permutation invariant training using grid lstm,” 2018.
 [26] Zhuo Chen, Yi Luo, and Nima Mesgarani, “Deep attractor network for singlemicrophone speaker separation,” in Acoustics, Speech and Signal Processing (ICASSP), 2017 IEEE International Conference on. IEEE, 2017, pp. 246–250.
 [27] Chenxing Li, Lei Zhu, Shuang Xu, Peng Gao, and Bo Xu, “Cbldnnbased speakerindependent speech separation via generative adversarial training,” 2018.
 [28] ZhongQiu Wang, Jonathan Le Roux, and John R Hershey, “Alternative objective functions for deep clustering,” in Proc. IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), 2018.
Comments
There are no comments yet.