Multi-talker monaural speech separation has a vast range of applications. For example, a home environment or a conference environment in which many people talk, the human auditory system can easily track and follow a target speaker’s voice from the multi-talker’s mixed voice. In this case, if automatic speech recognition and speaker recognition are to be performed, a clean speech signal of the target speaker needs to be separated from the mixed speech to complete the subsequent recognition work. Thus it is a problem that must be solved in order to achieve satisfactory performance in speech or speaker recognition tasks. There are two difficulties in this problem, the first is that since we don’t have any priori information of the user, a truly practical system must be speaker-independent. The second difficulty is that there is no way to use the beamforming algorithm for a single microphone signal. Many traditional methods, such as computational auditory scene analysis (CASA) [1, 2, 3], Non-negative matrix factorization (NMF) [4, 5], and probabilistic models , do not solve these two difficulties well.
More recently, a large number of techniques based on deep learning are proposed for this task. These methods can be briefly grouped into three categories. The first category is based on deep clustering (DPCL) [7, 8]
, which maps the time-frequency (TF) points of the spectrogram into the embedding vectors, then these embedding vectors are clustered into several classes corresponding to different speakers, and finally these clusters are used as masks to inversely transform the spectrogram to the separated clean voices; the second is the permutation invariant training (PIT)[9, 10], which solves the label permutation problem by minimizing the lowest error output among all possible permutations for N mixing sources assignment; the third category is end-to-end speech separation in time-domain [11, 12, 13]
, which is a natural way to overcome the obstacles of the upper bound source-to-distortion ratio improvement (SDRi) in short-time Fourier transform (STFT) mask estimation based methods and real-time processing requirements in actual use.
This paper is based on the end-to-end method [11, 12, 13], which has achieved better results than DPCL based or PIT based approaches. Since most DPCL and PIT based methods use STFT as front-end. Specifically, the mixed speech signal is first transformed from one-dimensional signal in time domain to two-dimensional spectrum signal in TF domain, and then the mixed spectrum is separated to result in spectrums corresponding to different source speeches by a deep clustering method, and finally the cleaned source speech signal can be restored by an inverse STFT on each spectrum. This framework has several limitations. Firstly, it is unclear whether the STFT is the optimal (even assume the parameters it depends on are optimal, such as size and overlap of audio frames, window type and so on) transformation of the signal for speech separation. Secondly, most STFT based methods often assumed that the phase of the separated signal to be equal to the mixture phase, which is generally incorrect and imposes an obvious upper bound on separation performance by using the ideal masks. As an approach to overcome the above problems, several speech separation models were recently proposed that operate directly on time-domain speech signals [11, 12]. Inspired by these first results, we propose FurcaNet111Furca is Latin for “fork”, and we use this word to mean the speech is split into two streams by our network like water., a fully end-to-end time-domain separation system, based on deep gated convolutional neural network (GCNN) , bidirectional long short-term memory (BiLSTM), deep neural network (DNN), which has showed promising performance on both a clean and noisy Voice Search tasks .
The remainder of this paper is organized as follows: section 2 introduces monaural speech separation, describe our proposed FurcaNet and the separation algorithm in detail. The experimental setup and results are presented in Section 3. We conclude this paper in Section 4.
2 The Furcanet Model
The proposed end-to-end deep learning approach consists of two main components: one is the FurcaNet pipeline, which consists of GCNN, BiLSTM and DNN; and the other is the perceptual loss function.
In this section, we first review the formal definition of the monaural speech separation task and the GCNN architecture. The details of the FurcaNet structure we investigated will be introduced. Finally the perceptual metric as a loss function is introduced.
2.1 Monaural speech separation
The goal of monaural speech separation is to estimate the individual target signals from a linearly mixed single-microphone signal, in which the target signals overlap in the TF domain. Let denote the target speech signals and denotes the mixed speech respectively. If we assume the target signals are linearly mixed, which can be represented as:
then monaural speech separation aims at estimating individual target signals from given mixed speech . In this work it is assumed that the number of target signals is known.
In this work, we propose an end-to-end deep learning approach to separate the mixed utterance. The input of the FurcaNet is a mixed utterance , and the output of the network are the separated utterances, ideally it is best to be exactly the same . In order to do this, firstly the mixed speech is framed. Then each frame of the mixed utterance is directly as raw wave forward propagated through the FurcaNet, and the output activations are the separated frames, each frame is corresponding only one speaker. Finally the separated frames are concatenate together to form the output utterances.
2.2 Network architecture
The proposed FurcaNet model is similar to , but with fine adjustment. The FurcaNet separation system comprises GCNN, BiLSTM and DNN, and the structure is illustrated in Fig. 1. A deep GCNN proposed in  is adopted here to build the front-end. GCNN is implemented by stacking multiple 1D gated convolutional (GConv) layers on top of each other.
) is used as a nonlinear control function instead of tanh activation or regular rectified linear units (ReLUs):
where and are the input and output, , , , and are learned parameters,
is the sigmoid function andis the element-wise product between vectors or matrices. Similar to LSTMs, GLUs play the role of controlling the information passed on in the hierarchy. This special gating mechanism allows us to effectively capture long-range context dependencies by deepening layers without encountering the problem of vanishing gradient. In order to stabilize the training and also reduce the training time, a layer normalization(LNrom) operator  was added behind each GConv layer.
In order to capture long-term contextual dependencies, BiLSTM is applied to replace the bottleneck architectures or dilated convolutional networks . BiLSTM is a natural choice for modeling long-term time series data since the recurrent connection architectures allow the network to make prediction with the entire input time series. After the gated convolution, we pass the GCNN output to BiLSTM layers. Finally we pass the output of the BiLSTM to one fully connected DNN layer. The DNN layer maps the signal further to a more separable space. The FurcaNet incorporates the GConv, BiLSTM, and DNN layers into a unified framework, combines the advantages of different layers. All the layers are trained jointly. During training we need to provide the correct reference to the corresponding output layer for supervision.
2.3 Perceptual metric: Utterance-level SDR objective
Since the loss function of many STFT-based methods is not directly applicable to waveform-based end-to-end speech separation, perceptual metric based loss function is tried in this work. The perception of speech is greatly affected by distortion [18, 19]. Generally in order to evaluate the performance of speech separation, the BSS_Eval metrics signal-to-distortion ratio (SDR), signal-to-Interference ratio (SIR), signal-to-artifact ratio (SAR) [20, 21], and short-time objective intelligibility (STOI)  have been often employed. In this work we directly use SDR, which is most commonly used metrics to evaluate the performance of source separation, as the training objective. SDR measures the amount of distortion introduced by the output signal and define it as the ratio between the energy of the clean signal and the energy of the distortion.
SDR captures the overall separation quality of the algorithm. There is a subtle problem here. We first concatenate the outputs of FurcaNet into a complete utterance and then compare with the input full utterance to calculate the SDR in the utterance level instead of calculating the SDR for one frame at a time. These two methods are very different in ways and performance. If we denote the output of the network by , which should ideally be equal to the target source , then SDR can be given as [20, 21]
Then our target is to maximize SDR or minimize the negative SDR as loss function respect to the .
3.1 Dataset and neural network
We evaluated our system on two-speaker speech separation problem using WSJ0-2mix dataset [7, 8], which contains 30 hours of training and 10 hours of validation data. The mixtures are generated by randomly selecting 49 male and 51 female speakers and utterances in Wall Street Journal (WSJ0) training set si_tr_s, and mixing them at various signal-to-noise ratios (SNR) uniformly between 0 dB and 5 dB. 5 hours of evaluation set is generated in the same way, using utterances from 16 unseen speakers from si_dt_05 and si_et_05 in the WSJ0 dataset.
In this work, we shift the window around raw waveform by 5ms and produce a set of frames at 10ms intervals. Thus structure of the FurcaNet instance used in this work is as the following. The frontend GCNN has 5 1D GConv layers. Since the input to the FurcaNet is a speech frame of 10ms (80 sample points), thus the size of the first convolution kernel is 80, and the other 4 1D GConv layers are with kernel of size 1000. Behind each GConv layer, we add a layer normalization operation  in order to stabilize the training. Then 2 BiLSTM layers with 1000 hidden units in each direction are employed after the GCNN. The DNN has 2 hidden layers of 2000 nodes each.
3.2 Training trick
During training Adam  serves as the optimizer to minimize the uSDR loss with initial learning rate of 0.001 and scale down by 0.5 when the training loss increased on the development set. Each mini-batch had 8 randomly selected utterances. The uSDR loss function is a bit hard to optimize. Adam  often got stuck in a local minimum. As the Fig. 3
shows, the horizontal axis is the trained epochs and the vertical axis is the negative SDR. In failure training process, uSDR stuck at 3dB or 4dB, and do not go any further. We found an ad-hoc trick to deal with this problem, since the weights of the FurcaNet is randomly initialized, we restart the training program directly until that the initial SDR (before any training epochs) on the development dataset is greater than a threshold (for example -30dB, the Fig.3 only shows the SDRs after each training epochs), then we will let the program start training.
We evaluate the systems with the SDR improvement (SDRi) [20, 21] metrics used in [8, 24, 25, 26, 9]. The original SDR, that is the average SDR of mixed speech for the original target speech and is 0.15. Table 1 lists the average SDRi obtained by FurcaNet and almost all the results in the past two years, where IRM means the ideal ratio mask
applied to the STFT of to obtain the separated speech, which is evaluated to show the upper bounds of STFT based methods, where is the STFT of . In this experiment, Chimera++ [28, wang2018end] gives the best SDRi in all baselines shown in Table 1. FurcaNet has achieved an improvement of 1.3dB SDRi compared with this best baseline FurcaNet has achieved the most significant performance improvement compared with baseline systems, and it break through the upper bound of STFT based methods.
|Chimera++ [28, wang2018end]||12.0|
In this study, we proposed an end-to-end architecture called FurcaNet for monaural speech separation. FurcaNet can combine the advantages of different neural networks such as GCNN, BiLSTM, and DNN, and at the same time it can directly optimize parameters using perceptual indicators such as SDR. Our results on two-speaker mixed speech separation task indicate that FurcaNet can achieve a state-of-the-art performance. Future research would include extending the experiment to three-speaker mix task to see whether it is independent of the number of sound sources.
We would like to thank Jian Wu at Northwestern Polytechnical University, Yi Luo at Columbia University, and Zhong-Qiu Wang at Ohio State University for valuable discussions on WSJ0-2mix database, DPCL, and end-to-end speech separation.
-  DeLiang Wang and Guy J Brown, Computational auditory scene analysis: Principles, algorithms, and applications, Wiley-IEEE press, 2006.
-  Yang Shao and DeLiang Wang, “Model-based sequential organization in cochannel speech,” IEEE Transactions on Audio, Speech, and Language Processing, vol. 14, no. 1, pp. 289–298, 2006.
-  Ke Hu and DeLiang Wang, “An unsupervised approach to cochannel speech separation,” IEEE Transactions on audio, speech, and language processing, vol. 21, no. 1, pp. 122–131, 2013.
-  Paris Smaragdis et al., “Convolutive speech bases and their application to supervised speech separation,” IEEE Transactions on audio speech and language processing, vol. 15, no. 1, pp. 1, 2007.
-  Jonathan Le Roux, Felix J Weninger, and John R Hershey, “Sparse nmf–half-baked or well done?,” Mitsubishi Electric Research Labs (MERL), Cambridge, MA, USA, Tech. Rep., no. TR2015-023, 2015.
“Speech recognition using factorial hidden markov models for separation in the feature space,”in Ninth International Conference on Spoken Language Processing, 2006.
-  John R Hershey, Zhuo Chen, Jonathan Le Roux, and Shinji Watanabe, “Deep clustering: Discriminative embeddings for segmentation and separation,” in Acoustics, Speech and Signal Processing (ICASSP), 2016 IEEE International Conference on. IEEE, 2016, pp. 31–35.
-  Yusuf Isik, Jonathan Le Roux, Zhuo Chen, Shinji Watanabe, and John R Hershey, “Single-channel multi-speaker separation using deep clustering,” arXiv preprint arXiv:1607.02173, 2016.
Morten Kolbæk, Dong Yu, Zheng-Hua Tan, Jesper Jensen, Morten Kolbaek, Dong
Yu, Zheng-Hua Tan, and Jesper Jensen,
“Multitalker speech separation with utterance-level permutation invariant training of deep recurrent neural networks,”IEEE/ACM Transactions on Audio, Speech and Language Processing (TASLP), vol. 25, no. 10, pp. 1901–1913, 2017.
-  Dong Yu, Morten Kolbæk, Zheng-Hua Tan, and Jesper Jensen, “Permutation invariant training of deep models for speaker-independent multi-talker speech separation,” in Acoustics, Speech and Signal Processing (ICASSP), 2017 IEEE International Conference on. IEEE, 2017, pp. 241–245.
-  Yi Luo and Nima Mesgarani, “Tasnet: time-domain audio separation network for real-time, single-channel speech separation,” arXiv preprint arXiv:1711.00541, 2017.
-  Shrikant Venkataramani, Jonah Casebeer, and Paris Smaragdis, “Adaptive front-ends for end-to-end source separation,” in Proc. NIPS, 2017.
-  Yi Luo and Nima Mesgarani, “Tasnet: Surpassing ideal time-frequency masking for speech separation,” arXiv preprint arXiv:1809.07454, 2018.
-  Yann N Dauphin, Angela Fan, Michael Auli, and David Grangier, “Language modeling with gated convolutional networks,” arXiv preprint arXiv:1612.08083, 2016.
-  Tara N Sainath, Ron J Weiss, Andrew Senior, Kevin W Wilson, and Oriol Vinyals, “Learning the speech front-end with raw waveform cldnns,” in Sixteenth Annual Conference of the International Speech Communication Association, 2015.
-  Jimmy Lei Ba, Jamie Ryan Kiros, and Geoffrey E Hinton, “Layer normalization,” arXiv preprint arXiv:1607.06450, 2016.
-  Li Li and Hirokazu Kameoka, “Deep clustering with gated convolutional networks,” in Proc. ICASSP, 2018, pp. 16–20.
-  Wonho Yang, Majid Benbouchta, and Robert Yantorno, “Performance of the modified bark spectral distortion as an objective speech quality measure,” in Acoustics, Speech and Signal Processing, 1998. Proceedings of the 1998 IEEE International Conference on. IEEE, 1998, vol. 1, pp. 541–544.
-  Peter Assmann and Quentin Summerfield, “The perception of speech under adverse conditions,” in Speech processing in the auditory system, pp. 231–308. Springer, 2004.
-  Cédric Févotte, Rémi Gribonval, and Emmanuel Vincent, “Bss_eval toolbox user guide–revision 2.0,” 2005.
-  Emmanuel Vincent, Rémi Gribonval, and Cédric Févotte, “Performance measurement in blind audio source separation,” IEEE transactions on audio, speech, and language processing, vol. 14, no. 4, pp. 1462–1469, 2006.
-  Cees H Taal, Richard C Hendriks, Richard Heusdens, and Jesper Jensen, “A short-time objective intelligibility measure for time-frequency weighted noisy speech,” in Acoustics Speech and Signal Processing (ICASSP), 2010 IEEE International Conference on. IEEE, 2010, pp. 4214–4217.
-  Diederik P Kingma and Jimmy Ba, “Adam: A method for stochastic optimization,” arXiv preprint arXiv:1412.6980, 2014.
-  Yi Luo, Zhuo Chen, and Nima Mesgarani, “Speaker-independent speech separation with deep attractor network,” IEEE/ACM Transactions on Audio, Speech, and Language Processing, vol. 26, no. 4, pp. 787–796, 2018.
-  Chenglin Xu, Xiong Xiao, Haizhou Li, CHENGLIN XU, WEI RAO, XIONG XIAO, ENG SIONG CHNG, and HAIZHOU LI, “Single channel speech separation with constrained utterance level permutation invariant training using grid lstm,” 2018.
-  Zhuo Chen, Yi Luo, and Nima Mesgarani, “Deep attractor network for single-microphone speaker separation,” in Acoustics, Speech and Signal Processing (ICASSP), 2017 IEEE International Conference on. IEEE, 2017, pp. 246–250.
-  Chenxing Li, Lei Zhu, Shuang Xu, Peng Gao, and Bo Xu, “Cbldnn-based speaker-independent speech separation via generative adversarial training,” 2018.
-  Zhong-Qiu Wang, Jonathan Le Roux, and John R Hershey, “Alternative objective functions for deep clustering,” in Proc. IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), 2018.