1 Introduction
Realworld speech communication often takes place in crowded, multitalker environments. A speech processing system that is designed to operate in such conditions needs the ability to separate speech of different talkers. This task which is effortless for humans has proven very difficult to model in machines. In recent years, deep learning approaches have significantly advanced the state of this problem compared to traditional methods [1, 2, 3, 4, 5, 6].
A typical neural network speech separation algorithm starts with calculating the shorttime Fourier transform (STFT) to create a timefrequency (TF) representation of the mixture sound. The TF bins that correspond to each source are then separated, and are used to synthesize the source waveforms using inverse STFT. Several issues arise in this framework. First, it is unclear whether Fourier decomposition is the optimal transformation of the signal for speech separation. Second, because STFT transforms the signal into a complex domain, the separation algorithm needs to deal with both magnitude and the phase of the signal. Because of the difficulty in modifying the phase, the majority of proposed methods only modify the magnitude of the STFT by calculating a timefrequency mask for each source, and synthesize using the masked magnitude spectrogram with the original phase of the mixture. This imposes an upper bound on separation performance. Even though several systems have been proposed to use the phase information to design the masks, such as the phasesensitive mask
[7] and complex ratio mask [8], the upper bound still exists since the reconstruction is not exact. Moreover, effective speech separation in STFT domain requires high frequency resolution which results in relatively large time window length, which is typically more than 32 ms for speech [3, 4, 5] and more than 90 ms for music separation [9]. Because the minimum latency of the system is bounded by the length of the STFT time window, this limits the use of such systems when very short latency is required, such as in telecommunication systems or hearable devices.A natural way to overcome these obstacles is to directly model the signal in the timedomain. In recent years, this approach has been successfully applied in tasks such as speech recognition, speech synthesis and speech enhancement [10, 11, 12, 13, 14], but waveformlevel speech separation with deep learning has not been investigated yet. In this paper, we propose Timedomain Audio Separation Network (TasNet), a neural network that directly models the mixture waveform using an encoderdecoder framework, and performs the separation on the output of the encoder. In this framework, the mixture waveform is represented by a nonnegative weighted sum of basis signals, where the weights are the outputs of the encoder, and the basis signals are the filters of the decoder. The separation is done by estimating the weights that correspond to each source from the mixture weight. Because the weights are nonnegative, the estimation of source weights can be formulated as finding the masks which indicate the contribution of each source to the mixture weight, similar to the TF masks that are used in STFT systems. The source waveforms are then reconstructed using the learned decoder.
This signal factorization technique shares the motivation behind independent component analysis (ICA) with nonnegative mixing matrix
[15] and seminonnegative matrix factorization (semiNMF) [16]. However unlike ICA or semiNMF, the weights and the basis signals are learned in a nonnegative autoencoder framework
[17, 18, 19, 20], where the encoder is a 1D convolutional layer and the decoder is a 1D deconvolutional layer (also known as transposed convolutional). In this scenario, the mixture weights replace the commonly used STFT representations.Since TasNet operates on waveform segments that can be as small as 5 ms, the system can be implemented in realtime with very low latency. In addition to having lower latency, TasNet outperforms the stateofart STFTbased system. In applications that do not require realtime processing, a noncausal separation module can also be used to further improve the performance by using information from the entire signal.
2 Model Description
2.1 Problem formulation
The problem of singlechannel speech separation is formulated as estimating sources , given the discrete waveform of the mixture
(1) 
We first segment the mixture and clean sources into
nonoverlapping vectors of length
samples, (note that varies from utterance to utterance)(2) 
For simplicity, we drop the notation where there is no ambiguity. Each segment of mixture and clean signals can be represented by a nonnegative weighted sum of basis signals B = [
(3) 
where is the mixture weight vector, and is the weight vector for the source . Separating the sources in this representation is then reformulated as estimating the weight matrix of each source given the mixture weight , subject to:
(4) 
Because all weights () are nonnegative, estimating the weight of each source can be thought of as finding its corresponding masklike vector, , which is applied to the mixture weight, , to recover :
(5)  
(6) 
where represents the relative contribution source to the mixture weight matrix, and and denotes elementwise multiplication and division.
In comparison to other matrix factorization algorithms such as ICA where the basis signals are required to have distinct statistical properties or explicit frequency band preferences, no such constraints are imposed here. Instead, the basis signals are jointly optimized with the other parameters of the separation network during training. Moreover, the synthesis of the source signal from the weights and basis signals is done directly in the timedomain, unlike the inverse STFT step which is needed in TF based solutions.
2.2 Network design
Figure 1 shows the structure of the network. It contains three parts: an encoder for estimating the mixture weight, a separation module, and a decoder for source waveform reconstruction. The combination of the encoder and the decoder modules construct a nonnegative autoencoder for the waveform of the mixture, where the nonnegative weights are calculated by the encoder and the basis signals are the 1D filters in the decoder. The separation is performed on the mixture weight matrix using a subnetwork that estimates a mask for each source.
2.2.1 Encoder for mixture weight calculation
The estimation of the nonnegative mixture weight for segment is done by a 1D gated convolutional layer
(7) 
where and are vectors with length , and is the mixture weight vector.
denotes the Sigmoid activation function and
denotes convolution operator. is the th segment of the entire mixture signal with length , and is normalized to have unit norm to reduce the variability. The convolution is applied on the rows (time dimension).2.2.2 Separation network
The estimation of the source masks is done with a deep LSTM network to model the time dependencies across the segments, followed by a fullyconnected layer with Softmax activation function for mask generation. The input to the LSTM network is the sequence of mixture weight vectors , and the output of the network for source is mask vectors . The procedure for estimation of the masks is the same as the TF mask estimation in [4], where a set of masks are generated by several LSTM layers followed by a fullyconnected layer with Softmax function as activation.
To speed up and stabilize the training process, we normalize the mixture weight vector in a way similar to layer normalization [22]
(8)  
(9) 
where parameters and
are gain and bias vectors that are jointly optimized with the network. This normalization step results in scale invariant mixture weight vectors and also enables more efficient training of the LSTM layers.
Starting from the second LSTM layer, an identity skip connection [23] is added between every two LSTM layers to enhance the gradient flow and accelerate the training process.
2.2.3 Decoder for waveform reconstruction
The separation network produces a mask matrix for each source from the mixture weight across all the segments. The source weight matrices can then be calculated by
(10) 
where is the weight matrix for source . Note that is applied to the original mixture weight instead of normalized weight . The timedomain synthesis of the sources is done by matrix multiplication between and the basis signals
(11) 
For each segment, this operation can also be formulated as a linear deconvolutional operation (also known as transposed convolution) [24], where each row in corresponds to a 1D filter which is jointly learned together with the other parts of the network. This is the inverse operation of the convolutional layer in Section 2.2.1.
Finally we scale the recovered signals to reverse the effect of normalization of discussed in Section 2.2.1. Concatenating the recoveries across all segments reconstruct the entire signal for each source.
(12) 
2.2.4 Training objective
Since the output of the network are the waveforms of the estimated clean signals, we can directly use sourcetodistortion ratio (SDR) as our training target. Here we use scaleinvariant sourcetonoise ratio (SISNR), which is used as the evaluation metric in place of the standard SDR in
[3, 5], as the training target. The SISNR is defined as:(13)  
(14)  
(15) 
where and are the estimated and target clean sources respectively, denotes the length of the signals, and and are both normalized to have zeromean to ensure scaleinvariance. Permutation invariant training (PIT) [4] is applied during training to remedy the source permutation problem [3, 4, 5].
3 Experiments
3.1 Dataset
We evaluated our system on twospeaker speech separation problem using WSJ02mix dataset [3, 4, 5], which contains 30 hours of training and 10 hours of validation data. The mixtures are generated by randomly selecting utterances from different speakers in Wall Street Journal (WSJ0) training set si_tr_s, and mixing them at random signaltonoise ratios (SNR) between 0 dB and 5 dB. Five hours of evaluation set is generated in the same way, using utterances from 16 unseen speakers from si_dt_05 and si_et_05 in the WSJ0 dataset. To reduce the computational cost, the waveforms were downsampled to 8 kHz.
3.2 Network configuration
The parameters of the system include the segment length , the number of basis signals , and the configuration of the deep LSTM separation network. Using a grid search, we found optimal to be 40 samples (5 ms at 8 kHz) and to be 500. We designed a 4 layer deep unidirectional LSTM network with 1000 hidden units in each layer, followed by a fullyconnected layer with 1000 hidden units that generates two 500dimensional mask vectors. For the noncausal configuration with bidirectional LSTM layers, the number of hidden units in each layer is set to 500 for each direction. An identical skip connection is added between the output of the second and last LSTM layers.
During training, the batch size is set to 128, and the initial learning rate is set to for the causal system (LSTM) and
for the noncausal system (BLSTM). We halve the learning rate if the accuracy on validation set is not improved in 3 consecutive epochs. The criteria for early stopping is no decrease in the cost function on the validation set for 10 epochs. Adam
[25] is used as the optimization algorithm. No further regularization or training procedures were used.3.3 Evaluation metrics
3.4 Results and analysis
Table 1 shows the performance of our system as well as three stateofart deep speech separation systems, Deep Clustering (DPCL++, [3]), Permutation Invariant Training (PIT, [4]), and Deep Attractor Network (DANet, [5]). Here TasNetLSTM represents the causal configuration with unidirectional LSTM layers. TasNetBLSTM corresponds to the system with bidirectional LSTM layers which is noncausal and cannot be implemented in realtime. For the other systems, we show the best performance reported on this dataset.
We see that with causal configuration, the proposed TasNet system significantly outperforms the stateofart causal system which uses a TF representation as input. Under the noncausal configuration, our system outperforms all the previous systems, including the twostage systems DPCL++ and uPITBLSTMST which have a secondstage enhancement network. Note that our system does not contain any regularizers such as recurrent dropout (DPCL++) or postclustering steps for mask estimation (DANet).
Table 2 compares the latency of different causal systems. The latency of a system is expressed in two parts: is the initial delay of the system that is required in order to receive enough samples to produce the first output. is the processing time for a segment, estimated as the average persegment processing time across the entire test set. The model was preloaded on a Titan X Pascal GPU before the separation of the first segment started. The average processing speed per segment in our system is less than 0.23 ms, resulting in a total system latency of 5.23 ms. In comparison, a STFTbased system requires at least 32 ms time interval to start the processing, in addition to the processing time required for calculation of STFT, separation, and inverse STFT. This enables our system to preform in situation that can tolerate only short latency, such as hearing devices and telecommunication applications.


Method  Causal  SISNRi  SDRi 
uPITLSTM [4]  ✓  –  7.0 
TasNetLSTM  ✓  7.7  8.0 
DPCL++ [3]  ×  10.8  – 
DANet [5]  ×  10.5  – 
uPITBLSTMST [4]  ×  –  10.0 
TasNetBLSTM  ×  10.8  11.1 




Method  
uPITLSTM [4]  32  –  >32 
TasNetLSTM  5  0.23  5.23 

To investigate the properties of the basis signals , we visualized the magnitude of their Fourier transform in both causal and noncausal networks. Figure 2 shows the frequency response of the basis signals sorted by their center frequencies (i.e. the bin index corresponding to the the peak magnitude). We observe a continuous transition from low to high frequency, showing that the system has learned to perform a spectral decomposition of the waveform, similar to the finding in [10]. We also observe that the frequency bandwidth increases with center frequency similar to melfilterbanks. In contrast, the basis signals in TasNet have a higher resolution in lower frequencies compared to Mel and STFT. In fact, 60% of the basis signals have center frequencies below 1 kHz (Fig. 2), which may indicate the importance of lowfrequency resolution for accurate speech separation. Further analysis of the network representation and transformation may lead to better understanding of how the network separates competing speakers [28].
4 Conclusion
In this paper, we proposed a deep learning speech separation system that directly operates on the sound waveforms. Using an autoencoder framework, we represent the waveform as nonnegative weighted sum of a set of learned basis signals. The timedomain separation problem then is solved by estimating the source masks that are applied to the mixture weights. Experiments showed that our system was 6 times faster compared to the stateofart STFTbased systems, and achieved significantly better speech separation performance. Audio samples are available at [29].
5 Acknowledgement
This work was funded by a grant from National Institute of Health, NIDCD, DC014279, National Science Foundation CAREER Award, and the Pew Charitable Trusts.
References

[1]
PoSen Huang, Minje Kim, Mark HasegawaJohnson, and Paris Smaragdis,
“Joint optimization of masks and deep recurrent neural networks for monaural source separation,”
IEEE/ACM Transactions on Audio, Speech and Language Processing (TASLP), vol. 23, no. 12, pp. 2136–2147, 2015.  [2] XiaoLei Zhang and DeLiang Wang, “A deep ensemble learning method for monaural speech separation,” IEEE/ACM Transactions on Audio, Speech and Language Processing (TASLP), vol. 24, no. 5, pp. 967–977, 2016.
 [3] Yusuf Isik, Jonathan Le Roux, Zhuo Chen, Shinji Watanabe, and John R Hershey, “Singlechannel multispeaker separation using deep clustering,” Interspeech 2016, pp. 545–549, 2016.
 [4] Morten Kolbæk, Dong Yu, ZhengHua Tan, and Jesper Jensen, “Multitalker speech separation with utterancelevel permutation invariant training of deep recurrent neural networks,” IEEE/ACM Transactions on Audio, Speech, and Language Processing, vol. 25, no. 10, pp. 1901–1913, 2017.
 [5] Zhuo Chen, Yi Luo, and Nima Mesgarani, “Deep attractor network for singlemicrophone speaker separation,” in Acoustics, Speech and Signal Processing (ICASSP), 2017 IEEE International Conference on. IEEE, 2017, pp. 246–250.
 [6] Yi Luo, Zhuo Chen, and Nima Mesgarani, “Speakerindependent speech separation with deep attractor network,” IEEE/ACM Transactions on Audio, Speech, and Language Processing, vol. 26, no. 4, pp. 787–796, 2018.
 [7] Hakan Erdogan, John R Hershey, Shinji Watanabe, and Jonathan Le Roux, “Phasesensitive and recognitionboosted speech separation using deep recurrent neural networks,” in Acoustics, Speech and Signal Processing (ICASSP), 2015 IEEE International Conference on. IEEE, 2015, pp. 708–712.
 [8] Donald S Williamson, Yuxuan Wang, and DeLiang Wang, “Complex ratio masking for monaural speech separation,” IEEE/ACM transactions on audio, speech, and language processing, vol. 24, no. 3, pp. 483–492, 2016.
 [9] Yi Luo, Zhuo Chen, John R Hershey, Jonathan Le Roux, and Nima Mesgarani, “Deep clustering and conventional networks for music separation: Stronger together,” in Acoustics, Speech and Signal Processing (ICASSP), 2017 IEEE International Conference on. IEEE, 2017, pp. 61–65.
 [10] Tara N Sainath, Ron J Weiss, Andrew Senior, Kevin W Wilson, and Oriol Vinyals, “Learning the speech frontend with raw waveform cldnns,” in Sixteenth Annual Conference of the International Speech Communication Association, 2015.
 [11] Pegah Ghahremani, Vimal Manohar, Daniel Povey, and Sanjeev Khudanpur, “Acoustic modelling from the signal domain using cnns.,” in INTERSPEECH, 2016, pp. 3434–3438.
 [12] Aaron van den Oord, Sander Dieleman, Heiga Zen, Karen Simonyan, Oriol Vinyals, Alex Graves, Nal Kalchbrenner, Andrew Senior, and Koray Kavukcuoglu, “Wavenet: A generative model for raw audio,” arXiv preprint arXiv:1609.03499, 2016.
 [13] Soroush Mehri, Kundan Kumar, Ishaan Gulrajani, Rithesh Kumar, Shubham Jain, Jose Sotelo, Aaron Courville, and Yoshua Bengio, “Samplernn: An unconditional endtoend neural audio generation model,” arXiv preprint arXiv:1612.07837, 2016.
 [14] Santiago Pascual, Antonio Bonafonte, and Joan Serrà, “Segan: Speech enhancement generative adversarial network,” Proc. Interspeech 2017, pp. 3642–3646, 2017.
 [15] FaYu Wang, ChongYung Chi, TsungHan Chan, and Yue Wang, “Nonnegative leastcorrelated component analysis for separation of dependent sources by volume maximization,” IEEE transactions on pattern analysis and machine intelligence, vol. 32, no. 5, pp. 875–888, 2010.
 [16] Chris HQ Ding, Tao Li, and Michael I Jordan, “Convex and seminonnegative matrix factorizations,” IEEE transactions on pattern analysis and machine intelligence, vol. 32, no. 1, pp. 45–55, 2010.
 [17] Ehsan HosseiniAsl, Jacek M Zurada, and Olfa Nasraoui, “Deep learning of partbased representation of data using sparse autoencoders with nonnegativity constraints,” IEEE transactions on neural networks and learning systems, vol. 27, no. 12, pp. 2486–2498, 2016.
 [18] Andre Lemme, René Felix Reinhart, and Jochen Jakob Steil, “Online learning and generalization of partsbased image representations by nonnegative sparse autoencoders,” Neural Networks, vol. 33, pp. 194–203, 2012.
 [19] Jan Chorowski and Jacek M Zurada, “Learning understandable neural networks with nonnegative weight constraints,” IEEE transactions on neural networks and learning systems, vol. 26, no. 1, pp. 62–69, 2015.
 [20] Paris Smaragdis and Shrikant Venkataramani, “A neural network alternative to nonnegative audio models,” in Acoustics, Speech and Signal Processing (ICASSP), 2017 IEEE International Conference on. IEEE, 2017, pp. 86–90.

[21]
Yann N Dauphin, Angela Fan, Michael Auli, and David Grangier,
“Language modeling with gated convolutional networks,”
in
International Conference on Machine Learning
, 2017, pp. 933–941.  [22] Jimmy Lei Ba, Jamie Ryan Kiros, and Geoffrey E Hinton, “Layer normalization,” arXiv preprint arXiv:1607.06450, 2016.

[23]
Kaiming He, Xiangyu Zhang, Shaoqing Ren, and Jian Sun,
“Identity mappings in deep residual networks,”
in
European Conference on Computer Vision
. Springer, 2016, pp. 630–645.  [24] Vincent Dumoulin and Francesco Visin, “A guide to convolution arithmetic for deep learning,” arXiv preprint arXiv:1603.07285, 2016.
 [25] Diederik Kingma and Jimmy Ba, “Adam: A method for stochastic optimization,” arXiv preprint arXiv:1412.6980, 2014.
 [26] Yoshua Bengio, Jérôme Louradour, Ronan Collobert, and Jason Weston, “Curriculum learning,” in Proc. ICML, 2009, pp. 41–48.
 [27] Emmanuel Vincent, Rémi Gribonval, and Cédric Févotte, “Performance measurement in blind audio source separation,” IEEE transactions on audio, speech, and language processing, vol. 14, no. 4, pp. 1462–1469, 2006.

[28]
Tasha Nagamine and Nima Mesgarani,
“Understanding the representation and computation of multilayer perceptrons: A case study in speech recognition,”
in International Conference on Machine Learning, 2017, pp. 2564–2573.  [29] “Audio samples for TasNet,” http://naplab.ee.columbia.edu/tasnet.html.
Comments
There are no comments yet.