Two-Step Sound Source Separation: Training on Learned Latent Targets

by   Efthymios Tzinis, et al.

In this paper, we propose a two-step training procedure for source separation via a deep neural network. In the first step we learn a transform (and it's inverse) to a latent space where masking-based separation performance using oracles is optimal. For the second step, we train a separation module that operates on the previously learned space. In order to do so, we also make use of a scale-invariant signal to distortion ratio (SI-SDR) loss function that works in the latent space, and we prove that it lower-bounds the SI-SDR in the time domain. We run various sound separation experiments that show how this approach can obtain better performance as compared to systems that learn the transform and the separation module jointly. The proposed methodology is general enough to be applicable to a large class of neural network end-to-end separation systems.



There are no comments yet.


page 1

page 2

page 3

page 4


End-to-end Networks for Supervised Single-channel Speech Separation

The performance of single channel source separation algorithms has impro...

HTMD-Net: A Hybrid Masking-Denoising Approach to Time-Domain Monaural Singing Voice Separation

The advent of deep learning has led to the prevalence of deep neural net...

DBNET: DOA-driven beamforming network for end-to-end farfield sound source separation

Many deep learning techniques are available to perform source separation...

Unsupervised Source Separation via Bayesian Inference in the Latent Domain

State of the art audio source separation models rely on supervised data-...

Deep Metric Learning with Locality Sensitive Angular Loss for Self-Correcting Source Separation of Neural Spiking Signals

Neurophysiological time series, such as electromyographic signal and int...

Compute and memory efficient universal sound source separation

Recent progress in audio source separation lead by deep learning has ena...

Interleaved Multitask Learning for Audio Source Separation with Independent Databases

Deep Neural Network-based source separation methods usually train indepe...

Code Repositories


A two step optimization for sound source separation on the adaptive front-end domain

view repo
This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

1 Introduction

Single-channel audio source separation is a fundamental problem in audio analysis, where one extracts the individual sources that constitute a mixture signal [1]

. Popular algorithms for source separation include independent component analysis

[2], non-negative matrix factorization [14] and more recently supervised [8, 6, 11, 19, 26] and unsupervised [24, 22, 4]

deep learning approaches. In many of the recent approaches, separation is performed by applying a mask on a latent representation, which is often a Fourier-based or a learned domain. Specifically, a separation module produces an estimated masked latent representation for the input sources and a decoder translates them back to the time domain.

Many approaches have used the short-time Fourier transform (STFT) as an encoder to obtain this latent representation, and conversely the inverse STFT (iSTFT) as a decoder. Using this representation, separation networks have been trained using a loss defined over various targets, such as: raw magnitude spectrogram representations

[8], ideal STFT masks [9, 7] and ideal affinity matrices [6, 10]. Other works have supplemented this by additionally reconstructing the phase of the sources [26, 27]. However, the ideal STFT masks impose an upper bound on the separation performance the aforementioned criteria do not necessarily translate to optimal separation. In order to address this, recent works have proposed end-to-end separation schemes where the encoder, decoder and separation modules are jointly optimized using a time-domain loss between the reconstructed sources waveforms and their clean targets [19, 25, 12]. However, a joint time-domain end-to-end training approach might not always yield an optimal decomposition of the input mixtures resulting to worse performance than the fixed STFT bases [12].

Some studies have reported significant benefits when performing source-separation in two stages. In [5], first the sources are separated and in a second stage the interference between the estimated sources is reduced. Similarly, an iterative scheme is proposed in [12], where the separation estimates from the first network are used as input to the final separation network. In [17]

, speaker separation is performed by first separating frame-level spectral components of speakers and later sequentially grouping them using a clustering network. Lately, state-of-the-art results in most natural language processing tasks have been achieved by pre-training the encoder transformation network


In this work, we propose a general two-step approach for performing source separation which can be used in any mask-based separation architecture. First we pre-train an encoder and decoder in order to learn a suitable latent representation. In the second step, we train a separation module using as loss the negative permutation invariant [28] scale invariant SDR (SI-SDR) [16] w.r.t. the learned latent representation. Moreover, we prove that for the case that the decoder is a transpose convolutional layer [19, 12], SI-SDR on the latent space bounds from below time-domain SI-SDR. Our experiments show that by maximizing SI-SDR on the learned latent targets, a consistent performance improvement is achieved across multiple sound separation tasks compared to the time-domain end-to-end training approach when using the exact same model architecture. The SI-SDR upper bound using the learned latent space is also significantly higher than that of STFT-domain masks. Finally, we also observe that the pre-trained encoder representations are also more sparse and structured compared to the joint training approach.

2 Two-step source separation

Assuming a mixture that consists of sources with samples each in the time-domain, we propose to perform source separation in two independent steps: A) We first obtain a latent representation for the source signals and for the input mixture. B) Then, we train a separation module which operates on the latent representation of the mixture and is trained to estimate the latent representation of the clean sources (or their masks in that space).

(a) Step 1: Learning the latent targets
(b) Step 2: Training the separation module only
Figure 1: Training a separation network in two independent steps. For each step, the non-trainable parts are represented with a dashed line.

2.1 Step 1: Learning the Latent Targets

As a first step we train an encoder in order to obtain a latent representation for the mixture . We also provide the clean sources as inputs to this encoder to obtain and apply a softmax function (across the dimension of the sources) in order to obtain separation masks for each source. An element-wise multiplication of these masks with the latent representation of the mixture , can be used as an estimate for each source. The decoder module is then trained to transform these latent representations back to time-domain using . In order to train the encoder and the decoder we optimize the permutation invariant [28] SI-SDR [16] between the clean sources s and the estimated sources :


where denotes the permutation of the sources that maximizes SI-SDR and the scalar ensures that the loss is scale invariant. A schematic representation of the aforementioned step for two sources is depicted in Fig. 0(a). The objective of this step is to find a latent representation transformation, which facilitates source separation through masking.

2.2 Step 2: Training the Separation Module

Once the weights of the encoder and decoder modules are fixed using the training recipe described in Step 1, we can train a separation module . Given the latent representation of an input mixture , is trained to produce an estimate of the latent representation for each clean source , i.e. . During inference, we can use the pre-trained decoder to transform the source estimates back into the time-domain . The block diagram describing the training of the separation module with a fixed encoder and decoder is shown in Fig. 0(b).

2.2.1 Training using SI-SDR on the Latent Separation Targets

In contrast to recent time-domain source-separation approaches [18, 19] which train all modules , , and using a variant of the loss defined in Eq. 1

, we propose to use the permutation invariant SI-SDR directly on the latent representation. For simplicity of notation we assume that each source has a vector latent representation

in a high dimensional space. The loss for training the separation module could then be: . The exact same training procedure could be followed, but now we can use as targets the optimal separation targets on the latent space as opposed to the time domain signals. The premise is that if the separation module is trained on producing latent representations which are close to the ideal ones (assuming ideal permutation order) then the estimates of the sources after the decoding layer would also approximate the clean sources in time-domain . The latter might not hold for any arbitrary embedding process, but in the next section we prove that SI-SDR in the latent representations lower-bounds the SI-SDR in the time-domain.

2.2.2 Relation to maximization of SI-SDR on Time-Domain

We restrict ourselves to a decoder that consists of a 1-D transposed convolutional layer which is the same as the decoder selection in most of the current end-to-end source separation approaches [18, 19, 12, 25]. For this part we focus on the th target latent representation that corresponds to a source time-domain signal . Because the encoder-decoder modules are trained as described in Section 2.1, the separation target produced by the auto-encoder would be close to the clean source , namely:


The separation network produces an estimated latent vector that corresponds to an estimated time-domain signal . Because the decoder is just a convolutional layer we can express it as a linear projection using the matrix :


Assuming the Moore-Penrose pseudo-inverse of P is well defined, we express the inverse mapping from time to the latent-space as:

Proposition 1.

Let and their corresponding projections through to defined as Ay and , respectively. If then the absolute value of their inner product on the projection space is bounded above from the absolute value of their inner product in , namely: , where and depends only on the values of A.


The inner product in the projection space can be rewritten as:


Moreover, we can bound the first term of Eq. 5 by applying Cauchy-Schwarz inequality to the inner products and using the fact that as shown next:


Similarly, we use Cauchy-Schwarz inequality and inequality 6 in order to bound the second term of Eq. 5 as well:


Then by applying inequalities 6 and 7 to Eq. 5 we get:


where always . Finally, we conclude that . ∎

Proposition 2.

Let , with unit norms, then maximizing w.r.t. is equivalent to maximizing w.r.t. .


By assuming that there is an optimal solution :


Which means that the two optimization goals are equivalent. ∎

Now we focus on the relationship of the maximization of SI-SDR for the th source when it is performed directly on the latent space and when it is performed on the time-domain using the clean source as a target . Again because all the SI-SDR measures are scale-invariant, we can assume that the separation targets and the estimates vectors have unit norms on both the time-domain and the latent space, namely . By using Proposition 1 we get:


Thus, by using the auto-encoder property (Eq. 2) and Proposition 2 we conclude that on the latent space lower bounds the corresponding value on the time-domain. The same proof holds for any encoder and for other targets on the latent space such as the masks . Empirically, we indeed notice that the maximization of on the latent space leads to the maximization of on the time-domain.

3 Experimental Framework

To experimentally verify our approach we perform a set of source separation experiments as described in the following sections.

3.1 Audio Data

We use two audio data collections. For speech sources we use speech utterances from Wall street journal (WSJ0) corpus [20]. Training, validation and test speaker mixtures are generated by randomly selecting various speakers from the sets si_tr_s, si_dt_05 and si_et_05, respectively.

For non-speech sounds we use the secs audio clips which are equally balanced between classes from the environmental sound classification (ESC50) data collection [21]. ESC50 spans various sound categories such as: non-speech human sounds, animal sounds, natural soundscapes, interior sounds and urban noises. We split the data to train, validation and test sets with a ratio of , respectively. For each set, the same prior is used across classes (e.g., each class has the same number of clips). Also, the sets do not share clips which originate from the same initial source file.

3.2 Sound Source Separation Tasks

In order to develop a system capable of performing universal sound source separation [12], we evaluate our two-step approach under three distinct sound separation tasks. For all separation tasks, each input mixture consists of two sources which are always mixed using secs of their total duration. All audio clips are downsampled to kHz for efficient processing. We discuss the audio collection(s) that we utilize and the mixture generation process in the sections below.

3.2.1 Speech Separation

We only use audio clips containing human speech from WSJ0. In accordance to other studies performing experiments on single-channel speech source separation [19, 23, 15, 27, 26], we use the publicly available WSJ0-2mix dataset [6]. In total there are , and mixtures for training, validation and testing, correspondingly.

3.2.2 Non-Speech Separation

We use audio clips only from ESC50. In this case, the total number of the available clean sources sounds is small, and thus, we propose an augmented mixture generation process which enables the generation of much more diverse mixtures. In order to generate each mixture, we randomly select a

sec segment from two audio files from two distinct audio classes. We mix these two segments with a random signal to noise ratio (SNR)s between


dB. For each epoch,

training mixtures are generated which generally are not the same with the ones generated for other epochs. For validation and test sets we fix their random seeds in order to always evaluate on the same and generated mixtures, respectively.

3.2.3 Mixed Separation

All four possible mixture combinations between speech and non-speech audio are considered by using both WSJ0 and ESC50 sources. Building upon the data augmentation training idea, we also add a random variable which controls the data collection (ESC50 or WSJ0) from which a source waveform is going to be chosen. Specifically, we set an equal probability of choosing a source file from the two collections (ESC50 and WSJ0). For WSJ0 each speaker is considered a distinct sound class, thus, no mixture consists utterances from the same speaker. After the two source waveforms are chosen, we follow the mixture generation process described in Section


3.3 Selected Network Architectures

Based on recent state-of-the-art approaches on both speech and universal sound source separation with learnable encoder and decoder modules, we consider configurations for the encoder-decoder parts as well as the separation module which are based on a similar time-dilated convolutional network (TDCN) architecture. In particular, we consider our implementations of ConvTasNet [19] that we refer simply as TDCN and its improved version proposed in [12] that we refer as residual-TDCN (RTDCN).

3.3.1 Encoder-Decoder Architecture

The encoder consists of one

D convolutional layer and a ReLU activation on top in order to ensure a non-negative latent representation of each audio input. Following the assumptions stated in Section

2.2.2, we use a D transposed convolutional layer for the decoder . Both encoder and decoder have the same number of channels (or number of bases) and their D kernels have a length corresponding to ms ( samples) and a hop-size equivalent to ms ( samples). For each task we select a different number of channels for the encoder and the decoder modules (, and for speech only, mixed and non-speech only separation tasks, respectively).

3.3.2 Separation Modules Architectures

Our implementation of TDCN consists of the same architecture and parameter configuration for the separation module as described in [19]

with an additional batch normalization layer before the final mask estimation which improved its performance over the original version on all separation tasks. Inspired by the original RTDCN separation module


, we keep the same parameter configuration as TDCN and we additionally use a feature-wise normalization between layers instead of global normalization. We also add long-term residual connections from previous layers. Moreover, before summing the residual connections, we concatenate them, normalize them and feed them through a dense layer as the latter yields some further improvement in separation performance. (Code is available online

111Source code:

3.4 Training and Evaluation Details

In order to show the effectiveness of our proposed two-step approach, we use the same network architecture when we perform end-to-end time-domain source separation and use as a loss the negative SI-SDR between the estimated signals on the time-domain and the clean waveforms . Instead in our two-step approach, we train the encoder-decoder parts separately as described in Section 2.1. In the second step, we use the pre-trained encoders for each task and train the separation module using as loss the negative SI-SDR on the latent space targets or their corresponding masks (see Section 2.2). We train all models using the Adam optimizer [13], the batch size is equal to , the initial learning rate is set to and we decrease it by a factor of at the th epoch. We train TDCN and RTDCN separation networks for epochs and epochs, respectively. The encoder-decoder parts for each task are trained independently for epochs ( times faster than training the separation network). We evaluate the separation performance for all models using SI-SDR improvement (SI-SDRi) on time domain which is the difference of SI-SDR of the estimated signal and the input mixture signal [19, 12]. As the STFT oracle mask we choose the ideal ratio mask (IRM) using a Hanning window with ms length and ms hop-size [19].

4 Results & Discussion

4.1 Comparison with Time-Domain Separation

In Table 1, the mean separation performance of best models is reported for each task. We notice that the proposed two-step approach and training on the latent space leads to a consistent improvement over the end-to-end approach where we train the same architecture using the time-domain SI-SDR loss. This observation holds when different separation modules are used and when we test them under different separation tasks. The non-speech separation task seems the hardest one since the models have access to only a limited number of training mixtures which further underlines the importance of our proposed data-augmentation technique as described in Section 3.2.2. Our two-step approach yields an absolute SI-SDR improvement over the end-to-end baseline of up to dB, dB and dB for speech, non-speech and mixed separation tasks, respectively. Notably, this performance improvement is achieved using the exact same architecture but instead of training it end-to-end using a time-domain loss, we pre-train the auto-encoder part and use a loss on the latent representations of the sources.


Separation Target Sound Separation Task
Module Domain Speech Non-speech Mixed


TDCN Time 15.4 7.7 11.7
Latent 16.1 8.2 12.4


RTDCN Time 15.6 8.3 12.0
Latent 16.2 8.4 12.6


Oracle STFT 13.0 14.8 14.5
Masks Latent 34.1 39.2 39.5
Table 1: Mean SI-SDRi (dB) of best performing models.
(a) Proposed two-step learning approach.
(b) Joint end-to-end training using time-domain SI-SDR loss.
Figure 2: Latent representations of a sec mixture and its constituent sources when training the same encoder architecture: a) individually using the proposed two-step approach (top) b) jointly with the TDCN separation module using SI-SDR loss on time-domain (bottom). We sort the basis indexes w.r.t. their energy and we raise the value of each cell to for better visualization.

4.2 Separation Targets in the Latent Space

In Table 1, we see that the oracle mask obtained from the two-step approach gives a much higher upper bound of separation performance, for all tasks, compared to ideal masks on the STFT domain. This is in line with the prior work that proposed to decompose signals using learned transforms [19, 25]. In Fig. 2 we can qualitatively compare the latent representations obtained from the same encoder when trained with our proposed two-step approach and with the baseline joint training of all modules. When the encoder and decoder are trained individually, a fewer number of bases are used to encode the input which leads to a sparser representation ( norm is roughly smaller compared to the joint training approach). Finally, the latent representations obtained from our proposed approach exhibit a spectrogram-like structure in a way that Speech is encoded using less bases than high frequency sounds like Bird Chirping.

5 Conclusion

We show how by pre-learning an optimal latent space can result in better source separation performance compared to a time-domain end-to-end training approach. Our experiments show that the proposed two-step approach yields a consistent performance improvement under multiple sound separation tasks. Additionally, the obtained sound latent representations remain sparse and structured while they also enjoy a much higher upper bound of separation performance compared to STFT-domain masks. Although this approach was demonstrated on TDCN architectures, it can be easily adapted for use with any other mask-based system.


  • [1] A. Belouchrani and M. G. Amin (1998) Blind source separation based on time-frequency signal representations. IEEE Transactions on Signal Processing 46 (11), pp. 2888–2897. Cited by: §1.
  • [2] S. Choi, A. Cichocki, H. Park, and S. Lee (2005) Blind source separation and independent component analysis: a review. Neural Information Processing-Letters and Reviews 6 (1), pp. 1–57. Cited by: §1.
  • [3] J. Devlin, M. Chang, K. Lee, and K. Toutanova (2019) BERT: pre-training of deep bidirectional transformers for language understanding. In Proceedings of the 2019 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Volume 1 (Long and Short Papers), pp. 4171–4186. Cited by: §1.
  • [4] L. Drude, D. Hasenklever, and R. Haeb-Umbach (2019) Unsupervised training of a deep clustering model for multichannel blind source separation. In Proc. ICASSP, pp. 695–699. Cited by: §1.
  • [5] E. M. Grais, G. Roma, A. J. Simpson, M. D. Plumbley, E. M. Grais, G. Roma, A. J. Simpson, and M. D. Plumbley (2017) Two-stage single-channel audio source separation using deep neural networks. IEEE/ACM Transactions on Audio, Speech and Language Processing (TASLP) 25 (9), pp. 1773–1783. Cited by: §1.
  • [6] J. R. Hershey, Z. Chen, J. Le Roux, and S. Watanabe (2016) Deep clustering: discriminative embeddings for segmentation and separation. In Proc. ICASSP, pp. 31–35. Cited by: §1, §1, §3.2.1.
  • [7] J. Heymann, L. Drude, and R. Haeb-Umbach (2016) Neural network based spectral mask estimation for acoustic beamforming. In Proc. ICASSP, pp. 196–200. Cited by: §1.
  • [8] P. Huang, M. Kim, M. Hasegawa-Johnson, and P. Smaragdis (2014) Deep learning for monaural speech separation. In Proc. ICASSP, pp. 1562–1566. Cited by: §1, §1.
  • [9] P. Huang, M. Kim, M. Hasegawa-Johnson, and P. Smaragdis (2015)

    Joint optimization of masks and deep recurrent neural networks for monaural source separation

    IEEE/ACM Transactions on Audio, Speech, and Language Processing 23 (12), pp. 2136–2147. Cited by: §1.
  • [10] Y. Isik, J. Le Roux, Z. Chen, S. Watanabe, and J. R. Hershey (2016) Single-channel multi-speaker separation using deep clustering. In Proc. Interspeech, Cited by: §1.
  • [11] A. Jansson, E. J. Humphrey, N. Montecchio, R. M. Bittner, A. Kumar, and T. Weyde (2017) Singing voice separation with deep u-net convolutional networks. In Proc. ISMIR, pp. 323–332. Cited by: §1.
  • [12] I. Kavalerov, S. Wisdom, H. Erdogan, B. Patton, K. Wilson, J. L. Roux, and J. R. Hershey (2019) Universal sound separation. Proc. WASPAA. Cited by: §1, §1, §1, §2.2.2, §3.2, §3.3.2, §3.3, §3.4.
  • [13] D. P. Kingma and J. Ba (2014) Adam: a method for stochastic optimization. arXiv preprint arXiv:1412.6980. Cited by: §3.4.
  • [14] J. Le Roux, J. R. Hershey, and F. Weninger (2015) Deep nmf for speech separation. In Proc. ICASSP, pp. 66–70. Cited by: §1.
  • [15] J. Le Roux, G. Wichern, S. Watanabe, A. Sarroff, and J. R. Hershey (2019) Phasebook and friends: leveraging discrete representations for source separation. IEEE Journal of Selected Topics in Signal Processing 13 (2), pp. 370–382. Cited by: §3.2.1.
  • [16] J. Le Roux, S. Wisdom, H. Erdogan, and J. R. Hershey (2019) SDR–half-baked or well done?. In Proc. ICASSP, pp. 626–630. Cited by: §1, §2.1.
  • [17] Y. Liu and D. Wang (2019) Divide and conquer: a deep casa approach to talker-independent monaural speaker separation. arXiv preprint arXiv:1904.11148. Cited by: §1.
  • [18] Y. Luo and N. Mesgarani (2018) Tasnet: time-domain audio separation network for real-time, single-channel speech separation. In Proc. ICASSP, pp. 696–700. Cited by: §2.2.1, §2.2.2.
  • [19] Y. Luo and N. Mesgarani (2019) Conv-tasnet: surpassing ideal time–frequency magnitude masking for speech separation. IEEE/ACM Transactions on Audio, Speech, and Language Processing 27 (8), pp. 1256–1266. Cited by: §1, §1, §1, §2.2.1, §2.2.2, §3.2.1, §3.3.2, §3.3, §3.4, §4.2.
  • [20] D. B. Paul and J. M. Baker (1992) The design for the wall street journal-based CSR corpus. In Speech and Natural Language: Proceedings of a Workshop Held at Harriman, New York, February 23-26, 1992, Cited by: §3.1.
  • [21] K. J. Piczak (2015) ESC: dataset for environmental sound classification. In Proc. ACM International Conference on Multimedia, pp. 1015–1018. Cited by: §3.1.
  • [22] P. Seetharaman, G. Wichern, J. Le Roux, and B. Pardo (2019) Bootstrapping single-channel source separation via unsupervised spatial clustering on stereo mixtures. In Proc. ICASSP, pp. 356–360. Cited by: §1.
  • [23] Z. Shi, H. Lin, L. Liu, R. Liu, S. Hayakawa, and J. Han (2019)

    Furcax: end-to-end monaural speech separation based on deep gated (de) convolutional neural networks with adversarial example training

    In Proc. ICASSP, pp. 6985–6989. Cited by: §3.2.1.
  • [24] E. Tzinis, S. Venkataramani, and P. Smaragdis (2019) Unsupervised deep clustering for source separation: direct learning from mixtures using spatial information. In Proc. ICASSP, pp. 81–85. Cited by: §1.
  • [25] S. Venkataramani, J. Casebeer, and P. Smaragdis (2018) End-to-end source separation with adaptive front-ends. In Proc. Asilomar Conference on Signals, Systems, and Computers, pp. 684–688. Cited by: §1, §2.2.2, §4.2.
  • [26] Z. Wang, K. Tan, and D. Wang (2019) Deep learning based phase reconstruction for speaker separation: a trigonometric perspective. In Proc. ICASSP, pp. 71–75. Cited by: §1, §1, §3.2.1.
  • [27] G. Wichern and J. Le Roux (2018) Phase reconstruction with learned time-frequency representations for single-channel speech separation. In Proc. IWAENC, pp. 396–400. Cited by: §1, §3.2.1.
  • [28] D. Yu, M. Kolbæk, Z. Tan, and J. Jensen (2017) Permutation invariant training of deep models for speaker-independent multi-talker speech separation. In Proc. ICASSP, pp. 241–245. Cited by: §1, §2.1.