A two step optimization for sound source separation on the adaptive front-end domain
In this paper, we propose a two-step training procedure for source separation via a deep neural network. In the first step we learn a transform (and it's inverse) to a latent space where masking-based separation performance using oracles is optimal. For the second step, we train a separation module that operates on the previously learned space. In order to do so, we also make use of a scale-invariant signal to distortion ratio (SI-SDR) loss function that works in the latent space, and we prove that it lower-bounds the SI-SDR in the time domain. We run various sound separation experiments that show how this approach can obtain better performance as compared to systems that learn the transform and the separation module jointly. The proposed methodology is general enough to be applicable to a large class of neural network end-to-end separation systems.READ FULL TEXT VIEW PDF
A two step optimization for sound source separation on the adaptive front-end domain
Single-channel audio source separation is a fundamental problem in audio analysis, where one extracts the individual sources that constitute a mixture signal 
. Popular algorithms for source separation include independent component analysis, non-negative matrix factorization  and more recently supervised [8, 6, 11, 19, 26] and unsupervised [24, 22, 4]
deep learning approaches. In many of the recent approaches, separation is performed by applying a mask on a latent representation, which is often a Fourier-based or a learned domain. Specifically, a separation module produces an estimated masked latent representation for the input sources and a decoder translates them back to the time domain.
Many approaches have used the short-time Fourier transform (STFT) as an encoder to obtain this latent representation, and conversely the inverse STFT (iSTFT) as a decoder. Using this representation, separation networks have been trained using a loss defined over various targets, such as: raw magnitude spectrogram representations, ideal STFT masks [9, 7] and ideal affinity matrices [6, 10]. Other works have supplemented this by additionally reconstructing the phase of the sources [26, 27]. However, the ideal STFT masks impose an upper bound on the separation performance the aforementioned criteria do not necessarily translate to optimal separation. In order to address this, recent works have proposed end-to-end separation schemes where the encoder, decoder and separation modules are jointly optimized using a time-domain loss between the reconstructed sources waveforms and their clean targets [19, 25, 12]. However, a joint time-domain end-to-end training approach might not always yield an optimal decomposition of the input mixtures resulting to worse performance than the fixed STFT bases .
Some studies have reported significant benefits when performing source-separation in two stages. In , first the sources are separated and in a second stage the interference between the estimated sources is reduced. Similarly, an iterative scheme is proposed in , where the separation estimates from the first network are used as input to the final separation network. In 
, speaker separation is performed by first separating frame-level spectral components of speakers and later sequentially grouping them using a clustering network. Lately, state-of-the-art results in most natural language processing tasks have been achieved by pre-training the encoder transformation network.
In this work, we propose a general two-step approach for performing source separation which can be used in any mask-based separation architecture. First we pre-train an encoder and decoder in order to learn a suitable latent representation. In the second step, we train a separation module using as loss the negative permutation invariant  scale invariant SDR (SI-SDR)  w.r.t. the learned latent representation. Moreover, we prove that for the case that the decoder is a transpose convolutional layer [19, 12], SI-SDR on the latent space bounds from below time-domain SI-SDR. Our experiments show that by maximizing SI-SDR on the learned latent targets, a consistent performance improvement is achieved across multiple sound separation tasks compared to the time-domain end-to-end training approach when using the exact same model architecture. The SI-SDR upper bound using the learned latent space is also significantly higher than that of STFT-domain masks. Finally, we also observe that the pre-trained encoder representations are also more sparse and structured compared to the joint training approach.
Assuming a mixture that consists of sources with samples each in the time-domain, we propose to perform source separation in two independent steps: A) We first obtain a latent representation for the source signals and for the input mixture. B) Then, we train a separation module which operates on the latent representation of the mixture and is trained to estimate the latent representation of the clean sources (or their masks in that space).
As a first step we train an encoder in order to obtain a latent representation for the mixture . We also provide the clean sources as inputs to this encoder to obtain and apply a softmax function (across the dimension of the sources) in order to obtain separation masks for each source. An element-wise multiplication of these masks with the latent representation of the mixture , can be used as an estimate for each source. The decoder module is then trained to transform these latent representations back to time-domain using . In order to train the encoder and the decoder we optimize the permutation invariant  SI-SDR  between the clean sources s and the estimated sources :
where denotes the permutation of the sources that maximizes SI-SDR and the scalar ensures that the loss is scale invariant. A schematic representation of the aforementioned step for two sources is depicted in Fig. 0(a). The objective of this step is to find a latent representation transformation, which facilitates source separation through masking.
Once the weights of the encoder and decoder modules are fixed using the training recipe described in Step 1, we can train a separation module . Given the latent representation of an input mixture , is trained to produce an estimate of the latent representation for each clean source , i.e. . During inference, we can use the pre-trained decoder to transform the source estimates back into the time-domain . The block diagram describing the training of the separation module with a fixed encoder and decoder is shown in Fig. 0(b).
, we propose to use the permutation invariant SI-SDR directly on the latent representation. For simplicity of notation we assume that each source has a vector latent representationin a high dimensional space. The loss for training the separation module could then be: . The exact same training procedure could be followed, but now we can use as targets the optimal separation targets on the latent space as opposed to the time domain signals. The premise is that if the separation module is trained on producing latent representations which are close to the ideal ones (assuming ideal permutation order) then the estimates of the sources after the decoding layer would also approximate the clean sources in time-domain . The latter might not hold for any arbitrary embedding process, but in the next section we prove that SI-SDR in the latent representations lower-bounds the SI-SDR in the time-domain.
We restrict ourselves to a decoder that consists of a 1-D transposed convolutional layer which is the same as the decoder selection in most of the current end-to-end source separation approaches [18, 19, 12, 25]. For this part we focus on the th target latent representation that corresponds to a source time-domain signal . Because the encoder-decoder modules are trained as described in Section 2.1, the separation target produced by the auto-encoder would be close to the clean source , namely:
The separation network produces an estimated latent vector that corresponds to an estimated time-domain signal . Because the decoder is just a convolutional layer we can express it as a linear projection using the matrix :
Assuming the Moore-Penrose pseudo-inverse of P is well defined, we express the inverse mapping from time to the latent-space as:
Let and their corresponding projections through to defined as Ay and , respectively. If then the absolute value of their inner product on the projection space is bounded above from the absolute value of their inner product in , namely: , where and depends only on the values of A.
The inner product in the projection space can be rewritten as:
Moreover, we can bound the first term of Eq. 5 by applying Cauchy-Schwarz inequality to the inner products and using the fact that as shown next:
where always . Finally, we conclude that . ∎
Let , with unit norms, then maximizing w.r.t. is equivalent to maximizing w.r.t. .
By assuming that there is an optimal solution :
Which means that the two optimization goals are equivalent. ∎
Now we focus on the relationship of the maximization of SI-SDR for the th source when it is performed directly on the latent space and when it is performed on the time-domain using the clean source as a target . Again because all the SI-SDR measures are scale-invariant, we can assume that the separation targets and the estimates vectors have unit norms on both the time-domain and the latent space, namely . By using Proposition 1 we get:
Thus, by using the auto-encoder property (Eq. 2) and Proposition 2 we conclude that on the latent space lower bounds the corresponding value on the time-domain. The same proof holds for any encoder and for other targets on the latent space such as the masks . Empirically, we indeed notice that the maximization of on the latent space leads to the maximization of on the time-domain.
To experimentally verify our approach we perform a set of source separation experiments as described in the following sections.
We use two audio data collections. For speech sources we use speech utterances from Wall street journal (WSJ0) corpus . Training, validation and test speaker mixtures are generated by randomly selecting various speakers from the sets si_tr_s, si_dt_05 and si_et_05, respectively.
For non-speech sounds we use the secs audio clips which are equally balanced between classes from the environmental sound classification (ESC50) data collection . ESC50 spans various sound categories such as: non-speech human sounds, animal sounds, natural soundscapes, interior sounds and urban noises. We split the data to train, validation and test sets with a ratio of , respectively. For each set, the same prior is used across classes (e.g., each class has the same number of clips). Also, the sets do not share clips which originate from the same initial source file.
In order to develop a system capable of performing universal sound source separation , we evaluate our two-step approach under three distinct sound separation tasks. For all separation tasks, each input mixture consists of two sources which are always mixed using secs of their total duration. All audio clips are downsampled to kHz for efficient processing. We discuss the audio collection(s) that we utilize and the mixture generation process in the sections below.
We only use audio clips containing human speech from WSJ0. In accordance to other studies performing experiments on single-channel speech source separation [19, 23, 15, 27, 26], we use the publicly available WSJ0-2mix dataset . In total there are , and mixtures for training, validation and testing, correspondingly.
We use audio clips only from ESC50. In this case, the total number of the available clean sources sounds is small, and thus, we propose an augmented mixture generation process which enables the generation of much more diverse mixtures. In order to generate each mixture, we randomly select a
sec segment from two audio files from two distinct audio classes. We mix these two segments with a random signal to noise ratio (SNR)s betweenand
dB. For each epoch,training mixtures are generated which generally are not the same with the ones generated for other epochs. For validation and test sets we fix their random seeds in order to always evaluate on the same and generated mixtures, respectively.
All four possible mixture combinations between speech and non-speech audio are considered by using both WSJ0 and ESC50 sources. Building upon the data augmentation training idea, we also add a random variable which controls the data collection (ESC50 or WSJ0) from which a source waveform is going to be chosen. Specifically, we set an equal probability of choosing a source file from the two collections (ESC50 and WSJ0). For WSJ0 each speaker is considered a distinct sound class, thus, no mixture consists utterances from the same speaker. After the two source waveforms are chosen, we follow the mixture generation process described in Section3.2.2.
Based on recent state-of-the-art approaches on both speech and universal sound source separation with learnable encoder and decoder modules, we consider configurations for the encoder-decoder parts as well as the separation module which are based on a similar time-dilated convolutional network (TDCN) architecture. In particular, we consider our implementations of ConvTasNet  that we refer simply as TDCN and its improved version proposed in  that we refer as residual-TDCN (RTDCN).
The encoder consists of one
D convolutional layer and a ReLU activation on top in order to ensure a non-negative latent representation of each audio input. Following the assumptions stated in Section2.2.2, we use a D transposed convolutional layer for the decoder . Both encoder and decoder have the same number of channels (or number of bases) and their D kernels have a length corresponding to ms ( samples) and a hop-size equivalent to ms ( samples). For each task we select a different number of channels for the encoder and the decoder modules (, and for speech only, mixed and non-speech only separation tasks, respectively).
Our implementation of TDCN consists of the same architecture and parameter configuration for the separation module as described in 
with an additional batch normalization layer before the final mask estimation which improved its performance over the original version on all separation tasks. Inspired by the original RTDCN separation module
, we keep the same parameter configuration as TDCN and we additionally use a feature-wise normalization between layers instead of global normalization. We also add long-term residual connections from previous layers. Moreover, before summing the residual connections, we concatenate them, normalize them and feed them through a dense layer as the latter yields some further improvement in separation performance. (Code is available online111Source code: github.com/etzinis/two_step_mask_learning.)
In order to show the effectiveness of our proposed two-step approach, we use the same network architecture when we perform end-to-end time-domain source separation and use as a loss the negative SI-SDR between the estimated signals on the time-domain and the clean waveforms . Instead in our two-step approach, we train the encoder-decoder parts separately as described in Section 2.1. In the second step, we use the pre-trained encoders for each task and train the separation module using as loss the negative SI-SDR on the latent space targets or their corresponding masks (see Section 2.2). We train all models using the Adam optimizer , the batch size is equal to , the initial learning rate is set to and we decrease it by a factor of at the th epoch. We train TDCN and RTDCN separation networks for epochs and epochs, respectively. The encoder-decoder parts for each task are trained independently for epochs ( times faster than training the separation network). We evaluate the separation performance for all models using SI-SDR improvement (SI-SDRi) on time domain which is the difference of SI-SDR of the estimated signal and the input mixture signal [19, 12]. As the STFT oracle mask we choose the ideal ratio mask (IRM) using a Hanning window with ms length and ms hop-size .
In Table 1, the mean separation performance of best models is reported for each task. We notice that the proposed two-step approach and training on the latent space leads to a consistent improvement over the end-to-end approach where we train the same architecture using the time-domain SI-SDR loss. This observation holds when different separation modules are used and when we test them under different separation tasks. The non-speech separation task seems the hardest one since the models have access to only a limited number of training mixtures which further underlines the importance of our proposed data-augmentation technique as described in Section 3.2.2. Our two-step approach yields an absolute SI-SDR improvement over the end-to-end baseline of up to dB, dB and dB for speech, non-speech and mixed separation tasks, respectively. Notably, this performance improvement is achieved using the exact same architecture but instead of training it end-to-end using a time-domain loss, we pre-train the auto-encoder part and use a loss on the latent representations of the sources.
|Separation||Target||Sound Separation Task|
In Table 1, we see that the oracle mask obtained from the two-step approach gives a much higher upper bound of separation performance, for all tasks, compared to ideal masks on the STFT domain. This is in line with the prior work that proposed to decompose signals using learned transforms [19, 25]. In Fig. 2 we can qualitatively compare the latent representations obtained from the same encoder when trained with our proposed two-step approach and with the baseline joint training of all modules. When the encoder and decoder are trained individually, a fewer number of bases are used to encode the input which leads to a sparser representation ( norm is roughly smaller compared to the joint training approach). Finally, the latent representations obtained from our proposed approach exhibit a spectrogram-like structure in a way that Speech is encoded using less bases than high frequency sounds like Bird Chirping.
We show how by pre-learning an optimal latent space can result in better source separation performance compared to a time-domain end-to-end training approach. Our experiments show that the proposed two-step approach yields a consistent performance improvement under multiple sound separation tasks. Additionally, the obtained sound latent representations remain sparse and structured while they also enjoy a much higher upper bound of separation performance compared to STFT-domain masks. Although this approach was demonstrated on TDCN architectures, it can be easily adapted for use with any other mask-based system.
Joint optimization of masks and deep recurrent neural networks for monaural source separation. IEEE/ACM Transactions on Audio, Speech, and Language Processing 23 (12), pp. 2136–2147. Cited by: §1.
Furcax: end-to-end monaural speech separation based on deep gated (de) convolutional neural networks with adversarial example training. In Proc. ICASSP, pp. 6985–6989. Cited by: §3.2.1.