1 Introduction
Single channel source separation deals with the problem of extracting the speaker or sound of interest from a mixture consisting of multiple simultaneous speakers or audio sources. In order to identify the source, we assume the availability of a few unmixed training examples. These examples are used to build representative models for the corresponding source. With the development of deep learning, several neural network architectures have been proposed to solve the supervised singlechannel source separation problem [1, 2, 3]. The latest deeplearning approaches to source separation have started to focus on performing separation by operating directly on the mixture waveforms [4, 5, 6, 7]. To train these endtoend models, the papers restrict themselves to minimizing a meansquared error loss [4, 5, 6], an L1 loss [7] or a sourcetodistortion ratio [4, 5] based costfunction between the separated speech and the corresponding groundtruth. A potential direction for improving endtoend models is to use loss functions that capture the salient aspects of source separation. Predominantly, the BSS_Eval metrics sourcetoDistortion ratio (SDR), sourcetoInterference ratio (SIR), sourcetoArtifact ratio (SAR) [8], and shorttime objective intelligibility (STOI) [9] have been used to evaluate the performance of source separation algorithms. Fu et.al., have proposed an endtoend neural network that captures the effect of STOI in performing source separation [10]. Alternatively, we could also develop suitable costfunctions for endtoend source separation by interpreting these metrics as suitable loss functions themselves. Proposing and evaluating these new costfunctions for source separation would also allow us mix and match a combination of these metrics to suit our requirements and improve source separation performance, for any neural network architecture.
Section 2 provides a description of the neural network used for endtoend source separation. Section 3 presents the approach to interpret the BSS_Eval and STOI metrics as loss functions for endtoend source separation. We evaluate our cost functions by deploying subjective listening tests. The details of our experiments, subjective listening tests and the corresponding results are discussed in section 4 and we conclude in section 5.
2 Endtoend separation
We first describe the endtoend neural network architecture used for source separation. We begin with the description of a shorttime Fourier transform (STFT) based source separation neural network. This network can be transformed into an endtoend separation network by replacing the STFT analysis and synthesis operations by their neural network alternatives
[4].Figure 1 (a) shows the architecture of a source separation network [4]
. The flow of data through the network can be explained by the following sequence of steps. The mixture is first transformed into its equivalent timefrequency (TF) representation using the STFT. The TF representation is then split into its magnitude and phase components. The magnitude spectrogram of the mixture is then fed to the separation neural network. This network is trained to estimate the magnitude spectrogram of the source of interest from the magnitude spectrogram of the mixture. The estimated magnitude spectrogram is multipled by the phase of the mixture and transformed into the time domain by the overlapandadd approach to invert the STFT.
As described in [4], we can transform this network into an endtoend source separation network by replacing the STFT blocks by corresponding neural networks, with the following sequence of steps. (i) The STFT and inverse STFT operations can be replaced by 1D convolution and transposed convolution layers. This would enable the network to learn an adaptive TF representation () directly from the waveform of the mixture. (ii) The frontend convolutional layer needs to be followed by a smoothing convolutional layer. This is done to obtain a smooth modulation spectrogram () that is similar to STFT magnitude spectrogram. The carrier component obtained using the elementwise division operation, incorporates the rapid variations of the adaptive TF representation. We will refer to this front end as the autoencoder transform (AET). Figure 1(b) gives the block diagram of the endtoend separation network using an AET frontend.
2.1 Examining the adaptive bases
We can understand the performance of endtoend source separation better by examining the learned TF bases and TF representations. Figure 2 plots the modulation spectrograms of a malefemale speech mixture, the first TF bases and their corresponding magnitude spectra. We rank the TF bases according to their dominant frequency component. We give these plots for two cases viz., the analysis convolution and synthesis transposedconvolution layers are independent (top), the analysis convolution and synthesis transposedconvolution layers share their weights (bottom). We observe that, similar to STFT bases, the adaptive bases are frequency selective in nature. However, the adaptive bases are concentrated at the lower frequencies and spreadout at the higher frequencies similar to the filters of the Mel filter bank.
3 Performance based costfunctions
Source separation approaches have traditionally relied on the use of magnitude spectrograms as the choice of TF representation. Magnitude spectrograms have been interpreted as probability distribution functions (pdf) drawn from random variables of varying characteristics. This motivated the use of several cost functions like the mean squared error
[11][11], ItakuraSaito divergence [12], Bregman divergences [13] to be used for source separation. Since these interpretations do not extend to waveforms, there is a need to propose and experiment with additional costfunctions suitable for use in the waveform domain. As stated before, the BSS_Eval metrics (SDR, SIR, SAR) and STOI are the most commonly used metrics to evaluate the performance of source separation algorithms. We now discuss how we can interpret these metrics as suitable loss functions for our neural network.3.1 BSS_Eval based costfunctions
In the absence of external noise, the distortions present in the output of a source separation algorithm can be categorized as interference and artifacts. Interference refers to the lingering effects of the other sources on the separated source. Thus, sourcetointerference ratio (SIR) is a metric that captures the ability of the algorithm to eliminate the other sources and preserve the source of interest. The processing steps in an algorithm may introduce arifacts or additional sounds in the separation results that do not exist in the original sources. Sourcetoartifact ratio (SAR) measures the ability of the network to produce high quality results without introducing additional artifacts. The unwanted nonlinear processing effects that may occur due to a neural network are also incorporated by SAR. These metrics can be combined into sourcetodistortion ratio (SDR), which captures the overall separation quality of the algorithm. We denote the output of the network by . This output should ideally be equal to the target source and completely suppress the interfering source . We note that notations refer to the timedomain waveforms of each signal. Thus, and are constants with respect to any optimization (max or min) applied on the network output
. We will also use the following definition of the innerproduct between vectors as,
Maximizing SDR with respect to can be given as,
Thus, maximizing the SDR is equivalent to maximizing the correlation between and , while producing the solution with least energy. Maximizing the SIR cost function can be given as,
Maximizing SIR is equivalent to maximizing the correlation between the network output and target source while minimizing the correlation between and interference . Over informal listening tests, we identified that a network trained purely on SIR, maximizes timefrequency (TF) bins where the target is present and the interference is not present and minimizes TF bins where both sources are present or bins where the interference dominates the target. This results in a network output consisting of sinusoidal tones near TF bins dominated by the target source.
For the SAR cost function, we assume that the clean target source and the clean interference are orthogonal in time. This allows for the following simplification:
From the equations, we see that SAR does not distinguish between the target source and the interference. Consequently, optimizing the SAR cost function does not directly optimize the quality of separation. The purpose of optimizing the SAR cost function should be to reduce audio artifacts in conjunction with a loss function that penalizes the presence of interference such as the SIR. In practice, a network that optimizes SAR directly should apply the identity transformation to the input mixture.
3.2 STOI based cost function
The drawback of BSS_Eval metrics is that they fail to incorporate “intelligibility” of the separated signal. Shorttime objective intelligibility (STOI) [9] is a metric that correlates well with subjective speech intelligibility. STOI accesses the shorttime correlation between TF representations of target speech and network output . We now describe the sequence of steps involved in interpreting STOI as a costfunction.
The network output and target source waveforms are first transformed into the TF domain using an STFT step. To do so, we use Hanning windowed frames of
samples zeropadded to a size of
samples each, and a hop of %. This STFT step was implemented using a D convolution operation. The resulting magnitude spectrograms are transformed into an octaveband representation by grouping frequencies using onethird octave bands reaching upto Hz. The resulting representations will be denoted as and , corresponding to and respectively. The representation corresponds to the th onethird octave band at the th time frame. This was implemented as a matrix multiplication step applied on the magnitude spectrograms.Given the onethird octave band representation and , we constructed new vectors and consisting of previous frames before the th time frame. We can write this explicitly as
Let be the th frame of vector . The octaveband representation of the network output is then normalized and clipped to have the similar scale as the target source, which is denoted as . The clipping procedure clips the network output so that the signaltodistortion ratio (SDR) is above .
We then compute the intermediate intelligibility matrix denoted by by taking the correlation between and .
To get the final STOI cost function, we take the average shorttime correlation over total time frames and total onethird octave bands.
It is clear by the procedure that maximizing the STOI cost function is equivalent to maximizing the average shorttime correlation between the TF representations for the target source and separation network output.
4 Experiments
Since the paper deals with interpreting source separation metrics as a cost function, it is not a reasonable approach to use the same metrics for their evaluation. In this paper, we use subjective listening tests targeted at evaluating the separation, artifacts and intelligibility of the separation results to compare the different loss functions. We use the crowdsourced audio quality evaluation (CAQE) toolkit [14] to setup the listening tests over Amazon Mechanical Turk (AMT). The details and results of our experiments follow.
4.1 Experimental Setup
For our experiments, we use the endtoend network shown in figure 1(b). The separation was performed with a
dimensional AET representation computed at a stride of
samples. A smoothing of samples was applied by the smoothing convolutional layer. The separation network consisted of dense layers each followed by a softplus nonlinarity. This network was trained using different proposed cost functions and their combinations. We compare the costfunctions by evaluating their performance on isolating the female speaker from a mixture comprising a male speaker and a female speaker, using the above endtoend network.To train the network, we randomly selected malefemale speaker pairs from the TIMIT database [15]. pairs were used for training and the remaining pairs were used for testing. Each speaker has recorded sentences in the database. For each pair, the recordings were mixed at dB. Thus, the training data consisted of mixtures. The trained networks were compared on their separation performance on the test sentences. Clearly, the test speakers were not a part of the training data to ensure that the network learns to separate female speech from a mixture of male and female speakers and does not memorize the speakers themselves.
In the subjective listening tests we compare the performance of endtoend source separation under the following cost functions:
(i) Mean squared error
(ii)
(iii)
(iv)
(v)
(vi)
(vii) .
These combinations were selected to understand the effects of individual cost functions on separation performance. We scale the value of each costfunction to unity before starting the training procedure. This was done to control the weighting of terms in case of composite costfunctions.
4.2 Evaluation
Using CAQE over a web environment like AMT has been shown to give consistent results to listening tests performed in controlled lab environments [14]. Thus, we use the same approach for our listening tests. The details are briefly described below.
4.2.1 Recruiting Listeners
For the listening tasks, we recruited listeners on Amazon Mechanical Turk that were over the age of 18 and had no previous history of hearing impairment. Each listener had to pass a brief hearing test that consisted of identifying the number of sinusoidal tones within two segments of audio. If the listener failed to identify the correct number of tones within the audio clip in two attempts, the listener’s response was rejected. For the listening tests, we recruited a total of participants over AMT.
4.2.2 Subjective Listening Tests
We assigned each of the accepted listeners to one of four evaluation tasks. Each task asked listeners to rate the quality of separation based on one of four perceptual metrics: preservation of source, suppression of interference, absence of additional artifacts, and speech intelligibility. The perceptual metrics such as preservation of source, suppression of interference, absence of additional artifacts, and speech intelligibility directly correspond to objective metrics such as SDR, SIR, SAR, and STOI respectively.
Accepted listeners were given the option to submit multiple evaluations for each of the different tasks. For each task, we trained listeners by giving each listener an audio sample of the isolated target source as well as a mixture of the source and interfering speech. We also provided 13 audio separation examples of poor quality and 13 audio examples of high quality according to the perceptual metric assigned to the listener. The audio files used to train the listener all had exceptionally high or low objective metrics (SDR, SIR, SAR, STOI) with respect to the pertaining task so that listeners could base their ratings in comparison to the best or worst separation examples.
After training, the listeners were then asked to rate eight unlabelled, randomly ordered, separation samples from 0 to 100 based on the metric assigned. The isolated target source was included in the listener evaluation as a baseline. The other seven audio samples correspond to separation examples output by a neural network trained with different cost functions enlisted in section 4.1.
4.3 Results and Discussion
Figure 3 gives the results of the subjective listening tests performed through AMT for each of the four tasks. The results are shown in the form of a barplot that shows the median value (solid line in the middle) and the percentile and percentile points (box boundaries). The vertical axis gives the distribution of listenerscores over the range (0100) obtained from the tests. The horizontal axis shows the different costfunctions used for evaluation, as listed in section 4.1. This also helps us to understand the nature of the proposed costfunctions. For example, figure 3(b) (bars 5,6,7) shows that incorporating the SIR term into the cost function explicitly, helps the network to suppress the interfering sources better. Similarly, the addition of a STOI term into the cost function improves the results in terms of speech intelligibility as seen in figure 3(d). It is also observed that adding STOI to the SDR costfunction helps in preserving the target source better (figure 3(a), bars 2,3 and 4). One possible reason for this could be that increasing the intelligibility of the separation results results in a perceptual notion of preserving the target source better. The BSS_Eval cost functions appear to be comparable in terms of preserving the target source (figure 3(a), bars 2,5,6,7) and slightly better than MSE. In terms of artifacts in the separated source, SDR outperforms all the costfunctions, all of which seem to introduce a comparable level of artifacts into the separation results (figure 3(c)). The use of SAR in the costfunction does not seem to have favorable or adverse effects on the perception of artifacts on the separation results.
5 Conclusion and Future Work
In this paper we have proposed and experimented with novel costfunctions motivated by BSS_Eval and STOI metrics, for endtoend source separation. We have shown that these costfunctions capture different salient aspects of sourceseparation depending upon their characteristics. This enables the flexibility to use composite costfunctions that can potentially improve the performance of existing source separation algorithms.
References
 [1] Y. Isik, J. L. Roux, Z. Chen, S. Watanabe, and J. R. Hershey, “Singlechannel multispeaker separation using deep clustering,” 2016.
 [2] Z. Chen, Y. Luo, and N. Mesgarani, “Deep attractor network for singlemicrophone speaker separation,” in 2017 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), March 2017, pp. 246–250.
 [3] Y. Luo, Z. Chen, and N. Mesgarani, “Speakerindependent speech separation with deep attractor network,” IEEE/ACM Transactions on Audio, Speech, and Language Processing, vol. 26, no. 4, pp. 787–796, April 2018.
 [4] S. Venkataramani, J. Casebeer, and P. Smaragdis, “Adaptive frontends for endtoend source separation.” [Online]. Available: http://media.aau.dk/smc/wpcontent/uploads/2017/12/ML4AudioNIPS17_paper_39.pdf
 [5] Y. Luo and N. Mesgarani, “Tasnet: Timedomain audio separation network for realtime singlechannel speech separation,” in Acoustics, Speech and Signal Processing (ICASSP), 2014 IEEE International Conference on. IEEE, 2018.
 [6] S. W. Fu, Y. Tsao, X. Lu, and H. Kawai, “Raw waveformbased speech enhancement by fully convolutional networks,” in 2017 AsiaPacific Signal and Information Processing Association Annual Summit and Conference (APSIPA ASC), Dec 2017.
 [7] D. Rethage, J. Pons, and X. Serra, “A wavenet for speech denoising,” 2018.
 [8] C. Févotte, R. Gribonval, and E. Vincent, “Bss_eval toolbox user guide–revision 2.0,” 2005.
 [9] C. H. Taal, R. C. Hendriks, R. Heusdens, and J. Jensen, “A shorttime objective intelligibility measure for timefrequency weighted noisy speech,” in Acoustics Speech and Signal Processing (ICASSP), 2010 IEEE International Conference on. IEEE, 2010, pp. 4214–4217.

[10]
S.W. Fu, Y. Tsao, X. Lu, and H. Kawai, “Endtoend waveform utterance enhancement for direct evaluation metrics optimization by fully convolutional neural networks,”
IEEE Transactions on Audio, Speech, and Language Processing, 2018.  [11] D. D. Lee and H. S. Seung, “Algorithms for nonnegative matrix factorization,” in Advances in neural information processing systems, 2001, pp. 556–562.
 [12] C. Févotte, N. Bertin, and J.L. Durrieu, “Nonnegative matrix factorization with the itakurasaito divergence: With application to music analysis,” Neural computation, vol. 21, no. 3, pp. 793–830, 2009.
 [13] S. Sra and I. S. Dhillon, “Generalized nonnegative matrix approximations with bregman divergences,” in Advances in neural information processing systems, 2006, pp. 283–290.
 [14] M. Cartwright, B. Pardo, G. J. Mysore, and M. Hoffman, “Fast and easy crowdsourced perceptual audio evaluation,” in 2016 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), March 2016, pp. 619–623.
 [15] J. S. Garofolo, L. F. Lamel, J. G. F. William M Fisher, D. S. Pallett, N. L. Dahlgren, and V. Zue, “Timit acoustic phonetic continuous speech corpus,” Philadelphia, 1993.
Comments
There are no comments yet.