Audio source separation is the process of separating a mixture (e.g. a pop band recording) into isolated sounds from individual sources (e.g. just the lead vocals). Deep learning models are the state-of-the-art for source separation, given that the mixture to be separated is similar to the mixtures the model was trained with. When a model is applied to a mixture unlike the training data (e.g. trying to separate speech with a model trained on music), it often fails. Thus, the end user must know enough about each model’s training to select the correct model for a given audio mixture. This limits how models can be deployed and imposes a bottleneck on adding source separation to an automation chain.
A method to automatically select the best model for the current mixture would transform the range of applications where source separation can be applied. For example, imagine a hearing aid that automatically switches models when the user moves from an outdoor construction site (a model trained to separate speech from environmental noise) to an indoor restaurant (a model trained to separate speech from speech).
There has been work in combining multiple algorithms or models for specific separation tasks [13, 2, 10]. These ensemble methods were applied to multiple methods that work for one application domain (e.g. vocals separation or speech enhancement) whereas our method chooses which model is the most appropriate for a given mixture across multiple domains.
The most closely related work is Manilow et al. . They trained a deep net to estimate separation quality, given an audio mixture and a separated source estimates. They used the quality estimate to guide switching between separation algorithms. Network training required ground-truth source estimates. In contrast, our method does not require any training.
. Error estimation in recognition problems (ASR) does not apply directly to source separation. In computer vision (e.g. image segmentation), failure detection has employed a measure of confidence that is learned during training, or by comparing the the input image to the training distribution for the model , or by manipulating the image to see if it behave similarly to the training data . Our work is for audio, does not require training, and does not need access to the training distribution.
Kim et al.  proposed a run-time selection method for speech enhancement. It compares the reconstruction error of different auto-encoders on some input sound. The model with the lowest reconstruction error is used. We cannot use reconstruction error in the case where ground-truth separations are not known, which is the situation in actual deployment.
2 Proposed Method
We automate selection of the appropriate domain-specific deep clustering source separation model for an audio mixture of unknown domain. We present a confidence measure that does not require ground truth to estimate separation quality, given a model and an audio mixture. We use this confidence measure to automatically select the best model output for the mixture. A system overview is shown in Figure 1.
2.1 Deep clustering
We apply our method to an ensemble of deep clustering  source separation networks, one trained on each of three audio domains: music, speech, and environmental sounds. We choose deep clustering because it has high performance [21, 11].
In deep clustering, a neural network is trained to map each time-frequency bin in a magnitude spectrogram of an audio mixture to a higher-dimensional embedding, such that bins that primarily contain energy from the same source are near each other and bins whose energy primarily come from different sound sources are far from each other. Given a good mapping, the assignment of bin to source can then be determined by a simple clustering method. All members of the same cluster are assigned to the same source. Because deep clustering performs separation via clustering, our method relies on an analysis of the embedding produced by deep clustering to establish a confidence measure.
2.2 Confidence measure
Deep clustering is effective as long as the given mixture is similar to the mixtures on which the model was trained. Otherwise, clusters in the embedding space will not correspond to individual sources. Our confidence measure builds on the one in 
, which was designed to predict the performance of a direction-of-arrival algorithm for source separation. In that work, the confidence measure was designed particularly for speech mixtures and required the clustering algorithm to be based on Gaussian Mixture Models. Here, we make a more general confidence measure so it can be applied to any source separation algorithm that uses clustering of the time frequency points (e.g. KAM, DUET ). In this paper we apply it to an ensemble of deep clustering networks.
Our measure does not require access to the ground truth separation of the mixture to estimate the separation quality produced by a given model on a given mixture. In clustering-based algorithm, time-frequency points are mapped to a space in which the clustering is performed. The core insight behind the confidence measure is that the distribution of the embedded time-frequency points of a mixture is predictive of the performance of the algorithm. Figure 2 shows a visualization of the confidence measure as applied to the distribution of points in a mixture produced by three trained deep clustering networks, each trained on a different domain. The input is a music mixture. The speech (left) and environmental (right) models return distributions with no clear clusters. The music model (middle) returns a more clusterable distribution, which is reflected by a higher confidence score. The confidence measure combines the silhouette score and posterior strength through multiplication so that it is high only when both are high. That is, .
2.2.1 Silhouette score
The silhouette score  captures how much separation exists between the clusters (intercluster distance) and how dense the clusters are (intracluster distance). Define as the set of embeddings for every time-frequency point in an audio mixture, where is the embedding of one point. is partitioned into clusters : . Consider a data point assigned to cluster :
Here, the intracluster distance is the mean distance (using a distance function ) between and all other points in , and the intercluster distance is the mean distance between and all the points in the nearest cluster . Compute the silhouette score of
Note ranges from to . Computing the silhouette score for every point in a mixture is intractable, so we compute the silhouette score of points from the loudest of time-frequency bins in the embedding space and take the mean across them to estimate the silhouette score for the mixture .
2.2.2 Posterior strength
For every point in a dataset
, the clustering algorithm – soft K-Means – produces, which indicates the membership of the point in some cluster , also called the posterior of the point in regards to the cluster . The closer that is to (not in the cluster) or (in the cluster), the more sure the assignment of that point. We compute the posterior strength of as follows:
The equation maps points that have a maximum posterior of (equal assignment to all clusters) to , and points that have a maximum posterior of to . The posterior strength of the mixture is the mean posterior strength of the loudest of time-frequency bins.
3 Experimental design
In our experiments, we first show that the confidence measure correlates well with source to distortion ratio (SDR)  – we use a specific variant called scale-dependent SDR – a widely used measure of source separation quality. Then we evaluate the performance of our confidence-based ensemble compared to an oracle ensemble, a random ensemble, and each domain-specific model on general mixtures.
For each domain that we considered - separating two speakers in a speech mixture, separating vocals from accompaniment in music mixtures, and separating environmental sounds from one another - we train 3 deep clustering networks with identical setups. Each network has 2 BLSTM layers with 300 hidden units each and an embedding size of 20 with sigmoid activation. We trained each network for 80 epochs using the Adam optimizer (learning rate was 2e-4).
We created three datasets, each with training, validation, and testing subsets. The speech dataset is made from WSJ0 . The music dataset is made from MUSDB . The environmental dataset is made from UrbanSound8k .
The MUSDB music dataset has four isolated stems for each song: vocals, drums, bass, and other. We define the accompaniment as the sum of the drums, bass, and other stems and the vocals consist of the vocals stem. The network is trained to separate the vocals from the accompaniment given the music mixture. The WSJ0 speech dataset consists of utterances from 119 speakers. For both, we use well-established training, validation, and test splits from previous work [7, 14].
The UrbanSound8k dataset has 10 sound classes: air conditioner, car horn, children playing, dog bark, drilling, engine idling, gun shot, jackhammer, siren and street music. Of these, we only use 5 classes which we believe to be separable: car horn, dog bark, gun shot, jackhammer, siren. UrbanSound8k comes split into 10 folds. We use folds 1-8 for training sources, 9 for validation, and 10 for testing.
We downsampled audio to kHz to reduce computational costs. Using Scaper , we made 5-second mixtures with 2 sources that are chosen randomly from each split. The mixtures generated for MUSDB were coherent: the vocals and accompaniment sources came from the same song at the same time. For each domain, we created 20000 mixtures for training, 5000 for validation, and 3000 for testing.
|Ensemble - oracle||8.37||6.55||12.21|
|Ensemble - random||4.86||4.25||2.82|
|Ensemble - confidence||7.61||6.47||10.52|
First, we investigate whether the confidence measure correlates well to performance for a single model. In Figure 3, shows a clear relationship between confidence and performance for the speech model as applied to the speech test mixtures. Further, we see that both confidence and performance are a function of the mixture type. Same-sex mixtures are harder to separate due to the frequency overlap between the speakers. This is reflected in Figure 3. For the other domains, we also observe strong correlations. A linear fit between the confidence measure and SDR applied to music mixtures separated by a deep cluster model trained on music mixtures returned an r-value of for vocals and for instrumentals. The linear fit for environmental sounds separated by a model trained on environmental sounds had an r-value of
Table 1, shows the performance of various approaches to separating each dataset. The bottom three rows show the performance of individual domain-specific models on all of the domains we consider. Predictably, every model shows poor performance on domains it was not trained on.
The top three rows of Table 1
show the performance of three ensemble approaches. These ensemble approaches switch between the three domain-specific models via different strategies. The oracle ensemble switches between them with knowledge of the true performance of each model. This is the upper bound for any switching system. The random ensemble randomly selects the model to apply to a given mixture, with equal probability. The confidence ensemble uses our confidence measure to select between the models. For each mixture, all three models are run and confidence measures are computed. The output from the model with the highest confidence is then chosen as the separation.
Results in Table 1 show the mean separation quality of each model (based on SDR) when evaluated on all 9,000 test mixtures. The confidence-based ensemble significantly outperforms the random ensemble. In the case of music mixtures, the confidence-based model achieves almost oracle performance, with mean SDR of compared to . In Table 2
, we show the confusion matrix. The precision of picking the best model when the best model is speech, music, and environmental sound are 97.50%, 60.17%, and 93.43%, respectively. The recall rates are 72.87%, 98.33%, and 57.80%.
We have presented a method for effectively combining the output of multiple deep clustering models by switching between them based on mixture domain in an unsupervised fashion. Our method works by analyzing the embedding produced by each deep clustering network to produce a confidence measure that is predictive of separation performance. This confidence measure can be applied to ensembles of any clustering-based separation algorithms.
-  (2018) Learning confidence for out-of-distribution detection in neural networks. arXiv preprint arXiv:1802.04865. Cited by: §1.
-  (2015) Extracting singing voice from music recordings by cascading audio decomposition techniques. In IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pp. 126–130. Cited by: §1.
-  (1993) CSR-i (wsj0) complete ldc93s6a. Web Download. Philadelphia: Linguistic Data Consortium. Cited by: §3.1.
-  (1997) A probabilistic approach to confidence estimation and evaluation. In Acoustics, Speech, and Signal Processing, 1997. ICASSP-97., 1997 IEEE International Conference on, Vol. 2, pp. 879–882. Cited by: §1.
-  (2016) A baseline for detecting misclassified and out-of-distribution examples in neural networks. arXiv preprint arXiv:1610.02136. Cited by: §1.
-  (2013) Mean temporal distance: predicting asr error from temporal properties of speech signal. In Acoustics, Speech and Signal Processing (ICASSP), 2013 IEEE International Conference on, pp. 7423–7426. Cited by: §1.
-  (2016-03) Deep clustering: discriminative embeddings for segmentation and separation. In IEEE International Conference on Acoustics, Speech, and Signal Processing (ICASSP), pp. 31–35. External Links: Cited by: §2.1, §3.1.
Collaborative deep learning for speech enhancement: a run-time model selection method using autoencoders. In 2017 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pp. 76–80. Cited by: §1.
-  (2018) SDR-half-baked or well done?. arXiv preprint arXiv:1811.02508. Cited by: §3.
-  (2014) Kernel additive models for source separation. Signal Processing, IEEE Transactions on 62 (16), pp. 4298–4310. Cited by: §1, §2.2.
-  (2016) Deep clustering and conventional networks for music separation: stronger together. arXiv preprint arXiv:1611.06265. Cited by: §2.1.
-  (October 15-18, 2017) Predicting algorithm efficacy for adaptive multi-cue source separation. In IEEE Workshop on Applications of Signal Processing to Audio and Acoustics (WASPAA 2017), New Paltz, NY, USA. Cited by: §1.
-  (2016) Learning to separate vocals from polyphonic mixtures via ensemble methods and structured output prediction. In Acoustics, Speech and Signal Processing (ICASSP), 2016 IEEE International Conference on, pp. 450–454. Cited by: §1.
-  (2017-12) The MUSDB18 corpus for music separation. External Links: Cited by: §3.1, §3.1.
-  (2007) The duet blind source separation algorithm. Blind Speech Separation, pp. 217–241. Cited by: §2.2.
-  (1987) Silhouettes: a graphical aid to the interpretation and validation of cluster analysis. Journal of computational and applied mathematics 20, pp. 53–65. Cited by: §2.2.1.
-  (2014-Nov.) A dataset and taxonomy for urban sound research. In 22st ACM International Conference on Multimedia (ACM-MM’14), Orlando, FL, USA. Cited by: §3.1.
-  (2017) Scaper: a library for soundscape synthesis and augmentation. In IEEE Workshop on Applications of Signal Processing to Audio and Acoustics (WASPAA). New Paltz, NY, USA, Cited by: §3.1.
-  (2019) Bootstrapping single-channel source separation via unsupervised spatial clustering on stereo mixtures. In ICASSP 2019-2019 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pp. 356–360. Cited by: §2.2.
-  (2017) Confidence estimation in deep neural networks via density modelling. arXiv preprint arXiv:1707.07013. Cited by: §1.
-  (2018) Alternative objective functions for deep clustering. In Proc. IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), Cited by: §2.1.