1 Introduction
Speech enhancement includes removing background noise and reverberation [5, 15]
. Deep learning based approaches provide state-of-the-art improvement but typically involve supervised training on a large and diverse dataset of noisy speech and clean speech pairs
[22]. Since real-world applications contain speech mixed with noise, most deep learning methods rely on synthetic data that mix clean and noisy speech [18]. However, synthetic data cannot fully capture real-world conditions leading to poor performance when deployed to large, diverse customer bases [23]. One solution is to identify new conditions to augment current test sets with additional representative data. However, labeling a large batch of customer data is expensive, and often prohibitive due to privacy and compliance reasons.This paper presents Aura
, an ‘ears-off’ methodology to identify scenarios in a sample-efficient manner that is: (i) representative of the customer workload, and (ii) challenging to noise suppression models. Real-world scenarios are instrumental to generate a low-variance estimate of how noise suppressors improve speech quality when used by customers. It is noteworthy that keeping the test set small (i.e., sample efficiency) ensures agility of development and deployment in production pipelines. Production workload datasets can be in the order of millions of clips so sample efficiency is an important practical challenge.
The main contribution of Aura is to operate in an ‘ears-off’ mode by relying on a pre-trained audio feature extractor [10] and a pre-trained accurate objective speech quality metric called DNSMOS P.835 [16]
. Both the feature extractor and speech quality metric are convolution-based models trained on large-scale open-sourced processed data and do not risk memorizing customer private information.
Aura maps customer data into an embedding space, partitions the data via clustering, and identifies the most challenging scenarios for noise suppressors in each cluster.Since we cannot share results on customer data, we evaluate Aura’s sampling on simulated data. We apply Aura to rank noise suppression models from the DNS Challenge [18] in terms of overall mean opinion scores (MOS) [16]. We show that the Spearman’s rank correlation coefficient (SRCC) between a sample produced by Aura and the entire mixture of noisy speech is . This is an improvement of compared to a random sampling of noisy speech. We also apply Aura to augment a current benchmark test set for noise suppressors [18]. The resulting test set is more challenging to noise suppressors with a predicted absolute decline of (a relative decline of ) in DNSMOS P.835 [16]. Moreover, the Aura-based test set captures more diverse audio scenarios with an increase of in diversity (as measured by distance, and an addition of out-of-distribution (OOD) samples compared to the current benchmark test set [18].
Related work. As part of the Deep Noise Suppression (DNS) challenge [18], Reddy et al. created an extensive open-sourced noisy speech test set with about 20 noise types, 15 languages, 200+ speakers, emotions, and singing. However, the resulting DNS benchmark test set does not cover all the scenarios that customers who use real-time communication platforms experience. These gaps motivate additional effort to identify challenging scenarios where DNS models underperform.
Active learning is the common paradigm to select which data points to label from a pool of unlabeled data [20]
when the labeling budget is limited. Many active learning heuristics are uncertainty-based and try to sample in priority hard examples
[6]. Kossen et al. show that heuristics in active learning lead to biased estimates of model performances because they do not preclude selecting redundant/outlier scenarios
[13]. On the other hand, Sener et al. propose a diversity-based solution identifying a set of data points that are the most representative of the entire data set [19]. This paper integrates diversity and uncertainty-based techniques to a pipeline that creates a test set for noise suppression models while operating in an ‘ears-off’ environment.Our ‘ears-off’ pipeline protects customers from privacy leakages, including membership attacks [21] and the extraction of identifiable information [3]. A growing body of work has developed differentially private learning algorithms that do not expose private customer data [1]. Federated learning is a partial solution to the ‘ears-off’ problem since it allows training models on decentralized devices without the need for a central server to access private data. Latif et al. and Guliani et al. present applications of the federated learning paradigm to speech-related tasks [14, 9]. However, these methods rely on an availability of labels at client endpoints.
We obtain an ‘ears-off’ pipeline by transferring knowledge from models trained on open-source data. Self-supervision uses surrogate tasks to learn representations or embeddings that are useful to real-world downstream tasks. The method has been successful in image-based tasks (e.g., [12]), and audio/video (e.g., [8]). In this paper, our pretext task is to label the type of background noise in noisy speech. Moreover, we use DNSMOS P.835 [16] to accurately measure the quality of speech without the need for clean reference. DNSMOS P.835 has a correlation with human ratings for noise suppression models and a 0.74 correlation per clip. This gives the flexibility to estimate speech quality before and after noise suppression without listening to the file.
2 Solution

Our objective is to estimate a performance metric of noise suppression on a target data that represents real-world conditions. For each speech clip, the P.835 protocol generates a MOS for signal quality SIG, background noise BAK, and overall quality OVRL. We derive two performance metrics from MOS: (i) differential MOS (DMOS) between after and before denoising; (ii) stack ranking of competing noise suppressors according to their average DMOS calculated on the target data .
The constraints are that only a small subset of files can be sampled out of all the noisy speech in ; and, that audio files in cannot be used to fit a model that could potentially encode identifiable information in its parameters. The restriction limits the size of the test set and allows rapid testing of models during development. We want the sampling estimate of the performance metric to have a small error compared to the expected value of on the target data . To minimize , we trade-off bias and variance [4]. A random sample of
audio files would result in zero bias, but high variance. On the other hand, probability-proportional-to-size sampling (PPS) would minimize the variance of the estimator by sampling audio files with a probability proportional to
[11]. One shortcoming of PPS sampling performed solely on is that it does not consider the diversity of scenarios.To trade-off variance and bias, we propose Aura, the pipeline shown in Figure 1. Aura learns a partition of the target data into clusters in an embedding space obtained from a pre-trained feature extractor. Then, it applies PPS to each cluster. Aura samples audio files in two steps. First, it samples files in all the clusters of the embedding space to capture the diversity represented in the target data. Second, within each cluster, it favors audio files with the largest to reduce the variance of the performance metric .
Test set for noise suppression models. The first application of Aura is to sample audio files in the target data to form a test set for noise suppression models. In each of the clusters of the embedding space, we sample files with a probability inversely proportional to their predicted DMOS. Note that a smaller DMOS indicates a more challenging (or noisier) speech scenario.
Model ranking. A model evaluation task in noise suppression involves comparing the speech quality produced by different models on the same set of audio files. Given noise suppressors, we adjust our sampling procedure in the following manner: Within each cluster, an audio file is selected with a probability proportional to the variance across models of the predicted DMOS for that file. In this application, the sampling performance is defined as the SRCC between the ranking of models obtained on the sample and the one obtained with the entire target data .
3 Experiments
3.1 Pipeline Components
Choice of feature extractor. Our feature extractor is constructed from VGGish, a model pre-trained on a 100M YouTube videos dataset [10]
. It generates a 128 dimension embedding for each audio clip. VGGish embeddings are trained to identifying the type of foreground sound in an audio clip. In the context of noise suppression, we aim to detect the noise in the background of a speech clip. Therefore, we fine-tune the feature extractor by using its embedding as inputs to two fully connected layers and by classifying the type of background noise.
We follow Reddy et al. [18] to synthesize 1.5 million instances of 10 second noisy speech containing categories of noise sounds from AudioSet [7]. We add each noise sound to the background of speech clips randomly drawn from the bank of clean speech clips created by [18]. Our background noise type classifier trained on this synthetic data achieved a mean average precision (MAP) of and area-under-the-curve (AUC) of on a hold-out sample. It provides us with a feature extractor that maps noisy speech to a 128 dimension embedding space that is semantically aligned with the type of background noise. Using kmeans++ on the embedding space, we partition million noisy speech into clusters [2] 111The number of clusters is selected as the number that achieve the lowest Davies-Bouldin index, where the index is computed using Euclidean distance in the embedding space.. We validated the quality of the clusters using subjective listening tests. The raters were presented clips from 24 random clusters and asked to report whether audio files in the cluster share a common audio property. We found that 80 of the clusters share a common background noise.
Choice of MOS predictor. We use a pre-trained model, DNSMOS P.835 to predict subjective speech quality [16].
3.2 Dataset
We will apply Aura to two target test sets.
Augmented DNS challenge test set. We create a target dataset of potential noisy speech candidates to add to the INTERSPEECH 2021 DNS Challenge test set [18]. We mix sounds from the balanced AudioSet data [7] with clean speech clips [17]. We use segments from AudioSet as background noise and follow [17]
to generate noisy speech with signal-to-noise ratio (SNR) between -5 dB and 5 dB. The resulting target data combines 1.7K files from the DNS challenge test set with 22K of 10 seconds clips from the newly synthesized noisy speech that covers 527 noise types with at least 59 clips per class. The pool of noisy speech candidates is at least partially out-of-distribution compared to
[18], which covers 120 noise types. Moreover, it does not overlap the dataset of audio clips used to fine-tune our feature extractor and train DNSMOS P.835 and thus, reproduces the conditions of an ‘ears-off’ environment.Augmented DNS challenge test set + clean speech. To simulate real-world conditions, for each noisy speech in the previous target data, we randomly draw 10 clean speech clips from [18]. Clean speech presents a challenge to stack rank models in development because it does not allow to discriminate the performance of noise suppressors.
3.3 Augmenting DNS challenge test set
We evaluate the following: From the pool of noisy speech candidates, can Aura generate a new test set of the same size (1.7K) as the INTERSPEECH 2021 benchmark test set [18], but with audio scenarios that are more challenging and diverse?
Protocol. We measure the diversity of a test set as the
distance between the distribution of audio segments across the embedding clusters and a uniform distribution between clusters. We normalize the value by calculating
of the contingency table over the percentage of data points in each cluster rather than raw frequency. The lower the
distance, the more audio properties encoded in the embedding space the resulting test set covers.Test set | DMOS | OOD | |||
SIG | BAK | OVRL | % | ||
DNS | -0.24 | 0.99 | 0.18 | 498 | 0% |
Challenge | () | ||||
Proposed | -0.57 | 0.91 | -0.32 | 3843 | 69% |
(no cluster) | 0.01 | 0.001 | () | ||
Proposed | -0.47 | 0.89 | -0.18 | 287 | 73% |
+ clustering | () |
Results. Table 1 compares DMOS and for the benchmark test set (top row) and the Aura-generated test set (bottom row). The proposed test set replaces of noisy speech in the benchmark test set with a new noisy speech from our synthetic data. Aura forms a test set with clips for which noise suppressors are more likely to degrade the signal and overall quality of the speech than for clips in the benchmark dataset (lower DMOS). Moreover, its distance is not significantly different from zero (p-value ), which indicates good coverage of all clusters in the embedding space and thus, a diverse set of audio conditions. Table 1 also shows that sampling the most challenging audio files in each cluster improves significantly diversity ( distance decreases from 3843 to 287) compared to sampling the most challenging files without diversity constraint (middle row in Table 1).
Figure 2 shows the top-10 noise types (in green) from the augmenting dataset of noisy speech that Aura adds to the test set. Aura prioritizes new out-of-distribution audio scenarios compared to the top-10 noise types present in the benchmark test set.
3.4 Model ranking
Sampling | SRCC | |||
---|---|---|---|---|
method | SIG | BAK | OVRL | ( |
Random | ||||
Diversity | ||||
Variance | ||||
Aura | 0.84 0.01 | 0.93 0.01 | 0.91 0.01 | 1 |
We evaluate Aura’s sampling efficiency to accurately stack rank noise suppression models with few samples from the target data . The objective is to estimate a model ranking that has a high SRCC with the ranking that would be obtained on the entire data.
Protocol. For each speech, we run noise suppression models from [18]
and on the DMOS predicted by our DNSMOS P.835 model as opposed to human ratings. We bootstrap the sampling 200 times and report the mean and standard deviation of the resulting rank correlation coefficients. We compare
Aura’s sampling performance to three alternative strategies: (i) Random, which draws randomly of data; (ii) Diversity, which samples uniformly across embedding clusters; (iii) Variance, which samples proportionally to the variance of predicted DMOS across the 28 models.Results. Table 2 shows the SRCC for ranking based on signal, background and overall DMOS. Aura’s sampling leads to a improvement over random sampling for signal-based ranking. We also observe that the confidence interval of the ranking obtained from Aura-based samples is narrower than the one obtained by random sampling. Compared to alternative approaches, Aura generates the sample with the lowest distance, which indicates a better coverage of audio scenarios. On the other hand, Random has the highest distance because it mostly samples clean speech. Figure 3 shows that even for larger samples, Aura still outperforms Random in terms of rank correlation.

4 Conclusion
Aura presents an end-to-end system for improving the test set used to evaluate deep noise suppression models. Aura is designed to work for ‘ears-off’ customer workloads to preserve privacy. Aura detects out-of-distribution samples by relying on an objective quality metric and a pre-trained feature extractor. Although targeted for noise suppression, the method is generic and can be extended to other media (i.e., audio/video/image) given the successful development of a powerful pre-trained feature extractor. This work is an important step toward balancing customer privacy and measuring model’s performance in real-world scenarios.
References
- [1] (2016) Deep learning with differential privacy. In Proceedings of the 2016 ACM SIGSAC, pp. 308–318. Cited by: §1.
- [2] (2006) K-means++: the advantages of careful seeding. Technical report Stanford. Cited by: §3.1.
-
[3]
(2020)
Extracting training data from large language models
. arXiv preprint arXiv:2012.07805. Cited by: §1. - [4] (2010) Learning bounds for importance weighting.. In Nips, Vol. 10, pp. 442–450. Cited by: §2.
- [5] (1985) Speech enhancement using a minimum mean-square error log-spectral amplitude estimator. IEEE transactions on acoustics, speech, and signal processing 33 (2), pp. 443–445. Cited by: §1.
- [6] (2017) Deep bayesian active learning with image data. In PMLR, pp. 1183–1192. Cited by: §1.
- [7] (2017) Audio set: an ontology and human-labeled dataset for audio events. In Proc. IEEE ICASSP 2017, New Orleans, LA. Cited by: §3.1, §3.2.
- [8] (2021) Analyzing the potential of pre-trained embeddings for audio classification tasks. In EUSIPCO 2020, pp. 790–794. Cited by: §1.
- [9] (2021) Training speech recognition models with federated learning: a quality/cost framework. In ICASSP 2021, pp. 3080–3084. Cited by: §1.
- [10] (2017) CNN architectures for large-scale audio classification. In 2017 IEEE ICASSP, pp. 131–135. Cited by: §1, §3.1.
- [11] (2020) Optimal sampling in unbiased active learning. In AISTATS, pp. 559–569. Cited by: §2.
- [12] (2019) Revisiting self-supervised visual representation learning. In Proceedings of the IEEE/CVF, pp. 1920–1929. Cited by: §1.
- [13] (2021) Active testing: sample-efficient model evaluation. arXiv preprint arXiv:2103.05331. Cited by: §1.
- [14] (2020) Federated learning for speech emotion recognition applications. In 2020 19th ACM/IEEE International Conference on Information Processing in Sensor Networks (IPSN), pp. 341–342. Cited by: §1.
- [15] (2005) Speech enhancement by map spectral amplitude estimation using a super-gaussian speech model. EURASIP Journal on Advances in Signal Processing 2005 (7), pp. 1–17. Cited by: §1.
- [16] (2021) DNSMOS P.835: a non-intrusive perceptual objective speech quality metric to evaluate noise suppressors. arXiv preprint arXiv:2101.11665. Cited by: §1, §1, §1, §3.1.
- [17] (2021) ICASSP 2021 Deep Noise Suppression Challenge. In ICASSP 2021, pp. 6623–6627. Cited by: §3.2.
- [18] (2021) Interspeech 2021 Deep Noise Suppression Challenge. Interspeech. Cited by: §1, §1, §1, §3.1, §3.2, §3.2, §3.3, §3.4, Table 1.
-
[19]
(2018)
Active learning for convolutional neural networks: a core-set approach
. In ICLR, Cited by: §1. - [20] (2009) Active learning literature survey. Cited by: §1.
- [21] (2017) Membership inference attacks against machine learning models. In 2017 IEEE Symposium on Security and Privacy (SP), Vol. , pp. 3–18. External Links: Document Cited by: §1.
-
[22]
(2014)
A regression approach to speech enhancement based on deep neural networks
. IEEE/ACM Transactions on Audio, Speech, and Language Processing 23 (1), pp. 7–19. Cited by: §1. - [23] (2021) Deep noise suppression with non-intrusive pesqnet supervision enabling the use of real training data. arXiv preprint arXiv:2103.17088. Cited by: §1.