Log In Sign Up

Aura: Privacy-preserving augmentation to improve test set diversity in noise suppression applications

Noise suppression models running in production environments are commonly trained on publicly available datasets. However, this approach leads to regressions in production environments due to the lack of training/testing on representative customer data. Moreover, due to privacy reasons, developers cannot listen to customer content. This `ears-off' situation motivates augmenting existing datasets in a privacy-preserving manner. In this paper, we present Aura, a solution to make existing noise suppression test sets more challenging and diverse while limiting the sampling budget. Aura is `ears-off' because it relies on a feature extractor and a metric of speech quality, DNSMOS P.835, both pre-trained on data obtained from public sources. As an application of , we augment a current benchmark test set in noise suppression by sampling audio files from a new batch of data of 20K clean speech clips from Librivox mixed with noise clips obtained from AudioSet. Aura makes the existing benchmark test set harder by 100 Spearman's rank correlation coefficient (SRCC) compared to random sampling and, identifies 73


page 1

page 2

page 3

page 4


ImportantAug: a data augmentation agent for speech

We introduce ImportantAug, a technique to augment training data for spee...

Improving the Robustness of Speech Translation

Although neural machine translation (NMT) has achieved impressive progre...

DNN No-Reference PSTN Speech Quality Prediction

Classic public switched telephone networks (PSTN) are often a black box ...

Not Your Grandfathers Test Set: Reducing Labeling Effort for Testing

Building and maintaining high-quality test sets remains a laborious and ...

1 Introduction

Speech enhancement includes removing background noise and reverberation [5, 15]

. Deep learning based approaches provide state-of-the-art improvement but typically involve supervised training on a large and diverse dataset of noisy speech and clean speech pairs

[22]. Since real-world applications contain speech mixed with noise, most deep learning methods rely on synthetic data that mix clean and noisy speech [18]. However, synthetic data cannot fully capture real-world conditions leading to poor performance when deployed to large, diverse customer bases [23]. One solution is to identify new conditions to augment current test sets with additional representative data. However, labeling a large batch of customer data is expensive, and often prohibitive due to privacy and compliance reasons.

This paper presents Aura

, an ‘ears-off’ methodology to identify scenarios in a sample-efficient manner that is: (i) representative of the customer workload, and (ii) challenging to noise suppression models. Real-world scenarios are instrumental to generate a low-variance estimate of how noise suppressors improve speech quality when used by customers. It is noteworthy that keeping the test set small (i.e., sample efficiency) ensures agility of development and deployment in production pipelines. Production workload datasets can be in the order of millions of clips so sample efficiency is an important practical challenge.

The main contribution of Aura is to operate in an ‘ears-off’ mode by relying on a pre-trained audio feature extractor [10] and a pre-trained accurate objective speech quality metric called DNSMOS P.835 [16]

. Both the feature extractor and speech quality metric are convolution-based models trained on large-scale open-sourced processed data and do not risk memorizing customer private information.

Aura maps customer data into an embedding space, partitions the data via clustering, and identifies the most challenging scenarios for noise suppressors in each cluster.

Since we cannot share results on customer data, we evaluate Aura’s sampling on simulated data. We apply Aura to rank noise suppression models from the DNS Challenge [18] in terms of overall mean opinion scores (MOS) [16]. We show that the Spearman’s rank correlation coefficient (SRCC) between a sample produced by Aura and the entire mixture of noisy speech is . This is an improvement of compared to a random sampling of noisy speech. We also apply Aura to augment a current benchmark test set for noise suppressors [18]. The resulting test set is more challenging to noise suppressors with a predicted absolute decline of (a relative decline of ) in DNSMOS P.835 [16]. Moreover, the Aura-based test set captures more diverse audio scenarios with an increase of in diversity (as measured by distance, and an addition of out-of-distribution (OOD) samples compared to the current benchmark test set [18].

Related work. As part of the Deep Noise Suppression (DNS) challenge [18], Reddy et al. created an extensive open-sourced noisy speech test set with about 20 noise types, 15 languages, 200+ speakers, emotions, and singing. However, the resulting DNS benchmark test set does not cover all the scenarios that customers who use real-time communication platforms experience. These gaps motivate additional effort to identify challenging scenarios where DNS models underperform.

Active learning is the common paradigm to select which data points to label from a pool of unlabeled data [20]

when the labeling budget is limited. Many active learning heuristics are uncertainty-based and try to sample in priority hard examples


. Kossen et al. show that heuristics in active learning lead to biased estimates of model performances because they do not preclude selecting redundant/outlier scenarios

[13]. On the other hand, Sener et al. propose a diversity-based solution identifying a set of data points that are the most representative of the entire data set [19]. This paper integrates diversity and uncertainty-based techniques to a pipeline that creates a test set for noise suppression models while operating in an ‘ears-off’ environment.

Our ‘ears-off’ pipeline protects customers from privacy leakages, including membership attacks [21] and the extraction of identifiable information [3]. A growing body of work has developed differentially private learning algorithms that do not expose private customer data [1]. Federated learning is a partial solution to the ‘ears-off’ problem since it allows training models on decentralized devices without the need for a central server to access private data. Latif et al. and Guliani et al. present applications of the federated learning paradigm to speech-related tasks [14, 9]. However, these methods rely on an availability of labels at client endpoints.

We obtain an ‘ears-off’ pipeline by transferring knowledge from models trained on open-source data. Self-supervision uses surrogate tasks to learn representations or embeddings that are useful to real-world downstream tasks. The method has been successful in image-based tasks (e.g., [12]), and audio/video (e.g., [8]). In this paper, our pretext task is to label the type of background noise in noisy speech. Moreover, we use DNSMOS P.835 [16] to accurately measure the quality of speech without the need for clean reference. DNSMOS P.835 has a correlation with human ratings for noise suppression models and a 0.74 correlation per clip. This gives the flexibility to estimate speech quality before and after noise suppression without listening to the file.

2 Solution

Figure 1: Overall structure of Aura. Aura reduces the variance of the performance metric (red) while maximizing the diversity component (gray) to ensure good coverage of audio scenarios in the embedding space.

Our objective is to estimate a performance metric of noise suppression on a target data that represents real-world conditions. For each speech clip, the P.835 protocol generates a MOS for signal quality SIG, background noise BAK, and overall quality OVRL. We derive two performance metrics from MOS: (i) differential MOS (DMOS) between after and before denoising; (ii) stack ranking of competing noise suppressors according to their average DMOS calculated on the target data .

The constraints are that only a small subset of files can be sampled out of all the noisy speech in ; and, that audio files in cannot be used to fit a model that could potentially encode identifiable information in its parameters. The restriction limits the size of the test set and allows rapid testing of models during development. We want the sampling estimate of the performance metric to have a small error compared to the expected value of on the target data . To minimize , we trade-off bias and variance [4]. A random sample of

audio files would result in zero bias, but high variance. On the other hand, probability-proportional-to-size sampling (PPS) would minimize the variance of the estimator by sampling audio files with a probability proportional to

[11]. One shortcoming of PPS sampling performed solely on is that it does not consider the diversity of scenarios.

To trade-off variance and bias, we propose Aura, the pipeline shown in Figure 1. Aura learns a partition of the target data into clusters in an embedding space obtained from a pre-trained feature extractor. Then, it applies PPS to each cluster. Aura samples audio files in two steps. First, it samples files in all the clusters of the embedding space to capture the diversity represented in the target data. Second, within each cluster, it favors audio files with the largest to reduce the variance of the performance metric .

Test set for noise suppression models. The first application of Aura is to sample audio files in the target data to form a test set for noise suppression models. In each of the clusters of the embedding space, we sample files with a probability inversely proportional to their predicted DMOS. Note that a smaller DMOS indicates a more challenging (or noisier) speech scenario.

Model ranking. A model evaluation task in noise suppression involves comparing the speech quality produced by different models on the same set of audio files. Given noise suppressors, we adjust our sampling procedure in the following manner: Within each cluster, an audio file is selected with a probability proportional to the variance across models of the predicted DMOS for that file. In this application, the sampling performance is defined as the SRCC between the ranking of models obtained on the sample and the one obtained with the entire target data .

3 Experiments

3.1 Pipeline Components

Choice of feature extractor. Our feature extractor is constructed from VGGish, a model pre-trained on a 100M YouTube videos dataset [10]

. It generates a 128 dimension embedding for each audio clip. VGGish embeddings are trained to identifying the type of foreground sound in an audio clip. In the context of noise suppression, we aim to detect the noise in the background of a speech clip. Therefore, we fine-tune the feature extractor by using its embedding as inputs to two fully connected layers and by classifying the type of background noise.

We follow Reddy et al. [18] to synthesize 1.5 million instances of 10 second noisy speech containing categories of noise sounds from AudioSet [7]. We add each noise sound to the background of speech clips randomly drawn from the bank of clean speech clips created by [18]. Our background noise type classifier trained on this synthetic data achieved a mean average precision (MAP) of and area-under-the-curve (AUC) of on a hold-out sample. It provides us with a feature extractor that maps noisy speech to a 128 dimension embedding space that is semantically aligned with the type of background noise. Using kmeans++ on the embedding space, we partition million noisy speech into clusters [2] 111The number of clusters is selected as the number that achieve the lowest Davies-Bouldin index, where the index is computed using Euclidean distance in the embedding space.. We validated the quality of the clusters using subjective listening tests. The raters were presented clips from 24 random clusters and asked to report whether audio files in the cluster share a common audio property. We found that 80 of the clusters share a common background noise.

Choice of MOS predictor. We use a pre-trained model, DNSMOS P.835 to predict subjective speech quality [16].

3.2 Dataset

We will apply Aura to two target test sets.

Augmented DNS challenge test set. We create a target dataset of potential noisy speech candidates to add to the INTERSPEECH 2021 DNS Challenge test set [18]. We mix sounds from the balanced AudioSet data [7] with clean speech clips [17]. We use segments from AudioSet as background noise and follow [17]

to generate noisy speech with signal-to-noise ratio (SNR) between -5 dB and 5 dB. The resulting target data combines 1.7K files from the DNS challenge test set with 22K of 10 seconds clips from the newly synthesized noisy speech that covers 527 noise types with at least 59 clips per class. The pool of noisy speech candidates is at least partially out-of-distribution compared to

[18], which covers 120 noise types. Moreover, it does not overlap the dataset of audio clips used to fine-tune our feature extractor and train DNSMOS P.835 and thus, reproduces the conditions of an ‘ears-off’ environment.

Augmented DNS challenge test set + clean speech. To simulate real-world conditions, for each noisy speech in the previous target data, we randomly draw 10 clean speech clips from [18]. Clean speech presents a challenge to stack rank models in development because it does not allow to discriminate the performance of noise suppressors.

3.3 Augmenting DNS challenge test set

We evaluate the following: From the pool of noisy speech candidates, can Aura generate a new test set of the same size (1.7K) as the INTERSPEECH 2021 benchmark test set [18], but with audio scenarios that are more challenging and diverse?

Protocol. We measure the diversity of a test set as the

distance between the distribution of audio segments across the embedding clusters and a uniform distribution between clusters. We normalize the value by calculating

of the contingency table over the percentage of data points in each cluster rather than raw frequency. The lower the

distance, the more audio properties encoded in the embedding space the resulting test set covers.

Test set DMOS OOD
DNS -0.24 0.99 0.18 498 0%
Challenge ()
Proposed -0.57 0.91 -0.32 3843 69%
(no cluster) 0.01 0.001 ()
Proposed -0.47 0.89 -0.18 287 73%
+ clustering ()
Table 1: Aura-based DNS test set. Sample provided by Aura on the new noisy data set increases difficulty level (lower DMOS) and diversity (lower ). The evaluation set is compared to the INTERSPEECH test set [18]. represents 95% conf. interval and P-values are in parenthesis. Low p-values indicate sample does not cover the clusters formed by the target data; large p-values mean almost uniform coverage.

Results. Table 1 compares DMOS and for the benchmark test set (top row) and the Aura-generated test set (bottom row). The proposed test set replaces of noisy speech in the benchmark test set with a new noisy speech from our synthetic data. Aura forms a test set with clips for which noise suppressors are more likely to degrade the signal and overall quality of the speech than for clips in the benchmark dataset (lower DMOS). Moreover, its distance is not significantly different from zero (p-value ), which indicates good coverage of all clusters in the embedding space and thus, a diverse set of audio conditions. Table 1 also shows that sampling the most challenging audio files in each cluster improves significantly diversity ( distance decreases from 3843 to 287) compared to sampling the most challenging files without diversity constraint (middle row in Table 1).

Figure 2 shows the top-10 noise types (in green) from the augmenting dataset of noisy speech that Aura adds to the test set. Aura prioritizes new out-of-distribution audio scenarios compared to the top-10 noise types present in the benchmark test set.

3.4 Model ranking

Sampling SRCC
method SIG BAK OVRL (
Aura 0.84 0.01 0.93 0.01 0.91 0.01 1
Table 2: SRCC between the ranking of 28 NS models obtained from a Aura sample with the model ranking obtained from using the entire data. Diversity is measured using distance (lower is better). 95% conf interval depicted using .

We evaluate Aura’s sampling efficiency to accurately stack rank noise suppression models with few samples from the target data . The objective is to estimate a model ranking that has a high SRCC with the ranking that would be obtained on the entire data.

Protocol. For each speech, we run noise suppression models from [18]

and on the DMOS predicted by our DNSMOS P.835 model as opposed to human ratings. We bootstrap the sampling 200 times and report the mean and standard deviation of the resulting rank correlation coefficients. We compare

Aura’s sampling performance to three alternative strategies: (i) Random, which draws randomly of data; (ii) Diversity, which samples uniformly across embedding clusters; (iii) Variance, which samples proportionally to the variance of predicted DMOS across the 28 models.

Results. Table 2 shows the SRCC for ranking based on signal, background and overall DMOS. Aura’s sampling leads to a improvement over random sampling for signal-based ranking. We also observe that the confidence interval of the ranking obtained from Aura-based samples is narrower than the one obtained by random sampling. Compared to alternative approaches, Aura generates the sample with the lowest distance, which indicates a better coverage of audio scenarios. On the other hand, Random has the highest distance because it mostly samples clean speech. Figure 3 shows that even for larger samples, Aura still outperforms Random in terms of rank correlation.

Figure 2: Top-10 of out-of-distribution (green) and in-distribution (blue) noise categories in Aura-based samples. In-distribution relates to categories in INTERSPEECH 2021 test set.
Figure 3: SRRC of Aura-based rank estimates as a function of sample size.

4 Conclusion

Aura presents an end-to-end system for improving the test set used to evaluate deep noise suppression models. Aura is designed to work for ‘ears-off’ customer workloads to preserve privacy. Aura detects out-of-distribution samples by relying on an objective quality metric and a pre-trained feature extractor. Although targeted for noise suppression, the method is generic and can be extended to other media (i.e., audio/video/image) given the successful development of a powerful pre-trained feature extractor. This work is an important step toward balancing customer privacy and measuring model’s performance in real-world scenarios.


  • [1] M. Abadi, A. Chu, I. Goodfellow, H. B. McMahan, I. Mironov, K. Talwar, and L. Zhang (2016) Deep learning with differential privacy. In Proceedings of the 2016 ACM SIGSAC, pp. 308–318. Cited by: §1.
  • [2] D. Arthur and S. Vassilvitskii (2006) K-means++: the advantages of careful seeding. Technical report Stanford. Cited by: §3.1.
  • [3] N. Carlini, F. Tramer, E. Wallace, M. Jagielski, A. Herbert-Voss, K. Lee, A. Roberts, T. Brown, D. Song, U. Erlingsson, et al. (2020)

    Extracting training data from large language models

    arXiv preprint arXiv:2012.07805. Cited by: §1.
  • [4] C. Cortes, Y. Mansour, and M. Mohri (2010) Learning bounds for importance weighting.. In Nips, Vol. 10, pp. 442–450. Cited by: §2.
  • [5] Y. Ephraim and D. Malah (1985) Speech enhancement using a minimum mean-square error log-spectral amplitude estimator. IEEE transactions on acoustics, speech, and signal processing 33 (2), pp. 443–445. Cited by: §1.
  • [6] Y. Gal, R. Islam, and Z. Ghahramani (2017) Deep bayesian active learning with image data. In PMLR, pp. 1183–1192. Cited by: §1.
  • [7] J. F. Gemmeke, D. P. W. Ellis, D. Freedman, A. Jansen, W. Lawrence, R. C. Moore, M. Plakal, and M. Ritter (2017) Audio set: an ontology and human-labeled dataset for audio events. In Proc. IEEE ICASSP 2017, New Orleans, LA. Cited by: §3.1, §3.2.
  • [8] S. Grollmisch, E. Cano, C. Kehling, and M. Taenzer (2021) Analyzing the potential of pre-trained embeddings for audio classification tasks. In EUSIPCO 2020, pp. 790–794. Cited by: §1.
  • [9] D. Guliani, F. Beaufays, and G. Motta (2021) Training speech recognition models with federated learning: a quality/cost framework. In ICASSP 2021, pp. 3080–3084. Cited by: §1.
  • [10] S. Hershey, S. Chaudhuri, D. P. Ellis, J. F. Gemmeke, A. Jansen, R. C. Moore, M. Plakal, D. Platt, R. A. Saurous, B. Seybold, et al. (2017) CNN architectures for large-scale audio classification. In 2017 IEEE ICASSP, pp. 131–135. Cited by: §1, §3.1.
  • [11] H. Imberg, J. Jonasson, and M. Axelson-Fisk (2020) Optimal sampling in unbiased active learning. In AISTATS, pp. 559–569. Cited by: §2.
  • [12] A. Kolesnikov, X. Zhai, and L. Beyer (2019) Revisiting self-supervised visual representation learning. In Proceedings of the IEEE/CVF, pp. 1920–1929. Cited by: §1.
  • [13] J. Kossen, S. Farquhar, Y. Gal, and T. Rainforth (2021) Active testing: sample-efficient model evaluation. arXiv preprint arXiv:2103.05331. Cited by: §1.
  • [14] S. Latif, S. Khalifa, R. Rana, and R. Jurdak (2020) Federated learning for speech emotion recognition applications. In 2020 19th ACM/IEEE International Conference on Information Processing in Sensor Networks (IPSN), pp. 341–342. Cited by: §1.
  • [15] T. Lotter and P. Vary (2005) Speech enhancement by map spectral amplitude estimation using a super-gaussian speech model. EURASIP Journal on Advances in Signal Processing 2005 (7), pp. 1–17. Cited by: §1.
  • [16] C. Reddy, V. Gopal, and R. Cutler (2021) DNSMOS P.835: a non-intrusive perceptual objective speech quality metric to evaluate noise suppressors. arXiv preprint arXiv:2101.11665. Cited by: §1, §1, §1, §3.1.
  • [17] C. K. Reddy, H. Dubey, V. Gopal, R. Cutler, S. Braun, H. Gamper, R. Aichner, and S. Srinivasan (2021) ICASSP 2021 Deep Noise Suppression Challenge. In ICASSP 2021, pp. 6623–6627. Cited by: §3.2.
  • [18] C. K. Reddy, H. Dubey, K. Koishida, A. Nair, V. Gopal, R. Cutler, S. Braun, H. Gamper, R. Aichner, and S. Srinivasan (2021) Interspeech 2021 Deep Noise Suppression Challenge. Interspeech. Cited by: §1, §1, §1, §3.1, §3.2, §3.2, §3.3, §3.4, Table 1.
  • [19] O. Sener and S. Savarese (2018)

    Active learning for convolutional neural networks: a core-set approach

    In ICLR, Cited by: §1.
  • [20] B. Settles (2009) Active learning literature survey. Cited by: §1.
  • [21] R. Shokri, M. Stronati, C. Song, and V. Shmatikov (2017) Membership inference attacks against machine learning models. In 2017 IEEE Symposium on Security and Privacy (SP), Vol. , pp. 3–18. External Links: Document Cited by: §1.
  • [22] Y. Xu, J. Du, L. Dai, and C. Lee (2014)

    A regression approach to speech enhancement based on deep neural networks

    IEEE/ACM Transactions on Audio, Speech, and Language Processing 23 (1), pp. 7–19. Cited by: §1.
  • [23] Z. Xu, M. Strake, and T. Fingscheidt (2021) Deep noise suppression with non-intrusive pesqnet supervision enabling the use of real training data. arXiv preprint arXiv:2103.17088. Cited by: §1.