Subjective evaluation of speech quality is the most reliable way to evaluate Speech Enhancement (SE) methods . However, subjective tests are not easily scalable as they require a considerable number of listeners, the process is laborious, time-consuming, and expensive. Conventional objective speech quality metrics such as Perceptual Evaluation of Speech Quality (PESQ) , Perceptual Objective Listening Quality Analysis (POLQA)  and Signal to Distortion Ratio (SDR) are widely used to evaluate Speech Enhancement (SE) algorithms optimized for human perception. Some of these metrics are designed to predict the subjective Mean Opinion Score (MOS) obtained using the Telecommunication Standardization Sector of the International Telecommunication Union (ITU-T) Recommendation P.800 . However, they are shown to correlate poorly with human rating when used for SE tasks that involve perceptually invariant transformations . Also, intrusive metrics cannot be used to evaluate real recordings when a clean reference is unavailable in realistic scenarios.
2 Related work
showed an accurate linear model of OVRL can be estimated as a function of SIG and BAK. Naderi and Cutler used this linear relationship to analyze the results of the 3rd Deep Noise Suppression challenge  to estimate the potential improvement in OVRL given a noise suppressor that maximized BAK. Hu and Loizou  released an intrusive speech quality assessment tool based on P.835, with a correlation to subjective quality of PCC(SIG)=0.70, PCC(CBAK)=0.58, PCC(OVRL)=0.73. A commercial tool, 3QUEST , is used to measure the speech (S-MOS), noise (N-MOS), and overall (G-MOS) quality of speech as part of the ETSI EG 202 396-3 standard for mobile telephone quality. This intrusive model has good performance, PCC(S-MOS)=0.92, PCC(N-MOS)=0.94, PCC(G-MOS)=0.94 by condition; it was trained with 179 conditions and tested with 81 conditions, but is it not reported in  the duration of training or testing data.
ITU-T Recommendation P.563 is a non-intrusive technique and can directly operate on the degraded signal . However, it was developed for narrow-band applications, works on limited impairment types, but correlates poorly with human ratings 
. Recently, Deep Neural Networks (DNNs) based approaches have been proposed to estimate the speech quality scores[6, 8, 1, 22, 26, 20, 7, 3, 4]. Some of these learning-based approaches use other objective metrics as the ground truth to train their speech quality predictor. Other methods use MOS obtained using P.800 as the ground truth to train their models. In , the authors trained the model to identify the Just Noticeable Difference (JND). MOS predictors trained on actual human ratings are more reliable than the ones trained to predict other objective metrics like PESQ or POLQA. The accuracy and robustness of the learned models depend on the quality of the human labels and also the quantity and diversity of the audio clips. A comparison of some common DNN-based non-intrusive speech quality assessment (NI-SQA) methods is given in Table 1. ACR is Absolute Catagorgy Rating . DNSMOS P.835 is the first P.835 based NI-SQA model we are aware of.
In , we show that the NI-SQA metric called DNSMOS trained using subjective quality labels is more robust and reliable than some of the other popular intrusive metrics. DNSMOS is used to do model training and model selection during noise suppression development. DNSMOS is also used for doing ablation studies for noise suppressors [16, 15]. DNSMOS has been quite popular, with over a hundred researchers using it after several months of releasing it.
However, DNSMOS only gives the overall score of the audio clip. In this paper, we extend that work to predict the quality of speech (SIG), background noise (BAK), and overall quality (OVRL) of the audio clip. We use the subjective quality labels obtained from ITU-T P.835 from Deep Noise Suppression (DNS) Challenge 3  and the noisy clips processed by several noise suppression models internally at Microsoft. The labels were obtained using our crowdsourcing based extension of P.835 described in 
. The model uses log power spectrogram as input features to a Convolutional Neural Network (CNN) based model. It can be used to stack rank different Deep Noise Suppression (DNS) methods based on MOS estimates with great accuracy and hence the name DNSMOS P.835. We are providing the DNSMOS as an Azure service for other researchers to use. The details of the API are atwww.microsoft.com/en-us/research/dns-challenge/dnsmos.
3 Data and subjective ratings
We used the labeled data from the DNS Challenge V3 
to train DNSMOS P.835. The DNS Challenge V3 test set comprised of 600 noisy speech clips processed by about 40 different noise suppression models. The real recordings in the test set were captured in a variety of noise types and Signal to Noise Ratio (SNR) and target levels. The test set is comprised of over 100 noise types and speakers. More details about the creation of these test sets can be found in. The speech quality ratings of the processed clips varied from very poor (MOS=1) to excellent (MOS=5) for SIG, BAK, and OVRL. The distribution of the MOS scores in the training data is shown in Figure 1
. The scores are highly skewed with most ratings populated in the range 3
The subjective quality ratings are obtained in several P.835 runs conducted over several months. Multiple noise suppression methods are compared in each P.835 run. Each P.835 run included the best performing noise suppressor, original noisy speech, and a couple of methods with intermediate perceptual quality from previous runs as anchors. Hence, some of the clips were rated multiple times. In total, we have about 30,000 audio clips with associated MOS scores as ground truth. The average length of each audio clip was about 9 seconds, giving us a total of 75 hours of data.
A subset of the dataset is summarized in Figure 2. What makes this dataset unique is (1) is it by far the largest P.835 dataset we know of and the only one used to train a DNN non-intrusive speech quality assessment model, and (2) the 40 deep noise suppression models used in the dataset gives a large variety of suppression artifacts we think is needed to generalize a speech quality assessment model for noise suppressors.
4 Dnsmos p.835
. They show that using the time-domain waveform requires a larger model trained on a larger and diverse data set to ensure generalization. The ground truth MOS scores are obtained for audio clips with an average length of 9 secs sampled at 16 kHz. This leads to a very large input dimension if we are treating it as a vector and the model will require many layers to compress and extract input features. Instead, we used log powered spectrogram as input feature extracted over 9 secs duration as it correlates well with human perception and is proven to work very well for analyzing speech quality. For spectral features, we used a frame size of 20 ms with a hop length of 10 ms. The input features are then converted to dB scale.
4.2 Prediction model
For predicting the MOS scores, we explored different configurations of CNN based models. The architecture for the best performing model is shown in Table 2
. The input to the model is log power Mel spectrogram with a 320 FFT size computed over a clip of length 9 secs sampled at 16 kHz with a frame size of 20 ms and hop length of 10 ms. This results in an input dimension of 900 x 161. We trained two different models with almost the same architecture except for the last layer. One model is trained to predict all 3 outputs (SIG, BAK, OVRL) and the other model is trained to predict only SIG. The reason is we found the prediction of SIG is a much harder task and is less correlated with BAK and OVRL. The models were trained with a batch size of 32 using the Adam optimizer and MSE loss function until the loss saturated. We experimented by adding batch normalization layers after every Conv layer in Table 1. However, adding batch normalization reduces the prediction accuracy of low volume clips. Humans tend to give lower ratings to clips with low amplitudes. We want the model to capture the variations in the target levels of the data. Hence, we avoid any kind of feature normalization. We also explored different network architectures including CNN followed by LSTM. The model in Table2 generalized the best and was of least complexity.
|Input||900 x 120 x 1|
Conv: 128, (3 x 3), ‘ReLU’
|900 x 161 x 128|
|Conv: 64, (3 x 3), ‘ReLU’||900 x 161 x 64|
|Conv: 64, (3 x 3), ‘ReLU’||900 x 161 x 64|
|Conv: 32, (3 x 3), ‘ReLU’||900 x 161 x 32|
|MaxPool: (2 x 2), Dropout(0.3)||450 x 80 x 32|
|Conv: 32, (3 x 3), ‘ReLU’||450 x 80 x 32|
|MaxPool: (2 x 2), Dropout(0.3)||225 x 40 x 32|
|Conv: 32, (3 x 3), ‘ReLU’||112 x 20 x 32|
|MaxPool: (2 x 2), Dropout(0.3)||112 x 15 x 32|
|Conv: 64, (3 x 3), ‘ReLU’||112 x 20 x 64|
|GlobalMaxPool||1 x 64|
|Dense: 128, ‘ReLU’||1 x 128|
|Dense: 64, ‘ReLU’||1 x 64|
|Dense: 1 or 3||1 x 1 or 1 x 3|
5 Experimental Results
5.1 Test set
The unseen real test set used to validate the trained model consists of P.835 evaluation of 17 different Microsoft internal noise suppression models on an unseen set of 850 clips. The clips span various categories like Emotional, English, Non-English with and without Tonal languages and Stationary noises. This unseen test set was created for a future DNS challenge and has similar categories as the training data, adding mouse clicks and improving the quality of emotional speech. The test set was created using crowdsourcing using the method described in .
5.2 Evaluation metric
PCC or MSE between the predictions of the developed objective metric and the ground truth human ratings is commonly used to measure the accuracy of the model [8, 22]. From , we know that P.835 is highly repeatable between runs when averaged across a set of clips per condition, which can be formed by grouping clips enhanced by a particular SE model or based on other criteria like SNR or reverb RT60 times. The PCC computed on the average of ratings per group across different runs is >0.9. We also found that PCC computed on the same clips but from two different P.835 runs is only about 0.7-0.8 due to the high rating noise per clip.
Hence, for stack ranking different noise suppressors we evaluate by computing the average of ratings across the entire test set for each model. Therefore, we compute Spearman’s Rank Correlation Coefficient (SRCC) and PCC between averaged human ratings and averaged DNSMOS per model. SRCC gives us the stack ranking accuracy of various SE models.
Table 3 shows the per model and per clip PCC and SRCC between human ratings and DNSMOS P.835 on the unseen test set described in Section 5.1. When DNSMOS is aggregated by model the results are excellent, though it still shows an area for improvement in SIG. The results on this unseen test set show DNSMOS P.835 generalizes well, at least for these categories of noises and environments. We can not compare DNSMOS P.835 with other metrics since it is the first NI-SQA metric for P.835 we are aware of.
Table 4 shows the clip level correlation of two P.835 runs, which gives an upper limit of the per clip correlation in Table 3. The main reason for the relatively low run to run clip correlations is we use only 5 ratings per clip for economy. If we increased this to 30 ratings the correlations would be significantly better, as should the per clip results in Table 3.
6 Conclusion and Future work
DNSMOS P.835 is an accurate speech quality metric designed to stack rank noise suppressors with great accuracy. We attribute the excellent performance of DNSMOS P.835 to (1) a large high-quality dataset, (2) a limited speech quality impairment category, (3) significant optimizations on the model architecture and training, and (4) aggregation by noise suppression model. The per clip performance can be improved by significantly increasing the number of ratings per clip, which is currently only 5 because of cost restrictions. We can also expand the complexity of the model to further improve performance.
-  (2019) Non-intrusive speech quality assessment using neural networks. In ICASSP, Vol. , pp. 631–635. Cited by: Table 1, §2.
-  (2013-06) Perceptual Objective Listening Quality Assessment (POLQA), The Third Generation ITU-T Standard for End-to-End Speech Quality Measurement Part II-Perceptual Model. AES: Journal of the Audio Engineering Society 61, pp. 385–402. Cited by: §1.
-  (2020) WaweNets: a no-reference convolutional waveform-based approach to estimating narrowband and wideband speech quality. In ICASSP, pp. 331–335. Cited by: Table 1, §2.
-  (2019) Non-intrusive speech quality prediction using modulation energies and lstm-network. IEEE/ACM Transactions on Audio, Speech, and Language Processing 27 (7), pp. 1151–1163. Cited by: Table 1, §2.
-  (2020) Real time speech enhancement in the waveform domain. In INTERSPEECH, Cited by: §4.1.
-  (2020) An attention enhanced multi-task model for objective speech assessment in real-world environments. In ICASSP, pp. 911–915. Cited by: §2.
-  (2018) Quality-net: an end-to-end non-intrusive speech quality assessment model based on blstm. In INTERSPEECH, Cited by: §2.
-  (2019) Intrusive and non-intrusive perceptual speech quality assessment using a convolutional neural network. In 2019 IEEE Workshop on Applications of Signal Processing to Audio and Acoustics (WASPAA), pp. 85–89. Cited by: Table 1, §2, §4.1, §5.2.
-  (2015) Speech acoustic modeling from raw multichannel waveforms. In ICASSP, Vol. , pp. 4624–4628. Cited by: §4.1.
-  (2006) Evaluation of objective measures for speech enhancement. In International Conference on Spoken Language Processing, Cited by: §2.
-  (2004) ITU-T Recommendation P.563: Single-ended method for objective speech quality assessment in narrowband telephony applications. Cited by: §2.
-  (1998-02) ITU-T Recommendation P.800: Methods for subjective determination of transmission quality. Cited by: §1, §2.
-  (2018) Subjective evaluation of speech quality with a crowdsourcing approach. International Telecommunication Union, Geneva. Cited by: DNSMOS P.835: A Non-Intrusive Perceptual Objective Speech Quality Metric to Evaluate Noise Suppressors.
-  (2003) Subjective test methodology for evaluating speech communication systems that include noise suppression algorithm. International Telecommunication Union, Geneva. Cited by: DNSMOS P.835: A Non-Intrusive Perceptual Objective Speech Quality Metric to Evaluate Noise Suppressors, §2.
-  (2021) A simultaneous denoising and dereverberation framework with target decoupling. In INTERSPEECH, Cited by: §2.
-  (2021) DCCRN+: channel-wise subband dccrn with snr estimation for speech enhancement. In INTERSPEECH, Cited by: §2.
-  (2020) A differentiable perceptual audio metric learned from just noticeable differences. In INTERSPEECH, Cited by: §2.
-  (2021) Subjective evaluation of noise suppression algorithms in crowdsourcing. In INTERSPEECH, Cited by: §2, §2, §5.2.
-  (2008) 3QUEST: 3-fold Quality Evaluation of Speech in Telecommunications Systems. Cited by: §2.
-  (2019) Improving deep models of speech quality prediction through voice activity detection and entropy-based measures. In ICASSP, pp. 636–640. Cited by: §2.
-  (2021) INTERSPEECH 2021 Deep Noise Suppression Challenge. In INTERSPEECH, Cited by: §2, §2, §3, §5.1.
-  (2021) DNSMOS: a non-intrusive perceptual objective speech quality metric to evaluate noise suppressors. In ICASSP, Vol. , pp. 6493–6497. Cited by: Table 1, §2, §2, §5.2.
-  (2019) A scalable noisy speech dataset and online subjective test framework. Proc. INTERSPEECH 2019, pp. 1816–1820. Cited by: §1.
-  (2001) Perceptual evaluation of speech quality (pesq)-a new method for speech quality assessment of telephone networks and codecs. In ICASSP, Vol. 2, pp. 749–752 vol.2. Cited by: §1.
SESQA: semi-supervised learning for speech quality assessment. In ICASSP, pp. 381–385. Cited by: Table 1.
Novel deep autoencoder features for non-intrusive speech quality assessment. In 2016 24th European Signal Processing Conference (EUSIPCO), pp. 2315–2319. Cited by: §2.
-  (2019) WaveCycleGAN2: Time-domain neural post-filter for speech waveform generation. arXiv preprint arXiv:1904.02892. Cited by: §4.1.