Performance optimizations on deep noise suppression models

10/08/2021 ∙ by Jerry Chee, et al. ∙ 0

We study the role of magnitude structured pruning as an architecture search to speed up the inference time of a deep noise suppression (DNS) model. While deep learning approaches have been remarkably successful in enhancing audio quality, their increased complexity inhibits their deployment in real-time applications. We achieve up to a 7.25X inference speedup over the baseline, with a smooth model performance degradation. Ablation studies indicate that our proposed network re-parameterization (i.e., size per layer) is the major driver of the speedup, and that magnitude structured pruning does comparably to directly training a model in the smaller size. We report inference speed because a parameter reduction does not necessitate speedup, and we measure model quality using an accurate non-intrusive objective speech quality metric.



There are no comments yet.


page 1

page 2

page 3

page 4

This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

1 Introduction

There has been much work in compressing deep learning methods so that they can efficiently operate within the real-time and hardware constraints of many audio enhancement applications [1, 8, 18, 19]. This interest stems from the fact that deep learning methods – while typically providing superior audio enhancement – come with a greater computational complexity than classical signal processing methods [1]. In real-time applications, the computational complexity becomes the primary constraint. The available memory per device varies, but the available time per computation does not. Thus we measure and present our compression results in terms of the inference speed. Calculating the memory or parameter reduction is not an accurate proxy - see Section 5.3.

We investigate the application of structured pruning and fine-tuning to speed up our baseline CRUSE model [1]. Structured pruning aims to find a dense sub-network that well approximates the original. This type of model compression immediately transfers to an inference speedup and reduced storage costs because we are performing dense and smaller matrix multiplications. In addition, we propose a new scalable per-layer parameter configuration for the CRUSE architecture class to specify the pruned network sizes.

1.1 Contributions

Using the CRUSE [1] architecture we demonstrate up to a 7.25X speedup over the baseline model, with a smooth degradation of model quality. Ablation studies indicate that the proposed network parameter configuration is in fact responsible for the successful scalability. Our structured pruning method does no better than directly training a model in a given size. The value then of structured pruning is in the architecture search: discovering which network parameterizations can reduce model complexity with minimal model degradation.

2 Related work

Tan and Wang [18, 19] use sparse regularization, iterative pruning, and clustering-based quantization to compress DNN speech enhancement models. However they use STOI and PESQ [7] to evaluate the quality after compression, which has been shown to have low correlation to subjective quality [2, 15]. In addition, no run-time benchmarks are given to show real-world improvements, the noise suppression model used is relatively simple and not state of the art, and the training and test set are simplistic. Therefore it isn’t clear from this study what optimizations work well on a best-in-class noise suppressor on a much more challenging test set such as [15].

Kim et. al. [8] use a combination of unstructured pruning, quantization, and knowledge distillation to compress keyword spotting models. The authors motivate their work via edge computing, but do not provide any complexity measurements to indicate real-world improvements. In addition, no comparisons are made to any other compression methods.

2.1 Deep noise suppression

Braun et. al. [1] developed the CRUSE class of models for real-time deep noise suppression. It is based on the U-Net architecture [20]

, another real-time model for DNS. Unlike earlier network architectures primarily based on recurrent neural networks whose models have hit a performance saturation 

[21, 23, 25], CRUSE is in the class of convolutional recurrent networks [20, 17, 22, 24]. These models have increased performance, albeit at a computational cost that prohibits their real-time deployment on consumer devices. We investigate two versions of the CRUSE model, a more complex model called CRUSE32 and a less complex model called CRUSE16.

2.2 Model compression

Pruning aims to remove parts of a neural network while maintaining its accuracy. It can either remove individual parameters resulting in sparse matrices (unstructured) 


, or it can remove groups of parameters such as channels or neurons (structured) 


. There are many strategies to prune a neural network, but simple magnitude pruning has been shown to be superior to more complicated methods on ImageNet 

[5]. Other methods of model compression include quantization, matrix factorization, and knowledge distillation [11]. Frankle and Carbin [4]

present the Lottery Ticket Hypothesis: dense randomly initialized neural networks contain sparse subnetworks (winning tickets) that can be trained to comparable accuracy of the original network, in a comparable number of epochs. The work of Liu et. al. 

[10] instead focuses on structured pruning, and presents a different message: fine-tuning a pruned model is comparable or worse to training a model directly in that size.

Operator Type Inference Time Proportion
GRU 75%
(De)Convolution 18%
Activations, etc 7%
Table 1: Profiling CRUSE32 to identify main targets for
inference speedup.

3 Experimental methodology

We employ two simple but important experimental methodologies which enable a more accurate assessment of performance in real-world scenarios. First, we provide timing results in the ONNX Runtime inference engine [3]. Timing results are crucial because of the strict real-time requirements for background noise suppression. It is not necessary that reduced parameter counts give a faster inference time. For example, we measured no meaningful speedup for sparse pruned models in ONNX Runtime (see Section 5.3). Benchmark results are conducted on an Intel Core i7-10610U CPU. Second, the model quality is evaluated using the INTERSPEECH 2021 and ICASSP 2021 DNS Challenge test set [15, 14]. We use a new non-intrusive objective speech quality metric called DNSMOS P.835 [13] employing the ITU-T P.835 [12] standard which provides separate scores for speech (SIG), background (BAK), and overall (OVLR) quality. DNSMOS P.835 has a Pearson Correlation Coefficient of 0.94 for speech, 0.98 for background, and 0.98 for overall compared to subjective quality, which gives sufficient accuracy to do fast pruning.

Figure 1: Speedup via structured pruning for the CRUSE class of models. DNSMOS P.835 reported for the 2021 DNS challenge test sets [15, 14]. Degree 2 polynomial trend line with values.

4 Performance optimizations
via pruning architecture search

We aim to improve the inference time of the CRUSE architecture class while maintaining minimal model degradation. The CRUSE architecture we study consists of 4 convolution-encoder and deconvolution-decoder layer pairs, with a central parallel GRU layer. Full details of the architecture can be found in Braun et. al. [1]. Because the DNS models operate in a real-time environment, we will focus on reducing the inference time. We first profile the inference time components of CRUSE, with results in Table 1. The GRU operations dominate the computation, and so we will focus on compressing them.

Config Network Mem Benchmark
Name Param (MB) (ms)
CRUSE32 32 64 128 256 33.6 3.41 ( .02)
CRUSE16 16 32 64 128 8.41 1.19 ( .02)
P.125 32 64 128 224 26.0 2.61 ( .01)
P.250 32 64 128 192 19.5 2.10 ( .01)
P.500 32 64 128 128 9.22 1.39 ( .02)
P.5625 32 64 112 112 7.16 1.14 ( .01)
P.625 32 64 96 96 5.34 0.96 ( .04)
P.6875 32 64 80 80 3.79 0.79 ( .01)
P.750 32 64 64 64 2.52 0.65 ( .003)
P.875 32 32 32 32 0.68 0.47 ( .01)
Table 2:

Structured pruning configurations for CRUSE architecture. Model memory is of the ONNX format. Benchmarks include 95% confidence intervals. CRUSE{32,16} are baseline models from Braun et. al. 


Structured pruning requires that the number of parameters in each layer be adjusted. Our DNS model can be parameterized by a length 4 vector

. Because of the skip connections between encoder-decoder layer pairs, CRUSE is symmetric about the central GRU layer. For example, specifies the output channels of the first convolution layer and the input channels of the second convolution layer. And specifies the output channels of the last convolution layer and the hidden state of the GRU layer. Table 2 specifies the baseline CRUSE32 network parameterizations, as well as the 8 configurations we consider.

We develop a heuristic framework for resizing the CRUSE architecture. First, reduce the dimensions of the GRU layer via a pruning configuration parameter. This reduction corresponds to modifying

in the network parameterization. For example, with the P.625 configuration the pruning parameter is . Thus, we change . Note that we must also change the output channels in the last convolution layer. Then, we enforce what we call “network monotonicity”, that . This monotonicity condition is commonly seen in neural network architecture design where the number of channels increase through the network [16, 6]. Recall that , and thus . We then set to satisfy the condition. Our proposed configuration scheme can be applied to other U-Net [20] style architectures, as it only requires a symmetric encoder-decoder structure.

We employ structured magnitude pruning to construct a dense sub-network to the size specified by our network parameterization. Consider a generic tensor

. To construct a sub-tensor with reduced dimension , choose a set of coordinates in dimension satisfying:

where represents the tensor indexed at coordinate in dimension . We apply this generic structured magnitude pruning method to reduce the number of input or output channels in (de)convolution layers, as well as the input and hidden dimensions of the GRU layer. We prune the CRUSE32 model, our baseline. Fine-tuning is critical to recover model accuracy after pruning. We use the same optimization hyper-parameters used to train the baseline CRUSE32 model [1].

5 Results

Figure 1 plots the speedup achieved via structured magnitude pruning to the configurations specified in Table 2, against the resulting signal, background, and overall DNSMOS P.835. Importantly, structured pruning achieves a smooth trade-off between complexity and model quality. This trade-off is quite flat; we can decrease complexity without incurring much model degradation. For example, a 3.64X memory reduction and a 2.46X inference speedup over the baseline CRUSE32 model only incurs a 0.01 degradation in overall DNSMOS P.835. The P.125 configuration achieves higher DNSMOS P.835 than the baseline CRUSE32 because we are doing additional retraining on top of the baseline model. At the extreme, a 7.25X inference speedup incurs a 0.2 overall DNSMOS P.835 degradation. Overall, we have shown the CRUSE class of models is viable at complexity levels previously thought untenable.

5.1 Ablation studies

Config (SIG, BAK, OVRL)
Name Struct Prune Direct Train
P.125 3.86 4.34 3.60 3.85 4.33 3.59
P.250 3.85 4.33 3.59 3.85 4.34 3.59
P.500 3.83 4.31 3.56 3.82 4.32 3.56
P.750 3.76 4.25 3.47 3.77 4.26 3.48
P.875 3.70 4.15 3.37 3.68 4.17 3.36
Table 3: Prune vs direct training a configuration.

We have introduced two new variables in compressing the CRUSE class of models: the network re-parameterization, and the structured magnitude pruning method itself. The CRUSE16 model follows the previous network parameterization [1], but with half the number of parameters as CRUSE32. In Table 2 we see that CRUSE16 lies between the P.500 and P.5625 prune configurations in terms of memory and benchmark speed. However Figure 1 shows CRUSE16 has a worse signal and overall DNSMOS P.835, with equivalent background. Holding complexity constant, our new network parameterization in Table 2 achieves superior audio quality.

Table 3 shows the results of directly training a model in a given configuration, using the same total number of training epochs as the pruned models. The results are effectively indistinguishable, which indicates that structured magnitude pruning is not providing any added value. Additional tuning of the fine-tuning procedure for structured pruning did not improve results, see Table 4.

Init DNSMOS P.835
1e-3* 3.76 4.25 3.47
1e-4 3.70 4.18 3.39
1e-5 3.61 4.08 3.24
1e-6 3.46 3.80 3.04
Table 4: Tuning LR for structured pruning fine-tuning. (*) is the selected hyper-parameter, used for fine-tuning, direct training, and is the original CRUSE32 training parameter.
Baseline (SIG, BAK, OVR
CRUSE32 3.80 4.35 3.57
CRUSE16 3.73 4.30 3.49
Table 5: DNSMOS P.835 for baseline CRUSE32, CRUSE16 models.

5.2 The value of fine-tuning

Prune Before After
Name Fine-Tuning Fine-Tuning
P.125 2.22 2.87 2.11 3.86 4.34 3.60
P.250 2.23 2.87 2.15 3.85 4.33 3.59
P.500 2.03 2.83 1.96 3.83 4.31 3.56
P.5625 2.04 2.77 2.04 3.80 4.27 3.52
P.625 2.11 2.74 2.02 3.79 4.28 3.52
P.6875 2.21 2.81 2.00 3.79 4.26 3.50
P.750 2.02 2.75 1.89 3.76 4.25 3.47
P.875 2.66 2.74 2.22 3.70 4.15 3.37
Table 6: DNSMOS P.835 of structure pruned CRUSE32 models before and after fine-tuning.
Frac Before After
GRU Fine-Tuning Fine-Tuning
0.25 3.80 4.34 3.56 3.86 4.32 3.59
0.50 3.78 4.32 3.54 3.86 4.33 3.60
0.75 3.68 4.22 3.40 3.86 4.33 3.59
Table 7: DNSMOS P.835 of unstructured (sparse) pruned CRUSE32 models before and after fine-tuning.

In Table 6 the DNSMOS P.835 is reported after structured magnitude pruning, and again after fine-tuning. We see that fine-tuning is crucial to recover the model quality. The MOS scores are increased by approximately 1, a large margin. Table 7 reports the DNSMOS P.835 after unstructured (sparse) magnitude pruning, and again after fine-tuning. Remarkably, we see that unstructured pruning does not degrade the model quality nearly as much. Note, the “Frac GRU 0.25” unstructured setting does not modify any convolution layers and is not equivalent to the “P.250” configuration. Subsequently, fine-tuning has less to recover.

5.3 Sparsity does not imply inference speedup

Frac Benchmark
GRU (ms)
0.00 4.91 ( .07)
0.25 4.92 ( .04)
0.50 4.92 ( .08)
0.75 4.90 ( .03)
Table 8: Benchmark results of unstructured (sparse) magnitude pruning the GRU layers in CRUSE32.

Despite the promising results in Table 7, reducing the parameter count via sparsity does not easily translate to an inference speedup. Table 8 shows the benchmark results for sparse magnitude pruning of the GRU layer in CRUSE32. A certain fraction of the GRU weights were set to zero. The times are not distinguishable with their confidence intervals. This means that ONNX Runtime provides no speedup from sparse linear algebra operations. Specialized sparse inference support is needed to realize speedup from sparsity. The Neural Magic inference engine [9] is one such option, but in our experimentation we found that support for RNN layers was still in development.

6 Conclusions

We have achieved up to a 7.25X inference speedup over the baseline CRUSE32 model. Our experiments indicate that the proposed network parameterization (size of each layer) is the primary driver of our speedup results, not structured magnitude pruning. These conclusions support Liu et. al. [10] in that the value of structured pruning is in conducting an architecture search. Additionally, our methodological choice of measuring the inference speed – rather than using parameter count as a proxy – revealed the difficulty of realizing practical benefits from sparse pruning methods. Our proposed network parameterization only requires a symmetric encoder-decoder structure, and thus can be applied to other U-Net style architectures.


  • [1] S. Braun, H. Gamper, C. K.A. Reddy, and I. Tashev (2021) Towards efficient models for real-time deep noise suppression. In ICASSP, Cited by: §1.1, §1, §1, §2.1, Table 2, §4, §4, §5.1.
  • [2] R. Cutler, A. Saabas, T. Parnamaa, M. Loide, S. Sootla, M. Purin, H. Gamper, S. Braun, K. Sorensen, R. Aichner, et al. (2021) INTERSPEECH 2021 acoustic echo cancellation challenge. In INTERSPEECH, Cited by: §2.
  • [3] O. R. developers (2021) ONNX runtime. Note: Cited by: §3.
  • [4] J. Frankle and M. Carbin (2019) The lottery ticket hypothesis: finding sparse, trainable neural networks. International Conference on Learning Representations. Cited by: §2.2.
  • [5] T. Gale, E. Elsen, and S. Hooker (2019) The state of sparsity in deep neural networks. arXiv preprint arXiv:1902.09574. Cited by: §2.2.
  • [6] K. He, X. Zhang, S. Ren, and J. Sun (2016) Deep residual learning for image recognition. In CVPR, Cited by: §4.
  • [7] (2001-02) ITU-T recommendation P.862: perceptual evaluation of speech quality (PESQ): an objective method for end-to-end speech quality assessment of narrow-band telephone networks and speech codecs. Cited by: §2.
  • [8] J. Kim, S. Chang, and N. Kwak (2021) PQK: model compression via pruning, quantization, and knowledge distillation. INTERSPEECH. Cited by: §1, §2.
  • [9] M. Kurtz, J. Kopinsky, R. Gelashvili, A. Matveev, J. Carr, M. Goin, W. Leiserson, S. Moore, B. Nell, N. Shavit, and D. Alistarh (2020-13–18 Jul) Inducing and exploiting activation sparsity for fast inference on deep neural networks. In ICML, H. D. III and A. Singh (Eds.),

    Proceedings of Machine Learning Research

    , Vol. 119, Virtual, pp. 5533–5543.
    External Links: Link Cited by: §5.3.
  • [10] Z. Liu, M. Sun, T. Zhou, G. Huang, and T. Darrell (2019) Rethinking the value of network pruning. International Conference on Learning Representations. Cited by: §2.2, §6.
  • [11] R. Mishra, P. H. Gupta, and T. Dutta (2021) A survey on deep neural network compression: challenges, overview, and solutions. arXiv preprint arXiv:2010.03954. Cited by: §2.2.
  • [12] B. Naderi and R. Cutler (2021) Subjective evaluation of noise suppression algorithms in crowdsourcing. In INTERSPEECH, Cited by: §3.
  • [13] C. Reddy, V. Gopal, and R. Cutler (2021) DNSMOS P.835: a non-intrusive perceptual objective speech quality metric to evaluate noise suppressors. arXiv preprint arXiv:2101.11665. Cited by: §3.
  • [14] C. K. Reddy, H. Dubey, V. Gopal, R. Cutler, S. Braun, H. Gamper, R. Aichner, and S. Srinivasan (2021) ICASSP 2021 deep noise suppression challenge. ICASSP. Cited by: Figure 1, §3.
  • [15] C. K. Reddy, H. Dubey, K. Koishida, A. Nair, V. Gopal, R. Cutler, S. Braun, H. Gamper, R. Aichner, and S. Srinivasan (2021) INTERSPEECH 2021 deep noise suppression challenge. In INTERSPEECH, Cited by: §2, Figure 1, §3.
  • [16] K. Simonyan and A. Zisserman (2015) Very deep convolutional networks for large-scale image recognition. In International Conference on Learning Representations, Cited by: §4.
  • [17] M. Strake, B. Defraene, K. Fluyt, W. Tirry, and T. Fingschedit (2019) Separate noise suppression and speech restoration: lstm-based speech enhancement in two stages. In WASPAA, Cited by: §2.1.
  • [18] K. Tan and D. Wang (2021) Compressing deep neural networks for efficient speech enhancement. In ICASSP, pp. 8358–8362. Cited by: §1, §2.
  • [19] K. Tan and D. Wang (2021) Towards model compression for deep learning based speech enhancement. IEEE/ACM Transactions on Audio, Speech, and Language Processing 29, pp. 1785–1794. Cited by: §1, §2.
  • [20] D. Wang and K. Tan (2018) A convolutional recurrent neural network for real-time speech enhancement. In INTERSPEECH, Cited by: §2.1, §4.
  • [21] F. Weninger, H. Erdogan, S. Watanabe, E. Vincent, J. Le Roux, J. R. Hershey, and B. Schuller (2015) Speech enhancement with lstm recurrent neural networks and its applications to noise-robust asr. In Proc. Latent Variable Analysis and Signal Separation, Cited by: §2.1.
  • [22] G. Wichern and A. Lukin (2017) Low-latency approximation of bidirectional recurrent networks for speech denoising. In WASPAA, Cited by: §2.1.
  • [23] D. S. Williamson and D. Wang (2017) Time-frequency masking in the complex domain for speech dereverberation and denoising. In IEEE/ACM Trans. Audio, Speech, Lang. Process, Cited by: §2.1.
  • [24] S. Wisdom, J. R. Hershey, R. Wilsom, J. Thorpe, M. Chinen, B. Patton, and R. A. Saurous (2019) Differentiable consistency constraints for improved deep speech enhancement. In ICASSP, Cited by: §2.1.
  • [25] R. Xia, S. Braun, C. Reddy, H. Dubey, R. Cutler, and I. Tashev (2020) Weighted speech distortion losses for neural-network-based real-time speech enhancement. In ICASSP, Cited by: §2.1.