Increasing Compactness Of Deep Learning Based Speech Enhancement Models With Parameter Pruning And Quantization Techniques

05/31/2019 ∙ by Jyun-Yi Wu, et al. ∙ 0

Most recent studies on deep learning based speech enhancement (SE) focused on improving denoising performance. However, successful SE applications require striking a desirable balance between denoising performance and computational cost in real scenarios. In this study, we propose a novel parameter pruning (PP) technique, which removes redundant channels in a neural network. In addition, a parameter quantization (PQ) technique was applied to reduce the size of a neural network by representing weights with fewer cluster centroids. Because the techniques are derived based on different concepts, the PP and PQ can be integrated to provide even more compact SE models. The experimental results show that the PP and PQ techniques produce a compacted SE model with a size of only 10.03 performance losses of 1.43 to 1.79) for PESQ. The promising results suggest that the PP and PQ techniques can be used in a SE system in devices with limited storage and computation resources.



There are no comments yet.


page 1

page 2

page 3

page 4

This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

1 Introduction

The goal of speech enchantment (SE) is to generate enhanced speech with better quality and intelligibility over the original noisy speech. Many SE methods have been proposed in the past. One class of approaches directly deducts the estimated noise components from noisy speech in the spectral domain; notable examples include spectral subtraction

[1] and its extensions. Another class of approaches considers the characteristics of speech and noise signals during the design of a gain function, which is used to filter out the noise components; well-known examples include the Wiener filter [2], minimum mean-square-error (MMSE) [3], and maximum-likelihood spectral amplitude (MLSA) [4] algorithms. These traditional approaches perform well when the assumed properties of speech and noise signals are maintained, while the performance degrades notably when dealing with non-stationary noises or operating with very low signal-to-noise ratio (SNR).

Recently, deep learning algorithms have been successfully introduced to the SE field [5]. Generally speaking, a deep-learning model is used as a mapping function with the aim of transforming noisy speech into clean speech. Notable approaches include the deep denoising auto-encoder (DDAE) [6], deep feedforward neural network [7]

, convolutional neural network (CNN)


and long short-term memory model (LSTM)

[9]; all of these models have shown promising results for transforming noisy spectral features into clean ones. More recently, several studies proposed the use of convolutional structures for speech and audio signal analyses and reconstruction [10, 11, 12, 13, 14], and thus the SE tasks can be directly carried out in the time domain.

Numerous studies have confirmed the outstanding denoising capability of deep learning-based methods, especially under more challenging conditions (e.g., non-stationary noises and low SNR conditions). However, a notable disadvantage of deep learning-based solutions is the requirement of large storage space for the SE models and high online computational costs, which makes them difficult to implement in a device with limited resources. In this study, we propose two techniques, namely parameter pruning (PP) and parameter quantization (PQ), to increase the compactness of deep learning-based SE models. The PP technique removes redundant channels and the PQ technique groups and represents similar weights using a cluster centroid. To evaluate the effectiveness of these two techniques, we used the TIMIT database [15] with several noise sources. In this study, we focus on the waveform mapping based SE method using the Fully Convolutional Neural Network (FCN) model. The experimental results show that both PP and PQ techniques can effectively improve the model compactness with modest degradations in quality and ineligibility performance.

The rest of this paper is organized as follows. Related research is reviewed in Section 2. Section 3 introduces the proposed techniques. Section 4 presents the experimental setup and results. Our concluding remarks are stated in Section 5.

2 Related Research

As mentioned earlier, we focus our attention on waveform mapping-based SE using the FCN model. The FCN model is a specialized CNN model that consists of only convolutional layers. In our previous studies, we showed that the FCN model can be used to directly map a noisy speech waveform to a clean waveform [10, 11]. There are two advantages of the waveform mapping-based SE process. First, possible distortions caused by imperfect phase information can be alleviated; second, the computational cost for converting a waveform to frame-based spectral features can be reduced.

Many algorithms have been derived to increase the compactness of neural network models, such as pruning, sparse constraints, and quantization. Pruning algorithms are de- signed to reduce the network complexity and to address overfitting [16]. By determining a threshold, any weight values lower than the threshold are removed from the model, thus reducing the total number of weights. Another class of approaches builds compact models by applying sparse constraints to reduce trivial filters in the models [17]. On the other hand, quantization algorithms compress the size of the original network by reducing the number of bits required to represent each weight [18]. Han et al. [19]

applied a k-means scalar quantization to the parameter values. These quantization methods significantly reduced memory usage with a modest loss in recognition accuracy.

Based on our literature survey, only few studies have investigated potential approaches to increase the compactness of SE models. Sun and Li proposed using a quantization technique to increase the compactness of an SE model [20]

. Ko et al. investigated the correlation of precision scaling and neuron numbers in an SE model

[21]. In [22], a two-stage quantization approach was derived to optimally reduce the number of bits when the model parameters are encoded in floating point representation. In the present study, the proposed PP technique adopted a different and novel concept that directly removes redundant channels to form a compact FCN model. The size of this model is then reduced further with the PQ technique.

3 The proposed PP and PQ Techniques

This section introduces the proposed PP and PQ techniques, as well as the integration of these two techniques.

3.1 The Parameter Pruning (PP) Technique

3.1.1 FCN-based Waveform Mapping

Figure 1(a) shows the process of the waveform mapping based on the FCN model. In the figure, we have filters: F, F, …, F; F is the -th filter, and F is the -th channel of F. F , , …, where is the channel weight. Assume that the receptive field and output sample of filter F is R(t) and y(t), repectively. The resulting convolution operation is:

Figure 1: The PP process: (a) Original model; (b) pruned model, and (c) the pruning and retraining process.

3.1.2 Definition of Sparsity

We define the redundancy criterion with sparsity of each channel in a filter. For filter F in an arbitrary layer of the FCN, we first compute the mean absolute value of all filter weights:


where and are the total number of channels in a filter and number of weights in a channel, respectively, and is a weight parameter. The sparsity of the -th channel in a filter F can then be defined as:


When is close to , most of weights in a channel are smaller than , and the channel is considered more redundant.

3.1.3 Channel Pruning

In our proposed parameter pruning (PP) technique, the pruning mechanism contains a retraining step. As shown in Fig.1.(c), if the sparsity in some channels F are larger than a predefined threshold value , the weights within the channel F will be set to zero. Next, we retrain the model. After several iterations, we then remove F, as shown in Fig.1.(b), we can then obtain F as the channel-pruned filters. Because F is reduced, R(t) is reduced accordingly after the PP process. Finally, we can compute the output as follows:


where is the number of pruned channels. This PP technique ensures that the compacted model remains stable after the pruning steps, while the retraining steps make models adjustable to the zero-weighted channels. We believe that our approach, unlike other pruning methods that directly remove filters, can effectively prevent severe performance drops.

3.2 The Parameter Quantization (PQ) Technique

In this study, the parameter quantization is carried out based on the k-means algorithm. By applying the k-means quantization, the parameters in a neural network model are grouped into several clusters, where each cluster of parameters shares a centroid value. Fig. 2 shows an example of the k-means-based PQ process. In this figure, each weight parameter in the original model is represented by a 32-bits floating point number. By applying the k-means algorithm with k=4, we can obtain a look-up table with 4 cluster centroids. Each weight in the model is then denoted with a cluster index that is linked to the corresponding cluster centroid. Therefore, the 10 weights (each represented as a 32-bit floating point number) in the original model can be represented with 4 cluster indices and 4 centroids. The corresponding compression rate is: .

Figure 2: Example of the PQ technique.

3.3 The Integration of PP and PQ

Although both methods aim to increase the model compactness, the PP and PQ techniques are derived based on different concepts. In this section, we investigate the compatibility of these two techniques. Fig. 3 shows the proposed integration system. PP is applied to remove redundant channels and establish a compact SE model. PQ is subsequently used to further reduce the model size.

Figure 3: Integration system of the PP and PQ techniques.

4 Experiments

In this section, we first introduce the experimental setup and then demonstrate the experimental results.

4.1 Experimental Setup

The TIMIT corpus was used to prepare the training and test sets. All 4620 utterances in the TIMIT training set were selected as training data. These utterances were corrupted with five noise types (Babble, Car, Jackhammer, Pink, and Street) at five SNR levels (-10 dB, -5 dB, 0 dB, 5 dB, and 10 dB). 100 utterances were randomly selected from the TIMIT testing set as the testing data. These utterances were artificially corrupted with three noise types (Babycry, White, and Engine) at four SNR levels (-12 dB, -6 dB, 0 dB, and 6 dB). Note that we intentionally designed mismatched noise types and SNR levels for training and testing conditions in order to simulate a more realistic scenario. We evaluated the PP and PQ techniques using two standardized metrics: perceptual evaluation of speech quality (PESQ) [23] and short-time objective intelligibility (STOI) [24]. PESQ measures the quality of the processed speech by assigning a score ranging from −0.5 to 4.5; a higher PESQ score denotes better speech quality. STOI measures speech intelligibility by assigning a score ranging from 0 to 1; a higher STOI score denotes better intelligibility.

4.2 Experimental Results

In this section, we present the PESQ and STOI results produced with PQ, PP, and the integrated system.

4.2.1 Parameter Quantization (PQ)

Regarding the PQ technique, we set the number of clusters k to 2, 4, 8, 16, 32, and 64, and the corresponding PESQ and STOI results are shown in Fig. 4 (a) and (b), respectively. It is clear that the scores of both PESQ and STOI decrease when the cluster number k is reduced. In practical SE applications, we must consider performance and computation simultaneously. Thus, we may first define a bound for acceptable performance drop (BAPD) and continue reducing the cluster number until the evaluation scores are lower than a defined bound. In this experiment, we consider this BAPD to be the mean score of the results produced with the original SE model and that of noisy speech. Using Fig.4(a) as an example, the PESQ scores for noisy speech and FCN without pruning are 1.64 and 1.85, respectively. BAPD is then defined as (1.64+1.85)/2 = 1.75. It is clear that this BAPD should be determined based on the target task. Here, we used a remarkably simple method for determining the BAPD value by providing an example. Figs. 4(a) and (b) show that the PESQ and STOI scores are similar with reduced BAPD when k4.

(a) PESQ (b) STOI
Figure 4: The average PESQ and STOI scores yielded from the PQ technique with different numbers of clusters. BAPD denotes the bound for acceptable performance drop.

4.2.2 Parameter Pruning (PP)

While implementing the PP technique, we gradually reduced the sparsity threshold from 1 (i.e., without conducting PP) to 0.60 with a step size of 0.05. We retrained the model after each sparsity threshold reduction. The PESQ and STOI results are shown in Figs. 5 (a) and (b), respectively. These results show that both PESQ and STOI scores significantly dropped when the sparsity threshold decreased from 0.7 to 0.65. In table 1, we listed the correlation between the sparsity threshold and the removal ratio in the SE model. The results in Table 1 show that the corresponding removal ratio is 19.8 when the sparsity threshold is set to 0.7.

(a) PESQ (b) STOI
Figure 5: The average PESQ & STOI scores yield by the PP technique with different sparsity threshold value. BAPD denotes the bound for acceptable performance drop.
Sparsity threshold Removal ratio Remaining parameters
1.00 0.00% 300,300
0.75 14.0% 258,225
0.70 19.8% 240,900
0.65 27.1% 218,900
0.60 30.1% 209,770
Table 1: Correlation between sparsity threshold and removal ratio, as well as the number of remaining parameters in the SE model.

4.2.3 The Integration of PP and PQ

Finally, we investigated whether integration of the PP and PQ techniques can provide an even more compact SE model. Based on our preliminary experiments, a more effective integration order is to use PP followed by PQ. From the results in Figs. 4 and 5, setting sparsity threshold =0.70 provides reasonably satisfactory performance. Therefore, we tested of 0.65, 0.70, and 0.75 with the PP technique while varying the number of clusters. The results are shown in Fig. 6. We first note that the systems with =0.70 and 0.75 provide similar performance; both value of notably outperform the case with =0.65. Moreover, the systems with =0.70 and 0.75 both suffered considerable performance drops when k=8. The results in Fig. 6 show that the system with =0.70 and k=16 provides the best performance: The size of the compacted SE model is only 10.03 as compared to that of the original model, where STOI reduces by 1.43% (from 0.70 to 0.69) and PESQ reduces by 3.24% (from 1.85 to 1.79).

(a) PESQ (b) STOI
Figure 6: The average of PESQ & STOI results achieved by the integration of PP and PQ techniques.

5 Conclusion

We propose utilizing the PP and PQ techniques to increase the compactness of the FCN model for an SE task. The main contribution of this study is two-fold. First, to the best of our knowledge, the PP technique is the first technique that directly removes redundant channels in the FCN model. Second, we have shown that applying PP, PQ, and an integration of PP and PQ effectively reduces the model size with only modest performance drops. The results suggest that the use of the proposed PP and PQ techniques allow an SE system with a compact neural network model to be installed in embedded devices that have lower computational capabilities. Please also note that although compression techniques for deep-learning models for pattern recognition (classification) tasks have been popularly studied, there are only few works researching model compression for signal generation (regression) tasks. Because of different output formats, the effects of model compression on regression tasks should be very different from that on classification tasks. The present study first time investigated the effects of model pruning/quantization on the SE task (a regression task), and the results can be used as a useful guidance for future SE studies.