1 Introduction
The goal of speech enchantment (SE) is to generate enhanced speech with better quality and intelligibility over the original noisy speech. Many SE methods have been proposed in the past. One class of approaches directly deducts the estimated noise components from noisy speech in the spectral domain; notable examples include spectral subtraction
[1] and its extensions. Another class of approaches considers the characteristics of speech and noise signals during the design of a gain function, which is used to filter out the noise components; wellknown examples include the Wiener filter [2], minimum meansquareerror (MMSE) [3], and maximumlikelihood spectral amplitude (MLSA) [4] algorithms. These traditional approaches perform well when the assumed properties of speech and noise signals are maintained, while the performance degrades notably when dealing with nonstationary noises or operating with very low signaltonoise ratio (SNR).Recently, deep learning algorithms have been successfully introduced to the SE field [5]. Generally speaking, a deeplearning model is used as a mapping function with the aim of transforming noisy speech into clean speech. Notable approaches include the deep denoising autoencoder (DDAE) [6], deep feedforward neural network [7]
, convolutional neural network (CNN)
[8]and long shortterm memory model (LSTM)
[9]; all of these models have shown promising results for transforming noisy spectral features into clean ones. More recently, several studies proposed the use of convolutional structures for speech and audio signal analyses and reconstruction [10, 11, 12, 13, 14], and thus the SE tasks can be directly carried out in the time domain.Numerous studies have confirmed the outstanding denoising capability of deep learningbased methods, especially under more challenging conditions (e.g., nonstationary noises and low SNR conditions). However, a notable disadvantage of deep learningbased solutions is the requirement of large storage space for the SE models and high online computational costs, which makes them difficult to implement in a device with limited resources. In this study, we propose two techniques, namely parameter pruning (PP) and parameter quantization (PQ), to increase the compactness of deep learningbased SE models. The PP technique removes redundant channels and the PQ technique groups and represents similar weights using a cluster centroid. To evaluate the effectiveness of these two techniques, we used the TIMIT database [15] with several noise sources. In this study, we focus on the waveform mapping based SE method using the Fully Convolutional Neural Network (FCN) model. The experimental results show that both PP and PQ techniques can effectively improve the model compactness with modest degradations in quality and ineligibility performance.
The rest of this paper is organized as follows. Related research is reviewed in Section 2. Section 3 introduces the proposed techniques. Section 4 presents the experimental setup and results. Our concluding remarks are stated in Section 5.
2 Related Research
As mentioned earlier, we focus our attention on waveform mappingbased SE using the FCN model. The FCN model is a specialized CNN model that consists of only convolutional layers. In our previous studies, we showed that the FCN model can be used to directly map a noisy speech waveform to a clean waveform [10, 11]. There are two advantages of the waveform mappingbased SE process. First, possible distortions caused by imperfect phase information can be alleviated; second, the computational cost for converting a waveform to framebased spectral features can be reduced.
Many algorithms have been derived to increase the compactness of neural network models, such as pruning, sparse constraints, and quantization. Pruning algorithms are de signed to reduce the network complexity and to address overfitting [16]. By determining a threshold, any weight values lower than the threshold are removed from the model, thus reducing the total number of weights. Another class of approaches builds compact models by applying sparse constraints to reduce trivial filters in the models [17]. On the other hand, quantization algorithms compress the size of the original network by reducing the number of bits required to represent each weight [18]. Han et al. [19]
applied a kmeans scalar quantization to the parameter values. These quantization methods significantly reduced memory usage with a modest loss in recognition accuracy.
Based on our literature survey, only few studies have investigated potential approaches to increase the compactness of SE models. Sun and Li proposed using a quantization technique to increase the compactness of an SE model [20]
. Ko et al. investigated the correlation of precision scaling and neuron numbers in an SE model
[21]. In [22], a twostage quantization approach was derived to optimally reduce the number of bits when the model parameters are encoded in floating point representation. In the present study, the proposed PP technique adopted a different and novel concept that directly removes redundant channels to form a compact FCN model. The size of this model is then reduced further with the PQ technique.3 The proposed PP and PQ Techniques
This section introduces the proposed PP and PQ techniques, as well as the integration of these two techniques.
3.1 The Parameter Pruning (PP) Technique
3.1.1 FCNbased Waveform Mapping
Figure 1(a) shows the process of the waveform mapping based on the FCN model. In the figure, we have filters: F, F, …, F; F is the th filter, and F is the th channel of F. F , , …, where is the channel weight. Assume that the receptive field and output sample of filter F is R(t) and y(t), repectively. The resulting convolution operation is:
(1) 
3.1.2 Definition of Sparsity
We define the redundancy criterion with sparsity of each channel in a filter. For filter F in an arbitrary layer of the FCN, we first compute the mean absolute value of all filter weights:
(2) 
where and are the total number of channels in a filter and number of weights in a channel, respectively, and is a weight parameter. The sparsity of the th channel in a filter F can then be defined as:
(3)  
(4) 
When is close to , most of weights in a channel are smaller than , and the channel is considered more redundant.
3.1.3 Channel Pruning
In our proposed parameter pruning (PP) technique, the pruning mechanism contains a retraining step. As shown in Fig.1.(c), if the sparsity in some channels F are larger than a predefined threshold value , the weights within the channel F will be set to zero. Next, we retrain the model. After several iterations, we then remove F, as shown in Fig.1.(b), we can then obtain F as the channelpruned filters. Because F is reduced, R(t) is reduced accordingly after the PP process. Finally, we can compute the output as follows:
(5) 
where is the number of pruned channels. This PP technique ensures that the compacted model remains stable after the pruning steps, while the retraining steps make models adjustable to the zeroweighted channels. We believe that our approach, unlike other pruning methods that directly remove filters, can effectively prevent severe performance drops.
3.2 The Parameter Quantization (PQ) Technique
In this study, the parameter quantization is carried out based on the kmeans algorithm. By applying the kmeans quantization, the parameters in a neural network model are grouped into several clusters, where each cluster of parameters shares a centroid value. Fig. 2 shows an example of the kmeansbased PQ process. In this figure, each weight parameter in the original model is represented by a 32bits floating point number. By applying the kmeans algorithm with k=4, we can obtain a lookup table with 4 cluster centroids. Each weight in the model is then denoted with a cluster index that is linked to the corresponding cluster centroid. Therefore, the 10 weights (each represented as a 32bit floating point number) in the original model can be represented with 4 cluster indices and 4 centroids. The corresponding compression rate is: .
3.3 The Integration of PP and PQ
Although both methods aim to increase the model compactness, the PP and PQ techniques are derived based on different concepts. In this section, we investigate the compatibility of these two techniques. Fig. 3 shows the proposed integration system. PP is applied to remove redundant channels and establish a compact SE model. PQ is subsequently used to further reduce the model size.
4 Experiments
In this section, we first introduce the experimental setup and then demonstrate the experimental results.
4.1 Experimental Setup
The TIMIT corpus was used to prepare the training and test sets. All 4620 utterances in the TIMIT training set were selected as training data. These utterances were corrupted with five noise types (Babble, Car, Jackhammer, Pink, and Street) at five SNR levels (10 dB, 5 dB, 0 dB, 5 dB, and 10 dB). 100 utterances were randomly selected from the TIMIT testing set as the testing data. These utterances were artificially corrupted with three noise types (Babycry, White, and Engine) at four SNR levels (12 dB, 6 dB, 0 dB, and 6 dB). Note that we intentionally designed mismatched noise types and SNR levels for training and testing conditions in order to simulate a more realistic scenario. We evaluated the PP and PQ techniques using two standardized metrics: perceptual evaluation of speech quality (PESQ) [23] and shorttime objective intelligibility (STOI) [24]. PESQ measures the quality of the processed speech by assigning a score ranging from −0.5 to 4.5; a higher PESQ score denotes better speech quality. STOI measures speech intelligibility by assigning a score ranging from 0 to 1; a higher STOI score denotes better intelligibility.
4.2 Experimental Results
In this section, we present the PESQ and STOI results produced with PQ, PP, and the integrated system.
4.2.1 Parameter Quantization (PQ)
Regarding the PQ technique, we set the number of clusters k to 2, 4, 8, 16, 32, and 64, and the corresponding PESQ and STOI results are shown in Fig. 4 (a) and (b), respectively. It is clear that the scores of both PESQ and STOI decrease when the cluster number k is reduced. In practical SE applications, we must consider performance and computation simultaneously. Thus, we may first define a bound for acceptable performance drop (BAPD) and continue reducing the cluster number until the evaluation scores are lower than a defined bound. In this experiment, we consider this BAPD to be the mean score of the results produced with the original SE model and that of noisy speech. Using Fig.4(a) as an example, the PESQ scores for noisy speech and FCN without pruning are 1.64 and 1.85, respectively. BAPD is then defined as (1.64+1.85)/2 = 1.75. It is clear that this BAPD should be determined based on the target task. Here, we used a remarkably simple method for determining the BAPD value by providing an example. Figs. 4(a) and (b) show that the PESQ and STOI scores are similar with reduced BAPD when k4.
4.2.2 Parameter Pruning (PP)
While implementing the PP technique, we gradually reduced the sparsity threshold from 1 (i.e., without conducting PP) to 0.60 with a step size of 0.05. We retrained the model after each sparsity threshold reduction. The PESQ and STOI results are shown in Figs. 5 (a) and (b), respectively. These results show that both PESQ and STOI scores significantly dropped when the sparsity threshold decreased from 0.7 to 0.65. In table 1, we listed the correlation between the sparsity threshold and the removal ratio in the SE model. The results in Table 1 show that the corresponding removal ratio is 19.8 when the sparsity threshold is set to 0.7.
Sparsity threshold  Removal ratio  Remaining parameters 

1.00  0.00%  300,300 
0.75  14.0%  258,225 
0.70  19.8%  240,900 
0.65  27.1%  218,900 
0.60  30.1%  209,770 
4.2.3 The Integration of PP and PQ
Finally, we investigated whether integration of the PP and PQ techniques can provide an even more compact SE model. Based on our preliminary experiments, a more effective integration order is to use PP followed by PQ. From the results in Figs. 4 and 5, setting sparsity threshold =0.70 provides reasonably satisfactory performance. Therefore, we tested of 0.65, 0.70, and 0.75 with the PP technique while varying the number of clusters. The results are shown in Fig. 6. We first note that the systems with =0.70 and 0.75 provide similar performance; both value of notably outperform the case with =0.65. Moreover, the systems with =0.70 and 0.75 both suffered considerable performance drops when k=8. The results in Fig. 6 show that the system with =0.70 and k=16 provides the best performance: The size of the compacted SE model is only 10.03 as compared to that of the original model, where STOI reduces by 1.43% (from 0.70 to 0.69) and PESQ reduces by 3.24% (from 1.85 to 1.79).
5 Conclusion
We propose utilizing the PP and PQ techniques to increase the compactness of the FCN model for an SE task. The main contribution of this study is twofold. First, to the best of our knowledge, the PP technique is the first technique that directly removes redundant channels in the FCN model. Second, we have shown that applying PP, PQ, and an integration of PP and PQ effectively reduces the model size with only modest performance drops. The results suggest that the use of the proposed PP and PQ techniques allow an SE system with a compact neural network model to be installed in embedded devices that have lower computational capabilities. Please also note that although compression techniques for deeplearning models for pattern recognition (classification) tasks have been popularly studied, there are only few works researching model compression for signal generation (regression) tasks. Because of different output formats, the effects of model compression on regression tasks should be very different from that on classification tasks. The present study first time investigated the effects of model pruning/quantization on the SE task (a regression task), and the results can be used as a useful guidance for future SE studies.
References
 [1] S. Boll, “Suppression of acoustic noise in speech using spectral subtraction,” IEEE/ACM Transactions on Audio, Speech, and Language Processing, vol. 27, no. 2, pp. 113–120, 1979.
 [2] P. Scalart and J. V. Filho, “Speech enhancement based on a priori signal to noise estimation,” in ICASSP, 1996.
 [3] Y. Ephraim and D. Malah, “Speech enhancement using a minimummean square error shorttime spectral amplitude estimator,” IEEE/ACM Transactions on Audio, Speech, and Language Processing, vol. 32, no. 6, pp. 1109–1121, 1984.
 [4] R. McAulay and M. Malpass, “Speech enhancement using a softdecision noise suppression filter,” IEEE/ACM Transactions on Audio, Speech, and Language Processing, vol. 28, no. 2, pp. 137–145, 1980.
 [5] D. Wang and J. Chen, “Supervised speech separation based on deep learning: An overview,” IEEE/ACM Transactions on Audio, Speech, and Language Processing, vol. 26, no. 10, pp. 1702–1726, 2018.

[6]
X. Lu, Y. Tsao, S. Matsuda, and C. Hori,
“Speech enhancement based on deep denoising autoencoder,”
in INTERSPEECH, 2013.  [7] Y. Xu, J. Du, L.R. Dai, and C.H. Lee, “A regression approach to speech enhancement based on deep neural networks,” IEEE/ACM Transactions on Audio, Speech, and Language Processing, vol. 23, no. 1, pp. 7–19, 2015.
 [8] S.W. Fu, Y. Tsao, and X. Lu, “SNR aware convolutional neural network modeling for speech enhancement,” in INTERSPEECH, 2016.

[9]
Z. Chen, S. Watanabe, H. Erdogan, and John R. Hershey,
“Speech enhancement and recognition using multitask learning of long shortterm memory recurrent neural networks,”
Nuclear Physics A, vol. 2015January, pp. 3274–3278, 2015.  [10] S.W. Fu, Y. Tsao, X. Lu, and H. Kawai, “Raw waveformbased speech enhancement by fully convolutional networks,” in Proc. APSIPA, 2017.

[11]
S.W. Fu, T. Wang, Y. Tsao, X. Lu, and H. Kawai,
“Endtoend waveform utterance enhancement for direct evaluation metrics optimization by fully convolutional neural networks,”
IEEE/ACM Transactions on Audio, Speech, and Language Processing, vol. 26, no. 9, pp. 1570–1584, 2018.  [12] Ashutosh Pandey and D. Wang, “A new framework for supervised speech enhancement in the time domain,” in INTERSPEECH, 2018.
 [13] Yi Luo and Nima Mesgarani, “Tasnet: Surpassing ideal timefrequency masking for speech separation,” arXiv preprint arXiv:1809.07454, 2018.
 [14] Emad M. Grais, D. Ward, and Mark D. Plumbley, “Raw multichannel audio source separation using multiresolution convolutional autoencoders,” arXiv preprint arXiv:1803.00702v1, 2018.
 [15] J. S. Garofolo, L. F. Lamel, W. M. Fisher, J. G. Fiscus, D. S. Pallett, and N. Y. Dahlgren, “Darpa timit acousticphonetic continous speech corpus cdrom. nist speech disc 11.1,” NASA STI/Recon Technical Report N, vol. 93, pp. 27403, 1993.
 [16] S. Han, J. Pool, J. Tran, and W. J. Dally, “Learning both weights and connections for efficient neural networks,” CoRR, vol. abs/1506.02626, 2015.
 [17] C.T. Liu, Y.H. Wu, Y.S. Lin, and S.Y. Chien, “A kernel redundancy removing policy for convolutional neural network,” CoRR, vol. abs/1705.10748, 2017.
 [18] D. Miyashita, E. H. Lee, and B. Murmann, “Convolutional neural networks using logarithmic data representation,” CoRR, vol. abs/1603.01025, 2016.
 [19] S. Han, H. Mao, and W. J. Dally, “Deep compression: Compressing deep neural network with pruning, trained quantization and huffman coding,” CoRR, vol. abs/1510.00149, 2015.
 [20] H. Sun and S. Li, “An optimization method for speech enhancement based on deep neural network,” in IOP Conference Series: Earth and Environmental Science. IOP Publishing, 2017, vol. 69, p. 012139.
 [21] J. H. Ko, J. Fromm, M. Philipose, I. Tashev, and S. Zarar, “Precision scaling of neural networks for efficient audio processing,” CoRR, vol. abs/1712.01340, 2017.
 [22] Y.T. Hsu, Y.C. Lin, S.W. Fu, Y. Tsao, and T.W. Kuo, “A study on speech enhancement using exponentonly floating point quantized neural network (eofpqnn),” CoRR, vol. abs/1808.06474, 2018.
 [23] A. W. Rix, J. G. Beerends, M. P. Hollier, and A. P. Hekstra, “Perceptual evaluation of speech quality (pesq)a new method for speech quality assessment of telephone networks and codecs,” in ICASSP, 2001, pp. 749–752.
 [24] C. H. Taal, R. C. Hendriks, R. Heusdens, and J. Jensen, “An algorithm for intelligibility prediction of time–frequency weighted noisy speech,” IEEE/ACM Transactions on Audio, Speech, and Language Processing, vol. 19, no. 7, pp. 2125–2136, 2011.
Comments
There are no comments yet.