Speech Dereverberation Based on Integrated Deep and Ensemble Learning

01/12/2018 ∙ by Wei-Jen Lee, et al. ∙ 0

Reverberation, which is generally caused by sound reflections from walls, ceilings, and floors, can result in severe performance degradations of acoustic applications. Due to a complicated combination of attenuation and time-delay effects, the reverberation property is difficult to characterize, and it remains a challenging task to effectively retrieve the anechoic speech signals from reverberation ones. In the present study, we proposed a novel integrated deep and ensemble learning (IDEL) algorithm for speech dereverberation. The IDEL algorithm consists of offline and online phases. In the offline phase, we train multiple dereverberation models, each aiming to precisely dereverb speech signals in a particular acoustic environment; then a unified fusion function is estimated that aims to integrate the information of multiple dereverberation models. In the online phase, an input utterance is first processed by each of the dereverberation models. The outputs of all models are integrated accordingly to generate the final anechoic signal. We evaluated IDEL on designed acoustic environments, including both matched and mismatched conditions of the training and testing data. Experimental results confirm that the proposed IDEL algorithm outperforms single deep-neural-network-based dereverberation model with the same model architecture and training data.

READ FULL TEXT VIEW PDF
POST COMMENT

Comments

There are no comments yet.

Authors

page 4

This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

1 Introduction

In realistic environments, the perceived speech signal may comprise of the original speech and multiple copies of the attenuated and time-delayed signals [1]. The combination of these signals can cause serious performance degradation of speech-related applications. For example, distant-talking speech significantly degrades the performance of automatic speech recognition (ASR) [2, 3] and speaker identification [4, 5]. Meanwhile, the adverse effects of reverberation will lower sound quality and intelligibility for both hearing-impaired and normal-hearing listeners [6, 7, 8]. In the past, various speech dereverberation methods have been developed. The goal of these methods is to extract anechoic speech signals from reverberant ones to enhance the performance of speech-related applications and to improve sound quality and intelligibility simultaneously for listeners in reverberant environments.

Traditional speech dereverberation methods can be roughly divided into three categories [9]. The first category is the source-model-based method, which estimates the clean signal by employing the priori knowledge about time–frequency speech structures [10, 11, 12, 13]. The second category is the homomorphic filtering technique, which adopts a homomorphic transformation to decompose the reverberant signal from the time domain to the cepstral domain, and thus separates the reverberation from the input cepstral coefficients with a simple subtraction operation [14]. Channel-inversion methods belong to the third category, which considers the reverberation as a convolution of the original sound with the room impulse response (RIR) and thereby performs an inverse filtering to deconvolve the captured signal [15, 16, 17, 18, 19, 20]. Even though the above three categories of approaches have been shown to provide satisfactory performance, they usually require an accurate estimation of time-varied RIR, which may not always be accessible in practice [21].

Recently, deep neural network (DNN) models, which show strong regression capabilities, have been used to address the speech dereverberation issue [21, 22]. The main concept here is to use a DNN model to characterize the non-linear spectral mapping from reverberant to anechoic speech in the training stage. In the testing stage, the trained DNN model is used to generate dereverbed utterances given the input reverberant signals. The same concept has been applied to perform denoising and dereverberation simultaneously [6]. Despite providing notable improvements over traditional algorithms, DNN-based dereverberation methods achieve the optimal performance only in matched training and testing reverberant conditions. To further improve the performance, an environment-aware DNN-based dereverberation system has been proposed, which selects the optimal DNN models online to perform dereverberation [23].

Contrary to the idea used in [23], the present study extends the previous work on the deep denoise autoencoder (DDAE) in speech enhancement [24, 25] and proposes a novel integrated deep and ensemble learning algorithm (IDEA) for speech dereverberation. The IDEA consists of offline and online phases. In the offline phase, multiple DDAE-based dereverberation models are prepared, with each aiming to precisely dereverb speech signals in a particular acoustic environment. Then, a unified fusion model is estimated to integrate the information of the multiple dereverberation models with the aim to estimate clean speech. In the online phase, an input reverberant speech is first processed by all dereverberation models simultaneously, and the outputs are integrated to ultimately generate the anechoic signals. The ensemble learning strategy, which has been proven to be able to improve system performance in speech enhancement [25] and ASR [26, 27], is adopted in the task to increase the generalization ability of DDAEs. As will be introduced in the results of experiments, conducted using the Mandarin hearing in noise test (MHINT) [28], a DDAE-based dereverberation system achieves the best quality and intelligibility scores when the training and testing conditions are similar (matched condition). However, the performance degrades significantly under mismatched conditions between training and testing. Evaluated results further indicate that the proposed IDEA outperforms the DDAE-based dereverberation system trained in the matched condition and significantly improves speech quality and intelligibility in both matched and mismatched conditions.

The rest of this paper is organized as follows. The spectral-mapping-based speech dereverberation system is reviewed in Section 2. Then, the proposed IDEA is introduced in Section 3. Experimental setup and analyses are presented in Section 4. Section 5 concludes our findings.

2 Spectral-Mapping-based Speech Dereverberation

In the time domain, the relationship between noisy and clean signals are formulated in Eq. (1)

(1)

where and represent the clean utterance and the additive noise, respectively; “” is the operation of convolution; and denotes the environmental filter. Fig. 1 shows the block diagram of the spectral-mapping-based speech dereverberation system, where the goal is to retrieve the anechoic speeches, , from the reverberant signals, . As can be seen in Fig. 1, is first converted to the spectrogram representation

by carrying out the short time Fourier transform (STFT). Next, a feature extraction (FE) process is conducted to extract the logarithmic power spectrogram (LPS) features

; then to incorporate the context information, the features are prepared by concatenating the adjacent static feature frames at the

th feature vector

, i.e. . The superscript “” denotes the vector transposition. The DNN-based dereverberation system compensates to the estimated LPS directly, which is further restored to the magnitude spectrum with the spectral restoration (SR) function. Finally, the dereverbed spectrogram with an updated magnitude and the original phase is converted back to the time domain via inverse STFT (ISTFT) to reconstruct the enhanced time signal .

Figure 1: Block diagram of the spectral-mapping-based speech dereverberation system.

It is noted that we only consider the reverberant clean signal in Eq. (1) and set to zero in the present study to focus the dereverberation task.

3 The Proposed IDEA

3.1 Highway-DDAE dereverberation system

In previous studies, traditional fully connected DNNs were used to perform dereverberation [21, 22, 23]. More recently, the highway strategy has been popularly used and shown to provide improved performance [29]. Our preliminary experiments show that using the highway strategy can improve the speech dereverberation performance in our task. In this section, we first introduce the highway-DDAE (HDDAE). Fig. 2 shows the flowchart of the HDDAE for dereverberation in the offline phase.

Figure 2: Flowchart of HDDAE in the offline phase.

From the figure, a set of clean–reverb speech pairs ( pairs) in the LPS domain is prepared first to form the training data, where there are -frame vectors for each of and . The supervised training procedure is then conducted by placing the clean and reverb , respectively, at the output and input sides of the HDDAE model. For the model with hidden layers, we have:

(2)

where

is a nonlinear mapping function (the ReLu activation function is used in this study).

and with

are the weight matrices and bias vectors, respectively. Notably, the output of the

th hidden layer cascades with

(output of the first hidden layer) to possibly address the vanishing gradient problem during the training process (please note that the highway connection may be applied in any two layers; however, the current architecture achieves the best performance in our preliminary experiments). The HDDAE parameter set

consisting of all and are determined accordingly by optimizing the following mean squared error function:

(3)

3.2 IDEA for dereverberation

In this sub-section, we present the proposed IDEA for speech dereverberation. As mentioned earlier, there are offline and online phases. The offline phase further consists of ensemble preparation (EP) and ensemble integration (EI) stages, which are shown in Figs. 3 and 4, respectively. For the EP stage in Fig. 3, there are , , to reverberant conditions, and thus the reverb data are divided into subsets, namely, , to . With these subsets of training data, together with the corresponding clean training sets, , to , we have clean–reverb training sets ( with ). Each training pair is then used to train an HDDAE model. Therefore, the HDDAE models, , to , are estimated in the EP stage.

Figure 3: Flowchart of the EP stage in the offline phase.

Figure 4: Flowchart of the IDEA in the offline phase (including the EP and EI stages)

Next, for the EI stage in Fig. 4, the input LPS is first processed by the HDDAE models, as shown in Eq. (4).

(4)

Then, the outputs of all of these HDDAE models are combined as a new input () to train the EI model. In this study, we construct the EI model using a convolutional neural network (CNN) with hidden layers, as shown in Eq. (5), consisting of convolution operations at the th sample (frame) vector of the input and a fully connected hidden layer .

(5)

The convolution operation applies a set of filters in order to extract feature maps to obtain local time–frequency structures and to achieve more robust feature representations [30]. The provided features at the th hidden layer are then fed into a fully connected feed-forward network , and finally obtain the estimated in the output layer of CNN. Notably, a nonlinear mapping function is applied to modulate the output of each hidden layer. In addition, the parameters of the CNN are randomly initialized and then optimized by minimizing the objective function in Eq. (6).

(6)

4 Experiment and Analysis

We evaluated the proposed IDEA using the MHINT sentences [28] containing 300 utterances pronounced by a native Mandarin male speaker that were recorded in a reverberation-free environment at a sampling rate of 16 kHz. From the database, 250 utterances were selected as the clean training data, and the other 50 utterances were used as the testing data for the speech dereverberation task.

Three distinct reverberant rooms were simulated: room 1 with size , room 2 with size , and room 3 with size , where the unit for all room sizes is meter. The positions of the speakers and receivers were randomly initialized for each room and were fixed for providing RIRs in the considerations of (s). For each , three different reverberant environments were provided for deriving RIRs to contaminate the clean training data, and to form the clean–reverb training set accordingly. In addition, one RIR was generated for each of the six values to deteriorate all testing utterances and form the testing set. The image model was applied to perform all RIRs by using an RIR generator [31]. Finally, we prepared and reverberant utterances for the training and testing sets, respectively.

In this study, a speech utterance was first windowed to successive frames with the frame size and the shift being 32 ms and 16 ms, respectively. On each frame vector, a 257-dimensional LPS was derived through the STFT and was further extended to dimensions in terms of mentioned in Section 2 to include the context information as an acoustic feature vector. As a result, the sizes of the input and output layers of the DDAE-based dereverberation system shown in Fig. 1 were 2827 and 257, respectively. As for the DDAE-based dereverberation system, four types of HDDAE-based architectures were implemented for comparisons: (a) single HDDAE model with three hidden layers ( in Eq. (2)) trained with the entire training dataset (denoted as “HDDAE”), (b) single HDDAE model with three hidden layers trained with the dataset composed of one specific condition (denoted as “HDDAE” with ), (c) single HDDAE model with six hidden layers ( in Eq. (2)) trained with the entire training dataset (denoted as “HDDAE”) and (d) the proposed IDEA model (denoted as “IDEA”) with HDDAE, HDDAE and HDDAE in the EP stage, and a CNN model with three hidden layers ( in Eq. (5); two convolutional layers with each layer containing 32 channels, and a fully-connected layer with 2048 nodes) in the EI stage in Fig. 4. Notably, each hidden layer of HDDAEs in (a), (b), (c), and (d) is composed of 2048 nodes.

The speech dereverberation scenarios were evaluated by (a) the quality test in terms of the perceptual evaluation of speech quality (PESQ) [32], (b) the perceptual test in terms of short-time objective intelligibility (STOI) [33], and (c) the speech distortion index (SDI) test [34]. The score ranges of PESQ and STOI are {-0.5 to 4.5} and {0 to 1}, respectively. Higher scores of PESQ and STOI denote better sound quality and intelligibility, respectively. On the other hand, the SDI measures the degree of speech distortion, and a lower SDI indicates smaller speech distortions and thus better performance.

Fig. 5 shows the speech spectrograms corresponding to clean, reverberation at s, processed by HDDAE, and processed by IDEA. From the figure, the spectrogram of the IDEA presents clearer spectral characteristics than those from HDDAE; please note the regions in the white blocks. The harmonic structures for high–frequency components are also clear.

Figure 5: Spectrum comparison with s.

We first list the PESQ scores of HDDAE, HDDAE and HDDAE evaluated in either the matched or mismatched testing reverberant conditions in Table 1. The results of the baseline (i.e., no dereverberation process was conducted) and HDDAE are also listed in the table for comparisons. In addition, the averaged PESQ scores (Avg.) for all methods over all testing environments () are shown in the last column of the table. In the table, for HDDAE, HDDAE, and HDDAE, the best PESQ score in each of the testing conditions is achieved by the HDDAE trained on the matched condition. In addition, the quality of utterances degrades significantly for those dereverberation systems in the mismatched environments, in which the PESQ scores could be even lower than those of baseline (unprocessed input). The observations indicate that the DDAE-based dereverberation system can effectively enhance the speech quality when the property of reverberation is known beforehand, but the performance may degrade dramatically in new environments, where the training and testing conditions are different. Meanwhile, HDDAE provides the best averaged PESQ score. The result indicates that the model trained on the diverse training set is more robust to varying testing environments.

Testing Avg.
Reverberation 2.0666 1.5534 1.1839 1.5661
HDDAE 2.4830 1.4784 1.0755 1.6373
HDDAE 1.6744 2.2539 1.2274 1.7072
HDDAE 1.3696 1.6525 2.1021 1.7217
HDDAE 2.4702 2.3064 2.1466 2.2838
Table 1: PESQ scores of HDDAE, HDDAE, HDDAE and HDDAE testing in either the matched or mismatched reverberant conditions.

Table 2 lists the averaged results of PESQ, STOI, and SDI for unprocessed speech, HDDAE, HDDAE, and IDEA on all the testing utterances (). From the table, we find that all evaluation matrices of DDAE-based approaches outperform those from unprocessed reverberation. These results indicate the effectiveness of the HDDAE-based dereverberation systems. In addition, the better PESQ, STOI and SDI scores of HDDAE than those from HDDAE indicate that the additional hidden layers of the HDDAE may not necessarily increase the system performance in the task. On the other hand, IDEA (also with six hidden layers) yields the highest sound quality and intelligibility and the lowest signal distortion, confirming the effectiveness of the proposed IDEA for the dereverberation task.

Reverberant HDDAE HDDAE IDEA
PESQ 1.5611 2.2838 2.2672 2.3808
STOI 0.6692 0.8598 0.8527 0.8691
SDI 8.0304 1.0520 1.5393 0.8916
Table 2: Averaged results of all testing data for the unprocessed reverberant speech, HDDAE-, HDDAE-, and IDEA-processed utterances.

To further analyze the performance of the proposed algorithm, we compare the PESQ scores of IDEA with those of the HDDAE in both matched and mismatched testing environments; the results are listed in Tables 3 and 4, respectively (please note that the testing data in Table 4 cover , which were not seen in the training data). From these tables, we observe that PESQ scores obtained by IDEA and HDDAE consistently decrease with increasing , revealing that the dereverberation performance is negatively correlated with the value. In addition, IDEA outperforms HDDAE in all testing s, confirming that the ensemble modeling can achieve better results than those from a single model, where the training data and the number of layers are the same for these two models.

Testing
HDDAE 2.4349 2.2990 2.1408
IDEA 2.5669 2.4249 2.2479
Table 3: PESQ scores of HDDAE and IDEA evaluated in the matched testing conditions
Testing
HDDAE 2.3575 2.2309 2.1399
IDEA 2.4676 2.3323 2.2452
Table 4: PESQ scores of HDDAE and IDEA evaluated in the mismatched testing conditions

5 Conclusion

From the experimental results, we first noted that the single-HDDAE-based systems could achieve good dereverberation performance in matched conditions, but the performance degraded significantly when the systems were tested in mismatched conditions, showing that the HDDAE models trained to address specific reverberation conditions may have limited generalization capabilities. In addition, the model HDDAE, which was trained using all the training data, outperformed individual HDDAE models in terms of PESQ scores over all testing environments. Moreover, when compared to the model HDDAE, the model IDEA provided better results, confirming that by collecting information from multiple environments to train matched HDDAE models and then integrating the information from the outputs of these models, diverse reverberation conditions can be covered and high dereverberation performance achieved.

6 Acknowledge

This research was supported in part by the Ministry of Science and Technology of Taiwan (MOST 107-2633-E-002-001), National Taiwan University, Intel Corporation, and Delta Electronics.

References

  • [1] P. Naylor and N. D. Gaubitch, Speech dereverberation. Springer Science & Business Media, 2010.
  • [2] X. Feng, Y. Zhang, and J. Glass, “Speech feature denoising and dereverberation via deep autoencoders for noisy reverberant speech recognition,” in Proc. ICASSP, pp. 1759–1763, 2014.
  • [3] K. Kinoshita et al., “The reverb challenge: A common evaluation framework for dereverberation and recognition of reverberant speech,” in Proc. WASPAA, pp. 1–4, 2013.
  • [4] X. Zhao, Y. Wang, and D. Wang, “Robust speaker identification in noisy and reverberant conditions,” IEEE/ACM Transactions on Audio, Speech and Language Processing, vol. 22, no. 4, pp. 836–845, 2014.
  • [5] S. O. Sadjadi and J. H. Hansen, “Hilbert envelope based features for robust speaker identification under reverberant mismatched conditions,” in Proc. ICASSP, pp. 5448–5451, 2011.
  • [6] K. Han et al., “Learning spectral mapping for speech dereverberation and denoising,” IEEE Transactions on Audio, Speech, and Language Processing, vol. 23, no. 6, pp. 982–992, 2015.
  • [7] K. Kokkinakis, O. Hazrati, and P. C. Loizou, “A channel-selection criterion for suppressing reverberation in cochlear implants,” The Journal of the Acoustical Society of America, vol. 129, no. 5, pp. 3221–3232, 2011.
  • [8] N. Roman and J. Woodruff, “Speech intelligibility in reverberation with ideal binary masking: Effects of early reflections and signal-to-noise ratio threshold,” The Journal of the Acoustical Society of America, vol. 133, no. 3, pp. 1707–1717, 2013.
  • [9] J. Benesty, M. M. Sondhi, and Y. Huang, Springer handbook of speech processing: Ch. 4.6. Springer Science & Business Media, 2007.
  • [10]

    B. W. Gillespie, H. S. Malvar, and D. A. Florêncio, “Speech dereverberation via maximum-kurtosis subband adaptive filtering,” in

    Proc. ICASSP, pp. 3701–3704, 2001.
  • [11] Y. Huang, J. Benesty, and J. Chen, “Speech acquisition and enhancement in a reverberant, cocktail-party-like environment,” in Proc. ICASSP, pp. V–V, 2006.
  • [12] S. C. Douglas and X. Sun, “Convolutive blind separation of speech mixtures using the natural gradient,” Speech Communication, vol. 39, no. 1, pp. 65–78, 2003.
  • [13] J. Li, R. Xia, Q. Fang, A. Li, and Y. Yan, “Speech intelligibility enhancement in noisy reverberant conditions,” in Proc. ISCSLP, pp. 1–5, 2016.
  • [14] D. Bees, M. Blostein, and P. Kabal, “Reverberant speech enhancement using cepstral processing,” in Proc. ICASSP, pp. 977–980, 1991.
  • [15]

    T. Nakatani, T. Yoshioka, K. Kinoshita, M. Miyoshi, and B.-H. Juang, “Speech dereverberation based on variance-normalized delayed linear prediction,”

    IEEE Transactions on Audio, Speech, and Language Processing, vol. 18, no. 7, pp. 1717–1731, 2010.
  • [16] H. Kameoka, T. Nakatani, and T. Yoshioka, “Robust speech dereverberation based on non-negativity and sparse nature of speech spectrograms,” in Proc. ICASSP, pp. 45–48, 2009.
  • [17] N. Mohanan, R. Velmurugan, and P. Rao, “Speech dereverberation using nmf with regularized room impulse response,” in Proc. ICASSP, pp. 4955–4959, 2017.
  • [18] M. Wu and D. Wang, “A two-stage algorithm for one-microphone reverberant speech enhancement,” IEEE Transactions on Audio, Speech, and Language Processing, vol. 14, no. 3, pp. 774–784, 2006.
  • [19] I. Kodrasi, T. Gerkmann, and S. Doclo, “Frequency-domain single-channel inverse filtering for speech dereverberation: Theory and practice,” in Proc. ICASSP, pp. 5177–5181, 2014.
  • [20] T. Hikichi, M. Delcroix, and M. Miyoshi, “Inverse filtering for speech dereverberation less sensitive to noise and room transfer function fluctuations,” EURASIP Journal on Applied Signal Processing, vol. 2007, no. 1, pp. 62–62, 2007.
  • [21] X. Xiao et al., “Speech dereverberation for enhancement and recognition using dynamic features constrained deep neural networks and feature adaptation,” EURASIP Journal on Advances in Signal Processing, vol. 2016, no. 1, p. 4, 2016.
  • [22] K. Han, Y. Wang, and D. Wang, “Learning spectral mapping for speech dereverberation,” in Proc. ICASSP, pp. 4628–4632, 2014.
  • [23] B. Wu et al., “A reverberation-time-aware approach to speech dereverberation based on deep neural networks,” IEEE/ACM Transactions on Audio, Speech and Language Processing, vol. 25, no. 1, pp. 102–111, 2017.
  • [24] X. Lu, Y. Tsao, S. Matsuda, and C. Hori, “Speech enhancement based on deep denoising autoencoder.,” in Proc. INTERSPEECH, pp. 436–440, 2013.
  • [25] X. Lu, Y. Tsao, S. Matsuda, and C. Hori, “Ensemble modeling of denoising autoencoder for speech spectrum restoration,” in Proc. INTERSPEECH, 2014.
  • [26] Y. Tsao, P. Lin, T.-y. Hu, and X. Lu, “Ensemble environment modeling using affine transform group,” Speech Communication, vol. 68, pp. 55–68, 2015.
  • [27] Y. Tsao and C.-H. Lee, “An ensemble speaker and speaking environment modeling approach to robust speech recognition,” IEEE transactions on audio, speech, and language processing, vol. 17, no. 5, pp. 1025–1037, 2009.
  • [28] L. L. Wong et al., “Development of the mandarin hearing in noise test (MHINT),” Ear and hearing, vol. 28, no. 2, pp. 70S–74S, 2007.
  • [29] R. K. Srivastava, K. Greff, and J. Schmidhuber, “Highway networks,” CoRR, vol. abs/1505.00387, 2015.
  • [30] S.-W. Fu, Y. Tsao, and X. Lu, “Snr-aware convolutional neural networkmodeling for speech enhancement.,” in Proc. INTERSPEECH, pp. 3768–3772, 2016.
  • [31] E. A. Habets, “Room impulse response generator,” Technische Universiteit Eindhoven, Tech. Rep, vol. 2, no. 2.4, p. 1, 2006.
  • [32] A. W. Rix, J. G. Beerends, M. P. Hollier, and A. P. Hekstra, “Perceptual evaluation of speech quality (PESQ)-a new method for speech quality assessment of telephone networks and codecs,” in Proc. ICASSP, pp. 749–752, 2001.
  • [33] C. H. Taal, R. C. Hendriks, R. Heusdens, and J. Jensen, “An algorithm for intelligibility prediction of time–frequency weighted noisy speech,” IEEE Transactions on Audio, Speech, and Language Processing, vol. 19, no. 7, pp. 2125–2136, 2011.
  • [34] J. Chen, J. Benesty, Y. Huang, and E. Diethorn, “Fundamentals of noise reduction in spring handbook of speech processing-chapter 43,” 2008.