Impulsive Noise Detection for Intelligibility and Quality Improvement of Speech Enhancement Methods Applied in Time-Domain

10/07/2019 ∙ by C. Medina, et al. ∙ Forte Duque de Caxias / Forte do Leme 0

This letter introduces a novel speech enhancement method in the Hilbert-Huang Transform domain to mitigate the effects of acoustic impulsive noises. The estimation and selection of noise components is based on the impulsiveness index of decomposition modes. Speech enhancement experiments are conducted considering five acoustic noises with different impulsiveness index and non-stationarity degrees under various signal-to-noise ratios. Three speech enhancement algorithms are adopted as baseline in the evaluation analysis considering spectral and time domains. The proposed solution achieves the best results in terms of objective quality measures and similar speech intelligibility rates to the competitive methods.

READ FULL TEXT VIEW PDF
POST COMMENT

Comments

There are no comments yet.

Authors

page 3

This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

1 Introduction

Impulsive background noisy condition may cause severe impact on the accuracy of acoustic classification systems and applications. Impulsive noises (slamming doors, industrial machinery, falling objects) are encountered in real environments. They are commonly characterized by almost instantaneous sharp sounds with high acoustic energy and wide spectral bandwidth. Impulsive sample sequences are generally defined in the literature by heavy-tail distributions tailored by its impulsiveness degree. Due to this impulsive nature, a key element of the research area includes accurate estimation of noise components especially from real acoustic noisy signals.

In recent years, many studies have been dedicated to mitigate the effect of non-stationary acoustic noise in different domains [1, 2, 3]. Particularly, speech enhancement solutions have been applied in the Hilbert-Huang Transform (HHT) domain. These techniques adopt the Empirical Mode Decomposition (EMD) [4] or one of its variations to analyze the noisy speech signal. This powerful decomposition has also become interesting for processing and analyze other signals, e.g. electroencephalogram signals [5], and multimodal sensing data [6]. HHT-based approaches have achieved interesting speech quality improvement in noisy scenarios [3, 7, 8]. Impulsive noises may be considered as a different kind of non-stationary sources.

This letter introduces an HHT-domain method to enhance speech signals corrupted by impulsive acoustic noises. The proposed HHT- solution applies the Ensemble EMD (EEMD) [9] to decompose a target noisy signal into a series of intrinsic mode functions (IMF). The noise components of each IMF are identified and selected based on the impulsiveness index [10] on a frame-by-frame basis. The speech signal is reconstructed excluding frames that are mainly composed by noise. In HHT-, no assumption is considered for speech and noise distributions.

Several experiments are conducted to examine the effectiveness of the proposed solution. HHT- is evaluated considering three quality and two intelligibility objective measures that present high correlation with subjective listening tests. Five real acoustic noises with different impulsiveness degrees are used to corrupt speech utterances. Five values of signal-to-noise ratio (SNR) are considered in this work: -10 dB, -5 dB, 0 dB, 5 dB, and 10 dB. Three speech enhancement techniques are adopted as baseline: the spectral Wiener filtering with unbiased minimum mean-square error estimator (UMMSE) [1], and the time domain EMD-based filtering (EMDF) [8] and EMD-Hurst-based (EMDH) [3] approaches. Experiments demonstrate that the HHT- method achieves interesting speech quality results, especially for highly impulsive noises. HHT- also shows similar average intelligibility rate when compared to the competitive techniques.

2 Hht-: Speech Enhancement Scheme

The HHT- speech enhancement includes three main steps: noisy signal decomposition, estimation and selection of noise components, and speech signal reconstruction. Fig. 1 illustrates the block diagram of the proposed method.

2.1 Noisy Signal Decomposition

HHT [4] is a nonlinear adaptive approach that locally analyzes a signal to define a local high-frequency part, also called detail , and a local trend , such that . An oscillatory IMF is derived from the detail function . The high versus low-frequency separation procedure is iteratively repeated over the residual , leading to a new detail and a new residual. Thus, the decomposition leads to a series of IMFs and a residual, such that , where is the -th mode of and is the residual. As opposed to other kinds of signal decomposition, a set of basis functions is not demanded for the HHT. In fact, HHT results in fully data-driven decomposition modes and does not require the stationarity of the target signal.

The EEMD was introduced in [9] to overcome the mode mixing problem that generally occurs in the original EMD. The key idea is to average IMFs obtained after corrupting the original signal using several realizations of white Gaussian noise (WGN). Thus, EEMD algorithm can be described as:

  1. Generate , where , , are different realizations of WGN;

  2. Apply EMD to decompose , , into a series of components , ;

  3. Assign the -th mode of as

  4. Finally, , where is the residual.

2.2 Estimation and Selection of Noise Components

In the literature, impulsive signals and noises are generally defined by a sequence of random samples with symmetric heavy-tail distribution, i.e., , where is a positive constant and is the impulsiveness index. The exponent is also related to -stable distribution and may be described as the characteristic exponent [10].

In [11] authors showed that for -stable noises the EMD behaves like a quase-dyadic filterbank for . Speech signals investigated in this work are impulsive and present heavy-tails with values in the range . On the other hand, acoustic noises commonly encountered in real urban scenarios have values in the range [11]. Thus, in this letter the EMD is applied to highlight the noise impulsiveness of the corrupted speech signal. The estimator proposed by McCulloch in [12, 13] is here adopted for the index estimation.

Figure 1: Block diagram of the HHT- speech enhancement method.

Fig. 2(a)-(c) show spectrograms of a clean speech signal collected from the TIMIT database [14], an impulsive Sliding Door Closing noise with , and also the corrupted signal with dB. Note from Fig. 2(b) that the noise energy is mostly concentrated at low frequencies and the spectrogram has sharp wide band components around , and seconds. Fig. 2(d) presents average values of the impulsiveness index estimated from IMFs of clean speech, impulsive noise, and noisy speech signals. It can be seen that as the mode index increases, the values of all signals approach . For the highest IMF indexes, e.g., , the acoustic noise and the noisy speech signal have similar values. These values are greater than those obtained from the clean speech signal. This indicates that these IMFs are more noise-like, which corroborates with previous works (for example, refer to [3]).

Similar behavior can be observed in Fig. 3, where the estimated values of from different IMFs are shown for the other four impulsive noises: Train (), Horn (), Babble (), and Helicopter (). Once again, values indicate that IMFs with high indices are mostly composed by noise. Note from Fig. 2(d) and Fig. 3 that for medium IMF indexes, i.e., , values of the noisy signal generally vary between those estimated from the noise and from the clean speech signal. This demonstrates that the impulsiveness index is an appropriate identification criterion to select the IMFs with more speech-like characteristics and reject the noise-like components.


Figure 2: Spectrograms of (a) clean speech (b) Sliding Door Closing noise (), and (c) noisy speech (SNR= dB). (d) The average values of estimated from the IMFs.
Figure 3: Average values of estimated from the IMFs of impulsive noise sources: (a) Train, (b) Horn, (c) Babble, and (d) Helicopter.
Figure 4: Spectrograms and INS obtained for 2.4-seconds segments of the acoustic impulsive noises: (a) Sliding Door Closing, (b) Train, (c) Horn (d) Babble and (e) Helicopter. Dashed lines indicate the value for the stationarity test threshold.

The selection of noise components is performed as follows. After the decomposition of the target noisy signal with the EEMD algorithm, each mode is segmented into a set of overlapping short-time frames , , with samples each. In this proposal, the selection of noisy components is based on parameters of each windowed IMF. For each frame , the impulsiveness index is estimated from the decomposition modes leading to a set of values . The next step is to determine the index of the last IMF whose impulsiveness index is bellow a given threshold, , i.e., . IMFs whose values exceed the threshold are considered as noise-like components.

2.3 Speech Signal Reconstruction

If represents the enhanced speech signal, then each frame is reconstructed by where is the index of the last mode considered as speech and is a window function used to avoid discontinuities in the reconstructed signal (for more details see [3]). Finally, is reconstructed by overlapping and adding all frames as where is a normalization factor that depends on the window function , the frame length , and the step size .

3 Evaluation Experiments

Extensive speech enhancement experiments are conducted with a subset of speech segments of the TIMIT speech database [14]. Speech utterances have sampling rate of kHz and average time duration of seconds. Five impulsive non-stationary acoustic noises are used to corrupt the speech utterances: Sliding Door Closing, Train, Horn, and Helicopter are selected from Freesound.org111Available at https://freesound.org., while Babble is obtained from the RSG-10 [15] database. These files are also available at lasp.ime.eb.br.

Fig. 4 presents the spectrogram and the index of non-stationarity (INS) [16] obtained from segments of five acoustic noises. The INS value is here shown to objectively examine the non-stationarity of impulsive noises. The time scale is the ratio of the length of the short-time spectral analysis () and the total time duration ( seconds) of noise sample sequences. For each window length , a threshold is defined to guarantee the stationarity assumption with a confidence degree of %. Thus, if then the noise is considered as stationary. Otherwise, it is designated as non-stationary. The values are also exhibited in Fig. 4.

Sliding Door Closing, Train, and Babble noises are here classified as highly non-stationary since their INS achieves values greater than

, , and , respectively. Horn noise presents INS results in the range and thus, it is defined as moderately non-stationary. Helicopter noise is considered as stationary since the INS values are quite similar to the stationarity threshold for all time scales.

The performance of the proposed and baseline methods are examined using five objective measures. Perceptual evaluation of speech quality (PESQ), log-likelihood ratio (LLR), and frequency-weighted segmental SNR (fwSNRseg) [17] are used to evaluate enhanced speech signals in terms of quality. These measures present high correlation with subjective overall quality and signal distortion results [17]. Coherence speech intelligibility index (CSII) [18] and short-time objective intelligibility measure (STOI) [19] are adopted for speech intelligibility assessment. Intelligibility prediction scores are obtained according to the mapping function , where refers to the objective measure. In this work, it is adopted and for the CSII, and and for the STOI.

For the HHT- method, the EEMD algorithm is applied considering 50 different realizations of WGN with SNR of 30 dB to obtain 10 IMFs. The decision threshold is crucial to determine the components to be removed from each corrupted speech frame. In this letter, an adaptive threshold is introduced, such that , where , is the estimate of for the corrupted speech windowed signal, and is the minimum value allowed for . In this work, is adopted to adjust the amount of noise components to be removed, while is used to avoid excessive component removal in speech dominant segments of the signal. The selection of noise components considers samples per frame and step size of samples.

Tab. 1 shows the PESQ results obtained with the proposed and baseline speech enhancement techniques for different impulsive acoustic noises and SNR values. Note that HHT- outperforms the competing time domain approaches for most of the noisy scenarios. Particularly for the Sliding Door Closing noise, which presents the lowest value, the HHT- achieves the highest average PESQ result, including the spectral UMMSE method. On average, the overall PESQ obtained with the proposed solution is 1.93, which is 0.04 and 0.10 higher than EMDH and EMDF, respectively. The spectral UMMSE achieved an overall PESQ of 2.11.

Noise SNR UMMSE EMDF EMDH HHT-
Sliding Door Closing
Train
Horn
Babble
Helicopter

Table 1: PESQ results with the proposed and baseline methods.

Fig. 5 exhibits the average fwSNRseg improvement obtained by the proposed and baseline methods for the five noises. Once again, HHT- achieves the best results for the highly impulsive Sliding Door Closing noise. It is interesting to mention that, for this noise, UMMSE does not improve the speech signals in terms of fwSNRseg. For the other noise sources, HHT- outperforms the time domain EMDF and EMDH techniques. Moreover, the fwSNRseg gain of HHT- is slightly superior than that obtained with UMMSE for the Babble and Helicopter noises.

Figure 5: Average fwSNRseg gain obtained for different noise sources.
Figure 6: Average LLR results obtained for different noise sources.
Noise SNR UMMSE EMDF EMDH HHT-
Sliding Door Closing
Train
Horn
Babble
Helicopter
Table 2: STOI Intelligibility rate prediction (%).
Figure 7: CSII intelligibility prediction rates obtained for (a) Sliding Door Closing, (b) Train, (c) Horn, (d) Babble, and (e) Helicopter acoustic noises.

Fig. 6 depicts the average LLR values obtained for each impulsive noise. Note that the proposed solution again achieves the highest LLR for four noise sources. The only exception is the Horn noise. However, HHT- outperforms the spectral UMMSE for this noise source. The overall LLR obtained with HHT- is 0.73, which is 0.04, 0.05 and 0.09 higher than results achieved with EMDH, EMDF and UMMSE, respectively.

Tab. 2 presents intelligibility prediction rates obtained with STOI. Note that HHT- and competitive solutions achieve quite close results, especially for SNR dB. On average, intelligibility prediction rates vary in at most 2.2 percentage points, i.e., from 57.0% with HHT- to 59.2% with EMDH. This similar behavior in terms of speech intelligibility is reinforced by CSII results depicted in Fig. 7. Once again, the proposed and baseline algorithms show similar speech intelligibility prediction values.

4 Conclusion

This letter introduced the HHT- speech enhancement technique based on the Hilbert-Huang Transform. The EEMD algorithm is used to decompose the noisy speech signal in time domain. The estimation and selection of noise components is performed frame-by-frame based on the impulsiveness index of the decomposition modes. The enhanced version of the speech signal is finally reconstructed using the IMFs that are mainly composed of speech. Several experiments were conducted using five non-stationary acoustic noises with different values of the impulsiveness index . Particularly for the most impulsive noise, the proposed solution outperformed the three competing approaches in terms of PESQ, fwSNRseg, and LLR objective quality measures. In terms of speech intelligibility, HHT- is similar to other state-of-the-art methods for all the impulsive noise sources.

Acknowledgment

R. Coelho is partially supported by the National Council for Scientific and Technological Development (CNPq) 307866/2015 and Fundação de Amparo à Pesquisa do Estado do Rio de Janeiro (FAPERJ) 203075/2016 research grants.

References

  • [1] T. Gerkmann and R. Hendriks, “Unbiased MMSE-based noise power estimation with low complexity and low tracking delay,” IEEE Transactions on Audio, Speech and Language Processing, vol. 20, no. 4, pp. 1383–1393, 2012.
  • [2] R. Tavares and R. Coelho, “Speech enhancement with nonstationary acoustic noise detection in time domain,” IEEE Signal Processing Letters, vol. 23, no. 1, pp. 6–10, 2016.
  • [3] L. Zão, R. Coelho, and P. Flandrin, “Speech enhancement with EMD and hurst-based mode selection,” IEEE/ACM Transactions on Audio, Speech and Language Processing, vol. 22, no. 5, pp. 899–911, 2014.
  • [4] N. Huang, Z. Shen, S. Long, M. Wu, H. Shih, Q. Zheng, N. Yen, C. Tung, and H. Liu, “The empirical mode decomposition and the hilbert spectrum for nonlinear and non-stationary time series analysis,” in Proc. R. Soc. London. Ser. A: Math., Phys., Eng. Sci., no. 1971, 1998, pp. 903–995.
  • [5]

    V. Bajaj, S. Taran, E. Tanyildizi, and A. Sengur, “Robust approach based on convolutional neural networks for identification of focal EEG signals,”

    IEEE Sensors Letters, vol. 3, no. 5, pp. 1–4, May 2019.
  • [6] A. Shamsan, W. Dan, and C. Cheng, “Multimodal data fusion using multivariate empirical mode decomposition for automatic process monitoring,” IEEE Sensors Letters, vol. 3, no. 1, pp. 1–4, January 2019.
  • [7] R. Coelho and L. Zão, “Empirical mode decomposition theory applied to speech enhancement,” in Signals and Images: Advances and Results in Speech, Estimation, Compression, Recognition, Filtering and Processing, R. Coelho, V. Nascimento, R. Queiroz, J. Romano, and C. Cavalcante, Eds.   Boca Raton, Florida: CRC Press, 2015.
  • [8] N. Chatlani and J. Soraghan, “EMD-based filtering (EMDF) of low frequency noise for speech enhancement,” IEEE Transactions on Audio, Speech and Language Processing, vol. 20, no. 4, pp. 1158–1166, May 2012.
  • [9] M. E. Torres, M. A. Colominas, G. Schlotthauer, and P. Flandrin, “A complete ensemble empirical mode decomposition with adaptive noise,” in Proc. IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), 2011.
  • [10] C. Nikias and M. Shao, Signal Processing with Alpha-Stable Distributions and Applications.   Wiley, 1995.
  • [11] A. Komaty, A. Boudraa, J. Nolan, and D. Dare, “On the behavior of EMD and MEMD in presence of symmetric alpha-stable noise,” IEEE Signal Processing Letters, vol. 22, no. 7, pp. 818–822, 2015.
  • [12] J. H. McCulloch, “Simple consistent estimators of stable distribution parameters,” Communications in Statistics - Simulation and Computation, vol. 15, no. 5, pp. 1109–1136, 1986.
  • [13] ——, “Maximum likelihood estimation of symmetric stable parameters,” Ohio State University, Department of Economics, Tech. Rep., 1998.
  • [14] J. Garofolo, L. Lamel, W. Fisher, J. Fiscus, D. Pallett, N. Dahlgren, and V. Zue, “Timit acoustic-phonetic continuous speech corpus,” in Linguistic Data Consortium, Philadelphia, PA, USA, 1993.
  • [15] H. Steeneken and F. Geurtsen, “Description of the RSG-10 noise database,” report IZF, vol. 3, p. 1988, 1988.
  • [16] P. Borgnat, P. Flandrin, P. Honeine, C. Richard, and J. Xiao, “Testing stationarity with surrogates: A time-frequency approach,” IEEE Transactions on Signal Processing, vol. 58, no. 7, pp. 3459–3470, July 2010.
  • [17] Y. Hu and P. Loizou, “Evaluation of objective quality measures for speech enhancement,” IEEE Transactions on Audio, Speech and Language Processing, vol. 16, no. 1, pp. 229–238, 2008.
  • [18] J. Kates, “Coherence and the speech intelligibility index,” Journal of the Acoustical Society of America, vol. 4, pp. 2224–2237, 2005.
  • [19] C. Taal, R. Hendriks, R. Heusdens, and J. Jensen, “An algorithm for intelligibility prediction of time-frequency weighted noisy speech,” IEEE Transactions on Audio, Speech and Language Processing, vol. 19, pp. 2125–2136, 2011.