On Single-Channel Speech Enhancement and On Non-Linear Modulation-Domain Kalman Filtering

10/31/2018 ∙ by Nikolaos Dionelis, et al. ∙ Imperial College London 0

This report focuses on algorithms that perform single-channel speech enhancement. The author of this report uses modulation-domain Kalman filtering algorithms for speech enhancement, i.e. noise suppression and dereverberation, in [1], [2], [3], [4] and [5]. Modulation-domain Kalman filtering can be applied for both noise and late reverberation suppression and in [2], [1], [3] and [4], various model-based speech enhancement algorithms that perform modulation-domain Kalman filtering are designed, implemented and tested. The model-based enhancement algorithm in [2] estimates and tracks the speech phase. The short-time-Fourier-transform-based enhancement algorithm in [5] uses the active speech level estimator presented in [6]. This report describes how different algorithms perform speech enhancement and the algorithms discussed in this report are addressed to researchers interested in monaural speech enhancement. The algorithms are composed of different processing blocks and techniques [7]; understanding the implementation choices made during the system design is important because this provides insights that can assist the development of new algorithms. Index Terms - Speech enhancement, dereverberation, denoising, Kalman filter, minimum mean squared error estimation.

READ FULL TEXT VIEW PDF
POST COMMENT

Comments

There are no comments yet.

Authors

page 1

This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

I Introduction

Technology is ever evolving with tremendous haste and the demand for speech enhancement systems is evident. The need for speech enhancement for human listeners is apparent due to the increase in the number of smartphone users. Speech enhancement for listeners is also needed in hearing aids. The requirements for speech enhancement for human listeners are not the same as for automatic speech recognition (ASR); nevertheless, the algorithms that perform speech enhancement for human listeners can be used for ASR. Examples that advocate the latter argument can be found in the REVERB challenge [8], [9]. In [10], speech enhancement is presented as front-end ASR. Nowadays, many technology-based applications need speech enhancement as a front-end system [10]. For example, ASR algorithms for robot audition can benefit from the use of speech enhancement as a front-end system. Smartphone applications also need speech enhancement as a front-end system. To answer to one’s questions, digital assistants such as “Google Home” [11] and Amazon’s “Alexa” can also benefit from the use of front-end speech enhancement. Front-end adaptive dereverberation has been used in [11] and [12].

Single-channel speech enhancement is different from multi-channel speech enhancement. Multi-channel speech enhancement can take advantage of the correlation between the different microphone signals and of the spatial cues that are related to the configuration of the microphones [13] [14]. Multi-channel speech enhancement can be performed using beamforming followed by single-channel speech enhancement [15] [12]. Beamforming is utilised for spatial discrimination and is usually followed by single-channel speech enhancement. The problem of single-channel/monoaural speech enhancement continues to be of significant interest to the speech community mainly because multi-channel enhancement can be performed with a beamformer followed by single-channel enhancement. Considering the enormous increase in the number of smartphone users, multi-channel (and thus single-channel) enhancement is needed as front-end in many applications.

The two main causes of speech degradation are additive noise and room reverberation, as described in, for example, the ACE challenge [16]. Speech recordings are degraded by noise and reverberation when captured using a near-field or far-field distant microphone within a confined acoustic space. Noise and reverberation have a detrimental impact on speech quality and speech intelligibility [1] [12]. Providing robustness to speech systems still remains a challenge due to noise and reverberation. Background noise, which is also known as ambient noise, can be stationary or non-stationary [12]. Noise can have tonal components that may have strong phase correlation with speech. Reverberation is a convolutive distortion; a room impulse response (RIR) includes components at both short and long delays resulting in both coloration [17] and reverberation and/or echoes. Reverberation can be quite long with a reverberation time, , of more than ms. Noise is uncorrelated with speech [1], early reflections are strongly correlated with speech and late reverberation is uncorrelated with speech. Early reverberation is not perceived as separate sound sources and is correlated with clean speech [12].

The goal of speech enhancement is to reduce and ideally eliminate the effects of both additive noise and room reverberation without distorting the speech signal [12] [18]. The aim is to enhance speech with high levels of noise in situations where noise is sufficiently high so that the speech quality is damaged [18] and in situations where abrupt changes of noise occur. Such situations arise commonly when the microphone is some distance away from the target speaker because the acoustic energy that the microphone receives from the target speaker decreases with the square of the distance whereas the noise energy typically remains constant. The aim of speech enhancement is to improve the perceived quality of speech by suppressing noise and late reveberation [12]. In particular, we aim to suppress late reverberation because early reflections are not perceived as separate sound sources and usually improve the speech quality and intelligibility of the degraded signal.

Ii Literature Review

Single-channel speech enhancement can be performed in different domains [2]. The ideal domain should be chosen such that (i) good statistical models of speech and noise exist in this domain, and (ii) speech and noise are separable in this domain. Speech and noise are additive in the time domain and therefore in the complex Short Time Fourier Transform (STFT) domain [12]

. Speech and noise are not additive in other domains such as the amplitude, power or log-power spectral domains. The relation between speech and noise becomes incrementally complicated in the amplitude spectral domain, the power spectral domain, the log-spectral domain and the cepstral domain. Modeling speech spectral log-amplitudes as Gaussian distributions leads to good speech modeling because the logarithmic scale is a good perceptual measure and because researchers use super-Gaussian distributions that resemble the log-normal, such as the Gamma distribution

[19]

, to model speech in the amplitude spectral domain. In this context, using the log-normal distribution in the amplitude spectral domain is equivalent to using the Gaussian distribution in the log-spectral domain. Speech signals can be modeled more accurately using super-Gaussian Laplacian distributions than using Gaussians in terms of the amplitude spectral coefficients

[20], [21].

The research work in [1] focuses on model-based speech enhancement aiming towards both noise suppression and dereverberation. Speech enhancement is performed in the log-spectral time-frequency domain using a Kalman filter (KF) to model temporal inter-frame correlations [2]. The reasons for choosing the log-spectral time-frequency domain are related to (i) in the previous paragraph: good statistical models of speech and noise exist in the log-spectral time-frequency domain [3]. Speech spectra are well modelled by Gaussians in the log-spectral domain (and not so well in other domains) [2], mean squared errors in the log-spectral domain are a good measure to use for perceptual speech quality and the non-nonnegative log-spectral domain is most suitable for infinite-support Gaussian modeling. The log-spectral domain is used because of the aforementioned reasons and because the loudness perception of the peripheral human auditory system is logarithmic.

Regarding (ii) and regarding the extent to which speech and noise are separable in the log-spectral time-frequency domain, some noise types are sparse in time and some are sparse in both time and frequency [12]. Speech is sparse in both time and frequency. Intermittent noise is sparse in time and some noise types are fairly sparse in both time and frequency. In addition, speech and noise are correlated over successive frames.

Monoaural speech enhancement is most commonly done in a time-frequency domain because both speech and, in many cases, interfering noise are relatively sparse in this domain. Speech is sparse in both time and frequency, intermittent noise is sparse in time and some noise types are fairly sparse in both time and frequency. A recent paper that advocates the argument that speech signals are sparse in both time and frequency is [22]. The sparse nature of speech spectrograms is also utilised in the dereverberation algorithm in [23].

Speech enhancement can also be performed in the time domain, even though speech is not sparse in the time domain. Early speech enhancement was performed in this domain.

Kalman filtering can be performed in the time domain; there is a plethora of enhancement algorithms that use a KF in the time domain and they all have originated from [24]. Kalman filtering in the time domain needs a KF state of a large dimension; for example, the KF state dimension is for a kHz sample rate and , or even , for an kHz sample rate. Kalman filtering in the time domain, [25] [24], is different from modulation-domain Kalman filtering, [26] [27]. Kalman filtering in the time domain, as performed in [25] and in [24], operates in the time domain and changes the spectrum, without explicitly computing the spectrum. In the same way, modulation-domain Kalman filtering, as performed in [26] [27], operates in a spectral time-frequency domain and changes the modulation spectrum, without explicitly computing it.

The model-based speech enhancement algorithms in [1] and in [2], which estimates and tracks the clean speech phase, solve the problem of monaural speech enhancement using modulation-domain Kalman filtering, which refers to imposing temporal constraints on a spectral time-frequency domain. Three possible domains are the amplitude spectral domain, the power spectral domain and the log-spectral domain. Non-linear adaptive modulation-domain Kalman filtering refers to tracking the clean speech signal in one of the three spectral domains along with imposing inter-frame constraints [2].

Speech is highly structured and it is mainly structured in its inter-frame component. Speech is a highly self-correlated signal and, by taking the inter-frame correlation into account, we are able to develop more sophisticated algorithms with better noise reduction results [28]. Speech has prominent temporal dependency which provides rich information for speech processing and this is why modulation-domain Kalman filtering can be performed. The speech enhancement algorithms in [2] and [3] model the temporal dynamics of the speech spectral log-powers, assuming that the STFT spectral log-power of the current frame is correlated with the STFT spectral log-power of the neighboring frames. When the algorithms estimate the spectral log-power of the clean speech in the current frame, they use the STFT spectral log-powers of the noisy speech not only in the current frame but also in the previous ones.

Speech enhancement aims to minimize the effects of additive noise and room reverberation on the quality and intelligibility of the speech signal. Speech quality is the measure of noise remaining after the processing on the speech signal and of how pleasant the resulting speech sounds, while intelligibility refers to the accuracy of understanding speech. Enhancement algorithms are designed to remove noise and reverberation with minimum speech distortion [12]. There is a trade-off between speech distortion and noise and reverberation suppression. Enhancement is challenging due to lack of knowledge about both the speech and the corrupting noise.

Speech enhancement is most commonly performed in a time-frequency domain that is related to the STFT and thus using STFT bins [18]. The main advantage of utilising the (high) frequency resolution of STFT bins is that perfect reconstruction is possible in the STFT domain. Different frequency bands, such as Mel-spaced bands and Bark-spaced bands, can also be used. The Mel-frequency scale is a perceptually motivated scale that is linear below kHz and logarithmic above kHz. Gammatone time-domain filters can also be used. The STFT is popular because it can be made to have perfect reconstruction; however, Mel-bank or Bark-bank or gammatone filters more closely match the frequency resolution of human hearing [4]. To reduce the computational complexity of signal processing algorithms, matching the frequency resolution of human hearing is important. Human hearing mainly depends on low and medium frequencies [20] [7] and high spectral resolution is not always needed at high frequencies [4].

Gammatone filters are easy-to-implement real-valued filters, usually of the eighth order, that match human hearing [29]. One of the main advantages of gammatone filters is that no frame segmentation is needed; the signal is in the time domain during the entire processing and the time-frequency trade-off is not evident. In this way, no artifacts are created from frame segmentation. The gammatone time-domain filters transform the signal into bands and then real-valued gains are computed for each band. One of the main disadvantages of gammatone filters is that perfect signal reconstruction is not possible.

Speech enhancement can be performed in different time-frequency domains, such as the complex STFT domain, the amplitude spectral domain and the power spectral domain. Other possible time-frequency domains are the log-spectral domain, the cepstral domain and the (spectral) phase domain [30] [31]. Moreover, speech enhancement can be performed either using the real and the imaginary parts of the complex STFT domain [32] [33] or using the log real and the log imaginary parts of the complex STFT domain. Most enhancement algorithms modify only the amplitude of the spectral components and leave the phase unchanged for three reasons: (i) estimating the phase reliably is difficult [18], (ii) the ear is largely insensitive to phase, and (iii) the optimum estimate of the clean speech phase is the noisy phase under reasonable assumptions. The enhancement problem is to estimate a real-valued time-frequency gain to apply to the noisy signal.

The real-valued time-frequency gain can be applied in STFT bins but can be calculated in Mel-spaced frequency bands, as in [34] [35]. According to [34] [35]

, the speech enhancement algorithms can first estimate and then interpolate the real-valued gain in Mel-spaced frequency bands to estimate and apply the real-valued gain in uniformly-spaced STFT bins.

Spectral subtraction in the magnitude spectral domain (or in the power spectral domain) was one of the most early enhancement techniques. Furthermore, regarding traditional enhancement algorithms, Minimum Mean Square Error (MMSE) [36] and Log-MMSE [37] are two of the most popular model-based enhancement techniques. The superiority of Log-MMSE over MMSE can be considered as motivation for using the log-spectral domain and thus for minimizing the error in the log-spectral domain. Both MMSE and Log-MMSE assume a uniform speech phase distribution [7] and, also, use that speech and noise are additive in the complex STFT domain [38].

MMSE and Log-MMSE can be considered as one group of algorithms since they are variants of time-frequency gain manipulation. In [39], a description of the MMSE and Log-MMSE statistical-based noise reduction algorithms is given. The Log-MMSE estimator is better in terms of speech quality than the MMSE estimator since it attenuates the noise power more without introducing much speech distortion [40]. According to [41], MMSE estimators using the decision-directed approach do not introduce musical noise. However, according to listening experiments, this claim of [41] is not actually true. In MMSE, [36], the a posteriori SNR is the noisy speech power divided by the noise power and the a priori SNR is the clean speech power divided by the noise power. The traditional MMSE approach, [36], uses the decision-directed approach to estimate the a priori SNR from the a posteriori SNR. The traditional Log-MMSE approach, [37]

, uses the log-power domain. In MMSE, the model assumes that the STFT coefficient of noisy speech is the sum of two zero-mean complex Gaussian random variables; the STFT coefficients of clean speech and noise are modeled with a zero-mean complex Gaussian distribution

[39]. For complex Gaussian random variables, the magnitude and phase are independent and this is a common assumption in speech processing algorithms. In addition, the distribution of the magnitude is Rayleigh and the distribution of the phase is uniform in ; the latter assumption is common in speech enhancement algorithms. Several variants of the MMSE and Log-MMSE estimators exist; super-Gaussian models for speech in the amplitude or power spectral domains have been proposed after the success of Log-MMSE. Alternative versions of the MMSE are presented, for example, in [38], in [42] and in [43].

More recently, researchers have tried to incorporate phase in speech modeling [30], [31]. The speech phase is not irrelevant, [44], and in low SNR levels, the ear is sensitive to the phase. Incorporating the phase leads to applying a complex-valued time-frequency gain to the noisy speech signal in the complex STFT domain. In [30] and [45], several speech phase estimation algorithms are discussed, analysed and tested. The speech separation algorithm in [18]

discretises the difference between the noisy and clean speech phases in a non-uniform way and treats the estimation of the difference between the noisy and clean speech phases as a supervised learning classification problem. In

[18], the (ideal) ratio mask is also discretised.

Regarding speech phase estimation in non-stationary noisy environments, the model-based speech enhancement algorithm presented in [2] estimates and tracks the clean speech phase. The STFT-based enhancement algorithm in [2] performs adaptive non-linear Kalman filtering in the log-magnitude spectral domain to track the speech phase in adverse conditions.

Recently, researchers consider the inter-frame correlation of speech. In traditional speech enhancement, each time-frame was considered on its own and inter-frame correlation was not explicitly modeled. In traditional speech enhancement, such as in MMSE or Log-MMSE, the local SNR estimate (i.e. either the a priori or the a posteriori local SNR) was smoothed and this is how inter-frame correlation was indirectly considered; there was no explicit model for the inter-frame correlation of speech. Nowadays, the inter-frame correlation of speech can be modeled using the modulation domain. Regarding modulation-domain algorithms, the relative spectra (RASTA) and Gabor modulation filters have been used for enhancement [46] and are popular as pre-processing front-end methods to ASR. The RASTA filter is a band-pass filter in the modulation domain that eliminates low and high modulation frequencies [46].

Modulation-domain Kalman filtering [26] [27] is different from the aforementioned modulation filters in the sense that the modulation-domain Kalman filtering algorithms do not compute the modulation spectrum. The modulation-domain Kalman filtering algorithms change/alter the modulation spectrum but they do not explicitly compute the modulation spectrum. Modulation-domain Kalman filtering considers the inter-frame correlation of speech in the spectral domain. With modulation-domain Kalman filtering, temporal constraints are imposed on a specific time-frequency domain of speech.

The modulation-domain Kalman filtering technique was first presented in [26] [27] in 2010. Enhancement algorithms can benefit from including a model of the temporal inter-frame correlation of speech. With modulation-domain Kalman filtering, each time-frame is not treated independently and temporal constraints are imposed on a specific time-frequency spectral domain of speech. In [26] [27], modulation-domain Kalman filtering in the amplitude spectral domain is performed with a linear normal KF update step; both the inter-frame speech correlation modeling and the speech tracking are performed in the amplitude spectral domain with a modulation-domain Kalman filter in [26] [27]. In this context, Gaussian distributions are used in the amplitude spectral domain in [26] [27]. The algorithm in [26] [27] assumes a linear distortion equation in the time-frequency amplitude spectral domain and this is why it performs a linear normal KF update step.

Whereas traditional speech enhancement algorithms treat each time-frame independently, an alternative approach performs filtering in the modulation domain. The modulation domain models the time correlation of frames. The modulation domain models the time evolution of the clean STFT amplitude domain coefficients in every frequency bin. The algorithms described in [38] and [19] use modulation-domain KFs.

The modulation-domain KF is a good low order linear predictor at modeling the dynamics of slow changes in the modulation domain and produces enhanced speech that “has minimal distortion and residual noise”, according to [26] [27]. The modulation-domain KF is an adaptive MMSE estimator that uses models of the inter-frame changes of the amplitude spectrum, the power spectrum or the log-spectrum of speech. Modulation-domain Kalman filtering for tracking both speech and noise is possible and beneficial according to [38]. Noise tracking using a KF can be beneficial for enhancement [47], [48]. Noise tracking is performed in [47] and subsequently in [49]. In the KF update step, the correlation between speech and noise samples can be estimated, as in [3], [2] and [5].

Modulation-domain Kalman filtering can be performed in the amplitude spectral domain, in the power spectral domain or in the log-magnitude spectral domain [2]. The KF equations are different in each case. Modulation-domain Kalman filtering in the log-spectral domain, minimizing the error in the log-power spectral domain, is performed in [3], in [2] and in [5]. Many papers, such as [50] and [51], relate clean speech and noisy speech in the log-spectral domain. The non-linear log-spectral distortion equation is used in [52] and in [53].

Time-frequency cells of the signal in the amplitude, power or log-power spectral domain can be viewed as features. When speech is distorted by noise and reverberation, the temporal characteristics of the feature trajectories are distorted and need to be enhanced. Filtering that removes variations in the signal that are uncharacteristic of speech, changing according to the underlying environment conditions, has to be performed.

Modulation-domain Kalman filtering in [26] [27] assumes that speech and noise add in the amplitude spectral domain. Assuming additivity of speech and noise in the amplitude spectrum is an approximation assuming a high instantaneous SNR. The spectral amplitude additivity assumption corrupts the algorithm’s mathematical perfection and is unreasonable in physical meaning despite that it produces reasonable results.

The phase factor, , is the cosine of the phase difference between speech and noise [54], [55]. The phase factor and the additivity in the power or the amplitude spectral domain are related to the in-phase and the in-quadrature components [6]. When speech and noise are in-phase, ; when speech and noise are in-quadrature, . According to [56], the effect of the phase factor is small when the noise estimates are poor. On the contrary, when the noise estimates are accurate, the effect of is stronger [56]. It was noted in [57] that the power-sum, log-sum and max-model approximations are usually used in denoising speech enhancement. Both the power-sum and the log-sum approximations assume and thus that speech and noise are in-quadrature. The max-model approximation resembles, but is not identical to, the assumption. We note that the amplitude-sum approximation is not mentioned in [57]. In modulation-domain Kalman filtering, [26] [27], and in nonnegative matrix factorization (NMF), [58], the amplitude-sum approximation that assumes is usually used.

Modeling the effect of noise as additive in the power spectral domain assumes . According to [59], it is well known that modeling the effect of additive noise as additive in the power spectral domain is only an approximation, which breaks down at SNRs close to dB. Then, the cross term in the power spectrum can no longer be neglected [59] [60].

The algorithm in [61] assumes that the phase factor is zero, . In [61], equation (3) is the power spectral domain assuming that . In [61], the log-power spectrum notation is used in equations (4)-(5) if we ignore the convolutive distortion and therefore the distortion due to the microphone type and the relative position of the talker or speaker.

The log-power spectrum non-linear distortion equation is , where is the noisy speech log-power, is the speech log-power and is the noise log-power [56] [3]. All the variables are defined in the log-power spectral domain. According to [52], the phase factor can also be modelled with the equation: , using and not .

Speech enhancement in non-stationary noise environments is a challenging research area. The modulation domain is an often-used representation in models of the human auditory system; in speech enhancement, the modulation domain models the temporal inter-frame correlation of frames rather than treating each frame independently [26] [27]. Enhancement algorithms can benefit from including a model of the inter-frame correlation of speech and a number of authors have found that the performance of a speech enhancer can be improved by using a speech model that imposes temporal structure [17], [62], [63]. Temporal inter-frame speech correlation modelling can be performed with a KF with a state of low dimension, as in [26] and [19]. The algorithms in [38] track the time evolution of the clean STFT amplitude domain coefficients in every frequency bin. In [64], speech inter-frame correlation is modeled. Considering KF algorithms, many papers, such as [50] [51] and [53], use the non-linear observation model relating clean and noisy speech in the log-spectral domain.

Modulation-domain Kalman filtering can be applied for both noise and late reverberation suppression and this is why this report discusses both noise reduction and dereverberation.

The modulation domain models the time correlation of frames and does not treat each time-frame independently. The algorithms in [38] track the time evolution of the clean STFT amplitude domain coefficients in every frequency. Denoising algorithms that operate in the modulation domain use overlapping modulation frames and use the KF. Considering KF-related algorithms, many papers, such as [50] and [51], use the observation model relating clean speech and noisy speech in the log-power spectrum. The non-linear log-spectral distortion equation is also used in [53]. In [64], the time-frame speech correlation is modeled and is then followed by NMF.

According to [26], [19], [38] and [3], temporal inter-frame speech correlation modeling requires the use of a KF with a state of low dimension. Motivated by the fact that inter-frame speech correlation modeling requires the use of a KF with a hidden state of dimension , we claim that a KF with a hidden state of dimension can effectively be utilized for both inter-frame and intra-frame/frequency speech correlation modeling. We use the KF prediction step for both inter-frame and intra-frame speech correlation modeling. Autoregressive (AR) modeling is a mathematical technique that models correlation and any local correlation can be modeled with the Markov assumption. In this paper, we use both inter-frame and intra-frame KF prediction steps and claim that the intra-frame KF prediction step can be used for frequencies around the pitch and harmonics. AR modeling for intra-frames will model the correlation among neighboring frequencies around the pitch and harmonics. In this way, we can better discriminate clean speech from noise in the log-magnitude spectral domain.

The algorithms in [38] operate in the modulation domain and treat every frequency bin on its own. In this paper, as main innovation, we advance intra-frame correlation modeling based on modulation-domain Kalman filtering by utilizing both inter-frame and intra-frame KF prediction steps. We use Kalman filtering in the log-power STFT spectrum. Log-spectral features are highly correlated: the behaviour of a certain frequency band is very similar to the behaviour of the adjacent frequency bands. Therefore, the log-power STFT spectrum is highly suitable for intra-frame modeling.

The procedure that is followed in algorithms that perform modulation-domain Kalman filtering is as follows. The first step of the procedure is to transform the time domain signals into a suitable time-frequency representation using the STFT. In this step, the algorithm divides the time domain signal into overlapping frames, obtained by sliding a window through the signal. These frames are then transformed into the frequency domain at a suitable resolution using the Fourier transform. The sliding window is shifted through the signal with a suitable hop to obtain a sub-sampled time-frequency representation that allows for perfect reconstruction. These steps constitute the STFT [28] [65]. The short-time spectra are then divided into their magnitude and phase components. The magnitude of the short-time spectra is usually considered on its own to separate speech from noise, leaving the phase of the short-time spectra unaltered. In modulation-domain Kalman filtering algorithms, adjacent magnitude short-time spectra are referred to as modulation frames; modulation frames, with a suitable length and increment, are used for AR modeling.

The modulation domain models the inter-frame correlation of clean speech and does not consider each time-frame independently. In [64], inter-frame speech correlation is modeled and is then followed by NMF. Inter-frame correlations of speech are considered in several papers and books by J. Benesty, i.e. [28]. Section 4 in [28] presents linear filters for inter-frame temporal correlation modeling of speech [2].

Nowadays, speech enhancement algorithms can model the inter-frame correlation of the speech spectrum. Short-term inter-frame relationships can be created based on the Markov property with the KF. The algorithm in [3] uses modulation-domain KFs. The KF framework, which is described amongst others in [66], is convenient in that it allows for statistically grounded approaches to tracking. Kalman filtering uses local inter-frame priors due to the temporal dynamics modeling of the KF prediction. Inter-frame correlation modeling of speech is performed in [63] using Markov Random Fields.

Inter-frame and intra-frame speech correlation modeling has been considered from 1987 in [67] and, subsequently, from 1991 in [68]. According to [68], inter-frame constaints are imposed on speech to reduce frame-to-frame pole jitter. In [63], Markov Random Fields are used for both inter-frame and intra-frame speech correlation modeling. Regarding intra-frame speech correlation modeling in voided frames, equation (2.6) in [63] correlates a specific harmonic with the previous and next harmonics using the observation that harmonics are integer multiples of the fundamental frequency [69] [7].

According to Sec 2.3 in [17]

, assuming independence between time-frames is uncommon and “this assumption could be relaxed by imposing temporal structure to the speech model with a recurrent neural network (RNN)”. According to

[62], in speech enhancement algorithms, the KF can be used to create short-term dependencies due to the Markov property while RNNs can be utilised to create long-term dependencies between time-frames. The latter statement may be true for the examples considered in [62] but it is not generally true for the RNN in Sec. 3 in [62]. According to [70]

, it can be shown that memory either decays or explodes in such RNNs that do not have long-short term memory (LSTM) and it is thus not clear that one can do better than KFs and the Markov property.

Speech signals can be considered to be correlated only for short-time periods. In the STFT time-frequency domain, inter-frame speech correlation exists due to both the speech characteristics and the STFT framing overlaps [2] [1].

According to [71], “noise reduction using inter-frame speech correlation modeling has been addressed partially in [72], [32] and [48] where, in the KF prediction step of a noise reduction method based on Kalman filtering, complex-valued prediction weights are used to exploit the temporal correlation of successive speech and noise STFT coefficients”. The authors in [71] do not discuss modulation-domain Kalman filtering and omit the references of [26] [27] and of [19] [38]. In addition, the authors in [71] claim that “algorithms that perform inter-frame speech correlation modeling assume perfect knowledge of theoretical inter-frame correlation”, which is not valid since any prediction errors are encapsulated in the AR residual. Modulation-domain Kalman filtering algorithms [38] assume small errors from AR modeling on the pre-cleaned noisy spectrum but they also compute the AR residual [2].

Kalman filtering is related to using Gaussian distributions; in modulation-domain Kalman filtering, at every time step, the posterior is computed using the KF-based local prior that is assumed to follow a Gaussian distribution. According to [73], speech enhancement based on spectral features, such as the amplitude, power and log-power spectrum, degrades when the spectral prior does not accurately model the distribution of the speech spectra and when the speech and the noise/interference have similar spectral distributions. Regarding the latter case, babble noise has a speech-shaped spectral distribution [7].

The modulation-domain Kalman filtering algorithms in [26] [27] perform a linear KF update step [74]; on the contrary, the modulation-domain Kalman filtering algorithms in [38], in [75] and in [76] perform a non-linear KF update step. For example, in [19]

, the modulation-domain Kalman filter performs a non-linear KF update step involving the Gamma distribution; the linear KF prediction step is performed in the amplitude spectral domain and then moment matching is used to obtain a Gamma prior so that the modified non-linear KF update step is performed using the Gamma distribution.

Modulation-domain Kalman filtering can be related to Bayesian filtering and particle filtering. The algorithm in [77] uses particle filtering to track time-varying harmonic components in noisy speech. Furthermore, non-linear adaptive Kalman filtering can be related to state-space modeling, which is used in the algorithm in [49] that performs both noise reduction and dereverberation. In Sec. IV.B in [49], the algorithm tracks the noise in a spectral domain using AR modeling.

Non-linear Kalman filtering can be used along with uncertainty decoding, [78] [79]

, in ASR because it estimates the speech amplitude spectrum and its variance. According to

[79], uncertainty decoding is a promising approach for dynamically tackling the distortions remaining after speech enhancement using posterior distributions instead of point estimates. The uncertainty is computed either directly in the ASR feature domain or propagated from the spectral domain to the feature domain [79]. With modulation-domain Kalman filtering, the uncertainty/variance is computed in the spectral domain.

Adaptive modulation-domain Kalman filtering with a non-linear KF update step can be related to the hidden dynamic model that is discussed and explained in section 13.6 in [60]. The non-linear mapping from the hidden states to the continuous-valued acoustic features in equation (13.39) in [60] resembles the KF update step that non-linearly relates the continuous-valued clean acoustic features with the continuous-valued noisy acoustic features. In section 13.6 in [60], the top-down generative process of the hidden dynamic model is analysed; the KF can be explained as a top-down process.

Speech enhancement is difficult especially when the noisy speech signal is only available from a single channel. Although many single-channel speech algorithms have been proposed that can improve the SNR of the noisy speech, they also introduce speech distortion and spurious tonal artefacts known as musical noise. In noisy conditions, the tradeoff between speech distortion and noise removal is apparent. According to the literature and to [20] and [80], if the evolution of noise is slower than the evolution of speech, and thus if noise is more stationary than speech, then noise can efficiently be estimated during the speech pauses. On the contrary, if noise is non-stationary, then it is more difficult to estimate the noise and this results in speech degradation [80]. In this research work, coloured noise is considered. According to the literature and to [7] and [20], real-world noise is colored and does not affect the speech signal uniformly over the entire spectrum [38].

Common/typical speech enhancement algorithms work on the STFT magnitudes, on the STFT powers or on the STFT log-powers, leaving the phase unaltered [12]. Other speech enhancement approaches alter the phase by considering the complex STFT domain, the real and imaginary parts of the complex STFT domain or the log real and log imaginary parts of the complex STFT domain. Furthermore, according to the literature [81] [60], some speech enhancement algorithms operate on the cepstrum and leave the phase unaltered.

Regarding the complex STFT domain, according to [32], performing complex AR modeling produces more accurate results than tracking the real and imaginary parts separately and there is no correlation in successive phase samples.

The cepstral domain is a possible speech processing domain. The cepstrum, which is different from the complex cepstrum [82], can be considered as a smoothed version of the log-spectral domain. On the one hand, the cepstrum is the inverse Fourier transform of the logarithm of the magnitude of the Fourier transform. On the other hand, the complex cepstrum is based on both the magnitude and the phase of the Fourier transform; the complex cepstrum is the inverse Fourier transform of the complex logarithm, , of the Fourier transform [82]. The cepstrum can be used for enhancement and it is usually used with Mel bands.

According to the literature and to [81] and [60], the front-end speech recognition system is as follows. A discrete Fourier transform (DFT) is applied after windowing; next, the power spectrum is computed, Mel-spaced bands are used, the log operator is used and then a second Fourier transform is performed. The second Fourier transform is usually a Discrete Cosine Transform (DCT). The DCT is performed on the Mel-spaced log-spectrum to compute the ceptrsum. The output of the DCT is approximately decorrelated; hence, the decorrelated features can be modelled with a Gaussian distribution that has a diagonal covariance matrix [81] [60]. The latter observation that the decorrelated DCT output features are usually modelled with a Gaussian distribution that has a diagonal covariance matrix is interesting. Speech enhancement as a front-end to speech recognition aims to enhance either the final feature of the cepstrum or any intermediate feature.

The speech enhancement algorithms that work on the STFT magnitudes try to minimize the error in the amplitude spectral domain. Likewise, the algorithms that work on the STFT powers try to minimize the error in the power spectral domain and the algorithms that operate on the STFT log-powers try to minimize the error in the log-spectral domain. In this sense, the enhancement algorithms that work on the STFT log-powers resemble the algorithms that use the log mean squared error (MSE) spectral distortion metric [40] [20]. In [40], P. C. Loizou examines the use of perceptual distortion metrics, such as the Itakura-Saito (IS) distortion and the hyperbolic-cosine (COSH) distortion, instead of the MSE and the log-MSE. Perceptual distortion metrics had been used for speech recognition before 2005 and, in 2005 [40], perceptual distortion metrics were used for speech enhancement and for estimating clean speech in the amplitude spectral domain.

Considering the amplitude, power and log-power spectral domain and the perceptual distortion metrics [40] [20], speech can be estimated and/or tracked in perceptually motivated time-frequency domains, such as the IS-spectral domain or the COSH-spectral domain. Perceptually motivated spectral time-frequency domains have not been used for speech tracking.

Iii Additional Literature Review

The non-linear KF algorithm in [2] is a model-based speech enhancement algorithm based on parametric estimation. KF algorithms are different from data-driven algorithms, such as [83] and [84]. Data-driven neural network algorithms consider all frequency bins simultaneously and are different from parametric estimation algorithms that operate on a per frequency bin basis [85] [14]. In [83], a LSTM RNN is used to estimate late reverberation that is then subtracted from the reverberant speech signal to estimate the anechoic dry speech. Supervised learning is examined in the PhD Theses [86] and [87].

A novel direction in speech enhancement refers to the use of neural networks (NNs) and deep NNs [10] [12]. NN-based speech enhancement, which has been examined in [18], [87] and [86], can be used. Amongst other places, deep NNs are mathematically described and discussed in chapter 4 in [60]; several examples of NN-based enhancement algorithms can be found in [88], [58], [89] and [90]. NNs perform frequency intra-frame correlation modeling since their inputs are the noisy speech in the amplitude spectral domain, the power spectral domain or the log-spectral domain. In NNs, inter-frame correlation of speech is modeled by considering context frames, which can be considered as overlapping modulation frames, as inputs to the NN. However, this speech inter-frame correlation modeling often leads to artefacts, decreasing the speech artefact ratio in source separation, according to slide 35 in [88]. Specifically, according to slide 35 in [88], frame-by-frame denoising with NNs produces comparable results to NNs with context frames in terms of separation metrics.

In contrast to NNs [18], model-based enhancement algorithms that perform modulation-domain Kalman filtering use few parameters and utilise the equations relating speech and noise in the complex STFT domain. Specific equations relating speech and noise in the spectral domain are used and the relationship between speech and noise is not learned from training data. Non-linear Kalman filtering algorithms model the speech inter-frame correlation in the STFT domain but not the speech intra-frame correlation in the STFT domain. NNs are robust to small variations of the training data [91] and are sensitive to training techniques and training samples [92] [91]. NNs over-parametrise the speech enhancement problem and, moreover, NNs assume that training and testing samples are independent and identically distributed (iid) in most cases.

The preceding paragraphs are not just a discussion of machine-learning versus model-based techniques, which is a well rehearsed discussion

[12]. The observation that NNs over-parametrise the problem while modulation-domain Kalman filtering algorithms use few parameters for each frequency bin to parametrise the speech enhancement problem is important. The observation that unseen noise types, unseen SNRs, unseen reverberation times and other unseen conditions affect the performance of NNs is also significant. Furthermore, another important observation is that the training of NNs is based on local minima: training NNs involves non-convex optimization [12] and the use of good priors is critical. Good priors can be considered as regularization, like dropout, to avoid overfitting. The training procedure has to reach a good local minimum that will lead to network parameters that will make the NN generalize well to unseen test data [92]. During inference, NNs are very fast and they also require low computation [88].

Ideal ratio masks and complex ideal ratio masks usually utilise a NN to estimate the real and the imaginary parts of the complex STFT of speech, as discussed in [93]. Ideal ratio masks compute a real-valued time-frequency gain; complex ideal ratio masks find a complex-valued time-frequency gain. Binary masking is different from ratio masking because it is based on classification and on hard labels (not soft labels).

Another contemporary direction in speech enhancement refers to the use of end-to-end systems. End-to-end systems operate in the time domain and depend on NN training, both on the training data and the training procedure [12] [18].

Regarding dereverberation [94], a few KF-based dereverberation algorithms exist in the literature. Dereverberation aims to remove echo and reverberation effects from speech signals for improved speech quality and intelligibility. Reverberation causes smearing across time and frequency; reverberation tends to spread speech energy over time. This time-energy spreading has two distinct effects: (i) the energy in individual phonemes become more spread out in time and, consequently, plosives have a delayed onset and decay and fricatives are smoothed, and (ii) preceding phonemes blur into the current phonemes. According to the literature [9] [94], the effect of (ii) is most apparent when a vowel precedes a consonant. Both (i) and (ii) reduce speech quality and speech intelligibility.

Speech captured with a distant microphone inevitably contains both reverberation and noise. In the time domain, the reverberant noisy speech signal, , can be expressed as where is the RIR between the talker and the microphone, is the clean speech signal, is the noise signal and is the convolution operator. Most dereverberation algorithms are mostly concerned with the effects of the late reflections. The temporal masking properties of the human ear cause the early reflections to reinforce the direct sound [94], and this is why early reverberation and early reflections enhance the quality of degraded speech signals.

The reverberation time, , and the Direct to Reverberant energy ratio (DRR) are the two main parameters of reverberation [95] [12]. The quantifies the reverberation duration along time and is defined as the time interval required for a sound level to decay dB after ceasing its original stimulus. The DRR describes the reverberation effect in the space domain, providing insight on the relative positions of the sound source and of the receiver [9] [12]. According to the literature, the reverberation time, , is independent of the source to microphone configuration; in contrast to the RIR, the measured in the diffuse sound field is independent of the source to microphone configuration. This is important for blindly estimating from noisy reverberant speech [1].

The reverberation time, , is independent of the source to microphone configuration and depends on the room. The impact of reverberation on human auditory perception depends on the reverberation time. If is small, the environment reinforces the sound which may enhance the sound perception [95]. On the contrary, if is large, a spoken syllable may persist for long and interfere with future spoken syllables.

According to [96], dereverberation algorithms that operate in the power spectral domain are robust and relatively insensitive to speaker movements and minor variations in the spatial placement of sources. In this context, algorithms that leave the phase unaltered and operate in the amplitude, power or log-power spectral domain are insensitive to speaker movements and to minor variations in the spatial placement of sources.

Enhancement algorithms that perform reverberation suppression, as opposed to reverberation cancellation, do not require an estimate of the RIR. In this report, we focus on enhancement algorithms that perform reverberation suppression. In addition, we also focus on algorithms that assume that the early and late reverberant speech components are independent and aim to suppress the late reverberant speech component.

Dereverberation can be performed using spectral subtraction to remove reverberant speech energy by cancelling the energy of preceding speech phonemes in the current time-frame.

In [97], spectral enhancement methods based on a time-frequency gain, originally developed for the purpose of noise suppression, have been modified and used for dereverberation. Such algorithms suppress late reverberation assuming that that the early and late reverberation components are independent. The novelty of the algorithms in [97] is that denoising algorithms can be adjusted to operate in noisy and reverberant conditions. Spectral enhancement dereverberation methods can be easily implemented in the STFT domain and have low computational complexity. The spectral enhancement dereverberation methods in [97] estimate the late reverberant spectral variance (LRSV) and use it in the place of the noise spectral variance; these algorithms reduce the problem of late reverberation suppression to the problem of estimating the LRSV blindly from reverberant speech observations [98].

The idea that late reverberation can be treated as an additive disturbance originates from [98]. In [97], this idea of treating late reverberation as an additive disturbance is expanded and utilised in various spectral enhancement dereverberation algorithms. The late reverberation suppression algorithm in [98] statistically models the RIR in the time domain, estimates the LRSV and uses spectral subtraction to enhance speech.

The seminal work of [98] is discussed in [95] where a dereverberation algorithm based on blind spectral weighting is developed to suppress late reverberation and reduce its overlap-masking effect. According to [95], the late reverberant speech component causes overlap-masking that smears the high energy phonemes, such as the vowels, over time, fills envelope gaps and increases the prominence of low-frequency energy in the speech spectrum. The spectral weighting algorithm in [95] mitigates the effect of overlap-masking using the uncorrelated assumption for late reverberation [98] [97].

Estimation of the LRSV is also referred to as reverberation noise estimation. Several spectral enhancement algorithms that employ different methods for reverberation noise estimation have been developed in the past. According to the literature and to [99], the LRSV estimator presented in [100] is a continuation and an extension of the LRSV estimator in [98]. The dereverberation algorithm in [100] statistically models the RIR in the STFT domain, and not in the time domain as [98]. Late reverberation is estimated and suppressed in [100] by considering the reverberation time, , and the energy contribution of the direct path and reverberant parts of speech in the STFT domain. The DRR is externally estimated in [100]. Two common criticisms of spectral enhancement algorithms that are based on reverberation noise estimation are that they introduce musical noise and that they suppress speech onsets, when they over-estimate the true reverberation noise.

According to the literature and to [93], ideal ratio masks and complex ideal ratio masks have been used by researchers for dereverberation. Complex ideal ratio masks take account of the speech phase since they estimate the real and imaginary parts of the complex STFT domain of clean speech. Complex ideal ratio masks estimate either the real and imaginary parts or the log real and log imaginary parts of the complex STFT domain of speech. In particular, complex ideal ratio masks utilise supervised learning and NNs to estimate either the real and imaginary parts or the log real and log imaginary parts of the complex STFT domain of clean speech. The NN-based data-driven speech enhancement algorithm in [93] uses complex ideal ratio masks for joint denoising and dereverberation.

In [101], the authors do not agree with the claim that complex ideal ratio masks can be used for dereverberation. In particular, the data-driven enhancement algorithm in [101] performs NN-based blind dereverberation using the Fourier transform of the STFT of the reverberant speech signal.

Supervised learning and NNs can be used for joint denoising and dereverberation that is not based on ideal ratio masks and complex ideal ratio masks. The NN that is used in the speech enhancement algorithm in [84] operates in the log-spectral domain, utilises context frames (i.e. neighboring frames, past and future frames at every time step) and estimates clean speech from noisy and reverberant speech in the log-spectral domain. In [102], two supervised dereverberation algorithms are examined: the one NN-based algorithm predicts speech in the amplitude spectral domain using direct mapping and the other NN-based algorithm predicts the ideal ratio mask. According to the results of [102], NNs used for ideal ratio masking [12] outperform NNs used for predicting the speech spectrum in terms of quality and intelligibility metrics.

We note that the NN-based data-driven speech enhancement algorithm in [84] estimates the clean speech phase using a post-processing technique. More specifically, the supervised algorithm in [84] uses an iterative procedure to reconstruct the time-domain signal that is based on [103], which was published in 1984. According to [84] and to [103], the enhancement algorithm “iteratively updates the phase at each step by replacing it with the phase of the STFT of its ISTFT”, while keeping the target magnitude from the NN fixed.

In [104], NMF is extended to include reverberation. More specifically, the two single-channel speech enhancement algorithms that are introduced in [104] model the room acoustics using a non-negative approximation of the convolutive transfer function and model speech in the amplitude spectral domain using NMF. The two speech enhancement algorithms in [104] enhance the quality of speech in noisy and reverberant conditions. A particular advantage of NMF-based algorithms is the use of iterative multiplicative update rules. Regarding NMF and dereverberation, the speech enhancement algorithm in [22] performs joint denoising and dereverberation using nonnegative matrix deconvolution and nonnegative speech dictionary models in the amplitude STFT spectral domain.

The modeling of the speech temporal dynamics can be beneficial in reverberant conditions [49], especially in severe reverberant conditions where the DRR is low and the is long. The enhancement algorithm in [49] performs both noise reduction and dereverberation using state-space modeling and speech and noise tracking. Moreover, the SPENDRED algorithm [34] [35] also considers speech temporal dynamics.

The SPENDRED algorithm, which is presented in [34] [35], performs time-varying and DRR estimation and it internally (and not externally) estimates and DRR at every time step. However, unless the source or the microphone are moving around, the and DRR will presumably be constant throughout the recording. In addition, SPENDRED also performs frequency-dependent and DRR estimation; according to the ACE challenge [16], performing frequency-dependent and DRR estimation is important. Furthermore, SPENDRED performs intra-frame speech correlation modeling; typical speech enhancement algorithms do not perform intra-frame frequency correlation modeling and decouple different frequency dimensions, treating each frequency bin on its own. Decoupling different frequency dimensions makes the algorithms easier to implement since frequency bins can be processed in parallel [2]. On the contrary, modeling the intra-frame correlation of the clean speech signal is important in order to enhance the pitch and the harmonics of speech.

Reverberation is frequency dependent and the SPENDRED algorithm takes advantage of this observation. Estimating frequency-dependent reverberation parameters is beneficial. Reverberation is frequency dependent and obtaining a estimate for each individual frequency bin, or for every Mel-spaced frequency band as in [34] [35], is advantageous.

The SPENDRED dereverberation algorithm is a model-based technique; it uses the reverberation model that is described by the equations (9.4) and (9.5) in section 9.2 in [81]. The SPENDRED algorithm does not utilise the coarser reverberation model that is described by the equation (9.6) in [81], which approximates the square of the RIR with its envelope only. As the joint denoising and dereverberation enhancement algorithm in [59]

, SPENDRED employs a parametric model of the RIR that is based on white noise with a decaying envelope, in which the decay time of the envelope is given by

.

The enhancement algorithms described in [105], [59] and [55] are based on creating statistical observation models of noisy and reverberant speech in the logarithmic Mel-power spectral domain. Observation models are used in the KF update step in modulation-domain Kalman filtering algorithms. Equations (9) and (10) in [105] define the observation model that relates noisy and reverberant speech, speech, reverberation and noise in the logarithmic Mel-power spectral domain. More specifically, equation (9) in [105] defines the observation model that relates noisy and reverberant speech, reverberant speech and noise in the logarithmic Mel-power spectral domain and equation (10) in [105] defines the observation model that relates reverberant speech, speech and reverberation in the logarithmic Mel-power spectral domain. In this context, reverberation in the logarithmic Mel-power spectral domain refers to finding a representation of the RIR in the logarithmic Mel-power spectral domain. The algorithm in [105] uses the instantaneous reverberant-to-noise ratio and the observation model in the logarithmic Mel-power spectral domain.

The model-based enhancement algorithms presented in [105], [59] and [55] are also discussed and explained in section 9.7.3 in [81]. Section 9.3 in [81] examine the equations that define the relations between reverberant and noisy speech, reverberant speech, speech, reverberation and noise in different spectral time-frequency domains. Section 9.7.3 in [81] discusses a few model-based dereverberation algorithms that use the equations that define the relations between reverberant and noisy speech, reverberant speech, speech, reverberation and noise in a specific spectral time-frequency domain.

As described and discussed in [106], the phase factor in Mel-spaced frequency bands has a different equation, different properties, a different distribution and different moments from the phase factor in STFT bins, . In addition, as described and discussed in [105], [59] and [55], the phase factor between reverberant speech and noise is different from the phase factor between speech and noise, . In [59] and in [55], the phase factor between reverberant speech and noise in Mel-spaced frequency bands is examined, investigated and modeled.

In noisy and reverberant conditions, finding the onset of speech phonemes and determining which frames are unvoiced/silence is difficult because reverberation tends to spread speech energy over time. In addition, in noisy and reverberant conditions, noise estimation is difficult because unvoiced/silence frames are hard to identify and the noise estimate is affected by the reverberation present in the noisy reverberant signal. According to [49], it is not efficient for enhancement algorithms to perform a two step procedure that is comprised of a denoising stage followed by a dereverberation step. The concatenation of different techniques for noise reduction and dereverberation is inefficient because denoising and blind dereverberation are not performed jointly [49].

Despite the claim that it is not efficient for algorithms to perform a two step procedure that is comprised of a denoising stage followed by a dereverberation step, long-term linear prediction with pre-denoising can be utilised to suppress noise and late reverberation. According to the literature [107] [81], with long-term linear prediction, the effect of reverberation may be represented as a one-dimensional convolution in each frequency bin. The convolutive nature of reverberation induces a long-term correlation between a current observation and past observations of reverberant speech [108] and this long-term correlation can be exploited to suppress reverberation. According to [109] [108], long-term linear prediction using the weighted prediction error (WPE) algorithm can be utilised for late reverberation reduction and is robust to noise. In [71], long-term linear prediction is discussed along with inter-frame speech correlation modeling. The algorithm in [109] utilises the WPE algorithm and long-term linear prediction in the complex STFT domain. According to [108], the WPE algorithm can also be used in the power spectral domain: the algorithm in [108] examines the possibility of subtracting the power spectra of the reverberation estimates from the observed power spectra while leaving the phase unchanged instead of subtracting the reverberation estimates in the STFT domain.

The speech enhancement algorithm in [110] perform a two step procedure that is comprised of a denoising stage followed by a dereverberation step. In [110], NN-based pre-denoising is used; the dereverberation step is performed using the WPE algorithm. Figure 1.b in [110] and Sec. 4 in [110] describe the NN, which operates in the log-spectral domain, that is used for pre-cleaning the noisy and reverberant speech signal. The pre-cleaned power spectral domain is then used for the WPE algorithm; a particular feature of the algorithm in [110] is that the WPE method does not need more than one iterations.

The WPE linear filtering approach removes reverberation in the complex STFT domain taking consecutive reverberant observations into account [81]. An adaptive and multi-channel variant of the WPE algorithm has recently been used as a front-end dereverberation method in “Google Home” [11].

According to the literature and to [108] and [107], for dereverberation, linear filtering can either exploit both the spectral amplitudes and phases of the signal or exploit the spectral amplitudes and leave the spectral phase unaltered. The speech spectral phase is severely affected by reverberation because reverberation is a superposition of numerous time-shifted and attenuated versions of the clean speech signal. It is worth noting that reverberation is strongly correlated with clean speech both in the short-term and in the long-term.

According to [109], we can estimate reverberation in the complex STFT domain, , performing a few iterations and using and . The parameter is the number of frames of the entire speech utterance; WPE performs batch processing and operates on the entire speech utterance.

The WPE method is an iterative algorithm that alternatively estimates the reverberation prediction coefficients and the speech spectral variance using batch processing of speech utterances. The WPE method needs the entire speech utterance for processing. Therefore, one of the drawbacks of the WPE method is that it requires at least a few seconds of the observed speech utterance in order to ensure the convergence of the reverberation prediction coefficients [107]. In addition, it is worth noting that the RIR should remain constant [107].

According to [99], using WPE to estimate the reverberant component of speech leads to a processing delay. The WPE method is a batch processing technique and it requires the pre-processing of the entire speech utterance in order to provide an accurate estimate of the reverberant component of speech [99]. Batch processing is not suitable when dealing with time-varying acoustic environments with varying RIRs. In [99], WPE is utilised for processing non-overlapping blocks of s long. Equations (15)-(19) in [99] describe the block-wise WPE method that can be used in real-world environments.

In summary, in this literature review, several different enhancement algorithms for noise suppression and dereverberation were presented, explained and discussed. One of the main points is that different enhancement algorithms operate in different spectral time-frequency domains and follow different methodologies and frameworks. Speech is a non-white signal and its correlation structure should not be destroyed; the speech enhancement algorithm needs to be able to distinguish between the correlation introduced by the RIR and the correlation of the speech signal itself [81]. A final remark is that real-world speech recordings are inevitably distorted by both noise and frequency-dependent reverberation [111] [12].

Iv Conclusion

This report focuses on speech enhancement considering both noise and convolutive distortions [7] [12]. Additive noise and room reverberation are two different types of distortion and the effects of both need to be suppressed and ideally eliminated [12]. The effects of additive noise are limited to a single frame of short-time signal analysis while the effects of room reverberation span a number of consecutive time frames. Non-linear adaptive modulation-domain Kalman filtering algorithms can be used for speech enhancement, i.e. noise suppression and dereverberation, as in [1], [2], [3], [4] and [5]. Modulation-domain Kalman filtering can be applied for both noise and late reverberation suppression; in [2], [1], [3] and [4], various model-based speech enhancement algorithms that perform modulation-domain Kalman filtering are designed, implemented and tested. The model-based speech enhancement algorithm presented in [2] tracks and estimates the clean speech phase and the STFT-based algorithm described in [5] uses the active speech level estimator presented in [6].

References

  • [1] N. Dionelis and M. Brookes, “Modulation-domain Kalman filtering for monaural blind speech denoising and dereverberation,” Submitted to IEEE Trans. on Audio, Speech and Language Process., 2018, [Online]. Available: https://arxiv.org/pdf/1807.10236.pdf.
  • [2] N. Dionelis and M. Brookes, “Phase-aware single-channel speech enhancement with modulation-domain Kalman filtering,” IEEE Trans. on Audio, Speech and Language Process., vol. 26, no. 5, pp. 937-950, May 2018.
  • [3] N. Dionelis and M. Brookes, “Modulation-domain speech enhancement using a Kalman filter with a Bayesian update of speech and noise in the log-spectral domain,” in Proc. IEEE Int. Work. Hands-free Speech Communication and Microphone Arrays, San Francisco, March 2017.
  • [4] N. Dionelis and M. Brookes, “Speech enhancement using Kalman filtering in the logarithmic Bark power spectral domain,” in Proc. European Signal Process. Conf., Rome, Sept. 2018.
  • [5] N. Dionelis and M. Brookes, “Speech enhancement using modulation-domain Kalman filtering with active speech level normalized log-spectrum global priors,” in Proc. European Signal Process. Conf., Kos, Aug. 2017.
  • [6] N. Dionelis and M. Brookes, “Active speech level estimation in noisy signals with quadrature noise suppression,” in Proc. European Signal Process. Conf., Budapest, Aug. 2016.
  • [7] N. Dionelis, “Adaptive power spectrum estimation of non-stationary acoustic noise,” Master’s thesis, Imperial College London, London, U.K., 2015, [Online] Available: https://drive.google.com/open?id=1LhvpX7Pk8G7XN2dbH8TpCcZbTOVDT2rJ.
  • [8] K. Kinoshita, M. Delcroix, T. Yoshioka, T. Nakatani, E. Habets et al., “The REVERB challenge: A common evaluation framework for dereverberation and recognition of reverberant speech,” in Proc. IEEE Workshop on Applications of Signal Processing to Audio and Acoustics, New Paltz, NY, Oct. 2013.
  • [9] K. Kinoshita, M. Delcroix, S. Gannot, E. Habets et al., “A summary of the REVERB challenge: State-of-the-art and remaining challenges in reverberant speech processing research,” EURASIP J. Adv. Signal Process., vol. 7, pp. 1-19, 2016.
  • [10]

    Z. Zhang, J. Geiger, J. Pohjalainen, A. El-Desoky Mousa, W. Jin and B. W. Schuller, “Deep learning for environmentally robust speech recognition: An overview of recent developments,”

    ACM Transactions on Intelligent Systems and Technology, vol. 9, no. 5, Article 49, April 2018, [Online] Available: https://doi.org/10.1145/3178115.
  • [11] B. Li, T. Sainath, A. Narayanan et al., “Acoustic modeling for Google Home,” in Proc. Conf. Int. Speech Communication Association, Stockholm, Aug. 2017.
  • [12] S. Watanabe, M. Delcroix, F. Metze and J. R. Hershey, New Era for Robust Speech Recognition: Exploiting Deep Learning, Ch. 6: Novel Deep Architectures in Speech Processing, Ch. 7: Deep Recurrent Networks for Separation and Recognition of Single-Channel Speech in Nonstationary Background Audio, Ch. 9: Adaptation of Deep Neural Network Acoustic Models for Robust Automatic Speech Recognition, Ch. 14: The CHiME Challenges: Robust Speech Recognition in Everyday Environments, Ch. 15: The REVERB Challenge: A Benchmark Task for Reverberation-Robust ASR Techniques, Ch. 17: Toolkits for Robust Speech Processing.   Springer, ISBN: 978-3-319-64679-4, 2017.
  • [13] H. Erdogan, J. Hershey, S. Watanabe, M. Mandel and J. Le Roux, “Improved MVDR beamforming using single-channel mask prediction networks,” in Proc. Annual Conference of the International Speech Communication Association (INTERSPEECH), Sept. 2016.
  • [14] J. Heymann, L. Drude, and R. Haeb-Umbach, “Neural network based spectral mask estimation for acoustic beamforming,” in Proc. IEEE Intl. Conf. on Acoustics, Speech and Signal Processing (ICASSP), March 2016.
  • [15] T. N. Sainath, R. J. Weiss, K. W. Wilson, et al., “Multichannel signal processing with deep neural networks for automatic speech recognition,” IEEE Trans. on Audio, Speech and Language Process., vol. 25, no. 5, pp. 965-979, May 2017.
  • [16] J. Eaton, N. D. Gaubitch, A. H. Moore and P. A. Naylor, “Estimation of room acoustic parameters: The ACE challenge,” IEEE Trans. on Audio, Speech and Language Process., vol. 24, no. 10, pp. 1681-1693, Oct. 2016.
  • [17] D. Liang, M. D. Hoffman and G. J. Mysore, “Speech dereverberation using a learned speech model,” in Proc. IEEE Int. Conf. Acoustics, Speech and Signal Process., April 2015.
  • [18] J. Le Roux, G. Wichern, S. Watanabe, A. Sarroff and J. R. Hershey, “Phasebook and friends: Leveraging discrete representations for source separation,” arXiv:1810.01395v1 [cs.SD], Oct. 2018, [Online] Available: https://arxiv.org/pdf/1810.01395.pdf.
  • [19] Y. Wang and M. Brookes, “Speech enhancement using an MMSE spectral amplitude estimator based on a modulation domain Kalman filter with a Gamma prior,” in Proc. IEEE Int. Conf. Acoustics, Speech and Signal Process., March 2016.
  • [20] P. C. Loizou, Speech Enhancement: Theory and Practice, Part II: Algorithms, Part III: Evaluation.   Taylor & Francis, 2013.
  • [21]

    B. Fodor and T. Gerkmann, “A posteriori speech presence probability estimation based on averaged observations and a super-Gaussian speech model,”

    in Proc. Int. Workshop on Acoustic Signal Enhancement, Antibes - Juan les Pins, Sept. 2014.
  • [22] D. Baby and H. Van hamme, “Joint denoising and dereverberation using exemplar-based sparse representations and decaying norm constraint,” IEEE Trans. on Audio, Speech and Language Process., vol. 25, no. 10, pp. 2024-2035, Oct. 2017.
  • [23] H. Kameoka, T. Nakatani and T. Yoshioka, “Robust speech dereverberation based on non-negativity and sparse nature of speech spectrograms,” in Proc. IEEE Int. Conf. on Acoustics, Speech and Signal Processing, Taipei, Taiwan, pp. 45-48, 2009.
  • [24] K. Paliwal and A. Basu, “A speech enhancement method based on Kalman filtering,” in Proc. IEEE Int. Conf. Acoustics, Speech and Signal Process., vol. 12, 1987, pp. 177–180.
  • [25] S. So and K. K. Paliwal, “Suppressing the influence of additive noise on the Kalman gain for low residual noise speech enhancement,” Speech Communication, vol. 53, pp. 355–378, 2010.
  • [26] S. So and K. K. Paliwal, “Modulation-domain Kalman filtering for single-channel speech enhancement,” Speech Communication, vol. 53, no. 6, pp. 818-829, July 2011.
  • [27] S. So, K. K. Wójcicki, and K. K. Paliwal, “Single-channel speech enhancement using Kalman filtering in the modulation domain,” in Proc. Conf. Int. Speech Communication Association, Makuhari, Sept. 2010.
  • [28] J. Benesty and J. Chen, A Conceptual Framework for Noise Reduction, Springer, DOI: 10.1007/978-3-319-12955-6, 2015.
  • [29] S. Alaya, N. Zoghlami and Z. Lachiri, “Speech enhancement based on perceptual filter bank improvement,” Int. Journal of Speech Technology, no. 17, pp. 253-258, DOI: 10.1007/s10772-014-9226-8, Febr. 2014.
  • [30] P. Mowlaee, R. Saeidi and Y. Stylianou, “Advances in phase-aware signal processing in speech communication,” Speech Communication, vol. 81, pp. 1-29, July 2016.
  • [31] J. Kulmer and P. Mowlaee, “Harmonic phase estimation in single-channel speech enhancement using von Mises distribution and prior SNR,” in Proc. IEEE Int. Conf. Audio and Speech Signal Process., Brisbane, April 2015.
  • [32] T. Esch and P. Vary, “Speech enhancement using a modified Kalman filter based on complex linear prediction and supergaussian priors,” in Proc. IEEE Int. Conf. Audio and Speech Signal Process., pp. 4877-4880, Las Vegas, April 2008.
  • [33] T. Esch, “Model-based speech enhancement exploiting temporal and spectral dependencies, Ch. 3: Speech enhancement incorporating temporal correlation,” Ph.D. dissertation, Aachen University, 2012.
  • [34] C. S. J. Doire, M. Brookes, P. A. Naylor et al, “Single-channel online enhancement of speech corrupted by reverberation and noise,” IEEE Trans. on Audio, Speech and Language Process., vol. 25, no. 3, pp. 572-587, March 2017.
  • [35] C. S. J. Doire, “Single-channel enhancement of speech corrupted by reverberation and noise, Ch. 4: Single-channel enhancement of speech,” Ph.D. dissertation, Imperial College London, 2016.
  • [36] Y. Ephraim and D. Malah, “Speech enhancement using a minimum-mean square error short-time spectral amplitude estimator,” IEEE Trans. on Acoustics, Speech and Signal Processing, vol. 32, no. 6, pp. 1109–1121, 1984.
  • [37] Ephraim, Y. and Malah, D., “Speech enhancement using a minimum mean-square error log-spectral amplitude estimator,” IEEE Trans. on Acoustics, Speech and Signal Process., vol. 33, no. 2, pp. 443–445, April 1985.
  • [38] Y. Wang, “Speech enhancement in the modulation domain, Ch. 5: Model-based speech enhancement in the modulation domain,” Ph.D. dissertation, Imperial College London, 2015.
  • [39] P. C. Loizou, Speech Enhancement: Theory and Practice.   Taylor & Francis, Second Edition, ISBN: 978-1-4665-0421-9, pp. 209-234, 2013.
  • [40] P. C. Loizou, “Speech enhancement based on perceptually motivated Bayesian estimators of the magnitude spectrum,” IEEE Trans. Speech Audio Process., vol. 13, no. 5, pp. 857-869, 2005.
  • [41] O. Cappe, “Elimination of the musical noise phenomenon with the Ephraim and Malah noise suppressor,” IEEE Trans Speech and Audio Processing, vol. 2, no. 2, pp. 345–349, Apr. 1994.
  • [42] P. J. Wolfe and S. J. Godsill, “Efficient alternatives to the ephraim and malah suppression rule for audio signal enhancement,” EURASIP Journal on Advances in Signal Process., vol. 10, 2003.
  • [43] C. Boubakir and D. Berkani, “Speech enhancement using minimum mean-square error amplitude estimators under normal and generalized gamma distribution,” Journal of Computer Science 6 (7): 700-705, 2010.
  • [44] K. Paliwal, K. Wojcicki, and B. Shannon, “The importance of phase in speech enhancement,” Speech Communication, vol. 53, no. 4, pp. 465-494, 2011.
  • [45] T. Gerkmann, M. Krawczyk-Becker and J. Le Roux, “Phase processing for single-channel speech enhancement, History and recent advances,” IEEE Signal Processing Magazine, vol. 32, no. 2, pp. 55-66, DOI: 10.1109/MSP.2014.2369251, March 2015.
  • [46] H. Hermansky, E. A. Wan and C. Avendano, “Speech enhancement based on temporal processing,” in Proc. IEEE Int. Conf. Acoustics, Speech and Signal Processing, vol. 1, pp. 405-408, 1995.
  • [47] B. Raj, R. Singh and R. Stern, “On tracking noise with linear dynamical system models,” in Proc. Conf. Int. Speech Communication Association, pp. 965-968, Lisbon, Oct. 2004.
  • [48] T. Esch and P. Vary, “Exploiting temporal correlation of speech and noise magnitudes using a modified Kalman filter for speech enhancement,” in Proc. Conf. Voice Communication (SprachKommunikation), pp. 1-4, Aachen, Oct. 2008.
  • [49] M. Wölfel, “Enhanced speech features by single-channel joint compensation of noise and reverberation,” IEEE Trans. on Audio, Speech and Language Process., vol. 17, no. 2, pp. 312-323, Feb. 2009.
  • [50]

    J. Li, L. Deng, D. Yu, Y. Gong and A. Acero, “High-performance HMM adaption with joint compensation of additive and convolutive distortions via vector Taylor series,”

    in Proc. IEEE Work. Automatic Speech Recognition and Understanding, pp. 65-70, Kyoto, Dec. 2007.
  • [51] J. Li, L. Deng, D. Yu, Y. Gong and A. Acero, “A unified framework of HMM adaption with joint compensation of additive and convolutive distortions,” Computer Speech and Language, vol. 23, no. 3, pp. 389-405, July 2009.
  • [52] J. Li, L. Deng, R. Haeb-Umbach, and Y. Gong, Robust automatic speech recognition, Ch. 6.2: Vector Taylor series, ISBN: 978-0-12-802398-3.   Elsevier, 2016.
  • [53] Y. Gong, “A method of joint compensation of additive and convolutive distortions for speaker-independent speech recognition,” IEEE Trans. on Speech Audio Process., vol. 13, no. 5, pp. 975-983, 2005.
  • [54] L. Deng, J. Droppo, and A. Acero, “Enhancement of log Mel power spectra of speech using a phase-sensitive model of the acoustic environment and sequential estimation of the corrupting noise,” IEEE Trans. on Speech and Audio Process., vol. 12, no. 2, pp. 133-143, April 2004.
  • [55] V. Leutnant, “Bayesian estimation employing a phase-sensitive observation model for noise and reverberation robust automatic speech recognition, Ch. 4: Bayesian estimation of the speech feature posterior,” Ph.D. dissertation, Paderborn University, 2015.
  • [56] J. Li and L. Deng and R. Haeb-Umbach and Y. Gong, Robust automatic speech recognition, Ch. 3.2: Modelling distortions of speech in acoustic environments, and Ch. 3.3: Impact of acoustic distortion on Gaussian modelling, ISBN: 978-0-12-802398-3.   Elsevier, 2016.
  • [57] J. Le Roux, E. Vincent and H. Erdogan, “Learning-based approaches to speech enhancement and separation, Section 2: The pre-deep-learning era, Slides: 1-13,” 2016, [Online] Available: https://www.merl.com/publications/docs/TR2016-113.pdf.
  • [58] P. Smaragdis and S. Venkataramani, “A neural network alternative to non-negative audio models,” arXiv preprint arXiv:1609.03296 [cs.SD], 2016.
  • [59] V. Leutnant, A. Krueger and R. Haeb-Umbach, “A new observation model in the logarithmic Mel power spectral domain for the automatic recognition of noisy reverberant speech,” IEEE Trans. on Audio, Speech and Language Process., vol. 22, no. 1, pp. 95-109, Jan. 2014.
  • [60] D. Yu and L. Deng, Automatic Speech Recognition: A Deep Learning Approach.   Springer, ISBN: 978-1-4471-5779-3, 2015.
  • [61] Y. Gong, “A method of joint compensation of additive and convolutive distortions for speaker-independent speech recognition,” IEEE Trans. on Speech and Audio Process., vol. 13, no. 5, pp. 975-983, 2005.
  • [62] N. Boulanger-Lewandowski, G. J. Mysore and M. Hoffman, “Exploiting long-term temporal dependencies in NMF using recurrent neural networks with application to source separation,” in Proc. IEEE Int. Conf. Acoustics, Speech and Signal Process., Florence, May 2014.
  • [63] I. Andrianakis and P. R. White, “On the application of Markov Random Fields to speech enhancement,” in Proc. Int. Conf. Signal Process., pp. 198-201, Cirencester, Dec. 2006.
  • [64] N. Mohammadiha, P. Smaragdis and A. Leijon, “Prediction based filtering and smoothing to exploit temporal dependencies in NMF,” in Proc. IEEE Int. Conf. on Acoustics, Speech and Signal Process., Vancouver, May 2013.
  • [65] K. Paliwal, K. Wójcicki, and B. Shannon, “The importance of phase in speech enhancement,” Speech Commun. 53, 4: 465-494, 2011.
  • [66] G. Welch and G. Bishop, “An introduction to the Kalman filter,” TR 95-041, Department of Computer Science, University of North Carolina at Chapel Hill, 2006.
  • [67] J. Hansen and M. Clements, “Iterative speech enhancement with spectral constraints,” in Proc. IEEE Int. Conf. Acoustics, Speech and Signal Process., vol. 12, Apr. 1987, pp. 189–192.
  • [68] J. H. L. Hansen and M. A. Clements, “Constrained iterative speech enhancement with application to speech recognition,” IEEE Trans Signal Processing, vol. 39, no. 4, pp. 795–805, Apr. 1991.
  • [69] S. Gonzalez and M. Brookes, “A pitch estimation algorithm robust to high levels of noise,” IEEE Trans. on Audio, Speech and Language Process., vol. 22, no. 2, pp. 518-530, Febr. 2014.
  • [70] Y. Bengio, P. Simard and P. Frasconi, “Learning long-term dependencies with gradient descent is difficult,” IEEE Trans. on Neural Networks, vol. 5, no. 2, pp. 157-166, 1994.
  • [71] M. Parchami, W.-P. Zhu and B. Champagne, “Speech dereverberation using weighted prediction error with correlated inter-frame speech components,” Speech Communication, vol. 87, pp. 49-57, 2017.
  • [72] T. Esch and P. Vary, “Model-based speech enhancement using SNR dependent MMSE estimation,” in Proc. IEEE Int. Conf. Acoustics, Speech and Signal Process., pp. 4652-4655, Prague, May 2011.
  • [73] R. T. T. Ogunfunmi and M. S. Narasimha, Speech and Audio Processing for Coding, Enhancement and Recognition, Chapter 9: Maximum A Posteriori Spectral Estimation with Source Log-Spectral Priors for Multichannel Speech Enhancement.   Springer, 2015.
  • [74] S. Roweis and Z. Ghahramani, “A unifying review of linear gaussian models,” Neural Computation, vol. 11, no. 2, pp. 305-345, [Online] Available: http://mlg.eng.cam.ac.uk/zoubin/papers/lds.pdf, Febr. 1999.
  • [75] Y. Wang and M. Brookes, “Speech enhancement using an MMSE spectral amplitude estimator based on a modulation domain Kalman filter with a Gamma prior,” in Proc. IEEE Int. Conf. Acoustics, Speech and Signal Process., Shanghai, 2016.
  • [76] Y. Wang and M. Brookes, “Model-based speech enhancement in the modulation domain,” IEEE Trans. on Audio, Speech and Language Process., vol. 26, no. 3, pp. 580-594, March 2018.
  • [77] C. Dubois and M. Davy, “Joint detection and tracking of time-varying harmonic components: A flexible bayesian approach,” IEEE Trans. on Audio, Speech and Language Process., vol. 15, no. 4, pp. 1283-1295, 2007.
  • [78] R. F. Astudillo, “Integration of short-time Fourier domain speech enhancement and observation uncertainty techniques for robust automatic speech recognition, Ch. 5: Fourier domain uncertainty models,” Ph.D. dissertation, Technical University of Berlin, 2010, [Online] Available: https://depositonce.tu-berlin.de/bitstream/11303/2780/1/Dokument_43.pdf.
  • [79] K. Nathwani, J. A. Morales-Cordovilla et al., “An extended experimental investigation of DNN uncertainty propagation for noise robust ASR,” in Proc Wkshp on Hands-free Speech Communication and Microphone Arrays, San Francisco, Mar. 2017.
  • [80] B. Ravi and T. K. Kumar, “Speech enhancement usingkernel and normalized kernel affine projection algorithm,” Signal and Image Processing : An International Journal (SIPIJ) Vol.4, No.4, 2013.
  • [81] J. Li and L. Deng and R. Haeb-Umbach and Y. Gong, Robust automatic speech recognition: A bridge to practical applications, Ch. 3: Background of robust speech recognition, ISBN: 978-0-12-802398-3.   Elsevier, 2016.
  • [82] A. V. Oppenheim and R. W. Schafer, Digital Signal Processing.   Prentice Hall, 1975.
  • [83] Y. Zhao, D. Wang, B. Xu and T. Zhang, “Late reverberation suppression using recurrent neural networks with long short-term memory,” in Proc. IEEE Intl. Conf. on Acoustics, Speech and Signal Processing, Calgary, Canada, pp. 5434-5438, April 2018.
  • [84] K. Han, Y. Wang, D. Wang, W. S. Woods, I. Merks and T. Zhang, “Learning spectral mapping for speech dereverberation and denoising,” IEEE Trans. on Audio, Speech and Language Process., vol. 23, no. 6, pp. 982-992, June 2015.
  • [85] B. Wu, K. Li, M. Yang and C.-H. Lee, “A reverberation-time-aware approach to speech dereverberation based on deep neural networks,” IEEE Trans. on Audio, Speech and Language Process., vol. 25, no. 1, pp. 102-111, Jan. 2017.
  • [86] D. S. Williamson, “Deep learning methods for improving the perceptual quality of noisy and reverberant speech, Ch. 6: Time-frequency masking in the complex domain for speech dereverberation and denoising,” Ph.D. dissertation, The Ohio State University, 2016. [Online]. Available: https://etd.ohiolink.edu/pg_10?0::NO:10:P10_ACCESSION_NUM:osu1461018277
  • [87] A. A. Nugraha, “Deep neural networks for source separation and noise-robust speech recognition, Ch. 4: On improving DNN spectral models,” Ph.D. dissertation, University of Lorraine, 2017. [Online]. Available: http://theses.eurasip.org/theses/757/deep-neural-networks-for-source-separation-and/
  • [88] P. Smaragdis, “Striving for computational and physical efficiency in speech enhancement,” Keynote, Int. Work. Hands-free Speech Communication and Microphone Arrays, San Francisco, March 2017, [Online] Available: http://hscma2017.org/KeynoteSpeakers.asp.
  • [89] R. Rehr and T. Gerkmann, “On the importance of super-Gaussian speech priors for machine-learning based speech enhancement,” IEEE Trans. on Audio, Speech and Language Process., vol. 26, no. 2, pp. 357-366, Febr. 2018.
  • [90] A. Kumar and D. Florencio, “Speech enhancement in multiple-noise conditions using deep neural networks,” arXiv:1605.02427 [cs], May 2016.
  • [91] Y. B. Ian Goodfellow and A. Courville, “Deep learning,” 2016, book in preparation for MIT Press. [Online]. Available: http://www.deeplearningbook.org
  • [92] C. M. Bishop, Pattern Recognition and Machine Learning.   Springer Science and Business Media, LLC, 2006.
  • [93] D. S. Williamson and DeLiang Wang, “Time-frequency masking in the complex domain for speech dereverberation and denoising,” IEEE Trans. on Audio, Speech and Language Process., vol. 25, no. 7, pp. 1492-1501, July 2017.
  • [94] P. Naylor and N. D. Gaubitch, Speech Dereverberation.   Springer, ISBN: 978-1-84996-056-4, 2010.
  • [95] S. O. Sadjadi and J. H. L. Hansen, “Blind spectral weighting for robust speaker identification under reverberation mismatch,” IEEE Trans. on Audio, Speech and Language Process., vol. 22, no. 5, pp. 937-945, May 2014.
  • [96] A. Maezawa, K. Itoyama, K. Yoshii and H. G. Okuno, “Nonparametric Bayesian dereverberation of power spectrograms based on infinite-order autoregressive processes,” IEEE Trans. on Audio, Speech and Language Process., vol. 22, no. 12, pp. 1918-1930, Dec. 2014.
  • [97] E. A. P. Habets, “Single- and Multi-Microphone Speech Dereverberation using Spectral Enhancement, Ch. 6: Late Reverberant Spectral Variance Estimation,” Ph.D. dissertation, Technische Universiteit Eindhoven, 2007.
  • [98] K. Lebart, J. M. Boucher, and P. N. Denbigh, “A New Method Based on Spectral Subtraction for Speech Dereverberation,” Acta Acoustica, vol. 87, pp. 359–366, 2001.
  • [99] M. Parchami, W.-P. Zhu and B. Champagne, “Model-based estimation of late reverberant spectral variance using modified weighted prediction error method,” Speech Communication, vol. 92, pp. 100-113, 2017.
  • [100] E. A. P. Habets, S. Gannot and I. Cohen, “Late reverberant spectral variance estimation based on a statistical model,” IEEE Signal Processing Letters, vol. 16, no. 9, pp. 770-773, Sept. 2009.
  • [101] T. - H. Chen, C. Huang and T.- S. Chi, “Dereverberation based on bin-wise temporal variations of complex spectrogram,” in Proc. IEEE Int. Conf. Acoustics, Speech and Signal Processing, New Orleans, March 2017.
  • [102] F. Xiong, B. T. Meyer, B. Cauchi, A. Jukic, S. Doclo and S. Goetze, “Performance comparison of real-time single-channel speech dereverberation algorithms,” in Proc. IEEE Int. Work. Hands-free Speech Communication and Microphone Arrays, San Francisco, March 2017.
  • [103] D. Griffin and J. S. Lim, “Signal estimation from modified short-time Fourier transform,” IEEE Trans. on Acoust., Speech, Signal Process., vol. 32, no. 2, pp. 236-243, Apr. 1984.
  • [104] N. Mohammadiha and S. Doclo, “Speech dereverberation using non-negative convolutive transfer function and spectro-temporal modeling,” IEEE Trans. on Audio, Speech and Language Process., vol. 24, no. 2, pp. 276-289, Febr. 2016.
  • [105] V. Leutnant, A. Krueger and R. Haeb-Umbach, “Investigations into a statistical observation model for logarithmic Mel power spectral density features of noisy reverberant speech,” in Proc. Speech Communication 10. ITG Symposium, Braunschweig, Germany, Sept. 2012.
  • [106] V. Leutnant and R. Haeb-Umbach, “An analytic derivation of a phase-sensitive observation model for noise-robust speech recognition,” in Proc. Conf. Int. Speech Communication Association, pp. 2395-2398, Brighton, Sept. 2009.
  • [107] T. Yoshioka, A. Sehr, M. Delcroix, K. Kinoshita, R. Maas, T. Nakatani, et al., “Making machines understand us in reverberant rooms: robustness against reverberation for automatic speech recognition,” IEEE Signal Process., vol. 29, no. 6, pp. 114-126, 2012.
  • [108] M. Delcroix, T. Yoshioka, A. Ogawa et al., “Strategies for distant speech recognition in reverberant environments,” EURASIP Journal on Advances in Signal Processing, vol. 60, July 2015.
  • [109] M. Delcroix, T. Yoshioka, A. Ogawa et al., “Linear prediction-based dereverberation with advanced speech enhancement and recognition technologies for the REVERB challenge,” REVERB’14, 2014.
  • [110] K. Kinoshita, M. Delcroix, H. Kwon, T. Mori, T. Nakatani, “Neural network-based spectrum estimation for online WPE dereverberation,” in Proc. Conf. Int. Speech Communication Association, Stockholm, Aug. 2017.
  • [111] W. Kellermann, S. Makino, P. A. Naylor and M. Omologo, “The AcouSP recommendation for annotation of acoustic data collections,” 2010, [Online] Available: www.commsp.ee.ic.ac.uk/~acousp.