[lines=2]Nowadays, technology is ever evolving with tremendous haste and the demand for speech enhancement systems is evident. Speech enhancement in noisy reverberant environments, for human listeners, is challenging. Speech is degraded by noise and reverberation when captured using a near-field or far-field distant microphone  . A room impulse response (RIR) can include components at long delays, hence resulting in reverberation and echoes  . Reverberation is a convolutive distortion that can be quite long with a reverberation time, , of more than s. Due to convolution, reverberation induces long-term correlation between consecutive observations. Reverberation and noise, which can be stationary or non-stationary, have a detrimental impact on speech quality and intelligibility. Reverberation, especially in the presence of non-stationary noise, damages the intelligibility of speech.
The direct to reverberant energy ratio (DRR) and the reverberation time, , are the two main parameters of a reverberation model  . The DRR describes reverberation in the space domain, depending on the positions of the sound source and the receiver. The is the time interval required for a sound level to decay dB after ceasing its original stimulus. The reverberation time, when measured in the diffuse sound field, is independent of the source to microphone configuration and mainly depends on the room. The impact of reverberation on auditory perception depends on the . If the is short, the environment reinforces the sound which may enhance the sound perception  . On the contrary, if the is long, spoken syllables interfere with future spoken syllables. Reverberation spreads energy over time and this smearing across time has two effects: (a) the energy of individual phonemes spreads out in time and, hence, plosives have a delayed decay and fricatives are smoothed, and (b) preceding phonemes blur into the current phonemes.
The aim of speech enhancement is to reduce and ideally eliminate the effects of both noise and reverberation without distorting the speech signal . Enhancement algorithms typically aim to suppress noise and late reverberation because early reverberation is not perceived as separate sound sources and usually improves the quality and intelligibility of speech. Noise is assumed to be uncorrelated with speech, early reverberation is correlated with speech and late reverberation is commonly assumed to be uncorrelated with speech  .
Speech enhancement can be performed in different domains. The ideal domain should be chosen such that (a) good statistical models of speech and noise exist in this domain, and (b) speech and noise are separable in this domain. Speech and noise are additive in the time domain and the Short Time Fourier Transform (STFT) domain . The relation between speech and noise becomes progressively more complicated in the amplitude, power and log-power spectral domains. Noise suppression algorithms usually operate in a time-frequency STFT domain and these techniques have been extended to address dereverberation. In , spectral enhancement methods based on a time-frequency gain, originally developed for noise suppression, have been modified and employed for dereverberation. Such algorithms suppress late reverberation assuming that the early and late reverberation components are uncorrelated. The spectral enhancement methods in  
estimate the late reverberant spectral variance (LRSV) and use it in the place of the noise spectral variance, reducing the problem of late reverberation suppression to that of estimating the LRSV. In , blind spectral weighting is employed to reduce the overlap-masking effect of reverberation using an uncorrelated and additive assumption for late reverberation.
Dereverberation algorithms that leave the phase unaltered and operate in the amplitude, power or log-power spectral domains are relatively insensitive to minor variations in the spatial placement of sources . Two criticisms of spectral enhancement algorithms based on LRSV reverberation noise estimation are that they introduce musical noise and suppress speech onsets when they over-estimate reverberation . The LRSV estimator in , which is a continuation of , models the RIR in the STFT domain and not in the time domain  , using the same model of the RIR that is attributed to J. Polack or J. Moorer . Reverberation is estimated in  considering the STFT energy contribution of the direct path of speech and an external estimate.
Modelling the speech temporal dynamics is beneficial when the is long and the DRR is low  . Joint denoising and dereverberation using speech and noise tracking is performed in . The SPENDRED algorithm  , which is a model-based method with a convolution model for reverberation based on the
and the DRR, considers the speech temporal dynamics. SPENDRED employs a parametric model of the RIR and performs frequency-dependent and time-varying and DRR estimation. However, unless the source or the microphone are moving, the and the DRR will be constant throughout the recording. The SPENDRED algorithm assumes that where is the acoustic frame increment. For example, when s and ms, then dB is assumed. In addition, SPENDRED performs intra-frame correlation modelling, which can be beneficial in adverse conditions, while typical algorithms decouple different frequency dimensions .
Statistical-based models, such as the SPENDRED algorithm, describe reverberation by a convolution in the power spectral domain while LRSV models describe reverberation as an additive distortion in the power spectral domain  . A model with an infinite impulse response is used either with the two parameters of the and the DRR, as in  , or with a finite number of parameters. The infinite-order convolution model of reverberation with the and the DRR is sparse and contrasts with the higher-order autoregressive processes in the complex STFT domain, used in  .
The algorithms described in ,  and  create non-linear observation models of noisy reverberant speech in the log Mel-power spectral domain, using the reverberation-to-noise ratio (RNR). As discussed in , phase differences in Mel-frequency bands have different properties from phase differences in STFT bins. The phase factor between reverberant speech and noise is different from that between speech and noise . In , the phase factor between reverberant speech and noise in Mel-frequency bands is examined.
In noisy reverberant conditions, finding the onset of speech phonemes and determining which frames are unvoiced/silence is difficult, due to the smearing across time, often leading to noise over-estimation. The concatenation of different techniques for denoising and dereverberation has lower performance than unified methods due to over-estimating noise when estimating noise and reverberation separately  .
Despite the claim that it is inefficient to perform a two step procedure that is comprised of denoising followed by dereverberation  , long-term linear prediction with pre-denoising can be used to suppress noise and reverberation. With the weighted prediction error (WPE) algorithm  , reverberation is represented as a one-dimensional convolution in each frequency bin. In , the WPE algorithm is discussed along with inter-frame correlation. In , the WPE algorithm is used in the complex STFT domain performing batch processing and iteratively estimating, first, the reverberation prediction coefficients and, then, the speech spectral variance. The WPE linear filtering approach, which can be employed in the power spectral domain  , takes into account past frames, from the -rd to the -th past frame  .
This paper presents an adaptive denoising and dereverberation Kalman filtering framework that tracks the speech and reverberation spectral log-magnitudes. In this paper, we extend the enhancer in  to include dereverberation. Enhancement is performed using a Kalman filter (KF) to model inter-frame correlations. We use an integrated structure of two parallel signal models to track speech, reverberation and the and DRR reverberation parameters. The
and the DRR are updated in every frame to improve the estimation of the speech log-magnitude spectrum. We create an observation model and a series of non-linear KF update steps performing joint noise and reverberation suppression by estimating the first two moments of the posterior distribution of the speech log-spectrum given the noisy reverberant log-spectrum. The log-spectral domain is chosen, as in 
, because good speech models exist in this domain. Modelling spectral log-amplitudes as Gaussian distributions leads to good speech modelling in noisy reverberant environments since super-Gaussian distributions that resemble the log-normal, such as the Gamma , are used to model the speech amplitude spectrum. Mean squared errors (MSEs) in the log-spectral domain are a good measure to use for perceptual quality and speech log-spectra are well modelled by Gaussian distributions, as in  and .
The structure of this paper is as follows. Section II describes the signal model and Sec. III presents the enhancement algorithm and its non-linear KF. The implementation and the validation of the algorithm are in Sec. IV. The algorithm’s evaluation is in Sec. V. Conclusions are drawn in Sec. VI.
Ii Signal model and notation
In the complex STFT domain, the noisy speech, , is given by where is the direct speech component, is the reverberant speech component and is the noise, as for example in  . The time-frame index is and the frequency bin index is . For clarity, we also define . We drop the time and frequency indexes and we obtain . We define the log-magnitude spectrum of as and we also define , , and similarly.
In the signal model, signal quantities with capital letters, such as , are complex numbers with magnitude and phase values, and . In the complex STFT domain, using , the reverberation signal model is given by
where is the acoustic frame increment. In (1), the factors and , where and
are uniformly distributed phases, are used. In (2), the DRR is defined in the power spectral domain  and the and the DRR are both time and frequency dependent, as described in .
The expression in (1) is the convolution model for reverberation; the most common reverberation model is this single-pole filter that is described by the pole and zero positions that depend on the and the DRR  . A convolution of infinite order is used, with the two parameters of the and the DRR, to describe reverberation  . Models that describe reverberation by a convolution are also discussed in  . The signal model is defined by (1) and by
where , and .
Figure 1 shows graphs of against for a fixed DRR and of against DRR for a fixed . If dB, then . If , then and .
Figure 2 illustrates the flowchart of the signal model. The reverberation signal model in (1) uses and because the and reverberation parameters, in (2), and the DRR are defined in the power spectral domain, as in  . The and parameters are mapped to and using (4) and (5). The signals , and are the total distubance, the old (decaying) reverberation and the new reverberation, respectively. We note that is defined in the first paragraph of this section and that and are defined in (4) and (5).
The signal model of how the reverberation parameters of and change over time is a random walk model. This is used in the algorithm’s KF prediction step for and .
The signal model in Fig. 2 is directly linked to the alternating and interacting KFs of the enhancement algorithm. The algorithm is a collection of two KFs, the speech KF and the reverberation KF, that estimate the speech and reverberation log-amplitude spectra and the and reverberation parameters. This KF algorithm is described in detail in Sec. III.
Iii The speech enhancement algorithm
The KF algorithm operates in the log-magnitude spectral domain, tracking speech and reverberation. Figure 3
depicts the denoising and dereverberation algorithm that formulates a model of reverberation as a first-order autoregressive process and propagates the means and variances of the random variables. Almost all the signals follow a Gaussian distribution and the distribution ofconditioned on observations up to time is given by . In Fig. 3, a Gaussian distribution is denoted by its mean, .
The core of the algorithm in Fig. 3 is the KF that is defined by the gray blocks in the flowchart diagram. The non-linear KF estimates and tracks the posterior distributions of the speech log-magnitude spectrum, , the reverberation log-magnitude spectrum, , and the reverberation parameters, and .
The input to the algorithm in Fig. 3 is the noisy reverberant speech in the time domain. The algorithm’s first step is to perform a STFT and obtain the signal in the complex STFT domain. The algorithm does not alter the noisy reverberant phase, , and uses the noisy reverberant amplitude spectrum, , in three ways: in the speech KF prediction step, in the KF update step and in the noise power modelling. The main part of the algorithm is the KF and the speech KF state, , is the speech log-spectrum from the previous frames,
The speech KF prediction step is based on autoregressive (AR) modelling on the log-spectrum of pre-cleaned speech . The reverberation KF state is and the KF states of the reverberation parameters are and . The KF observation is the noisy reverberant speech log-spectrum, , which is used in the KF update step to compute the first two moments of the posterior of the speech log-spectrum. The mean of the speech log-spectrum posterior is used together with to create the enhanced speech signal using the inverse STFT (ISTFT).
Apart from the speech log-spectrum, the non-linear KF also tracks the reverberation log-spectrum, , and the and reverberation parameters. The KF, as defined by the gray blocks in Fig. 3, has a speech KF prediction step, a reverberation KF prediction step and a series of KF update steps. The reverberation KF is comprised of the blocks “Reverberation KF prediction”, “KF Update” and “, KF Update”. These three blocks perform joint denoising and dereverberation and estimate and to enhance noisy reverberant speech.
The structure of the rest of this algorithm description section is as follows. Sections III.A and III.B present the speech and reverberation KF prediction steps, respectively. Section III.C describes the KF update step and Sec. III.D the priors for the and parameters that are needed so that the KF (a) distinguishes between speech and reverberation, and (b) does not diverge to non-realistic and DRR estimates. Section III.E describes the unshaded peripheral blocks in Fig. 3.
Iii-a The Speech KF Prediction Step
The speech KF prediction step is linear and is related to the “Speech KF prediction”, “Decorrelate” and “Recorrelate” blocks in Fig. 3. The speech KF prediction step is described in  and in   and is based on conditional distributions to model short-term dependencies. Decorrelation and recorrelation of the speech KF state in (6) are performed after and before the speech KF prediction step, respectively. The decorrelation and recorrelation operations in Fig. 3
, which are performed so that the non-linear KF update step can be applied, perform vector-matrix and matrix-matrix multiplications for the speech KF state mean and its covariance matrix, respectively, using . The outputs of the “Decorrelate” block are: (a) the first element of the speech KF state, and (b) the rest elements of the speech KF state.
The KF prediction step propagates the first and second moments of the speech KF state  . Inter-frame linear relationships are used for the speech KF prediction step that uses AR modelling in the log-magnitude spectral domain. In the speech KF prediction step, is predicted as a linear combination of using the speech AR coefficients that are obtained from the “Speech AR(p)” block in Fig. 3, which uses pre-cleaned speech as an input. After the speech KF prediction step,
is correlated; we decorrelate the speech KF state with a linear transformation (using) to simplify the KF update step and impose the observation constraint . The KF update step changes only the first element of the speech KF state and after the KF update step, recorrelation is applied with a linear transformation (using ) to continue the KF recursion.
Iii-B The Reverberation KF Prediction Step
The presented algorithm uses a KF prediction step for and that assumes that the variance of and increases over time, preserving their mean. The KF algorithm implements a random-walk prediction step, performing the operations of
where is a fixed error variance for and is a fixed error variance for . The values used for the prediction error variances, and , depend on the rate at which the and the DRR are likely to change in a real situation.
After the reverberation KF prediction step, the algorithm computes and imposes priors on and using Gaussian-Gaussian multiplication. The internally computed priors for and in the “, priors” block in Fig. 3 are explained in Sec. III.D. After imposing the priors, the outputs are , and , . We note that a prime diacritic, , is used in (7) and (8) to denote quantities before the priors.
The “Reverberation KF prediction” block in Fig. 3 estimates the first two moments of the prior distribution of the reverberation spectral log-amplitude, i.e. and its variance. The algorithm performs a reverberation KF prediction step based on the previous posterior of both speech and reverberation using the signal model in (1), where is less than unity and this makes the reverberation KF prediction step stable.
From (1) and Fig. 2, the STFT-domain reverberation is the sum of two components arising, respectively, from the reverberation and speech components of the previous frame. The old reverberation, , and the new reverberation, , are defined in (4) and (5), respectively. The KF algorithm calculates the prior distributions of these two components in the log-amplitude spectral domain using and . These equations are based on (4) and (5) with a common condition added to all terms. Assuming that and are uncorrelated with and , respectively, the means and variances of the two Gaussian distributions are added. The variances therefore add,
As shown in Fig. 3, the final operation of the “Reverberation KF prediction” block is to compute the prior distribution, . The addition in the complex STFT domain of two random variables in the log-spectral domain is modelled. The reverberation log-amplitude spectrum is estimated by modelling the addition in the STFT domain of two random variables in the log-amplitude spectral domain. Given two disturbance sources, we combine them into a single disturbance source.
From this point onwards in Sec. III.B, the time-frame subscript is omitted for clarity. For example, is used instead of , which is defined in Sec. III.
From (1), the reverberation component is the STFT-domain sum of two elements arising, respectively, from the reverberation and speech components in the previous frame. The log-amplitude spectral domain distributions of these two elements, and , were calculated in the preceding paragraphs. A two-dimensional Gaussian distribution is used for assuming independence between and . We assume that the phase difference, , between the two disturbance sources, and , is uniformly distributed, i.e. , and independent of their magnitudes. We write and , which takes account of  . Next, we calculate
where sigma points are used to evaluate the inner integral over  and the outer integral over .
Iii-C The Non-Linear KF Update Step
The KF algorithm decomposes the noisy reverberant observation, , into its component parts using distributions in the log-magnitude spectral domain. The decompositions are based on Fig. 2 and the signal model in (3) and (1). The KF algorithm performs a series of low-dimensional operations instead of a high-dimensional one in the KF update step. The adaptive KF algorithm propagates backwards through Fig. 2 and decomposes: (a) into speech, , and into reverberation and noise, , (b) the reverberation and noise, , into reverberation, , and into noise, , and (c) the reverberation, , into “old reverberation”, , and into “new reverberation”, . The reverberation and noise log-spectrum, , is a variable of the “KF Update” block in Fig. 3.
The KF algorithm in Fig. 3 uses the noisy reverberant observation, , to first update and and then update . The posterior is computed in the “KF Update” block in Fig. 3. In the proposed KF algorithm, the observed affects directly and, in turn, affects . We hence divide the observation update into two steps: (a) we use the log-spectrum observation, , to estimate the posterior distributions and because in (3), and (b) we use as an “observation” to obtain the posterior distributions and because in (3). In (b), we calculate an updated version of by using the posterior as a KF observation constraint. Hence, according to (a) and (b), the log-spectrum observation, , provides new information about and .
The sequence of operations involved in the “Reverberation KF prediction”, “KF update” and “, KF update” blocks are listed in Table LABEL:tab:ppssaaa. The “Reverberation KF prediction” block in Fig. 3, which was presented in Sec. III.B, performs the first five operations in Table LABEL:tab:ppssaaa. The “KF update” and “, KF update” blocks in Fig. 3 perform the next seven operations in Table LABEL:tab:ppssaaa, i.e. steps 6-12. The bottom “” block in Fig. 3 performs step 13. These 13 steps constitute the dereverberation KF update step. The non-linear dereverberation KF update step computes the first two moments of the posterior distributions for most signal quantities and, moreover, includes the prediction step as well for some quantities. Both means and variances are computed for the tracked Gaussian signals; for clarity, the variances, such as for , are not included in Table LABEL:tab:ppssaaa.
[caption = The operations performed in the “Reverberation KF prediction”, “KF Update” and “, KF Update” blocks in Fig. 3. Steps 7-12 perform specific signal decompositions propagating backwards through Fig. 2., label = tab:ppssaaa, pos = hp, doinside=, width=.492]p0.2cm p7.8cm & Inputs: (a) from the speech KF prediction step, (b) from external noise estimation, (c) from observation, and (d) from in step 7 and from Sec. III.A. 1: & from step 13 2: & from step 13 3: & from steps 1, 13 4: & from steps 2, 13 5: & from steps 3, 4 6: & from step 5 and input (b) 7: & from 6 and inputs (a), (c) 8: & from 5, 7 and input (b) 9: & from step 2 and input (d) 10: & from steps 3, 8, 9 11: & from steps 1, 10, 13 12: & from 2, 10 and input (d)13: &
For clarity, from this point onwards in Sec. III.C, the time subscript is included only if it differs from . Table LABEL:tab:ppssaaa shows the time subscripts. We also denote by .
The KF algorithm computes the total distubance, , from (3) in step 6 using similar equations to step 5. A two-dimensional Gaussian distribution is used for and independence is assumed between and . Hence, . The phase-sensitive KF algorithm assumes that the phase difference, , between the two disturbance sources, and , is uniformly distributed and independent of their magnitudes. From (3), we write and thus . Next, using , the first two moments of are given by
where sigma points are used to evaluate the inner integral over  and the outer integral over .
The KF algorithm performs noise suppression with steps 6 and 7. Step 7 decomposes into and , as shown in the signal model in Fig. 2, estimating both and .
Step 7 performs the first signal decomposition, into and , when propagating backwards through the signal model in Fig. 2. Step 7 applies the observation constraint, , according to (3). As in  , the variables are first transformed according to where and . This variable transformation is performed to allow the imposition of the scalar KF observation, .
The noisy reverberant log-amplitude spectrum, , is given by . The KF update step assumes that is uniformly distributed, . Therefore, . The first two moments of the posteriors of and are computed using
where the Jacobian determinant is and the moment indexes, and , are integers, . We denote the variables for two moment indexes by and .
The first two moments of the posterior distributions of and are estimated. In (15), the priors of the speech and of the noise and late reverberation are assumed to be independent, i.e. . In addition, in (15), weighted sigma points are used to evaluate the outer integral over .
Step 8 performs the second signal decomposition, into and , when propagating backwards through the proposed signal model in Fig. 2. Step 8 decomposes into and , according to (3), estimating both and . Step 8 performs an integral over where the integrand is similar to step 7 and (15) and to the KF update step in  . Instead of a scalar observation, as in step 7, the observation in step 8 is a distribution; step 8 performs an outer integral over the observation distribution and the integrand is similar to step 7. Step 8 uses where and . Step 7 computes a two-dimensional integral over the variables of and of the phase difference between and , using
that reduces the probability space from three to two dimensions. In step 8, the KF algorithm calculates a three-dimensional integral over: (a), (b) the phase difference between and , i.e. , and (c) the posterior of . Assuming , step 8 computes
In Table LABEL:tab:ppssaaa, the signals in square brackets are calculated but are not used in the KF recursion. In step 8, the posterior of is estimated using (16) but is not used in the KF recursion.
Step 9 determines a preliminary estimate of the posterior distribution of new reverberation component, in (5). This preliminary estimate, denoted , combines an updated estimate of the previous frame’s speech, with the prior estimate of the reverberation parameter, . In step 9, two random variables in the log-amplitude spectral domain are added; the means and variances of the two Gaussian variables are: and . In this addition, and are assumed to be independent.
Step 10 decomposes the reverberation, , into a new reverberation component, and an old decaying reverberation component, , using (1), (4) and (5). Step 10 uses the prior distributions for these two components from steps 3 and 9. Step 10 performs the same operation as step 8. In analogy to step 7, the variable transformations in steps 8 and 10 are and , respectively. In step 10, the KF algorithm performs an integral over where the integrand is similar to the KF update step in . Step 10 estimates the posterior distributions of and ; it estimates both and using and . Step 10 computes
where and is the phase difference of the STFT-domain signals of and . Using (17), we decompose into old reverberation and new reverberation, estimating and . In (17), sigma points are used to evaluate the integral over and the integral over .
Steps 10-12 perform the final signal decomposition, into and , when propagating backwards through the signal model in Fig. 2. In steps 11 and 12, the KF algorithm computes the first two moments of the posterior distributions of and . In step 11, the algorithm performs an integral over using weighted sigma points where the integrand models the addition of two random variables in the log-amplitude spectral domain, and . Likewise, in step 12, the algorithm performs an integral over using sigma points where the integrand models the addition of two random variables in the log-amplitude spectral domain, and . Steps 11 and 12 perform a linear KF update and impose a straight line observation constraint because the signal model is that and are additive and that and are additive, according to (3) and (1).
In step 11, the KF algorithm decomposes the old reverberation component, , according to (5) into the sum of a reverberation parameter, , and the previous frame’s reverberation, . We define where , is zero-mean Gaussian with variance , and where is a diagonal matrix with the elements of j on the main diagonal of the matrix. If , then
where and . We compute and and set equal to the first element of this computed mean and equal to the first element of this variance.
Likewise, for step 12, we define where , is zero-mean Gaussian with variance , and, moreover, . Next, we use
where . The KF algorithm computes and and sets equal to the first element of this computed mean and equal to the first element of this computed variance.
In steps 11 and 12, the presented KF algorithm computes , and , , respectively. Steps 11 and 12 use and , respectively.
Step 13 applies a one-frame delay, i.e. , to continue the KF recursion as shown in Fig. 3. In summary, the 13 steps have the four main operations of: (a) step 5, (b) step 7, (c) step 8, and (d) step 11. The operations performed in the other steps are either simple or identical to these operations.
According to the 13 steps in Table LABEL:tab:ppssaaa, the and reverberation parameters are affected by the observed through a series of operations that estimate the first two moments of the posterior distributions. The estimate of is affected by the previous estimate of because of the sequence of operations: . The estimate of depends on the previous estimate of because of the sequence of operations: .
The proposed 13 steps do not include the speech KF prediction step, which is shown in Fig. 3, that is used to calculate that is needed in steps 9 and 12. After step 7, according to Fig. 3 and Sec. III.B, recorrelation of the speech KF state is performed with . Using , from the recorrelation operation, is obtained. In steps 9 and 12, is used as a better estimate of than .
In step 9, we note that two sub-indexes of and give rise to a sub-index of : . In step 9, we introduce the notation to avoid using the same symbol for two different posterior distributions, and .
Iii-D The Priors for the Reverberation Parameters
This section describes the priors for the and reverberation parameters, which are based on   and . The priors for and are imposed using Gaussian-Gaussian multiplication; is modelled with a Gaussian distribution and its internal prior is also a Gaussian. Likewise, is modelled with a Gaussian and its internal prior is also a Gaussian.
The priors for and are estimated from spectral log-amplitude observations in the free decay region (FDR), which is comprised of consecutive frames with decreasing energy. We define the look-ahead factor and the frame index . The least squares (LS) fit to the FDR is found using where is the time index in seconds and depends on . The parameters of the straight line are the slope, , and the y-intercept, . For clarity, the frame subscript is omitted from and . We define . Its Gaussian distribution is
where and .
The log-likelihood of is given by where and are constants.
We define the vectors and from and , respectively. We define as the covariance matrix of r. The regression coefficients as a Gaussian distribution are where and ,