In many hands-free scenarios, the measured microphone signals suffer from an additive background noise, which may originate from both environmental sources and from microphone responses. Apart from noise, if the recording takes place in an enclosed space, the recorded signals may also contain multiple sound reflections from walls and other objects in the room, resulting in a reverberation. As the level of noise and reverberation increases, the perceived quality and intelligibility of the speech signal deteriorate, which in turn affect the performance of speech communication systems, as well as automatic speech recognition (ASR) systems.
In order to reduce the effects of reverberation and noise, speech enhancement algorithms are required, which aim at recovering the clean speech source from the recorded microphone signals. Speech dereverberation and noise reduction algorithms often require the power spectral densities (PSDs) of the speech, reverberation and noise components. In the multichannel framework, a commonly used assumption (see e.g. in [1, 2, 3, 4, 5, 6, 7, 8, 9, 10]) is that the late reverberant signal is modelled as a spatially homogeneous sound field, with a time-varying PSD multiplied by a spherical diffuse time-invariant spatial coherence matrix. As the spatial coherence matrix depends only on the microphone geometry, it can be calculated in advance. However, the reverberation PSD is an unknown parameter that should be estimated. Numerous methods exist for estimating the reverberation PSD. They are broadly divided into two classes, namely non-blocking-based estimators and blocking-based estimators. The non-blocking-based approach jointly estimate the PSDs of the late reverberation and speech. The estimation is carried out using the maximum likelihood (ML) criterion [3, 6] or in the least-squares (LS) sense, by minimizing the Frobenius norm of an error PSD matrix . In the blocking-based method, the desired speech signal is first blocked using a blocking matrix (BM), and then the reverberation PSD is estimated. Estimators in this class are also based on the ML approach [1, 5, 7, 9] or the LS criteria [2, 4].
All previously mentioned methods do not include an estimator for the noise PSD. In [1, 3], a noiseless scenario is assumed. In [2, 4, 5, 6, 8, 7, 9, 10], the noise PSD matrix is assumed to be known in advance, or that an estimate is available. Typically, the noise PSD matrix is assumed to be time-invariant, and therefore can be estimated during speech-absent periods using a voice activity detector (VAD). However, in practical acoustic scenarios the spectral characteristics of the noise might be time-varying, e.g. when the noise environment includes a background radio or TV, and thus a VAD-based algorithm may fail. Therefore, the noise PSD matrix has to be included in the estimation procedure.
Some papers in the field deal with performance analysis of the proposed estimators. We give a brief review of the commonly used tools to assess the quality of an estimator. Theoretical analysis of estimators typically consists of calculating the bias and the mean square error (MSECramér-Rao Bound (CRB) is an important tool to evaluate the quality of any unbiased estimator, since it gives a lower bound on the MSE. An estimator that is unbiased and attains the CRB, is called efficient. The maximum likelihood estimator (MLE) is asymptotically efficient , namely attains the CRB when the amount of samples is large.
Theoretical analysis of PSD estimators in the noise-free scenario was addressed in [12, 13]. In , CRBs were derived for the reverberation and the speech MLEs proposed in . These MLEs are efficient, i.e. attain the CRB for any number of samples. In addition, it was pointed out that the non-blocking-based reverberation MLE derived in  is identical to the blocking-based MLE proposed in . In , it was shown that the non-blocking-based reverberation MLE of  obtains a lower MSE compared to a noiseless version of the blocking-based LS estimator derived in .
In the noisy case, quality assessment was discussed in  and . In , it was numerically demonstrated that an iterative blocking-based MLE yields lower MSE than the blocking-based LS estimator proposed in . In , closed-form CRBs were derived for the two previously proposed MLEs of the reverberation PSD, namely the blocking-based estimator in  and the non-blocking-based estimator in . The CRB for the non-blocking-based reverberation estimator was shown to be lower than the CRB for the blocking-based estimator. However, it was shown that in the noiseless case, both reverberation MLEs are identical and both CRBs coincide.
As opposed to previous works, the assumption of known noise PSD matrix is not made in . The noise is modelled as a spatially homogeneous sound field, with a time-varying PSD multiplied by a time-invariant spatial coherence matrix. It is assumed that the spatial coherence matrix of the noise is known in advance, while the time-varying PSD is unknown. Two different estimators were developed, based on the LS method. In the first one, a joint estimator for the speech, noise and late reverberation PSDs was developed. As an alternative, a blocking-based estimator was proposed, in which the speech signal is first blocked by a BM, and then the noise and reverberation PSDs are jointly estimated. However, this model only fits spatially homogeneous noise fields that are characterized by a full-rank covariance matrix. Moreover, in  it was claimed that the ML approach is preferable over the LS estimation procedure.
In this paper, we treat the noise PSD matrix as an unknown parameter. We assume that the noise PSD matrix is a rank-deficient matrix, as opposed to the spatially homogeneous assumption considered in . This scenario arises when the noise signal consists of a set of directional interfering sources, whose number is smaller than the number of microphones. We assume that the positions of the interfering sources are fixed, while their spectral PSD matrix is time-varying, e.g. when the acoustic environment includes radio or TV. It should be emphasized that, in contrast to  which estimates only a scalar PSD of the noise, in our model the entire spectral PSD matrix of the noise is estimated, and thus the case of multiple non-stationary noise sources, can be handled. We derive closed-form MLEs of the various PSDs, for both the non-blocking-based and the blocking-based methods. The proposed estimators are analytically studied and compared, and the corresponding MSEs expressions are derived. Furthermore, CRBs for estimating the various PSDs are derived.
An important benefit of considering the rank-deficient noise as a separated problem, is due to the form of the solution. In the ML framework, a closed-form solution exists for the noiseless case [1, 3] but not for the full-rank noise scenario, thus requiring iterative optimization techniques [5, 6, 9] (as opposed to LS method that has closed-form solutions in both cases). However, we show here that when the noise PSD matrix is a rank-deficient matrix, closed-form MLE exists, which yields simpler and faster estimation procedure with low computational complexity, and is not sensitive to local maxima.
The remainder of the paper is organized as follows. Section II introduces some notations and preliminary notes. Section III presents the problem formulation, and describes the probabilistic model. Section IV derives the MLEs for both the non-blocking-based and the blocking-based methods, and Section V presents the CRB derivation. Section VI demonstrates the performance of the proposed estimators by an experimental study based on both simulated data and recorded room impulse responses. The paper is concluded in Section VII.
Ii Notation and Preliminaries
In this work, scalars are denoted with regular lowercase letters, vectors are denoted with bold lowercase letters and matrices are denoted with bold uppercase letters. A list of notations used in our derivations is given in TableI.
|determinant of a matrix|
|trace of a matrix|
|stacking the columns of a matrix on top of one another|
where is the mean vector and is an Hermitian positive definite complex covariance matrix. For a positive definite Hermitian form , where and a Hermitian matrix, the variance is given by [11, p. 513, Eq. (15.29-15.30)]:
For the Kronecker product, the following identities hold :
Iii Problem Formulation
Iii-a Signal Model
Consider a speech signal received by microphones, in a noisy and reverberant acoustic environment.
We work with the short-time Fourier transform
short-time Fourier transform(STFT) representation of the measured signals. Let denote the frequency bin index, and denote the time frame index. The -channel observation signal writes
where is defined as the direct speech component, as received by the first microphone (designated as a reference microphone), is the time-invariant relative direct-path transfer function (RDTF) vector between the reference microphone and all microphones, denotes the late reverberation and denotes the noise. It is assumed that the noise signal consists of interfering sources, i.e.
where denotes the vector of noise sources and is the noise acoustic transfer function (ATF) matrix, assumed to be time-invariant. It is assumed that .
Iii-B Probabilistic Model
The speech STFT
coefficients are assumed to follow a zero-mean complex Gaussian distribution with a time-varyingPSD . Hence, the PDF of the speech writes:
The late reverberation signal is modelled by a zero-mean complex multivariate Gaussian distribution:
The reverberation PSD matrix is modelled as a spatially homogeneous and isotropic sound field, with a time-varying PSD, . It is assumed that the time-invariant coherence matrix can be modelled by a spherically diffuse sound field :
where , is the inter-distance between microphones and , denotes the sampling frequency and is the sound velocity.
The noise sources vector is modelled by a zero-mean complex multivariate Gaussian distribution with a time-varying PSD matrix :
The PDF of therefore writes
where is the PSD matrix of the input signals. Assuming that the components in (6) are independent, is given by
A commonly used dereverberation and noise reduction technique is to estimate the speech signal using the multichannel minimum mean square error (MMSE) estimator, which yields the multichannel Wiener filter (MCWF), given by :
denotes the total interference PSD matrix. For implementing (15), we assume that the RDTF vector and the spatial coherence matrix are known in advance. The RDTF depends only on the direction of arrival (DOA) of the speaker and the geometry of the microphone array, and thus it can be constructed based on a DOA estimate. The spatial coherence matrix is calculated using (10), based on the spherical diffuseness assumption.
The noise ATF matrix is in general not available (since such estimate requires that each noise is active separately). To circumvent the problem, we assume that a speech-absent segment (where all noise sources are active) is available, in which we apply the eigenvalue decomposition (EVD) to the noise PSD matrix . Note that , i.e.– rank representation of the noise PSD matrix is given by
where is the eigenvalues matrix (comprised of the non-zero eigenvalues) and is the corresponding eigenvectors matrix. is a basis that spans the noise ATFs subspace, and thus 
where . It follows that the noise PSD matrix in (12) can be recast as
where . Using this basis change, the MCWF in (15) is now computed with
As a result, rather than requiring the knowledge of the exact noise ATF matrix, we use that is learned from a speech-absent segment. Due to this basis change, we will need to estimate instead of .
Clearly, estimators of the late reverberation , speech and noise PSD are required for evaluating the MCWF. For the sake of brevity, the frame index and the frequency bin index are henceforth omitted whenever possible.
Iv Ml Estimators
We propose two ML-based methods: (i) Non-blocking-based estimation: Simultaneous ML estimation of the speech, reverberation and noise PSDs; and (ii) Blocking-based estimation: Elimination of the speech PSD using a BM, and then joint ML estimation of the reverberation and noise PSDs. Both methods are then compared and analyzed.
Iv-a Non-Blocking-Based Estimation
We start with the joint ML estimation of the reverberation, speech and noise PSDs. Based on the short-time stationarity assumption [12, 9], it is assumed that the PSDs are approximately constant across small number of consecutive time frames, denoted by . We therefore denote as the concatenation of previous observations of :
The set of unknown parameters is denoted by , where . Assuming that the consecutive signals in are i.i.d., the PDF of writes (see e.g. ):
where is the sample covariance matrix, given by
The MLE of the set is therefore given by
To the best of our knowledge, for the general noisy scenario this problem is considered as having no closed-form solution. However, we will show that when the noise PSD matrix is rank-deficient, with , a closed-form solution exists. In the following, we present the proposed estimators. The detailed derivations appear in the Appendices.
In Appendix A, it is shown that the MLE of is given by:
where is given by
and is the speech-plus-noise subspace
The matrix is a projection matrix onto the subspace orthogonal to the speech-plus-noise subspace. The role of is to block the directions of the desired speech and noise signals, in order to estimate the reverberation level.
Once we obtain the MLE for the late reverberation PSD, the MLEs for the speech and noise PSDs can be computed. In Appendix B, it is shown that the MLE for the speech PSD writes
where is a minimum variance distortionless response (MVDR) beamformer that extracts the speech signal while eliminating the noise, given by
and is a projection matrix onto the subspace orthogonal to the noise subspace, given by
In Appendix C, it is shown that the MLE of the noise PSD can be computed with
where is a multi-source linearly constrained minimum variance (LCMV) beamformer that extracts the noise signals while eliminating the speech signal:
and is a projection matrix onto the subspace orthogonal to the speech subspace, given by
Interestingly, the projection matrix can be recast as a linear combination of the above beamformers (see Appendix D):
In the noiseless case, i.e. when , reduces to
where , leading to the same closed-form estimators as in [3, Eq. (7)]:
Iv-B Blocking-Based Estimation
As a second approach, we first block the speech component using a BM, and then jointly estimate the PSDs of the reverberation and noise. Let denote the BM, which satisfies . The output of the BM is given by
The PDF of therefore writes:
where the PSD matrix is given by
where is the total interference matrix, defined in (16). Under this model, the parameter set of interest is . Similarly to , it is assumed that is fixed during the entire segment. Let be defined similarly to in (22). Assuming again i.i.d. concatenated snapshots, the PDF of writes
where is given by
The MLE of is obtained by solving:
To the best of our knowledge, this problem is also considered as having no closed-form solution. Again, we argue that if the noise PSD matrix satisfies , then we can obtain a closed-form solution. Multiplying (20) from left by and from right by , the noise PSD matrix at the output of the BM writes
where is the reduced noise subspace:
In Appendix E, the following MLE is obtained:
where is given by
After the BM was applied, the remaining role of is to block the noise signals, in order to estimate the reverberation level. Note that .
Given , it is shown in Appendix F that the MLE for the noise PSD writes
where is a multi-source LCMV beamformer, directed towards the noise signals after the BM, given by
Note that with this notation, in (51) can be recast as
Since , it also follows that
Also, in Appendix G it is shown that
namely the LCMV of (34), used in the non-blocking-based approach, can be factorized into two stages: The first is a BM that blocks the speech signal, followed by a modified LCMV, which recovers the noise signals at the output of the BM.
Iv-C Comparing the MLEs
In this section, the obtained blocking-based and non-blocking-based MLEs are compared. We will use the following identity, that is proved in [14, Appendix A]:
Iv-C1 Comparing the reverberation Psd estimators
It should be noted that in [12, 14] the two MLEs of the reverberation PSD were shown to be identical in the noiseless case. Here we extend this result to the noisy case, when the noise PSD matrix is a rank-deficient matrix.
Iv-C2 Comparing the noise Psd estimators
Iv-D Mse Calculation
In the sequel, the theoretical performance of the proposed PSD estimators is analyzed. Since the non-blocking-based and the blocking-based MLEs were proved in section IV-C to be identical for both reverberation and noise PSDs, it suffices to analyze the non-blocking-based MLEs.
Iv-D1 Theoretical performance of the reverberation Psd estimators
It is well known that for an unbiased estimator, the MSE is identical to the variance. We therefore start by showing that the non-blocking-based MLE in (26) is unbiased. Using (24), the expectation of (26) writes
Then, we use the following property (see (85d)):
It follows that the reverberation MLE is unbiased, and thus the MSE is identical to the variance. Using the i.i.d. assumption, the variance of the non-blocking-based MLE in (26) is given by
Iv-D2 Theoretical performance of the noise Psd estimators
Next, we calculate the variance of the diagonal terms of . To this end, we write the entry of in (33) as
for , where is the column of the matrix in (34). Using a partitioned matrix to simplify , it can be shown that
where is composed of all the vectors in except , i.e. , and is the corresponding projection matrix onto the subspace orthogonal to
It can be verified that . Denote the diagonal terms of by . In Appendix H, it is shown that
where is defined as the noise-to-reverberation ratio at the output of :
Iv-D3 Theoretical performance of the speech Psd estimator
where is defined as the signal-to-reverberation ratio at the output of :
V Crb Derivation
In this section, we derive the CRB on the variance of any unbiased estimator of the various PSDs.
V-a Crb for the Late Reverberation PSD
V-B Crb for the Speech and Noise PSDs
Vi Experimental Study
In this section, the proposed MLEs are evaluated in a synthetic Monte-Carlo simulation as well as on measurements of a real room environment. In Section VI-A, a Monte-Carlo simulation is conducted in which signals are generated synthetically based on the assumed statistical model. The sensitivity of the proposed MLEs is examined with respect to the various model parameters, and the MSEs of the proposed MLEs are compared to the corresponding CRBs. In Section VI-B, the proposed estimators are examined in a real room environment, by utilizing them for the task of speech dereverberation and noise reduction using the MCWF.
Vi-a Monte-Carlo Simulation
Vi-A1 Simulation Setup
In order to evaluate the accuracy of the proposed estimators, synthetic data was generated according to the signal model in (6), by simulating i.i.d. snapshots of single-tone signals, having a frequency of Hz. The signals are captured by a uniform linear array (ULA) with microphones, and inter-distance between adjacent microphones. The desired signal component was drawn according to a complex Gaussian distribution with a zero-mean and a PSD . The RDTF is given by
where is the time difference of arrival (TDOA) w.r.t. the reference microphone, given by , and is the DOA, defined as the broadside angle measured w.r.t. the perpendicular to the array. The reverberation component was drawn according to a complex Gaussian distribution with a zero-mean and a PSD matrix , where is modelled as an ideal spherical diffuse sound field, given by . The noise component was constructed as where denotes the noise sources, drawn according to a zero-mean complex Gaussian distribution with a random PSD matrix , and is an random ATF matrix. For the estimation procedure, is extracted by applying the EVD to a set of noisy training samples, generated with different .
In the sequel, we examine the proposed estimators and bounds as a function of the model parameters. Specifically, the influence of the following parameters is examined: i) number of snapshots ; ii) reverberation PSD value ; iii) speech PSD value ; and iv) noise power , which is defined as the Frobenius norm of the noise PSD matrix, i.e. . In each experiment, we changed the value of one parameter, while keeping the rest fixed. The nominal values of the parameters are presented in Table II.
For each scenario, we carried out Monte-Carlo trials. The reverberation PSD was estimated in each trial with both (26) and (50), the noise PSD was estimated with both (33) and (52) and the speech PSD was estimated with (29). The accuracy of the estimators was evaluated using the normalized mean square error (nMSE), by averaging over the Monte-Carlo trials and normalizing w.r.t. the square of the corresponding PSD value. For each quantity, the corresponding normalized CRB was also computed, in order to demonstrate the theoretical lower bound on the nMSE.
Vi-A2 Simulation Results
In Fig. 1(a), the nMSEs are presented as a function of the number of snapshots, . Clearly, the nMSEs of all the estimators decrease as the number of snapshots increases. As expected from the analytical study, it is evident that the non-blocking-based and the blocking-based MLEs yield the same nMSE, for both the reverberation and noise PSDs. Furthermore, for all quantities the nMSEs coincide with the corresponding CRBs.