ML Estimation and CRBs for Reverberation, Speech and Noise PSDs in Rank-Deficient Noise-Field

07/22/2019 ∙ by Yaron Laufer, et al. ∙ Bar-Ilan University 0

Speech communication systems are prone to performance degradation in reverberant and noisy acoustic environments. Dereverberation and noise reduction algorithms typically require several model parameters, e.g. the speech, reverberation and noise power spectral densities (PSDs). A commonly used assumption is that the noise PSD matrix is known. However, in practical acoustic scenarios, the noise PSD matrix is unknown and should be estimated along with the speech and reverberation PSDs. In this paper, we consider the case of rank-deficient noise PSD matrix, which arises when the noise signal consists of multiple directional interference sources, whose number is less than the number of microphones. We derive two closed-form maximum likelihood estimators (MLEs). The first is a non-blocking-based estimator which jointly estimates the speech, reverberation and noise PSDs, and the second is a blocking-based estimator, which first blocks the speech signal and then jointly estimates the reverberation and noise PSDs. Both estimators are analytically compared and analyzed, and mean square errors (MSEs) expressions are derived. Furthermore, Cramer-Rao Bounds (CRBs) on the estimated PSDs are derived. The proposed estimators are examined using both simulation and real reverberant and noisy signals, demonstrating the advantage of the proposed method compared to competing estimators.

READ FULL TEXT VIEW PDF
POST COMMENT

Comments

There are no comments yet.

Authors

page 1

page 10

This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

I Introduction

In many hands-free scenarios, the measured microphone signals suffer from an additive background noise, which may originate from both environmental sources and from microphone responses. Apart from noise, if the recording takes place in an enclosed space, the recorded signals may also contain multiple sound reflections from walls and other objects in the room, resulting in a reverberation. As the level of noise and reverberation increases, the perceived quality and intelligibility of the speech signal deteriorate, which in turn affect the performance of speech communication systems, as well as automatic speech recognition (ASR) systems.

In order to reduce the effects of reverberation and noise, speech enhancement algorithms are required, which aim at recovering the clean speech source from the recorded microphone signals. Speech dereverberation and noise reduction algorithms often require the power spectral densities (PSDs) of the speech, reverberation and noise components. In the multichannel framework, a commonly used assumption (see e.g. in [1, 2, 3, 4, 5, 6, 7, 8, 9, 10]) is that the late reverberant signal is modelled as a spatially homogeneous sound field, with a time-varying PSD multiplied by a spherical diffuse time-invariant spatial coherence matrix. As the spatial coherence matrix depends only on the microphone geometry, it can be calculated in advance. However, the reverberation PSD is an unknown parameter that should be estimated. Numerous methods exist for estimating the reverberation PSD. They are broadly divided into two classes, namely non-blocking-based estimators and blocking-based estimators. The non-blocking-based approach jointly estimate the PSDs of the late reverberation and speech. The estimation is carried out using the maximum likelihood (ML) criterion [3, 6] or in the least-squares (LS) sense, by minimizing the Frobenius norm of an error PSD matrix [8]. In the blocking-based method, the desired speech signal is first blocked using a blocking matrix (BM), and then the reverberation PSD is estimated. Estimators in this class are also based on the ML approach [1, 5, 7, 9] or the LS criteria [2, 4].

All previously mentioned methods do not include an estimator for the noise PSD. In [1, 3], a noiseless scenario is assumed. In [2, 4, 5, 6, 8, 7, 9, 10], the noise PSD matrix is assumed to be known in advance, or that an estimate is available. Typically, the noise PSD matrix is assumed to be time-invariant, and therefore can be estimated during speech-absent periods using a voice activity detector (VAD). However, in practical acoustic scenarios the spectral characteristics of the noise might be time-varying, e.g. when the noise environment includes a background radio or TV, and thus a VAD-based algorithm may fail. Therefore, the noise PSD matrix has to be included in the estimation procedure.

Some papers in the field deal with performance analysis of the proposed estimators. We give a brief review of the commonly used tools to assess the quality of an estimator. Theoretical analysis of estimators typically consists of calculating the bias and the mean square error (MSE

), which coincides with the variance for unbiased estimators. The

Cramér-Rao Bound (CRB) is an important tool to evaluate the quality of any unbiased estimator, since it gives a lower bound on the MSE. An estimator that is unbiased and attains the CRB, is called efficient. The maximum likelihood estimator (MLE) is asymptotically efficient [11], namely attains the CRB when the amount of samples is large.

Theoretical analysis of PSD estimators in the noise-free scenario was addressed in [12, 13]. In [12], CRBs were derived for the reverberation and the speech MLEs proposed in [3]. These MLEs are efficient, i.e. attain the CRB for any number of samples. In addition, it was pointed out that the non-blocking-based reverberation MLE derived in [3] is identical to the blocking-based MLE proposed in [1]. In [13], it was shown that the non-blocking-based reverberation MLE of [3] obtains a lower MSE compared to a noiseless version of the blocking-based LS estimator derived in [2].

In the noisy case, quality assessment was discussed in [9] and [14]. In [9], it was numerically demonstrated that an iterative blocking-based MLE yields lower MSE than the blocking-based LS estimator proposed in [2]. In [14], closed-form CRBs were derived for the two previously proposed MLEs of the reverberation PSD, namely the blocking-based estimator in [5] and the non-blocking-based estimator in [6]. The CRB for the non-blocking-based reverberation estimator was shown to be lower than the CRB for the blocking-based estimator. However, it was shown that in the noiseless case, both reverberation MLEs are identical and both CRBs coincide.

As opposed to previous works, the assumption of known noise PSD matrix is not made in [15]. The noise is modelled as a spatially homogeneous sound field, with a time-varying PSD multiplied by a time-invariant spatial coherence matrix. It is assumed that the spatial coherence matrix of the noise is known in advance, while the time-varying PSD is unknown. Two different estimators were developed, based on the LS method. In the first one, a joint estimator for the speech, noise and late reverberation PSDs was developed. As an alternative, a blocking-based estimator was proposed, in which the speech signal is first blocked by a BM, and then the noise and reverberation PSDs are jointly estimated. However, this model only fits spatially homogeneous noise fields that are characterized by a full-rank covariance matrix. Moreover, in [9] it was claimed that the ML approach is preferable over the LS estimation procedure.

In this paper, we treat the noise PSD matrix as an unknown parameter. We assume that the noise PSD matrix is a rank-deficient matrix, as opposed to the spatially homogeneous assumption considered in [15]. This scenario arises when the noise signal consists of a set of directional interfering sources, whose number is smaller than the number of microphones. We assume that the positions of the interfering sources are fixed, while their spectral PSD matrix is time-varying, e.g. when the acoustic environment includes radio or TV. It should be emphasized that, in contrast to [15] which estimates only a scalar PSD of the noise, in our model the entire spectral PSD matrix of the noise is estimated, and thus the case of multiple non-stationary noise sources, can be handled. We derive closed-form MLEs of the various PSDs, for both the non-blocking-based and the blocking-based methods. The proposed estimators are analytically studied and compared, and the corresponding MSEs expressions are derived. Furthermore, CRBs for estimating the various PSDs are derived.

An important benefit of considering the rank-deficient noise as a separated problem, is due to the form of the solution. In the ML framework, a closed-form solution exists for the noiseless case [1, 3] but not for the full-rank noise scenario, thus requiring iterative optimization techniques [5, 6, 9] (as opposed to LS method that has closed-form solutions in both cases). However, we show here that when the noise PSD matrix is a rank-deficient matrix, closed-form MLE exists, which yields simpler and faster estimation procedure with low computational complexity, and is not sensitive to local maxima.

The remainder of the paper is organized as follows. Section II introduces some notations and preliminary notes. Section III presents the problem formulation, and describes the probabilistic model. Section IV derives the MLEs for both the non-blocking-based and the blocking-based methods, and Section V presents the CRB derivation. Section VI demonstrates the performance of the proposed estimators by an experimental study based on both simulated data and recorded room impulse responses. The paper is concluded in Section VII.

Ii Notation and Preliminaries

In this work, scalars are denoted with regular lowercase letters, vectors are denoted with bold lowercase letters and matrices are denoted with bold uppercase letters. A list of notations used in our derivations is given in Table 

I.

transpose
conjugate transpose
complex conjugate
determinant of a matrix
trace of a matrix
Kronecker product
stacking the columns of a matrix on top of one another
TABLE I: Notation

For a random vector , a multivariate complex Gaussian probability density function (PDF) is given by [16]:

(1)

where is the mean vector and is an Hermitian positive definite complex covariance matrix. For a positive definite Hermitian form , where and a Hermitian matrix, the variance is given by [11, p. 513, Eq. (15.29-15.30)]:

(2)

For the Kronecker product, the following identities hold [17]:

(3)
(4)
(5)

Iii Problem Formulation

Iii-a Signal Model

Consider a speech signal received by microphones, in a noisy and reverberant acoustic environment. We work with the

short-time Fourier transform

(STFT) representation of the measured signals. Let denote the frequency bin index, and denote the time frame index. The -channel observation signal writes

(6)

where is defined as the direct speech component, as received by the first microphone (designated as a reference microphone), is the time-invariant relative direct-path transfer function (RDTF) vector between the reference microphone and all microphones, denotes the late reverberation and denotes the noise. It is assumed that the noise signal consists of interfering sources, i.e.

(7)

where denotes the vector of noise sources and is the noise acoustic transfer function (ATF) matrix, assumed to be time-invariant. It is assumed that .

Iii-B Probabilistic Model

The speech STFT

coefficients are assumed to follow a zero-mean complex Gaussian distribution with a time-varying

PSD . Hence, the PDF of the speech writes:

(8)

The late reverberation signal is modelled by a zero-mean complex multivariate Gaussian distribution:

(9)

The reverberation PSD matrix is modelled as a spatially homogeneous and isotropic sound field, with a time-varying PSD, . It is assumed that the time-invariant coherence matrix can be modelled by a spherically diffuse sound field [18]:

(10)

where , is the inter-distance between microphones and , denotes the sampling frequency and is the sound velocity.

The noise sources vector is modelled by a zero-mean complex multivariate Gaussian distribution with a time-varying PSD matrix :

(11)

Using (7) and (11), it follows that has a zero-mean complex multivariate Gaussian distribution with a PSD matrix , given by

(12)

The PDF of therefore writes

(13)

where is the PSD matrix of the input signals. Assuming that the components in (6) are independent, is given by

(14)

A commonly used dereverberation and noise reduction technique is to estimate the speech signal using the multichannel minimum mean square error (MMSE) estimator, which yields the multichannel Wiener filter (MCWF), given by [19]:

(15)

where

(16)

denotes the total interference PSD matrix. For implementing (15), we assume that the RDTF vector and the spatial coherence matrix are known in advance. The RDTF depends only on the direction of arrival (DOA) of the speaker and the geometry of the microphone array, and thus it can be constructed based on a DOA estimate. The spatial coherence matrix is calculated using (10), based on the spherical diffuseness assumption.

The noise ATF matrix is in general not available (since such estimate requires that each noise is active separately). To circumvent the problem, we assume that a speech-absent segment (where all noise sources are active) is available, in which we apply the eigenvalue decomposition (EVD) to the noise PSD matrix . Note that , i.e. 

is a rank-deficient matrix. Based on the computed eigenvalues and eigenvectors, a

 – rank representation of the noise PSD matrix is given by

(17)

where is the eigenvalues matrix (comprised of the non-zero eigenvalues) and is the corresponding eigenvectors matrix. is a basis that spans the noise ATFs subspace, and thus [20]

(18)

where consists of projections coefficients of the original ATFs on the basis vectors. Substituting (18) into (7) and using (6), yields

(19)

where . It follows that the noise PSD matrix in (12) can be recast as

(20)

where . Using this basis change, the MCWF in (15) is now computed with

(21)

As a result, rather than requiring the knowledge of the exact noise ATF matrix, we use that is learned from a speech-absent segment. Due to this basis change, we will need to estimate instead of .

Clearly, estimators of the late reverberation , speech and noise PSD are required for evaluating the MCWF. For the sake of brevity, the frame index and the frequency bin index are henceforth omitted whenever possible.

Iv Ml Estimators

We propose two ML-based methods: (i) Non-blocking-based estimation: Simultaneous ML estimation of the speech, reverberation and noise PSDs; and (ii) Blocking-based estimation: Elimination of the speech PSD using a BM, and then joint ML estimation of the reverberation and noise PSDs. Both methods are then compared and analyzed.

Iv-a Non-Blocking-Based Estimation

We start with the joint ML estimation of the reverberation, speech and noise PSDs. Based on the short-time stationarity assumption [12, 9], it is assumed that the PSDs are approximately constant across small number of consecutive time frames, denoted by . We therefore denote as the concatenation of previous observations of :

(22)

The set of unknown parameters is denoted by , where . Assuming that the consecutive signals in are i.i.d., the PDF of writes (see e.g. [14]):

(23)

where is the sample covariance matrix, given by

(24)

The MLE of the set is therefore given by

(25)

To the best of our knowledge, for the general noisy scenario this problem is considered as having no closed-form solution. However, we will show that when the noise PSD matrix is rank-deficient, with , a closed-form solution exists. In the following, we present the proposed estimators. The detailed derivations appear in the Appendices.

In Appendix A, it is shown that the MLE of is given by:

(26)

where is given by

(27)

and is the speech-plus-noise subspace

(28)

The matrix is a projection matrix onto the subspace orthogonal to the speech-plus-noise subspace. The role of is to block the directions of the desired speech and noise signals, in order to estimate the reverberation level.

Once we obtain the MLE for the late reverberation PSD, the MLEs for the speech and noise PSDs can be computed. In Appendix B, it is shown that the MLE for the speech PSD writes

(29)

where is a minimum variance distortionless response (MVDR) beamformer that extracts the speech signal while eliminating the noise, given by

(30)

and is a projection matrix onto the subspace orthogonal to the noise subspace, given by

(31)

Note that

(32)

The estimator in (29) can be interpreted as the variance of the noisy observations minus the estimated variance of the reverberation, at the output of the MVDR beamformer [12].

In Appendix C, it is shown that the MLE of the noise PSD can be computed with

(33)

where is a multi-source linearly constrained minimum variance (LCMV) beamformer that extracts the noise signals while eliminating the speech signal:

(34)

and is a projection matrix onto the subspace orthogonal to the speech subspace, given by

(35)

Note that

(36)

Interestingly, the projection matrix can be recast as a linear combination of the above beamformers (see Appendix D):

(37)

Using (32), (36) and (37), it can also be noted that is orthogonal to both beamformers

(38)

In the noiseless case, i.e. when , reduces to

(39)

where , leading to the same closed-form estimators as in [3, Eq. (7)]:

(40)
(41)

Iv-B Blocking-Based Estimation

As a second approach, we first block the speech component using a BM, and then jointly estimate the PSDs of the reverberation and noise. Let denote the BM, which satisfies . The output of the BM is given by

(42)

The PDF of therefore writes:

(43)

where the PSD matrix is given by

(44)

where is the total interference matrix, defined in (16). Under this model, the parameter set of interest is . Similarly to , it is assumed that is fixed during the entire segment. Let be defined similarly to in (22). Assuming again i.i.d. concatenated snapshots, the PDF of writes

(45)

where is given by

(46)

The MLE of is obtained by solving:

(47)

To the best of our knowledge, this problem is also considered as having no closed-form solution. Again, we argue that if the noise PSD matrix satisfies , then we can obtain a closed-form solution. Multiplying (20) from left by and from right by , the noise PSD matrix at the output of the BM writes

(48)

where is the reduced noise subspace:

(49)

In Appendix E, the following MLE is obtained:

(50)

where is given by

(51)

After the BM was applied, the remaining role of is to block the noise signals, in order to estimate the reverberation level. Note that .

Given , it is shown in Appendix F that the MLE for the noise PSD writes

(52)

where is a multi-source LCMV beamformer, directed towards the noise signals after the BM, given by

(53)

Note that with this notation, in (51) can be recast as

(54)

Since , it also follows that

(55)

Also, in Appendix G it is shown that

(56)

namely the LCMV of (34), used in the non-blocking-based approach, can be factorized into two stages: The first is a BM that blocks the speech signal, followed by a modified LCMV, which recovers the noise signals at the output of the BM.

Iv-C Comparing the MLEs

In this section, the obtained blocking-based and non-blocking-based MLEs are compared. We will use the following identity, that is proved in [14, Appendix A]:

(57)

Substituting (35) into (57) yields

(58)

Iv-C1 Comparing the reverberation Psd estimators

First, we compare the reverberation PSD estimators in (26) and (50). Substituting (54) into (50) and then using (46), (49), (56) and (57), yields the following equation:

(59)

Using (37) and noting that , yields (26). It follows that both estimators are identical:

(60)

It should be noted that in [12, 14] the two MLEs of the reverberation PSD were shown to be identical in the noiseless case. Here we extend this result to the noisy case, when the noise PSD matrix is a rank-deficient matrix.

Iv-C2 Comparing the noise Psd estimators

The noise PSD estimators in (33) and (52) are now compared. Substituting (46) into (52) and then using (56) and (60), yields the same expression as in (33), and therefore

(61)

Iv-D Mse Calculation

In the sequel, the theoretical performance of the proposed PSD estimators is analyzed. Since the non-blocking-based and the blocking-based MLEs were proved in section IV-C to be identical for both reverberation and noise PSDs, it suffices to analyze the non-blocking-based MLEs.

Iv-D1 Theoretical performance of the reverberation Psd estimators

It is well known that for an unbiased estimator, the MSE is identical to the variance. We therefore start by showing that the non-blocking-based MLE in (26) is unbiased. Using (24), the expectation of (26) writes

(62)

Then, we use the following property (see (85d)):

(63)

to obtain

(64)

It follows that the reverberation MLE is unbiased, and thus the MSE is identical to the variance. Using the i.i.d. assumption, the variance of the non-blocking-based MLE in (26) is given by

(65)

In order to simplify (65), we use the identity in (2). Since and is a Hermitian matrix (note that ), we obtain

(66)

Finally, using (63) and (85b), the variance writes

(67)

Note that in the noiseless case, namely , the variance reduces to the one derived in [12, 13].

Iv-D2 Theoretical performance of the noise Psd estimators

Using (24), (III-B) and (36), and based on the unbiasedness of , it can be shown that in (33) is an unbiased estimator of .

Next, we calculate the variance of the diagonal terms of . To this end, we write the entry of in (33) as

(68)

for , where is the column of the matrix in (34). Using a partitioned matrix to simplify , it can be shown that

(69)

where is composed of all the vectors in except , i.e. , and is the corresponding projection matrix onto the subspace orthogonal to

(70)

It can be verified that . Denote the diagonal terms of by . In Appendix H, it is shown that

(71)

where is defined as the noise-to-reverberation ratio at the output of :

(72)

Iv-D3 Theoretical performance of the speech Psd estimator

Using (24), (III-B) and (32) and based on the unbiasedness of , it can be shown that is an unbiased estimator of . In a similar manner to (IV-D2), the variance of (29) can be shown to be

(73)

where is defined as the signal-to-reverberation ratio at the output of :

(74)

When the reverberation level is low, (IV-D3) reduces to In the noiseless case, i.e. , (IV-D3) becomes identical to the variance derived in [12, 13].

V Crb Derivation

In this section, we derive the CRB on the variance of any unbiased estimator of the various PSDs.

V-a Crb for the Late Reverberation PSD

In Appendix I, it is shown that the CRB on the reverberation PSD writes

(75)

The resulting CRB is identical to the MSE derived in (67), and thus the proposed MLE is an efficient estimator.

V-B Crb for the Speech and Noise PSDs

The CRB on the speech PSD is identical to the MSE derived in (IV-D3), as outlined in Appendix J. The CRB on the noise PSD can be derived similarly. We conclude that the proposed PSDs estimators are efficient.

Vi Experimental Study

In this section, the proposed MLEs are evaluated in a synthetic Monte-Carlo simulation as well as on measurements of a real room environment. In Section VI-A, a Monte-Carlo simulation is conducted in which signals are generated synthetically based on the assumed statistical model. The sensitivity of the proposed MLEs is examined with respect to the various model parameters, and the MSEs of the proposed MLEs are compared to the corresponding CRBs. In Section VI-B, the proposed estimators are examined in a real room environment, by utilizing them for the task of speech dereverberation and noise reduction using the MCWF.

Vi-a Monte-Carlo Simulation

Vi-A1 Simulation Setup

In order to evaluate the accuracy of the proposed estimators, synthetic data was generated according to the signal model in (6), by simulating i.i.d. snapshots of single-tone signals, having a frequency of  Hz. The signals are captured by a uniform linear array (ULA) with microphones, and inter-distance between adjacent microphones. The desired signal component was drawn according to a complex Gaussian distribution with a zero-mean and a PSD . The RDTF is given by

(76)

where is the time difference of arrival (TDOA) w.r.t. the reference microphone, given by , and is the DOA, defined as the broadside angle measured w.r.t. the perpendicular to the array. The reverberation component was drawn according to a complex Gaussian distribution with a zero-mean and a PSD matrix , where is modelled as an ideal spherical diffuse sound field, given by . The noise component was constructed as where denotes the noise sources, drawn according to a zero-mean complex Gaussian distribution with a random PSD matrix , and is an random ATF matrix. For the estimation procedure, is extracted by applying the EVD to a set of noisy training samples, generated with different .

In the sequel, we examine the proposed estimators and bounds as a function of the model parameters. Specifically, the influence of the following parameters is examined: i) number of snapshots ; ii) reverberation PSD value ; iii) speech PSD value ; and iv) noise power , which is defined as the Frobenius norm of the noise PSD matrix, i.e. . In each experiment, we changed the value of one parameter, while keeping the rest fixed. The nominal values of the parameters are presented in Table II.

TABLE II: Nominal Parameters

For each scenario, we carried out Monte-Carlo trials. The reverberation PSD was estimated in each trial with both (26) and (50), the noise PSD was estimated with both (33) and (52) and the speech PSD was estimated with (29). The accuracy of the estimators was evaluated using the normalized mean square error (nMSE), by averaging over the Monte-Carlo trials and normalizing w.r.t. the square of the corresponding PSD value. For each quantity, the corresponding normalized CRB was also computed, in order to demonstrate the theoretical lower bound on the nMSE.

Vi-A2 Simulation Results

In Fig. 1(a), the nMSEs are presented as a function of the number of snapshots, . Clearly, the nMSEs of all the estimators decrease as the number of snapshots increases. As expected from the analytical study, it is evident that the non-blocking-based and the blocking-based MLEs yield the same nMSE, for both the reverberation and noise PSDs. Furthermore, for all quantities the nMSEs coincide with the corresponding CRBs.