Robust Joint Estimation of Multi-Microphone Signal Model Parameters

10/12/2018 ∙ by Andreas I. Koutrouvelis, et al. ∙ 0

One of the biggest challenges in multi-microphone applications is the estimation of the parameters of the signal model such as the power spectral densities (PSDs) of the sources, the early (relative) acoustic transfer functions of the sources with respect to the microphones, the PSD of late reverberation, and the PSDs of microphone-self noise. Typically, the existing methods estimate subsets of the aforementioned parameters and assume some of the other parameters to be known a priori. This may result in inconsistencies and inaccurately estimated parameters and potential performance degradation in the applications using these estimated parameters. So far, there is no method to jointly estimate all the aforementioned parameters. In this paper, we propose a robust method for jointly estimating all the aforementioned parameters using confirmatory factor analysis. The estimation accuracy of the signal-model parameters thus obtained outperforms existing methods in most cases. We experimentally show significant performance gains in several multi-microphone applications over state-of-the-art methods.

READ FULL TEXT VIEW PDF
POST COMMENT

Comments

There are no comments yet.

Authors

page 1

This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

I Introduction

Microphone arrays (see e.g., [1] for an overview) are used extensively in many applications, such as source separation [2, 3, 4, 5, 6], multi-microphone noise reduction [1, 7, 8, 9, 10, 11, 12, 13], dereverberation [14, 15, 16, 17, 18, 19], sound source localization [20, 21, 22, 23], and room geometry estimation [24, 25]. All the aforementioned applications are based on a similar multi-microphone signal model, typically depending on the following parameters: i) the early relative acoustic transfer functions (RATFs) of the sources with respect to the microphones; ii) the power spectral densities (PSDs) of the early components of the sources, iii) the PSD of the late reverberation, and, iv) the PSDs of the microphone-self noise. Other parameters, like the target cross power spectral density matrix (CPSDM), the noise CPSDM, source locations and room geometry information, can be inferred from (combinations of) the above mentioned parameters. Often, none of these parameters are known a priori, while estimation is challenging. Often, only a subset of the parameters is estimated, see e.g., [14, 15, 16, 17, 26, 19, 27, 28, 29, 30], typically requiring rather strict assumptions with respect to stationarity and/or knowledge of the remaining parameters.

In [15, 17] the target source PSD and the late reverberation PSD are jointly estimated assuming that the early RATFs of the target with respect to all microphones are known and all the remaining noise components (e.g., interferers) are stationary in time intervals typically much longer than a time frame. In [31, 26, 19], it was shown that the method in [15, 17] may lead to inaccurate estimates of the late reverberation PSD, when the early RATFs of the target include estimation errors. In [26, 19], a more accurate estimator for the late reverberation PSD was proposed, independent of early RATF estimation errors.

The methods proposed in [27, 28] do not assume that some noise components are stationary like in [17], but assume that the total noise CPSDM has a constant [27] or slow-varying [28] structure over time (i.e., it can be written as an unknown scaling parameter multiplied with a constant spatial structure matrix). This may not be realistic in practical acoustical scenarios, where different interfering sources change their power and location across time more rapidly and with different patterns. Moreover, these methods do not separate the late reverberation from the other noise components and only differentiate between the target source PSD and the overall noise PSD. As in [17], these methods assume that the early RATFs of the target are known. In [28], the structure of the noise CPSDM is estimated only in target-absent time-frequency tiles using a voice activity detector (VAD), which may lead to erroneous estimates if the spatial structure matrix of the noise changes during target-presence.

In [30]

, the early RATFs and the PSDs of all sources are estimated using the expectation maximization (EM) method 

[32]. This method assumes that only one source is active per time-frequency tile and the noise CPSDM (excluding the contributions of the interfering point sources) is estimated assuming it is time-invariant. Due to the time-varying nature of the late reverberation (included in the noise CPSDM), this assumption is often violated. This method does not estimate the time-varying PSD of the late reverberation, neither the PSDs of the microphone-self noise.

While the aforementioned methods focus on estimation of just one or several of the required model parameters, the method presented in [4] jointly estimates the early RATFs of the sources, the PSDs of the sources and the PSDs of the microphone-self noise. Unlike [30], the method in [4] does not assume single source activity per time-frequency tile and, thus, it is applicable to more general acoustic scenarios. The method in [4] is based on the non-orthogonal joint-diagonalization of the noisy CPSDMs. This method unfortunately does not guarantee non-negative estimated PSDs and, thus, the obtained target CPSDM may not be positive semidefinite resulting in performance degradation. Moreover, this approach does not estimate the PSD of the late reverberation. In conclusion, most methods only focus on the estimation of a subset of the required model parameters and/or rely on assumptions which may be invalid and/or impractical.

In this paper, we propose a method which jointly estimates all the aforementioned parameters of the multi-microphone signal model. The proposed method is based on confirmatory factor analysis (CFA) [33, 34, 35, 36] and on the non-orthogonal joint-diagonalization principle introduced in [4]. The combination of these two theories and the adjustment to the multi-microphone case gives us a robust method, which is applicable for temporally and spatially non-stationary sources. The proposed method uses linear constraints to reduce the feasibility set of the parameter space and thus increase robustness. Moreover, the proposed method guarantees positive estimated PSDs and, thus, positive semidefinite target and noise CPSDMs. Although generally applicable, in this manuscript, we will compare the performance of the proposed method with state-of-the-art approaches in the context of source separation and dereverberation.

The remaining of the paper is organized as follows. In Sec. II, the signal model, notation and used assumptions are introduced. In Sec. III, we review the CFA theory and its relation to the non-orthogonal joint diagonalization principle. In Sec. IV, the proposed method is introduced. In Sec. V, we introduce several constraints to increase the robustness of the proposed method. In Sec. VI, we discuss the implementation and practicality of the proposed method. In Sec. VII, we conduct experiments in several multi-microphone applications using the proposed method and existing state-of-the-art approaches. In Sec. VIII, we draw conclusions.

Ii Preliminaries

Ii-a Notation

We use lower-case letters for scalars, bold-face lower-case letters for vectors, and bold-face upper-case letters for matrices. A matrix

can be expressed as , where is its -th column. The elements of a matrix are denoted as . We use the operand to denote the trace of a matrix,

to denote the expected value of a random variable,

to denote the vector formed from the diagonal of a matrix , and to denote the Frobenius norm of a matrix. We use to form a square diagonal matrix with diagonal . A hermitian positive semi-definite matrix is expressed as , where

and its eigenvalues are real non-negative. The cardinality of a set is denoted as

. The minimum element of a vector is obtained via the operation .

Ii-B Signal Model

Consider an -element microphone array of arbitrary structure within a possibly reverberant enclosure, in which there are acoustic point sources (target and interfering sources). The

-th microphone signal (in the short-time Fourier transform (STFT) domain) is modeled as

(1)

where is the frequency-bin index; the time-frame index; and the early and late components of the -th point source, respectively; and denotes the microphone self-noise. The early components include the line of sight and a few initial strong reflections. The late components describe the effect of the remaining reflections and are usually referred to as late reverberation. The -th early component is given by

(2)

where is the corresponding RATF with respect to the -th microphone, the -th point-source at the reference microphone, is the index of a time-segment, which is a collection of time-frames. That is, we assume that the source signal can vary faster (from time-frame to time-frame) than the early RATFs, which are assumed to be constant over multiple time-frames (which we call a time-segment). By stacking all microphone recordings into vectors, the multi-microphone signal model is given by

(3)

where and all the other vectors can be similarly represented. If we assume that all sources in (3) are mutually uncorrelated and stationary within a time-frame, the signal model of the CPSDM of the noisy recordings is given by

(4)

where , is the PSD of the -th source at the reference microphone, the CPSDM of the late reverberation and is a diagonal matrix, which has as its diagonal elements the PSDs of the microphone-self noise. Note that and are time-frame varying, while the microphone-self noise PSDs are typically time-invariant. The CPSDM model in (4) can be re-written as

(5)

where and is commonly referred to as mixing matrix and has as its columns the early RATFs of the sources. As we work with RATFs, the row of corresponding to the reference microphone is equal to a vector with only ones. Moreover, is a diagonal matrix, where .

Ii-C Late Reverberation Model

A commonly used assumption (adopted in this paper) is that the late reverberation CPSDM has a known spatial structure, , which is time-invariant but varying over frequency [14, 17]. Under the constant spatial-structure assumption, is modeled as [14, 17]

(6)

with the PSD of the late reverberation which is unknown and needs to be estimated. By combining (5), and (6), we obtain the final CPSDM model given by

(7)

There are several existing methods [15, 17, 18, 26, 19] to estimate under the assumption that is known. There are mainly two methodologies of obtaining . The first is to use many pre-calculated impulse responses measured around the array as in [7]. The second is to use a model which is based on the fact that the off-diagonal elements of depend on the distance between every microphone pair. The distances between any two microphone pairs is described by the symmetric microphone-distance matrix with elements which is the distance between microphones and . Two commonly used models for the spatial structure are the cylindrical and spherical isotropic noise fields [37, 10]. The cylindrical isotropic noise field is accurate for rooms where the ceiling and the floor are more absorbing than the walls. These models are accurate for sufficiently large rooms [10].

Ii-D Estimation of CPSDMs Using Sub-Frames

The estimation of , is achieved using overlapping multiple sub-frames. The set of all used sub-frames within the -th time-frame is denoted by , and the number of used sub-frames is . We assume that the noisy microphone signals within a time-frame are stationary and, thus, we can estimate the noisy CPSDM using the sample CPSDM, i.e.,

(8)

with the sub-frame index. Fig. 1 summarizes how we split time using sub-frames, time-frames and time-segments.

Ii-E Problem Formulation

The goal of this paper is to jointly estimate the parameters , , , and for the -th time-segment of the signal model in (7) by only having estimates of the noisy CPSDM matrices for all time frames belonging to the -th time-segment and possibly having an estimate and/or . From now on, we will neglect time-frequency indices to simplify notation wherever is necessary.

Iii Confirmatory Factor Analysis

Confirmatory factor analysis (CFA) [33, 34, 36] aims at estimating the parameters of the following CPSDM model:

(9)

where and . In CFA, some of the elements in and are fixed such that the remaining variables are uniquely identifiable (see below). More specifically, let and denote the sets of the selected row-column index-pairs of the matrices and , respectively, where their elements are fixed to some known constants , and .

There are several existing CFA methods (see e.g. [36], for an overview). Most of these are special cases of the following general CFA problem

(10)

with a cost function, which is typically one of the following cost functions: maximum likelihood (ML), least squares (LS), or generalized least squares (GLS). That is,

(11)

where is given in (9). Notice, that the problem in (10) is not convex (due to the non-convex terms ) and may have multiple local minima.

TF

TF

SF

TS

Fig. 1: Splitting time into time-segments (TS), time-frames (TF), and sub-frames (SF).

There are two necessary conditions for the parameters of the CPSDM model in (9) to be uniquely identifiable111We say that the parameters of a function are uniquely identifiable if there is one-to-one relationship between the parameters and the function value.. The first identifiability condition states that the number of equations should be larger than the number of unknowns [40, 36]. Since , there are known values, while there are unknowns due to , unknowns due to (because ), and unknowns due to (because is diagonal). Therefore, the first identifiability condition is given by [40]

(12)

The identifiability condition in (12) is not sufficient for guaranting unique identifiability [36]. Specifically, for any arbitary non-singular matrix , we have and, therefore [34]

(13)

This means that there are infinitly many optimal solutions () of the problem in (10). Since there are variables in , the second identifiability condition of the CPSDM model in (9) states that we need to fix at least of the parameters in and  [40, 34], i.e.,

(14)

This second condition is necessary but not sufficient, since we need to fix the proper parameters and not just any parameters [40, 34] such that . For a general full-element , a recipe on how to select the constraints in order to achieve unique identifiability is provided in [34].

Iii-a Simultaneous CFA (SCFA) in Multiple Time-Frames

The -th time-segment consists of the following time-frames: , where is the set of the time-frames in the -th time-segment. For ease of notation, we can alternatively re-write this as . The problem in (10) considered time-frame. Now we assume that we estimate for time-frames in the -th time-segment. We also assume that , if . Recall that the mixing matrix is assumed to be static within a time-segment. Moreover, is time-invariant and, thus, shared among different time-frames within the same time-segment. One can exploit these two facts in order to increase the ratio between the number of equations and the number of unknown parameters [33, 35] and thus satisfy the first and second identifiability conditions with less microphones. This can be done by solving the following general simultaneous CFA (SCFA) problem [35]

(15)

The CFA problem in (10) is a special case of SCFA, when we select . The first identifiability condition for the SCFA problem becomes

(16)

We conclude from (12) and (16) that the SCFA problem (for ) needs less microphones compared to the problem in (10) to satisfy the first identifiability condition, assuming both problems have the same number of sources. Moreover, the second identifiability condtion in the SCFA problem becomes

(17)

From (14) and (17), we conclude that the SCFA problem (for ) satisfies easier the second identifiability condition compared to the problem in (10), if both problems have the same number of sources and microphones.

Iii-B Special Case (S)CFA: is Diagonal

A special case of (S)CFA, which is more suitable for the application at hand, is when are constrained to be diagonal due to the signal model in (5). We refer to this special case as the diagonal (S)CFA problem. By constraining to be diagonal, the total number of fixed parameters in is

(18)

It has been shown in [41, 42] that in this case, and for , the class of the only possible is , where is a permutation matrix and is a scaling matrix, if the following condition is satisfied

(19)

where

(20)

and are the Kruskal-ranks [41] of the matrices and , respectively. We conclude, that if (16) is satisfied, and there are at least fixed variables in and , and the condition in (19) is satisfied, then the parameters of (9) (for diagonal) will be uniquely identifiable up to a possible scaling and/or permutation.

Iii-C Diagonal SCFA vs Non-Orthogonal Joint Diagonalization

The diagonal SCFA problem in Sec. III-B is very similar to the joint diagonalization method in [4] apart from the two positive semidefinite constraints that avoid improper solutions, and which are lacking in [4]. Finally, it is worth mentioning that the method proposed in [4] solves the scaling ambiguity by setting (corresponding to a varying reference microphone per-source), which means fixed elements in , i.e., . Therefore, in [4], the total number of fixed parameters in is given by

(21)

By combining (21) and (17), the second identifiability condition becomes

(22)

Note that for , if , the second identifiability condition is always satisfied, but the permutation ambiguity still exists and needs extra steps to be resolved [4]. However, for , the second identifiability condition is satisfied for and there are no permutation ambiguities. By combining (21), and (16), the first identifiability condition for the diagonal SCFA with becomes

(23)

Iv Proposed Diagonal SCFA Problems

In this section, we will propose two methods based on the diagonal SCFA problem from Sec. III-B to estimate the different signal model parameters in (7). Unlike the diagonal SCFA problem and the non-orthogonal joint diagonalization method in [4], the first proposed method also estimates the late reverberation PSD. The second proposed method skips the estimation of the late reverberation PSD and thus is more similar to the diagonal SCFA problem and the non-orthogonal joint diagonalization method in [4]. Since we are using the early RATFs as columns of , we fix all the elements of the -th row of equal to 1, where is the reference microphone index. Thus, unlike the method proposed in [4], which uses a varying reference microphone (i.e., ), we use a single reference microphone (i.e., ).

Although our proposed constraints will resolve the scaling ambiguity (described in Sec III-B), the permutation ambiguity (described in Sec III-B) still exists and needs extra steps to be resolved. In this paper, we do not focus on this problem and we assume that we know the perfect permutation matrix per time-frequency tile. The interested reader can find more information on how to solve permutation ambiguities in [4, 5, 6]. An exception occurs in the context of dereverberation where, typically, a single point source (i.e., ) exists and, therefore, a single fixed parameter in is sufficient to solve both the permutation and scaling ambiguities.

Iv-a Proposed Basic Diagonal SCFA Problem

The proposed basic diagonal SCFA problem is based on the signal model in (7), which takes into account the late reverberation. Here we assume that we have computed a priori . The proposed diagonal SCFA problem is given by

(24)

We will refer to the problem in (24) as the problem. The objective function of the problem depends on . This means that we have additional unknowns in (23). Thus, the first identifiability condition becomes

(25)

A simplified version of the problem is obtained when the reverberation parameter is omitted. This problem therefore uses the signal model of (9) instead of (7). We will refer to this simplified problem as the problem. The only differences between the and the method proposed [4], is that in the we use a fixed reference microphone and positivity constraints for the PSDs.

Since, we have fixed parameters in corresponding to the reference microphone, in both proposed methods, the total number of fixed parameters in and is the same as in (21). The second identifiability condition of all proposed methods is therefore the same as in (22).

Iv-B versus

Although the method typically fits a more accurate signal model to the noisy measurements compared to the method, it does not necessarily guarantee a better performance over the method. In other words, the model-mismatch error is not the only critical factor in achieving good performance. Another important factor is how over-determined is the system of equations to be solved is, i.e., what is the ratio of knowns and unknowns. With respect to the over-determination factor, the method is more efficient, since it has less parameters to estimate, if is the same in both methods. Consequently, the problem boils down to how much is the model-mismatch error and the over-determination. Thus, it is natural to expect that for not highly reverberant environments, the method may perform better than the method, while for highly reverberant environments the inverse may hold.

V Robust Estimation of Parameters

In Secs. V-AV-E, we propose additional constraints in order to increase the robustness of the initial versions of the two diagonal SCFA problems proposed in Sec. IV. The robustness is needed in order to overcome CPSDM estimation errors and model-mismatch errors. We use linear inequality constraints (mainly simple box constraints) on the parameters to be estimated. These constraints limit the feasibility set of the parameters to be estimated and avoid unreasonable values.

A less efficient alternative procedure to increase robustness would be to solve the proposed problems with a multi-start optimization technique such that a good local optimum will be obtained. Note that this procedure is more computational demanding and also (without the box constraints) does not guarantee estimated parameters that belong in a meaningful region of values.

V-a Constraining the Summation of PSDs

If the model in (7) perfectly describes the acoustic scene, the sum of the PSDs of the point sources, late reverberation, and microphone self-noise at the reference microphone equals (where is the reference microphone index and is the element of ). That is,

(26)

where is the -th diagonal element of . In practice, the model is not perfect and we do not know , but an estimate . Therefore, a box constraint is used, instead of an equality constraint. That is,

(27)

where is a constant which controls the underestimation or overestimation of the PSDs. This box constraint can be used to improve the robustness of the problem, but cannot be used by the problem, since it does not estimate . A less tight box constraint that can be used for both , problems is

(28)

One may see the inequality in (28) as a sparsity constraint, natural in audio and speech processing as the number of the active sound sources is small (typically much smaller than the maximum number of sources, , existing in the acoustic scene) for a singe time-frequency tile. In this case, controls the sparsity. A low implies large sparsity, while a large implies low sparsity. The sparsity is over frequency and time.

V-B Box Constraints for the Early RATFs

Extra robustness can be achieved if the elements of the early RATFs are box-constrained as follows:

(29)

where are some complex-valued upper and lower bounds, respectively222An alternative method would be to constrain with real lower and upper bounds but that would lead to a non-linear inequality constraint and, thus, a more complicated implementation.. We select the values of based on relative Green functions. Let us denote with the location of the -th source, with the location of the -th microphone, and with the distance between the -th source and -th microphone. The anechoic ATF (direct path only) at the frequency-bin between the -th source -th microphone is given by [43]

(30)

where is the FFT length, is the speed of sound, and is the time of arrival (TOA) of the -th source to the -th microphone. The corresponding anechoic relative ATF with respect to the reference microphone is given by

(31)

where is the time difference of arrival (TDOA) of the -th source between microphones and . What becomes clear from (31) is that the anechoic relative ATF depends only on the two unknown parameters . The upper and lower bounds of the real part of (31) can be written compactly using the following box inequality

(32)

and similarly for the imaginary part of .

Among all the points on the circle with any constant radius and center the middle point between microphones with indices and , the inequality in (32) becomes maximally relaxed for the maximum possible and minimum possible , i.e., when the ratio becomes maximum. This happens when the -th source is in the endfire direction of the two microphones and closest to -th microphone. In this case we have and, therefore,  (32) becomes

(33)

The imaginary part of is constrained similarly to (33). In the inequality in (33), the parameters are unknown. Now, we try to relax this inequality and find ways that are independent of these unknown parameters.

Note that the quantity should not be allowed to be greater than the sub-frame length in seconds, i.e., , where is the sub-frame length in samples. If it is greater than , the signal model given in (7) is invalid, i.e., the CPSDM of the -th point source cannot be written as a rank-1 matrix, because it will not be fully correlated between microphones . Therefore, we have

(34)

Note that the inequality in (34) should also hold in the endfire direction of the two microphones, which means

(35)

The inequality in (33) is maximally relaxed for the maximum possible and the minimum possible . The maximum allowable is given by (35). Moreover, another practical observation is that the sources cannot be in the same location as the microphones. Therefore, we have

(36)

where is a very small distance (e.g., m). Therefore, the maximum range of the real part of the relative anechoic ATF is given by

(37)

The imaginary part of is constrained similar to (37). The above inequality is based on anechoic free-field RATFs. In practice, we have early RATFs which include early echoes and/or directivity patterns which means that we might want to make the box constraint in (37) less tight.

V-C Tight Box Constraints for the Early RATFs based on

In Sec. V-B we proposed the box constraints in (37) based on practical considerations without knowing the distance between sensors or between sources and sensors. In this section we assume that we have an estimate of the distance matrix (see Sec. II-C), . Consequently we know and, therefore, we can make the box constraint in (37) even tighter. Specifically, the inequality in (33) is maximally relaxed as follows

(38)

The imaginary part of is constrained similar to (38).

V-D Box Constraints for the Late Reverberation PSD

In this section, we take into consideration the late reverberation. We can be almost certain that the following box constraint is satisfied:

(39)

This box constraint is only applicable in the problem. The box-constraint in (39) prevents large overestimation errors which may result in speech intelligibility reduction in noise reduction applications [44, 18].

V-E All microphones have the same microphone-self noise PSD

Here we examine the special case where , i.e., all microphones have the same self-noise PSD. Moreover, since the microphone self-noise is stationary, we can be almost certain that the following box-constraint holds

(40)

Similar to the constraint in (39), the constraint in (40) avoids large overestimation errors.

By having a common self-noise PSD for all microphones, the number of parameters are reduced by , since we have only one microphone-self noise PSD for all microphones. Hence, the first identifiability condition for the problem is now given by

(41)

Similarly, the first identifiability condition for the problem is now given by

(42)

Vi Practical Considerations

In this section, we discuss practical problems regarding the choice of several parameters of the proposed methods and implementation aspects. Although, we have already explained the problem of over-determination in Sec. IV-B, in Sec VI-A, we discuss additional ways of achieving over-determination. In Sec. VI-B, we discuss about some limitations of the proposed methods. Finally, in Secs. VI-C and VI-D, we discuss how to implement the proposed methods.

Vi-a Over-determination Considerations

Increasing the ratio of the number of equations over the number of unknowns obviously fits better the CPSDM model to the measurements under the assumption that the model is accurate enough and the early RATFs do not change within a time-segment. There are two main approaches to increase the ratio of the number of equations over the number of unknowns. The first approach is to reduce the number of the parameters to be estimated while fixing the number of equations as already explained in Sec. IV-B. In addition to the explanation provided in IV-B, we could also reduce the number of parameters by source counting per time-frequency tile and adapt . However, this is out of the scope of the present paper and here we assume that we have a constant in the entire time-frequency horizon which is the maximum possible . The second approach is to increase the number of time-frames in a time-segment and/or the number of microphones . Increasing is not practical, because typically, the acoustic sources are moving. Thus, should not be too small but also not too large. Note that is also effected by the time-frame length denoted by . If is small we can use a larger , while if is large, we should use a small in order to be able to also track moving sources. However, if we select to be very small, the number of sub-frames will be smaller and consequently the estimation error in will be large and will cause performance degradation.

Vi-B Limitations of the Proposed Methods

From the identifiability conditions in (23), (25), (41) and (42) for fixed and , we can obtain the minimum number of microphones needed to satisfy these inequalities. Alternatively, for a fixed and we can obtain the minimum number of time-frames needed to satisfy these inequalities. Finally, for a fixed and we can find the maximum number of sources for which we can identify their parameters (early RATFs and PSDs). Let , , and the minimum number of microphones satisfying the identifiability conditions in (23), (25), (41) and (42), respectively. Moreover, let , , and the minimum number of time-frames satisfying the identifiability conditions in (23), (25), (41) and (42), respectively. In addition, let , , and the maximum number of sources satisfying the identifiability conditions in (23), (25), (41) and (42), respectively. The following inequalities can be easily proved:

Vi-C Online Implementation Using Warm-Start

The estimation of the parameters is carried out for all time-frames within one time-segment. Subsequently, in order to have low latency, we shift the time-segment one time-frame. For the time-frames in the current time-segment that overlap with the time-frames in the previous time-segment, the parameters are initialized using the estimates from the corresponding

time-frames in the previous time-segment. The parameters of the most recent time-frame are initialized by selecting a value that is drawn from a uniform distribution with boundaries corresponding to the lower and upper bound of the corresponding box constraint. Only for the first time-segment, the early RATFs are initialized with the

most dominant relative eigenvectors from the averaged CPSDM over all time-frames of the first time-segment.

Vi-D Solver

The non-convex optimization problems that we proposed can be solved with various existing solvers within the literature such as [45, 46, 47, 48]. In this paper, we used the standard MATLAB optimization toobox to solve the optimization problems which implements a combination of the methods in [46, 47, 48]. These methods require first and sometimes second-order derivatives of the objective function. The first-order derivatives of the objective functions in (11) with respect to most parameters have been obtained already in [34, 35, 4, 36] without taking into account the late reverberation PSD. Thus, here we provide only the first-order derivatives with respect to the late reverberation PSD parameter. We have

(43)
(44)
(45)

For the second-order derivatives, we used the Broyden-Fletcher-Goldfarb-Shanno (BFGS) approximated Hessian [36].

Vii Experiments

In this section, we show the performance of the proposed methods in the context of two multi-microphone applications. The first application is dereverberation of a single point source (). The second application is source separation combined with dereverberation examined in an acoustic scene with point sources. In this paper, we use the perfect permutation matrix for all compared methods in the source separation experiments. For these experiments we selected the maximum-likelihood objective function in (11). The values of the parameters that we selected for both applications are summarized in Table I. All methods based on the diagonal SCFA methodology are implemented using the online implementation in Sec. VI-C. The acoustic scene we consider for the source separation example is depicted in Fig. 2. The acoustic scene we consider for the dereverberation example is similar with the only difference that the music signal and male talker sources (see Fig. 2) are not present. The room dimensions are  m. The reverberation time for the dereverberation application is selected  s, while for the source separation, and  s. The microphone signals have a duration of  s and the duration of the impulse responses used to construct the microphone signals is s. The microphone signals were constructed using the image method [43]. The microphone array is circular with a consecutive microphone distance of  cm. The reference microphone is the right-top microphone in Fig. 2. Moreover, we assume that the microphone-self noise has the same PSD at all microphones. Finally, it is worth mentioning that the early part of a room impulse response (see Sec. II-B) is of the same length as the sub-frame length.

Parameter Definition Value
number of microphones
FFT length
time-frame length samples (0.125 s)
sub-frame length samples (0.0125 s)
overlapping of sub-frames
spatial coherence matrix spherical isotropic model
reference microphone index
controls overestimation
underestimation
controls sparsity
speed of sound
minimum possible
source-microphone distance
cm
sampling frequency kHz
mic. self noise PSD
TABLE I: Summary of parameters used in the experiments.

Vii-a Performance Evaluation

We will perform two types of performance evaluations in both applications. The first one measures the error of the estimated parameters, while the second one measures the performance by using the estimated parameters in a source estimation algorithm and measure instrumental intelligibility and sound quality of the estimated source waveforms. We measure the average PSD errors of the sources, the average PSD error of the late reverberation, and the average PSD error of the microphone-self noise using the following three measures [49]:

(46)
(47)
(48)

We also compute the underestimates (denoted as above with superscript un) and overestimates (denoted as above with superscript ov) of the above averages as in [44] since a large overestimation error in the noise PSDs and a large underestimation error in the target PSD typically results in large target source distortions in the context of a noise reduction framework [44]. On the other hand, a large underestimation error in the noise PSDs may result in musical noise [44]. We also evaluate the average early RATF estimation error using the Hermitian angle measure [50] given by

(49)

If the PSD of a source in a frequency-bin is negligible for all time-frames within a time-segment, the estimated PSD and RATF of this source at that time-frequency tile are skipped from the above averages.

To evaluate the intelligibility and quality of the

-th target source signal, the estimated parameters are used to construct a multi-channel Wiener filter (MWF) as a concatenation of a single-channel Wiener filter (SWF) and a minimum variance distortionless response (MVDR) beamformer 

[1]. That is,

(50)

and

(51)

where

(52)

The noise reduction of the -th source is evaluated using the segmental-signal-to-noise-ratio (SSNR) for the -th source only in sub-frames where the -th source is active after which we average the SSNRs over all sources. Moreover, for speech sources, we measure the predicted intelligibility with the SIIB measure [51, 52] and average SIIB over all speech sources.

Fig. 2: Acoustic scene with sources and microphones.

Vii-B Reference State-of-the-Art Dereverberation and Parameter-Estimation Methods

For the dereverberation we first estimate the PSD of the late reverberation using the method proposed in [26, 19]. Specifically, we first compute the Cholesky decomposition after which we compute the whitened estimated noisy CPSDM as

(53)

Next, we compute the eigenvalue decomposition , where the diagonal entries of are sorted in descending order. The PSD of the late reverberation is then computed as

(54)

Having an estimate of the late reverberation, we compute the noise CPSDM matrix as and use it to estimate the early RATF and PSD of the target in the sequel.

We estimate the early RATF of the target using the method proposed in [8, 53]. We first compute the Cholesky decomposition . We then compute the whitened estimated noisy CPSDM as . Next, we compute the eigenvalue decomposition , where the diagonal entries of are sorted in descending order. We compute the early RATF as

(55)

with . We improve even further the accuracy of the estimated RATF by estimating the RATFs of all time frames within each time-segment and then use the average of these as the RATF estimate. Finally, the target PSD is estimated as proposed in [28, 15], i.e.,

(56)

where is given in (51).

Fig. 3: