I Introduction
Microphone arrays (see e.g., [1] for an overview) are used extensively in many applications, such as source separation [2, 3, 4, 5, 6], multimicrophone noise reduction [1, 7, 8, 9, 10, 11, 12, 13], dereverberation [14, 15, 16, 17, 18, 19], sound source localization [20, 21, 22, 23], and room geometry estimation [24, 25]. All the aforementioned applications are based on a similar multimicrophone signal model, typically depending on the following parameters: i) the early relative acoustic transfer functions (RATFs) of the sources with respect to the microphones; ii) the power spectral densities (PSDs) of the early components of the sources, iii) the PSD of the late reverberation, and, iv) the PSDs of the microphoneself noise. Other parameters, like the target cross power spectral density matrix (CPSDM), the noise CPSDM, source locations and room geometry information, can be inferred from (combinations of) the above mentioned parameters. Often, none of these parameters are known a priori, while estimation is challenging. Often, only a subset of the parameters is estimated, see e.g., [14, 15, 16, 17, 26, 19, 27, 28, 29, 30], typically requiring rather strict assumptions with respect to stationarity and/or knowledge of the remaining parameters.
In [15, 17] the target source PSD and the late reverberation PSD are jointly estimated assuming that the early RATFs of the target with respect to all microphones are known and all the remaining noise components (e.g., interferers) are stationary in time intervals typically much longer than a time frame. In [31, 26, 19], it was shown that the method in [15, 17] may lead to inaccurate estimates of the late reverberation PSD, when the early RATFs of the target include estimation errors. In [26, 19], a more accurate estimator for the late reverberation PSD was proposed, independent of early RATF estimation errors.
The methods proposed in [27, 28] do not assume that some noise components are stationary like in [17], but assume that the total noise CPSDM has a constant [27] or slowvarying [28] structure over time (i.e., it can be written as an unknown scaling parameter multiplied with a constant spatial structure matrix). This may not be realistic in practical acoustical scenarios, where different interfering sources change their power and location across time more rapidly and with different patterns. Moreover, these methods do not separate the late reverberation from the other noise components and only differentiate between the target source PSD and the overall noise PSD. As in [17], these methods assume that the early RATFs of the target are known. In [28], the structure of the noise CPSDM is estimated only in targetabsent timefrequency tiles using a voice activity detector (VAD), which may lead to erroneous estimates if the spatial structure matrix of the noise changes during targetpresence.
In [30]
, the early RATFs and the PSDs of all sources are estimated using the expectation maximization (EM) method
[32]. This method assumes that only one source is active per timefrequency tile and the noise CPSDM (excluding the contributions of the interfering point sources) is estimated assuming it is timeinvariant. Due to the timevarying nature of the late reverberation (included in the noise CPSDM), this assumption is often violated. This method does not estimate the timevarying PSD of the late reverberation, neither the PSDs of the microphoneself noise.While the aforementioned methods focus on estimation of just one or several of the required model parameters, the method presented in [4] jointly estimates the early RATFs of the sources, the PSDs of the sources and the PSDs of the microphoneself noise. Unlike [30], the method in [4] does not assume single source activity per timefrequency tile and, thus, it is applicable to more general acoustic scenarios. The method in [4] is based on the nonorthogonal jointdiagonalization of the noisy CPSDMs. This method unfortunately does not guarantee nonnegative estimated PSDs and, thus, the obtained target CPSDM may not be positive semidefinite resulting in performance degradation. Moreover, this approach does not estimate the PSD of the late reverberation. In conclusion, most methods only focus on the estimation of a subset of the required model parameters and/or rely on assumptions which may be invalid and/or impractical.
In this paper, we propose a method which jointly estimates all the aforementioned parameters of the multimicrophone signal model. The proposed method is based on confirmatory factor analysis (CFA) [33, 34, 35, 36] and on the nonorthogonal jointdiagonalization principle introduced in [4]. The combination of these two theories and the adjustment to the multimicrophone case gives us a robust method, which is applicable for temporally and spatially nonstationary sources. The proposed method uses linear constraints to reduce the feasibility set of the parameter space and thus increase robustness. Moreover, the proposed method guarantees positive estimated PSDs and, thus, positive semidefinite target and noise CPSDMs. Although generally applicable, in this manuscript, we will compare the performance of the proposed method with stateoftheart approaches in the context of source separation and dereverberation.
The remaining of the paper is organized as follows. In Sec. II, the signal model, notation and used assumptions are introduced. In Sec. III, we review the CFA theory and its relation to the nonorthogonal joint diagonalization principle. In Sec. IV, the proposed method is introduced. In Sec. V, we introduce several constraints to increase the robustness of the proposed method. In Sec. VI, we discuss the implementation and practicality of the proposed method. In Sec. VII, we conduct experiments in several multimicrophone applications using the proposed method and existing stateoftheart approaches. In Sec. VIII, we draw conclusions.
Ii Preliminaries
Iia Notation
We use lowercase letters for scalars, boldface lowercase letters for vectors, and boldface uppercase letters for matrices. A matrix
can be expressed as , where is its th column. The elements of a matrix are denoted as . We use the operand to denote the trace of a matrix,to denote the expected value of a random variable,
to denote the vector formed from the diagonal of a matrix , and to denote the Frobenius norm of a matrix. We use to form a square diagonal matrix with diagonal . A hermitian positive semidefinite matrix is expressed as , whereand its eigenvalues are real nonnegative. The cardinality of a set is denoted as
. The minimum element of a vector is obtained via the operation .IiB Signal Model
Consider an element microphone array of arbitrary structure within a possibly reverberant enclosure, in which there are acoustic point sources (target and interfering sources). The
th microphone signal (in the shorttime Fourier transform (STFT) domain) is modeled as
(1) 
where is the frequencybin index; the timeframe index; and the early and late components of the th point source, respectively; and denotes the microphone selfnoise. The early components include the line of sight and a few initial strong reflections. The late components describe the effect of the remaining reflections and are usually referred to as late reverberation. The th early component is given by
(2) 
where is the corresponding RATF with respect to the th microphone, the th pointsource at the reference microphone, is the index of a timesegment, which is a collection of timeframes. That is, we assume that the source signal can vary faster (from timeframe to timeframe) than the early RATFs, which are assumed to be constant over multiple timeframes (which we call a timesegment). By stacking all microphone recordings into vectors, the multimicrophone signal model is given by
(3) 
where and all the other vectors can be similarly represented. If we assume that all sources in (3) are mutually uncorrelated and stationary within a timeframe, the signal model of the CPSDM of the noisy recordings is given by
(4) 
where , is the PSD of the th source at the reference microphone, the CPSDM of the late reverberation and is a diagonal matrix, which has as its diagonal elements the PSDs of the microphoneself noise. Note that and are timeframe varying, while the microphoneself noise PSDs are typically timeinvariant. The CPSDM model in (4) can be rewritten as
(5) 
where and is commonly referred to as mixing matrix and has as its columns the early RATFs of the sources. As we work with RATFs, the row of corresponding to the reference microphone is equal to a vector with only ones. Moreover, is a diagonal matrix, where .
IiC Late Reverberation Model
A commonly used assumption (adopted in this paper) is that the late reverberation CPSDM has a known spatial structure, , which is timeinvariant but varying over frequency [14, 17]. Under the constant spatialstructure assumption, is modeled as [14, 17]
(6) 
with the PSD of the late reverberation which is unknown and needs to be estimated. By combining (5), and (6), we obtain the final CPSDM model given by
(7) 
There are several existing methods [15, 17, 18, 26, 19] to estimate under the assumption that is known. There are mainly two methodologies of obtaining . The first is to use many precalculated impulse responses measured around the array as in [7]. The second is to use a model which is based on the fact that the offdiagonal elements of depend on the distance between every microphone pair. The distances between any two microphone pairs is described by the symmetric microphonedistance matrix with elements which is the distance between microphones and . Two commonly used models for the spatial structure are the cylindrical and spherical isotropic noise fields [37, 10]. The cylindrical isotropic noise field is accurate for rooms where the ceiling and the floor are more absorbing than the walls. These models are accurate for sufficiently large rooms [10].
IiD Estimation of CPSDMs Using SubFrames
The estimation of , is achieved using overlapping multiple subframes. The set of all used subframes within the th timeframe is denoted by , and the number of used subframes is . We assume that the noisy microphone signals within a timeframe are stationary and, thus, we can estimate the noisy CPSDM using the sample CPSDM, i.e.,
(8) 
with the subframe index. Fig. 1 summarizes how we split time using subframes, timeframes and timesegments.
IiE Problem Formulation
The goal of this paper is to jointly estimate the parameters , , , and for the th timesegment of the signal model in (7) by only having estimates of the noisy CPSDM matrices for all time frames belonging to the th timesegment and possibly having an estimate and/or . From now on, we will neglect timefrequency indices to simplify notation wherever is necessary.
Iii Confirmatory Factor Analysis
Confirmatory factor analysis (CFA) [33, 34, 36] aims at estimating the parameters of the following CPSDM model:
(9) 
where and . In CFA, some of the elements in and are fixed such that the remaining variables are uniquely identifiable (see below). More specifically, let and denote the sets of the selected rowcolumn indexpairs of the matrices and , respectively, where their elements are fixed to some known constants , and .
There are several existing CFA methods (see e.g. [36], for an overview). Most of these are special cases of the following general CFA problem
(10) 
with a cost function, which is typically one of the following cost functions: maximum likelihood (ML), least squares (LS), or generalized least squares (GLS). That is,
(11) 
where is given in (9). Notice, that the problem in (10) is not convex (due to the nonconvex terms ) and may have multiple local minima.
There are two necessary conditions for the parameters of the CPSDM model in (9) to be uniquely identifiable^{1}^{1}1We say that the parameters of a function are uniquely identifiable if there is onetoone relationship between the parameters and the function value.. The first identifiability condition states that the number of equations should be larger than the number of unknowns [40, 36]. Since , there are known values, while there are unknowns due to , unknowns due to (because ), and unknowns due to (because is diagonal). Therefore, the first identifiability condition is given by [40]
(12) 
The identifiability condition in (12) is not sufficient for guaranting unique identifiability [36]. Specifically, for any arbitary nonsingular matrix , we have and, therefore [34]
(13) 
This means that there are infinitly many optimal solutions () of the problem in (10). Since there are variables in , the second identifiability condition of the CPSDM model in (9) states that we need to fix at least of the parameters in and [40, 34], i.e.,
(14) 
This second condition is necessary but not sufficient, since we need to fix the proper parameters and not just any parameters [40, 34] such that . For a general fullelement , a recipe on how to select the constraints in order to achieve unique identifiability is provided in [34].
Iiia Simultaneous CFA (SCFA) in Multiple TimeFrames
The th timesegment consists of the following timeframes: , where is the set of the timeframes in the th timesegment. For ease of notation, we can alternatively rewrite this as . The problem in (10) considered timeframe. Now we assume that we estimate for timeframes in the th timesegment. We also assume that , if . Recall that the mixing matrix is assumed to be static within a timesegment. Moreover, is timeinvariant and, thus, shared among different timeframes within the same timesegment. One can exploit these two facts in order to increase the ratio between the number of equations and the number of unknown parameters [33, 35] and thus satisfy the first and second identifiability conditions with less microphones. This can be done by solving the following general simultaneous CFA (SCFA) problem [35]
(15) 
The CFA problem in (10) is a special case of SCFA, when we select . The first identifiability condition for the SCFA problem becomes
(16) 
We conclude from (12) and (16) that the SCFA problem (for ) needs less microphones compared to the problem in (10) to satisfy the first identifiability condition, assuming both problems have the same number of sources. Moreover, the second identifiability condtion in the SCFA problem becomes
(17) 
From (14) and (17), we conclude that the SCFA problem (for ) satisfies easier the second identifiability condition compared to the problem in (10), if both problems have the same number of sources and microphones.
IiiB Special Case (S)CFA: is Diagonal
A special case of (S)CFA, which is more suitable for the application at hand, is when are constrained to be diagonal due to the signal model in (5). We refer to this special case as the diagonal (S)CFA problem. By constraining to be diagonal, the total number of fixed parameters in is
(18) 
It has been shown in [41, 42] that in this case, and for , the class of the only possible is , where is a permutation matrix and is a scaling matrix, if the following condition is satisfied
(19) 
where
(20) 
and are the Kruskalranks [41] of the matrices and , respectively. We conclude, that if (16) is satisfied, and there are at least fixed variables in and , and the condition in (19) is satisfied, then the parameters of (9) (for diagonal) will be uniquely identifiable up to a possible scaling and/or permutation.
IiiC Diagonal SCFA vs NonOrthogonal Joint Diagonalization
The diagonal SCFA problem in Sec. IIIB is very similar to the joint diagonalization method in [4] apart from the two positive semidefinite constraints that avoid improper solutions, and which are lacking in [4]. Finally, it is worth mentioning that the method proposed in [4] solves the scaling ambiguity by setting (corresponding to a varying reference microphone persource), which means fixed elements in , i.e., . Therefore, in [4], the total number of fixed parameters in is given by
(21) 
By combining (21) and (17), the second identifiability condition becomes
(22) 
Note that for , if , the second identifiability condition is always satisfied, but the permutation ambiguity still exists and needs extra steps to be resolved [4]. However, for , the second identifiability condition is satisfied for and there are no permutation ambiguities. By combining (21), and (16), the first identifiability condition for the diagonal SCFA with becomes
(23) 
Iv Proposed Diagonal SCFA Problems
In this section, we will propose two methods based on the diagonal SCFA problem from Sec. IIIB to estimate the different signal model parameters in (7). Unlike the diagonal SCFA problem and the nonorthogonal joint diagonalization method in [4], the first proposed method also estimates the late reverberation PSD. The second proposed method skips the estimation of the late reverberation PSD and thus is more similar to the diagonal SCFA problem and the nonorthogonal joint diagonalization method in [4]. Since we are using the early RATFs as columns of , we fix all the elements of the th row of equal to 1, where is the reference microphone index. Thus, unlike the method proposed in [4], which uses a varying reference microphone (i.e., ), we use a single reference microphone (i.e., ).
Although our proposed constraints will resolve the scaling ambiguity (described in Sec IIIB), the permutation ambiguity (described in Sec IIIB) still exists and needs extra steps to be resolved. In this paper, we do not focus on this problem and we assume that we know the perfect permutation matrix per timefrequency tile. The interested reader can find more information on how to solve permutation ambiguities in [4, 5, 6]. An exception occurs in the context of dereverberation where, typically, a single point source (i.e., ) exists and, therefore, a single fixed parameter in is sufficient to solve both the permutation and scaling ambiguities.
Iva Proposed Basic Diagonal SCFA Problem
The proposed basic diagonal SCFA problem is based on the signal model in (7), which takes into account the late reverberation. Here we assume that we have computed a priori . The proposed diagonal SCFA problem is given by
(24) 
We will refer to the problem in (24) as the problem. The objective function of the problem depends on . This means that we have additional unknowns in (23). Thus, the first identifiability condition becomes
(25) 
A simplified version of the problem is obtained when the reverberation parameter is omitted. This problem therefore uses the signal model of (9) instead of (7). We will refer to this simplified problem as the problem. The only differences between the and the method proposed [4], is that in the we use a fixed reference microphone and positivity constraints for the PSDs.
IvB versus
Although the method typically fits a more accurate signal model to the noisy measurements compared to the method, it does not necessarily guarantee a better performance over the method. In other words, the modelmismatch error is not the only critical factor in achieving good performance. Another important factor is how overdetermined is the system of equations to be solved is, i.e., what is the ratio of knowns and unknowns. With respect to the overdetermination factor, the method is more efficient, since it has less parameters to estimate, if is the same in both methods. Consequently, the problem boils down to how much is the modelmismatch error and the overdetermination. Thus, it is natural to expect that for not highly reverberant environments, the method may perform better than the method, while for highly reverberant environments the inverse may hold.
V Robust Estimation of Parameters
In Secs. VA—VE, we propose additional constraints in order to increase the robustness of the initial versions of the two diagonal SCFA problems proposed in Sec. IV. The robustness is needed in order to overcome CPSDM estimation errors and modelmismatch errors. We use linear inequality constraints (mainly simple box constraints) on the parameters to be estimated. These constraints limit the feasibility set of the parameters to be estimated and avoid unreasonable values.
A less efficient alternative procedure to increase robustness would be to solve the proposed problems with a multistart optimization technique such that a good local optimum will be obtained. Note that this procedure is more computational demanding and also (without the box constraints) does not guarantee estimated parameters that belong in a meaningful region of values.
Va Constraining the Summation of PSDs
If the model in (7) perfectly describes the acoustic scene, the sum of the PSDs of the point sources, late reverberation, and microphone selfnoise at the reference microphone equals (where is the reference microphone index and is the element of ). That is,
(26) 
where is the th diagonal element of . In practice, the model is not perfect and we do not know , but an estimate . Therefore, a box constraint is used, instead of an equality constraint. That is,
(27) 
where is a constant which controls the underestimation or overestimation of the PSDs. This box constraint can be used to improve the robustness of the problem, but cannot be used by the problem, since it does not estimate . A less tight box constraint that can be used for both , problems is
(28) 
One may see the inequality in (28) as a sparsity constraint, natural in audio and speech processing as the number of the active sound sources is small (typically much smaller than the maximum number of sources, , existing in the acoustic scene) for a singe timefrequency tile. In this case, controls the sparsity. A low implies large sparsity, while a large implies low sparsity. The sparsity is over frequency and time.
VB Box Constraints for the Early RATFs
Extra robustness can be achieved if the elements of the early RATFs are boxconstrained as follows:
(29) 
where are some complexvalued upper and lower bounds, respectively^{2}^{2}2An alternative method would be to constrain with real lower and upper bounds but that would lead to a nonlinear inequality constraint and, thus, a more complicated implementation.. We select the values of based on relative Green functions. Let us denote with the location of the th source, with the location of the th microphone, and with the distance between the th source and th microphone. The anechoic ATF (direct path only) at the frequencybin between the th source th microphone is given by [43]
(30) 
where is the FFT length, is the speed of sound, and is the time of arrival (TOA) of the th source to the th microphone. The corresponding anechoic relative ATF with respect to the reference microphone is given by
(31) 
where is the time difference of arrival (TDOA) of the th source between microphones and . What becomes clear from (31) is that the anechoic relative ATF depends only on the two unknown parameters . The upper and lower bounds of the real part of (31) can be written compactly using the following box inequality
(32) 
and similarly for the imaginary part of .
Among all the points on the circle with any constant radius and center the middle point between microphones with indices and , the inequality in (32) becomes maximally relaxed for the maximum possible and minimum possible , i.e., when the ratio becomes maximum. This happens when the th source is in the endfire direction of the two microphones and closest to th microphone. In this case we have and, therefore, (32) becomes
(33) 
The imaginary part of is constrained similarly to (33). In the inequality in (33), the parameters are unknown. Now, we try to relax this inequality and find ways that are independent of these unknown parameters.
Note that the quantity should not be allowed to be greater than the subframe length in seconds, i.e., , where is the subframe length in samples. If it is greater than , the signal model given in (7) is invalid, i.e., the CPSDM of the th point source cannot be written as a rank1 matrix, because it will not be fully correlated between microphones . Therefore, we have
(34) 
Note that the inequality in (34) should also hold in the endfire direction of the two microphones, which means
(35) 
The inequality in (33) is maximally relaxed for the maximum possible and the minimum possible . The maximum allowable is given by (35). Moreover, another practical observation is that the sources cannot be in the same location as the microphones. Therefore, we have
(36) 
where is a very small distance (e.g., m). Therefore, the maximum range of the real part of the relative anechoic ATF is given by
(37) 
The imaginary part of is constrained similar to (37). The above inequality is based on anechoic freefield RATFs. In practice, we have early RATFs which include early echoes and/or directivity patterns which means that we might want to make the box constraint in (37) less tight.
VC Tight Box Constraints for the Early RATFs based on
In Sec. VB we proposed the box constraints in (37) based on practical considerations without knowing the distance between sensors or between sources and sensors. In this section we assume that we have an estimate of the distance matrix (see Sec. IIC), . Consequently we know and, therefore, we can make the box constraint in (37) even tighter. Specifically, the inequality in (33) is maximally relaxed as follows
(38) 
The imaginary part of is constrained similar to (38).
VD Box Constraints for the Late Reverberation PSD
In this section, we take into consideration the late reverberation. We can be almost certain that the following box constraint is satisfied:
(39) 
This box constraint is only applicable in the problem. The boxconstraint in (39) prevents large overestimation errors which may result in speech intelligibility reduction in noise reduction applications [44, 18].
VE All microphones have the same microphoneself noise PSD
Here we examine the special case where , i.e., all microphones have the same selfnoise PSD. Moreover, since the microphone selfnoise is stationary, we can be almost certain that the following boxconstraint holds
(40) 
Similar to the constraint in (39), the constraint in (40) avoids large overestimation errors.
By having a common selfnoise PSD for all microphones, the number of parameters are reduced by , since we have only one microphoneself noise PSD for all microphones. Hence, the first identifiability condition for the problem is now given by
(41) 
Similarly, the first identifiability condition for the problem is now given by
(42) 
Vi Practical Considerations
In this section, we discuss practical problems regarding the choice of several parameters of the proposed methods and implementation aspects. Although, we have already explained the problem of overdetermination in Sec. IVB, in Sec VIA, we discuss additional ways of achieving overdetermination. In Sec. VIB, we discuss about some limitations of the proposed methods. Finally, in Secs. VIC and VID, we discuss how to implement the proposed methods.
Via Overdetermination Considerations
Increasing the ratio of the number of equations over the number of unknowns obviously fits better the CPSDM model to the measurements under the assumption that the model is accurate enough and the early RATFs do not change within a timesegment. There are two main approaches to increase the ratio of the number of equations over the number of unknowns. The first approach is to reduce the number of the parameters to be estimated while fixing the number of equations as already explained in Sec. IVB. In addition to the explanation provided in IVB, we could also reduce the number of parameters by source counting per timefrequency tile and adapt . However, this is out of the scope of the present paper and here we assume that we have a constant in the entire timefrequency horizon which is the maximum possible . The second approach is to increase the number of timeframes in a timesegment and/or the number of microphones . Increasing is not practical, because typically, the acoustic sources are moving. Thus, should not be too small but also not too large. Note that is also effected by the timeframe length denoted by . If is small we can use a larger , while if is large, we should use a small in order to be able to also track moving sources. However, if we select to be very small, the number of subframes will be smaller and consequently the estimation error in will be large and will cause performance degradation.
ViB Limitations of the Proposed Methods
From the identifiability conditions in (23), (25), (41) and (42) for fixed and , we can obtain the minimum number of microphones needed to satisfy these inequalities. Alternatively, for a fixed and we can obtain the minimum number of timeframes needed to satisfy these inequalities. Finally, for a fixed and we can find the maximum number of sources for which we can identify their parameters (early RATFs and PSDs). Let , , and the minimum number of microphones satisfying the identifiability conditions in (23), (25), (41) and (42), respectively. Moreover, let , , and the minimum number of timeframes satisfying the identifiability conditions in (23), (25), (41) and (42), respectively. In addition, let , , and the maximum number of sources satisfying the identifiability conditions in (23), (25), (41) and (42), respectively. The following inequalities can be easily proved:
ViC Online Implementation Using WarmStart
The estimation of the parameters is carried out for all timeframes within one timesegment. Subsequently, in order to have low latency, we shift the timesegment one timeframe. For the timeframes in the current timesegment that overlap with the timeframes in the previous timesegment, the parameters are initialized using the estimates from the corresponding
timeframes in the previous timesegment. The parameters of the most recent timeframe are initialized by selecting a value that is drawn from a uniform distribution with boundaries corresponding to the lower and upper bound of the corresponding box constraint. Only for the first timesegment, the early RATFs are initialized with the
most dominant relative eigenvectors from the averaged CPSDM over all timeframes of the first timesegment.
ViD Solver
The nonconvex optimization problems that we proposed can be solved with various existing solvers within the literature such as [45, 46, 47, 48]. In this paper, we used the standard MATLAB optimization toobox to solve the optimization problems which implements a combination of the methods in [46, 47, 48]. These methods require first and sometimes secondorder derivatives of the objective function. The firstorder derivatives of the objective functions in (11) with respect to most parameters have been obtained already in [34, 35, 4, 36] without taking into account the late reverberation PSD. Thus, here we provide only the firstorder derivatives with respect to the late reverberation PSD parameter. We have
(43)  
(44)  
(45) 
For the secondorder derivatives, we used the BroydenFletcherGoldfarbShanno (BFGS) approximated Hessian [36].
Vii Experiments
In this section, we show the performance of the proposed methods in the context of two multimicrophone applications. The first application is dereverberation of a single point source (). The second application is source separation combined with dereverberation examined in an acoustic scene with point sources. In this paper, we use the perfect permutation matrix for all compared methods in the source separation experiments. For these experiments we selected the maximumlikelihood objective function in (11). The values of the parameters that we selected for both applications are summarized in Table I. All methods based on the diagonal SCFA methodology are implemented using the online implementation in Sec. VIC. The acoustic scene we consider for the source separation example is depicted in Fig. 2. The acoustic scene we consider for the dereverberation example is similar with the only difference that the music signal and male talker sources (see Fig. 2) are not present. The room dimensions are m. The reverberation time for the dereverberation application is selected s, while for the source separation, and s. The microphone signals have a duration of s and the duration of the impulse responses used to construct the microphone signals is s. The microphone signals were constructed using the image method [43]. The microphone array is circular with a consecutive microphone distance of cm. The reference microphone is the righttop microphone in Fig. 2. Moreover, we assume that the microphoneself noise has the same PSD at all microphones. Finally, it is worth mentioning that the early part of a room impulse response (see Sec. IIB) is of the same length as the subframe length.
Parameter  Definition  Value  
number of microphones  
FFT length  
timeframe length  samples (0.125 s)  
subframe length  samples (0.0125 s)  
overlapping of subframes  
spatial coherence matrix  spherical isotropic model  
reference microphone index  



controls sparsity  
speed of sound  

cm  
sampling frequency  kHz  
mic. self noise PSD 
Viia Performance Evaluation
We will perform two types of performance evaluations in both applications. The first one measures the error of the estimated parameters, while the second one measures the performance by using the estimated parameters in a source estimation algorithm and measure instrumental intelligibility and sound quality of the estimated source waveforms. We measure the average PSD errors of the sources, the average PSD error of the late reverberation, and the average PSD error of the microphoneself noise using the following three measures [49]:
(46) 
(47) 
(48) 
We also compute the underestimates (denoted as above with superscript un) and overestimates (denoted as above with superscript ov) of the above averages as in [44] since a large overestimation error in the noise PSDs and a large underestimation error in the target PSD typically results in large target source distortions in the context of a noise reduction framework [44]. On the other hand, a large underestimation error in the noise PSDs may result in musical noise [44]. We also evaluate the average early RATF estimation error using the Hermitian angle measure [50] given by
(49) 
If the PSD of a source in a frequencybin is negligible for all timeframes within a timesegment, the estimated PSD and RATF of this source at that timefrequency tile are skipped from the above averages.
To evaluate the intelligibility and quality of the
th target source signal, the estimated parameters are used to construct a multichannel Wiener filter (MWF) as a concatenation of a singlechannel Wiener filter (SWF) and a minimum variance distortionless response (MVDR) beamformer
[1]. That is,(50) 
and
(51) 
where
(52) 
The noise reduction of the th source is evaluated using the segmentalsignaltonoiseratio (SSNR) for the th source only in subframes where the th source is active after which we average the SSNRs over all sources. Moreover, for speech sources, we measure the predicted intelligibility with the SIIB measure [51, 52] and average SIIB over all speech sources.
ViiB Reference StateoftheArt Dereverberation and ParameterEstimation Methods
For the dereverberation we first estimate the PSD of the late reverberation using the method proposed in [26, 19]. Specifically, we first compute the Cholesky decomposition after which we compute the whitened estimated noisy CPSDM as
(53) 
Next, we compute the eigenvalue decomposition , where the diagonal entries of are sorted in descending order. The PSD of the late reverberation is then computed as
(54) 
Having an estimate of the late reverberation, we compute the noise CPSDM matrix as and use it to estimate the early RATF and PSD of the target in the sequel.
We estimate the early RATF of the target using the method proposed in [8, 53]. We first compute the Cholesky decomposition . We then compute the whitened estimated noisy CPSDM as . Next, we compute the eigenvalue decomposition , where the diagonal entries of are sorted in descending order. We compute the early RATF as
(55) 
with . We improve even further the accuracy of the estimated RATF by estimating the RATFs of all time frames within each timesegment and then use the average of these as the RATF estimate. Finally, the target PSD is estimated as proposed in [28, 15], i.e.,
(56) 
where is given in (51).
Comments
There are no comments yet.