I Introduction
Speech recordings in the real world consist of the convolutive images of multiple audio sources and some additive noise. A convolutive image is the convolution between the source signal and the room impulse response (RIR), which is also called mixing filter in the multisource context. Correspondingly, the distortions on the source signals, i.e. interfering speakers, reverberations and additive noise, heavily deteriorate the speech intelligibility for both human listening and machine recognition. This work aims to suppress these distortions, in other words, to recover the respective source signals from the multichannel recordings. In general, suppressing interfering speakers, reverberations and noise are respectively refered to source separation, dereverberation and noise reduction. Each of which is a difficult task, that attracts lots of research attentions. In the microphone recordings, there are three unknown terms, i.e. source signals, mixing filters, and noise. Thence, the problem is often split into two subproblems i) identification of mixing filters and noise statistics, and ii) estimation of the source signals. This work focuses on the problem of speech source estimation assuming that the mixing filters, and possibly the noise statistics, are either known or their estimates are available.
Most convolutive source separation and speech enhancement techniques are designed in the short time Fourier transform (STFT) domain. In this domain, the convolutive process is usually approximated at each timefrequency (TF) bin by a product between the source STFT coefficient and the Fourier transform of the mixing filter. This assumption is called the multiplicative transfer function (MTF) approximation [1], or the narrowband approximation, and the frequency domain mixing filter is called the acoustic transfer function (ATF). Based on the known ATFs, or the respective relative transfer functions (RTFs) [2, 3]
, the beamforming techniques are widely used for multichannel source separation and speech enhancement, such as the minimum variance/power distortionless response (MVDR/MPDR) beamformer, and the linearly constrained minimum variance/power (LCMV/LCMP) beamformer
[2, 4]. Moreover, the sparsity of the audio signals in the TF domain can be utilized. Based on this property, the binary masking [5, 6] and the norm minimization [7] approaches have been applied for source separation. For more examples of MTFbased techniques, please refer to a comprehensive review [8] and references therein.The narrowband assumption is theoretically valid only if the length of the mixing filters is small relative to the length of the STFT window. In practice, this is very rarely the case, even for moderately reverberant environments, since the STFT window is limited to assume local stationarity of audio signals. Hence the narrowband assumption fundamentally hamper the speech enhancement performance, and this becomes critical for strongly reverberant environments. To avoid the limitation of narrowband assumption, several source separation methods based on the timedomain representation of mixing filters have been proposed. In the wideband Lasso method [9], the source signals are estimated by minimizing an norm fitting cost between the microphone signals and the mixing model involving the unknown source signals, in which the exact timedomain (wideband) sourcefilter convolution is used. Importantly, the norm of the STFTdomain source signals is added to the fitting cost as a regularization term to impose the spectral sparsity of the source spectra. In the presence of additive noise, the norm regularization is able to reduce the noise in the recovered source signals. However, the regularization factor is difficult to set even if the noise power is known. To overcome this, a more flexible scheme is proposed in [10] that relaxes the norm fitting cost to the noise level and minimizes the norm. In addition, a reweighting approach is also proposed in [10] to approximate the norm. In the family of multichannel inverse filtering or multichannel equalization, an inverse filter is estimated with respect to the known mixing filters, and applied to the microphone signals, preserving the desired source and suppressing the interfering sources. The multipleinput/output inverse theorem (MINT) method [11] was first proposed for this aim, which however is sensitive to RIR perturbations (misalignment / estimation error) and to microphone noise. To improve the robustness of MINT to RIR perturbations, many techniques have been proposed, preserving not only the directpath impulse response but also the early reflections, such as channel shortening [12], infinity and norm optimizationbased channel shortening/reshaping [13], partial MINT [14, 15], etc. In addition, the energy of the inverse filter was used in [16] as a regularization term to avoid the amplification of filter perturbations and microphone noise. In [17], a twostage method was proposed, that first converts a multipleinput multipleoutput (MIMO) system to multiple singleinput multipleoutput (SIMO) systems for source separation, and then applies inverse filtering for dereverberation.
The wideband models mentioned above are all performed in the time domain. The timedomain convolution problem can be transformed to the subband domain, which provides several benefits i) the original problem is split into subproblems, and each subproblem has a smaller data size and thus a smaller computational complexity, ii) the subband mixing filters are shorter than the timedomain filters, thence are likely to have less nearcommon zeros among microphones, which benefits both the filter identification and the multichannel equalization, even if the former is beyond the scope of this work, and iii) in the TF domain, the sparsity of the speech signal can be more easily exploited. Several variants of subband MINT were proposed based on filter banks [18, 19, 20, 21, 22]. The key issues in the filterbank design are i) the timedomain RIRs should be well approximated in the subband domain, and ii) the frequency response of each filterbank should be fully excited, i.e. should not involve the frequency components with the magnitude close to zero. Otherwise, these components are common to all channels, and are problematic in the MINT application. To satisfy the second condition, the filterbank is either critically sampled [18, 19], which suffers from frequency aliasing, or has a flattop frequency response [20, 21, 22], which may suffer from time aliasing. Generally speaking, the STFT transform is more preferable in the sense that most of the acoustic algorithms in the current literature are performed in this domain. To represent the timedomain convolution in the STFT domain, especially for the long filter case, crossband filters were introduced in [23]. To simplify the analysis, the convolutive transfer function (CTF) approximation is further adopted in [24, 25] only using the bandtoband convolution and ignoring the crossband filters. In [25], CTF is integrated into the generalized sidelobe canceler beamformer. In our previous works [26] and [27], blindly estimated CTF, specifically its directpath part, was used for localizing single speaker and multiple speakers, respectively. In [28], a CTFLasso method was proposed following the spirit of the wideband Lasso [9].
Several probabilistic techniques have also been proposed for wideband source separation via maximizing the likelihood of a generative model. Variational ExpectationMaximization (EM) algorithms are proposed in
[29] and [30] based on the timedomain convolution and in [31] based on crossband filters. CTFbased EM algorithms are proposed in [32] and [33] for single source dereverberation and source separation, respectively. These EM algorithms iteratively estimate the mixing filters and the sources, and intrinsically require a fairly good initialization for both filters and sources.In this work, we propose the following three source recovery methods in the standard oversampled STFT domain using the CTF approximation:

All the abovementioned improved MINT methods are proposed for single source dereverberation. The multisource case has been rarely studied, even if the multisource MINT was presented in the original paper [11]. We propose a CTFbased multisource MINT method for both source separation and dereverberation. The oversampled STFT does not suffer from both frequency aliasing and time aliasing. However, the STFT window is not flattop, namely the subband signals and filters have a frequency region with a magnitude close to zero, which is common to all channels. To overcome this problem, instead of using the conventional impulse function as the target of the inverse filtering, we propose a new target, which has a frequency response corresponding to the STFT window. In addition, a filter energy regularization is adopted following [16] to improve the robustness of inverse filtering.

For situations where the CTFs of the sources are not all available, we propose a beamforminglike inverse filtering method. The inverse filters are designed i) to preserve one source with known CTFs based on single source MINT, and ii) to minimize the overall power of the inverse filtering output, and thus suppress the interfering sources and noise. This method shares a similar spirit with the MPDR beamformer.

To overcome the drawback of the CTFLasso method [28], namely that the regularization factor is difficult to set with respect to the noise level, following the spirit of [10], we propose to recover the source signals by minimizing the norm of the source spectra with the constraint that the norm fitting cost is less than a tolerance. The setting of the tolerance is studied. In addition, a complexvalued proximal splitting algorithm [34, 35] is investigated to solve the optimization problem.
The remainder of this paper is organized as follows. The problem is formulated based on CTF in Section II. The two multichannel inverse filtering methods are proposed in Section III. The improved CTFLasso method is proposed in Section IV. Experiments are presented in Section V. Section VI concludes the work.
Ii CTFbased Problem Formulation
In the time domain, we consider a multichannel convolutive mixture with sources and microphones,
(1) 
where is the time index, and and are respectively the indices of the microphones and the sources. The signals , and are microphone signals, source signals, and noise signals, respectively. Here denotes convolution, and is the RIR relating the th source to the th microphone. Note that the relation between and is not specified here, and this will be discussed afterwards with respect to the proposed methods. The noise signals are uncorrelated with the source signals, and could be spatially uncorrelated, diffuse, or directional.
The goal of this paper is to recover the multiple source signals from the microphone signals, given the RIRs and the noise PSDs. The RIRs and noise PSDs could be blindly estimated from the microphone signals, and the estimated values generally suffer from disturbances, which are not trivial but beyond the scope of this work. Overall, the multisource recovery problem implies that source separation, dereverberation, and noise reduction are conducted simultaneously.
Iia Convolutive Transfer Function
In this section, the timedomain convolution is transformed into the STFTdomain CTF convolution. To simplify the exposition, we consider, for the meantime, the noise free situation with only one microphone and one source: , where the source and microphone indices are omitted.
The STFT representation of the microphone signal is
(2) 
where and denote the frame index and the frequency index, respectively. is the STFT analysis window, and and denote the frame (window) length, and the frame step, respectively. In the filter bank interpretation, the analysis window is considered as the lowpass filter, and as the decimation factor.
The crossband filter model [23] consists in representing the STFT coefficient as a summation over multiple convolutions (between the STFTdomain source signal and filter ) across frequency bins. Mathematically, the linear time invariant system can be written in the STFT domain as
(3) 
If , then is noncausal, with noncausal coefficients, where denotes the ceiling function. The number of causal filter coefficients is related to the reverberation time. For notational simplicity, let the filter index be in , with being the filter length, i.e. the noncausal coefficients are shifted to the causal part, which only leads to a constant shift of the frame index of the source signal. Let denote the STFT synthesis window. The STFTdomain impulse response is related to the timedomain impulse response by:
(4) 
which represents the convolution with respect to the time index evaluated at frame steps, with
To simplify the analysis, we consider the CTF approximation, i.e., only bandtoband filters with are considered:
(5) 
IiB STFT Domain Mixing Model
Based on the CTF approximation, we can obtain the STFTdomain mixing model corresponding to the timedomain model (1),
(6) 
Note that here (and hereafter) the frequency index is omitted, unless it is necessary. Since the proposed methods are applied frequencywise. Let and denote the frame indices of the microphone signals and the CTFs respectively. The goal of this work is to recover the STFT coefficients of the source signals, i.e. , and then applying the inverse STFT to obtain an estimation of the timedomain source signals.
Iii multichannel Inverse Filtering
The multichannel inverse filtering method is based on the MINT method. In this section, we propose two MINTbased methods in the CTF domain for the multisource case.
Iiia Problem Formulation for Inverse Filtering
Define the CTFdomain inverse filters as with and , where denotes the length of the inverse filters. The output of the inverse filtering is
(7) 
which comprises the mixture of the inverse filtered sources and the inverse filtered noise.
To facilitate the analysis, we denote the convolution in vector form. We define the convolution matrix for the microphone signal
as:(8) 
and the vector of filter as
where denotes the vector or matrix transpose. Then the convolution can be written as . The inverse filtering (7) can be written as:
(9) 
with:
Similarly, we define the convolution matrix for the CTF as , and write as . Moreover, we define , and write as .
IiiB The CTFMINT Formulation
To preserve a desired source, e.g. the th source, the inverse filtering of the CTF filters, i.e. , should target an impulse function function with length . To suppress the interfering sources, the inverse filtering of the CTF filters of the other sources, i.e. , should target a zero signal. Let denote the vector form of , and denote a ()dimensional zero vector. We define the following input output MINT equation
which can be rewritten in a compact form as
(10) 
When the matrix is either square or wide, namely and thus , (10) has an exact solution, which means an exact inverse filtering can be achieved. This condition implies an overdetermined recording system, i.e. .
From [11], the solvable condition of (10) is that the CTFs of the desired source , do not have any common zero. On one hand, the subband filters, i.e. the CTFs, are much shorter than the timedomain filters, and are thus likely to have much less nearcommon zeros, which is a major benefit. On the other hand, the filter banks induced from the shorttime windows lead to some structured common zeros. From (4), for any RIR , its CTF (with ) is computed as
(11) 
with
being the crosscorrelation of the analysis window and the synthesis window modulated (frequency shifted) by . This crosscorrelation has a similar frequency response as the windows and in the sense that it is also a lowpass filter with the same bandwidth denoted by . The frequency response of is the frequency response of multiplied by the frequency response of , and then folded by downsampling with a period of . To avoid frequency aliasing, the period should not be smaller than the bandwidth not to fold the passband of the lowpass filter. For example, in this work, we use the Hamming window, the width of the main lobe is considered as the bandwidth, i.e. . Consequently, we set the constraint . If we consider the magnitude of side lobes to be zero, the frequency response of can be interpreted as the th frequency band of multiplied by the frequency response of the downsampled , i.e. . When , the frequency response of involves some side lobes, which have a magnitude close to zero. When , only the main lobe is involved, and because the magnitude is dramatically decreasing from the center of the main lobe to its margin, the frequency region close to the margin of the main lobe has magnitude close to zero. This phenomenon, namely that the frequency response of and thus of are not fully excited, is common to all microphones, which is problematic for solving (10). Fortunately, it is trivially known that the common zeros are introduced by the frequency response of . To make (10) solvable, we propose to determine the desired target to have the same frequency response as , instead of the impulse function that has a fullband frequency response. To this end, the target is designed as:
(12) 
where denotes the vector form of . The zeros before introduce a modeling delay. As shown in [16], this delay is important for making the inverse filtering robust to perturbations of the CTF.
The solution of (10) gives an exact recovery of the th source plus the filtered noise as shown in (7). In this method, a directional noise can be treated as an interfering source, and be modeled in the MINT formulation. Therefore, here we only need to consider the spatially uncorrelated or diffuse noise . To suppress the noise, a straightforward way is to minimize the power of the filtered noise under the MINT constraint (10). As proposed in [16], an alternative way to suppress the noise is to reduce the energy of the inverse filter . This strategy is equivalent to minimizing the power of the filtered noise if we approximately assume the noise correlation matrix is the identity. In addition, this strategy is also capable to suppress the perturbations of the CTFs, if the disturbance noise is also assumed to have an identity correlation matrix. This leads to the following optimization problem:
(13) 
where is the CTF energy for the desired source (summed over channels and frames), used as a normalization term, and is the regularization factor. Indeed, the power of the inverse filter is at the level of , thus is somehow normalized by . As a result, the choice of , which controls the tradeoff between the two terms in (13), is made independent of the energy level of the CTF filters. This property is especially relevant for the present frequencywise algorithm since all frequencies can share the same regularization factor , although the CTF energy may significantly vary along the frequencies. The solution of (13), i.e. the CTFbased regularized MINT inverse filter, is
(14) 
where is the
dimensional identity matrix. We refer to this method as CTFMINT.
As mentioned above, to perform the exact inverse filtering, matrix should be either square or wide. In (13), the exact match between and is relaxed, which means the exact inverse filtering is abandoned to improve the robustness of the inverse filter estimate. Let denote the ratio between the number of columns and the number of rows of , then we have . Rename as , then:
(15) 
For the overdetermined recording system, i.e. , we can set to have a square or wide . When , should be less than , consequently is narrow, however, as opposed to solving (10), the optimization problem (13) is still feasible. Note that when , thence in practice should be sufficiently small to avoid a very large .
IiiC The CTFMPDR Formulation
The above CTFMINT approach requires CTF knowledge of all the sources. In this section, we consider the situation where the CTFs of the sources are not all obtained/estimated. One source is recovered based on its own CTFs only.
For the desired source, the inverse filter should still satisfy to achieve a distortionless desired source. At the same time, the power of the output, i.e. , should be minimized. Again, by relaxing the match between and , we define the following optimization problem
(16) 
where is the energy of the microphone signals. Similar to CTFMINT, the normalization factor makes the choice of the regularization factor independent of the energy of the CTF filters and the energy of the microphone signals. Therefore, all the frequencies can share the same regularization factor , even if the energy of microphone signals significantly varies across frequencies. This optimization problem considers any type of noise signal equally by minimizing the overall output power.
The solution of (16), i.e. the CTFbased beamforminglike inverse filter, is
(17) 
This method is similar in spirit with the MPDR beamformer, more exactly with the speech distortion weighted multichannel Wiener filter [36] since the source distortionless is relaxed. We still refer to this method as CTFMPDR.
Similarly, let denote the ratio between the number of columns and the number of rows of , then we have . Rename as , then
(18) 
Because the inverse filter is constrained by only one source, i.e. the desired source, it can always be set as in order to have either square or wide .
For both CTFMINT and CTFMPDR, the source signals are estimated by respectively taking the th source as the desired source and appling (7). They both do not require the knowledge of noise statistic.
Iv CTFbased Constrained Lasso
Instead of explicitly estimating an inverse filter, the source signals can be directly recovered by matching the microphone signals and the mixing model involving the unknown source signals. To this end, the spectral sparsity of the speech signals could be exploited as prior knowledge.
Iva Problem Formulation for the Mixing model
The mixing model (6) can be rewritten in vector/matrix form as
(19) 
where , and denote the matrices of microphone signals, source signals and noise signals, respectively, and denotes the threeway CTF array. The convolution is carried out along the time frame. Remember that this equation is defined for each frequency bin and that we omit the index for clarity of presentation. In Section III, the convolution between two signals was formulated as the multiplication of the convolution matrix of one signal and the vector form of the other signal. In the present section, the convolution operator is considered in its conventional form. The reason is that, in the method proposed here, only the convolution operation itself is used, which can be achieved by the fast Fourier transform.
In our previous work [28], we proposed to estimate the source signals by solving an norm fitting cost minimization problem with an norm regularization term
(20) 
where is the regularization factor. Note that both the and norms on matrices are redefined here as vector norms. The first term minimizes the fitting cost, and the second term imposes sparsity on the speech source signals. In the presence of additional noise , the regularization factor can be adjusted to impose the sparsity and thus to remove the noise from the estimated source signals. However, it is difficult to automatically tune even when the noise PSD is known. Especially, the source recovery is performed frequency by frequency in this work, and it is common that the noise PSD has different values at different frequencies. This requires a specific value of for each frequency, which further increases the difficulty of choosing . In this work, we solve this problem by transforming the above problem to a constrained optimization problem.
IvB CTFbased Constrained Lasso
Problem (20) is equivalent to the following formulation
(21) 
for some unknown and . The norm fitting cost is relaxed to at most a tolerance . This formulation was first proposed in [10] for audio source separation in the time domain. We adapted it to the CTFmagnitude domain in our previous work [37] for single source dereverberation. In the present work, we further extend it to the complexvalued CTF domain for multisource recovery.
The setting of the tolerance is critical to the quality of the recovered source signals. The tolerance is related to the noise power in the microphone signals. The noise signal is assumed to be stationary. Let denote the noise PSD in the th microphone, which can be estimated from pure noise signal or estimated by a noise PSD estimator, e.g. [38]. Let denote the noise signal in the th microphone in vector form. The squared norm of the noise signal, i.e. the noise energy , follows an Erlang distribution with mean and variance [39]. We assume that noise signals are spatially uncorrelated, then for all microphones, the squared norm has mean and variance . To relax the fitting cost to the noise power, we set the noise relaxing term as:
(22) 
Here, the standard deviation is subtracted twice, because: i) this makes the probability, that the
fitting cost to be larger than , to be very small; when the fitting cost is allowed to be larger than , the minimization of will distort the source signal; here we favor less source signal distortion at the price of less noise reduction, and ii) the minimization of tends to make the residual noise in the estimated source signals sparse. The sparse noise is perceptually notable even if the noise power is low. As a result, some perceptible noise remains in the estimated source signal. This method needs only an estimation of the singlechannel noise autoPSD, but not the crossPSD among microphones or among frames. Note that a directional noise cannot be considered as a source, since the method depends on the spectral sparsity of the source signal.Besides, the fit should also be relaxed with respect to the CTF approximation error and the CTF filter perturbations. The tolerance is akin to the energy of the noisefree signal, which can be estimated by spectral subtraction as:
(23) 
Empirically, the tolerance with respect to the noisefree signal is set to . Overall, the tolerance is set to .
Thanks to the sparsity constraint, the optimization problem (21) is feasible for (over)determined configurations as well as underdetermined ones. We refer to this method as CTFbased Constrained Lasso (CTFCLasso).
IvC Convex Optimization Algorithm
The optimization algorithm presented in this section mainly follows the principle proposed in [10]. Unlike [10], the target optimization problem (21) is carried out in the complex domain, and thus the optimization algorithm is also complexvalued. The optimization problem consists of an norm minimization and a quadratic constraint, which are both convex. The difficulty of this convex optimization problem is that the norm objective function is not differentiable.
The constrained optimization problem (21) can be recast as the following unconstrained optimization problem
(24) 
where denotes the convex set of signals verifying the constraint, , and denotes the indicator function of , namely equals 0 if , and otherwise. This unconstrained problem consists of two lower semicontinuous, nondifferentiable (nonsmooth), convex functions. For this problem, the DouglasRachford splitting method [34] is suitable, which is an iterative method. At each iteration, the two functions are split, and their proximity operators and (see below) are individually applied. The DouglasRachford method does not require the differentiability of any of the two functions, and is a generalization of the proximal splitting method [35]. Algorithm 1 summarizes the DouglasRachford method. Here and are set as constant values over iterations, e.g. 1 and 0.01 respectively in our experiments. The initialization of is set as the matrix composed of replication of the first microphone signal. The convergence criteria is set to check if the optimization objective is almost invariant from one iteration to the next. The threshold is set to in our experiments. In addition, the maximum number of iterations is set to 20.
The proximity operator plays the most important role in the optimization of nonsmooth functions. In Hilbert space, the proximity of a complexvalued function is
(25) 
The proximity operator of the norm at point , aka the shrinkage operator, is given entrywise by
(26) 
The proximity of the indicator function is the projection of onto . To compute this proximity, based on the proximal splitting method and the FenchelRockafellar duality [40], an iterative method was derived in [41], and used in [10]. However, this method converges linearly, which is very slow especially when the convex set (also ) is small. As hinted in [41], it can be accelerated to the squared speed via the Nesterov’s scheme [42, 43]. The accelerated method is summarized in Algorithm 2. The acceleration procedure is composed of Step 3 and 4, which are based on the derivation in [43]. Here is the adjoint matrix of , and is obtained by conjugate transposing the source and channel indices, and then temporally reversing the filters. Here is the tightest frame bound of the quadratic operation in the indicator function, and thus is the largest spectral value of the frame operator . The power iteration method is used to compute , which is summarized in Algorithm 3. We set as a constant value over iterations, e.g. in the experiments. In Step 2, the projection of a variable onto the convex set can be easily obtained as
(27) 
In Algorithm 2, the variable iteratively moves from the initial point to its projection, thence a convergence criteria is set to check the feasibility of the constraint. The slack factor is set to avoid the time consuming long tail of convergence, which however leads to a possible small bias of the norm constraint. In addition, the maximum number of iterations is set to 300.
V Experiments
In this section, we evaluate the quality of the estimated source signals, in terms of the performance of source separation, speech dereverberation and noise reduction.
Va Experimental Configuration
VA1 Dataset
The multichannel impulse response data [44] is used, which was recorded using a 8channel linear microphone array in the speech and acoustic lab of BarIlan University, with room size of m m m. The reverberation time is controlled by 60 panels covering the room facets. In the reported experiments, we used the recordings with s. The RIRs are truncated to correspond to , and have a length of 5600 samples. The speech signals from the TIMIT dataset [45] are taken as the source signals, with a duration of about 3 s. TIMIT speech is convolved with a RIR as the image of one source. Multiple image sources are summed up. For one such mixture, the source direction and the microphonetosource distance of each source are randomly selected from :: and {1 m, 2 m}, respectively. Note that the mutiple sources consist of different TIMIT speech utterances and different impulse responses in terms of source directions. To generate noisy microphone signals, a spatially uncorrelated stationary speechlike noise is added to the noisefree mixture, the noise level is controlled by a wideband input signaltonoise ratio (SNR). Note that SNR refers to the averaged single sourcetonoise ratio over multiple sources. To evaluate the robustness of the methods to the perturbations of the RIRs/CTFs, a proportional random Gaussian noise is added to the original filters in the time domain to generate the perturbed filters denoted as . The perturbation level is denoted as the normalized projection misalignment (NPM) [46] in decibels (dB). Various acoustic conditions in terms of the number of microphones and sources, SNRs, and NPMs are tested. For each condition, 20 runs are executed, and the averaged performance measures are computed.
VA2 Performance Metrics
The signaltodistortion ratio (SDR) [47] in dB is used to evaluate the overall quality of the outputs. The unprocessed microphone signals are evaluated as the baseline scores. The overall outputs, i.e. (7) for CTFMINT and CTFMPDR, and (21) for CTFCLasso, are evaluated as the output scores.
The signaltointerference ratio (SIR) [47] in dB is specially used to evaluate the source separation performance. This metric focuses on the suppression of interfering sources, thence the additive noise would be eliminated. The unprocessed noisefree mixtures, i.e. , are evaluated as the baseline scores. For CTFMINT and CTFMPDR, we can simply take the noisefree output, i.e. in (7), for evaluation. However, for CTFCLasso, we have to test the overall outputs, since the noisefree output is not available. Experimental results show that CTFCLasso has low residual noise, thus the SIR measure is assumed not to be significantly influenced by the output additive noise.
The perceptual evaluation of speech quality (PESQ) [48] is specially used to evaluate the dereverberation performance. The interfering sources and noise would be eliminated. For each source, its unprocessed image sources, i.e. are evaluated as the baseline scores. For CTFMINT and CTFMPDR, the noisefree single source output, i.e. is evaluated. For CTFCLasso, again we have to test the overall outputs. However, the residual interfering sources and noise affect the PESQ measure to a large extent. Therefore, we should note that the PESQ scores of CTFCLasso are highly underestimated.
The output SNR in dB is used to evaluate the noise reduction performance. The input SNR is taken as the baseline scores. For CTFMINT and CTFMPDR, the output SNR is computed as the power ratio between the noisefree outputs and the output noise, i.e. . For CTFCLasso, the noise PSDs in the output signals are first blindly estimated using the method proposed in [38]. The power of the noisefree outputs are estimated by spectral subtraction following the principle in (23), and then the output SNR is obtained by taking the ratio of them. It is shown in [38] that the estimation error of noise PSD is around 1 dB, thence the estimated output SNRs are reliable.
SDR, SIR and PESQ are evaluated in the time domain, thence the signals mentioned above are actually their corresponding timedomain signals reconstructed using inverse STFT. The output SNR for CTFMINT and CTFMPDR are computed either in the time domain or in the STFTdomain, while the output SNR for CTFCLasso is computed in the STFT domain.
VA3 Parameter Settings
The sampling rate is 16 kHz. The STFT is calculated using a Hamming window, with window length and frame step of (64 ms) and , respectively. The CTFs are computed from the timedomain filters using (11). The CTF length is 29. For the overdetermined recording system, i.e. , the length of the inverse filter of CTFMINT, i.e. , is computed via (15) with , which makes square. Pilot experiments show that a longer inverse filter (or a larger ) does not noticeably improve the performance measures, while leading to a larger computational cost. For the case of , is set to be less than and close to , and should be small to avoid an unreasonable long inverse filter. The exact values of will be given in the following experiments depending on the specific values of and . The length of the inverse filter of CTFMPDR, i.e. , is computed via (18) with , thus is square. The optimal setting of the modeling delay in is related to the length of the inverse filters. In the experiments, it is respectively set to 6 and 3 taps for CTFMINT and CTFMPDR as a good tradeoff for the different inverse filter lengths in various acoustic conditions.
Thanks to the normalization factors in (13) and (16), the same regularization factors and are suitable for all frequencies. Moreover, they are robust to any possible numerical scales of the filters and the signals in different datasets. Fig. 1 shows the performance measures of CTFMINT and CTFMPDR as a function of and , respectively. For CTFMINT, with the increase of , the inaccuracy of inverse filtering increases, while the energy of the inverse filters decreases. From the left plot of Fig. 1, it is observed that the output SNR gets larger with the increase of , which confirms that the additive noise can be suppressed by decreasing the energy of the inverse filter. However, SIR and PESQ scores become smaller with the increase of due to the larger inaccuracy of inverse filtering, which leads to more residual interfering sources and reverberation. Integrating these effects, SDR first increases then decreases with the increase of . In a similar way, the energy of the inverse filters also affects the robustness of the inverse filtering to the CTF perturbations. In summary, we consider two representative choices of : i) a relatively small one, i.e. , leads to an accurate inverse filtering but a large energy of the inverse filter; this is suitable for the case where both the microphone noise and the CTF perturbations are small, and ii) a large one, i.e. , achieves an output SNR being slightly larger than the input SNR thus avoiding the amplification of the additive noise. In the following experiments, the former is used for the noisefree case, and the latter is used for the noisy case. This partially oracle configuration is a bit unrealistic, but is useful to show the full potential of CTFMINT. See [14] for further discussion on the optimal setting of .
For CTFMPDR, controls the tradeoff between the distortionless of the desired source and the power of the output. The minimization of the power of the output will suppress both the interfering sources and the noise. From the right plot of Fig. 1, we observe that PESQ decreases along with the increase of , due to the increased distortions of the desired source. SIR and output SNR can be increased by increasing until . A larger , e.g. , leads to a smaller SIR and output SNR although the power of the output is smaller, since the desired signal is also heavily distorted and suppressed. Overall, is set to , which achieves a high PESQ score and good other measures.
VB Influence of the Number of Microphones
Fig. 2 shows the results as a function of the number of microphones. The source number is fixed to three. In this experiment, the microphone signals are noise free, thus the output SNR is not reported. For CTFMINT, is set to 0.55 and 0.8 for the cases of two and three microphones, respectively. Consequently the length of the inverse filters are about five times the CTF length.
For CTFMINT, the scores of all the three metrics dramatically decrease when the number of microphones goes from four to three and to two, namely from the overdetermined case to the determined case and to the underdetermined case. This indicates that the inaccuracy of the inverse filtering is large for the non overdetermined case, due to the insufficient degrees of freedom of the inverse filters as spatial parameters. CTFMPDR suppresses the interfering sources by minimizing the power of the output, and implicitly also by the inverse filtering with a target of zero signal. Therefore, as for CTFMPDR, the metrics to measure the interfering sources suppression performance, i.e. SDR and SIR, also significantly degrade for the non overdetermined case. Along with the increase of number of microphones, the PESQ score slightly varies, which means that the inverse filtering of the desired source is not considerably affected, due to the small variation of the output power. The performance measures of CTFCLasso increases almost linearly with the growing number of microphones, no matter whether it is underdetermined or overdetermined, thanks to exploiting the spectral sparsity. For the overdetermined case, i.e. four microphones or more, SDR and SIR for the three methods slowly increase with the growing number of microphones, and CTFMINT has a larger changing rate. CTFCLasso achieves the worst PESQ score due to the influence of the residual interfering sources. By listening to the outputs of CTFCLasso, they are not perceived as more reverberant.
Overall, without considering the noise reduction, CTFMINT performs the best for the overdetermined case. For instance, CTFMINT achieves an SDR of 21.9 dB by using four microphones, which is a very good source recovery SDR score. CTFCLasso performs the best for the underdetermined case. For instance, CTFCLasso achieves an SDR of 8.4 dB by using only two microphones. By only using the mixing filters of one source, the source separation performance of CTFMPDR is worse than the other two methods.
VC Performance for Various Number of Sources
Fig. 3 shows the results as a function of the number of sources. In this experiment, the number of microphones is fixed to six. The microphone signals are noise free, thus the output SNR is not reported. From this figure, we can observe that the performance measures of the three methods degrade with the increase of the number of sources, except for the PESQ score of CTFMPDR. CTFMINT achieves the best performance, even if it exhibits the largest performance degradation. This is somehow consistent with the experiments with various number of microphones that good performance requires a large ratio between the number of microphones and the number of sources. Both CTFMPDR and CTFCLasso have smaller performance degradation. At first sight, it is surprising that CTFMPDR achieves a larger PESQ score when more sources are present in the mixture. The reason is that the normalized output power, i.e. , becomes smaller with the increase of the number of sources due to a larger . Correspondingly, the inverse filtering inaccuracy of the desired source, i.e. , becomes smaller as well.
Acoustic Condition  SDR [dB]  Computation Time per Mixture [s]  
SNR  NPM  CTFMINT  CTFMPDR  CTFCLasso  LCMP  TDMINT  WLasso  CTFMINT  CTFMPDR  CTFCLasso  LCMP  TDMINT  WLasso  
4  3      21.9  6.7  11.0  3.6    18.9  25.4  4.9  1987  1.1    4284 
6  2      30.4  10.4  16.6  0.3  30.0  31.2  5.8  4.2  1688  1.1  142  3843 
6  3      26.3  8.2  12.6  0.6    23.8  12.2  5.9  2827  1.2    5961 
6  5      13.6  4.5  8.2  6.4    14.7  229.6  12.4  5679  1.9    10134 
4  3  15 dB    3.8  0.9  10.6  14.7      21.9  6.7  1500  1.1     
4  3    15 dB  1.7  4.3  4.2  4.1    0.5  21.9  6.7  1440  1.1    4245 
VD Influence of Additive Noise
Fig. 4 shows the results as a function of the input SNR. The number of microphones and of sources are respectively fixed to four and three. As mentioned above, for the noisy case, the regularization factor is set to . The inverse filter of CTFMINT is invariant for various input SNRs, since it depends only on the CTF filters, but not on the microphone signals. As a result, the SIR and PESQ scores are constant, but are much smaller than the noisefree case with , see Fig. 2
. The SNR improvement is also a constant value, about 1 dB. For CTFMPDR, SIR and PESQ are smaller when the input SNR is lower, since a larger input noise leads to a larger output noise, thus degrades the suppression of the interfering sources, and distorts the inverse filtering of the desired source. Along with the increase of the input SNR, the output SNR increases, but the SNR improvement decreases. The SNR improvement is negative when the input SNR is larger than 5 dB, which means the microphone noise is amplified. For CTFMINT and CTFMPDR, the residual noise is significant, which indicates that the inverse filtering is not able to efficiently suppress the white noise. Therefore, a single channel noise reduction process is needed as a postprocessing, as in
[49, 50]. The output SNR of CTFCLasso is always larger than the input SNRs, which means that the microphone noise is efficiently reduced. SDR and SIR of CTFCLasso degrades for the low SNR case, but not much.VE Influence of CTF Perturbations
Fig. 5 shows the results as a function of NPMs. For CTFMINT, two choices of the regularization factor, i.e. and , are tested. As expected, all the metrics become worse with the increase of NPM, thus we only analyze the SDR scores. Note that, when NPM is 65 dB, the three methods achieve almost the same performance measures as with the perturbationfree case. Along with the increase of NPMs, the performance of CTFMINT with dramatically degrades from a large score to a very small score, which indicates its high sensitivity to CTF perturbations. In contrast, CTFMINT with has a small performance degradation rate, but the performance is poor even for the low NPM case. The performance measures of CTFMPDR almost linearly decreases with a relatively large degradation rate. The performance of CTFCLasso is stable until NPM equals 35 dB, and quickly degrades when NPM is larger than 25 dB.
In CTFMINT, the inverse filter is designed to respectively satisfy the targets of desired source and interfering sources. Therefore, the CTF perturbations of the desired source will not significantly affect the suppression of interfering sources, and vice versa. Moreover, in CTFMPDR, the inverse filter is computed depending only on the CTFs of the desired source, thence the CTF perturbations of the interfering sources will not affect the inverse filtering at all. In contrast, in CTFCLasso, all sources are simultaneously recovered based on the CTFs of all of them, consequently the CTF perturbations of one source will affect the recovery of all sources. These assertions have been verified by some pilot experiments.
VF Comparison with Baseline Methods
To benchmark the proposed methods, we compare them with three baseline methods:

LCMP beamformer [4] based on the narrowband assumption. Based on the steering vectors and the correlation matrix of microphone signals, a beamformer is computed to preserve one desired source and zero out the others, and to minimize the power of the output. The RIRs are longer than the STFT window, thus the steering vector should be computed as the Fourier transform of the truncated RIRs. In this experiment, the steering vector is set to the CTF tap with the largest power.

Time domain MINT (TDMINT) [16]. This method is also set to recover the directpath source signal with an energy regularization. In this experiment, we extend this method to the multisource case. We only test the condition with and , following the principle of the proposed method, the length of inverse filter and the modeling delay are set to 2800 and 1024, respectively. Other conditions require too long inverse filters that cannot be implemented within basic memory ressources on a personal computer.

Wideband Lasso (WLasso) [9]. The regularization factor is set to , which is empirically suitable for the noisefree case.
Table I presents the SDR scores for six representative acoustic conditions, as well as the computation times which will be analyzed in the next section. Note that ‘’ means noisefree and perturbationfree in the columns of SNR and NPM, respectively. LCMP performs poorly for all conditions, which verifies the assertion that the narrowband assumption is not suitable for the long RIR case. CTFMINT achieves a bit higher SDR score than TDMINT, despite the fact that the CTFbased filtering is an approximation of the timedomain filtering. This is mainly due to much shorter filters in the STFT/CTF domain. WLasso noticeably outperforms CTFCLasso for the noisefree and perturbationfree cases, due to its exact timedomain convolution. WLasso has a similar noise reduction capability with CTFCLasso, however the regularization factor is difficult to set for a proper noise reduction, thence the results of WLasso for the noisy case is not reported. Compared to CTFCLasso, WLasso has a larger performance degradation rate with the increase of the number of sources and of filter perturbations.
VG Analysis of Computational Complexity
Table I also presents the averaged computation time for one mixture with a duration of 3 s. All methods were implemented in MATLAB. CTFMINT and CTFMPDR computation times comprise the inverse filters computation and the inverse filtering on the microphone signals, and the former dominates the computation time. From (14) and (17), the computations include the multiplication and inversion of the matrices, thence the complexity is cubic in matrix dimension. We consider square matrices in (14) and in (17), whose dimension is equal to . From (15) and (18), is proportional to the filter length , to for CTFMINT, and to for CTFMPDR. The inverse filters are respectively computed for each source and each frequency. Overall, CTFMINT and CTFMPDR have a computational complexity of and , respectively, where is the number of frequency bins. The complexity of TDMINT can be derived from the complexity of CTFMINT by replacing the CTF length with the RIR length and setting to 1. Since it is proportional to the cube of RIR length, the complexity is prohibitive for most settings. The LCMP beamformer is similar to CTFMINT, just using an instantaneous steering vector and an instantaneous inverse filter, namely the length of CTF and inverse filter are both 1, thence it has the lowest computation complexity. These methods have a closeform solution and thus low computational complexity. These can be verified by the computation times shown in Table I.
The iterative optimization of CTFCLasso leads to a high computational complexity. Unlike the Newtonstyle methods employing the secondorder derivative, the DouglasRachford optimization method is a firstorder method, thence the complexity is linear with respect to the problem size, specifically the length of microphone signals and filters, and the number of microphones and sources. The most time consuming procedure in Algorithm 1 is the computation of the proximity of the indicator function, i.e. the projection. To verify this, we can compare the DouglasRachford method with the optimization algorithm for the Lasso problem (20) that does not have an norm constraint and thus an indicator function. In [28], we solved the unconstrained Lasso problem using the fast iterative shrinkagethresholding algorithm (FISTA) [43], which is also a proximal splitting method just without computing the proximity of the indicator function. As reported in [28], FISTA needs only about tens of seconds per mixture, while here DouglasRachford needs thousands of seconds per mixture, see Table I. As stated in Section IVC, in Algorithm 2, the variable iteratively moves from the initial point to its projection in the convex set. Therefore, a larger convex set caused by a larger noise power (a larger ) needs less iterations to reach the projection, and needs less computation time. This can be verified by the fact that the case with SNR of 15 dB needs less computation time than the noisefree case. When the CTF perturbations is large, e.g. NPM is 15 dB, the optimized objective, i.e. , is large, thence less iterations (and less computation time) are needed to converge. The CTF convolution at one frequency has a much smaller data size than the timedomain convolution, as a result, the CTFbased DouglasRachford method only requires of the order of ten iterations to converge, while the timedomain WLasso method requires tens of thousands iterations to converge. As shown in Table I, the WLasso method needs more computation time than CTFCLasso, although it is unconstrained and optimized by FISTA.
Vi Conclusion
Three source recovery methods based on CTF have been proposed in this paper. CTFMINT is an ideal overdetermined source recovery method when the microphone noise and mixing filter perturbations are small. It has a relative low computational complexity. However, it is sensitive to the microphone noise and filter perturbations. CTFMPDR is also more suitable for the overdetermined case than for the non overdetermined case. It achieves the worst performance among the three proposed methods but with the lowest computational cost. The major virtue of CTFMPDR is that it only requires the mixing filters of the desired source, which makes it more practical. Thanks to exploiting the spectral sparsity, CTFCLasso is able to perform well in the underdetermined case, and to efficiently reduce the microphone noise. However, it requires the mixing filters of all sources, which are not easy to obtain in practice. In addition, the computational cost is high due to the iterative optimization procedure.
References
 [1] Y. Avargel and I. Cohen, “On multiplicative transfer function approximation in the shorttime Fourier transform domain,” IEEE Signal Processing Letters, vol. 14, no. 5, pp. 337–340, 2007.
 [2] S. Gannot, D. Burshtein, and E. Weinstein, “Signal enhancement using beamforming and nonstationarity with applications to speech,” IEEE Transactions on Signal Processing, vol. 49, no. 8, pp. 1614–1626, 2001.
 [3] X. Li, L. Girin, R. Horaud, and S. Gannot, “Estimation of relative transfer function in the presence of stationary noise based on segmental power spectral density matrix subtraction,” in IEEE International Conference on Acoustics, Speech and Signal Processing, pp. 320–324, 2015.
 [4] H. L. Van Trees, Detection, estimation, and modulation theory. John Wiley & Sons, 2004.
 [5] O. Yilmaz and S. Rickard, “Blind separation of speech mixtures via timefrequency masking,” IEEE Transactions on Signal Processing,, vol. 52, no. 7, pp. 1830–1847, 2004.
 [6] M. I. Mandel, R. J. Weiss, and D. P. Ellis, “Modelbased expectationmaximization source separation and localization,” IEEE Transactions on Audio, Speech, and Language Processing, vol. 18, no. 2, pp. 382–394, 2010.

[7]
S. Winter, W. Kellermann, H. Sawada, and S. Makino, “MAPbased underdetermined blind source separation of convolutive mixtures by hierarchical clustering and
norm minimization,” EURASIP Journal on Applied Signal Processing, vol. 2007, no. 1, pp. 81–81, 2007.  [8] S. Gannot, E. Vincent, S. MarkovichGolan, and A. Ozerov, “A consolidated perspective on multimicrophone speech enhancement and source separation,” IEEE/ACM Transactions on Audio, Speech, and Language Processing, vol. 25, no. 4, pp. 692–730, 2017.
 [9] M. Kowalski, E. Vincent, and R. Gribonval, “Beyond the narrowband approximation: Wideband convex methods for underdetermined reverberant audio source separation,” IEEE Transactions on Audio, Speech, and Language Processing, vol. 18, no. 7, pp. 1818–1829, 2010.
 [10] S. Arberet, P. Vandergheynst, J.P. Carrillo, R. E.Thiran, and Y. Wiaux, “Sparse reverberant audio source separation via reweighted analysis,” IEEE Transactions on Audio, Speech, and Language Processing, vol. 21, no. 7, pp. 1391–1402, 2013.
 [11] M. Miyoshi and Y. Kaneda, “Inverse filtering of room acoustics,” IEEE Transactions on Acoustics, Speech, and Signal Processing, vol. 36, no. 2, pp. 145–152, 1988.
 [12] M. Kallinger and A. Mertins, “Multichannel room impulse response shapinga study,” in IEEE International Conference on Acoustics Speech and Signal Processing Proceedings (ICASSP), vol. 5, pp. V101–V104, 2006.
 [13] A. Mertins, T. Mei, and M. Kallinger, “Room impulse response shortening/reshaping with infinity and norm optimization,” IEEE Transactions on Audio, Speech, and Language Processing, vol. 18, no. 2, pp. 249–259, 2010.
 [14] I. Kodrasi, S. Goetze, and S. Doclo, “Regularization for partial multichannel equalization for speech dereverberation,” IEEE Transactions on Audio, Speech, and Language Processing, vol. 21, no. 9, pp. 1879–1890, 2013.
 [15] I. Kodrasi and S. Doclo, “Joint dereverberation and noise reduction based on acoustic multichannel equalization,” IEEE/ACM Transactions on Audio, Speech, and Language Processing, vol. 24, no. 4, pp. 680–693, 2016.
 [16] T. Hikichi, M. Delcroix, and M. Miyoshi, “Inverse filtering for speech dereverberation less sensitive to noise and room transfer function fluctuations,” EURASIP Journal on Advances in Signal Processing, vol. 2007, no. 1, pp. 1–12, 2007.
 [17] Y. Huang, J. Benesty, and J. Chen, “A blind channel identificationbased twostage approach to separation and dereverberation of speech signals in a reverberant environment,” IEEE Transactions on Speech and Audio Processing, vol. 13, no. 5, pp. 882–895, 2005.
 [18] H. Yamada, H. Wang, and F. Itakura, “Recovering of broadband reverberant speech signal by subband MINT method,” in IEEE International Conference on Acoustics, Speech, and Signal Processing (ICASSP), pp. 969–972, 1991.
 [19] H. Wang and F. Itakura, “Realization of acoustic inverse filtering through multimicrophone subband processing,” IEICE Transactions on Fundamentals of Electronics, Communications and Computer Sciences, vol. 75, no. 11, pp. 1474–1483, 1992.
 [20] S. Weiss, G. W. Rice, and R. W. Stewart, “Multichannel equalization in subbands,” in IEEE Workshop on Applications of Signal Processing to Audio and Acoustics, pp. 203–206, 1999.
 [21] N. D. Gaubitch and P. A. Naylor, “Equalization of multichannel acoustic systems in oversampled subbands,” IEEE Transactions on Audio, Speech, and Language Processing, vol. 17, no. 6, pp. 1061–1070, 2009.
 [22] F. Lim and P. A. Naylor, “Robust speech dereverberation using subband multichannel least squares with variable relaxation,” in European Signal Processing Conference (EUSIPCO), 2013.
 [23] Y. Avargel and I. Cohen, “System identification in the shorttime Fourier transform domain with crossband filtering,” IEEE Transactions on Audio, Speech, and Language Processing, vol. 15, no. 4, pp. 1305–1319, 2007.
 [24] R. Talmon, I. Cohen, and S. Gannot, “Relative transfer function identification using convolutive transfer function approximation,” IEEE Transactions on Audio, Speech, and Language Processing, vol. 17, no. 4, pp. 546–555, 2009.
 [25] R. Talmon, I. Cohen, and S. Gannot, “Convolutive transfer function generalized sidelobe canceler,” IEEE transactions on audio, speech, and language processing, vol. 17, no. 7, pp. 1420–1434, 2009.
 [26] X. Li, L. Girin, R. Horaud, and S. Gannot, “Estimation of the directpath relative transfer function for supervised soundsource localization,” IEEE/ACM Transactions on Audio, Speech and Language Processing, vol. 24, no. 11, pp. 2171–2186, 2016.
 [27] X. Li, L. Girin, R. Horaud, and S. Gannot, “Multiplespeaker localization based on directpath features and likelihood maximization with spatial sparsity regularization,” IEEE/ACM Transactions on Audio, Speech, and Language Processing, vol. 25, no. 10, pp. 1997–2012, 2017.
 [28] X. Li, L. Girin, and R. Horaud, “Audio source separation based on convolutive transfer function and frequencydomain lasso optimization,” in IEEE International Conference on Acoustics, Speech, and Signal Processing (ICASSP), 2017.
 [29] S. Leglaive, R. Badeau, and G. Richard, “Multichannel audio source separation: variational inference of timefrequency sources from timedomain observations,” in IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), 2017.
 [30] S. Leglaive, R. Badeau, and G. Richard, “Separating timefrequency sources from timedomain convolutive mixtures using nonnegative matrix factorization,” in IEEE Workshop on Applications of Signal Processing to Audio and Acoustics (WASPAA), 2017.
 [31] R. Badeau and M. D. Plumbley, “Multichannel highresolution NMF for modeling convolutive mixtures of nonstationary signals in the timefrequency domain,” IEEE/ACM Transactions on Audio, Speech, and Language Processing, vol. 22, no. 11, pp. 1670–1680, 2014.

[32]
B. Schwartz, S. Gannot, and E. A. Habets, “Online speech dereverberation using kalman filter and EM algorithm,”
IEEE/ACM Transactions on Audio, Speech and Language Processing, vol. 23, no. 2, pp. 394–406, 2015.  [33] X. Li, L. Girin, and R. Horaud, “An EM algorithm for audio source separation based on the convolutive transfer function,” in IEEE Workshop on Applications of Signal Processing to Audio and Acoustics (WASPAA), 2017.
 [34] P. L. Combettes and J.C. Pesquet, “A douglas–rachford splitting approach to nonsmooth convex variational signal recovery,” IEEE Journal of Selected Topics in Signal Processing, vol. 1, no. 4, pp. 564–574, 2007.
 [35] P. L. Combettes and V. R. Wajs, “Signal recovery by proximal forwardbackward splitting,” Multiscale Modeling & Simulation, vol. 4, no. 4, pp. 1168–1200, 2005.
 [36] S. Doclo, A. Spriet, J. Wouters, and M. Moonen, “Speech distortion weighted multichannel wiener filtering techniques for noise reduction,” Speech enhancement, pp. 199–228, 2005.
 [37] X. Li, R. Horaud, and S. Gannot, “Blind multichannel identification and equalization for dereverberation and noise reduction based on convolutive transfer function,” CoRR, vol. abs/1706.03652, 2017.
 [38] X. Li, L. Girin, S. Gannot, and R. Horaud, “Nonstationary noise power spectral density estimation based on regional statistics,” in IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pp. 181–185, 2016.
 [39] C. Forbes, M. Evans, N. Hastings, and B. Peacock, “Erlang distribution,” Statistical Distributions, Fourth Edition, pp. 84–85, 2010.
 [40] R. T. Rockafellar, Convex analysis. Princeton university press, 2015.
 [41] M. J. Fadili and J.L. Starck, “Monotone operator splitting for optimization problems in sparse recovery,” in IEEE International Conference on Image Processing, pp. 1461–1464, 2009.
 [42] Y. Nesterov, “Gradient methods for minimizing composite objective function,” tech. rep., International Association for Research and Teaching, 2007.
 [43] A. Beck and M. Teboulle, “A fast iterative shrinkagethresholding algorithm for linear inverse problems,” SIAM journal on imaging sciences, vol. 2, no. 1, pp. 183–202, 2009.
 [44] E. Hadad, F. Heese, P. Vary, and S. Gannot, “Multichannel audio database in various acoustic environments,” in International Workshop on Acoustic Signal Enhancement, pp. 313–317, 2014.
 [45] J. S. Garofolo, L. F. Lamel, W. M. Fisher, J. G. Fiscus, D. S. Pallett, and N. L. Dahlgren, “Getting started with the DARPA TIMIT CDROM: An acoustic phonetic continuous speech database,” National Institute of Standards and Technology (NIST), Gaithersburgh, MD, vol. 107, 1988.
 [46] D. R. Morgan, J. Benesty, and M. M. Sondhi, “On the evaluation of estimated impulse responses,” IEEE Signal processing letters, vol. 5, no. 7, pp. 174–176, 1998.
 [47] E. Vincent, R. Gribonval, and C. Févotte, “Performance measurement in blind audio source separation,” IEEE transactions on audio, speech, and language processing, vol. 14, no. 4, pp. 1462–1469, 2006.
 [48] A. W. Rix, J. G. Beerends, M. P. Hollier, and A. P. Hekstra, “Perceptual evaluation of speech quality (PESQ)a new method for speech quality assessment of telephone networks and codecs,” in IEEE International Conference on Acoustics, Speech, and Signal Processing (ICASSP), vol. 2, pp. 749–752, 2001.
 [49] I. Cohen, S. Gannot, and B. Berdugo, “An integrated realtime beamforming and postfiltering system for nonstationary noise environments,” EURASIP Journal on Applied Signal Processing, vol. 2003, pp. 1064–1073, 2003.
 [50] S. Gannot and I. Cohen, “Speech enhancement based on the general transfer function GSC and postfiltering,” IEEE Transactions on Speech and Audio Processing, vol. 12, no. 6, pp. 561–571, 2004.
Comments
There are no comments yet.