A unified convolutional beamformer for simultaneous denoising and dereverberation

12/20/2018
by   Tomohiro Nakatani, et al.
0

This paper proposes a method for estimating a convolutional beamformer that can perform denoising and dereverberation simultaneously in an optimal way. The application of dereverberation based on weighted prediction error (WPE) followed by denoising based on a minimum variance distortionless response beamformer (MVDR) has conventionally been considered a promising approach, however, the optimality of this approach is not guaranteed. To realize the optimal integration of denoising and dereverberation, we present a method that unifies WPE and a variant of MVDR, namely a minimum power distortionless response beamformer (MPDR), into a single convolutional beamformer, and we optimize it based on a single unified optimization criterion. The proposed beamformer is referred to as a Weighted Power minimization Distortionless response beamformer (WPD). Experiments show that the proposed method substantially improves the speech enhancement performance in terms of both objective speech enhancement measures and automatic speech recognition (ASR).

READ FULL TEXT VIEW PDF
POST COMMENT

Comments

There are no comments yet.

Authors

page 1

page 2

page 3

page 4

08/06/2019

Maximum likelihood convolutional beamformer for simultaneous denoising and dereverberation

This article describes a probabilistic formulation of a Weighted Power m...
05/20/2020

Jointly optimal denoising, dereverberation, and source separation

This paper proposes methods that can optimize a Convolutional BeamFormer...
06/03/2021

Joint Multi-Channel Dereverberation and Noise Reduction Using a Unified Convolutional Beamformer With Sparse Priors

Recently, the convolutional weighted power minimization distortionless r...
08/28/2017

Integrated Speech Enhancement Method Based on Weighted Prediction Error and DNN for Dereverberation and Denoising

Both reverberation and additive noises degrade the speech quality and in...
10/30/2019

Jointly optimal dereverberation and beamforming

We previously proposed an optimal (in the maximum likelihood sense) conv...
04/19/2019

Dry, Focus, and Transcribe: End-to-End Integration of Dereverberation, Beamforming, and ASR

Sequence-to-sequence (S2S) modeling is becoming a popular paradigm for a...
This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

I Introduction

When a speech signal is captured by distant microphones, e.g., in a conference room, it will inevitably contain additive noise and reverberation components. These components are detrimental to the perceived quality of the observed speech signal and often cause serious degradation in many applications such as hands-free teleconferencing and ASR.

Microphone array signal processing techniques have been developed to minimize the aforementioned detrimental effects by reducing the noise and the reverberation in the acquired signal. A filter-and-sum beamformer [1], MVDR and MPDR [2, 3, 4, 5, 6], and a maximum signal-to-noise ratio beamformer [7, 8, 9] are widely-used techniques for denoising, while WPE and its variants [10, 11, 12, 13, 14] are emerging techniques for dereverberation. The usefulness of these techniques, particularly for improving ASR performance, has been extensively studied, e.g., at the REVERB challenge [15] and the CHiME-3/4/5 challenges [16, 17, 18]. Advances in this technological area have led to recent progress in commercial devices with far-field ASR capability, such as smart speakers [19, 20, 21].

It is, however, still challenging to reduce both noise and reverberation simultaneously in an optimal way. For example, researchers have proposed using MVDR and WPE in a cascade manner [22, 23], where, for example, the signal is first dereverberated by WPE and then denoised with MVDR. With this approach, dereverberation may not be optimally performed due to the influence of the noise, and denoising may be disturbed by the remaining reverberation. Certain joint optimization techniques have also been proposed [24, 25, 26], however, they still perform dereverberation and denoising separately, which makes the optimality of the integration unclear, resulting in only a marginal performance improvement compared with the cascade system.

To achieve optimal integration, this paper proposes a method for unifying WPE and MPDR, into a single convolutional beamformer and for optimizing the beamformer based on a single unified optimization criterion. We can derive a globally optimal closed-form solution for this beamformer, assuming that the time-varying power and steering vector of the desired signal are given. The optimality of the beamformer is guaranteed under the assumed condition. The proposed beamformer is referred to as a Weighted Power minimization Distortionless response beamformer (WPD) because it estimates the filter coefficients as ones that minimize the average power of the enhanced signal weighted by the time-varying power of the desired signal under a distortionless constraint determined based on the steering vector. Note that the steering vector and the signal power must also be given for WPE and MPDR, respectively, and several useful techniques for their estimation have already been proposed.

In experiments, we compare the proposed method with WPE, MPDR, and both approaches in cascade configuration in terms of objective speech enhancement measures and ASR performance. The experiments show that the proposed method substantially outperforms all the conventional methods as regards almost all the performance metrics. For example, in comparison with the cascade system, the proposed method achieves an average word error reduction rate of 7.5 % for real data taken from the REVERB Challenge dataset.

In the reminder of this paper, we define the signal model in Section II, and then overview three conventional methods, WPE, MPDR, and the two approaches in a cascade configuration in Section III. Then, Section IV describes our proposed beamformer. The experimental results and concluding remarks are given in Sections V and VI, respectively.

Ii Signal model

Assume that a single speech signal is captured by microphones in a noisy reverberant environment. Letting and

denote non-conjugate and conjugate transpose, the captured signal in the short-time Fourier transform domain is approximately modeled at each frequency bin by

(1)

where is a column vector containing all the microphone signals at a time frame . In this paper, the frequency indices of the symbols are omitted for brevity, and on the assumption that each frequency bin is processed independently in the same way. is a clean speech signal, for is a set of column vectors containing convolutional acoustic transfer functions from the speaker location to all the microphones, and is the additive noise.

The first term in eq. (1) can be further decomposed into two parts, one composed of the direct signal and early reflections, and the other corresponding to the late reverberation [27]. With this decomposition, eq. (1) is rewritten as

(2)
(3)
(4)

where is a time index that separates the reverberation into the two parts. The goal of the speech enhancement is to reduce the late reverberation and the noise from the captured signal while preserving , hereafter referred to as a desired signal.

Iii Conventional methods

This section gives a brief overview of the conventional methods, including WPE, MPDR, and the two approaches in a cascade configuration.

Iii-a Dereverberation by WPE

If we disregard the additive noise, , we can rewrite eq. (1

) using a multichannel autoregressive model

[28, 29, 10] as

(5)

where for are dimensional matrices containing coefficients that predict the current captured signal, , from the past captured signals, for . The prediction error corresponds to the desired signal in the above equation.

WPE estimates the prediction coefficients based on maximum likelihood estimation, assuming that the desired signal at each microphone follows a time-varying complex Gaussian distribution with a mean 0 and a time-varying variance,

, which corresponds to the time-varying power of the desired signal. Then, the prediction coefficients, , are estimated as those that minimize the average power of the prediction error weighted by the inverse of . The estimation is represented by

(6)

where is squared norm of a vector . It is known that the prediction delay also works as a distortionless constraint to prevent the desired signal components from being distorted by the dereverberation [10]. As for the estimation of , several useful techniques have been proposed, such as an iterative estimation method.

With the estimated prediction coefficients, the dereverberation is performed by

(7)

It was experimentally confirmed that WPE can function robustly even in noisy environments to reduce the late reverberation with a slight increase in the noise [10].

Iii-B Beamforming by MPDR

Assuming that the transfer function corresponding to the desired signal can be approximated by a product of a vector with the clean speech signal, i.e., , and taking the late reverberation, , as a part of the noise, , eq. (2) becomes

(8)

MPDR is defined as a vector, , that minimizes the average power of the captured signal, , under a distortionless constraint, , that preserves the clean speech, , unchanged by the beamforming [2, 3]. Here, is also termed a steering vector, and techniques to estimate it from a captured signal have been proposed. Due to scale ambiguity in the steering vector estimation, a relative transfer function (RTF) [30] is in practice substituted for it. An RTF is defined as the steering vector normalized by its value at a reference channel, calculated by where denotes the value at the reference channel. This makes the distortionless constraint work to keep the desired signal at the reference channel, , unchanged by the beamforming.

The beamformer is estimated as follows:

(9)

Then, the desired signal is estimated as

(10)

With MPDR, the resultant signal is composed of only one channel signal corresponding to the reference channel .

According to the above interpretation, MPDR can perform both denoising and dereverberation [31] by reducing that contains the additive noise and the late reverberation. However, the dereverberation capability of this beamformer is limited because it cannot reduce reverberation components that come from the target speaker direction, especially when there are few microphones.

Iii-C Cascade of WPE and MPDR

To achieve better speech enhancement in noisy reverberant environments, researchers have also proposed using both WPE and MPDR in a cascade configuration [22]. Because WPE can dereverberate all the microphone signals individually, MPDR can be applied to the signals after WPE has been applied. Techniques have also been proposed to estimate the steering vector and the power of the desired signal, for example, by iteratively and alternately applying WPE and MPDR to the captured signal [25].

Iv Proposed method

This section describes a method for unifying WPE and MPDR into a single convolutional beamformer. A globally optimal closed-form solution can be obtained for the beamformer given the steering vector and the time-varying power of the desired signal, and thus we can perform more effective speech enhancement than with a simple cascade of WPE and MPDR. Figure 1 illustrates the processing flow of the method.

Iv-a Convolutional beamforming by WPD

First, the signal obtained using the cascade consisting of WPE and MPDR, i.e., eqs. (7) and (10), can be rewritten as

(11)
(12)
(13)

where we set to obtain the above second line, and we set and to obtain the third line. Note that and contain a time gap between their first and the second elements, corresponding to the prediction delay in eq. (7).

Next, the optimization criterion is defined based on the model of the desired speech used for WPE, namely the time-varying Gaussian distribution, and based on the distortionless constraint used for MPDR. Specifically, we estimate the convolutional filter, , as one that minimizes the average weighted power of a signal under a distortionless constraint. It is represented by

(14)

Here, all the filter coefficients are optimized based on the average weighted power minimization criterion although the conventional MPDR does not use this criterion. This modification is important to unify the two methods into a single form. Note that the use of the time-varying weight makes the distribution of the enhanced speech obtained by beamforming closer to that of the desired speech.

Eq. (14) can be viewed as a variation of eq. (9) used for the conventional MPDR. Unlike eq. (9), eq. (14) evaluates the average weighted power of the signal, and considers not only the spatial covariance but also the temporal covariance. The solution is obtained as follows:

(15)

where is a column vector containing followed by zeros, and is a power-normalized temporal-spatial covariance matrix with a prediction delay, which is defined as

(16)

Finally, with the estimated convolutional filter, , the target speech is estimated as

(17)

Interestingly, the same solution can be derived for the proposed method even when we concatenate MPDR and WPE in reverse order. The signal obtained in this case is represented by

(18)

where is the MPDR beamformer applied to , is an arbitrary denoising matrix that contains in its first column, and is a coefficient vector that predicts the current denoised signal, , from the past denoised signals, . Then, eq. (12) is obtained by setting , and optimized in the same way as discussed above.

It is also worth noting that we can derive a different solution for the proposed method by adopting the average power of the signal with no power weighting (and possibly with prewhitening of the captured signal [29, 32]) as its minimization criterion. However, this solution does not work well for speech enhancement in noisy reverberant environments as WPE does not work well with such a criterion [10].

Fig. 1: Processing flow of WPD (proposed method).

V Experiments

V-a Dataset and evaluation metrics

We evaluated the performance of the proposed method using the REVERB Challenge dataset [15]. The dataset is composed of a training set, a development set (dev set), and an evaluation set (eval set). While dev set and eval set are composed of simulated data (SimData) and real recordings (RealData), respectively, train set is composed only of SimData. Each utterance in the dataset contains reverberant speech uttered by a speaker and stationary additive noise. The distance between the speaker and the microphone array is varied from 0.5 m to 2.5 m. For SimData, the reverberation time is varied from about 0.25 s to 0.7 s, and the signal-to-noise ratio (SNR) is set at about 20 dB.

Evaluation metrics prepared for the challenge were used in the experiments. As objective measures for evaluating speech enhancement performance [33], we used the cepstrum distance (CD), the log likelihood ratio (LLR), the frequency-weighted segmental SNR (FWSSNR), and the speech-to-reverberation modulation energy ratio (SRMR) [34]. To evaluate the enhanced speech in terms of ASR performance, we used a baseline ASR system recently developed using kaldi [35]. This is a fairly competitive system composed of a TDNN acoustic model trained using lattice-free MMI and online i-vector extraction, and a tri-gram language model.

V-B Methods to be compared and analysis conditions

We compared WPD (Proposed) with WPE, MPDR, and WPE followed by MPDR (WPE+MPDR). For all the methods, a hanning window was used for a short-time analysis with frame length with the shift set at 32 ms and 8 ms, respectively. The sampling frequency was 16 kHz and microphones were used for all the experiments. For WPE, WPE+MPDR, and the proposed method, the prediction delay was set at , and the length of the prediction filter was set at , and , respectively, for frequency ranges of to kHz, to kHz, to kHz, to kHz, and to kHz.

As for the time-varying power, , and the steering vector, , of the target speech, we estimated them from the captured signal based on a method used in [25], and used the same estimates for all the methods. Figure 2 shows the processing flow of this estimation. Adopting the power of the captured signal as the initial value of

, we repeatedly applied WPE+MPDR to the captured signal, and updated the steering vector and the power of the signal using the outputs of WPE and MPDR, respectively. The number of iterations was set at two for this estimation. The steering vector was estimated based on the generalized eigenvalue decomposition with covariance whitening

[36, 37] assuming that each utterance has noise-only periods of 225 ms and 75 ms, respectively, at its beginning and ending parts.

Fig. 2: Processing flow for estimating and by iterating WPE+MPDR.

V-C Evaluation by objective speech enhancement measures

SimData RealData
CD SRMR LLR FWSSNR SRMR
No Enh 3.97 3.68 0.58 3.62 3.18
WPE 3.76 4.77 0.53 4.99 5.00
MPDR 3.67 4.50 0.54 4.66 4.82
WPE+MPDR 3.01 5.37 0.48 7.52 6.57
Proposed 2.64 5.34 0.39 8.18 6.64
TABLE I: Audible quality of enhanced speech evaluated using REVERB Challenge eval set. No Enh means no speech enhancement. Boldface indicates the best score for each metric.

Table I summarizes the evaluation results we obtained using the objective speech enhancement measures. First, all the methods improved the speech enhancement with all the measures. In addition, WPE+MPDR greatly outperformed WPE and MPDR, while the proposed method further outperformed WPE+MPDR for all the metrics except for SRMR on SimData. These results clearly show the superiority of our proposed method.

V-D Evaluation using ASR

Table II shows the word error rates (WERs) obtained using the baseline ASR system. First, all the methods reduced the WERs under all the conditions except for the SimData-Room1-Near condition. Furthermore, the proposed method greatly outperformed WPE, MPDR, and WPE+MPDR under all the conditions. These results again clearly indicate the superiority of the proposed method. Note that although it was hard to reduce the WERs under the SimData-Room1-Near condition as it contains very little noise and reverberation, the increase in the WER when using the proposed method was the smallest among the compared methods.

SimData
Room1 Room2 Room3 Ave
Near Far Near Far Near Far -
No Enh 3.13 3.94 4.71 7.31 4.70 7.50 4.35
WPE 3.20 3.47 4.64 5.91 4.28 5.32 4.47
MPDR 3.37 3.98 3.89 4.77 4.18 5.21 4.23
WPE+MPDR 3.32 3.76 4.10 4.61 4.57 5.71 4.35
Proposed 3.20 3.45 3.87 4.22 3.75 4.19 3.78

RealData
Near Far Average
No Enh 17.53 19.68 18.61
WPE 12.33 13.88 13.11
MPDR 10.60 13.81 12.20
WPE+MPDR 8.75 11.31 10.03
Proposed 7.86 10.67 9.27
TABLE II: Word error rate (WER) in % evaluated using REVERB Challenge eval set. No Enh means no speech enhancement. Boldface indicates the best score for each condition.

Vi Concluding remarks

This paper presented a method for unifying WPE and MPDR that enabled it to perform denoising and dereverberation both optimally and simultaneously based on microphone array signal processing. A convolutional beamformer, referred to as WPD, was derived and shown to improve the speech enhancement performance in noisy reverberant environments, as regards the objective speech enhancement measures and WERs, in comparison with conventional methods, including WPE, MPDR, and WPE+MPDR. Future work will include a comprehensive evaluation of the proposed method in various noisy reverberant environments, the introduction of different optimization criteria, and the extension of the proposed method to online processing.

References

  • [1] X. Anguera, C. Wooters, and J. Hernando, “Acoustic beamforming for speaker diarization of meetings,” IEEE Trans. ASLP, vol. 15, no. 7, pp. 2011–2022, 2007.
  • [2] H. L. V. Trees, Optimum Array Processing, Part IV of Detection, Estimation, and Modulation Theory, Wiley-Interscience, New York, 2002.
  • [3] H. Cox, “Resolving power and sensitivity to mismatch of optimum array processors,” The Journal of the Acoustical Society of America, vol. 54, pp. 771–785, 1973.
  • [4] T. Higuchi, N. Ito, S. Araki, T. Yoshioka, M. Delcroix, and T. Nakatani,

    “Online MVDR beamformer based on complex Gaussian mixture model with spatial prior for noise robust asr,”

    IEEE/ACM Transactions on Audio, Speech, and Language Processing, vol. 25, no. 4, pp. 780–793, 2017.
  • [5] H. Erdogan, J. R. Hershey, S. Watanabe, M. I. Mandel, and J. Le Roux, “Improved MVDR beamforming using single-channel mask prediction networks,” Proc. Interspeech, pp. 1981–1985, 2016.
  • [6] S. Emura, S. Araki, T. Nakatani, and N. Harada, “Distortionless beamforming optimized with -norm minimization,” IEEE Signal Processing Letters, vol. 25, no. 7, pp. 936–940, 2018.
  • [7] E. Warsitz and R. Haeb-Umbach, “Blind acoustic beamforming based on generalized eigenvalue decomposition,” IEEE Transactions on Audio, Speech, and Language Processing, vol. 15, no. 5, 2007.
  • [8] S. Araki, H. Sawada, and S. Makino, “Blind speech separation in a meeting situation with maximum SNR beamformer,” Proc. IEEE ICASSP, pp. 41–44, 2007.
  • [9] J. Heymann, L. Drude, C. Boeddeker, P. Hanebrink, and R. Haeb-Umbach, “Beamnet: end-to-end training of a beamformer-supported multichannel asr system,” Proc. IEEE ICASSP, pp. 5235–5329, 2017.
  • [10] T. Nakatani, T. Yoshioka, K. Kinoshita, M. Miyoshi, and B.-H. Juang, “Speech dereverberation based on variance-normalized delayed linear prediction,” IEEE Transactions on Audio, Speech, and Language Processing, vol. 18, no. 7, pp. 1717–1731, 2010.
  • [11] T. Yoshioka and T. Nakatani, “Generalization of multi-channel linear prediction methods for blind MIMO impulse response shortening,” IEEE Transactions on Audio, Speech and Language Processing, vol. 20, no. 10, pp. 2707–2720, 2012.
  • [12] T. Yoshioka, H. Tachibana, T. Nakatani, and M. Miyoshi, “Adaptive dereverberation of speech signals with speaker-position change detection,” Proc. IEEE ICASSP, pp. 3733–3736, 2009.
  • [13] A. Jukić, T. van Waterschoot, T. Gerkmann, and S. Doclo, “Multi-channel linear prediction-based speech dereverberation with sparse priors,” IEEE/ACM Transactions on Audio, Speech and Language Processing, vol. 23, no. 9, pp. 1509–1520, 2015.
  • [14] D. Giacobello and T. L. Jensen, “Speech dereverberation based on convex optimization algorithms for group sparse linear prediction,” Proc. IEEE ICASSP, pp. 446–450, 2018.
  • [15] K. Kinoshita, M. Delcroix, S. Gannot, E. A. P. Habets, R. Haeb-Umbach, W. Kellermann, V. Leutnant, R. Maas, T. Nakatani, B. Raj, A. Sehr, and T. Yoshioka, “A summary of the REVERB challenge: state-of-the-art and remaining challenges in reverberant speech processing research,” EURASIP Journal on Advances in Signal Processing, vol. doi:10.1186/s13634-016-0306-6, 2016.
  • [16] J. Barker, R. Marxer, E. Vincent, and S. Watanabe, “The third ‘CHiME’ speech separation and recognition challenge: Dataset, task and baselines,” Proc. IEEE ASRU-2015, pp. 504–511, 2015.
  • [17] E. Vincent, S. Watanabe, J. Barker, and R. Marxer, “CHiME4 Challenge,” http://spandh.dcs.shef.ac.uk/chime_challenge/chime2016/.
  • [18] J. Barker, S. Watanabe, and E. Vincent, “CHiME5 Challenge,” http://spandh.dcs.shef.ac.uk/chime_challenge/.
  • [19] B. Li, T. N. Sainath, J. Caroselli, A. Narayanan, M. Bacchiani, A. Misra, I. Shafran, H. Sak, G. Pundak, K. Chin, K. Sim, R. J. Weiss, K. W. Wilson, E. Variani, C. Kim, O. Siohan, M. Weintraub, E. McDermott, R. Rose, and M. Shannon, “Acoustic modeling for Google home,” Proc. Interspeech, 2017.
  • [20] Audio Software Engineering and Siri Speech Team, “Optimizing siri on homepod in far-field settings,”

    Apple Machine Learning Journal

    , vol. 1, no. 12, 2018.
  • [21] R. Haeb-Umbach, S. Watanabe, T. Nakatani, M. Bacchiani, B. Hoffmeister, M. Seltzer, and M. Souden, “Speech processing for digital home assistants,” submitted to IEEE Signal Processing Magazine, 2019.
  • [22] M. Delcroix, T. Yoshioka, A. Ogawa, Y. Kubo, M. Fujimoto, N. Ito, K. Kinoshita, M. Espi, S. Araki, T. Hori, and T. Nakatani, “Strategies for distant speech recognition in reverberant environments,” EURASIP J. Adv. Signal Process, vol. Article ID 2015:60, doi:10.1186/s13634-015-0245-7, 2015.
  • [23] W. Yang, G. Huang, W. Zhang, J. Chen, and J. Benesty, “Dereverberation with differential microphone arrays and the weighted-prediction-error method,” Proc. IWAENC, 2018.
  • [24] M. Togami, “Multichannel online speech dereverberation under noisy environments,” Proc. EUSIPCO, pp. 1078–1082, 2015.
  • [25] L. Drude, C. Boeddeker, J. Heymann, R. Haeb-Umbach, K. Kinoshita, M. Delcroix, and T. Nakatani,

    “Integrating neural network based beamforming and weighted prediction error dereverberation,”

    Proc. Interspeech, pp. pp. 3043–3047, 2018.
  • [26] T. Dietzen, S. Doclo, M. Moonen, and T. van Waterschoot, “Joint multi-microphone speech dereverberation and noise reduction using integrated sidelobe cancellation and linear prediction,” Proc. IWAENC, 2018.
  • [27] J. S. Bradley, H. Sato, and M. Picard, “On the importance of early reflections for speech in rooms,” The Journal of the Acoustic Sociaty of America, vol. 113, pp. 3233–3244, 2003.
  • [28] K. Abed-Meraim and P. Loubaton, “Prediction error method for second-order blind identification,” IEEE Trans. on Signal Processing, vol. 45, no. 3, pp. 694–705, 1997.
  • [29] K. Kinoshita, M. Delcroix, T. Nakatani, and M. Miyoshi, “Suppression of late reverberation effect on speech signal using long-term multiple-step linear prediction,” IEEE Transactions on Audio, Speech, and Language Processing, vol. 17, no. 4, pp. 534–545, 2009.
  • [30] I. Cohen, “Relative transfer function identification using speech signals,” IEEE Trans. on Speech, and Audio Processing, vol. 12, no. 5, pp. 451–459, 2004.
  • [31] J. Heymann, L. Drude, and R. Haeb-Umbach, “A generic neural acoustic beamforming architecture for robust multi-channel speech processing,” Computer, Speech, and Language, vol. 46, pp. 374–385, 2017.
  • [32] T. Dietzen, A. Spriet, W. Tirry, S. Doclo, M. Moonen, and T. van Waterschoot, “On the relation between data-dependent beamforming and multichannel linear prediction for dereverberation,” Proc. AES 60th International Conference, pp. 1–8, 2016.
  • [33] Y. Hu and P. C. Loizou, “Evaluation of objective quality measures for speech enhancement,” IEEE T-ASLP, vol. 16, no. 1, pp. 229–238, 2008.
  • [34] T. H. Falk, C. Zheng, and W. Y. Chan et al., “A non-intrusive quality and intelligibility measure of reverberant and dereverberated speech,” IEEE T-ASLP, vol. 18, no. 7, pp. 1766–1774, 2010.
  • [35] D. Povey, A. Ghoshal, G. Boulianne, L. Burget, O. Glembek, N. Goel, M. Hannemann, P. Motlicek, Y. Qian, P. Schwarz, J. Silovsky, G. Stemmer, and K. Vesely, “The kaldi speech recognition toolkit,” Proc. IEEE ASRU, 2011.
  • [36] N. Ito, S. Araki, M. Delcroix, and T. Nakatani, “Probabilistic spatial dictionary based online adaptive beamforming for meeting recognition in noisy and reverberant environments,” Proc. IEEE ICASSP, pp. 681–685, 2017.
  • [37] S. Markovich-Golan and S. Gannot, “Performance analysis of the covariance subtraction method for relative transfer function estimation and comparison to the covariance whitening method,” pp. 544–548, 2015.