I Introduction
When a speech signal is captured by distant microphones, e.g., in a conference room, it will inevitably contain additive noise and reverberation components. These components are detrimental to the perceived quality of the observed speech signal and often cause serious degradation in many applications such as handsfree teleconferencing and ASR.
Microphone array signal processing techniques have been developed to minimize the aforementioned detrimental effects by reducing the noise and the reverberation in the acquired signal. A filterandsum beamformer [1], MVDR and MPDR [2, 3, 4, 5, 6], and a maximum signaltonoise ratio beamformer [7, 8, 9] are widelyused techniques for denoising, while WPE and its variants [10, 11, 12, 13, 14] are emerging techniques for dereverberation. The usefulness of these techniques, particularly for improving ASR performance, has been extensively studied, e.g., at the REVERB challenge [15] and the CHiME3/4/5 challenges [16, 17, 18]. Advances in this technological area have led to recent progress in commercial devices with farfield ASR capability, such as smart speakers [19, 20, 21].
It is, however, still challenging to reduce both noise and reverberation simultaneously in an optimal way. For example, researchers have proposed using MVDR and WPE in a cascade manner [22, 23], where, for example, the signal is first dereverberated by WPE and then denoised with MVDR. With this approach, dereverberation may not be optimally performed due to the influence of the noise, and denoising may be disturbed by the remaining reverberation. Certain joint optimization techniques have also been proposed [24, 25, 26], however, they still perform dereverberation and denoising separately, which makes the optimality of the integration unclear, resulting in only a marginal performance improvement compared with the cascade system.
To achieve optimal integration, this paper proposes a method for unifying WPE and MPDR, into a single convolutional beamformer and for optimizing the beamformer based on a single unified optimization criterion. We can derive a globally optimal closedform solution for this beamformer, assuming that the timevarying power and steering vector of the desired signal are given. The optimality of the beamformer is guaranteed under the assumed condition. The proposed beamformer is referred to as a Weighted Power minimization Distortionless response beamformer (WPD) because it estimates the filter coefficients as ones that minimize the average power of the enhanced signal weighted by the timevarying power of the desired signal under a distortionless constraint determined based on the steering vector. Note that the steering vector and the signal power must also be given for WPE and MPDR, respectively, and several useful techniques for their estimation have already been proposed.
In experiments, we compare the proposed method with WPE, MPDR, and both approaches in cascade configuration in terms of objective speech enhancement measures and ASR performance. The experiments show that the proposed method substantially outperforms all the conventional methods as regards almost all the performance metrics. For example, in comparison with the cascade system, the proposed method achieves an average word error reduction rate of 7.5 % for real data taken from the REVERB Challenge dataset.
In the reminder of this paper, we define the signal model in Section II, and then overview three conventional methods, WPE, MPDR, and the two approaches in a cascade configuration in Section III. Then, Section IV describes our proposed beamformer. The experimental results and concluding remarks are given in Sections V and VI, respectively.
Ii Signal model
Assume that a single speech signal is captured by microphones in a noisy reverberant environment. Letting and
denote nonconjugate and conjugate transpose, the captured signal in the shorttime Fourier transform domain is approximately modeled at each frequency bin by
(1) 
where is a column vector containing all the microphone signals at a time frame . In this paper, the frequency indices of the symbols are omitted for brevity, and on the assumption that each frequency bin is processed independently in the same way. is a clean speech signal, for is a set of column vectors containing convolutional acoustic transfer functions from the speaker location to all the microphones, and is the additive noise.
The first term in eq. (1) can be further decomposed into two parts, one composed of the direct signal and early reflections, and the other corresponding to the late reverberation [27]. With this decomposition, eq. (1) is rewritten as
(2)  
(3)  
(4) 
where is a time index that separates the reverberation into the two parts. The goal of the speech enhancement is to reduce the late reverberation and the noise from the captured signal while preserving , hereafter referred to as a desired signal.
Iii Conventional methods
This section gives a brief overview of the conventional methods, including WPE, MPDR, and the two approaches in a cascade configuration.
Iiia Dereverberation by WPE
If we disregard the additive noise, , we can rewrite eq. (1
) using a multichannel autoregressive model
[28, 29, 10] as(5) 
where for are dimensional matrices containing coefficients that predict the current captured signal, , from the past captured signals, for . The prediction error corresponds to the desired signal in the above equation.
WPE estimates the prediction coefficients based on maximum likelihood estimation, assuming that the desired signal at each microphone follows a timevarying complex Gaussian distribution with a mean 0 and a timevarying variance,
, which corresponds to the timevarying power of the desired signal. Then, the prediction coefficients, , are estimated as those that minimize the average power of the prediction error weighted by the inverse of . The estimation is represented by(6) 
where is squared norm of a vector . It is known that the prediction delay also works as a distortionless constraint to prevent the desired signal components from being distorted by the dereverberation [10]. As for the estimation of , several useful techniques have been proposed, such as an iterative estimation method.
With the estimated prediction coefficients, the dereverberation is performed by
(7) 
It was experimentally confirmed that WPE can function robustly even in noisy environments to reduce the late reverberation with a slight increase in the noise [10].
IiiB Beamforming by MPDR
Assuming that the transfer function corresponding to the desired signal can be approximated by a product of a vector with the clean speech signal, i.e., , and taking the late reverberation, , as a part of the noise, , eq. (2) becomes
(8) 
MPDR is defined as a vector, , that minimizes the average power of the captured signal, , under a distortionless constraint, , that preserves the clean speech, , unchanged by the beamforming [2, 3]. Here, is also termed a steering vector, and techniques to estimate it from a captured signal have been proposed. Due to scale ambiguity in the steering vector estimation, a relative transfer function (RTF) [30] is in practice substituted for it. An RTF is defined as the steering vector normalized by its value at a reference channel, calculated by where denotes the value at the reference channel. This makes the distortionless constraint work to keep the desired signal at the reference channel, , unchanged by the beamforming.
The beamformer is estimated as follows:
(9) 
Then, the desired signal is estimated as
(10) 
With MPDR, the resultant signal is composed of only one channel signal corresponding to the reference channel .
According to the above interpretation, MPDR can perform both denoising and dereverberation [31] by reducing that contains the additive noise and the late reverberation. However, the dereverberation capability of this beamformer is limited because it cannot reduce reverberation components that come from the target speaker direction, especially when there are few microphones.
IiiC Cascade of WPE and MPDR
To achieve better speech enhancement in noisy reverberant environments, researchers have also proposed using both WPE and MPDR in a cascade configuration [22]. Because WPE can dereverberate all the microphone signals individually, MPDR can be applied to the signals after WPE has been applied. Techniques have also been proposed to estimate the steering vector and the power of the desired signal, for example, by iteratively and alternately applying WPE and MPDR to the captured signal [25].
Iv Proposed method
This section describes a method for unifying WPE and MPDR into a single convolutional beamformer. A globally optimal closedform solution can be obtained for the beamformer given the steering vector and the timevarying power of the desired signal, and thus we can perform more effective speech enhancement than with a simple cascade of WPE and MPDR. Figure 1 illustrates the processing flow of the method.
Iva Convolutional beamforming by WPD
First, the signal obtained using the cascade consisting of WPE and MPDR, i.e., eqs. (7) and (10), can be rewritten as
(11)  
(12)  
(13) 
where we set to obtain the above second line, and we set and to obtain the third line. Note that and contain a time gap between their first and the second elements, corresponding to the prediction delay in eq. (7).
Next, the optimization criterion is defined based on the model of the desired speech used for WPE, namely the timevarying Gaussian distribution, and based on the distortionless constraint used for MPDR. Specifically, we estimate the convolutional filter, , as one that minimizes the average weighted power of a signal under a distortionless constraint. It is represented by
(14) 
Here, all the filter coefficients are optimized based on the average weighted power minimization criterion although the conventional MPDR does not use this criterion. This modification is important to unify the two methods into a single form. Note that the use of the timevarying weight makes the distribution of the enhanced speech obtained by beamforming closer to that of the desired speech.
Eq. (14) can be viewed as a variation of eq. (9) used for the conventional MPDR. Unlike eq. (9), eq. (14) evaluates the average weighted power of the signal, and considers not only the spatial covariance but also the temporal covariance. The solution is obtained as follows:
(15) 
where is a column vector containing followed by zeros, and is a powernormalized temporalspatial covariance matrix with a prediction delay, which is defined as
(16) 
Finally, with the estimated convolutional filter, , the target speech is estimated as
(17) 
Interestingly, the same solution can be derived for the proposed method even when we concatenate MPDR and WPE in reverse order. The signal obtained in this case is represented by
(18) 
where is the MPDR beamformer applied to , is an arbitrary denoising matrix that contains in its first column, and is a coefficient vector that predicts the current denoised signal, , from the past denoised signals, . Then, eq. (12) is obtained by setting , and optimized in the same way as discussed above.
It is also worth noting that we can derive a different solution for the proposed method by adopting the average power of the signal with no power weighting (and possibly with prewhitening of the captured signal [29, 32]) as its minimization criterion. However, this solution does not work well for speech enhancement in noisy reverberant environments as WPE does not work well with such a criterion [10].
V Experiments
Va Dataset and evaluation metrics
We evaluated the performance of the proposed method using the REVERB Challenge dataset [15]. The dataset is composed of a training set, a development set (dev set), and an evaluation set (eval set). While dev set and eval set are composed of simulated data (SimData) and real recordings (RealData), respectively, train set is composed only of SimData. Each utterance in the dataset contains reverberant speech uttered by a speaker and stationary additive noise. The distance between the speaker and the microphone array is varied from 0.5 m to 2.5 m. For SimData, the reverberation time is varied from about 0.25 s to 0.7 s, and the signaltonoise ratio (SNR) is set at about 20 dB.
Evaluation metrics prepared for the challenge were used in the experiments. As objective measures for evaluating speech enhancement performance [33], we used the cepstrum distance (CD), the log likelihood ratio (LLR), the frequencyweighted segmental SNR (FWSSNR), and the speechtoreverberation modulation energy ratio (SRMR) [34]. To evaluate the enhanced speech in terms of ASR performance, we used a baseline ASR system recently developed using kaldi [35]. This is a fairly competitive system composed of a TDNN acoustic model trained using latticefree MMI and online ivector extraction, and a trigram language model.
VB Methods to be compared and analysis conditions
We compared WPD (Proposed) with WPE, MPDR, and WPE followed by MPDR (WPE+MPDR). For all the methods, a hanning window was used for a shorttime analysis with frame length with the shift set at 32 ms and 8 ms, respectively. The sampling frequency was 16 kHz and microphones were used for all the experiments. For WPE, WPE+MPDR, and the proposed method, the prediction delay was set at , and the length of the prediction filter was set at , and , respectively, for frequency ranges of to kHz, to kHz, to kHz, to kHz, and to kHz.
As for the timevarying power, , and the steering vector, , of the target speech, we estimated them from the captured signal based on a method used in [25], and used the same estimates for all the methods. Figure 2 shows the processing flow of this estimation. Adopting the power of the captured signal as the initial value of
, we repeatedly applied WPE+MPDR to the captured signal, and updated the steering vector and the power of the signal using the outputs of WPE and MPDR, respectively. The number of iterations was set at two for this estimation. The steering vector was estimated based on the generalized eigenvalue decomposition with covariance whitening
[36, 37] assuming that each utterance has noiseonly periods of 225 ms and 75 ms, respectively, at its beginning and ending parts.VC Evaluation by objective speech enhancement measures
SimData  RealData  

CD  SRMR  LLR  FWSSNR  SRMR  
No Enh  3.97  3.68  0.58  3.62  3.18 
WPE  3.76  4.77  0.53  4.99  5.00 
MPDR  3.67  4.50  0.54  4.66  4.82 
WPE+MPDR  3.01  5.37  0.48  7.52  6.57 
Proposed  2.64  5.34  0.39  8.18  6.64 
Table I summarizes the evaluation results we obtained using the objective speech enhancement measures. First, all the methods improved the speech enhancement with all the measures. In addition, WPE+MPDR greatly outperformed WPE and MPDR, while the proposed method further outperformed WPE+MPDR for all the metrics except for SRMR on SimData. These results clearly show the superiority of our proposed method.
VD Evaluation using ASR
Table II shows the word error rates (WERs) obtained using the baseline ASR system. First, all the methods reduced the WERs under all the conditions except for the SimDataRoom1Near condition. Furthermore, the proposed method greatly outperformed WPE, MPDR, and WPE+MPDR under all the conditions. These results again clearly indicate the superiority of the proposed method. Note that although it was hard to reduce the WERs under the SimDataRoom1Near condition as it contains very little noise and reverberation, the increase in the WER when using the proposed method was the smallest among the compared methods.
SimData  
Room1  Room2  Room3  Ave  
Near  Far  Near  Far  Near  Far    
No Enh  3.13  3.94  4.71  7.31  4.70  7.50  4.35 
WPE  3.20  3.47  4.64  5.91  4.28  5.32  4.47 
MPDR  3.37  3.98  3.89  4.77  4.18  5.21  4.23 
WPE+MPDR  3.32  3.76  4.10  4.61  4.57  5.71  4.35 
Proposed  3.20  3.45  3.87  4.22  3.75  4.19  3.78 
RealData  

Near  Far  Average  
No Enh  17.53  19.68  18.61 
WPE  12.33  13.88  13.11 
MPDR  10.60  13.81  12.20 
WPE+MPDR  8.75  11.31  10.03 
Proposed  7.86  10.67  9.27 
Vi Concluding remarks
This paper presented a method for unifying WPE and MPDR that enabled it to perform denoising and dereverberation both optimally and simultaneously based on microphone array signal processing. A convolutional beamformer, referred to as WPD, was derived and shown to improve the speech enhancement performance in noisy reverberant environments, as regards the objective speech enhancement measures and WERs, in comparison with conventional methods, including WPE, MPDR, and WPE+MPDR. Future work will include a comprehensive evaluation of the proposed method in various noisy reverberant environments, the introduction of different optimization criteria, and the extension of the proposed method to online processing.
References
 [1] X. Anguera, C. Wooters, and J. Hernando, “Acoustic beamforming for speaker diarization of meetings,” IEEE Trans. ASLP, vol. 15, no. 7, pp. 2011–2022, 2007.
 [2] H. L. V. Trees, Optimum Array Processing, Part IV of Detection, Estimation, and Modulation Theory, WileyInterscience, New York, 2002.
 [3] H. Cox, “Resolving power and sensitivity to mismatch of optimum array processors,” The Journal of the Acoustical Society of America, vol. 54, pp. 771–785, 1973.

[4]
T. Higuchi, N. Ito, S. Araki, T. Yoshioka, M. Delcroix, and T. Nakatani,
“Online MVDR beamformer based on complex Gaussian mixture model with spatial prior for noise robust asr,”
IEEE/ACM Transactions on Audio, Speech, and Language Processing, vol. 25, no. 4, pp. 780–793, 2017.  [5] H. Erdogan, J. R. Hershey, S. Watanabe, M. I. Mandel, and J. Le Roux, “Improved MVDR beamforming using singlechannel mask prediction networks,” Proc. Interspeech, pp. 1981–1985, 2016.
 [6] S. Emura, S. Araki, T. Nakatani, and N. Harada, “Distortionless beamforming optimized with norm minimization,” IEEE Signal Processing Letters, vol. 25, no. 7, pp. 936–940, 2018.
 [7] E. Warsitz and R. HaebUmbach, “Blind acoustic beamforming based on generalized eigenvalue decomposition,” IEEE Transactions on Audio, Speech, and Language Processing, vol. 15, no. 5, 2007.
 [8] S. Araki, H. Sawada, and S. Makino, “Blind speech separation in a meeting situation with maximum SNR beamformer,” Proc. IEEE ICASSP, pp. 41–44, 2007.
 [9] J. Heymann, L. Drude, C. Boeddeker, P. Hanebrink, and R. HaebUmbach, “Beamnet: endtoend training of a beamformersupported multichannel asr system,” Proc. IEEE ICASSP, pp. 5235–5329, 2017.
 [10] T. Nakatani, T. Yoshioka, K. Kinoshita, M. Miyoshi, and B.H. Juang, “Speech dereverberation based on variancenormalized delayed linear prediction,” IEEE Transactions on Audio, Speech, and Language Processing, vol. 18, no. 7, pp. 1717–1731, 2010.
 [11] T. Yoshioka and T. Nakatani, “Generalization of multichannel linear prediction methods for blind MIMO impulse response shortening,” IEEE Transactions on Audio, Speech and Language Processing, vol. 20, no. 10, pp. 2707–2720, 2012.
 [12] T. Yoshioka, H. Tachibana, T. Nakatani, and M. Miyoshi, “Adaptive dereverberation of speech signals with speakerposition change detection,” Proc. IEEE ICASSP, pp. 3733–3736, 2009.
 [13] A. Jukić, T. van Waterschoot, T. Gerkmann, and S. Doclo, “Multichannel linear predictionbased speech dereverberation with sparse priors,” IEEE/ACM Transactions on Audio, Speech and Language Processing, vol. 23, no. 9, pp. 1509–1520, 2015.
 [14] D. Giacobello and T. L. Jensen, “Speech dereverberation based on convex optimization algorithms for group sparse linear prediction,” Proc. IEEE ICASSP, pp. 446–450, 2018.
 [15] K. Kinoshita, M. Delcroix, S. Gannot, E. A. P. Habets, R. HaebUmbach, W. Kellermann, V. Leutnant, R. Maas, T. Nakatani, B. Raj, A. Sehr, and T. Yoshioka, “A summary of the REVERB challenge: stateoftheart and remaining challenges in reverberant speech processing research,” EURASIP Journal on Advances in Signal Processing, vol. doi:10.1186/s1363401603066, 2016.
 [16] J. Barker, R. Marxer, E. Vincent, and S. Watanabe, “The third ‘CHiME’ speech separation and recognition challenge: Dataset, task and baselines,” Proc. IEEE ASRU2015, pp. 504–511, 2015.
 [17] E. Vincent, S. Watanabe, J. Barker, and R. Marxer, “CHiME4 Challenge,” http://spandh.dcs.shef.ac.uk/chime_challenge/chime2016/.
 [18] J. Barker, S. Watanabe, and E. Vincent, “CHiME5 Challenge,” http://spandh.dcs.shef.ac.uk/chime_challenge/.
 [19] B. Li, T. N. Sainath, J. Caroselli, A. Narayanan, M. Bacchiani, A. Misra, I. Shafran, H. Sak, G. Pundak, K. Chin, K. Sim, R. J. Weiss, K. W. Wilson, E. Variani, C. Kim, O. Siohan, M. Weintraub, E. McDermott, R. Rose, and M. Shannon, “Acoustic modeling for Google home,” Proc. Interspeech, 2017.

[20]
Audio Software Engineering and Siri Speech Team,
“Optimizing siri on homepod in farfield settings,”
Apple Machine Learning Journal
, vol. 1, no. 12, 2018.  [21] R. HaebUmbach, S. Watanabe, T. Nakatani, M. Bacchiani, B. Hoffmeister, M. Seltzer, and M. Souden, “Speech processing for digital home assistants,” submitted to IEEE Signal Processing Magazine, 2019.
 [22] M. Delcroix, T. Yoshioka, A. Ogawa, Y. Kubo, M. Fujimoto, N. Ito, K. Kinoshita, M. Espi, S. Araki, T. Hori, and T. Nakatani, “Strategies for distant speech recognition in reverberant environments,” EURASIP J. Adv. Signal Process, vol. Article ID 2015:60, doi:10.1186/s1363401502457, 2015.
 [23] W. Yang, G. Huang, W. Zhang, J. Chen, and J. Benesty, “Dereverberation with differential microphone arrays and the weightedpredictionerror method,” Proc. IWAENC, 2018.
 [24] M. Togami, “Multichannel online speech dereverberation under noisy environments,” Proc. EUSIPCO, pp. 1078–1082, 2015.

[25]
L. Drude, C. Boeddeker, J. Heymann, R. HaebUmbach, K. Kinoshita, M. Delcroix,
and T. Nakatani,
“Integrating neural network based beamforming and weighted prediction error dereverberation,”
Proc. Interspeech, pp. pp. 3043–3047, 2018.  [26] T. Dietzen, S. Doclo, M. Moonen, and T. van Waterschoot, “Joint multimicrophone speech dereverberation and noise reduction using integrated sidelobe cancellation and linear prediction,” Proc. IWAENC, 2018.
 [27] J. S. Bradley, H. Sato, and M. Picard, “On the importance of early reflections for speech in rooms,” The Journal of the Acoustic Sociaty of America, vol. 113, pp. 3233–3244, 2003.
 [28] K. AbedMeraim and P. Loubaton, “Prediction error method for secondorder blind identification,” IEEE Trans. on Signal Processing, vol. 45, no. 3, pp. 694–705, 1997.
 [29] K. Kinoshita, M. Delcroix, T. Nakatani, and M. Miyoshi, “Suppression of late reverberation effect on speech signal using longterm multiplestep linear prediction,” IEEE Transactions on Audio, Speech, and Language Processing, vol. 17, no. 4, pp. 534–545, 2009.
 [30] I. Cohen, “Relative transfer function identification using speech signals,” IEEE Trans. on Speech, and Audio Processing, vol. 12, no. 5, pp. 451–459, 2004.
 [31] J. Heymann, L. Drude, and R. HaebUmbach, “A generic neural acoustic beamforming architecture for robust multichannel speech processing,” Computer, Speech, and Language, vol. 46, pp. 374–385, 2017.
 [32] T. Dietzen, A. Spriet, W. Tirry, S. Doclo, M. Moonen, and T. van Waterschoot, “On the relation between datadependent beamforming and multichannel linear prediction for dereverberation,” Proc. AES 60th International Conference, pp. 1–8, 2016.
 [33] Y. Hu and P. C. Loizou, “Evaluation of objective quality measures for speech enhancement,” IEEE TASLP, vol. 16, no. 1, pp. 229–238, 2008.
 [34] T. H. Falk, C. Zheng, and W. Y. Chan et al., “A nonintrusive quality and intelligibility measure of reverberant and dereverberated speech,” IEEE TASLP, vol. 18, no. 7, pp. 1766–1774, 2010.
 [35] D. Povey, A. Ghoshal, G. Boulianne, L. Burget, O. Glembek, N. Goel, M. Hannemann, P. Motlicek, Y. Qian, P. Schwarz, J. Silovsky, G. Stemmer, and K. Vesely, “The kaldi speech recognition toolkit,” Proc. IEEE ASRU, 2011.
 [36] N. Ito, S. Araki, M. Delcroix, and T. Nakatani, “Probabilistic spatial dictionary based online adaptive beamforming for meeting recognition in noisy and reverberant environments,” Proc. IEEE ICASSP, pp. 681–685, 2017.
 [37] S. MarkovichGolan and S. Gannot, “Performance analysis of the covariance subtraction method for relative transfer function estimation and comparison to the covariance whitening method,” pp. 544–548, 2015.
Comments
There are no comments yet.