1 Introduction
Speech enhancement has been studied extensively because of its various applications including mobile communication [1] and hearing aids [2]. When multiple microphones are available, multichannel speech enhancement is an effective approach because it takes advantage of spatial information [3]. Recently, deep neural network (DNN)–based multichannel speech enhancement has gained increasing attention [4, 5, 6] motivated by its strong modeling capability. DNNbased multichannel speech enhancement methods often manipulate an observed signal in the timefrequency (TF) domain because spatial filtering can be efficiently implemented in the TF domain. Ordinarily, the estimated spectrogram or TF mask are passed to objective functions defined in the TF domain. However, the enhanced timedomain signal is important for human listeners. Hence, this paper proposes a DNNbased multichannel speech enhancement system in which speech enhancement is conducted in the TF domain, and objective functions are computed on the reconstructed timedomain signal to improve human perception.
Recently, various DNNbased approaches to multichannel speech enhancement have been studied [4, 5, 7]. A wellknown approach is maskbased beamforming (MB) in which TF mask is used for estimating the spatial covariance matrix (SCM) [5, 6]. Although it achieved excellent performance as the frontend of ASR [8, 9, 10], it has a few drawbacks. First, the DNN was often trained to minimize the estimation error of TF masks instead of maximizing the quality of the estimated signal directly [11, 12]. In addition, its performance is limited under noisy and reverberant environment because it does not consider nonstationary characteristics of speech signal [13].
To address these problems, we proposed a DNNbased multichannel Wiener filtering (MWF) with a multichannel objective function for speech separation [14]. The DNNbased MWF is based on the estimation of timevarying SCMs, and it can adapt to the timevarying speech signal. In addition, the quality of the estimated signal is directly maximized in the TF domain based on a statistical model of a multichannel signal. Hence, it can be expected that the DNNbased MWF improves the performance of multichannel speech enhancement as in speech separation. However, its result is often inconsistent [15, 16, 17], and thus the estimated amplitude and phase may change by applying the inverse STFT (iSTFT) and STFT. Although several DNNbased monaural speech enhancement and separation methods improve the performance by considering consistency [18, 19], the consistency was not taken into account in DNNbased multichannel speech enhancement.
In this paper, we propose a novel system for a DNNbased multichannel speech enhancement where the DNN is trained to improve the quality of the enhanced timedomain signal directly. The overview of the proposed system is illustrated in Fig. 1. The DNN estimates TF masks and power spectral densities of speech and noise to calculate MWF. Multichannel speech enhancement is conducted by MWF, which is represented by blue blocks in Fig. 1. The estimated spectrogram is converted back to the timedomain and passed to an objective function. Thanks to this, the objective function can consider the reconstruction error due to the inconsistency. We investigate two novel objective functions for evaluating the enhanced timedomain signal in the time or TF domain. Our experiment confirmed the performance of the DNNbased MWF was improved by using the proposed objective functions.
2 Preliminaries
2.1 Speech enhancement by multichannel Wiener filtering
Since our proposed system uses MWF, this subsection reviews MWF that has been applied to multichannel speech enhancement [20]. Let a noisy signal be observed by microphones, and be the observed noisy signal in the TF domain where and are the time and frequency indices, respectively. The observed signal is given by the sum of the clean speech and noise as
(1) 
We assume both speech and noise follow multivariate zeromean complex Gaussian distributions as in
[21]:(2)  
(3) 
where and are the timevarying SCMs of speech and noise, respectively. The observed noisy signal also follows a multivariate zeromean complex Gaussian distribution: .
Given timevarying SCMs, the posterior distribution follows a multivariate complex Gaussian distribution:
(4) 
where its mean and covariance matrix are calculated as
(5)  
(6)  
(7) 
is the identity matrix, and
is called MWF. The enhanced spectrogram is obtained by applying a MWF, and the result is converted back to the timedomain by applying iSTFT.2.2 DNNbased multichannel Wiener filtering
We proposed a DNNbased MWF for taking advantage of the strong modeling capability of a DNN in multichannel speech separation [14]. In the DNNbased MWF, a DNN estimates TF mask and power spectral density for each speaker. The timevarying SCM of th speaker is calculated by
(8)  
(9) 
where is a timeinvariant SCM estimated by using TF mask, and is the Hermitian transpose of . Based on the estimated timevarying SCMs, each speech signal is estimated by MWF.
To train the DNN for estimating TF masks and power spectral densities, we proposed the following objective function [14]:
(10)  
(11) 
where is the multichannel signal estimated by MWF, and is the covariance calculated by Eq. (7). The minimization of this objective function corresponds to the maximization of the posterior distribution . In other words, the objective function given in Eq. (10) evaluates the quality of the estimated multichannel signal based on the statistical model of multichannel signals. One undisputed advantage of this objective function is that the separated signal is directly evaluated while conventional methods have set auxiliary targets, such as TF mask, in their objective functions [4, 5]. The effectiveness of the multichannel objective function has also been confirmed in various MB [22].
2.3 STFT consistency
It is known that spectrograms calculated by STFT have a relation between neighborhood TF bins, and they are called consistent spectrograms [15, 16, 17]. A consistent spectrogram satisfies the following relation:
(12) 
where is STFT, is iSTFT, and is the projection onto the set of consistent spectrograms. When speech enhancement is conducted in the TF domain, the consistency of the estimated spectrogram is not guaranteed. In such a case, the spectrogram calculated by STFT of the reconstructed timedomain signal differs from the estimated spectrogram. In DNNbased speech enhancement, this discrepancy indicates that objective functions defined in the TF domain do not evaluate the estimated timedomain speech properly. Some studies have addressed this problem since the discrepancy decreases the performance of TF masking [18, 19]. For instance, [18] presented the wave approximation (WA) which evaluates the estimated signal in the timedomain, and [19] proposed to evaluate a spectrogram projected onto the set of consistent spectrograms. Although these studies showed the importance of the consistency in monaural speech enhancement and separation, it was not explicitly considered in multichannel speech enhancement.
3 Proposed multichannel speech enhancement system
In this section, we propose a system of DNNbased multichannel speech enhancement in which an objective function is computed on the estimated timedomain signal as illustrated in Fig. 1. In the proposed system, multichannel speech enhancement is conducted by the DNNbased MWF as described in Section 2.2. Then, the result of MWF is converted back to the timedomain by iSTFT and passed to the objective function. In Section 3.1, we extend WA for applying it to our proposed system, which is defined in the timedomain. Section 3.2 describes another objective function calculated by a sum of the original multichannel objective function [14] and a consistencyaware objective function defined in the TF domain. Both proposed objective functions are summarized in Fig. 2.
3.1 Multichannel wave approximation (MWA)
The multichannel objective function given in Eq. (10) is computed on the estimated spectrogram in the TF domain, and it does not consider the reconstruction error due to the inconsistency of the estimated spectrogram. To address this problem, we propose multichannel wave approximation (MWA) which is computed on the reconstructed time domain signal as illustrated in Fig. 2. It is formulated as a sum of WA at each channel:
(13) 
where is the norm, and and are the clean and estimated spectrograms at th channel, respectively. WMA trains to maximize the quality of the reconstructed timedomain signal while the original objective function given by Eq. (10) focuses on the estimated spectrogram which may be inconsistent. Recently, WA have achieved promising results in monaural speech enhancement and separation [18]. The proposed MWA is a simple extension of WA to multichannel case.
3.2 Consistencyaware multichannel objective function
We propose another consistencyaware multichannel objective function as a sum of the original multichannel objective function given in Eq. (10) and a consistencyaware objective function:
(14) 
where is the Frobenius norm, and
is a hyperparameter for adjusting two terms. In the second term, the estimated spectrogram is projected onto the set of consistent spectrograms, and then the distance to the clean spectrogram is calculated. This projection enables the objective functioin to consider the reconstruction error due to the inconsistency. In other words, the second term corresponds to evaluating the estimated timedomain signal in the TF domain by recomputing STFT. Note that the second term in Eq. (
14) does not have any known statistical meaning while the first term is based on a statistical model of multichannel signals. It can be considered to evaluate the posterior distribution with the consistency projection, which is included in our future work.The proposed objective functions are summarized in Fig. 2. The first proposed objective function given in Eq. (13) is defined between the clean and estimated timedomain signal. In contrast, the second one given in Eq. (14) considers both estimated spectrogram and STFT of the reconstructed timedomain signal. In our early experiment, this combination achieved better performance comparing to only using the second term. The difference of the domain of the proposed objective functions affects the enhanced signal as shown in the following experiment.
4 Experiments and results
To confirm the effectiveness of the proposed system, an experiment of multichannel speech enhancement under diffuse noise was conducted. DNNbased MWFs were compared with various baseline methods including TF masking and MB. In the following subsections, the DNNbased MWFs using DNNs trained by the proposed objective functions given by Eqs. (13) and (14) are refereed to as Prop. 1 and Prop. 2, respectively.
dB  dB  dB  

Approach  SDR [dB]  CD [dB]  PESQ  SDR [dB]  CD [dB]  PESQ  SDR [dB]  CD [dB]  PESQ  
ms  Observed  1.14  5.26  1.14  6.93  4.67  1.33  13.21  3.74  1.77  
TF masking  PSA  9.15  4.42  1.61  13.46  3.70  2.00  18.07  2.95  2.59  
PSA+Proj  9.36  4.53  1.62  13.76  3.80  2.03  18.40  3.08  2.64  
WA  9.60  4.65  1.63  14.06  3.90  2.11  18.69  3.15  2.73  
Spatial filtering  MB  5.42  4.88  1.28  11.54  4.22  1.65  16.85  3.24  2.27  
Original  10.48  4.41  1.77  14.90  3.62  2.26  19.20  2.71  2.81  
Prop. 1  11.23  4.73  1.73  15.57  3.97  2.23  19.75  3.15  2.84  
Prop. 2  10.72  4.38  1.82  15.01  3.60  2.30  19.39  2.68  2.85  
ms  Observed  0.84  5.28  1.12  6.77  4.58  1.33  12.87  3.77  1.75  
TF masking  PSA  9.11  4.45  1.60  13.38  3.65  2.03  17.90  2.96  2.57  
PSA+Proj  9.27  4.53  1.60  13.66  3.75  2.05  18.22  3.06  2.62  
WA  9.56  4.65  1.62  13.99  3.85  2.15  18.54  3.14  2.75  
Spatial filtering  MB  5.20  4.90  1.26  11.07  4.15  1.62  16.34  3.32  2.19  
Original  10.38  4.42  1.76  14.50  3.58  2.23  18.82  2.76  2.77  
Prop. 1  11.23  4.73  1.71  15.32  3.92  2.25  19.56  3.19  2.79  
Prop. 2  10.67  4.40  1.80  14.71  3.57  2.29  18.98  2.74  2.80 
4.1 Experimental setup
4.1.1 Dataset
In both training and testing, the clean speech in TIMIT corpus [23] and noise from Diverse Environments Multichannel Acoustic Noise Database (DEMAND) [24] were used. The measured impulse responses in Multichannel Impulse Response Database (MIRD) [25] were convoluted to the above dry sources where the st channel of the noise in DEMAND was used as the dry source. The distance between the speaker and microphones was set to m, and the azimuth of each talker is randomly selected from points (from to with the intervals of ). On the other hand, diffuse noise was generated by playing noise from all points. Note that the noise played at each point is obtained by splitting the original noise into periods. The first half was used in the training/validation and the other was used in the testing. The number of microphones was where the distance between microphones was set to cm.
A training set with speech files was randomly selected from the training set of TIMIT, and the others were used as a validation set. Since the number of noise was small, we conducted a data augmentation^{1}^{1}1 The diffuse noise was augmented by conducting convex combinations of two noises, randomly selected from DEMAND, as , where
is randomly generated from a Beta distribution.
. The signaltonoise ratio (SNR) of the training/validation set was adjusted from
to dB. At the training, the reverberation time () was ms. On the other hand, at the testing, speeches randomly selected from the testing set of TIMIT were used as clean speach, and the later periods of the noise were used. We evaluated under two reverberation conditions: ms and ms. All the speeches were sampled at kHz, and STFT was computed using the Hann window whose length was ms with ms shift.4.1.2 Baseline methods
We compared the proposed methods with the following baseline methods. At first, TF masking was used as a wellknown monaural speech enhancement approach. To confirm the effectiveness of considering the consistency, three objective functions [the phase sensitive approximation (PSA) [26], PSA with the consistency projection (PSA+Proj) [19], and WA [18]] were compared. MB [5] was also conducted which used a DNN trained based on PSA. Although several iterative methods using DNN have been proposed in multichannel source separation [4, 27, 28], we only compared the proposed system with aforementioned noniterative methods because it is noniterative. The performance of the proposed method can be improved by unifying iterative methods.
4.1.3 DNN architecture and setup
In all methods, including TF masking, the input feature was the concatenation of the amplitude feature and phasedifference features. The amplitude feature was calculated by
(15) 
where , and
is the utterancelevel mean and variance normalization. As in a previous study
[29], the phasedifference between two microphones was also used as a input feature:(16)  
(17) 
where is the complex argument.
The DNN for the proposed methods is illustrated in Fig. 3
, which contains two bidirectional longshort term memory (BLSTM) layers and dense layers. Dropout of
was applied to each BLSTM layer and dense layer without the last layers. The networks are trained on frame segments using the Adam optimizer over epochs. The learning rate was decayed by multiplying if the objective function on the validation set did not decrease for consecutive epochs, and the initial learning rate was set to . In Prop. 2, was set to . In baseline methods, we used only the TF mask estimation part of the DNN illustrated in Fig. 3.Note that all systems were implemented using TensorFlow in which STFT and iSTFT are implemented with their backpropagation. In addition, it supports a lot of complexvalued operations and their derivatives. Hence, we can easily apply MWF in the training.
4.2 Experimental results
The performances of multichannel speech enhancement were evaluated by the signaltodistortion ratio (SDR), cepstrum distortion (CD), and PESQ. The experimental results are summarized in Table 1 in which the bold font represents the best score in each condition. As can be seen from both reverberation conditions, MB resulted in the lowest performance because it does not consider nonstationary characteristics of speech. In TF masking, consistencyaware methods, PSA+Proj and WA, outperformed the original PSA in terms of SDR and PESQ. This results confirmed the importance of the consistency.
The DNNbased MWF with the original multichannel objective function (Original) [14] outperformed the other conventional methods. Furthermore, the DNNbased MWF with the proposed MWA, Prop. 1, significantly improved SDR. On the other hand, by using the multiobjective function given in Eq. (14), Prop. 2 outperformed the original DNNbased MWF in terms of not only SDR but also CD and PESQ. We stress that, the difference between three DNNbased MWFs is only the objective function, and thus the computational cost for the inference is the same.
5 Conclusion
In this paper, we described the system of DNNbased multichannel speech enhancement where the DNN is trained to maximize the quality of the timedomain signal estimated by the DNNbased MWF. We further proposed two objective functions defined on the enhanced timedomain signal. Our experimental results confirmed the effectiveness of the DNNbased MWF and proposed objective functions in multichannel speech enhancement. Future work includes combining the proposed system with iterative algorithms.
References
 [1] P. C. Loizou, Speech Enhancement: Theory and Practice, Second Edition, CRC Press, Inc., 2nd edition, Feb. 2013.
 [2] S. Doclo, S. Gannot, M. Moonen, and A. Spriet, Handbook on Array Processing and Sensor Network, chapter Acoustic Beamforming for Hearing Aid Applications, pp. 269–302, Wiley Online Library, Jan. 2010.
 [3] S. Gannot, E. Vincent, S. MarkovichGolan, and A. Ozerov, “A consolidated perspective on multimicrophone speech enhancement and source separation,” IEEE/ACM Trans. Audio, Speech and Lang. Proc., vol. 25, no. 4, pp. 692–730, Apr. 2017.
 [4] S. Sivasankaran, A. A. Nugraha, E. Vincent, J. A. MoralesCordovilla, S. Dalmia, I. Illina, and A. Liutkus, “Robust ASR using neural network based speech enhancement and feature simulation,” in IEEE Workshop Autom. Speech Recognit. Underst. (ASRU), Dec. 2015, pp. 482–489.
 [5] J. Heymann, L. Drude, and R. HaebUmbach, “Neural network based spectral mask estimation for acoustic beamforming,” in IEEE Int. Conf. on Acoust., Speech Signal Process. (ICASSP), Mar. 2016, pp. 196–200.
 [6] H. Erdogan, J. R. Hershey, S. Watanabe, M. Mandel, and J. Le Roux, “Improved mvdr beamforming using singlechannel mask prediction networks,” in INTERSPEECH, Sept. 2016, pp. 1981–1985.
 [7] X. Xiao, S. Watanabe, H. Erdogan, L. Lu, J. Hershey, M. L. Seltzer, G. Chen, Y. Zhang, M. Mandel, and D. Yu, “Deep beamforming networks for multichannel speech recognition,” in IEEE Int. Conf. on Acoust., Speech Signal Process. (ICASSP), 2016, pp. 5745–5749.

[8]
T. Higuchi, N. Ito, S. Araki, T. Yoshioka, M. Delcroix, and
T. Nakatani,
“Online MVDR beamformer based on complex Gaussian mixture model with spatial prior for noise robust ASR,”
IEEE/ACM Trans. Audio, Speech, Language Process., vol. 25, no. 4, pp. 780–793, Apr. 2017. 
[9]
S. Watanabe, M. Delcroix, F. Metze, and J. R. Hershey,
New Era for Robust Speech Recognition: Exploiting Deep Learning
, Springer, 2017.  [10] T. Yoshioka, H. Erdogan, Z. Chen, and F. Alleva, “Multimicrophone neural speech separation for farfield multitalker speech recognition,” in IEEE Int. Conf. on Acoust., Speech Signal Process. (ICASSP), Apr. 2018, pp. 5739–5743.
 [11] Tsubasa Ochiai, Shinji Watanabe, Takaaki Hori, and John R. Hershey, “Multichannel endtoend speech recognition,” in Int. Conf. Mach. Learn. (ICML), Aug. 2017, pp. 2632–2641.
 [12] J. Heymann, L. Drude, C. Boeddeker, P. Hanebrink, and R. HaebUmbach, “Beamnet: Endtoend training of a beamformersupported multichannel ASR system,” in IEEE Int. Conf. on Acoust., Speech Signal Process. (ICASSP), 2017, pp. 5325–5329.
 [13] Z. Wang and D. Wang, “Allneural multichannel speech enhancement,” in Interspeech, Sept. 2018, pp. 3234–3238.
 [14] M. Togami, “Multichannel Itakura Saito distance minimization with deep neural network,” in IEEE Int. Conf. on Acoust., Speech Signal Process. (ICASSP), May 2019, pp. 536–540.

[15]
D. Griffin and J. Lim,
“Signal estimation from modified shorttime Fourier transform,”
IEEE Trans. Acoust., Speech, Signal Process., vol. 32, no. 2, pp. 236–243, Apr. 1984.  [16] J. Le Roux, N. Ono, and S. Sagayama, “Explicit consistency constraints for STFT spectrograms and their application to phase reconstruction,” in ISCA Workshop Stat. Percept. Audit. (SAPA), Sept. 2008, pp. 23–28.
 [17] Y. Masuyama, K. Yatabe, and Y. Oikawa, “Griffin–Lim like phase recovery via alternating direction method of multipliers,” IEEE Signal Process. Lett., vol. 26, no. 1, pp. 184–188, Jan. 2019.
 [18] Z. Wang, D. Wang J. Le Roux, and J. Hershey, “Endtoend speech separation with unfolded iterative phase reconstruction,” in Interspeech, Sept. 2018, pp. 2708–2712.
 [19] S. Wisdom, J. R. Hershey, K. Wilson, J. Thorpe, M. Chinen, B. Patton, and R. A. Saurous, “Differentiable consistency constraints for improved deep speech enhancement,” in IEEE Int. Conf. on Acoust., Speech Signal Process. (ICASSP), May 2019, pp. 900–904.

[20]
K. Shimada, Y. Bando, M. Mimura, K. Itoyama, K. Yoshii, and
T. Kawahara,
“Unsupervised speech enhancement based on multichannel NMFinformed beamforming for noiserobust automatic speech recognition,”
IEEE/ACM Trans. Audio, Speech, Lang. Process., vol. 27, no. 5, pp. 960–971, May 2019.  [21] N. Q. K. Duong, E. Vincent, and R. Gribonval, “Underdetermined reverberant audio source separation using a fullrank spatial covariance model,” IEEE Trans Audio, Speech, Lang. Process., vol. 18, no. 7, pp. 1830–1840, Sept. 2010.

[22]
Y. Masuyama, M. Togami, and T. Komatsu,
“Multichannel loss function for supervised speech source separation by maskbased beamforming,”
in Interspeech, Sept. 2019.  [23] J. S. Garofolo, L. F. Lamel, W. M. Fisher, J. G Fiscus, and D. S. Pallett, “DARPA TIMIT acousticphonetic continous speech corpus CDROM,” 1993.
 [24] J. Thiemann, N. Ito, and E. Vincent, “The diverse environments multichannel acoustic noise database: A database of multichannel environmental noise recording,” J. Acoust. Soc. Am., vol. 133, no. 5, pp. 3591–3591, 2013.
 [25] E. Hadad, F. Heese, P. Vary, and S. Gannot, “Multichannel audio database in various acoustic environments,” in Int. Workshop Acoust. Signal Enhance. (IWAENC), Sept. 2014, pp. 31–317.

[26]
H. Erdogan, J. R. Hershey, S. Watanabe, and J. Le Roux,
“Phasesensitive and recognitionboosted speech separation using deep recurrent neural networks,”
in IEEE Int. Conf. on Acoust., Speech Signal Process. (ICASSP), Apr. 2015, pp. 708–712.  [27] A. A. Nugraha, A. Liutkus, and E. Vincent, “Multichannel audio source separation with deep neural networks,” in IEEE Int. Conf. on Acoust., Speech Signal Process. (ICASSP), Sept. 2016, vol. 24, pp. 1652–1664.
 [28] N. Makishima, S. Mogami, N. Takamune, D. Kitamura, H. Sumino, S. Takamichi, H. Saruwatari, and N. Ono, “Independent deeply learned matrix analysis for determined audio source separation,” IEEE/ACM Trans. Audio, Speech Lang.Process., vol. 27, no. 10, pp. 1601–1615, Oct. 2019.
 [29] Z. Wang, J. Le Roux, and J. R. Hershey, “Multichannel deep clustering: Discriminative spectral and spatial embeddings for speakerindependent speech separation,” in IEEE Int. Conf. on Acoust., Speech Signal Process. (ICASSP), Apr. 2018, pp. 1–5.