Consistency-aware multi-channel speech enhancement using deep neural networks

02/14/2020
by   Yoshiki Masuyama, et al.
0

This paper proposes a deep neural network (DNN)-based multi-channel speech enhancement system in which a DNN is trained to maximize the quality of the enhanced time-domain signal. DNN-based multi-channel speech enhancement is often conducted in the time-frequency (T-F) domain because spatial filtering can be efficiently implemented in the T-F domain. In such a case, ordinary objective functions are computed on the estimated T-F mask or spectrogram. However, the estimated spectrogram is often inconsistent, and its amplitude and phase may change when the spectrogram is converted back to the time-domain. That is, the objective function does not evaluate the enhanced time-domain signal properly. To address this problem, we propose to use an objective function defined on the reconstructed time-domain signal. Specifically, speech enhancement is conducted by multi-channel Wiener filtering in the T-F domain, and its result is converted back to the time-domain. We propose two objective functions computed on the reconstructed signal where the first one is defined in the time-domain, and the other one is defined in the T-F domain. Our experiment demonstrates the effectiveness of the proposed system comparing to T-F masking and mask-based beamforming.

READ FULL TEXT VIEW PDF
POST COMMENT

Comments

There are no comments yet.

Authors

page 1

page 2

page 3

page 4

page 5

11/05/2018

Trainable Adaptive Window Switching for Speech Enhancement

This study proposes a trainable adaptive window switching (AWS) method a...
06/09/2021

Deep Interaction between Masking and Mapping Targets for Single-Channel Speech Enhancement

The most recent deep neural network (DNN) models exhibit impressive deno...
11/12/2019

PHASEN: A Phase-and-Harmonics-Aware Speech Enhancement Network

Time-frequency (T-F) domain masking is a mainstream approach for single-...
06/15/2018

Monaural source enhancement maximizing source-to-distortion ratio via automatic differentiation

Recently, deep neural network (DNN) has made a breakthrough in monaural ...
02/13/2020

DNN-Based Distributed Multichannel Mask Estimation for Speech Enhancement in Microphone Arrays

Multichannel processing is widely used for speech enhancement but severa...
11/06/2018

Kernel Machines Beat Deep Neural Networks on Mask-based Single-channel Speech Enhancement

We apply a fast kernel method for mask-based single-channel speech enhan...
02/03/2020

Tensor-to-Vector Regression for Multi-channel Speech Enhancement based on Tensor-Train Network

We propose a tensor-to-vector regression approach to multi-channel speec...
This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

1 Introduction

Speech enhancement has been studied extensively because of its various applications including mobile communication [1] and hearing aids [2]. When multiple microphones are available, multi-channel speech enhancement is an effective approach because it takes advantage of spatial information [3]. Recently, deep neural network (DNN)–based multi-channel speech enhancement has gained increasing attention [4, 5, 6] motivated by its strong modeling capability. DNN-based multi-channel speech enhancement methods often manipulate an observed signal in the time-frequency (T-F) domain because spatial filtering can be efficiently implemented in the T-F domain. Ordinarily, the estimated spectrogram or T-F mask are passed to objective functions defined in the T-F domain. However, the enhanced time-domain signal is important for human listeners. Hence, this paper proposes a DNN-based multi-channel speech enhancement system in which speech enhancement is conducted in the T-F domain, and objective functions are computed on the reconstructed time-domain signal to improve human perception.

Recently, various DNN-based approaches to multi-channel speech enhancement have been studied [4, 5, 7]. A well-known approach is mask-based beamforming (MB) in which T-F mask is used for estimating the spatial covariance matrix (SCM) [5, 6]. Although it achieved excellent performance as the front-end of ASR [8, 9, 10], it has a few drawbacks. First, the DNN was often trained to minimize the estimation error of T-F masks instead of maximizing the quality of the estimated signal directly [11, 12]. In addition, its performance is limited under noisy and reverberant environment because it does not consider non-stationary characteristics of speech signal [13].

To address these problems, we proposed a DNN-based multi-channel Wiener filtering (MWF) with a multi-channel objective function for speech separation [14]. The DNN-based MWF is based on the estimation of time-varying SCMs, and it can adapt to the time-varying speech signal. In addition, the quality of the estimated signal is directly maximized in the T-F domain based on a statistical model of a multi-channel signal. Hence, it can be expected that the DNN-based MWF improves the performance of multi-channel speech enhancement as in speech separation. However, its result is often inconsistent [15, 16, 17], and thus the estimated amplitude and phase may change by applying the inverse STFT (iSTFT) and STFT. Although several DNN-based monaural speech enhancement and separation methods improve the performance by considering consistency [18, 19], the consistency was not taken into account in DNN-based multi-channel speech enhancement.

Figure 1: Block diagram of proposed multi-channel speech enhancement system. Green and blue blocks indicate STFT-related layers and multi-channel signal processing layers, respectively. Red block represents DNN, and only this block is trainable. DNN is trained to maximize the quality of estimated time-domain signal.

In this paper, we propose a novel system for a DNN-based multi-channel speech enhancement where the DNN is trained to improve the quality of the enhanced time-domain signal directly. The overview of the proposed system is illustrated in Fig. 1. The DNN estimates T-F masks and power spectral densities of speech and noise to calculate MWF. Multi-channel speech enhancement is conducted by MWF, which is represented by blue blocks in Fig. 1. The estimated spectrogram is converted back to the time-domain and passed to an objective function. Thanks to this, the objective function can consider the reconstruction error due to the inconsistency. We investigate two novel objective functions for evaluating the enhanced time-domain signal in the time or T-F domain. Our experiment confirmed the performance of the DNN-based MWF was improved by using the proposed objective functions.

2 Preliminaries

2.1 Speech enhancement by multi-channel Wiener filtering

Since our proposed system uses MWF, this subsection reviews MWF that has been applied to multi-channel speech enhancement [20]. Let a noisy signal be observed by microphones, and be the observed noisy signal in the T-F domain where and are the time and frequency indices, respectively. The observed signal is given by the sum of the clean speech and noise as

(1)

We assume both speech and noise follow multivariate zero-mean complex Gaussian distributions as in

[21]:

(2)
(3)

where and are the time-varying SCMs of speech and noise, respectively. The observed noisy signal also follows a multivariate zero-mean complex Gaussian distribution: .

Given time-varying SCMs, the posterior distribution follows a multivariate complex Gaussian distribution:

(4)

where its mean and covariance matrix are calculated as

(5)
(6)
(7)

is the identity matrix, and

is called MWF. The enhanced spectrogram is obtained by applying a MWF, and the result is converted back to the time-domain by applying iSTFT.

2.2 DNN-based multi-channel Wiener filtering

We proposed a DNN-based MWF for taking advantage of the strong modeling capability of a DNN in multi-channel speech separation [14]. In the DNN-based MWF, a DNN estimates T-F mask and power spectral density for each speaker. The time-varying SCM of th speaker is calculated by

(8)
(9)

where is a time-invariant SCM estimated by using T-F mask, and is the Hermitian transpose of . Based on the estimated time-varying SCMs, each speech signal is estimated by MWF.

To train the DNN for estimating T-F masks and power spectral densities, we proposed the following objective function [14]:

(10)
(11)

where is the multi-channel signal estimated by MWF, and is the covariance calculated by Eq. (7). The minimization of this objective function corresponds to the maximization of the posterior distribution . In other words, the objective function given in Eq. (10) evaluates the quality of the estimated multi-channel signal based on the statistical model of multi-channel signals. One undisputed advantage of this objective function is that the separated signal is directly evaluated while conventional methods have set auxiliary targets, such as T-F mask, in their objective functions [4, 5]. The effectiveness of the multi-channel objective function has also been confirmed in various MB [22].

2.3 STFT consistency

It is known that spectrograms calculated by STFT have a relation between neighborhood T-F bins, and they are called consistent spectrograms [15, 16, 17]. A consistent spectrogram satisfies the following relation:

(12)

where is STFT, is iSTFT, and is the projection onto the set of consistent spectrograms. When speech enhancement is conducted in the T-F domain, the consistency of the estimated spectrogram is not guaranteed. In such a case, the spectrogram calculated by STFT of the reconstructed time-domain signal differs from the estimated spectrogram. In DNN-based speech enhancement, this discrepancy indicates that objective functions defined in the T-F domain do not evaluate the estimated time-domain speech properly. Some studies have addressed this problem since the discrepancy decreases the performance of T-F masking [18, 19]. For instance, [18] presented the wave approximation (WA) which evaluates the estimated signal in the time-domain, and [19] proposed to evaluate a spectrogram projected onto the set of consistent spectrograms. Although these studies showed the importance of the consistency in monaural speech enhancement and separation, it was not explicitly considered in multi-channel speech enhancement.

3 Proposed multi-channel speech enhancement system

In this section, we propose a system of DNN-based multi-channel speech enhancement in which an objective function is computed on the estimated time-domain signal as illustrated in Fig. 1. In the proposed system, multi-channel speech enhancement is conducted by the DNN-based MWF as described in Section 2.2. Then, the result of MWF is converted back to the time-domain by iSTFT and passed to the objective function. In Section 3.1, we extend WA for applying it to our proposed system, which is defined in the time-domain. Section 3.2 describes another objective function calculated by a sum of the original multi-channel objective function [14] and a consistency-aware objective function defined in the T-F domain. Both proposed objective functions are summarized in Fig. 2.

Figure 2: Illustration of proposed objective functions.

3.1 Multi-channel wave approximation (MWA)

The multi-channel objective function given in Eq. (10) is computed on the estimated spectrogram in the T-F domain, and it does not consider the reconstruction error due to the inconsistency of the estimated spectrogram. To address this problem, we propose multi-channel wave approximation (MWA) which is computed on the reconstructed time domain signal as illustrated in Fig. 2. It is formulated as a sum of WA at each channel:

(13)

where is the norm, and and are the clean and estimated spectrograms at th channel, respectively. WMA trains to maximize the quality of the reconstructed time-domain signal while the original objective function given by Eq. (10) focuses on the estimated spectrogram which may be inconsistent. Recently, WA have achieved promising results in monaural speech enhancement and separation [18]. The proposed MWA is a simple extension of WA to multi-channel case.

3.2 Consistency-aware multi-channel objective function

We propose another consistency-aware multi-channel objective function as a sum of the original multi-channel objective function given in Eq. (10) and a consistency-aware objective function:

(14)

where is the Frobenius norm, and

is a hyperparameter for adjusting two terms. In the second term, the estimated spectrogram is projected onto the set of consistent spectrograms, and then the distance to the clean spectrogram is calculated. This projection enables the objective functioin to consider the reconstruction error due to the inconsistency. In other words, the second term corresponds to evaluating the estimated time-domain signal in the T-F domain by recomputing STFT. Note that the second term in Eq. (

14) does not have any known statistical meaning while the first term is based on a statistical model of multi-channel signals. It can be considered to evaluate the posterior distribution with the consistency projection, which is included in our future work.

The proposed objective functions are summarized in Fig. 2. The first proposed objective function given in Eq. (13) is defined between the clean and estimated time-domain signal. In contrast, the second one given in Eq. (14) considers both estimated spectrogram and STFT of the reconstructed time-domain signal. In our early experiment, this combination achieved better performance comparing to only using the second term. The difference of the domain of the proposed objective functions affects the enhanced signal as shown in the following experiment.

4 Experiments and results

To confirm the effectiveness of the proposed system, an experiment of multi-channel speech enhancement under diffuse noise was conducted. DNN-based MWFs were compared with various baseline methods including T-F masking and MB. In the following subsections, the DNN-based MWFs using DNNs trained by the proposed objective functions given by Eqs. (13) and (14) are refereed to as Prop. 1 and Prop. 2, respectively.

dB dB dB
Approach SDR [dB] CD [dB] PESQ SDR [dB] CD [dB] PESQ SDR [dB] CD [dB] PESQ
ms Observed 1.14 5.26 1.14 6.93 4.67 1.33 13.21 3.74 1.77
T-F masking PSA 9.15 4.42 1.61 13.46 3.70 2.00 18.07 2.95 2.59
PSA+Proj 9.36 4.53 1.62 13.76 3.80 2.03 18.40 3.08 2.64
WA 9.60 4.65 1.63 14.06 3.90 2.11 18.69 3.15 2.73
Spatial filtering MB 5.42 4.88 1.28 11.54 4.22 1.65 16.85 3.24 2.27
Original 10.48 4.41 1.77 14.90 3.62 2.26 19.20 2.71 2.81
Prop. 1 11.23 4.73 1.73 15.57 3.97 2.23 19.75 3.15 2.84
Prop. 2 10.72 4.38 1.82 15.01 3.60 2.30 19.39 2.68 2.85
ms Observed 0.84 5.28 1.12 6.77 4.58 1.33 12.87 3.77 1.75
T-F masking PSA 9.11 4.45 1.60 13.38 3.65 2.03 17.90 2.96 2.57
PSA+Proj 9.27 4.53 1.60 13.66 3.75 2.05 18.22 3.06 2.62
WA 9.56 4.65 1.62 13.99 3.85 2.15 18.54 3.14 2.75
Spatial filtering MB 5.20 4.90 1.26 11.07 4.15 1.62 16.34 3.32 2.19
Original 10.38 4.42 1.76 14.50 3.58 2.23 18.82 2.76 2.77
Prop. 1 11.23 4.73 1.71 15.32 3.92 2.25 19.56 3.19 2.79
Prop. 2 10.67 4.40 1.80 14.71 3.57 2.29 18.98 2.74 2.80
Table 1: Results of speech enhancement.

4.1 Experimental setup

Figure 3: Network architecture used in experiment. Only colored blocks contained trainable parameters. In T-F masking and MB, only the T-F mask of speech was used.

4.1.1 Dataset

In both training and testing, the clean speech in TIMIT corpus [23] and noise from Diverse Environments Multichannel Acoustic Noise Database (DEMAND) [24] were used. The measured impulse responses in Multichannel Impulse Response Database (MIRD) [25] were convoluted to the above dry sources where the st channel of the noise in DEMAND was used as the dry source. The distance between the speaker and microphones was set to m, and the azimuth of each talker is randomly selected from points (from to with the intervals of ). On the other hand, diffuse noise was generated by playing noise from all points. Note that the noise played at each point is obtained by splitting the original noise into periods. The first half was used in the training/validation and the other was used in the testing. The number of microphones was where the distance between microphones was set to cm.

A training set with speech files was randomly selected from the training set of TIMIT, and the others were used as a validation set. Since the number of noise was small, we conducted a data augmentation111 The diffuse noise was augmented by conducting convex combinations of two noises, randomly selected from DEMAND, as , where

is randomly generated from a Beta distribution.

. The signal-to-noise ratio (SNR) of the training/validation set was adjusted from

to dB. At the training, the reverberation time () was ms. On the other hand, at the testing, speeches randomly selected from the testing set of TIMIT were used as clean speach, and the later periods of the noise were used. We evaluated under two reverberation conditions: ms and ms. All the speeches were sampled at kHz, and STFT was computed using the Hann window whose length was ms with ms shift.

4.1.2 Baseline methods

We compared the proposed methods with the following baseline methods. At first, T-F masking was used as a well-known monaural speech enhancement approach. To confirm the effectiveness of considering the consistency, three objective functions [the phase sensitive approximation (PSA) [26], PSA with the consistency projection (PSA+Proj) [19], and WA [18]] were compared. MB [5] was also conducted which used a DNN trained based on PSA. Although several iterative methods using DNN have been proposed in multi-channel source separation [4, 27, 28], we only compared the proposed system with aforementioned non-iterative methods because it is non-iterative. The performance of the proposed method can be improved by unifying iterative methods.

4.1.3 DNN architecture and setup

In all methods, including T-F masking, the input feature was the concatenation of the amplitude feature and phase-difference features. The amplitude feature was calculated by

(15)

where , and

is the utterance-level mean and variance normalization. As in a previous study

[29], the phase-difference between two microphones was also used as a input feature:

(16)
(17)

where is the complex argument.

The DNN for the proposed methods is illustrated in Fig. 3

, which contains two bidirectional long-short term memory (BLSTM) layers and dense layers. Dropout of

was applied to each BLSTM layer and dense layer without the last layers. The networks are trained on -frame segments using the Adam optimizer over epochs. The learning rate was decayed by multiplying if the objective function on the validation set did not decrease for consecutive epochs, and the initial learning rate was set to . In Prop. 2, was set to . In baseline methods, we used only the T-F mask estimation part of the DNN illustrated in Fig. 3.

Note that all systems were implemented using TensorFlow in which STFT and iSTFT are implemented with their backpropagation. In addition, it supports a lot of complex-valued operations and their derivatives. Hence, we can easily apply MWF in the training.

4.2 Experimental results

The performances of multi-channel speech enhancement were evaluated by the signal-to-distortion ratio (SDR), cepstrum distortion (CD), and PESQ. The experimental results are summarized in Table 1 in which the bold font represents the best score in each condition. As can be seen from both reverberation conditions, MB resulted in the lowest performance because it does not consider non-stationary characteristics of speech. In T-F masking, consistency-aware methods, PSA+Proj and WA, outperformed the original PSA in terms of SDR and PESQ. This results confirmed the importance of the consistency.

The DNN-based MWF with the original multi-channel objective function (Original) [14] outperformed the other conventional methods. Furthermore, the DNN-based MWF with the proposed MWA, Prop. 1, significantly improved SDR. On the other hand, by using the multi-objective function given in Eq. (14), Prop. 2 outperformed the original DNN-based MWF in terms of not only SDR but also CD and PESQ. We stress that, the difference between three DNN-based MWFs is only the objective function, and thus the computational cost for the inference is the same.

5 Conclusion

In this paper, we described the system of DNN-based multi-channel speech enhancement where the DNN is trained to maximize the quality of the time-domain signal estimated by the DNN-based MWF. We further proposed two objective functions defined on the enhanced time-domain signal. Our experimental results confirmed the effectiveness of the DNN-based MWF and proposed objective functions in multi-channel speech enhancement. Future work includes combining the proposed system with iterative algorithms.

References

  • [1] P. C. Loizou, Speech Enhancement: Theory and Practice, Second Edition, CRC Press, Inc., 2nd edition, Feb. 2013.
  • [2] S. Doclo, S. Gannot, M. Moonen, and A. Spriet, Handbook on Array Processing and Sensor Network, chapter Acoustic Beamforming for Hearing Aid Applications, pp. 269–302, Wiley Online Library, Jan. 2010.
  • [3] S. Gannot, E. Vincent, S. Markovich-Golan, and A. Ozerov, “A consolidated perspective on multimicrophone speech enhancement and source separation,” IEEE/ACM Trans. Audio, Speech and Lang. Proc., vol. 25, no. 4, pp. 692–730, Apr. 2017.
  • [4] S. Sivasankaran, A. A. Nugraha, E. Vincent, J. A. Morales-Cordovilla, S. Dalmia, I. Illina, and A. Liutkus, “Robust ASR using neural network based speech enhancement and feature simulation,” in IEEE Workshop Autom. Speech Recognit. Underst. (ASRU), Dec. 2015, pp. 482–489.
  • [5] J. Heymann, L. Drude, and R. Haeb-Umbach, “Neural network based spectral mask estimation for acoustic beamforming,” in IEEE Int. Conf. on Acoust., Speech Signal Process. (ICASSP), Mar. 2016, pp. 196–200.
  • [6] H. Erdogan, J. R. Hershey, S. Watanabe, M. Mandel, and J. Le Roux, “Improved mvdr beamforming using single-channel mask prediction networks,” in INTERSPEECH, Sept. 2016, pp. 1981–1985.
  • [7] X. Xiao, S. Watanabe, H. Erdogan, L. Lu, J. Hershey, M. L. Seltzer, G. Chen, Y. Zhang, M. Mandel, and D. Yu, “Deep beamforming networks for multi-channel speech recognition,” in IEEE Int. Conf. on Acoust., Speech Signal Process. (ICASSP), 2016, pp. 5745–5749.
  • [8] T. Higuchi, N. Ito, S. Araki, T. Yoshioka, M. Delcroix, and T. Nakatani,

    “Online MVDR beamformer based on complex Gaussian mixture model with spatial prior for noise robust ASR,”

    IEEE/ACM Trans. Audio, Speech, Language Process., vol. 25, no. 4, pp. 780–793, Apr. 2017.
  • [9] S. Watanabe, M. Delcroix, F. Metze, and J. R. Hershey,

    New Era for Robust Speech Recognition: Exploiting Deep Learning

    ,
    Springer, 2017.
  • [10] T. Yoshioka, H. Erdogan, Z. Chen, and F. Alleva, “Multi-microphone neural speech separation for far-field multi-talker speech recognition,” in IEEE Int. Conf. on Acoust., Speech Signal Process. (ICASSP), Apr. 2018, pp. 5739–5743.
  • [11] Tsubasa Ochiai, Shinji Watanabe, Takaaki Hori, and John R. Hershey, “Multichannel end-to-end speech recognition,” in Int. Conf. Mach. Learn. (ICML), Aug. 2017, pp. 2632–2641.
  • [12] J. Heymann, L. Drude, C. Boeddeker, P. Hanebrink, and R. Haeb-Umbach, “Beamnet: End-to-end training of a beamformer-supported multi-channel ASR system,” in IEEE Int. Conf. on Acoust., Speech Signal Process. (ICASSP), 2017, pp. 5325–5329.
  • [13] Z. Wang and D. Wang, “All-neural multi-channel speech enhancement,” in Interspeech, Sept. 2018, pp. 3234–3238.
  • [14] M. Togami, “Multi-channel Itakura Saito distance minimization with deep neural network,” in IEEE Int. Conf. on Acoust., Speech Signal Process. (ICASSP), May 2019, pp. 536–540.
  • [15] D. Griffin and J. Lim,

    “Signal estimation from modified short-time Fourier transform,”

    IEEE Trans. Acoust., Speech, Signal Process., vol. 32, no. 2, pp. 236–243, Apr. 1984.
  • [16] J. Le Roux, N. Ono, and S. Sagayama, “Explicit consistency constraints for STFT spectrograms and their application to phase reconstruction,” in ISCA Workshop Stat. Percept. Audit. (SAPA), Sept. 2008, pp. 23–28.
  • [17] Y. Masuyama, K. Yatabe, and Y. Oikawa, “Griffin–Lim like phase recovery via alternating direction method of multipliers,” IEEE Signal Process. Lett., vol. 26, no. 1, pp. 184–188, Jan. 2019.
  • [18] Z. Wang, D. Wang J. Le Roux, and J. Hershey, “End-to-end speech separation with unfolded iterative phase reconstruction,” in Interspeech, Sept. 2018, pp. 2708–2712.
  • [19] S. Wisdom, J. R. Hershey, K. Wilson, J. Thorpe, M. Chinen, B. Patton, and R. A. Saurous, “Differentiable consistency constraints for improved deep speech enhancement,” in IEEE Int. Conf. on Acoust., Speech Signal Process. (ICASSP), May 2019, pp. 900–904.
  • [20] K. Shimada, Y. Bando, M. Mimura, K. Itoyama, K. Yoshii, and T. Kawahara,

    “Unsupervised speech enhancement based on multichannel NMF-informed beamforming for noise-robust automatic speech recognition,”

    IEEE/ACM Trans. Audio, Speech, Lang. Process., vol. 27, no. 5, pp. 960–971, May 2019.
  • [21] N. Q. K. Duong, E. Vincent, and R. Gribonval, “Under-determined reverberant audio source separation using a full-rank spatial covariance model,” IEEE Trans Audio, Speech, Lang. Process., vol. 18, no. 7, pp. 1830–1840, Sept. 2010.
  • [22] Y. Masuyama, M. Togami, and T. Komatsu,

    “Multichannel loss function for supervised speech source separation by mask-based beamforming,”

    in Interspeech, Sept. 2019.
  • [23] J. S. Garofolo, L. F. Lamel, W. M. Fisher, J. G Fiscus, and D. S. Pallett, “DARPA TIMIT acoustic-phonetic continous speech corpus CD-ROM,” 1993.
  • [24] J. Thiemann, N. Ito, and E. Vincent, “The diverse environments multi-channel acoustic noise database: A database of multichannel environmental noise recording,” J. Acoust. Soc. Am., vol. 133, no. 5, pp. 3591–3591, 2013.
  • [25] E. Hadad, F. Heese, P. Vary, and S. Gannot, “Multichannel audio database in various acoustic environments,” in Int. Workshop Acoust. Signal Enhance. (IWAENC), Sept. 2014, pp. 31–317.
  • [26] H. Erdogan, J. R. Hershey, S. Watanabe, and J. Le Roux,

    “Phase-sensitive and recognition-boosted speech separation using deep recurrent neural networks,”

    in IEEE Int. Conf. on Acoust., Speech Signal Process. (ICASSP), Apr. 2015, pp. 708–712.
  • [27] A. A. Nugraha, A. Liutkus, and E. Vincent, “Multichannel audio source separation with deep neural networks,” in IEEE Int. Conf. on Acoust., Speech Signal Process. (ICASSP), Sept. 2016, vol. 24, pp. 1652–1664.
  • [28] N. Makishima, S. Mogami, N. Takamune, D. Kitamura, H. Sumino, S. Takamichi, H. Saruwatari, and N. Ono, “Independent deeply learned matrix analysis for determined audio source separation,” IEEE/ACM Trans. Audio, Speech Lang.Process., vol. 27, no. 10, pp. 1601–1615, Oct. 2019.
  • [29] Z. Wang, J. Le Roux, and J. R. Hershey, “Multi-channel deep clustering: Discriminative spectral and spatial embeddings for speaker-independent speech separation,” in IEEE Int. Conf. on Acoust., Speech Signal Process. (ICASSP), Apr. 2018, pp. 1–5.