I Introduction
Multichannel speech enhancement using a microphone array plays a vital role for distant automatic speech recognition (ASR) in noisy environments. A standard approach to multichannel speech enhancement is to use beamforming [1, 2, 3, 4, 5, 6, 7, 8, 9, 10]. Given the spatial information of speech and noise, we can emphasize the speech coming from one direction and suppress the noise from the other directions [11, 12, 13, 14, 15]. This approach was empirically shown to achieve the significant improvement of ASR performance in the CHiME Challenge [16, 17, 18]. There are many variants of beamforming such as multichannel Wiener filtering (MWF) [11, 12], minimum variance distortionless response (MVDR) beamforming [13], generalized sidelobe cancelling (GSC) [14]
, and generalized eigenvalue (GEV) beamforming
[15], which are all performed in the timefrequency (TF) domain.To calculate demixing filters for beamforming, the steering vector of speech and the spatial covariance matrix (SCM) of noise should be estimated. The steered response power phase transform (SRPPHAT) [19] and the weighted delayandsum (DS) beamforming [20] are not sufficiently robust to real environments [16]. Recently, estimation of TF masks has actively been investigated [1, 2, 3, 4, 5, 6, 7, 8], assuming that each TF bin of an observed noisy speech spectrogram is classified into speech or noise. The SCMs of speech and noise are then calculated from the classified TF bins. The steering vector of the target speech is obtained as the principal component of the SCM of the speech [1, 3, 2]
. For such binary classification, an unsupervised method based on complex Gaussian mixture models (CGMMs)
[1] and a supervised method based on deep neural networks (DNNs) [3, 4, 2, 7, 8, 5, 6] have been proposed.Although DNNbased beamforming works well in controlled experimental environments, it has two major problems in real environments. One problem is that the performance of ASR in unknown environments is often be considerably degraded due to the overfitting to training data consisting of many pairs of noisy speech spectrograms and ideal binary masks (IBMs) of speech. Although multicondition training with various kinds of noisy environments mitigates the problem [21], it is still an open question whether DNNbased beamforming works when a microphone array with different geometry and frequency characteristics is used in unseen noisy environments. The other problem is that the spatial features such as interchannel level and phase differences (ILDs and IPDs), which play an essential role in conventional multichannel audio signal processing, are simply input to DNNs without considering the physical meanings and generative processes of those features.
To solve these problems, we recently proposed an unsupervised method of speech enhancement [22] based on several types of beamforming using the SCMs of speech and noise estimated by a blind source separation (BSS) method called multichannel nonnegative matrix factorization (MNMF)[23] (Fig. 1). Given the multichannel complex spectrograms of mixture signals, MNMF can estimate the SCMs of multiple sources (i.e., speech and multiple noise sources) as well as represent the nonnegative power spectrogram of each source as the product of two nonnegative matrices corresponding to a set of basis spectra and a set of temporal activations. The SCMs of speech and noise estimated by MNMF that decomposes each TF bin into the sum of speech and noise are expected to be more accurate than those estimated by a CGMM[1] or a DNN[3, 4, 5, 6, 2, 7, 8] that classifies each TF bin into speech or noise. The unsupervised speech enhancement method is also expected to work well even in unknown environments for which there are no matched training data. In this paper, we newly propose an online extension of MNMFinformed beamforming that can process the observed mixture signals in a streaming manner.
The main contribution of this paper is to describe the complete formulation of the proposed method of MNMFinformed beamforming and report comprehensive comparative experiments. More specifically, we test the combination of MNMF with various types of beamforming, i.e., the timevariant and timeinvariant versions of fullrank MWF [11], rank1 MWF [12], and MVDR beamforming [13], using the CHiME3 data and a real internal test set. These variants are compared with the stateoftheart methods of DNNbased beamforming[3] using phaseaware acoustic features [4, 5, 6] and cost functions[24]. In addition, we evaluate the performance of the online extension of the proposed method.
The rest of this paper is organized as follows. Section II describes related work on multichannel speech enhancement. Section III and Section IV explain three types of beamforming (fullrank MWF, rank1 MWF, and MVDR beamforming) and MNMF, respectively. Section V explains the proposed method of MNMFinformed beamforming. Section VI reports comparative experiments and Section VII concludes the paper.
Ii Related Work
We review nonblind beamforming methods based on the steering vector of speech and the SCM of noise for ASR in noisy environments. We also review BSS methods including several variants of MNMF.
Iia Beamforming Methods
There are several variants of beamforming such as DS [20] MVDR [13], GEV [15] beamforming and MWF [11, 12]. The DS beamforming [20] uses only the steering vector of target speech and the other methods additionally use the SCM of noise. The GEV beamforming aims to maximize the signaltonoise ratio (SNR) [15]
without putting any assumptions on the acoustic transfer function from the speaker to the array and the SCM of the noise. The MVDR beamforming and the MWF, on the other hand, assume that the timefrequency (TF) bins of speech and noise spectrograms are distributed according to complex Gaussian distributions
[13, 11, 12]. In Section III, we review the relationships between MVDR beamforming and rank1 and fullrank MWF in terms of the propagation process and and the filter estimation strategy.TF Mask estimation has actively been studied for computing the SCMs of speech and noise [1, 2, 3, 4, 7, 8, 5, 6]. Our unsupervised method is different from DNNbased mask estimation [2, 3, 4, 7, 8, 5, 6] in two ways. First, our method decomposes each TF bin into the sum of speech and noise, while the maskbased methods calculate the SCM of speech from noisy TF bins without any decomposition. Second, our method uses no training data, while in general the DNNbased methods need a sufficient number of pairs of noisy spectrograms and ideal binary masks (IBMs). The performance of the DNNbased mask estimation would be degraded in unseen conditions that are not covered by the training data because of overfitting to the training data.
The major limitation of most DNNbased methods is that only singlechannel magnitude spectrograms are used for mask estimation by discarding the spatial information such as ILDs and IPDs. Recently, Wang et al.[6] and Pertilä[5] have investigated the use of ILDs and IPDs as acoustic features for mask estimation. Erdogan et al.[24] proposed a method for estimating a phasesensitive filter in singlechannel speech enhancement. For comparative evaluation, inspired by these stateoftheart methods, we use both spatial and magnitude features for DNNbased multichannel mask estimation.
IiB Multichannel Nonnegative Matrix Factorization
Multichannel extensions of NMF [25, 26, 27, 23, 28, 29] represent the complex spectrograms of multichannel mixture signals by using the SCMs and lowrank power spectrograms of multiple source signals. Ozerov et al. [26] pioneered the use of NMF for multichannel source separation, where the SCMs are restricted to rank1 matrices and the cost function based on the ItakuraSaito (IS) divergence is minimized. This model was extended to have fullrank SCMs [27]. Sawada et al. [23] introduced partitioning parameters to have a set of basis spectra shared by all sources and derived a majorizationminimization (MM) algorithm. Nikunen and Virtanen [28] proposed a similar model that represents the SCM of each source as the weighted sum of directiondependent SCMs. While these methods can be used in a underdetermined case, Kitamura et al. [29] proposed independent lowrank matrix analysis (ILRMA) for a determined case by restricting the SCMs of [23] to rank1 matrices. This can be viewed as a unified model of NMF and independent vector analysis (IVA) and is robust to initialization.
Iii Beamforming Methods
This section introduces three major methods of beamforming; fullrank and rank1 versions of multichannel Wiener filtering (MWF) and minimum variance distortionless response (MVDR) beamforming (Table I)[11].
Speech: Fullrank  Speech: Rank1  
Noise: Fullrank  Noise: Fullrank  
MAP  Fullrank MWF  Rank1 MWF 
Eqs. (8) & (9)  Eqs. (12) & (13)  
ML    MVDR 
  Eqs. (15) & (16) 
Iiia Overview
The goal of beamforming is to extract a source signal of interest from a mixture signal in the shorttime Fourier transform (STFT) domain. Let
be the multichannel complex spectrum of the mixture at frequency and frame recorded by microphones, which is assumed to be given by(1) 
where and are the multichannel complex spectra of speech and noise (called images), respectively. The notations are listed in Table II. The goal is to estimate a linear demixing filter that obtains an estimate of the target speech from the mixture (speech + noise) as follows:
(2) 
As shown in Table I, the beamforming methods can be categorized in terms of sound propagation processes.

The fullrank propagation process considers various propagation paths caused by reflection and reverberation. It is thus represented by using an fullrank SCM for each source.

The rank1 propagation process considers only the direct paths from each sound source to the microphones. It is thus represented by using an dimensional steering vector for each source.
The fullrank propagation process reduces to the rank1 propagation process when the fullrank SCM is restricted to a rank1 matrix whose eigenvector is equal to the steering vector.
The beamforming methods can also be categorized in terms of estimation strategies.

The maximum a posteriori (MAP) estimation assumes the target speech spectra to be complex Gaussian distributed.

The maximum likelihood (ML) estimation uses no prior knowledge about the target speech spectra.
Observation  Speech  Noise  
Multichannel spectrum  
Steering vector    
Spatial covariance matrix 
IiiB FullRank Multichannel Wiener Filtering
The fullrank MWF [30] assumes both the target speech and the noise to follow multivariate circularlysymmetric complex Gaussian distributions as follows:
(3)  
(4) 
where and are the fullrank SCMs of the speech and noise at frequency and time , respectively, and indicates the set of Hermitian positive definite matrices. Using the reproducible property of the Gaussian distribution, we have
(5) 
Given the mixture , the posterior distribution of the multichannel speech image is obtained as follows:
(6)  
(7) 
To obtain a monaural estimate of the speech, it is necessary to choose a reference channel (dimension) from the MAP estimate of the speech image . The timevariant demixing filter is thus given by
(8) 
where is the dimensional onehot vector that takes in dimension . If the speaker does not move and the noise is stationary, and are often assumed to be timeinvariant, i.e., and . In this case, the timeinvariant demixing filter is given by
(9) 
In reality, the speech is not stationary, but such timeinvariant linear filtering is known to be effective for speech enhancement with small distortion. In general, the enhanced speech signals obtained by the timevariant filter tends to be more distorted than those obtained by the timeinvariant filter.
IiiC Rank1 Multichannel Wiener Filtering
The rank1 MWF [32] is obtained as a special case of the fullrank MWF when the spatial covariance matrix of the speech is restricted to a rank1 matrix as follows:
(11) 
where and are the power and steering vector of the speech at frequency and time , respectively. Substituting Eq. (11) into Eq. (8) and using the Woodbury matrix identity, we obtain the timevariant demixing filter as follows:
(12) 
In practice, to achieve reasonable performance, we assume the timeinvariance of the speech, i.e., and . Similarly, the timeinvariant filter is given by
(13) 
Given the steering vector or , the power spectral density in Eqs. (12) or in (13), can be estimated as follows:
(14) 
where represents the Frobenius norm of a matrix.
IiiD Minimum Variance Distortionless Response Beamforming
The MVDR beamforming [13] can be derived as a special case of the rank1 MWF when the power spectral density of the speech in Eq. (11) (the variance of the Gaussian distribution in Eq. (3)) approaches infinity, i.e., we do not put any assumption on the target speech. The timevariant and timeinvariant demixing filters are given by
(15)  
(16) 
Iv Multichannel Nonnegative Matrix Factorization
This section introduces multichannel nonnegative matrix factorization (MNMF)[23]. In this paper we assume that the observed noisy speech contains sound sources, one of which corresponds to target speech and the other sources are regarded as noise. Let be the number of channels (microphones).
Iva Probabilistic Formulation
We explain the generative process of the multichannel observations of noisy speech, , where is the multichannel complex spectrum of the mixture at frequency and time . Let be the singlechannel complex spectrum of source at frequency and time and be the multichannel complex spectrum (image) of source . If the sources do not move, we have
(17) 
where is the timeinvariant steering vector of source at frequency . Here is assumed to be circularlysymmetric complex Gaussian distributed as follows:
(18) 
where is the power spectral density of source at frequency and time . Using Eq. (17) and Eq. (18), can be said to be multivariate circularlysymmetric complex Gaussian distributed as follows:
(19) 
where and is the rank1 SCM of source at frequency . In MNMF, the rank1 assumption on is relaxed to deal with the underdetermined condition of by allowing to take any fullrank positive definite matrix. Assuming the instantaneous mixing process (source additivity) in the frequency domain, we have
(20) 
Using Eq. (19) and Eq. (20), the reproducible property of the Gaussian distribution leads to
(21) 
The nonnegative power spectral density of each source is assumed to be factorized in an NMF style as follows:
(22) 
where is the number of basis spectra, is the power of basis at frequency and is the activation of basis at time . This naive model has basis spectra in total. One possibility to reduce the number of parameters is to share basis spectra between all sources as follows:
(23) 
where indicates the weight of basis in source . Substituting Eq. (23) into Eq. (21), we obtain the probabilistic generative model of as follows:
(24) 
IvB Parameter Estimation
Given , our goal is to estimate , , , and that maximize the likelihood obtained by multiplying Eq. (24) over all frequency and time . Let two positive definite matrices and be as follows:
(25)  
(26) 
The maximization of the likelihood function given by Eq. (24) is equivalent to the minimization of the logdeterminant divergence between and given by
(27) 
The total cost function to be minimized w.r.t. , , , and is thus given by
(28) 
Since Eq. (28) is hard to directly minimize, a convergenceguaranteed MM algorithm was proposed (see [23] for detailed derivation). The updating rules are given by
(29)  
(30)  
(31) 
is obtained as the unique solution of a special case of the continuous time algebraic Riccati equation . In the original study on MNMF [23], this equation was solved using an iterative optimization algorithm. In the field of information geometry, however, the analytical solution of this equation is known to exist as follows:
(32)  
(33)  
(34) 
where
is updated to the geometric mean of two positive definite matrices
and [33, 34, 35].V MNMFInformed Beamforming
This section explains the proposed MNMFinformed beamforming and its online extension. Our method takes as input the multichannel noisy speech spectrograms and outputs a speech spectrogram, which is then passed to an ASR backend (Fig. 1). MNMF is used to estimate the SCMs of speech and the other sounds from . The steering vector of the target speech and the SCM of noise are then computed. Finally, the enhanced speech is obtained by using one of the three kinds of beamforming described in Section III.
Va Estimation of Spatial Information
To use a beamforming method (Section III), we compute the SCMs and of speech and noise by using the parameters , , , and of MNMF (Section IV). Assuming that source is the target speech (see Section VC), we have
(35)  
(36) 
where and are the timevariant SCMs of speech and noise, respectively. The timeinvariant SCMs and are also given by
(37)  
(38) 
The corresponding steering vectors and of the target speech are approximated as the principal components of and , respectively, as follows:
(39)  
(40) 
VB Online MNMF
We propose an online extension of MNMF that incrementally updates the parameters , , , and . Suppose that is given as a series of minibatches in a sequential order, where each minibatch consists of multiple frames . The notation represents a statistic of minibatch . The latest statistics are considered with a weight . When , the current minibatch is put more emphasis [36]. The online updating rules are as follows:
(41)  
(42)  
(43)  
(44)  
(45)  
(46)  
(47)  
(48)  
(49) 
where the function is defined as follows:
(50)  
(51) 
VC Initialization of MNMF
We randomly initialize all parameters except for . Since MNMF is sensitive to the initialization of [37], we use a constrained version of MNMF called independent lowrank matrix analysis (ILRMA) [29] for initializing . Since ILRMA can be used in the determined condition of , in this paper we assume for MNMF. In ILRMA, is restricted to a rank1 matrix, i.e., (see Section IVA). Using Eq. (17) and Eq. (20), we have
(52) 
where is a set of source spectra and is a mixing matrix. If is a nonsingular matrix, we have
(53) 
where is a demixing matrix and is a demixing filter of source . We use ILRMA for estimating , compute , and initialize , where is a small number.
In this paper, we assume that the target speech is predominant in the duration of (e.g., one utterance). To deal with longer observations, voice activity detection (VAD) would be needed for segmenting the signals into multiple utterances. In reality, it can be said to be rare that a target utterance largely overlaps another utterance with the same level of volume. To make source correspond to the target speech, the steering vector of source is thus initialized as the principal component of the average empirical SCM as follows:
(54) 
In the online version, the average of the empirical SCMs is taken over the first minibatch.
The procedures of the offline and online versions of speech enhancement are shown in Algorithm 1 and Algorithm 2. In the online version, the spatial information of the target speech and noise are initialized by using the first relativelylong minibatch (e.g., 10 s), and then updated in each minibatch (e.g., 0.5 s). As described in Section IIIC, when the timevariant rank1 MWF is used, the SCM , the steering vector , and the power of the speech are assumed to be timeinvariant, while those of noise are kept to be timevariant.
Vi Evaluation
This section reports comprehensive experiments conducted for evaluating all the variants of the proposed method based on unsupervised MNMFinformed beamforming (i.e., fullrank MWF, rank1 MWF, or MVDR, timevariant or timeinvariant, and offline or online), in comparison with the stateoftheart methods based on supervised DNNbased mask estimation. To evaluate the performance of ASR, we used a common dataset taken from the third CHiME Challenge [16], where a sufficient amount of training data are available. We also used an internal dataset consisting of multichannel recordings in real noisy environments whose acoustic characteristics were different from those of the ChiME3 dataset.
Via Configurations
We describe the configurations of the speech enhancement methods used for evaluation. The multichannel complex spectrograms of noisy speech signals recorded by five or six microphones at a sampling rate of 16 [kHz] were obtained by shorttime Fourier transform (STFT) with a Hamming window of 1024 samples (160 [ms]) and a shifting interval of 160 samples (10 [ms]), i.e., and .
ViA1 MNMFInformed Beamforming
In MNMF, the number of basis spectra was set as and the number of sources was set as (one source for speech and the remaining four sources for noise). MNMF was combined with the timevariant and timeinvariant versions of fullrank MWF, rank1 MWF, and MVDR beamforming (MNMF{TV, TI}{WF, WF1, MV}). The demixing filter was computed from the same SCMs estimated by MNMF to prevent the initialization sensitivity from affecting the ASR performance.
ViA2 DNNBased Beamforming
For comparison, we used the timeinvariant version of MVDR beamforming based on DNNbased mask estimation because the timeinvariant version of MVDR beamforming has been the most common choice among various kinds of beamforming in DNNbased beamforming.
To estimate masks, we used different combinations of magnitude and spatial features [6] described below:

The log of the outputs of channel melscale filter banks (LMFBs) were computed at each time from the magnitude spectrogram of a reference channel manually specified or automatically selected by Eq. (10), where we set . These features were stacked over 11 frames from time to time at each time .

The dimensional ILDs and IPDs (sine and cosine of phase angle differences) from the reference channel were extracted at each frame and frequency . This is considered to be more robust to overfitting than using all the dimensional ILDs and IPDs between channels as proposed in [6].
These features were stacked over 11 frames from time to time to obtain up to dimensional features at each time , which were fed into DNNs.
To train DNNs, we tested two kinds of cost functions with different target data [24] described below:

Ideal binary masks (IBMs) are used as target data, i.e., a TF mask takes when and takes otherwise, as in the standard DNNbased mask estimation. The cost function is based on the crossentropy loss between the target masks and the outputs of a DNN.

Phasesensitive filters (PSFs) are used as target data, i.e., a TF filter is defined as , as proposed in [24]. The cost function is based on the phasesensitive spectrum approximation (PSA) between the filtered and groundtruth speech spectra given by .
We defined a baseline using LMFBs and IBMs (DNNIBM) and its counterpart using LMFBs and PSFs (DNNPSF). As extensions of DNNIBM, we tested the additional use of ILDs and/or IPDs (DNNIBM{L, P, LP}). A standard feedforward DNN was trained under each configuration. Although a bidirectional long shortterm memory network (BLSTM) was originally proposed for DNNIBM
[3], the feedforward DNN slightly outperformed the BLSTM in our preliminary experiments. We thus report the results obtained with the feedforward DNN in this paper. The steering vector of speech and the SCM of noise used in Eq. (16) are given by(55)  
(56) 
As a common baseline, the weighted delaysum (DS) beamforming called Beamformit [20] was also used for comparison.
ViA3 Automatic Speech Recognition
We used a defacto standard ASR system based on a DNNHMM [38, 39] and a standard WSJ 5k trigram model as acoustic and language models, respectively, with the Kaldi WFST decoder [40]
. The DNN had four hidden layers with 2,000 rectified linear units (ReLUs)
[41] and a softmax output layer with 2,000 nodes. Its input was a 1,320dimensional feature vector consisting of 11 frames of 40channel LMFB outputs and their delta and acceleration coefficients. Mean and variance normalization was applied to input vectors. Dropout [42][43] were used in the training of all hidden layers.ViA4 Performance Evaluation
The performance of ASR was measured in terms of the word error rate (WER) defined as the ratio of the number of substitution, deletion, and insertion errors to the number of words in the reference text. The performance of speech enhancement was measured in terms of the speech distortion ratio (SDR) [44] defied as the ratio of the energy of target components to that of distortion components including interference, noise, and artifact errors. In addition, the performance of speech enhancement was measured in terms of the perceptual evaluation of speech quality (PESQ) [45] and shorttime objective intelligibility (STOI) [46] which are closely related to the human auditory perception.
SCM estimation  Beamforming  Simulated data  Real data  
Method  (Target / Features)  Time  Type  BUS  CAF  PED  STR  Av.  BUS  CAF  PED  STR  Av. 
Not enhanced  11.64  17.18  14.05  15.33  14.55  31.00  24.62  18.33  14.81  22.19  
Beamformit  Inv.  DS  9.88  14.59  13.56  15.05  13.27  19.91  15.45  13.32  13.49  15.54  
DNNPSF  PSF / LMFB  Inv.  MV  6.43  8.70  8.52  8.65  8.07  14.51  11.02  10.59  9.41  11.38 
DNNIBM  IBM / LMFB  Inv.  MV  6.41  8.63  8.50  8.39  7.98  14.28  11.23  10.39  9.49  11.35 
DNNIBML  IBM / LMFB + ILD  Inv.  MV  6.24  8.12  8.46  7.75  7.64  15.52  11.43  12.71  10.40  12.51 
DNNIBMP  IBM / LMFB + IPD  Inv.  MV  6.52  8.11  11.41  8.87  8.65  14.11  10.81  11.49  9.47  11.47 
DNNIBMLP  IBM / LMFB + ILD + IPD  Inv.  MV  6.54  8.18  9.53  8.27  8.13  15.82  10.63  12.43  10.25  12.28 
MNMFTVWF  ILRMA + MNMF  Var.  WF  7.58  10.59  14.23  13.67  11.52  14.73  11.30  11.21  10.07  11.83 
MNMFTIWF  ILRMA + MNMF  Inv.  WF  7.43  10.52  14.21  13.56  11.43  14.90  11.77  11.58  10.05  12.07 
MNMFTVWF1  ILRMA + MNMF  Var.  WF1  7.60  11.09  14.48  13.80  11.74  13.68  11.51  11.77  10.35  11.83 
MNMFTIWF1  ILRMA + MNMF  Inv.  WF1  7.68  11.34  14.53  13.80  11.84  14.26  11.54  11.51  10.24  11.89 
MNMFTVMV  ILRMA + MNMF  Var.  MV  7.71  11.34  14.61  14.01  11.92  14.60  11.73  11.49  10.12  11.99 
MNMFTIMV  ILRMA + MNMF  Inv.  MV  7.75  11.30  14.49  13.77  11.83  14.60  11.65  11.55  10.14  11.99 
Simulated data (SDR)  Simulated data (PESQ)  Simulated data (STOI)  
Method  BUS  CAF  PED  STR  Av.  BUS  CAF  PED  STR  Av.  BUS  CAF  PED  STR  Av. 
Not enhanced  6.75  7.74  8.33  6.56  7.35  2.32  2.09  2.13  2.19  2.18  0.88  0.85  0.87  0.86  0.87 
Beamformit  5.45  7.60  8.32  5.46  6.71  2.42  2.21  2.20  2.22  2.26  0.89  0.86  0.87  0.85  0.87 
DNNPSF  8.59  13.85  12.64  9.43  11.13  2.82  2.52  2.60  2.61  2.64  0.96  0.94  0.95  0.94  0.95 
DNNIBM  8.75  13.39  12.74  9.59  11.25  2.82  2.52  2.61  2.61  2.64  0.96  0.94  0.95  0.94  0.95 
DNNIBML  10.99  14.38  12.91  11.12  12.35  2.84  2.54  2.60  2.63  2.65  0.96  0.95  0.95  0.95  0.95 
DNNIBMP  10.68  14.18  12.53  10.61  12.00  2.83  2.54  2.55  2.59  2.63  0.96  0.95  0.93  0.94  0.94 
DNNIBMLP  11.49  14.55  12.67  11.32  12.51  2.83  2.54  2.54  2.60  2.63  0.96  0.95  0.94  0.94  0.95 
MNMFTVWF  17.69  16.41  16.28  14.28  16.16  2.91  2.60  2.65  2.65  2.70  0.97  0.95  0.93  0.94  0.94 
MNMFTIWF  17.36  16.29  16.16  14.08  15.97  2.89  2.60  2.64  2.65  2.69  0.97  0.95  0.93  0.93  0.94 
MNMFTVWF1  15.65  15.61  14.83  13.12  14.80  2.89  2.58  2.59  2.63  2.67  0.97  0.95  0.92  0.93  0.94 
MNMFTIWF1  15.81  15.65  14.86  13.21  14.88  2.89  2.58  2.58  2.63  2.67  0.97  0.95  0.92  0.93  0.94 
MNMFTVMV  13.68  15.17  14.33  12.33  13.87  2.88  2.58  2.58  2.63  2.67  0.96  0.94  0.92  0.93  0.94 
MNMFTIMV  13.69  15.18  14.33  12.33  13.88  2.88  2.58  2.57  2.63  2.66  0.96  0.94  0.92  0.93  0.94 
ViB Evaluation on CHiME3 Dataset
We report the comparative experiment using the common dataset used in the third CHiME Challenge [16].
ViB1 Experimental Conditions
The training set consists of 1,600 real utterances and 7,138 simulated ones obtained by mixing the clean training set of WSJ0 with background noise. The test set includes 1,320 real utterances (“et05_real_noisy”) with and 1,320 simulated ones (“et05_simu_noisy”) with . In the real data, each utterance was recorded by six microphones placed at a handheld tablet, from which five channels except for the second channel on the back side of the tablet were used and the fifth channel facing the speaker was set as a reference channel . There were four types of noisy environments: bus (BUS), cafeteria (CAF), pedestrian area (PED), and street place (STR).
To estimate TF masks, the five kinds of DNNs, DNN{IBM, PSF} and DNNIBM{L, P, LP}, were trained by using the simulated training set. The DNNHMM acoustic model was also trained using the same data. The SDRs, PESQs, and STOIs were measured for only the simulated test set because the clean speech data were required. The WERs were measured for both the simulated and real test set.
ViB2 Noisy Speech Recognition
The performances of ASR are listed in Table III. Among the MNMFbased variants, MNMFTVWF and MNMFTVWF1 attained the best average WERs of 11.83% for the real data and were significantly better than Beamformit with the average WER of 15.54%. Among the DNNbased variants, the DNNIBM achieved the best average WER of 11.35% for the real data. MNMFTVWF and MNMFTVWF1 were still comparable with DNNIBM trained by using the matched data. This result is considered to be promising because our unsupervised method does not need any prior training. In our evaluation, neither the use of spatial information such as ILDs and IPDs nor the PSFbased cost function was effective in terms of the WER.
The WERs obtained by the MNMFbased variants for the simulated PED and STR data were worse than those for the real PED and STR data, while the DNNbased variants worked well for both data. As listed in Table IV, the DNNHMM is considered to mismatch the enhanced speech for the PED and STR data because the performances of speech enhancement for the simulated PED and STR data were comparable with those for the BUS and CAF data. The WERs for the real BUS data were remarkably worse than those for the simulated BUS data. A main reason would be that the spatial characteristics of speech and noise fluctuated over time due to the vibration of the bus in a real environment. The lowrank assumption of MNMF still held in the bus and the timevariant types of beamforming thus slightly worked better.
Interestingly, while the WERs obtained by the MNMFbased variants were much worse than those obtained by the DNNbased variants for the simulated data, all the methods yielded similar results for the real data. This indicates that the DNNbased beamforming tends to overfit the training data.
ViB3 Speech Enhancement
The performances of speech enhancement are listed in Table IV. The MNMFbased variants were generally excellent in terms of the SDR, and were almost comparable with the DNNbased variants in terms of the PESQ and STOI. In our evaluation, the SDRs were closely related to the WERs. MNMFTVWF achieved the best average SDR of 16.16 dB, while the DNNbased variants showed lower SDRs up to 12.51 dB. Interestingly, the WERs obtained by the DNNbased variants were much better than those obtained by the MNMFbased variants for the simulated data. Figures 2, 3, 4, and 5 show the input noisy speech spectrogram and the enhanced speech spectrograms obtained by Beamformit, DNNIBM, and MNMFTIMV, respectively. Although the lowfrequency noise components were not sufficiently suppressed by the DNNbased methods, those components are considered to have a little impact on ASR. MNMFTIMV was shown to estimate harmonic structures more clearly.
The fullrank MWF worked best in speech enhancement and the rank1 MWF showed the second highest performance. While the fullrank MWF can consider various propagation paths caused by reflection and reverberation, the rank1 MWF and MVDR beamforming can consider only the direct paths from sound sources to the microphones. When the fullrank SCMs were accurately estimated by MNMF, the performance of speech enhancement was proven to be improved.
ViC Evaluation on JNAS Dataset
We report the comparative experiment using the internal dataset recorded in real noisy environments. We also evaluated the online version of the proposed method.
ViC1 Experimental Conditions
We made an internal dataset consisting of 200 sentences taken from the Japanese newspaper article sentence (JNAS) corpus [47] and spoken by five male speakers in a noisy crowded cafeteria (Fig. 6). The utterances were recorded with a fivechannel microphone array () and the total duration was about 20 min. To make a realistic condition, we used a hemispherical array with microelectromechanical system (MEMS) microphones that are widely used in commercial products. The distance between the speaker and the array was 1m. The noisy JNAS dataset has significantly different acoustic characteristics from those of the CHiME3 dataset, as it was recorded in a different noisy environment by using a different microphone array (Table V).
Test set  CHiME3  Noisy JNAS 
Noisy environments  4 (including cafe)  1 (another cafe) 
Microphone type  Condenser  MEMS 
Microphone array geometry  Rectangle  Hemisphere 
Speaker distance  0.2  0.5 m  1 m 
Speaker gender  2 males & 2 females  5 male 
Speaker language  English  Japanese 
The ASR performance was evaluated using this dataset. The DNNHMM acoustic model was also trained using the multicondition data, in which the noise data of the CHiME3 were added to the original clean 57,071 utterances of the JNAS. The model had six hidden layers with 2,048 sigmoidal nodes and an output layer with 3,000 nodes. A trigram language model was also trained using the JNAS. The Julius decoder [48] was used in this evaluation.
The noisy JNAS task was very different from the CHiME3 task in terms of microphone setups and noise environments. Since the DNNs for mask estimation were trained using the noise data of the CHiME3 data set, the noise condition of noisy JNAS was unknown. In noisy JNAS test sets, two kinds of DNNs were used for mask estimation. One was the same DNN as that in the CHiME3 test sets. The other was trained by adding the noise data of the CHiME3 to the original clean utterances of the JNAS corpus.
The online versions of the proposed MNMFbased variants were also evaluated. To investigate the length of the first minibatch size, the online enhancement processing was performed using consecutive 10 utterances of the same speaker. For online speech enhancement, the basic minibatch size was fixed to 0.5 s, and experiments were conducted by changing the size of the first minibatch from 5 s to 20 s. They were compared with the offline processing with the consecutive 10 utterances. The value of the weight was set to 0.9.
ViC2 Noisy Speech Recognition
Method  Training data  Avg. 
Not enhanced  38.52  
Weighted DS  32.01  
DNNIBM  CHiME3  12.27 
DNNIBM  JNAS  11.37 
DNNPSF  CHiME3  12.11 
DNNPSF  JNAS  14.82 
MNMFTVWF  10.01  
MNMFTIWF  9.91  
MNMFTVWF1  9.36  
MNMFTIWF1  9.30  
MNMFTVMV  9.30  
MNMFTIMV  9.36 
Method  Offline  Online  
First minibatch size  20 s  15 s  10 s  5 s  
MNMFTVWF  9.10  9.07  9.07  9.55  11.32 
MNMFTIWF  9.17  9.29  8.98  9.58  11.56 
MNMFTVWF1  8.94  10.43  10.72  10.57  12.21 
MNMFTIWF1  9.00  10.49  10.95  10.47  12.30 
MNMFTVMV  8.71  9.95  11.94  11.21  12.84 
MNMFTIMV  8.78  9.95  11.68  10.92  12.78 
The performances of ASR are listed in Table VI. Training the DNN using the JNAS data used for training the DNNHMM was effective (from 12.27% to 11.37%) because the speech data became matched in terms of spoken languages and noisy environments. MNMFTIWF1 and MNMFTVMV achieved the best WER of 9.30%, which was 18.21% relative improvement from that obtained by DNNIBM trained by using the data used for training the DNNHMM. The DNNbased beamforming was found to work worse in unknown recording conditions. This may have been due to overfitting to the CHiME3 noise data and it is difficult in practice to cover all the noisy conditions.
In the noisy JNAS tasks, there was also only a little difference among the beamforming methods, but the MVDR beamforming was the most effective in combination with the offline versions of the proposed method. The use of a timevariant noise SCM also did not bring notable improvement.
ViC3 Online Speech Enhancement
The performances of ASR obtained by the online versions of the proposed method are listed in Table VII. The online MNMFTIWF achieved the average WER of 8.98% while the offline MNMFTIWF achieved the average WER of 9.17%. The performance of the online MNMFTIWF using a long first minibatch was better than that of the offline version because the initial estimates of SCMs were accurate in all frequency bins. On the other hand, the performances of the online MNMFTIMV and MNMFTIWF1 were worse than those of the offline versions even when the first minibatch was long. The offline MNMFTIMV achieved the average WER of 8.71% while the online MNMFTVMV achieved the average WER of 9.95%. MVDR beamforming and rank1 MWF estimated the principal eigenvector of the SCM of speech as the steering vector for every minibatch, which may degrade the ASR performance. The initialization of the online versions depends on the first minibatch size. The performance was degraded when the first minibatch contained a few segments of the target speech.
The practical problem of our approach lies in the computational complexity of MNMF related to the repeated inversions of SCMs. The realtime factors of the DNN and MNMFbased beamforming methods were around 0.42 and 50, respectively. An order of magnitude faster approximations of MNMF [49, 50] was recently proposed, which was comparable with ILRMA in speed, and could be extended similarly to an online version for realtime noisy speech recognition.
The remaining problem lies in a long waiting time (10 s or 20 s) before achieving reasonable performance. This problem could be mitigated in a realistic scenario in which a microphone array (smart speaker) is assumed to be fixed in a room, e.g., a microphone array is placed on the center of a table for meeting recording. Every time sound activities are detected, the SCMs of the corresponding directions can be incrementally updated. A strong advantage of the proposed online method is that is can adapt to the room acoustics on the fly.
ViD Experimental Findings
The two experiments using the CHiME3 and JNAS datasets indicates that it is reasonable to use the MNMFinformed timeinvariant rank1 Wiener filtering (MNMFTIWF1) for dealing with noisy speech spectrograms recorded in real unseen environments. In online speech enhancement, the MNMFinformed timeinvariant fullrank Wiener filtering (MNMFTIWF) tends to work best because the steering vector of speech is more difficult to update than the SCM of speech in an online manner. Since the WERs and SDRs obtained by the timeinvariant beamforming methods are almost equal to those obtained by the timevariant methods, in practice it would be better to use the timeinvariant methods for improving the temporal stability of speech enhancement.
Vii Conclusion
This paper described the unsupervised speech enhancement method based MNMFguided beamforming. Our method uses MNMF to estimate the SCMs of speech and noise in an unsupervised manner and then generates an enhanced speech signal with beamforming. We extended MNMF to an online version and initialized MNMF with ILRMA. We evaluated various types of beamforming in a wide variety of conditions. The experimental results in realrecording ASR tasks demonstrated that the proposed methods were more robust in an unknown environment than the stateoftheart beamforming method with DNNbased mask estimation.
We plan to integrate BSS and DNNbased SCM estimation in order to improve the performance of ASR. Learning a basis matrix from a clean speech database is expected to improve the performance of speech enhancement[37]. When a microphone array is specified beforehand, learning the normalized SCM of the target speech is also expected to improve the performance. When noisy environments are covered by training data used for DNNbased mask estimation, MNMF can be initialized by using the results of DNNbased mask estimation [2] to further refine the SCMs of speech and noise. It would be promising to use recentlyproposed semisupervised speech enhancement methods based on NMF or MNMF with a DNNbased prior on speech spectra [51, 52, 53].
References
 [1] T. Higuchi, N. Ito, S. Araki, T. Yoshioka, M. Delcroix, and T. Nakatani, “Online MVDR beamformer based on complex Gaussian mixture model with spatial prior for noise robust ASR,” IEEE/ACM Transactions on Audio, Speech, and Language Processing, vol. 25, no. 4, pp. 780–793, 2017.
 [2] T. Nakatani, N. Ito, T. Higuchi, S. Araki, and K. Kinoshita, “Integrating DNNbased and spatial clusteringbased mask estimation for robust MVDR beamforming,” in IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), 2017, pp. 286–290.
 [3] J. Heymann, L. Drude, and R. HaebUmbach, “Neural network based spectral mask estimation for acoustic beamforming,” in IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), 2016, pp. 196–200.
 [4] H. Erdogan, J. R. Hershey, S. Watanabe, M. I. Mandel, and J. Le Roux, “Improved MVDR beamforming using singlechannel mask prediction networks,” in Annual Conference of the International Speech Communication Association (Interspeech), 2016, pp. 1981–1985.
 [5] P. Pertilä, “Microphonearraybased speech enhancement using neural networks,” in Parametric TimeFrequency Domain Spatial Audio. WileyIEEE Press, 2017, pp. 291–325.
 [6] Z.Q. Wang, J. L. Roux, and J. R. Hershey, “Multichannel deep clustering: Discriminative spectral and spatial embeddings for speakerindependent speech separation,” in IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), 2018, pp. 1–5.
 [7] X. Xiao, S. Zhao, D. L. Jones, E. S. Chng, and H. Li, “On timefrequency mask estimation for MVDR beamforming with application in robust speech recognition,” in IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), 2017, pp. 3246–3250.

[8]
T. Ochiai, S. Watanabe, T. Hori, and J. R. Hershey, “Multichannel endtoend
speech recognition,” in
International Conference on Machine Learning (ICML)
, vol. 70, 2017, pp. 2632–2641.  [9] T. N. Sainath, R. J. Weiss, K. W. Wilson, B. Li, A. Narayanan, E. Variani, M. Bacchiani, I. Shafran, A. Senior, K. Chin, A. Misra, and C. Kim, “Multichannel signal processing with deep neural networks for automatic speech recognition,” IEEE/ACM Transactions on Audio, Speech, and Language Processing, vol. 25, no. 5, pp. 965–979, 2017.
 [10] M. Mimura, Y. Bando, K. Shimada, S. Sakai, K. Yoshii, and T. Kawahara, “Combined multichannel NMFbased robust beamforming for noisy speech recognition,” in Annual Conference of the International Speech Communication Association (Interspeech), 2017, pp. 2451–2455.
 [11] M. Souden, J. Benesty, and S. Affes, “On optimal frequencydomain multichannel linear filtering for noise reduction,” IEEE Transactions on Audio, Speech, and Language Processing, vol. 18, no. 2, pp. 260–276, 2010.
 [12] Z. Wang, E. Vincent, R. Serizel, and Y. Yan, “Rank1 constrained multichannel Wiener filter for speech recognition in noisy environments,” Computer Speech & Language, vol. 49, pp. 37–51, 2017.
 [13] B. D. Van Veen and K. M. Buckley, “Beamforming: A versatile approach to spatial filtering,” IEEE ASSP Magazine, vol. 5, no. 2, pp. 4–24, 1988.
 [14] S. Gannot and I. Cohen, “Speech enhancement based on the general transfer function GSC and postfiltering,” IEEE Transactions on Audio, Speech, and Language Processing, vol. 12, no. 6, pp. 561–571, 2004.
 [15] E. Warsitz and R. HaebUmbach, “Blind acoustic beamforming based on generalized eigenvalue decomposition,” IEEE Transactions on Audio, Speech, and Language Processing, vol. 15, no. 5, pp. 1529–1539, 2007.
 [16] J. Barker, R. Marxer, E. Vincent, and S. Watanabe, “The third CHiME speech separation and recognition challenge: Dataset, task and baselines,” in IEEE Automatic Speech Recognition and Understanding Workshop (ASRU), 2015, pp. 504–511.
 [17] T. Yoshioka, N. Ito, M. Delcroix, A. Ogawa, K. Kinoshita, M. Fujimoto, C. Yu, W. J. Fabian, M. Espi, T. Higuchi, S. Araki, and T. Nakatani, “The NTT CHiME3 system: Advances in speech enhancement and recognition for mobile multimicrophone devices,” in IEEE Automatic Speech Recognition and Understanding Workshop (ASRU), 2015, pp. 436–443.

[18]
T. Hori, Z. Chen, H. Erdogan, J. R. Hershey, J. Le Roux, V. Mitra, and S. Watanabe, “The MERL/SRI system for the 3rd CHiME challenge using beamforming, robust feature extraction, and advanced speech recognition,” in
IEEE Automatic Speech Recognition and Understanding Workshop (ASRU), 2015, pp. 475–481.  [19] B. Loesch and B. Yang, “Adaptive segmentation and separation of determined convolutive mixtures under dynamic conditions,” in International conference on Latent Variable Analysis and Signal Separation (LVA/ICA), 2010, pp. 41–48.
 [20] X. Anguera, C. Wooters, and J. Hernando, “Acoustic beamforming for speaker diarization of meetings,” IEEE Transactions on Audio, Speech, and Language Processing, vol. 15, no. 7, pp. 2011–2022, 2007.
 [21] E. Vincent, S. Watanabe, A. A. Nugraha, J. Barker, and R. Marxer, “An analysis of environment, microphone and data simulation mismatches in robust speech recognition,” Computer Speech & Language, vol. 46, pp. 535–557, 2017.
 [22] K. Shimada, Y. Bando, M. Mimura, K. Itoyama, K. Yoshii, and T. Kawahara, “Unsupervised beamforming based on multichannel nonnegative matrix factorization for noisy speech recognition,” in IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), 2018, pp. 5734–5738.
 [23] H. Sawada, H. Kameoka, S. Araki, and N. Ueda, “Multichannel extensions of nonnegative matrix factorization with complexvalued data,” IEEE Transactions on Audio, Speech, and Language Processing, vol. 21, no. 5, pp. 971–982, 2013.

[24]
H. Erdogan, J. R. Hershey, S. Watanabe, and J. L. Roux, “Phasesensitive and recognitionboosted speech separation using deep recurrent neural networks,” in
IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), 2015, pp. 708–712.  [25] K. Itakura, Y. Bando, E. Nakamura, K. Itoyama, K. Yoshii, and T. Kawahara, “Bayesian multichannel audio source separation based on integrated source and spatial models,” IEEE/ACM Transactions on Audio, Speech, and Language Processing, vol. 26, no. 4, pp. 831–846, 2018.
 [26] A. Ozerov and C. Févotte, “Multichannel nonnegative matrix factorization in convolutive mixtures for audio source separation,” IEEE Transactions on Audio, Speech, and Language Processing, vol. 18, no. 3, pp. 550–563, 2010.
 [27] S. Arberet, A. Ozerov, and N. Q. K. Duong, “Nonnegative matrix factorization and spatial covariance model for underdetermined reverberant audio source separation,” in IEEE International Conference on Information Science, Signal Processing and their Applications (ISSPA), 2010, pp. 1–4.
 [28] J. Nikunen and T. Virtanen, “Direction of arrival based spatial covariance model for blind sound source separation,” IEEE/ACM Transactions on Audio, Speech, and Language Processing, vol. 22, no. 3, pp. 727–739, 2014.
 [29] D. Kitamura, N. Ono, H. Sawada, H. Kameoka, and H. Saruwatari, “Determined blind source separation unifying independent vector analysis and nonnegative matrix factorization,” IEEE/ACM Transactions on Audio, Speech, and Language Processing, vol. 24, no. 9, pp. 1626–1641, 2016.
 [30] S. Doclo and M. Moonen, “GSVDbased optimal filtering for single and multimicrophone speech enhancement,” IEEE Transactions on Signal Processing, vol. 50, no. 9, pp. 2230–2244, 2002.
 [31] T. C. LawinOre and S. Doclo, “Reference microphone selection for mwfbased noise reduction using distributed microphone arrays,” in ITG Symposium on Speech Communication, 2012, pp. 31–34.
 [32] Z. Wang, E. Vincent, R. Serizel, and Y. Yan, “Rank1 constrained multichannel Wiener filter for speech recognition in noisy environments,” Computer Speech & Language, vol. 49, pp. 37–51, 2018.
 [33] T. Ando, C.K. Li, and R. Mathias, “Geometric means,” Linear Algebra and its Applications, vol. 385, pp. 305–334, 2004.
 [34] W.H. Chen, “A review of geometric mean of positive definite matrices,” British Journal of Mathematics & Computer Science, vol. 5, no. 1, pp. 1–12, 2015.

[35]
K. Yoshii, “Correlated tensor factorization for audio source separation,” in
IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), 2018, pp. 731–735.  [36] A. Lefevre, F. Bach, and C. Févotte, “Online algorithms for nonnegative matrix factorization with the ItakuraSaito divergence,” in IEEE Workshop on Applications of Signal Processing to Audio and Acoustics (WASPAA), 2011, pp. 313–316.
 [37] Y. Tachioka, T. Narita, I. Miura, T. Uramoto, N. Monta, S. Uenohara, K. Furuya, S. Watanabe, and J. Le Roux, “Coupled initialization of multichannel nonnegative matrix factorization based on spatial and spectral information,” in Annual Conference of the International Speech Communication Association (Interspeech), 2017, pp. 2461–2465.

[38]
A. R. Mohamed, G. E. Dahl, and G. Hinton, “Acoustic modeling using deep belief networks,”
IEEE Transactions on Audio, Speech, and Language Processing, vol. 20, no. 1, pp. 14–22, 2012.  [39] G. E. Dahl, D. Yu, L. Deng, and A. Acero, “Contextdependent pretrained deep neural networks for largevocabulary speech recognition,” IEEE Transactions on Audio, Speech, and Language Processing, vol. 20, no. 1, pp. 30–42, 2012.
 [40] D. Povey, A. Ghoshal, G. Boulianne, L. Burget, O. Glembek, N. Goel, M. Hannemann, P. Motlíček, Y. Qian, P. Schwarz, J. Silovský, G. Stemmer, and K. Veselý, “The Kaldi speech recognition toolkit,” in IEEE Automatic Speech Recognition and Understanding Workshop (ASRU), 2011.

[41]
V. Nair and G. E. Hinton, “Rectified linear units improve restricted Boltzmann machines,” in
International Conference on Machine Learning (ICML), 2010, pp. 807–814.  [42] N. Srivastava, G. Hinton, A. Krizhevsky, I. Sutskever, and R. Salakhutdinov, “Dropout: A simple way to prevent neural networks from overfitting.” Journal of Machine Learning Research, vol. 15, no. 1, pp. 1929–1958, 2014.
 [43] S. Ioffe and C. Szegedy, “Batch normalization: Accelerating deep network training by reducing internal covariate shift,” in International Conference on Machine Learning (ICML), 2015, pp. 448–456.
 [44] E. Vincent, R. Gribonval, and C. Févotte, “Performance measurement in blind audio source separation,” IEEE Transactions on Audio, Speech, and Language Processing, vol. 14, no. 4, pp. 1462–1469, 2006.
 [45] A. Rix, J. Beerends, M. Hollier, and A. Hekstra, “Perceptual evaluation of speech quality (PESQ), an objective method for endtoend speech quality assessment of narrowband telephone networks and speech codecs,” in ITUT Recommendation. IEEE, 2001, p. 862.
 [46] C. H. Taal, R. C. Hendriks, R. Heusdens, and J. Jensen, “An algorithm for intelligibility prediction of timefrequency weighted noisy speech,” IEEE Transactions on Audio, Speech, and Language Processing, vol. 19, no. 7, pp. 2125–2136, 2011.
 [47] K. Itou, M. Yamamoto, K. Takeda, T. Takezawa, T. Matsuoka, T. Kobayashi, K. Shikano, and S. Itahashi, “JNAS: Japanese speech corpus for large vocabulary continuous speech recognition research,” Journal of the Acoustical Society of Japan (E), vol. 20, no. 3, pp. 199–206, 1999.
 [48] A. Lee, T. Kawahara, and K. Shikano, “Julius — An open source realtime large vocabulary recognition engine,” in European Conference on Speech Communication and Technology (Eurospeech), 2001, pp. 1691–1694.
 [49] N. Ito and T. Nakatani, “FastMNMF: Joint diagonalization based accelerated algorithms for multichannel matrix factorization,” in IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), 2019, to appear.
 [50] K. Sekiguchi, A. A. Nugraha, Y. Bando, and K. Yoshii, “Fast multichannel source separation based on jointly diagonalizable spatial covariance matrices,” in European Signal Processing Conference (EUSIPCO), 2019, submitted. [Online]. Available: https://arxiv.org/abs/1903.03237

[51]
Y. Bando, M. Mimura, K. Itoyama, K. Yoshii, and T. Kawahara, “Statistical speech enhancement based on probabilistic integration of variational autoencoder and nonnegative matrix factorization,” in
IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), 2018, pp. 716–720.  [52] S. Leglaive, L. Girin, and R. Horaud, “A variance modeling framework based on variational autoencoders for speech enhancement,” in IEEE International Workshop on Machine Learning for Signal Processing (MLSP), 2018, pp. 1–6.
 [53] K. Sekiguchi, Y. Bando, K. Yoshii, and T. Kawahara, “Bayesian multichannel speech enhancement with a deep speech prior,” in AsiaPacific Signal and Information Processing Association Annual Summit and Conference (APSIPA ASC), 2018, pp. 1233–1239.
Comments
There are no comments yet.