1 Introduction
Speech enhancement is a classical problem of speech processing, which aims to recover a clean speech signal from the recording of a noisy signal, where the noise is generally considered as additive [1]. In this work we address singlechannel speech enhancement, which can be seen as a an underdetermined source separation problem, where the sources to be separated are of different nature.
Statistical approaches combining a local Gaussian modeling of the timefrequency signal coefficients with a variance model have attracted a lot of attention in the past few years [2]. Within this framework, nonnegative matrix factorization (NMF) techniques are especially popular to structure the timefrequencydependent signal variance [3]. Recently, deep neural networks were also successfully investigated for speech enhancement, to generate either timefrequency masks or clean power spectrograms from noisy spectrograms [4], or to model the signal variance [5] (here in a multichannel framework). Similarities between NMF and standard autoencoders have been reported in [6]. The recent approach [7]
also falls into the variance modeling framework. Very interestingly, the authors developed a semisupervised singlechannel speech enhancement method mixing concepts of Bayesian inference and deep learning, through the use of a variational autoencoder
[8].In the present work, we follow a similar approach, using a speakerindependent supervised Gaussian speech model based on a variational autoencoder. For a given timefrequency point, it mainly consists in modeling the speech variance as a nonlinear function of a Gaussian latent random vector, by means of a neural network. Compared with
[7], we further introduce a timeframedependent gain in order to provide some robustness with respect to the loudness of the training examples. The noise model is Gaussian and based on unsupervised NMF, so as to be free of generalization issues regarding the noisy recording environments. We propose a new Monte Carlo expectationmaximization algorithm [9] for performing semisupervised speech enhancement. Experimental results show the superiority of our approach compared with a baseline method using NMF, and also compared with a stateoftheart fully supervised deep learning approach [10].We first provide in Section 2 a brief presentation of the NMFbased variance modeling framework commonly used in audio source separation. This will help us emphasizing the conceptual similarities shared with the speech model used in this work. In Section 3 we present the model. Our inference algorithm is detailed in Section 4. The optimization problem related to the variational autoencoder training is presented in Section 5. Experiments are described in Section 6 and we finally conclude in Section 7.
2 Variance Modeling Framework
We work in the shortterm Fourier transform (STFT) domain. For all
, where denotes the frequency index and the timeframe index, the singlechannel speech enhancement problem consists in recovering the clean speech STFT coefficients from the noisy observed ones , whereis a noise signal. The source STFT coefficients are modeled as complex circularly symmetric Gaussian random variables
[2]:(1) 
where
denotes the complex proper Gaussian distribution, which is circularly symmetric (rotational invariant in the complex plane) if
. Under a local stationarity assumption [11], the variances in (1) characterize the shortterm power spectral density of the audio source signals. Note that this Gaussian model has been recently extended to other complex circularly symmetric and elliptically contoured distributions [12], such as the symmetric alphastable [13] or the Student’s t [14] distributions.A widely used linear variance model for the source signals relies on NMF. For , let be a matrix containing spectral patterns, and let be a matrix containing the activations of these spectral patterns over the time frames. At each timefrequency point , the variance is modeled as follows:
(2) 
where is th row of , is th column of , and denotes the transposition operator. Without any additional constraint on the model parameters, the NMF rank is generally chosen such that . Compared with the unconstrained Gaussian model in (1), the number of parameters to be estimated is therefore considerably reduced, which is of great interest for solving an underdetermined problem.
In a semisupervised speech enhancement setting [15], is learned on a clean speech database. The variance model in (2) is therefore seen as a linear function of the activations. Then, in the speech enhancement algorithm, only the speech activation matrix and the noise NMF parameters and are estimated from the observation of the noisy mixture signal. It has been shown in [3] that maximum likelihood estimation of the NMF parameters under this Gaussian model amounts to solving an optimization problem involving the ItakuraSaito divergence. An optimal estimate of the speech signal in the minimum mean square error sense is then obtained by Wiener filtering.
3 Model
3.1 Supervised Speech Model
In a supervised setting, the variance model in (2) is limited by its linear nature. Moreover, the number of trainable parameter is restricted by the factorization rank, which is usually chosen to be small. In this work we consider a supervised and nonlinear modeling of the speech variances. Independently for all , we have the following generative model, involving a latent random vector :
(3)  
(4) 
where denotes the multivariate Gaussian distribution for a realvalued random vector,
is the identity matrix of appropriate size, and
, , represents a nonlinear function parametrized by. The output of this function is provided by one neuron in the
dimensional output layer of a neural network, whose weights and biases are denoted by . In the context of variational autoencoders, the model (3)(4) is called the generative network or probabilistic decoder [8].As in the case of an NMFbased supervised variance model, we assume that the parameters are learned on a clean speech database. This is done using a recognition network or probabilistic encoder in addition to the probabilistic decoder, as will be explained in Section 5. Only the latent random vectors will have to be inferred from the observed data. Moreover, the latent dimension is set much smaller than .
3.2 Unsupervised Noise Model
As introduced in Section 2, we use an unsupervised NMFbased Gaussian model for the noise. Independently for all :
(5) 
3.3 Mixture Model
The observed mixture signal is modeled as follows for all :
(6) 
where represents a framedependent but frequencyindependent gain. We introduce this term in order to provide some robustness with respect to the loudness of the training examples used to learn the variational autoencoder parameters (this will be illustrated in Section 6). The speech and noise signals are further supposed to be mutually independent given the latent random vectors , such that for all :
(7) 
where is the set of unsupervised model parameters at timefrequency point .
4 Inference
Let us introduce the following notations:

[leftmargin=*]

: The set of mixture STFT coefficients for a given time frame ;

: The set of all mixture STFT coefficients;

: The set of all latent variables;

: The set of unsupervised model parameters.
From the set of observations , our primary goal is to estimate the unsupervised model parameters , which will serve in the final inference of the speech STFT coefficients. Unfortunately, straightforward maximum likelihood estimation is here intractable. A common alternative then consists in exploiting the latent variable structure of the model to derive an expectationmaximization (EM) algorithm [16]. From an initialization of the model parameters, it consists in iterating the two following steps until convergence:

[leftmargin=*]

EStep: Compute ;

MStep: Update .
4.1 EStep
Due to the nonlinear relation between the observations and the latent variables (7), we cannot compute the posterior distribution in an analytical form. Therefore we cannot compute the expectation involved in the definition of at the EStep. We consequently approximate it by the following empirical average (Monte Carlo approximation):
(8) 
where denotes equality up to additive constants with respect to and , and is a sequence of samples drawn from the posterior
using a Markov Chain Monte Carlo (MCMC) method. Here we use the MetropolisHastings algorithm
[17]. This approach forms the basis of the Monte Carlo EM (MCEM) algorithm [9]. Note that unlike the standard EM algorithm, it does not ensure that the likelihood increases at each iteration. Nevertheless, some convergence results in terms of stationary point of the likelihood can be obtained under suitable conditions [18].At the th iteration of the MetropolisHastings algorithm and independently for all , we first draw a sample from a proposal random walk distribution:
(9) 
Using the fact that this is a symmetric proposal distribution [17]
, we then compute the acceptance probability
defined by:(10) 
where with defined in (7) and defined in (3). Then we draw
from the uniform distribution
. If , we accept the sample and set , otherwise we reject the sample and set . We only keep the last samples for computing in (8), i.e. we discard the samples drawn during a so called burnin period.4.2 MStep
At the Mstep of the MCEM algorithm, we want to maximize in (8) with respect to the unsupervised model parameters . As usual in the NMF literature [19], we adopt a blockcoordinate approach by successively and individually updating , and , using the auxiliary function technique. We will provide derivation details only for , as the procedure is very similar for the other parameters.
Let be the opposite of in (8), seen as a function of only (other parameters are fixed). Our goal is to minimize under a nonnegativity constraint.
Definition 1 (Auxiliary function).
The mapping is an auxiliary function to if and only if
(11)  
(12) 
In other words, is an upper bound of which is tight for . The original minimization problem can be replaced with an alternate minimization of this upper bound. From an initial point we iterate:
(13) 
This procedure corresponds to the majorizeminimize (MM) algorithm [20], which by construction leads to a monotonic decrease of . Moreover, its convergence properties are the same as the ones of the EM algorithm [21].
Proposition 1 (Auxiliary function to ).
The function defined below is an auxiliary function to .
(14) 
where for all and , and .
Proof.
The proof is provided in [22] due to space limitation. ∎
The auxiliary function is separable in convex functions of the individual coefficients , , , which is convenient for updating those parameters. By canceling the partial derivative of with respect to , we obtain an update which depends on . According to the MM algorithm, we can use the fact that is equal to the previous value of for obtaining the following multiplicative update in matrix form:
(15) 
where denotes elementwise multiplication and exponentiation, matrix division is also elementwise, is the matrix of entries , and is the matrix of entries . We can straightforwardly apply the same procedure for maximizing with respect to . The resulting multiplicative update rule is given by:
(16) 
Again, in a very similar fashion, the vector of speech gains is updated as follows:
(17) 
where is an allones column vector of dimension and is the matrix of entries . The nonnegativity of , and is ensured provided that those parameters are initialized with nonnegative entries. In practice, at the Mstep of the MCEM algorithm, we perform only one iteration of updates (15), (16) and (17).
4.3 Speech reconstruction
In the following denotes the set of parameters estimated by the above MCEM algorithm. For all , let be the scaled version of the speech STFT coefficients as introduced in (6), with . Our final goal is to estimate those coefficients according to their posterior mean:
(18) 
This estimation corresponds to a softmasking of the mixture signal, similarly as the standard Wiener filtering. Exactly as before, this expectation cannot be computed in an analytical form, but we can approximate it using the same MetropolisHastings algorithm as detailed in Section 4.1. Note that this estimation procedure is different than the one proposed in [7]. Here the STFT source coefficients are estimated from their true posterior distribution, without conditioning on the latent variables. The timedomain estimate of the speech signal is finally obtained by inverse STFT and overlapadd.
5 Supervised Training of the Speech Model
For supervised training of the speech model, we use a large dataset of independent cleanspeech STFT time frames . In our speech enhancement method detailed above, we are only interested in using the generative model defined in (3) and (4). However, in order to learn the parameters of this generative model, we also need to introduce a recognition network or probabilistic encoder , which is an approximation of the true posterior [8]. Independently for all and , is defined similarly as in [7] by:
(19) 
where and , are nonlinear functions parametrized by . The output of each function is provided by one neuron in the dimensional output layer of a neural network, whose weights and biases are denoted by .
As explained in [8], the recognition network parameters and the generative network parameters can be jointly trained by maximizing the evidence lowerbound (ELBO) defined by:
(20) 
where
is the KullbackLeibler divergence. The first term in the righthand side of (
20) is a reconstruction accuracy term, while the second one is a regularization term. From (3), (4) and (19), the ELBO can be developed as follows:(21) 
where is the ItakuraSaito (IS) divergence. We note that maximizing the ELBO with respect to the parameters of the generative speech model amounts to minimizing a cost function that involves the IS divergence between the observed power spectrogram and the variance model . It is interesting to highlight the similarity with the Gaussian generative speech model (1)(2), where maximum likelihood estimation of its parameters also amounts to minimizing the IS divergence between and the NMFbased variance parametrization , as proved in [3].
Finally, as the expectation in (21) cannot be computed in an analytical form, we again use a Monte Carlo approximation:
(22) 
where for all , is independently drawn from in (19), using the socalled reparametrization trick [8]. Independently for all , it consists in sampling:
(23) 
Injecting (22) in (21), we obtain an objective function which is differentiable with respect to both and , and which can be optimized using gradientascentbased algorithms.
6 Experiments
Database: The supervised speech model parameters are learned from the training set of the TIMIT database [23]. It contains almost 4 hours of speech signals at a 16kHz sampling rate, distributed over 462 speakers. For the evaluation of the speech enhancement algorithm, we mixed clean speech signals from the test set of the TIMIT database and noise signals from the DEMAND database [24], with various noisy environments: domestic, nature, office, indoor public spaces, street and transportation. We created 168 mixtures at a 0 dB signaltonoise ratio (one mixture per speaker in the TIMIT test set). Note that both speakers and sentences are different than in the training set.
Baseline methods: Considering that both approaches belong to the variance modeling framework, we first compare our method with a semisupervised NMF baseline as described in Section 2. The speech NMF dictionary is learned using the training set of the TIMIT database, and only the speech activation matrix and the noise NMF model parameters are estimated from the noisy mixture signals. The speech signal is then recovered by Wiener filtering. We also compare our approach with the fully supervised deep learning method proposed in [10]. In this work, a deep neural network is trained to map noisy speech logpower spectrograms to clean speech logpower spectrograms. The authors showed that their mappingbased approach outperformed other deep learning techniques that rely on the estimation of timefrequency masks. Moreover, the authors used more than 100 different noise types to train their system, which showed to be quite effective in handling unseen noise types. It is therefore a very relevant method to compare with. We used the Python code provided by the authors.^{1}^{1}1https://github.com/yongxuUSTC/sednn
Parameter settings: The STFT is computed using a 64ms sine window (i.e. ) with 75%overlap. We compare the performance of our method with the semisupervised NMF baseline according to the dimension of the latent space and the rank of the NMF speech model. For fair comparison, we set . The rank of the noise model is arbitrarily fixed to . Unsupervised NMF parameters are randomly initialized. For the proposed method, the gain vector is initialized with an allones vector. The iterative algorithms of those two methods are stopped if the improvement in terms of objective function is smaller that (for our method we monitored ). The parameters of our MCEM algorithm are the following ones: At each EStep (see Section 4.1), we run 40 iterations of the MetropolisHastings algorithm using for the proposal distribution, and we discard the first 30 samples as the burnin period. At the first iteration of the MCEM algorithm we need to initialize the Markov chain of the MetropolisHastings algorithm. For that purpose we use the recognition network considering the mixture signal as input because the clean speech signal is not observed: for all , . Then, at each new EStep, we use the last sample drawn at the previous EStep to initialize the Markov chain. For estimating the speech STFT coefficients (see Section 4.3), we run the MetropolisHastings algorithm for 100 iterations and discard the first 75 samples.
Variational autoencoder: The structure of our variational autoencoder is represented in Fig. 1. As in [8], we use hyperbolic tangent () activation functions except for the encoder/decoder output layers, which use identity activation functions (). The values of these output layers thus lie in , which is the reason why we output logarithm of variances. For learning the parameters and (see Section 5), we used the Adam optimizer [25] with a step size of
, exponential decay rates for the first and second moment estimates of
and respectively, and an epsilon offor preventing division by zero. 20% of the TIMIT training set was kept as a validation set, and early stopping with a patience of 10 epochs was used. Weights were initialized using the uniform initializer described in
[26]. For reproducibility, a Python implementation of our algorithm is available online.^{2}^{2}2https://github.com/sleglaive/MLSP2018Results: We evaluate the enhanced speech signal in terms of signaltodistortion ratio (SDR) [27] and perceptual evaluation of speech quality (PESQ) measure [28]. The SDR is expressed in decibels (dB) and the PESQ score lies between and . We computed the SDR using the mir_eval Python library.^{3}^{3}3https://github.com/craffel/mir_eval We first illustrate the interest of having a timeframe dependent gain as introduced in (6). Inevitably, the variational autoencoder is trained using examples with a limited loudness range. If we force for all , we can therefore expect the speech enhancement quality to depend on the speech power in the observed noisy mixture signal. This problem is illustrated in Fig. 2: The SDR indicated by black dots suddenly drops for a scaling factor of the mixture signal greater than 12 dB. It indicates that the power of the underlying clean speech signal in this mixture went out of the power range observed during the training. One solution to this problem could be to perform some kind of data augmentation as in [7]. Here we exploited the statistical modeling framework, by introducing parameters in order to handle some inherent drawbacks of learningbased approaches. The gray diamonds in Fig. 2 indeed show that the timeframedependent gains provide a solution to this robustness issue.
The speech enhancement results are presented in Fig. 3. We first observe that while the NMF baseline requires a sufficiently large rank , our method obtains quite stable results for a latent dimension greater than or equal to . Moreover it always performs better than the NMF baseline in terms of both SDR and PESQ measures. We also see that it outperforms the fully supervised deep learning approach [10]. It is indeed difficult for such kind of approaches to generalize to unseen noises, which strongly justifies the interest of semisupervised approaches. Audio examples illustrating those results are available online.^{4}^{4}4https://sleglaive.github.io/demomlsp18.html
We conclude this section with some information about the computational time for enhancing a 2.6 secondslong mixture, using a central processing unit at 2.5 GHz. One iteration of the proposed MCEM algorithm (with ) takes around 2.9 seconds. For the semisupervised NMF baseline (with ), one iteration of update rules takes around 4 milliseconds. The fully supervised deeplearning approach, which is non iterative, takes around 7 seconds.
7 Conclusion
In this paper we showed that within the variance modeling framework, variational autoencoders are a suitable alternative to supervised NMF. We proposed a semisupervised speech enhancement method based on an MCEM algorithm. Experimental results demonstrated that it outperforms both a semisupervised NMF baseline and a fully supervised deep learning approach. We used a relatively shallow architecture for the variational autoencoder, we can therefore expect even better results by going deeper. Future works include using variational autoencoders for separating similar sources, e.g. multiple speakers. This is a much harder problem and we may have to exploit spatial information provided by multichannel mixtures to address it. Another interesting perspective would be to introduce a temporal model for the latent variables of the variational autoencoder.
References
 [1] Philipos C. Loizou, Speech enhancement: theory and practice, CRC press, 2007.
 [2] Emmanuel Vincent, Maria G. Jafari, Samer A. Abdallah, Mark D. Plumbley, and Mike E. Davies, “Probabilistic modeling paradigms for audio source separation,” in Machine Audition: Principles, Algorithms and Systems, Wenwu Wang, Ed., pp. 162–185. IGI Global, 2010.
 [3] Cédric Févotte, Nancy Bertin, and JeanLouis Durrieu, “Nonnegative matrix factorization with the ItakuraSaito divergence: With application to music analysis,” Neural computation, vol. 21, no. 3, pp. 793–830, 2009.
 [4] DeLiang Wang and Jitong Chen, “Supervised speech separation based on deep learning: an overview,” arXiv preprint arXiv:1708.07524, 2017.
 [5] Aditya Arie Nugraha, Antoine Liutkus, and Emmanuel Vincent, “Multichannel audio source separation with deep neural networks,” IEEE Trans. Audio, Speech, Language Process., vol. 24, no. 9, pp. 1652–1664, 2016.
 [6] Paris Smaragdis and Shrikant Venkataramani, “A neural network alternative to nonnegative audio models,” in Proc. IEEE Int. Conf. Acoust., Speech, Signal Process. (ICASSP), 2017, pp. 86–90.
 [7] Yoshiaki Bando, Masato Mimura, Katsutoshi Itoyama, Kazuyoshi Yoshii, and Tatsuya Kawahara, “Statistical speech enhancement based on probabilistic integration of variational autoencoder and nonnegative matrix factorization,” in Proc. IEEE Int. Conf. Acoust., Speech, Signal Process. (ICASSP), 2018, pp. 716–720.
 [8] Diederik P. Kingma and Max Welling, “Autoencoding variational Bayes,” in Proc. Int. Conf. Learning Representations (ICLR), 2014.
 [9] Greg C.G. Wei and Martin A. Tanner, “A Monte Carlo implementation of the EM algorithm and the poor man’s data augmentation algorithms,” Journal of the American statistical Association, vol. 85, no. 411, pp. 699–704, 1990.
 [10] Yong Xu, Jun Du, LiRong Dai, and ChinHui Lee, “A regression approach to speech enhancement based on deep neural networks,” IEEE Trans. Audio, Speech, Language Process., vol. 23, no. 1, pp. 7–19, 2015.
 [11] Antoine Liutkus, Roland Badeau, and Gäel Richard, “Gaussian processes for underdetermined source separation,” IEEE Trans. Signal Process., vol. 59, no. 7, pp. 3155–3167, 2011.
 [12] Esa Ollila, Jan Eriksson, and Visa Koivunen, “Complex elliptically symmetric random variablesgeneration, characterization, and circularity tests,” IEEE Trans. Signal Process., vol. 59, no. 1, pp. 58–69, 2011.
 [13] Antoine Liutkus and Roland Badeau, “Generalized Wiener filtering with fractional power spectrograms,” in Proc. IEEE Int. Conf. Acoust., Speech, Signal Process. (ICASSP), 2015, pp. 266–270.

[14]
Kazuyoshi Yoshii, Katsutoshi Itoyama, and Masataka Goto,
“Student’s t nonnegative matrix factorization and positive semidefinite tensor factorization for singlechannel audio source separation,”
in Proc. IEEE Int. Conf. Acoust., Speech, Signal Process. (ICASSP), 2016, pp. 51–55.  [15] Paris Smaragdis, Bhiksha Raj, and Madhusudana Shashanka, “Supervised and semisupervised separation of sounds from singlechannel mixtures,” in Proc. Int. Conf. Indep. Component Analysis and Signal Separation, 2007, pp. 414–421.
 [16] Arthur P. Dempster, Nan M. Laird, and Donald B. Rubin, “Maximum likelihood from incomplete data via the EM algorithm,” Journal of the royal statistical society. Series B (Methodological), vol. 39, no. 1, pp. 1–38, 1977.
 [17] Christian P. Robert and George Casella, Monte Carlo Statistical Methods, SpringerVerlag New York, Inc., Secaucus, NJ, USA, 2005.
 [18] K.S. Chan and Johannes Ledolter, “Monte Carlo EM estimation for time series models involving counts,” Journal of the American Statistical Association, vol. 90, no. 429, pp. 242–252, 1995.
 [19] Cédric Févotte and Jérôme Idier, “Algorithms for nonnegative matrix factorization with the divergence,” Neural computation, vol. 23, no. 9, pp. 2421–2456, 2011.
 [20] David R. Hunter and Kenneth Lange, “A tutorial on MM algorithms,” The American Statistician, vol. 58, no. 1, pp. 30–37, 2004.
 [21] C.F. Jeff Wu, “On the convergence properties of the EM algorithm,” The Annals of statistics, pp. 95–103, 1983.
 [22] “Supporting document,” http://sleglaive.github.io/documents/sup_doc_MLSP18.pdf.
 [23] John S. Garofolo, Lori F. Lamel, William M. Fisher, Jonathan G. Fiscus, David S. Pallett, Nancy L. Dahlgren, and Victor Zue, “TIMIT acoustic phonetic continuous speech corpus,” in Linguistic data consortium, 1993.
 [24] Joachim Thiemann, Nobutaka Ito, and Emmanuel Vincent, “The Diverse Environments Multichannel Acoustic Noise Database (DEMAND): A database of multichannel environmental noise recordings,” in Proc. Int. Cong. on Acoust., 2013.
 [25] Diederik P. Kingma and Jimmy Ba, “Adam: A method for stochastic optimization,” arXiv preprint arXiv:1412.6980, 2014.
 [26] Xavier Glorot and Yoshua Bengio, “Understanding the difficulty of training deep feedforward neural networks,” in Proc. Int. Conf. Artif. Intelligence and Stat., 2010, pp. 249–256.
 [27] Emmanuel Vincent, Rémi Gribonval, and Cédric Févotte, “Performance measurement in blind audio source separation,” IEEE Trans. Audio, Speech, Language Process., vol. 14, no. 4, pp. 1462–1469, 2006.
 [28] Antony W. Rix, John G. Beerends, Michael P. Hollier, and Andries P. Hekstra, “Perceptual evaluation of speech quality (PESQ)a new method for speech quality assessment of telephone networks and codecs,” in Proc. IEEE Int. Conf. Acoust., Speech, Signal Process. (ICASSP), 2001, pp. 749–752.