1 Introduction
Speech enhancement is an important problem in audio signal processing [1]. The objective is to recover a clean speech signal from a noisy mixture signal. In this work, we focus on singlechannel (i.e. singlemicrophone) speech enhancement.
Discriminative approaches based on deep neural networks have been extensively used for speech enhancement. They try to estimate a clean speech spectrogram or a timefrequency mask from a noisy speech spectrogram, see e.g.
[2, 3, 4, 5, 6]. Recently, deep generative speech models based on variational autoencoders (VAEs) [7] have been investigated for singlechannel [8, 9, 10, 11] and multichannel speech enhancement [12, 13, 14]. A pretrained deep generative speech model is combined with a nonnegative matrix factorization (NMF) [15] noise model whose parameters are estimated at test time, from the observation of the noisy mixture signal only. Compared with discriminative approaches, these generative methods do not require pairs of clean and noisy speech signal for training. This setting was referred to as “semisupervised source separation” in previous works [16, 17, 18], which should not be confused with the supervised/unsupervised terminology of machine learning.
To the best of our knowledge, the aforementioned works on VAEbased deep generative models for speech enhancement have only considered an independent modeling of the speech time frames, through the use of feedforward and fully connected architectures. In this work, we propose a recurrent VAE (RVAE) for modeling the speech signal. The generative model is a special case of the one proposed in [19], but the inference model for training is different. At test time, we develop a variational expectationmaximization algorithm (VEM) [20] to perform speech enhancement. The encoder of the RVAE is finetuned to approximate the posterior distribution of the latent variables, given the noisy speech observations. This model induces a posterior temporal dynamic over the latent variables, which is further propagated to the speech estimate. Experimental results show that this approach outperforms its feedforward and fullyconnected counterpart.
2 Deep generative speech model
2.1 Definition
Let
denote a sequence of shorttime Fourier transform (STFT) speech time frames, and
a corresponding sequence of latent random vectors. We define the following hierarchical generative speech model independently for all time frames
:(1) 
and where will be defined by means of a decoder neural network.
denotes the multivariate Gaussian distribution for a realvalued random vector and
denotes the multivariate complex proper Gaussian distribution [21]. Multiple choices are possible to define the neural network corresponding to , which will lead to different probabilistic graphical models represented in Fig. 1.FFNN generative speech model
where denotes a feedforward fullyconnected neural network (FFNN) of parameters . Such an architecture was used in [8, 9, 10, 11, 12, 13, 14]. As represented in Fig. (a)a, this model results in the following factorization of the completedata likelihood:
(2) 
Note that in this case, the speech STFT time frames are not only conditionally independent, but also marginally independent, i.e. .
RNN generative speech model
where denotes the output at time frame
of a recurrent neural network (RNN), taking as input the sequence of latent random vectors
. As represented in Fig. (b)b, we have the following factorization of the completedata likelihood:(3) 
Note that for this RNNbased model, the speech STFT time frames are still conditionally independent, but not marginally independent.
BRNN generative speech model
where denotes the output at time frame of a bidirectional RNN (BRNN) taking as input the complete sequence of latent random vectors . As represented in Fig. (c)c, we end up with the following factorization of the completedata likelihood:
(4) 
As for the RNNbased model, the speech STFT time frames are conditionally independent but not marginally.
Note that for avoiding cluttered notations, the variance
in the generative speech model (1) is not made explicitly dependent on the decoder network parameters , but it clearly is.2.2 Training
We would like to estimate the decoder parameters in the maximum likelihood sense, i.e. by maximizing , where is a training dataset consisting of i.i.d sequences of STFT speech time frames. In the following, because it simplifies the presentation, we simply omit the sum over the sequences and the associated subscript .
Due to the nonlinear relationship between and , the marginal likelihood is analytically intractable, and it cannot be straightforwardly optimized. We therefore resort to the framework of variational autoencoders [7] for parameters estimation, which builds upon stochastic fixedform variational inference [22, 23, 24, 25, 26].
This latter methodology first introduces a variational distribution (or inference model) parametrized by , which is an approximation of the true intractable posterior distribution . For any variational distribution, we have the following decomposition of logmarginal likelihood:
(5) 
where is the variational free energy (VFE) (also referred to as the evidence lower bound) defined by:
(6) 
and is the KullbackLeibler (KL) divergence. As the latter is always nonnegative, we see from (5) that the VFE is a lower bound of the intractable logmarginal likelihood. Moreover, we see that it is tight if and only if . Therefore, our objective is now to maximize the VFE with respect to (w.r.t) both and . But in order to fully define the VFE in (6), we have to define the form of the variational distribution .
Using the chain rule for joint distributions, the posterior distribution of the latent vectors can be exactly expressed as follows:
(7) 
where we considered . The variational distribution is naturally also expressed as:
(8) 
In this work,
denotes to the probability density function (pdf) of the following Gaussian
inference model:(9) 
where will be defined by means of an encoder neural network.
Inference model for the BRNN generative speech model
For the BRNN generative speech model, the parameters of the variational distribution in (9) are defined by
(10) 
where denotes the output at time frame of a neural network whose parameters are denoted by . It is composed of:

[leftmargin=*]

“Prediction block”: a causal recurrent block processing ;

“Observation block”: a bidirectional recurrent block processing the complete sequence of STFT speech time frames ;

“Update block”: a feedforward fullyconnected block processing the outputs at timeframe of the two previous blocks.
If we want to sample from in (8), we have to sample recursively each , starting from up to . Interestingly, the posterior is formed by running forward over the latent vectors, and both forward and backward over the input sequence of STFT speech timeframes. In other words, the latent vector at a given time frame is inferred by taking into account not only the latent vectors at the previous time steps, but also all the speech STFT frames at the current, past and future time steps. The anticausal relationships were not taken into account in the RVAE model [19].
Inference model for the RNN generative speech model
Using the fact that is conditionally independent of all other nodes in Fig. (b)b given its Markov blanket (defined as the set of parents, children and coparents of that node) [27], (7) can be simplified as:
(11) 
where . This conditional independence also applies to the variational distribution in (9), whose parameters are now given by:
(12) 
where denotes the same neural network as for the BRNNbased model, except that the observation block is not a bidirectional recurrent block anymore, but an anticausal recurrent one. The full approximate posterior is now formed by running forward over the latent vectors, and backward over the input sequence of STFT speech timeframes.
Inference model for the FFNN generative speech model
For the same reason as before, by studying the Markov blanket of in Fig. (a)a, the dependencies in (7) can be simplified as follows:
(13) 
This simplification also applies to the variational distribution in (9), whose parameters are now given by:
(14) 
where denotes the output of an FFNN. Such an architecture was used in [8, 9, 10, 11, 12, 13, 14]. This is the only case where, from the approximate posterior, we can sample all latent vectors in parallel for all time frames, without further approximation.
Here also, the mean and variance vectors in the inference model (9) are not made explicitly dependent on the encoder network parameters , but they clearly are.
Variational free energy
Given the generative model (1) and the general inference model (9), we can develop the VFE defined in (6) as follows (derivation details are provided in Appendix A.1):
(15) 
where denotes equality up to an additive constant w.r.t and , is the ItakuraSaito (IS) divergence [15], and denote respectively the th entries of and , and and denote respectively the th entry of and .
The expectations in (15) are analytically intractable, so we compute unbiased Monte Carlo estimates using a set of i.i.d. realizations drawn from . For that purpose, we use the “reparametrization trick” introduced in [7]. The obtained objective function is differentiable w.r.t to both and , and it can be optimized using gradientascentbased algorithms. Finally, we recall that in the final expression of the VFE, there should actually be an additional sum over the i.i.d. sequences in the training dataset . For stochastic or minibatch optimization algorithms, we would only consider a subset of these training sequences for each update of the model parameters.
3 Speech enhancement: Model and algorithm
3.1 Speech, noise and mixture model
The deep generative clean speech model along with its parameters learning procedure were defined in the previous section. For speech enhancement, we now consider a Gaussian noise model based on an NMF parametrization of the variance [15]. Independently for all time frames , we have:
(16) 
where with and .
The noisy mixture signal is modeled as , where is a gain parameter scaling the level of the speech signal at each time frame [9]. We further consider the independence of the speech and noise signals so that the likelihood is defined by:
(17) 
where .
3.2 Speech enhancement algorithm
We consider that the speech model parameters which have been learned during the training stage are fixed, so we omit them in the rest of this section. We now need to estimate the remaining model parameters from the observation of the noisy mixture signal . However, very similarly as for the training stage (see Section 2.2), the marginal likelihood is intractable, and we resort again to variational inference. The VFE at test time is defined by:
(18) 
Following a VEM algorithm [20], we will maximize this criterion alternatively w.r.t at the Estep, and at the Mstep. Note that here also, we have with equality if and only if .
Variational EStep with finetuned encoder
We consider a fixedform variational inference strategy, reusing the inference model learned during the training stage. More precisely, the variational distribution is defined exactly as in (9) and (8) except that is replaced with . Remember that the mean and variance vectors and in (9) correspond to the VAE encoder network, whose parameters were estimated along with the parameters of the generative speech model. During the training stage, this encoder network took clean speech signals as input. It is now finetuned with a noisy speech signal as input. For that purpose, we maximize w.r.t only, with fixed . This criterion takes the exact same form as (15) except that is replaced with where denotes the th entry of , is replaced with , and is replaced with , the th entry of which was defined along with (17). Exactly as in Section 2.2, intractable expectations are replaced with a Monte Carlo estimate and the VFE is maximized w.r.t. by means of gradientbased optimization techniques. In summary, we use the framework of VAEs [7] both at training for estimating and from clean speech signals, and at testing for finetuning from the noisy speech signal, and with fixed. The idea of refitting the encoder was also proposed in [28] in a different context.
Pointestimate EStep
In the experiments, we will compare this variational Estep with an alternative proposed in [29], which consists in relying only on a point estimate of the latent variables. In our framework, this approach can be understood as assuming that the approximate posterior
is a dirac delta function centered at the maximum a posteriori estimate
. Maximization of w.r.tcan be achieved by means of gradientbased techniques, where backpropagation is used to compute the gradient w.r.t. the input of the generative decoder network.
MStep
For both the VEM algorithm and the pointestimate alternative, the MStep consists in maximizing w.r.t. under a nonnegativity constraint and with fixed. Replacing intractable expectations with Monte Carlo estimates, the Mstep can be recast as minimizing the following criterion [9]:
(19) 
where implicitly depends on . For the VEM algorithm, is a set of i.i.d. sequences drawn from using the current value of the parameters . For the point estimate approach, and corresponds to the maximum a posteriori estimate. This optimization problem can be tackled using a majorizeminimize approach [30], which leads to the multiplicative update rules derived in [9] using the methodology proposed in [31] (these updates are recalled in Appendix A.2).
Speech reconstruction
Given the estimated model parameters, we want to compute the posterior mean of the speech coefficients:
(20) 
In practice, the speech estimate is actually given by the scaled coefficients . Note that (20) corresponds to a Wienerlike filtering, averaged over all possible realizations of the latent variables according to their posterior distribution. As before, this expectation is intractable, but we approximate it by a Monte Carlo estimate using samples drawn from for the VEM algorithm. For the pointestimate approach, is approximated by a dirac delta function centered at the maximum a posteriori.
In the case of the RNN and BRNNbased generative speech models (see Section 2.1), it is important to remember that sampling from is actually done recursively, by sampling from to (see Section 2.2). Therefore, there is a posterior temporal dynamic that will be propagated from the latent vectors to the estimated speech signal, through the expectation in the Wienerlike filtering of (20). This temporal dynamic is expected to be beneficial compared with the FFNN generative speech model, where the speech estimate is built independently for all time frames.
4 Experiments
Dataset
The deep generative speech models are trained using around 25 hours of clean speech data, from the "si_tr_s" subset of the Wall Street Journal (WSJ0) dataset [32]
. Early stopping with a patience of 20 epochs is performed using the subset "si_dt_05" (around 2 hours of speech). We removed the trailing and leading silences for each utterance. For testing, we used around 1.5 hours of noisy speech, corresponding to 651 synthetic mixtures. The clean speech signals are taken from the "si_et_05" subset of WSJ0 (unseen speakers), and the noise signals from the "verification" subset of the QUTNOISE dataset
[33]. Each mixture is created by uniformly sampling a noise type among {"café", "home", "street", "car"} and a signaltonoise ratio (SNR) among {5, 0, 5} dB. The intensity of each signal for creating a mixture at a given SNR is computed using the ITUR BS.17704 protocol
[34]. Note that an SNR computed with this protocol is here 2.5 dB lower (in average) than with a simple sum of the squared signal coefficients. Finally, all signals have a 16 kHzsampling rate, and the STFT is computed using a 64ms sine window (i.e. ) with 75%overlap.Network architecture and training parameters
All details regarding the encoder and decoder network architectures and their training procedure are provided in Appendix A.3.
Speech enhancement parameters
The dimension of the latent space for the deep generative speech model is fixed to . The rank of the NMFbased noise model is fixed to . and are randomly initialized (with a fixed seed to ensure fair comparisons), and is initialized with an allones vector. For computing (19), we fix the number of samples to , which is also the case for building the Monte Carlo estimate of (20). The VEM algorithm and its "point estimate" alternative (referred to as PEEM) are run for 500 iterations. We used Adam [35] with a step size of for the gradientbased iterative optimization technique involved at the Estep. For the FFNN deep generative speech model, it was found that an insufficient number of gradient steps had a strong negative impact on the results, so it was fixed to 10. For the (B)RNN model, this choice had a much lesser impact so it was fixed to 1, thus limiting the computational burden.
Algorithm  Model  SISDR (dB)  PESQ  ESTOI 

MCEM [9]  FFNN  5.4 0.4  2.22 0.04  0.60 0.01 
PEEM  FFNN  4.4 0.4  2.21 0.04  0.58 0.01 
RNN  5.8 0.5  2.33 0.04  0.63 0.01  
BRNN  5.4 0.5  2.30 0.04  0.62 0.01  
VEM  FFNN  4.4 0.4  1.93 0.05  0.53 0.01 
RNN  6.8 0.4  2.33 0.04  0.67 0.01  
BRNN  6.9 0.5  2.35 0.04  0.67 0.01  
noisy mixture  2.6 0.5  1.82 0.03  0.49 0.01  
oracle Wiener filtering  12.1 0.3  3.13 0.02  0.88 0.01 
Median results and confidence intervals.
Results
We compare the performance of the VEM and PEEM algorithms for the three types of deep generative speech model. For the FFNN model only, we also compare with the Monte Carlo EM (MCEM) algorithm proposed in [9] (which cannot be straightforwardly adapted to the (B)RNN model). The enhanced speech quality is evaluated in terms of scaleinvariant signaltodistortion ratio (SISDR) in dB [36], perceptual evaluation of speech quality (PESQ) measure (between 0.5 and 4.5) [37] and extended shorttime objective intelligibility (ESTOI) measure (between 0 and 1) [38]. For all measures, the higher the better. The median results for all SNRs along with their confidence interval are presented in Table 1. Best results are in blackcolorbold font, while graycolorbold font indicates results that are not significantly different. As a reference, we also provide the results obtained with the noisy mixture signal as the speech estimate, and with oracle Wiener filtering. Note that oracle results are here particularly low, which shows the difficulty of the dataset. Oracle SISDR is for instance 7 dB lower than the one in [9]. Therefore, the VEM and PEEM results should not be directly compared with the MCEM results provided in [9], but only with the ones provided here.
From Table 1, we can draw the following conclusions: First, we observe that for the FFNN model, the VEM algorithm performs poorly. In this setting, the performance measures actually strongly decrease after the first 50to100 iterations of the algorithm. We did not observe this behavior for the (B)RNN model. We argue that the posterior temporal dynamic over the latent variables helps the VEM algorithm finding a satisfactory estimate of the overparametrized posterior model . Second, the superiority of the RNN model over the FFNN one is confirmed for all algorithms in this comparison. However, the bidirectional model (BRNN) does not perform significantly better than the unidirectional one. Third, the VEM algorithm outperforms the PEEM one, which shows the interest of using the full (approximate) posterior distribution of the latent variables and not only the maximumaposteriori point estimate for estimating the noise and mixture model parameters. Audio examples are available online [39].
5 Conclusion
In this work, we proposed a recurrent deep generative speech model and a variational EM algorithm for speech enhancement. We showed that introducing a temporal dynamic is clearly beneficial in terms of speech enhancement. Future works include developing a Markov chain EM algorithm to measure the quality of the proposed variational approximation of the intractable true posterior distribution.
References
 [1] P. C. Loizou, Speech enhancement: theory and practice, CRC press, 2007.
 [2] Y. Xu, J. Du, L.R. Dai, and C.H. Lee, “A regression approach to speech enhancement based on deep neural networks,” IEEE Trans. Audio, Speech, Language Process., vol. 23, no. 1, pp. 7–19, 2015.
 [3] F. Weninger, H. Erdogan, S. Watanabe, E. Vincent, J. Le Roux, J. R. Hershey, and B. Schuller, “Speech enhancement with LSTM recurrent neural networks and its application to noiserobust ASR,” in Proc. Int. Conf. Latent Variable Analysis and Signal Separation (LVA/ICA), 2015, pp. 91–99.

[4]
D. Wang and J. Chen,
“Supervised speech separation based on deep learning: An overview,”
IEEE Trans. Audio, Speech, Language Process., vol. 26, no. 10, pp. 1702–1726, 2018. 
[5]
X. Li, S. Leglaive, L. Girin, and R. Horaud,
“Audionoise power spectral density estimation using long shortterm memory,”
IEEE Signal Process. Letters, vol. 26, no. 6, pp. 918–922, 2019.  [6] X. Li and R. Horaud, “Multichannel speech enhancement based on timefrequency masking using subband long shortterm memory,” in Proc. IEEE Workshop Applicat. Signal Process. Audio Acoust. (WASPAA), 2019.
 [7] D. P. Kingma and M. Welling, “Autoencoding variational Bayes,” in Proc. Int. Conf. Learning Representations (ICLR), 2014.
 [8] Y. Bando, M. Mimura, K. Itoyama, K. Yoshii, and T. Kawahara, “Statistical speech enhancement based on probabilistic integration of variational autoencoder and nonnegative matrix factorization,” in Proc. IEEE Int. Conf. Acoust., Speech, Signal Process. (ICASSP), 2018, pp. 716–720.
 [9] S. Leglaive, L. Girin, and R. Horaud, “A variance modeling framework based on variational autoencoders for speech enhancement,” in Proc. IEEE Int. Workshop Machine Learning Signal Process. (MLSP), 2018, pp. 1–6.
 [10] S. Leglaive, U. Şimşekli, A. Liutkus, L. Girin, and R. Horaud, “Speech enhancement with variational autoencoders and alphastable distributions,” in Proc. IEEE Int. Conf. Acoust., Speech, Signal Process. (ICASSP), 2019, pp. 541–545.
 [11] M. Pariente, A. Deleforge, and E. Vincent, “A statistically principled and computationally efficient approach to speech enhancement using variational autoencoders,” in Proc. Interspeech, 2019.
 [12] K. Sekiguchi, Y. Bando, K. Yoshii, and T. Kawahara, “Bayesian multichannel speech enhancement with a deep speech prior,” in Proc. AsiaPacific Signal and Information Processing Association Annual Summit and Conference (APSIPA ASC), 2018, pp. 1233–1239.
 [13] S. Leglaive, L. Girin, and R. Horaud, “Semisupervised multichannel speech enhancement with variational autoencoders and nonnegative matrix factorization,” in Proc. IEEE Int. Conf. Acoust., Speech, Signal Process. (ICASSP), 2019, pp. 101–105.
 [14] M. Fontaine, A. A. Nugraha, R. Badeau, K. Yoshii, and A. Liutkus, “Cauchy multichannel speech enhancement with a deep speech prior,” in Proc. European Signal Processing Conference (EUSIPCO), 2019.
 [15] C. Févotte, N. Bertin, and J.L. Durrieu, “Nonnegative matrix factorization with the ItakuraSaito divergence: With application to music analysis,” Neural Computation, vol. 21, no. 3, pp. 793–830, 2009.
 [16] P. Smaragdis, B. Raj, and M. Shashanka, “Supervised and semisupervised separation of sounds from singlechannel mixtures,” in Proc. Int. Conf. Indep. Component Analysis and Signal Separation, 2007, pp. 414–421.
 [17] G. J. Mysore and P. Smaragdis, “A nonnegative approach to semisupervised separation of speech from noise with the use of temporal dynamics,” in Proc. IEEE Int. Conf. Acoust., Speech, Signal Process. (ICASSP), 2011, pp. 17–20.
 [18] N. Mohammadiha, P. Smaragdis, and A. Leijon, “Supervised and unsupervised speech enhancement using nonnegative matrix factorization,” IEEE Trans. Audio, Speech, Language Process., vol. 21, no. 10, pp. 2140–2151, 2013.
 [19] J. Chung, K. Kastner, L. Dinh, K. Goel, A. C. Courville, and Y. Bengio, “A recurrent latent variable model for sequential data,” in Proc. Adv. Neural Information Process. Syst. (NIPS), 2015, pp. 2980–2988.
 [20] R. M. Neal and G. E. Hinton, “A view of the EM algorithm that justifies incremental, sparse, and other variants,” in Learning in Graphical Models, M. I. Jordan, Ed., pp. 355–368. MIT Press, 1999.
 [21] F. D. Neeser and J. L. Massey, “Proper complex random processes with applications to information theory,” IEEE Trans. Information Theory, vol. 39, no. 4, pp. 1293–1302, 1993.
 [22] M. I. Jordan, Z. Ghahramani, T. S. Jaakkola, and L. K. Saul, “An introduction to variational methods for graphical models,” Machine Learning, vol. 37, no. 2, pp. 183–233, 1999.
 [23] A. Honkela, T. Raiko, M. Kuusela, M. Tornio, and J. Karhunen, “Approximate Riemannian conjugate gradient learning for fixedform variational Bayes,” Journal of Machine Learning Research, vol. 11, no. Nov., pp. 3235–3268, 2010.

[24]
T. Salimans and D. A. Knowles,
“Fixedform variational posterior approximation through stochastic linear regression,”
Bayesian Analysis, vol. 8, no. 4, pp. 837–882, 2013.  [25] M. D. Hoffman, D. M. Blei, C. Wang, and J. Paisley, “Stochastic variational inference,” Journal of Machine Learning Research, vol. 14, no. 1, pp. 1303–1347, 2013.
 [26] D. M. Blei, A. Kucukelbir, and J. D. McAuliffe, “Variational inference: A review for statisticians,” Journal of the American Statistical Association, vol. 112, no. 518, pp. 859–877, 2017.
 [27] C. M. Bishop, Pattern Recognition and Machine Learning, Springer, 2006.
 [28] P.A. Mattei and J. Frellsen, “Refit your encoder when new data comes by,” in 3rd NeurIPS workshop on Bayesian Deep Learning, 2018.
 [29] H. Kameoka, L. Li, S. Inoue, and S. Makino, “Supervised determined source separation with multichannel variational autoencoder,” Neural Computation, vol. 31, no. 9, pp. 1–24, 2019.
 [30] D. R. Hunter and K. Lange, “A tutorial on MM algorithms,” The American Statistician, vol. 58, no. 1, pp. 30–37, 2004.
 [31] C. Févotte and J. Idier, “Algorithms for nonnegative matrix factorization with the divergence,” Neural Computation, vol. 23, no. 9, pp. 2421–2456, 2011.
 [32] J. S. Garofalo, D. Graff, D. Paul, and D. Pallett, “CSRI (WSJ0) Sennheiser LDC93S6B,” https://catalog.ldc.upenn.edu/LDC93S6B, 1993, Philadelphia: Linguistic Data Consortium.
 [33] D. B. Dean, A. Kanagasundaram, H. Ghaemmaghami, M. H. Rahman, and S. Sridharan, “The QUTNOISESRE protocol for the evaluation of noisy speaker recognition,” in Proc. Interspeech, 2015, pp. 3456–3460.
 [34] “Algorithms to measure audio programme loudness and truepeak audio level,” Recommendation BS.17704, International Telecommunication Union (ITU), Oct. 2015.
 [35] D. P. Kingma and J. Ba, “Adam: A method for stochastic optimization,” in Proc. Int. Conf. Learning Representations (ICLR), 2015.
 [36] J. Le Roux, S. Wisdom, H. Erdogan, and J. R. Hershey, “SDR – Halfbaked or Well Done?,” in Proc. IEEE Int. Conf. Acoust., Speech, Signal Process. (ICASSP), 2019, pp. 626–630.
 [37] A. W. Rix, J. G. Beerends, M. P. Hollier, and A. P. Hekstra, “Perceptual evaluation of speech quality (PESQ)a new method for speech quality assessment of telephone networks and codecs,” in Proc. IEEE Int. Conf. Acoust., Speech, Signal Process. (ICASSP), 2001, pp. 749–752.
 [38] C. H. Taal, R. C. Hendriks, R. Heusdens, and J. Jensen, “An algorithm for intelligibility prediction of time–frequency weighted noisy speech,” IEEE Trans. Audio, Speech, Language Process., vol. 19, no. 7, pp. 2125–2136, 2011.
 [39] “Audio examples,” https://bit.ly/30T90ud.
 [40] S. Hochreiter and J. Schmidhuber, “Long shortterm memory,” Neural Computation, vol. 9, no. 8, pp. 1735–1780, 1997.
A Appendix
a.1 Variational free energy derivation details
In this section we give derivation details for obtaining the expression of the variational free energy in (15). We will develop the two terms involved in the definition of the variational free energy in (6).
Datafidelity term
From the generative model defined in (1) we have:
(21) 
Regularization term
a.2 Update rules for the Mstep
The multiplicative update rules for minimizing (19) using a majorizeminimize technique [30, 31] are given by (see [9] for derivation details):
(23) 
(24) 
(25) 
where denotes elementwise multiplication and exponentiation, matrix division is also elementwise, are the matrices of entries and respectively, is the matrix of entries and is an allones column vector of dimension . Note that nonnegativity of , and is ensured provided that they are initialized with nonnegative values.
a.3 Neural network architectures and training
The decoder and encoder network architectures are represented in Fig. 2 and Fig. 3 respectively. The "dense" (i.e. feedforward fullyconnected) output layers are of dimension and for the encoder and decoder, respectively. The dimension of all other layers was arbitrarily fixed to 128. RNN layers correspond to long shortterm memory (LSTM) ones [40]. For the FFNN generative model, a batch is made of 128 time frames of clean speech power spectrogram. For the (B)RNN generative model, a batch is made 32 sequences of 50 time frames. Given an input sequence, all LSTM hidden states for the encoder and decoder networks are initialized to zero. For training, we use the Adam optimizer [35] with a step size of , exponential decay rates of and
for the first and second moment estimates, respectively, and an epsilon of
for preventing division by zero.