1 Introduction
Deep generative models for approximating complicated and often high-dimensional probability distributions became a rapidly developing research field. Variational autoencoders (VAEs) were originally introduced by Kingma and Welling
[21] and have seen a large number of modifications and improvements for a huge number of quite different applications. For some overview on VAEs, we refer to [22]. Recently, diffusion normalizing flows arising from the Euler discretization of a certain stochastic differential equation were proposed by Zhang and Chen in [39]. On the other hand, finite normalizing flows inclusive residual neural networks (ResNets) [3, 4, 17], invertible neural networks (INNs) [2, 8, 15, 20, 26, 29] and autoregessive flows [7, 9, 18, 27] are a popular class of generative models. To overcome topological constraints and improve the expressiveness of normalizing flow architectures, Wu, Köhler and Noé introduced stochastic normalizing flows [38] which combine deterministic, learnable flow transformations with stochastic sampling methods. In [14], we considered stochastic normalizing flows from a Markov chain point of view. In particular, we replaced the transition densities by general Markov kernels and established proofs via Radon-Nikodym derivatives. This allowed to incorporate deterministic flows or Metropolis-Hasting flows which do not have densities into the mathematical derivation.The aim of this tutorial paper is to propose the straightforward and clear framework of Markov chains to combine deterministic normalizing and stochastic flows, in particular VAEs and diffusion normalizing flows. More precisely we establish a pair of Markov chains having some special properties. This provides a powerful tool for coupling different architectures. We want to highlight the advantages of the Markov chain approach that can handle distributions with and without densities in a sound mathematical way. We are aware that relations between normalizing flows and other approaches as VAEs were already mentioned in the literature and we point to corresponding references at the end of Section 5.
The outline of the paper is as follows: in the next Section 2 we recall the notation of Markov kernels. Then, in Section 4, we use them to explain normalizing flows. Stochastic normalizing flow were introduced as a pair of Markov chains in Section 4. Afterwards, we show how VAEs fit into the setting of stochastic normalizing flows in Section 5. Related references are given at end of the section. Finally, we demonstrate in Section 6 how diffusion normalizing flows can be seen as stochastic normalizing flows as well.
2 Markov Kernels
In this section, we introduce the basic notation of Markov chains, see, e.g., [24].
Let
be a probability space. By a probability measure on
we always mean a probability measure defined on the Borel -algebra . Let denote the set of probability measures on. Given a random variable
, we use the push-forward notationfor the corresponding measure on . A Markov kernel is a mapping such that
-
is measurable for any , and
-
is a probability measure for any .
For a probability measure on , the measure on is defined by
(1) |
Note that this definition captures all sets in since the measurable rectangles form a -stable generator of . Then, it holds for all integrable that
In the following, we use the notion of the regular conditional distribution of a random variable given a random variable which is defined as the -almost surely unique Markov kernel with the property
(2) |
We will use the abbreviation if the meaning is clear from the context. A sequence , of -dimensional random variables , , is called a Markov chain, if there exist Markov kernels
in the sense (2) such that
(3) |
The Markov kernels are also called transition kernels. If the measure has a density , and resp. have densities resp. , then setting in equation (1) results in
(4) |
3 Normalizing Flows
In this section, we show how normalizing flows can be interpreted as finite Markov chains. A normalizing flows [28] is often understood as deterministic, invertible transform, which we call .
For better readability, we likewise skip the dependence of on the parameter and write just . Normalizing flows can be used to model the density of a distribution by a simpler distribution
, usually the standard normal distribution, by learning
such that it holds approximatelyNote that we have by the change of variable formula for the corresponding densities
(5) |
The approximation can be done by minimizing the Kullback-Leibler divergence
(6) | ||||
(7) |
Noting that the first summand is just a constant, this gives the loss function
The network is constructed by concatenating smaller blocks
which are invertible networks on their own. Then, the blocks generate a pair of Markov chains by
Here, for all , the dimension of the random variables and is equal to . The transition kernels and are given by the Dirac distributions
which can be seen by (2) as follows: for any it holds
(8) | ||||
(9) |
Since is by definition concentrated on the set , this becomes
(10) | ||||
(11) | ||||
(12) |
Consequently, by (1), the transition kernel is given by . Due to their correspondence to the layers and from the normalizing flow , we call the Markov kernels forward layers, while the Markov kernels are called reverse layers.
4 Stochastic Normalizing Flows
The idea of stochastic normalizing flows is to replace some of the deterministic layers from a normalizing flow by random transforms. From the Markov chains viewpoint, we replace the kernels and with the Dirac measure by more general Markov kernels.
Formally, a stochastic normalizing flow (SNF) is a pair of Markov chains of -dimensional random variables and , , with the following properties:
-
have the densities for any .
-
There exist Markov kernels and , such that
(13) (14) -
For -almost every , the measures and are absolutely continuous with respect to each other.
We say that the Markov chain is a reverse Markov chain of . In applications, Markov chains usually start with a latent random variable
on , where it is easy to sample from and we intend to learn the Markov chain such that approximates a target random variable on , while the reversed Markov chain is initialized with a random variable
from a data space and should approximate the latent variable . As outlined in the previous paragraph, each deterministic normalizing flow is a special case of a SNF. In the following, let denote the normal distribution with density .
4.1 Stochastic Layers
In the following, we briefly recall the two stochastic layers which were used in [14, 38]. Another kind of layer arising from VAEs is detailed in the next section. In both cases from [14, 38], we choose as for the deterministic layers
and the basic idea is to push the distribution of into the direction of some proposal density
, which is usually chosen as some interpolation between
and . For a detailed description of this interpolation, we refer to [14]. As reverse layer, we use the same Markov kernel as the forward layer, i.e.,Metropolis-Hastings (MH) Layer: The Metropolis-Hastings algorithm outlined in Alg. 1 is a frequently used Markov Chain Monte Carlo type algorithm to sample from a distribution with known density , see, e.g., [30].
Under mild assumptions, the corresponding Markov chain admits the unique stationary distribution and as in the total variation norm, see, e.g. [35].
In the MH layer, the transition from to is one step of a Metropolis-Hastings algorithm. More precisely, let and be random variables such that are independent. Here denotes the smallest -algebra generated by the random variable . Then, we set
(15) |
where
with a proposal density which has to be specified. The corresponding transition kernel was derived, e.g., in [34] and is given by
(16) |
Note that another kind of MH layers coming from the Metropolis-adjusted Langevin algorithm (MALA) [11, 31]
was also used in [14, 38] under the name Markov Chain Monte Carlo (MCMC) layer.
Langevin Layer:
In the Langevin layer, we model the transition from to by one step of an
explicit Euler discretization of the overdamped Langevin dynamics [37].
Let such that
and are independent.
Again we assume that we are given a proposal density which has to be specified.
We denote by the negative log-likelihood of
and set
where are some predefined constants. To determine the corresponding kernel, we use the the independence of of and to obtain that and have the common density
(17) | ||||
(18) | ||||
(19) |
Then, for , it holds
(20) | ||||
(21) | ||||
(22) |
where
(23) |
4.2 Training SNFs
We aim to find parameters of a SNF such that . Recall, that for deterministic normalizing flows, it holds that , such that the loss function reads as . Unfortunately, the stochastic layers make it impossible to evaluate and minimize
. Instead, we minimize the KL divergence of the joint distributions
which is an upper bound of . It was shown in [14, Theorem 5] that this loss function can be rewritten as
(24) | ||||
(25) | ||||
(26) |
where is given by the Radon-Nikodym derivative . Finally, note that by [14, Theorem 6] we have for any deterministic normalizing flow that .
5 VAEs as Special SNF Layers
In this section, we introduce variational autoencoders (VAEs) as another kind of stochastic layers of a SNF. First, we briefly revisit the definition of autoencoders and VAEs. Afterwards, we show that a VAE can be viewed as one-layer SNF.
Autoencoders.
Autoencoders (see [12]
for an overview) are a dimensionality reduction technique inspired by the principal component analysis. For
, an autoencoder is a pair of neural networks, consisting of an encoder and a decoder , where and are the neural networks parameters. The network aims to encode samples from a -dimensional distribution in the lower-dimensional space such that the decoder is able to reconstruct them. Consequently, it is a necessary assumption that the distribution is approximately concentrated on a -dimensional manifold. A possible loss function to train and is given byUsing this construction, autoencoders have shown to be very powerful for reduce the dimensionality of very complex datasets.
Variational Autoenconders via Markov Kernels.
Variational autoencoders (VAEs) [21] aim to use the power of autoencoders to approximate a probability distribution with density using a simpler distribution with density which is usually the standard normal distribution. Here, the idea is to learn random transforms that push the distribution onto and vice versa. Formally, these transforms are defined by the Markov kernels
(27) |
where
is a neural network with parameters , which determines the parameters of the normal distribution within the definition of . Similarly,
determines the parameters within the definition of . In analogy to the autoencoders in the previous paragraph, and are called stochastic decoder and encoder. By definition, has the density and has the density .
Now, we aim to learn the parameters such that it holds approximately
(28) |
Assuming that the above equation holds true exactly, we can generate samples from by sampling first from and secondly, sampling from .
The first idea, would be to use the maximum likelihood estimator as loss function, i.e., maximize
Unfortunately, computing the integral directly is intractable. Thus, using Bayes’ formula
we artificially incorporate the stochastic encoder by the computation
(29) | ||||
(30) | ||||
(31) | ||||
(32) |
Then the loss function given by
(33) |
is a lower bound on the so-called evidence . Therefore it is called the evidence lower bound (ELBO). Now the parameters and of the VAE can be trained by maximizing the expected ELBO, i.e., by minimizing the loss function
(34) |
VAEs as one Layer SNFs.
In the following, we show that a VAE is a special case of one layer SNF. Let be a one-layer SNF, where the layers and are defined as in (27) with densities and , respectively. Note that in contrast to the stochastic layers from Section 4 the dimensions and are not longer equal. Now, with , the loss function (25) of the SNF reads as
(35) |
where is given by the Radon-Nikodym derivative . Now we can use the fact, that by the definition of and the random variables as well as the random variables have a joint density to express by the corresponding densities of . Together with Bayes’ formula we obtain
Inserting this into (35), we get
(36) |
and using (2) further
(37) | ||||
(38) | ||||
(39) |
where denotes the ELBO as defined in (33) and is a constant independent of and . Consequently, minimizing is equivalent to minimize the negative expected ELBO, which is exactly the loss for VAEs from (34).
The above result could alternatively be derived via the relation of the ELBO to the the KL divergence between the probability measures defined by the densities and , see [22, Section 2.7].
Related Combinations of VAEs and Normalizing flows.
There exist several works which model the latent distribution of a VAE by normalizing flows [6, 29], SNFs [38] or sampling-based Monte Carlo methods [36] and often achieve state-of-the-art results. Using the above derivation, all of these models can be viewed as special cases of a SNF, even though some of them employ different training techniques for minimizing the loss function. Further, the authors of [13] modify the learning of the covariance matrices of decoder and encoder of a VAE using normalizing flows. However, analogously as in the previous section, this can be viewed as one-layer SNF.
A similar idea was applied in [25], where the authors model the weight distribution of a Baysian neural network by a noramlizing flow. However, we are not completely sure, how this approach relates to SNFs.
Finally, to overcome the problem of expansive training in high dimensions, some recent papers [5, 23] propose also other combinations of a dimensionality reduction and normalizing flows. However, [5] can be viewed as a variational autoencoder with special structured generator and can therefore be considered as one-layer SNF. In [23] the authors propose to reduce the dimension in a first step by a non-variational autoencoder and the optimization of a normalizing flow in the reduced dimensions in a second step.
6 Diffusion Normalizing Flows as special SNFs
Recently, Song et. al [33] proposed to learn the drift , and diffusion coefficient of a stochastic differential equation
(40) |
with respect to the Brownian motion , such that it holds approximately for some and some data distribution . The explicit Euler discretization of (40) with step size reads as
where is independent of . With a similar computation as for the Langevin layers, this corresponds to the Markov kernel
(41) |
Song et. al. parametrize the functions by some a-priori learned score network [19, 32] and achieve competitive performance in image generation. Motivated by the time-reversal [1, 16, 10] of the SDE (40), Zhang and Chen [39] introduce the backward layer
and learn the parameters of the neural networks and using the loss function from (25) to achieve state-of-the-art results. Even though Zhang and Chen call their model diffusion flow, it is indeed a special case of a SNF using the forward and backward layers and .
References
- [1] B. D. Anderson. Reverse-time diffusion equation models. Stochastic Processes and their Applications, 12(3):313–326, 1982.
- [2] L. Ardizzone, J. Kruse, C. Rother, and U. Köthe. Analyzing inverse problems with invertible neural networks. In 7th International Conference on Learning Representations, ICLR 2019, New Orleans, LA, USA, May 6-9, 2019, 2019.
-
[3]
J. Behrmann, W. Grathwohl, R. T. Chen, D. Duvenaud, and J.-H. Jacobsen.
Invertible residual networks.
In
International Conference on Machine Learning
, pages 573–582, 2019. - [4] R. T. Q. Chen, J. Behrmann, D. K. Duvenaud, and J.-H. Jacobsen. Residual flows for invertible generative modeling. In Advances in Neural Information Processing Systems, volume 32. Curran Associates, Inc., 2019.
- [5] E. Cunningham, R. Zabounidis, A. Agrawal, I. Fiterau, and D. Sheldon. Normalizing flows across dimensions. arXiv preprint arXiv:2006.13070, 2020.
- [6] B. Dai and D. P. Wipf. Diagnosing and enhancing VAE models. In International Conference on Learning Representations, 2019.
- [7] N. De Cao, I. Titov, and W. Aziz. Block neural autoregressive flow. ArXiv:1904.04676, 2019.
- [8] L. Dinh, J. Sohl-Dickstein, and S. Bengio. Density estimation using real NVP. In 5th International Conference on Learning Representations, ICLR 2017, Toulon, France, April 24-26, 2017, Conference Track Proceedings, 2017.
- [9] C. Durkan, A. Bekasov, I. Murray, and G. Papamakarios. Neural spline flows. Advances in Neural Information Processing Systems, 2019.
- [10] C. Durkan and Y. Song. On maximum likelihood training of score-based generative models. arXiv preprint arXiv:2101.09258, 2021.
- [11] M. Girolami and B. Calderhead. Riemann manifold Langevin and Hamiltonian Monte Carlo methods. J. R. Stat. Soc.: Series B (Statistical Methodology), 73(2):123–214, 2011.
- [12] I. Goodfellow, Y. Bengio, and A. Courville. Deep learning. MIT press, 2016.
-
[13]
A. A. Gritsenko, J. Snoek, and T. Salimans.
On the relationship between normalising flows and variational- and denoising autoencoders.
In Deep Generative Models for Highly Structured Data, ICLR 2019 Workshop, 2019. - [14] P. Hagemann, J. Hertrich, and G. Steidl. Stochastic normalizing flows for inverse problems: a Markov Chains viewpoint. arXiv preprint arXiv:2109.11375, 2021.
- [15] P. L. Hagemann and S. Neumayer. Stabilizing invertible neural networks using mixture models. Inverse Problems, 2021.
- [16] U. G. Haussmann and E. Pardoux. Time reversal of diffusions. The Annals of Probability, pages 1188–1205, 1986.
-
[17]
K. He, X. Zhang, S. Ren, and J. Sun.
Deep residual learning for image recognition.
In
Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition
, pages 770–778, 2016. - [18] C.-W. Huang, D. Krueger, A. Lacoste, and A. Courville. Neural autoregressive flows. In Proc. of the 35th International Conference on Machine Learning, pages 2078–2087. PMLR, 2018.
- [19] A. Hyvärinen and P. Dayan. Estimation of non-normalized statistical models by score matching. Journal of Machine Learning Research, 6(4), 2005.
- [20] D. P. Kingma and P. Dhariwal. Glow: Generative flow with invertible 1x1 convolutions. ArXiv preprint arXiv:1807.03039, 2018.
- [21] D. P. Kingma and M. Welling. Auto-encoding variational bayes. arXiv preprint arXiv:1312.6114, 2013.
- [22] D. P. Kingma and M. Welling. An introduction to variational autoencoders. Foundations and Trends in Machine Learning, 12(4):307–392, 2019.
- [23] K. Kothari, A. Khorashadizadeh, M. de Hoop, and I. Dokmanić. Trumpets: Injective flows for inference and inverse problems. arXiv preprint arXiv:2102.10461, 2021.
- [24] J.-F. Le Gall. Brownian motion, martingales, and stochastic calculus, volume 274 of Graduate Texts in Mathematics. Springer, [Cham], 2016.
- [25] C. Louizos and M. Welling. Multiplicative normalizing flows for variational Bayesian neural networks. In Proc. of the 34th International Conference on Machine Learning, pages 2218–2227. PMLR, 2017.
- [26] T. Müller, McWilliams, R. B., M. F., Gross, and J. Novák. Neural importance sampling. ArXiv:1808.03856, 2018.
- [27] G. Papamakarios, T. Pavlakou, , and I. Murray. Masked autoregressive flow for density estimation. Advances in Neural Information Processing Systems, page 2338–2347, 2017.
- [28] D. Rezende and S. Mohamed. Variational inference with normalizing flows. In F. Bach and D. Blei, editors, Proceedings of the 32nd International Conference on Machine Learning, volume 37 of Proceedings of Machine Learning Research, pages 1530–1538, Lille, France, 07–09 Jul 2015. PMLR.
- [29] D. J. Rezende and S. Mohamed. Variational inference with normalizing flows. ArXiv:1505.05770, 2015.
- [30] G. O. Roberts and J. S. Rosenthal. General state space Markov chains and MCMC algorithms. Probabability Surveys, 1:20 – 71, 2004.
- [31] G. O. Roberts and R. L. Tweedie. Exponential convergence of Langevin distributions and their discrete approximations. Bernoulli, 2(4):341–363, 1996.
- [32] Y. Song and S. Ermon. Generative modeling by estimating gradients of the data distribution. arXiv preprint arXiv:1907.05600, 2019.
- [33] Y. Song, J. Sohl-Dickstein, D. P. Kingma, A. Kumar, S. Ermon, and B. Poole. Score-based generative modeling through stochastic differential equations. arXiv preprint arXiv:2011.13456, 2020.
- [34] L. Tierney. A note on Metropolis-Hastings kernels for general state spaces. Annals of Applied Probability, 8(1):1–9, 1998.
- [35] D. Tsvetkov, L. Hristov, and R. Angelova-Slavova. On the convergence of the Metropolis-Hastings Markov chains. arXiv 1302.0654v4, 2020.
- [36] A. Vahdat, K. Kreis, and J. Kautz. Score-based generative modeling in latent space. ArXiv preprint arXiv:2106.05931, 2021.
- [37] M. Welling and Y. W. Teh. Bayesian learning via stochastic gradient Langevin dynamics. In International Conference on Machine Learning, page 681–688, 2011.
- [38] H. Wu, J. Köhler, and F. Noé. Stochastic normalizing flows. In H. Larochelle, M. Ranzato, R. Hadsell, M. Balcan, and H. Lin, editors, Advances in Neural Information Processing Systems 2020, 2020.
- [39] Q. Zhang and Y. Chen. Diffusion normalizing flow. In Conference on Neural Information Processing Systems, 2021.