DeepAI

# A Unified Approach to Variational Autoencoders and Stochastic Normalizing Flows via Markov Chains

Normalizing flows, diffusion normalizing flows and variational autoencoders are powerful generative models. In this paper, we provide a unified framework to handle these approaches via Markov chains. Indeed, we consider stochastic normalizing flows as pair of Markov chains fulfilling some properties and show that many state-of-the-art models for data generation fit into this framework. The Markov chains point of view enables us to couple both deterministic layers as invertible neural networks and stochastic layers as Metropolis-Hasting layers, Langevin layers and variational autoencoders in a mathematically sound way. Besides layers with densities as Langevin layers, diffusion layers or variational autoencoders, also layers having no densities as deterministic layers or Metropolis-Hasting layers can be handled. Hence our framework establishes a useful mathematical tool to combine the various approaches.

• 16 publications
• 5 publications
• 26 publications
09/23/2021

### Stochastic Normalizing Flows for Inverse Problems: a Markov Chains Viewpoint

To overcome topological constraints and improve the expressiveness of no...
07/06/2020

### SurVAE Flows: Surjections to Bridge the Gap between VAEs and Flows

Normalizing flows and variational autoencoders are powerful generative m...
02/06/2016

Variational Autoencoders are powerful models for unsupervised learning. ...
06/29/2021

### Diffusion Priors In Variational Autoencoders

Among likelihood-based approaches for deep generative modelling, variati...
03/20/2018

### Linearizing Visual Processes with Convolutional Variational Autoencoders

This work studies the problem of modeling non-linear visual processes by...
09/06/2018

### Reversible Markov chains: variational representations and ordering

This pedagogical document explains three variational representations tha...
11/15/2020

### Predictive Coding, Variational Autoencoders, and Biological Connections

This paper identifies connections between predictive coding, from theore...

## 1 Introduction

Deep generative models for approximating complicated and often high-dimensional probability distributions became a rapidly developing research field. Variational autoencoders (VAEs) were originally introduced by Kingma and Welling

[21] and have seen a large number of modifications and improvements for a huge number of quite different applications. For some overview on VAEs, we refer to [22]. Recently, diffusion normalizing flows arising from the Euler discretization of a certain stochastic differential equation were proposed by Zhang and Chen in [39]. On the other hand, finite normalizing flows inclusive residual neural networks (ResNets) [3, 4, 17], invertible neural networks (INNs) [2, 8, 15, 20, 26, 29] and autoregessive flows [7, 9, 18, 27] are a popular class of generative models. To overcome topological constraints and improve the expressiveness of normalizing flow architectures, Wu, Köhler and Noé introduced stochastic normalizing flows [38] which combine deterministic, learnable flow transformations with stochastic sampling methods. In [14], we considered stochastic normalizing flows from a Markov chain point of view. In particular, we replaced the transition densities by general Markov kernels and established proofs via Radon-Nikodym derivatives. This allowed to incorporate deterministic flows or Metropolis-Hasting flows which do not have densities into the mathematical derivation.

The aim of this tutorial paper is to propose the straightforward and clear framework of Markov chains to combine deterministic normalizing and stochastic flows, in particular VAEs and diffusion normalizing flows. More precisely we establish a pair of Markov chains having some special properties. This provides a powerful tool for coupling different architectures. We want to highlight the advantages of the Markov chain approach that can handle distributions with and without densities in a sound mathematical way. We are aware that relations between normalizing flows and other approaches as VAEs were already mentioned in the literature and we point to corresponding references at the end of Section 5.

The outline of the paper is as follows: in the next Section 2 we recall the notation of Markov kernels. Then, in Section 4, we use them to explain normalizing flows. Stochastic normalizing flow were introduced as a pair of Markov chains in Section 4. Afterwards, we show how VAEs fit into the setting of stochastic normalizing flows in Section 5. Related references are given at end of the section. Finally, we demonstrate in Section 6 how diffusion normalizing flows can be seen as stochastic normalizing flows as well.

## 2 Markov Kernels

In this section, we introduce the basic notation of Markov chains, see, e.g., [24].

Let

be a probability space. By a probability measure on

we always mean a probability measure defined on the Borel -algebra . Let denote the set of probability measures on

. Given a random variable

, we use the push-forward notation

 PX=X#P\coloneqqP∘X−1

for the corresponding measure on . A Markov kernel is a mapping such that

• is measurable for any , and

• is a probability measure for any .

For a probability measure on , the measure on is defined by

 (μ×K)(A×B)\coloneqq∫AK(x,B)dμ(x). (1)

Note that this definition captures all sets in since the measurable rectangles form a -stable generator of . Then, it holds for all integrable that

 ∫Rn×Rdf(x,y)d(μ×K)(x,y)=∫Rn∫Rdf(x,y)dK(x,⋅)(y)dμ(x).

In the following, we use the notion of the regular conditional distribution of a random variable given a random variable which is defined as the -almost surely unique Markov kernel with the property

 PX×PY|X=⋅(⋅)=P(X,Y). (2)

We will use the abbreviation if the meaning is clear from the context. A sequence , of -dimensional random variables , , is called a Markov chain, if there exist Markov kernels

 Kt=PXt|Xt−1:Rdt−1×B(Rdt)→[0,1]

in the sense (2) such that

 P(X0,...,XT)=PX0×PX1|X0×⋯×PXT|XT−1. (3)

The Markov kernels are also called transition kernels. If the measure has a density , and resp. have densities resp. , then setting in equation (1) results in

 pt(y)=∫Rdt−1kt(x,y)pt−1(x)dx. (4)

## 3 Normalizing Flows

In this section, we show how normalizing flows can be interpreted as finite Markov chains. A normalizing flows [28] is often understood as deterministic, invertible transform, which we call .

For better readability, we likewise skip the dependence of on the parameter and write just . Normalizing flows can be used to model the density of a distribution by a simpler distribution

, usually the standard normal distribution, by learning

such that it holds approximately

 PX≈T#PZ,or equivalentlyPZ≈T−1#PX.

Note that we have by the change of variable formula for the corresponding densities

 pT#PZ(x)=pZ(T−1(x))|det∇T−1(x)|. (5)

The approximation can be done by minimizing the Kullback-Leibler divergence

 KL(PX,T#PZ) =Ex∼PX[log(pXpT#Z)]=Ex∼PX[logpX]−Ex∼PX[logpT#Z] (6) =Ex∼PX[logpX]−Ex∼PX[logpZ∘T−1]−Ex∼PX[log|det(∇T−1)|]. (7)

Noting that the first summand is just a constant, this gives the loss function

 LNF(θ)=−Ex∼PX[logpZ∘T−1]−Ex∼PX[log|det(∇T−1)|].

The network is constructed by concatenating smaller blocks

 T=TT∘⋯∘T1

which are invertible networks on their own. Then, the blocks generate a pair of Markov chains by

 X0∼PZ,Xt=Tt(Xt−1)andYT∼PX,Yt−1=T−1t(Yt).

Here, for all , the dimension of the random variables and is equal to . The transition kernels and are given by the Dirac distributions

 Kt(x,⋅)=δTt(x)andRt(x,⋅)=δT−1t(x)

which can be seen by (2) as follows: for any it holds

 P(Xt−1,Xt)(A×B) =∫Rd1A×B(xt−1,xt)dP(Xt−1,Xt)(xt−1,xt) (8) =∫Rd1A(xt−1)1B(xt)dP(Xt−1,Xt)(xt−1,xt). (9)

Since is by definition concentrated on the set , this becomes

 P(Xt−1,Xt)(A×B) =∫Rd×Rd1A(xt−1)1B(Tt(xt−1))dP(Xt−1,Xt)(xt−1,xt) (10) =∫A1B(Tt(xt−1))dPXt−1(xt−1) (11) =∫AδTt(xt−1)(B)dPXt−1. (12)

Consequently, by (1), the transition kernel is given by . Due to their correspondence to the layers and from the normalizing flow , we call the Markov kernels forward layers, while the Markov kernels are called reverse layers.

## 4 Stochastic Normalizing Flows

The idea of stochastic normalizing flows is to replace some of the deterministic layers from a normalizing flow by random transforms. From the Markov chains viewpoint, we replace the kernels and with the Dirac measure by more general Markov kernels.

Formally, a stochastic normalizing flow (SNF) is a pair of Markov chains of -dimensional random variables and , , with the following properties:

• have the densities for any .

• There exist Markov kernels and , such that

 P(X0,...,XT) =PX0×PX1|X0×⋯×PXT|XT−1, (13) P(YT,...,Y0) =PYT×PYT−1|YT×⋯×PY0|Y1. (14)
• For -almost every , the measures and are absolutely continuous with respect to each other.

We say that the Markov chain is a reverse Markov chain of . In applications, Markov chains usually start with a latent random variable

 X0=Z

on , where it is easy to sample from and we intend to learn the Markov chain such that approximates a target random variable on , while the reversed Markov chain is initialized with a random variable

 YT=X

from a data space and should approximate the latent variable . As outlined in the previous paragraph, each deterministic normalizing flow is a special case of a SNF. In the following, let denote the normal distribution with density .

### 4.1 Stochastic Layers

In the following, we briefly recall the two stochastic layers which were used in [14, 38]. Another kind of layer arising from VAEs is detailed in the next section. In both cases from [14, 38], we choose as for the deterministic layers

 dt−1=dt=d

and the basic idea is to push the distribution of into the direction of some proposal density

, which is usually chosen as some interpolation between

and . For a detailed description of this interpolation, we refer to [14]. As reverse layer, we use the same Markov kernel as the forward layer, i.e.,

 Rt=Kt.

Metropolis-Hastings (MH) Layer: The Metropolis-Hastings algorithm outlined in Alg. 1 is a frequently used Markov Chain Monte Carlo type algorithm to sample from a distribution with known density , see, e.g., [30].

Under mild assumptions, the corresponding Markov chain admits the unique stationary distribution and as in the total variation norm, see, e.g. [35].

In the MH layer, the transition from to is one step of a Metropolis-Hastings algorithm. More precisely, let and be random variables such that are independent. Here denotes the smallest -algebra generated by the random variable . Then, we set

 Xt \coloneqqXt+1[U,1](αt(Xt−1,Xt−1+ξt))ξt (15)

where

 αt(x,y)\coloneqqmin{1,pt(y)pt(x)}

with a proposal density which has to be specified. The corresponding transition kernel was derived, e.g., in [34] and is given by

 Kt(x,A)\coloneqq∫AN(y;x,σ2I)αt(x,y)dy+δx(A)∫RdN(y;x,σ2I)(1−αt(x,y))dy. (16)

Note that another kind of MH layers coming from the Metropolis-adjusted Langevin algorithm (MALA) [11, 31] was also used in [14, 38] under the name Markov Chain Monte Carlo (MCMC) layer.
Langevin Layer: In the Langevin layer, we model the transition from to by one step of an explicit Euler discretization of the overdamped Langevin dynamics [37]. Let such that and are independent. Again we assume that we are given a proposal density which has to be specified. We denote by the negative log-likelihood of and set

 Xt\coloneqqXt−1−a1∇ut(Xt−1)+a2ξt,

where are some predefined constants. To determine the corresponding kernel, we use the the independence of of and to obtain that and have the common density

 p(Xt−1,Xt)(xt−1,xt) =pXt−1,ξt(xt−1,1a2(xt−xt−1+a1∇ut(xt−1)) (17) =pXt−1(xt−1)pξt(1a2(xt−xt−1+a1∇ut(xt−1))) (18) =pXt−1(xt−1)N(xt;xt−1−a1∇ut(xt−1),a22I). (19)

Then, for , it holds

 P(Xt−1,Xt)(A×B) =∫A×BpXt−1(xt−1)N(xt;xt−1−a1∇ut(xt−1),a22I)d(xt−1,xt) (20) =∫A∫BN(xt;xt−1−a1∇ut(xt−1),a22I)dxtpXt−1(xt−1)dxt−1 (21) =∫AKt(xt−1,B)pXt(xt)dPXt−1(xt−1), (22)

where

 Kt(x,⋅)\coloneqqN(x−a1∇ut(x),a22I). (23)

By (1) and (2) this is the Langevin transition kernel .

### 4.2 Training SNFs

We aim to find parameters of a SNF such that . Recall, that for deterministic normalizing flows, it holds that , such that the loss function reads as . Unfortunately, the stochastic layers make it impossible to evaluate and minimize

. Instead, we minimize the KL divergence of the joint distributions

 LSNF=KL(P(Y0,...,YT),P(X0,...,XT)),

which is an upper bound of . It was shown in [14, Theorem 5] that this loss function can be rewritten as

 LSNF(θ) =KL(P(Y0,...,YT),P(X0,...,XT)) (24) =E(x0,...,xT)∼P(Y0,...,YT)[log(pX(xT)pXT(xT)T∏t=1ft(xt−1,xt))] (25) =E(x0,...,xT)∼P(Y0,...,YT)[log(pX(xT)pZ(x0)T∏t=1ft(xt−1,xt)pXt−1(xt−1)pXt(xt))], (26)

where is given by the Radon-Nikodym derivative . Finally, note that by [14, Theorem 6] we have for any deterministic normalizing flow that .

## 5 VAEs as Special SNF Layers

In this section, we introduce variational autoencoders (VAEs) as another kind of stochastic layers of a SNF. First, we briefly revisit the definition of autoencoders and VAEs. Afterwards, we show that a VAE can be viewed as one-layer SNF.

#### Autoencoders.

Autoencoders (see [12]

for an overview) are a dimensionality reduction technique inspired by the principal component analysis. For

, an autoencoder is a pair of neural networks, consisting of an encoder and a decoder , where and are the neural networks parameters. The network aims to encode samples from a -dimensional distribution in the lower-dimensional space such that the decoder is able to reconstruct them. Consequently, it is a necessary assumption that the distribution is approximately concentrated on a -dimensional manifold. A possible loss function to train and is given by

 LAE(ϕ,θ)=Ex∼PX[∥x−Dθ(Eϕ(x))∥2].

Using this construction, autoencoders have shown to be very powerful for reduce the dimensionality of very complex datasets.

#### Variational Autoenconders via Markov Kernels.

Variational autoencoders (VAEs) [21] aim to use the power of autoencoders to approximate a probability distribution with density using a simpler distribution with density which is usually the standard normal distribution. Here, the idea is to learn random transforms that push the distribution onto and vice versa. Formally, these transforms are defined by the Markov kernels

 K(z,⋅)=N(μθ(z),Σθ(z))%andR(x,⋅)=N(μϕ(x),Σϕ(x)), (27)

where

 D(z)=Dθ(z)=(μθ(z),Σθ(z))

is a neural network with parameters , which determines the parameters of the normal distribution within the definition of . Similarly,

 E(x)=Eϕ(x)=(μϕ(x),Σϕ(x))

determines the parameters within the definition of . In analogy to the autoencoders in the previous paragraph, and are called stochastic decoder and encoder. By definition, has the density and has the density .

Now, we aim to learn the parameters such that it holds approximately

 pX(x)≈∫z∈Rnpθ(x|z)pZ(z)dzor equivalentlyPX(A)≈∫RnK(z,A)dPZ(z). (28)

Assuming that the above equation holds true exactly, we can generate samples from by sampling first from and secondly, sampling from .

The first idea, would be to use the maximum likelihood estimator as loss function, i.e., maximize

 Ex∼PX[log(pθ(x))],pθ(x)=∫z∈Rnpθ(x|z)pZ(z)dz.

Unfortunately, computing the integral directly is intractable. Thus, using Bayes’ formula

 pθ(z|x)=pθ(x|z)pZ(z)pθ(x),

we artificially incorporate the stochastic encoder by the computation

 log(pθ(x)) =Ez∼qϕ(⋅|x)[log(pθ(x)pθ(z|x)pθ(z|x))] (29) =Ez∼qϕ(⋅|x)[log(pθ(x)pθ(z|x)qθ(z|x))]+Ez∼qϕ(⋅|x)[log(qϕ(z|x)pθ(z|x))] (30) =Ez∼qϕ(⋅|x)[log(pθ(x|z)pZ(z)qθ(z|x))]+KL(qϕ(⋅|x),pθ(⋅|x)) (31) ≥Ez∼qϕ(⋅|x)[log(pθ(x|z)pZ(z)qθ(z|x))] (32)

Then the loss function given by

 (33)

is a lower bound on the so-called evidence . Therefore it is called the evidence lower bound (ELBO). Now the parameters and of the VAE can be trained by maximizing the expected ELBO, i.e., by minimizing the loss function

 LVAE(θ,ϕ)=−Ex∼PX[Lθ,ϕ(x)]. (34)

#### VAEs as one Layer SNFs.

In the following, we show that a VAE is a special case of one layer SNF. Let be a one-layer SNF, where the layers and are defined as in (27) with densities and , respectively. Note that in contrast to the stochastic layers from Section 4 the dimensions and are not longer equal. Now, with , the loss function (25) of the SNF reads as

 LSNF(θ,ϕ)=E(z,x)∼P(Y0,Y1)[−log(pX1(x)pX(x)f1(z,x))], (35)

where is given by the Radon-Nikodym derivative . Now we can use the fact, that by the definition of and the random variables as well as the random variables have a joint density to express by the corresponding densities of . Together with Bayes’ formula we obtain

 f1(z,x)=pY0|Y1=x(z)pX0|X1=x(z)=qϕ(z|x)pX1|X0=z(x)pX1(x)pX0(z)=qϕ(z|x)pθ(x|z)pX1(x)pZ(z).

Inserting this into (35), we get

 LSNF(θ,ϕ) =E(z,x)∼P(Y0,Y1)[−log(pθ(x|z)pZ(z)qϕ(z|x)pX(x))] (36)

and using (2) further

 LSNF(θ,ϕ) =Ex∼PX[Ez∼R(x,⋅)[−log(pθ(x|z)pZ(z)qϕ(z|x)pX(x))]] (37) (38) (39)

where denotes the ELBO as defined in (33) and is a constant independent of and . Consequently, minimizing is equivalent to minimize the negative expected ELBO, which is exactly the loss for VAEs from (34).

The above result could alternatively be derived via the relation of the ELBO to the the KL divergence between the probability measures defined by the densities and , see [22, Section 2.7].

#### Related Combinations of VAEs and Normalizing flows.

There exist several works which model the latent distribution of a VAE by normalizing flows [6, 29], SNFs [38] or sampling-based Monte Carlo methods [36] and often achieve state-of-the-art results. Using the above derivation, all of these models can be viewed as special cases of a SNF, even though some of them employ different training techniques for minimizing the loss function. Further, the authors of [13] modify the learning of the covariance matrices of decoder and encoder of a VAE using normalizing flows. However, analogously as in the previous section, this can be viewed as one-layer SNF.

A similar idea was applied in [25], where the authors model the weight distribution of a Baysian neural network by a noramlizing flow. However, we are not completely sure, how this approach relates to SNFs.

Finally, to overcome the problem of expansive training in high dimensions, some recent papers [5, 23] propose also other combinations of a dimensionality reduction and normalizing flows. However, [5] can be viewed as a variational autoencoder with special structured generator and can therefore be considered as one-layer SNF. In [23] the authors propose to reduce the dimension in a first step by a non-variational autoencoder and the optimization of a normalizing flow in the reduced dimensions in a second step.

## 6 Diffusion Normalizing Flows as special SNFs

Recently, Song et. al [33] proposed to learn the drift , and diffusion coefficient of a stochastic differential equation

 dXt=ft(Xt)dt+gtdBt (40)

with respect to the Brownian motion , such that it holds approximately for some and some data distribution . The explicit Euler discretization of (40) with step size reads as

 Xt=Xt−1+ϵft−1(Xt−1)+√ϵgt−1ξt−1,t=1,...,T,

where is independent of . With a similar computation as for the Langevin layers, this corresponds to the Markov kernel

 Kt(x,⋅)\coloneqqPXt|Xt−1=x=N(x+ϵft−1(x),ϵg2t−1). (41)

Song et. al. parametrize the functions by some a-priori learned score network [19, 32] and achieve competitive performance in image generation. Motivated by the time-reversal [1, 16, 10] of the SDE (40), Zhang and Chen [39] introduce the backward layer

 Rt(x,⋅)=PYt−1|Yt=x=N(x+ϵ(ft(x)−g2tst(x)),ϵg2t)

and learn the parameters of the neural networks and using the loss function from (25) to achieve state-of-the-art results. Even though Zhang and Chen call their model diffusion flow, it is indeed a special case of a SNF using the forward and backward layers and .

On the other hand, not every SNF can be represented as a discretized SDE. For example, the forward layer (16) from the MH layer has not the form (41).