 # Tutorial: Deriving the Standard Variational Autoencoder (VAE) Loss Function

In Bayesian machine learning, the posterior distribution is typically computationally intractable, hence variational inference is often required. In this approach, an evidence lower bound on the log likelihood of data is maximized during training. Variational Autoencoders (VAE) are one important example where variational inference is utilized. In this tutorial, we derive the variational lower bound loss function of the standard variational autoencoder. We do so in the instance of a gaussian latent prior and gaussian approximate posterior, under which assumptions the Kullback-Leibler term in the variational lower bound has a closed form solution. We derive essentially everything we use along the way; everything from Bayes' theorem to the Kullback-Leibler divergence.

## Authors

##### This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

## Bayes Theorem

Bayes theorem is a way to update one’s belief as new evidence comes into view. The probability of a hypothesis,

, given some new data , is denoted, , and is given by

 p(z|x)=p(x|z)p(z)p(x), (1)

where is the probability of the data , is the probability of the data given a hypothesis , and is the probability of that hypothesis . While Bayes theorem by itself can appear non-intuitive or at least difficult to intuit, the key to understanding it is to derive it. It arises directly out of the conditional probability axiom, which itself arises out of the definition of the joint probability. The probability of an event and an event Y occurring jointly is,

 p(X∩Y)=p(X|Y)p(Y) (2)

And since the ‘AND’ is commutative, we have,

 p(X∩Y)=p(Y∩X)=p(Y|X)p(X) (3)
 p(X|Y)p(Y)=p(Y|X)p(X) (4)

Dividing both sides of Equation (4) by yields Bayes theorem,

 p(X|Y)=p(Y|X)p(X)p(Y) (5)

## Kullback-Leibler Divergence

When comparing two distributions as we often do in density estimation, the central task of generative models, we need a measure of similarity between both distributions. The Kullback-Leibler divergence is a commonly used similarity measure for this purpose. It is the expectation of the information difference between both distributions. But first, what is information?

To understand what information is and to see its definition, consider the following: The higher the probability of an event, the lower its information content. This makes intuitive sense in that if someone tells us something ‘obvious’ i.e. highly probable i.e. something we and almost everyone else already knew, then that informant has not increased the amount of information we have. Hence the information content of highly probably event is low. Another way to say this is that the information is inversely related to the probability of an event. And since is directly related to , it follows that is inversely related to , and is how we model information:

 Information content of event x wrt p=Ip(x)=−logp(x) (6)
 Information content of event x wrt q=Iq(x)=−logq(x) (7)

The difference of information between and is therefore:

 ΔI=Ip−Iq=−logp(x)+logq(x)=log(q(x)p(x)) (8)

And the Kullback-Leibler is the expectation of the above difference, and is given by,

 DKL(q(x)||p(x)):=E∼q[ΔI]=∫(ΔI)q(x)dx=∫q(x)log(q(x)p(x))dx (9)

Similarly

 DKL(p(x)||q(x)):=E∼p[ΔI]=∫(ΔI)p(x)dx=∫p(x)log(p(x)q(x))dx (10)

Note that the Kullback-Leibler (KL) is not symmetric, i.e,

 DKL(q(x)||p(x))≠DKL(p(x)||q(x)) (11)

In , we are taking the expectation of the information difference with respect to distribution, while in , we are taking the expectation with respect to the distribution.

Hence the Kullback-Leibler is called a ‘divergence’ and not a ‘metric’ as metrics must be symmetric. There recently have been a number of symmetrization devices proposed for KL which have been shown to improve its generative fidelity [Pu et al. (2017)][ Chen et al. (2017)] [Arjovsky et al. (2017)].

Note the KL divergence is always non-negative, i.e.,

 DKL(q(x)||p(x))=−∫q(x)log(p(x)q(x))dx≥0 (12)

To see this, note that as depicted in Figure (1),

 logt≤t−1 (13)

Therefore

 −DKL(q(x)||p(x))=∫q(x)log(p(x)q(x))dx≤ (14) ∫q(x)(p(x)q(x)−1)dx= ∫q(x)p(x)q(x)dx−∫q(x)dx= ∫p(x)dx−∫q(x)dx= 1−1=0

We have just shown,

 −DKL(q(x)||p(x))≤0 (15)

which implies,

 DKL(q(x)||p(x))≥0 (16)

## VAE Objective

Consider variational autoencoders [Kingma et al. (2013)]. They have many applications including for finer characterization of disease [Odaibo (2019)]. The encoder portion of a VAE yields an approximate posterior distribution

, and is parametrized on a neural network by weights collectively denoted

. Hence we more properly write the encoder as . Similarly, the decoder portion of the VAE yields a likelihood distribution , and is parametrized on a neural network by weights collectively denoted . Hence we more properly denote the decoder portion of the VAE as . The output of the encoder are parameters of the latent distribution, which is sampled to yield the input into the decoder. A VAE schematic is shown in Figure (2).

The KL divergence between the approximate and the real posterior distributions is given by,

 DKL(qθ(z|xi)||p(z|xi))=−∫qθ(z|xi)log(p(z|xi)qθ(z|xi))dz≥0 (17)

Applying Bayes’ theorem to the above equation yields,

 DKL(qθ(z|xi)||p(z|xi))=−∫qθ(z|xi)log(pϕ(xi|z)p(z)qθ(z|xi)p(xi))dz≥0 (18)

This can be broken down using laws of logarithms, yielding,

 DKL(qθ(z|xi)||p(z|xi))=−∫qθ(z|xi)[log(pϕ(xi|z)p(z)qθ(z|xi))−logp(xi)]dz≥0 (19)

Distributing the integrand then yields,

 −∫qθ(z|xi)log(pϕ(xi|z)p(z)qθ(z|xi))dz+∫qθ(z|xi)logp(xi)dz≥0 (20)

In the above, we note that is a constant and can therefore be pulled out of the second integral above, yielding,

 −∫qθ(z|xi)log(pϕ(xi|z)p(z)qθ(z|xi))dz+logp(xi)∫qθ(z|xi)dz≥0 (21)

And since

is a probability distribution it integrates to 1 in the above equation, yielding,

 −∫qθ(z|xi)log(pϕ(xi|z)p(z)qθ(z|xi))dz+logp(xi)≥0. (22)

Then carrying the integral over to the other side of the inequality, we get,

 logp(xi)≥∫qθ(z|xi)log(pϕ(xi|z)p(z)qθ(z|xi))dz. (23)

Applying rules of logarithms, we get,

 logp(xi)≥∫qθ(z|xi)[logpϕ(xi|z)+logp(z)−logqθ(z|xi)]dz. (24)

Recognizing the right hand side of the above inequality as Expectation, we write,

 logp(xi)≥E∼qθ(z|xi)[logpϕ(xi|z)+logp(z)−logqθ(z|xi)] (25)
 logp(xi)≥E∼qθ(z|xi)[logp(xi,z)−logqθ(z|xi)] (26)

From Equation (23) it also follows that:

 logp(xi)≥∫qθ(z|xi)log(p(z)qθ(z|xi))dz+∫qθ(z|xi)logpϕ(xi|z)dz (27)
 logp(xi)≥−DKL(qθ(z|xi)||p(z))+E∼qθ(z|xi)[logpϕ(xi|z)] (28)

The right hand side of the above equation is the Evidence Lower Bound (ELBO) also known as the variational lower bound. It is so termed because it bounds the likelihood of the data which is the term we seek to maximize. Therefore maximizing the ELBO maximizes the log probability of our data by proxy. This is the core idea of variational inference, since maximization of the log probability directly is typically computationally intractable. The Kullback-Leibler term in the ELBO is a regularizer because it is a constraint on the form of the approximate posterior. The second term is called a reconstruction term because it is a measure of the likelihood of the reconstructed data output at the decoder.

Notably, we have some liberty to choose some structure for our latent variables. We can obtain a closed form for the loss function if we choose a gaussian representation for the latent prior and the approximate posterior, . In addition to yielding a closed form loss function, the gaussian model enforces a form of regularization in which the approximate posterior have variation or spread (like a gaussian).

## Closed form VAE Loss: Gaussian Latents

Say we choose:

 p(z)→1√2πσ2pexp(−(x−μp)22σ2p) (29)

and

 qθ(z|xi)→1√2πσ2qexp(−(x−μq)22σ2q) (30)

,

then the KL or regularization term in the ELBO becomes:

 −DKL(qθ(z|xi)||p(z))=
 \bigintsss1√2πσ2qexp(−(x−μq)22σ2q)log⎛⎜ ⎜ ⎜ ⎜ ⎜⎝1√2πσ2pexp(−(x−μp)22σ2p)1√2πσ2qexp(−(x−μq)22σ2q)⎞⎟ ⎟ ⎟ ⎟ ⎟⎠dz (31)

Evaluating the term in the logarithm simplifies the above into,

 \bigintsss1√2πσ2qexp(−(x−μq)22σ2q)× (32) {−12log(2π)−log(σp)−(x−μp)22σ2p+12log(2π)+log(σq)+(x−μq)22σ2q}dz.

This further simplifies into,

 1√2πσ2q\bigintsssexp(−(x−μq)22σ2q){−log(σp)−(x−μp)22σ2p+log(σq)+(x−μq)22σ2q}dz, (33)

which further simplifies into,

 1√2πσ2q\bigintsssexp(−(x−μq)22σ2q){log(σqσp)−(x−μp)22σ2p+(x−μq)22σ2q}dz. (34)

Expressing the above as an Expectation we get,

 −DKL(qθ(z|xi)||p(z))=Eq{log(σqσp)−(x−μp)22σ2p+(x−μq)22σ2q} (35) =log(σqσp)+Eq{−(x−μp)22σ2p+(x−μq)22σ2q} =log(σqσp)−12σ2pEq{(x−μp)2}+12σ2qEq{(x−μq)2}

And since the variance

is the expectation of the squared distance from the mean, i.e.,

 σ2q=Eq{(x−μq)2}, (36)

it follows that,

 (37) =log(σqσp)−12σ2pEq{(x−μp)2}+12 =log(σqσp)−12σ2pEq{(x−μq+μq−μp)2}+12 =log(σqσp)−12σ2pEq⎧⎪⎨⎪⎩(x−μqa+μq−μpb)2⎫⎪⎬⎪⎭+12

Recall that,

 (a+b)2=a2+2ab+b2, (38)

therefore,

 −DKL(qθ(z|xi)||p(z))=log(σqσp)−12σ2pEq⎧⎪⎨⎪⎩(x−μqa+μq−μpb)2⎫⎪⎬⎪⎭+12 (39) =log(σqσp)−12σ2pEq{(x−μq)2+2(x−μq)(μq−μp)+(μq−μp)2}+12 =log(σqσp)−12σ2pEq{(x−μq)2+2(x−μq)(μq−μp)+(μq−μp)2}+12 =log(σqσp)−12σ2p[Eq{(x−μq)2}+2Eq{(x−μq)(μq−μp)}+Eq{(μq−μp)2}]+12 =log(σqσp)−σ2q+(μq−μp)22σ2p+12

And when we take and , we get,

 −DKL(qθ(z|xi)||p(z))=log(σq)−σ2q+μ2q2+12 (40) =12log(σ2q)−σ2q+μ2q2+12 =12[1+log(σ2q)−σ2q−μ2q]

Recall the ELBO, Equation (28),

 logp(xi)≥−DKL(qθ(z|xi)||p(z))+E∼qθ(z|xi)[logpϕ(xi|z)]

From which it follows that the contribution from a given datum and a single stochastic draw towards the objective to be maximized is,

 12[1+log(σ2j)−σ2j−μ2j]+E∼qθ(z|xi)[logpϕ(xi|z)] (41)

where and are parameters into the approximate distribution, , and

is an index into the latent vector

. For a batch, the objective function is therefore given by,

 G=J∑j=112[1+log(σ2i)−σ2i−μ2i]+1L∑lE∼qθ(z|xi)[logp(xi|z(i,l))] (42)

where is the dimension of the latent vector , and is the number of samples stochastically drawn according to re-parametrization trick.

Because the objective function we obtain in Equation (42) is to be maximized during training, we can think of it as a ‘gain’ function as opposed to a loss function. To obtain the loss function, we simply take the negative of :

 L=−J∑j=112[1+log(σ2i)−σ2i−μ2i]−1L∑lE∼qθ(z|xi)[logp(xi|z(i,l))] (43)

Therefore to train the VAE is to seek the optimal network parameters that minimize :

 (θ∗,ϕ∗)=argmin(θ,ϕ)L(θ,ϕ) (44)

## Conclusion

We have done a step-by-step derivation of the VAE loss function. We illustrated the essence of variational inference along the way, and have derived the closed form loss in the special case of gaussian latent.

## Acknowledgement

The author thanks Larry Carin for helpful discussion on consequences of Kullback-Leibler divergence asymmetry, and on KL symmetrization approach.