1 Variational AutoEncoders and Normalizing Flows
Let
be a vector of
observable variables, a vector of stochastic latent variables and letbe a parametric model of the joint distribution. Given data
we typically aim at maximizing the marginal loglikelihood,, with respect to parameters. However, when the model is parameterized by a neural network (NN), the optimization could be difficult due to the intractability of the marginal likelihood. A possible manner of overcoming this issue is to apply
variational inference and optimize the following lower bound:(1) 
where is the inference model (an encoder), is called a decoder and is the prior. There are various ways of optimizing this lower bound but for continuous this could be done efficiently through a reparameterization of [KW:13], [RMW:14], which yields a variational autoencoder architecture (VAE).
Typically, a diagonal covariance matrix of the encoder is assumed, i.e., , where and are parameterized by the NN. However, this assumption can be insufficient and not flexible enough to match the true posterior.
A manner of enriching the variational posterior is to apply a normalizing flow [TT:13], [TV:10]
. A (finite) normalizing flow is a powerful framework for building flexible posterior distribution by starting with an initial random variable with a simple distribution for generating
and then applying a series of invertible transformations , for . As a result, the last iteration gives a random variable that has a more flexible distribution. Once we choose transformations for which the Jacobiandeterminant can be computed, we aim at optimizing the following lower bound [RM:15] :(2) 
The fashion the Jacobiandeterminant is handled determines whether we deal with general normalizing flows or volumepreserving flows. The general normalizing flows aim at formulating the flow for which the Jacobiandeterminant is relatively easy to compute. On the contrary, the volumepreserving flows design series of transformations such that the Jacobiandeterminant equals while still it allows to obtain flexible posterior distributions.
In this paper, we propose a new volumepreserving flow and show that it performs similarly to the linear general normalizing flow.
2 New VolumePreserving Flow
In general, we can obtain more flexible variational posterior if we model a fullcovariance matrix using a linear transformation, namely,
. However, in order to take advantage of the volumepreserving flow, the Jacobiandeterminant of must be . This could be accomplished in different ways, e.g.,is orthogonal matrix or it is the lowertriangular matrix with ones on the diagonal. The former idea was employed by the Hauseholder flow (HF)
[TW:16] and the latter one by the linear Inverse Autoregressive Flow (LinIAF) [KSJCSW:16]. In both cases, the encoder outputs an additional set of variables that are further used to calculate . In the case of the LinIAF, the lower triangular matrix with ones on the diagonal is given by the NN explicitly.However, in the LinIAF a single matrix could not fully represent variations in data. In order to alleviate this issue we propose to consider such matrices, . Further, to obtain the volumepreserving flow, we propose to use a convex combination of these matrices , where is calculated using the softmax function, namely, , where is the neural network used in the encoder.
Eventually, we have the following linear transformation with the convex combination of the lowertriangular matrices with ones on the diagonal:
(3) 
The convex combination of lowertriangular matrices with ones on the diagonal results again in the lowertriangular matrix with ones on the diagonal, thus, . This formulates the volumepreserving flow we refer to as convex combination linear IAF (ccLinIAF).
3 Experiments
Datasets
In the experiments we use two datasets: the MNIST dataset^{1}^{1}1We used the static binary dataset as in [LM:11]. [MNIST] and the Histopathology dataset [TW:16]. The first dataset contains images of handwritten digits (50,000 training images, 10,000 validation images and 10,000 test images) and the second one contains grayscaled image patches of histopathology scans (6,800 training images, 2,000 validation images and 2,000 test images). For both datasets we used a separate validation set for hyperparameters tuning.
Setup
In both experiments we trained the VAE with stochastic hidden units, and the encoder and the decoder were parameterized with twolayered neural networks (
hidden units per layer) and the gate activation function
[DG:15], [DFAG:16], [OKEVGK:16], [TW:16]. The number of combined matrices was determined using the validation set and taking more than matrices resulted in no performance improvement. For training we utilized ADAM [KB:14] with the minibatch size equaland one example for estimating the expected value. The learning rate was set according to the validation set. The maximum number of epochs was
and earlystopping with a lookahead of epochs was applied. We used the warmup [BVVDJB:15], [SRMSW:16] for first epochs. We initialized weights according to [GB:10].We compared our approach to linear normalizing flow (VAE+NF) [RM:15], and finite volumepreserving flows: NICE (VAE+NICE) [DKB:14], HVI (VAE+HVI) [SKW:15], HF (VAE+HF) [TW:16], linear IAF (VAE+LinIAF) [KSJCSW:16] on the MNIST data, and to VAE+HF on the Histopathology data. The methods were compared according to the lower bound of marginal loglikelihood measured on the test set.
Method  

VAE  
VAE+NF (=10)  
VAE+NF (=80)  
VAE+NICE (=10)  
VAE+NICE (=80)  
VAE+HVI (=1)  
VAE+HVI (=8)  
VAE+HF(=1)  
VAE+HF(=10)  
VAE+LinIAF  
VAE+ccLinIAF(=5) 
Method  

VAE  
VAE+HF (=1)  
VAE+HF (=10)  
VAE+HF (=20)  
VAE+LinIAF  
VAE+ccLinIAF(=5) 
Discussion
The results presented in Table 1 and 2 for MNIST and Histopathology data, respectively, reveal that the proposed flow outperforms all volumepreserving flows and performs similarly to the linear normalizing flow with large number of transformations. The advantage of using several matrices instead of one is especially apparent on the Histopathology data where the VAE+ccLinIAF performed better by about nats than the VAE+LinIAF. Hence, the convex combination of the lowertriangular matrices with ones on the diagonal seems to allow to better reflect the data with small additional computational burden.
Implementation
The code for the proposed approach can be found at: https://github.com/jmtomczak/vae_vpflows.
Acknowledgments
The research conducted by Jakub M. Tomczak was funded by the European Commission within the Marie SkłodowskaCurie Individual Fellowship (Grant No. 702666, ”Deep learning and Bayesian inference for medical imaging”).
References
 [Bowman et al., 2015] Bowman et al.][2015]BVVDJB:15 Bowman, S. R., Vilnis, L., Vinyals, O., Dai, A. M., Jozefowicz, R., & Bengio, S. (2015). Generating sentences from a continuous space. arXiv preprint arXiv:1511.06349.
 [Dauphin et al., 2016] Dauphin et al.][2016]DFAG:16 Dauphin, Y. N., Fan, A., Auli, M., & Grangier, D. (2016). Language modeling with gated convolutional networks. arXiv preprint arXiv:1612.08083.
 [Dauphin & Grangier, 2015] Dauphin and Grangier][2015]DG:15 Dauphin, Y. N., & Grangier, D. (2015). Predicting distributions with linearizing belief networks. arXiv preprint arXiv:1511.05622.
 [Dinh et al., 2014] Dinh et al.][2014]DKB:14 Dinh, L., Krueger, D., & Bengio, Y. (2014). Nice: Nonlinear independent components estimation. arXiv preprint arXiv:1410.8516.
 [Glorot & Bengio, 2010] Glorot and Bengio][2010]GB:10 Glorot, X., & Bengio, Y. (2010). Understanding the difficulty of training deep feedforward neural networks. AISTATS (pp. 249–256).
 [Kingma & Ba, 2014] Kingma and Ba][2014]KB:14 Kingma, D., & Ba, J. (2014). ADAM: A method for stochastic optimization. arXiv preprint arXiv:1412.6980.
 [Kingma et al., 2016] Kingma et al.][2016]KSJCSW:16 Kingma, D. P., Salimans, T., Józefowicz, R., Chen, X., Sutskever, I., & Welling, M. (2016). Improving variational inference with inverse autoregressive flow. NIPS.
 [Kingma & Welling, 2013] Kingma and Welling][2013]KW:13 Kingma, D. P., & Welling, M. (2013). Autoencoding variational bayes. arXiv preprint arXiv:1312.6114.
 [Larochelle & Murray, 2011] Larochelle and Murray][2011]LM:11 Larochelle, H., & Murray, I. (2011). The neural autoregressive distribution estimator. AISTATS (p. 2).

[LeCun et al., 1998]
LeCun et al.][1998]MNIST
LeCun, Y., Cortes, C., & Burges, C. J. (1998).
The MNIST database of handwritten digits.
 [Rezende & Mohamed, 2015] Rezende and Mohamed][2015]RM:15 Rezende, D., & Mohamed, S. (2015). Variational Inference with Normalizing Flows. ICML (pp. 1530–1538).
 [Rezende et al., 2014] Rezende et al.][2014]RMW:14 Rezende, D. J., Mohamed, S., & Wierstra, D. (2014). Stochastic backpropagation and approximate inference in deep generative models. arXiv preprint arXiv:1401.4082.
 [Salimans et al., 2015] Salimans et al.][2015]SKW:15 Salimans, T., Kingma, D. P., & Welling, M. (2015). Markov chain Monte Carlo and Variational Inference: Bridging the gap. ICML (pp. 1218–1226).
 [Sønderby et al., 2016] Sønderby et al.][2016]SRMSW:16 Sønderby, C. K., Raiko, T., Maaløe, L., Sønderby, S. K., & Winther, O. (2016). Ladder variational autoencoders. arXiv preprint arXiv:1602.02282.
 [Tabak & Turner, 2013] Tabak and Turner][2013]TT:13 Tabak, E., & Turner, C. V. (2013). A family of nonparametric density estimation algorithms. Communications on Pure and Applied Mathematics, 66, 145–164.
 [Tabak & VandenEijnden, 2010] Tabak and VandenEijnden][2010]TV:10 Tabak, E. G., & VandenEijnden, E. (2010). Density estimation by dual ascent of the loglikelihood. Communications in Mathematical Sciences, 8, 217–233.
 [Tomczak & Welling, 2016] Tomczak and Welling][2016]TW:16 Tomczak, J. M., & Welling, M. (2016). Improving Variational AutoEncoders using Householder Flow. arXiv preprint arXiv:1611.09630.
 [van den Oord et al., 2016] van den Oord et al.][2016]OKEVGK:16 van den Oord, A., Kalchbrenner, N., Espeholt, L., Vinyals, O., Graves, A., & Kavukcuoglu, K. (2016). Conditional image generation with pixelcnn decoders. Advances in Neural Information Processing Systems (pp. 4790–4798).