A VAE is a deep generative model with variational inference. A generative model is an unsupervised learning approach that is able to learn a domain by processing a large amount of data from it and then generate new data like it(Hinton and Ghahramani, 1997; Yu et al., 2018). VAE, together with Generative Adversarial Networks (Goodfellow et al., 2016) and Deep Autoregressive Networks (Gregor et al., 2013), are amongst the most powerful and popular generative model techniques. VAE has been successfully applied in many domains, such as image processing (Pu et al., 2016)2017), and cybersecurity (Chandy et al., 2019).
A VAE works by maximizing a variational lower bound of the likelihood of the data (Kingma and Welling, 2013)
. A VAE has two halves: a recognition model (an encoder) and a generative model (a decoder). The recognition model learns a latent representation of the input data, and the generative model learns to transform this representation back into the original data. The recognition and generative models are jointly trained by optimizing the probability of the input data using stochastic gradient ascent.
Application of the VAE involves selection of an approximate posterior distribution for the latent variables. This decision determines the flexibility and tractability of the VAE, and hence the quality and efficiency of the inference made, and poses a core challenge in variational inference. Conventionally, the choice is the normal distribution with a diagonal covariance matrix. This pick helps with computation efficiency but limits the flexibility to match the true posterior. We introduce a new transformation, DT, which approximates the posterior as a normal distribution with full covariance. DT offers theoretical advantages of model flexibility, parallelizability, scalability, and efficiency, which together provide a clear improvement in VAE for its wider adoption for statistical inference in the presence of large, complex datasets.
2 Variational Autoencoder
Let x be a (set of) observed variables, z a (set of) continuous, stochastic latent variables that represent their encoding, andx
(datapoints) are generated by a random process, which involves the unobserved random variablesz. The encoder network with parameters encodes the given dataset with an approximate posterior distribution given by defined over the latent variables, while the decoder network with parameters decodes z into x with probability . The encoder tries to approximate the true but intractable posterior represented as . By assuming a standard normal prior for the decoder and given a dataset X, we can optimize the network parameters by maximizing the log-probability of the data , i.e., to maximize
where, given our approximation to the true posterior distribution, for each datapoint x we can write
The RHS term is denoted as . Because KL divergence is always non-negative, it can be written as follows and is the (variational) lower bound on the marginal likelihood of datapoint x
Therefore, maximizing the lower bound will simultaneously increase the probability of the data and reduce divergence from the true posterior. Thus, we would like to maximize it w.r.t. the encoder and decoder parameters, and , respectively.
2.2 Need for model flexibility
The encoder and decoder in a VAE are conventionally modeled using the normal distribution with a diagonal covariance matrix, i.e., , where and
are commonly nonlinear functions parametrized by neural networks. This practice is mainly driven by the requirements for computational tractability. It, however, limits flexibility of the model, especially in the case of the encoder where the encoder will not be able to learn the true posterior distribution.
3 Dyadic Transformation
Theoretically, the approximate model will be significantly more flexible if it is modeled as a multivariate normal distribution with a full covariance matrix.
A linear transformation matrixB of size applied on an -dimensional normal distribution produces another normal distribution . Thus, although Y is a normal distribution with diagonal covariance, its transformation through B would result in a multivariate normal distribution:
This transformation matrix B introduces number of new parameters. In order to utilize this transformation in our generative model, we would need to compute the log-probability and KL divergence of G. These computations do not scale well with the size of B.
To overcome this issue, we define the transformation matrix B as follows:
is an identity matrix,is a scalar parameter, U is an matrix, and V is a matrix. Here is a model hyper-parameter that can be adjusted to set the trade-off between algorithm flexibility and computational efficiency.
In what follows, we show that this affine transformation gives the higher flexibility desired without introducing much additional computational complexity and thus it scales well with n.
3.2 Efficient calculation of matrix determinant and inverse
Computing the log-probability and KL Divergence of the generative model involves the calculation of the determinant and inverse of the dyadic transformation matrix. We show that these operations can be efficiently computed with the help of the following theorems:
Theorem 1. (Sherman-Morrison-Woodbury). Given four matrices A, U, C, and V,
if the matrices are of conformable sizes and also if the matrices A and are invertible (Woodbury, 1950). With the help of this theorem, we can efficiently calculate the inverse for the Dyadic Transformation matrix B.
Theorem 2. (Sylvester’s Determinant Identity). Given two matrices U and V of sizes and ,
where and are identity matrices of orders and , respectively (Sylvester, 1851). This theorem relates the determinant of an matrix with the determinant of an matrix, which is very useful in regimes were . We use this property to make the determinant calculations of B computationally tractable.
3.3 KL divergence between two normal distributions
Using the above theorems we show that the KL divergence for a multivariate normal distribution obtained using Dyadic Transformation can be efficiently computed.
KL divergence between the independent normal posterior and standard normal prior can be written as (Kingma and Welling, 2013)
where is the dimensionality of z. We can show that in general the KL divergence between two normal distributions, with means and , and covariance matrices and , is (Duchi, 2007):
Given that in our case, and , we can write
We observe that calculation of KL divergence also involves the calculation of . This is performed efficiently using the Sylvester’s determinant theorem.
3.4 Calculation of the gradient of matrix determinant and inverse
Given a matrix D, the derivative of the inverse and determinant of D w.r.t. a variable t can be calculated as
We make a key observation from the two derivative equations above. That is, given the determinant and inverse of a matrix are finite, their gradients will also be finite. Calculation of either derivatives thus may not lead to numerical instability even if the matrix is initialized randomly.
Also from Equations (11) and (12) we can show that for the Dyadic Transformation matrix B
An important observation here is that if the value of is small enough then the determinant and inverse of the dyadic transformation matrix will be finite. This observation was crucial for us in order to make the make the numerical computations stable.
Pseudo code for VAE with Dyadic Transformation
4 Related Work
Many recent strategies proposed to improve flexibility of inference models are based on the concept of normalizing flows, introduced by (Rezende and Mohamed, 2015) in the context of stochastic variational inference. Members of this family build a flexible variational posterior by starting with a conventional normal distribution for generating the latent variables and then applying a chain of invertible transformations, such as Householder transformation (Tomczak and Welling, 2016) and inverse autoregressive transformation (Kingma et al., 2016). Our proposed strategy requires only a single transformation and can be applied to both the encoder and the decoder.
We conducted experiments on MNIST dataset to empirically evaluate our approach. MNIST is a dataset of 60,000 training and 10,000 test images of handwritten digits with a resolution of 28 28 pixels (LeCun et al., 1998)
. The dataset was dynamically binarized as in(Salakhutdinov and Murray, 2008).
Our model had 50 stochastic units each and the encoder and decoder were parameterized by a two-layer feed forward network with 500 units each. The model was trained using ADAM gradient-based optimization algorithm (Kingma and Ba, 2015) with a mini-batch size of 128. For the Dyadic Transformation matrix B we used a value of 0.001 for .
The results of the experiments are presented in Table 1. The results indicate that our proposed strategy is able to obtain competitively low log-likelihoods despite its inherent simplicity and low computational requirements. Compared to VAE, DT adds an additional computational cost of which is primarily for the determinant calculation. Hence, for smaller values of k, DT does not add any computational cost. Also the memory requirements for DT is which is also reasonable for small values of k.
Our idea is fundamentally different from the other strategies for improving VAE since it does not belong to the existing large family of normalizing flow transformations. Thus, it holds promise for creating a new family of strategies for building flexible distributions in the context of stochastic variational inference.
We presented Dyadic Transformation, a new transformation that builds flexible multivariate distribution to enhance variational inference without sacrificing computational tractability. Our elegantly-simple idea boosts model flexibility with only a single transformation step. The empirical experiments obtained indicated objectively that DT increases VAE performance and its results are competitive compared to the family of normalizing flows, which involve multiple levels of transformation. Our transformation can be readily integrated with the methods in this family to collectively build powerful hybrids. Dyadic Transformation can also be straightforwardly applied to the decoder to obtain even more significant performance gains. It can also be applied to binary data by modifying a Restricted Boltzmann Machine. These theoretical advantages will be explored in future research.
- Chandy et al.  Sarin Chandy, Amin Rasekh, Zachary Barker, and Ehsan Shafiee. Cyberattack detection using deep generative models with variational inference. ASCE Journal of Water Resources Planning and Management, 2019.
- Duchi  John Duchi. Derivations for linear algebra and optimization. Berkeley, California, 2007.
- Goodfellow et al.  Ian Goodfellow, Yoshua Bengio, Aaron Courville, and Yoshua Bengio. Deep learning, volume 1. MIT press Cambridge, 2016.
- Gregor et al.  Karol Gregor, Ivo Danihelka, Andriy Mnih, Charles Blundell, and Daan Wierstra. Deep autoregressive networks. arXiv preprint arXiv:1310.8499, 2013.
Hinton and Ghahramani 
Geoffrey Hinton and Zoubin Ghahramani.
Generative models for discovering sparse distributed representations.Philosophical Transactions of the Royal Society of London B: Biological Sciences, 352(1358):1177–1190, 1997.
- Kingma and Ba  Diederik Kingma and Jimmy Ba. Adam: A method for stochastic optimization. Proceedings of the 3rd International Conference for Learning Representations, 2015.
- Kingma and Welling  Diederik Kingma and Max Welling. Auto-encoding variational bayes. Proceedings of the 2nd International Conference on Learning Representations, 2013.
- Kingma et al.  Diederik Kingma, Tim Salimans, and Max Welling. Improving variational inference with inverse autoregressive flow. Proceedings of the 30th Conference on Neural Information Processing Systems (NIPS 2016), 2016.
- LeCun et al.  Yann LeCun, Léon Bottou, Yoshua Bengio, and Patrick Haffner. Gradient-based learning applied to document recognition. Proceedings of the IEEE, 86(11):2278–2324, 1998.
- Pu et al.  Yunchen Pu, Zhe Gan, Ricardo Henao, Xin Yuan, Chunyuan Li, Andrew Stevens, and Lawrence Carin. Variational autoencoder for deep learning of images, labels and captions. In Advances in neural information processing systems, pages 2352–2360, 2016.
Rezende and Mohamed 
Danilo Rezende and Shakir Mohamed.
Variational inference with normalizing flows.
Proceedings of the 32nd International Conference on Machine Learning, pages 1530–1538, 2015.
Salakhutdinov and Murray 
Ruslan Salakhutdinov and Iain Murray.
On the quantitative analysis of deep belief networks.In Proceedings of the 25th international conference on Machine learning, pages 872–879. ACM, 2008.
- Salimans et al.  Tim Salimans, Diederik Kingma, and Max Welling. Markov chain monte carlo and variational inference: Bridging the gap. In Proceedings of the 32nd International Conference on Machine Learning (ICML-15), pages 1218–1226, 2015.
Semeniuta et al. 
Stanislau Semeniuta, Aliaksei Severyn, and Erhardt Barth.
A hybrid convolutional variational autoencoder for text generation.In Proceedings of the 2017 Conference on Empirical Methods in Natural Language Processing, pages 627–637, 2017.
- Sylvester  James Joseph Sylvester. On the relation between the minor determinants of linearly equivalent quadratic functions. The London, Edinburgh, and Dublin Philosophical Magazine and Journal of Science, 1(4):295–305, 1851.
- Tomczak and Welling  Jakub Tomczak and Max Welling. Improving variational auto-encoders using householder flow. Proceedings of the 30th Conference on Neural Information Processing Systems (NIPS 2016), 2016.
- Woodbury  Max Woodbury. Inverting modified matrices. Memorandum report, 42:106, 1950.
- Yu et al.  Tingzhao Yu, Lingfeng Wang, Huxiang Gu, Shiming Xiang, and Chunhong Pan. Deep generative video prediction. Pattern Recognition Letters, 110:58–65, 2018.