Spatial Variational Auto-Encoding via Matrix-Variate Normal Distributions

05/18/2017 ∙ by Zhengyang Wang, et al. ∙ Washington State University Texas A&M University 0

The key idea of variational auto-encoders (VAEs) resembles that of traditional auto-encoder models in which spatial information is supposed to be explicitly encoded in the latent space. However, the latent variables in VAEs are vectors, which are commonly interpreted as multiple feature maps of size 1x1. Such representations can only convey spatial information implicitly when coupled with powerful decoders. In this work, we propose spatial VAEs that use latent variables as feature maps of larger size to explicitly capture spatial information. This is achieved by allowing the latent variables to be sampled from matrix-variate normal (MVN) distributions whose parameters are computed from the encoder network. To increase dependencies among locations on latent feature maps and reduce the number of parameters, we further propose spatial VAEs via low-rank MVN distributions. Experimental results show that the proposed spatial VAEs outperform original VAEs in capturing rich structural and spatial information.



There are no comments yet.


page 7

page 8

Code Repositories


Tensorflow implementation of Spatial VAE via Matrix-Variate Normal Distributions

view repo
This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.


Deep learning, variational auto-encoders, matrix-variate normal distributions, generative models, unsupervised learning

1 Introduction.

The mathematical and computational modeling of probability distributions in high-dimensional space and generating samples from them are highly useful yet very challenging. With the development of deep learning methods, deep generative models have been shown to be effective and scalable 

[12, 22, 5, 9, 19, 8, 21]

in capturing probability distributions over high-dimensional data spaces and generating samples from them. Among them, variational auto-encoders (VAEs) 

[12, 22, 6, 11]

are one of the most promising approaches. In machine learning, the auto-encoder architecture is applied to train scalable models by learning latent representations. For image modeling tasks, it is preferred to encode spatial information into the latent space explicitly. However, the latent variables in VAEs are vectors, which can be interpreted as

feature maps with no explicit spatial information. While such lack of explicit spatial information does not lead to major performance problems on simple tasks such as digit generation from the MNIST dataset [16], it greatly limits the model’s abilities when images are more complicated [13, 17].

To overcome this limitation, we propose spatial VAEs that employ feature maps as latent representations. Such latent feature maps are generated from matrix-variate normal (MVN) distributions whose parameters are computed from the encoder network. Specifically, MVN distributions are able to generate feature maps with appropriate dependencies among locations. To increase dependencies among locations on latent feature maps and reduce the number of parameters, we further propose spatial VAEs via low-rank MVN distributions. In this low-rank formulation, the mean matrix of MVN distribution is computed as the outer product of two vectors computed from the encoder network. Experimental results on image modeling tasks demonstrate the capabilities of our spatial VAEs in complicated image generation tasks.

It is worth noting that the original VAEs can be considered as a special case of spatial VAEs via MVN distributions. That is, if we set the size of feature maps generated via MVN distributions to , spatial VAEs via MVN distributions reduce to the original VAEs. More importantly, when the size of feature maps is larger than , direct structural ties have been built into elements of the feature maps via MVN distributions. Thus, our proposed spatial VAEs are intrinsically different with the original VAEs when the size of feature maps is larger than . Specifically, our proposed spatial VAEs cannot be obtained by enlarging the size of the latent representations in the original VAEs.

2 Background and Related Work.

In this section, we introduce the architectures of auto-encoders and variational auto-encoders.

2.1 Auto-Encoder Architectures.

Auto-encoder (AE) is a model architecture used in tasks like image segmentation [30, 23, 18], machine translation [2, 25] and denoising reconstruction [28, 29]. It consists of two parts: an encoder that encodes the input data into lower-dimensional latent representations and a decoder that generates outputs by decoding the representations. Depending on different tasks, the latent representations will focus on different properties of input data. Nevertheless, these tasks usually require outputs to have similar or exactly the same structure as inputs. Thus, structural information is expected to be preserved through the encoder-decoder process.

In computer vision tasks, structural information usually means spatial information of images. There are two main strategies to preserve spatial information in AE for image tasks. One is to apply very powerful decoders, like conditional pixel convolutional neural networks (PixelCNNs) 

[20, 27, 24, 9], that generate output images pixel-by-pixel. In this way, the decoders can recover spatial information in the form of dependencies among pixels. However, pixel-by-pixel generation is very slow, resulting in major speed problems in practice. The other method is to let the latent representations explicitly contain spatial information and apply decoders that can make use of such information. To apply this strategy for image tasks, usually the latent representations are feature maps of size between the size of a pixel () and that of the input image, while the decoders are deconvolutional neural networks (DCNNs) [30]. Since most computer vision tasks only require high-level spatial information like relative locations of objects instead of detailed relationships among pixels, preserving only rough spatial information is enough, and this strategy is proved effective and efficient.

2.2 Variational Auto-Encoders.

In unsupervised learning, generative models aim to modeling the underlying data distribution. Formally, for data space , let

denote the probability density function (PDF) of the true data distribution for 

. Given a dataset  of i.i.d samples from , generative models try to approximate using a model distribution where represents model parameters. To train the model, maximum likelihood (ML) inference is performed on ; that is, parameters are updated to optimize The approximation quality of relies on the generalization ability of the model. In machine learning, it highly depends on learning latent representations which can encode common features among data samples and disentangle abstract explanatory factors behind the data [3]. In data generation tasks, we apply for modeling, where is the PDF of the distribution of latent representations and represents a complex mapping from the latent space to the data space. A major advantage of using latent representations is dimensionality reduction of data since they are low-dimensional. The prior can be simple and easy to model while the mapping represented by can be learned through complicated deep learning models automatically.

Recently, [12] point out that the above model has intractability problems and can only be trained by costly sampling-based methods. To tackle this, they propose variational auto-encoders (VAEs), which instead maximize a variational lower bound of the log-likelihood as


where is an approximation model to the intractable , parameterized by ,

represents the Kullback-Leibler divergence. In VAEs,

, , and

are modeled as multivariate Gaussian distributions with diagonal covariance matrices. Here,

, and are computed with deep neural networks like CNNs. Figure 1 shows the architecture of VAEs. The model parameters and can be trained using the reparameterization trick [22], where the sampling process is decomposed into two steps as


3 Spatial Variational Auto-Encoders.

In this section, we analyze a problem of the original VAEs and propose spatial VAEs in Section 3.1 to overcome it. Afterwards, several ways to implement spatial VAEs are discussed. A naïve implementation is introduced and analyzed in Section 3.2, followed by a method that incorporates the use of matrix-variate normal (MVN) distributions in Section 3.3. Finally, we propose our final model, spatial VAEs via low-rank MVN distributions, by applying a low-rank formulation of MVN distributions in Section 3.4.

3.1 Overview.

Note that and in VAEs resemble the encoder and decoder, respectively, in AE for image reconstruction tasks, where represents the latent representations. However, in VAE, is commonly a vector, which can be considered as multiple feature maps. While may implicitly preserve some spatial information of the input image , it raises the requirement for a more complex decoder. Given a fixed architecture, the hypothesis space of decoder models is limited. As a result, the optimal decoder may not lie in the hypothesis space [31]. This problem significantly hampers the performance of VAEs, especially when spatial information is important for images in .

Based on the above analysis, it is beneficial to either have larger hypothesis space for decoders or let explicitly contain spatial information. Note that these two methods correspond to the two strategies introduced in Section 2.1. [9] follow the first strategy and propose PixelVAEs whose decoders are conditional PixelCNNs [27] instead of simple DCNNs. As conditional PixelCNNs themselves are also generative models, PixelVAEs can be considered as conditional PixelCNNs with the conditions replaced by . In spite of their impressive results, the performance of PixelVAEs and conditional PixelCNNs is similar, which indicates that conditional PixelCNNs are responsible for capturing most properties of images in . In this case, contributes little to the performance. In addition, applying conditional PixelCNNs leads to very slow generation process in practice. In this work, the second strategy is explored by constructing spatial latent representations in the form of feature maps of size larger than . Such feature maps can explicitly contain spatial information. We term VAEs with spatial latent representations as spatial VAEs.

The main distinction between spatial VAEs and the original VAEs is the size of latent feature maps. By having feature maps instead of ones, the total dimension of the latent representations significantly increases. However, spatial VAEs are essentially different from the original VAEs with a higher-dimensional latent vector . Suppose the vector is extended by times in order to match the total dimension, the number of hidden nodes in each layer of decoders will explode correspondingly. This results in an explosion in the number of decoders’ parameters, which slows down the generation process. Whereas in spatial VAEs, decoders becomes even simpler since is closer to the required size of output images. From the other side, when using decoders of similar capacities, spatial VAEs must have higher-dimensional latent representations than the original VAEs. It is demonstrated that this only slightly influences the training process by requiring more outputs from encoders, while the generation process that only involves decoders remains unaffected. Our experimental results show that with proper designs, spatial VAEs substantially outperform the original VAEs when applying similar decoders.

Figure 1: Illustration of the differences between the proposed spatial VAEs via low-rank MVN distributions and the original VAEs. At the top is the architecture of the original VAEs where the latent is a vector sampled from a multivariate Gaussian distribution with a diagonal covariance matrix. Below is the proposed model which is explained in detail in Section 3.4. Briefly, it modifies the sampling process by incorporating a low-rank formulation of the MVN distributions and produces latent representations that explicitly retain spatial information.

3.2 Naïve Spatial VAEs.

To achieve spatial VAEs, a direct and naïve way is to simply reshape the original vector into feature maps of size . But this naïve way is problematic since the sampling process does not change. Note that in the original VAEs, the vector is sampled from . The covariance matrix is diagonal, meaning each variable is uncorrelated. In particular, for multivariate Gaussian distributions, uncorrelation implies independence. Therefore,

’s components are independent random variables and the variances of their distributions correspond to entries on the diagonal of

. Specifically, suppose is a -dimensional vector, the component is a random variable that follows the univariate normal distribution as where represents the vector consisting of a matrix’s diagonal entries. After applying the reparameterization trick, we can rewrite Equation 2.2 as


To sample feature maps of size in naïve spatial VAEs, the above process is followed by a reshape operation while setting .

However, between two different components and , the only relationship is that their respective distribution parameters and are both computed from . Such dependencies are implicit and weak. It is obvious that after reshaping, there is no direct relationship among locations within each feature map, while spatial latent representations should contain spatial information like dependencies among locations. To overcome this limitation, we propose spatial VAEs via matrix-variate normal distributions.

3.3 Spatial VAEs via Matrix-Variate Normal Distributions.

Instead of obtaining feature maps of size by first sampling a -dimensional vector from multivariate normal distributions and then reshaping, we propose to directly sample matrices as feature maps from matrix-variate normal (MVN) distributions [10], resulting in an improved model known as spatial VAEs via MVN distributions. Specifically, we modify in the original VAEs and keep other parts the same. As explained below, MVN distributions can model dependencies between the rows and columns in a matrix. In this way, dependencies among locations within a feature map are established. We proceed by providing the definition of MVN distributions.


A random matrix

is said to follow a matrix-variate normal distribution with mean matrix and covariance matrix , where , , if follows the multivariate normal distribution . Here, denotes the Kronecker product and denotes transforming a matrix into an -dimensional vector by concatenating the columns.

In MVN distributions, and capture the relationships across rows and columns, respectively, of a matrix. By constructing the covariance matrix through the Kronecker product of these two matrices, dependencies among values in a matrix can be modeled. In spatial VAEs, a feature map can be considered as a matrix that follows a MVN distribution , where and are diagonal matrices. Although within the random variables corresponding to each location are still independent since is diagonal, MVN distributions are able to add direct structural ties among locations through their variances. For example, for two locations and in ,


Here, and are independently sampled from two univariate Gaussian distributions. However, the variances and have built direct interactions through the Kronecker product. Based on this, we propose spatial VAEs via MVN distributions, which samples feature maps of size from independent MVN distributions as


where , and are computed through the encoder. Here, compared to the original VAEs, is replaced but remains the same. Since MVN distributions are defined based on multivariate Gaussian distributions, the term in Equation 1 can be calculated in a similar way.

To demonstrate the differences with naïve spatial VAEs, we reexamine the original VAEs. Note that naïve spatial VAEs have the same sampling process as the original VAEs. The original VAE samples a -dimensional vector from where is a -dimensional vector and is a diagonal matrix. Because is diagonal, it can be represented by the -dimensional vector . To summarize, the encoder of the original VAEs outputs values which are interpreted as and .

In spatial VAEs via MVN distributions, according to Equation 6, is a matrix while and are diagonal matrices that can be represented by -dimensional vectors. In this case, the required number of outputs from the encoder is changed to , corresponding to , and . As has been explained in Section 3.2, since is diagonal, sampling the matrix is equivalent to sampling scalar numbers from independent univariate normal distributions. So the modified sampling process with the reparameterization trick is



Here, we take advantage of the fact that for diagonal matrices, the Kronecker product is equivalent to the out-product of vectors. To be specific, suppose and are two diagonal matrices, then and are two -dimensional vectors and satisfy


It is worth noting that, compared to naïve spatial VAEs, the required number of outputs from the encoder decreases from to . As a result, spatial VAEs via MVN distributions leads to a simpler model while adding structural ties among locations. Note that the original VAEs can be considered as a special case of the spatial VAEs via MVN distributions. That is, if we set , spatial VAEs via MVN distributions reduce to the original VAEs.

Figure 2: Sample face images generated by different VAEs when trained on the CelebA dataset. The first and second rows shows training images and images generated by the original VAEs. The remaining three rows are the results of naïve spatial VAEs, spatial VAEs via MVN distributions and spatial VAEs via low-rank MVN distributions, respectively.

3.4 A Low-Rank Formulation.

The use of MVN distributions makes locations directly related to each other within a feature map by adding restrictions on variances. However, in probability theory, variance only measures the expected distance from the mean. To have more direct relationships, it is preferred to have restricted means. In this section, we introduce a low-rank formulation of MVN distributions 

[1] for spatial VAEs.

The low-rank formulation of a MVN distribution is denoted as where the mean matrix is computed by the out-product instead. Here, and are -dimensional and -dimensional vectors, respectively. Similar to computing the covariance matrix through the Kronecker product of two separate matrices, it explicitly forces structural interactions among entries of the mean matrix. Applying this low-rank formulation leads to our final model, spatial VAEs via low-rank MVN distributions, which is illustrated in Figure 1. By using two distinct -dimensional vectors to construct , Equation 6 is modified as


where and are -dimensional vectors. For the encoder, the number of outputs is further reduced to from , replacing outputs for with outputs for and another outputs for . In contrast to Equation 3.3, the two-step sampling process can be expressed as



As has been demonstrated in Section 3.1, spatial VAEs require more outputs from encoders than the original VAEs, which slows down the training process. Spatial VAEs via low-rank MVN distributions properly address the problem while achieving appropriate spatial latent representations. According to the experimental results, they outperform the original VAEs in several image generation tasks when similar decoders are used.

4 Experimental Studies.

We use the original VAEs as the baseline models in our experiments, as most recent improvements on VAEs are derived from the vector latent representations and can be easily incorporated into our matrix-based models. To elucidate the performance differences of various spatial VAEs, we compare the results of three different spatial VAEs as introduced in Section 3; namely naïve spatial VAEs, spatial VAEs via MVN distributions and spatial VAEs via low-rank MVN distributions. We train the models on the CelebA, CIFAR-10 and MNIST datasets, and analyze sample images generated from the models to evaluate the performance. For the same task, the encoders of all compared models are composed of the same convolutional neural networks (CNNs) and a fully-connected output layer [15, 14]. While the fully-connected layer may differ as required by different numbers of output units, it only slightly affects the training process. As discussed in Section 3.1, it is reasonable to compare spatial VAEs with the original VAEs in the case that their decoders have similar architectures and model capabilities. Therefore, following the original VAEs, deconvolutional neural networks (DCNNs) are used as decoders in spatial VAEs. Meanwhile, the total number of trainable parameters in the decoders of all compared models are set to be as similar as possible while accommodating different input sizes.

4.1 CelebA.

Figure 3: Sample images generated by different VAEs when trained on the CIFAR- dataset. From top to bottom, the five rows are training images and images generated by the original VAEs, naïve spatial VAEs, spatial VAEs via MVN distributions, spatial VAEs via low-rank MVN distributions, respectively.

The CelebA dataset contains colored face images of size . The generative models are supposed to generate faces that are similar but not exactly the same to those in the dataset. For this task, the CNNs in the encoders have layers while the decoders are or -layer DCNNs corresponding to spatial VAEs and the original VAEs, respectively. This difference is caused by the fact that spatial VAEs have () feature maps as latent representations, which require fewer up-sampling operations to obtain outputs. We set and , and the dimension of in the original VAEs is in order to have decoders with similar numbers of trainable parameters.

Figure 2 shows sample face images generated by the original VAEs and three different variants of spatial VAEs. It is clear that spatial VAEs can generate images with more details than the original VAEs. Due to the lack of explicit spatial information, the original VAEs produce face images with little details like hair near the borders. While naïve spatial VAEs seem to address this problem, most faces have only incomplete hairs as naïve spatial VAEs cannot capture the relationships among different locations. Theoretically, spatial VAEs via MVN distributions are able to incorporate interactions among locations. However, the results are strange faces with some distortions. We believe the reason is that adding dependencies among locations through restrictions on distribution variances is not effective and sufficient. Spatial VAEs via low-rank MVN distributions that have restricted means tackle this well and generate faces with appealing visual appearances.

4.2 Cifar-10.

The CIFAR- dataset consists of color images of in classes. VAEs usually perform poorly in generating photo-realistic images since there are significant differences among images in different classes, indicating that the underlying true distribution of the data is a multi-model. In this case, VAEs tend to output very blurry images [26, 8, 7]. However, comparison among different models can still demonstrate the differences in terms of generative capabilities. In this experiment, we set and , and the dimension of in the original VAEs is . The encoders have layers while the decoders have or layers.

Some sample images are provided in Figure 3. The original VAEs only produce images composed of several colored areas, which is consistent to the results of a similar model reported in [22]. It is obvious that all three implementations of spatial VAEs generate images with more details. However, naïve spatial VAEs still produce meaningless images as there is no relationship among different parts. The images generated by spatial VAEs via MVN distributions look like some distorted objects, which have similar problems to the results of the CelebA dataset. Again, spatial VAEs via low-rank MVN distributions outperform the other models, producing blurry but object-like images.

Model Log-Likelihood
Original VAE 297
Naïve SVAE 275
SVAE via MVN 267
SVAE via low-rank MVN 296
Table 1:

Parzen window log-likelihood estimates of test data on the MNIST dataset. We follow the same procedure as in 


4.3 Mnist.

Model Training time Generation time
Original VAE 167.0309s 1.3892s
Naïve SVAE 178.8601s 1.3676s
SVAE via MVN 177.4387s 1.3767s
SVAE via low-rank MVN 172.9639s 1.3686s
Table 2:

Training and generation time of different models when trained on the CelebA dataset using a Nvidia Tesla K40C GPU. The average time for training one epoch and the time for generating

images are reported and compared.

We perform quantitative analysis on real-valued MNIST dataset by employing the Parzen window log-likelihood estimates [4]. This evaluation method is used for several generative models where the exact likelihood is not tractable [8, 19]. The results are reported in Table 1 where SVAE is short for spatial VAE. Despite of the difference in visual quality of generated images, spatial VAE via low-rank MVN distributions shares similar quantitative results with the original VAE. Note that generative models for images are supposed to capture the underlying data distribution by maximizing log-likelihood and generate images that are similar to real ones. However, it has been pointed in [26] that these two objectives are not consistent, and generative models need to be evaluated directly with respect to the applications for which they were intended. A model that can generates samples with good visual appearances may have poor average log-likelihood on test dataset and vice versa. Common examples of deep generative models are VAEs and generative adversarial networks (GANs) [8]. VAEs usually have higher average log-likelihood while GANs can generate more photo-realistic images. This is basically caused by the different training objectives of these two models [7]. Currently there is no commonly accepted standard for evaluating generative models.

4.4 Timing Comparison.

To show the influence of different spatial VAEs to the training process, we compare the training time on the CelebA dataset. Theoretically, spatial VAEs slow down training due to the larger numbers of outputs from encoders. To keep the number of trainable parameters in decoders roughly equal, we set the dimension of in the original VAEs to be while and for spatial VAEs. According to Section 3, the numbers of outputs from their encoders are , , , and for the original VAE, naïve spatial VAE, spatial VAE via MVN distributions and spatial VAE via low-rank MVN distributions, respectively. We train our models on a Nvidia Tesla K40C GPU and report the average time for training one epoch in Table 2. Comparisons of the time for generating images are also provided to show that the increase in the total dimension of latent representations does not affect the generation process.

The results show consistent relationships between the training time and the number of outputs from encoders; that is, spatial VAEs cost more time than the original VAE but spatial VAEs via low-rank MVN distributions can alleviate this problem. Moreover, spatial VAEs only slightly slow down the training process since they only affect one single layer in the models.

5 Conclusion.

In this work, we propose spatial VAEs for image generation tasks, which improve VAEs by requiring the latent representations to explicitly contain spatial information of images. Specifically, in spatial VAEs, () feature maps are sampled to serve as spatial latent representations in contrast to a vector. This is achieved by sampling the latent feature maps from MVN distributions, which can model dependencies between the rows and columns in a matrix. We further propose to employ a low-rank formulation of MVN distributions to establish stronger dependencies. Qualitative results on different datasets show that spatial VAEs via low-rank MVN distributions substantially outperform the original VAEs.


This work was supported by the National Science Foundation grants IIS-1633359 and DBI-1641223.