Log In Sign Up

A Prior of a Googol Gaussians: a Tensor Ring Induced Prior for Generative Models

by   Maksim Kuznetsov, et al.

Generative models produce realistic objects in many domains, including text, image, video, and audio synthesis. Most popular models—Generative Adversarial Networks (GANs) and Variational Autoencoders (VAEs)—usually employ a standard Gaussian distribution as a prior. Previous works show that the richer family of prior distributions may help to avoid the mode collapse problem in GANs and to improve the evidence lower bound in VAEs. We propose a new family of prior distributions—Tensor Ring Induced Prior (TRIP)—that packs an exponential number of Gaussians into a high-dimensional lattice with a relatively small number of parameters. We show that these priors improve Fréchet Inception Distance for GANs and Evidence Lower Bound for VAEs. We also study generative models with TRIP in the conditional generation setup with missing conditions. Altogether, we propose a novel plug-and-play framework for generative models that can be utilized in any GAN and VAE-like architectures.


Theoretical Insights into the Use of Structural Similarity Index In Generative Models and Inferential Autoencoders

Generative models and inferential autoencoders mostly make use of ℓ_2 no...

On the Convergence of the ELBO to Entropy Sums

The variational lower bound (a.k.a. ELBO or free energy) is the central ...

Information bottleneck through variational glasses

Information bottleneck (IB) principle [1] has become an important elemen...

Composition and decomposition of GANs

In this work, we propose a composition/decomposition framework for adver...

A Probe Towards Understanding GAN and VAE Models

This project report compares some known GAN and VAE models proposed prio...

Lower Dimensional Kernels for Video Discriminators

This work presents an analysis of the discriminators used in Generative ...

Can Push-forward Generative Models Fit Multimodal Distributions?

Many generative models synthesize data by transforming a standard Gaussi...

1 Introduction

Modern generative models are widely applied to the generation of realistic and diverse images, text, and audio files Karras et al. (2018); Ping et al. (2018); van den Oord et al. (2018); Polykovskiy et al. (2018); Zhavoronkov et al. (2019). Generative Adversarial Networks (GAN) Goodfellow et al. (2014), Variational Autoencoders (VAE) Kingma and Welling (2013), and their variations are the most commonly used neural generative models. Both architectures learn a mapping from some prior distribution —usually a standard Gaussian—to the data distribution . Previous works showed that richer prior distributions might improve the generative models—reduce mode collapse for GANs Ben-Yosef and Weinshall (2018); Pan et al. (2018) and obtain a tighter Evidence Lower Bound (ELBO) for VAEs Tomczak and Welling (2018).

(a) 2D Tensor Ring Induced Prior.
(b) An example Tensor Ring decomposition.
Figure 1: (a)

The TRIP distribution is a multidimensional Gaussian Mixture Model with an exponentially large number of modes located on the lattice nodes.

(b) To compute the value , one should multiply the highlighted matrices and compute the trace .

If the prior lies in a parametric family, we can learn the most suitable distribution for it during training. In this work, we investigate Gaussian Mixture Models as prior distributions with an exponential number of Gaussians in nodes of a multidimensional lattice. In our experiments, we used a prior with more than a googol () Gaussians. To handle such complex distributions, we represent using a Tensor Ring decomposition Zhao et al. (2016)

—a method for approximating high-dimensional tensors with a relatively small number of parameters. We call this family of distributions a Tensor Ring Induced Prior (TRIP). For this distribution, we can compute marginal and conditional probabilities and sample from them efficiently.

We also extend TRIP to conditional generation, where a generative model produces new objects with specified attributes . With TRIP, we can produce new objects conditioned only on a subset of attributes, leaving some labels unspecified during both training and inference.

Our main contributions are summarized as follows:

  • We introduce a family of distributions that we call a Tensor Ring Induced Prior (TRIP) and use it as a prior for generative models—VAE, GAN, and its variations.

  • We investigate an application of TRIP to conditional generation and show that this prior improves quality on sparsely labeled datasets.

  • We evaluate TRIP models on the generation of CelebA faces for both conditional and unconditional setups. For GANs, we show improvement in Fréchet Inception Distance (FID) and improved ELBO for VAEs. For the conditional generation, we show lower rates of condition violation compared to standard conditional models.

2 Tensor Ring Induced Prior

In this section, we introduce a Tensor Ring-induced distribution for both discrete and continuous variables. We also define a Tensor Ring Induced Prior (TRIP) family of distributions.

2.1 Tensor Ring decomposition

Tensor Ring decomposition Zhao et al. (2016)

represents large high-dimensional tensors (such as discrete distributions) with a relatively small number of parameters. Consider a joint distribution

of discrete random variables taking values from . We write these probabilities as elements of a -dimensional tensor . For the brevity of notation, we use for . The number of elements in this tensor grows exponentially with the number of dimensions , and for only binary variables the tensor contains real numbers. Tensor Ring decomposition reduces the number of parameters by approximating tensor with low-rank non-negative tensors cores , where are core sizes, and :


To compute , for each random variable , we slice a tensor along the first dimension and obtain a matrix . We multiply these matrices for all random variables and compute the trace of the resulting matrix to get a scalar (see Figure 1(b) for an example). In Tensor Ring decomposition, the number of parameters grows linearly with the number of dimensions. With larger core sizes , Tensor Ring decomposition can approximate more complex distributions. Note that the order of the variables matters: Tensor Ring decomposition better captures dependencies between closer variables than between the distant ones.

With Tensor Ring decomposition, we can compute marginal distributions without computing the whole tensor . To marginalize out the random variable , we replace cores in Eq 1 with matrix :


In Supplementary Materials, we show an Algorithm for computing marginal distributions. We can also compute conditionals as a ratio between the joint and marginal probabilities

; we sample from conditional or marginal distributions using the chain rule.

2.2 Continuous Distributions parameterized with Tensor Ring Decomposition

In this section, we apply the Tensor Ring decomposition to continuous distributions over vectors

. In our Learnable Prior model, we assume that each component of is a Gaussian Mixture Model with fully factorized components. The joint distribution is a multidimensional Gaussian Mixture Model with modes placed in the nodes of a multidimensional lattice (Figure 1(a)). The latent discrete variables indicate the index of mixture component for each dimension ( corresponds to the -th dimension of the latent code ):



is a discrete distribution of prior probabilities of mixture components, which we store as a tensor

in a Tensor Ring decomposition. Note that is not a factorized distribution, and the learnable prior

may learn complex weightings of the mixture components. We call the family of distributions parameterized in this form a Tensor Ring Induced Prior (TRIP) and denote its learnable parameters (cores, means, and standard deviations) as



To highlight that the prior distribution is learnable, we further write it as . As we show later, we can optimize directly using gradient descent for VAE models and REINFORCE Williams (1992) for GANs.

An important property of the proposed TRIP family is that we can derive its one-dimensional conditional distributions in a closed form. For example, to sample using a chain rule, we need distributions :


From Equation 5

we notice that one-dimensional conditional distributions are Gaussian Mixture Models with the same means and variances as priors, but with different weights

(see Supplementary Materials).

Computations for marginal probabilities in the general case are shown in Algorithm 1; conditional probabilities can be computed as a ratio between the joint and marginal probabilities. Note that we compute a normalizing constant on-the-fly.

  Input: A set of variable indices for which we compute the probability, and values of these latent codes for
  Output: Joint probability , where
  Initialize ,
  for  to  do
     if  is marginalized out (then
     end if
  end for
Algorithm 1 Calculation of marginal probabilities in TRIP

3 Generative Models With Tensor Ring Induced Prior

In this section, we describe how popular generative models—Variational Autoencoders (VAEs) and Generative Adversarial Networks (GANs)—can benefit from using Tensor Ring Induced Prior.

3.1 Variational Autoencoder

Variational Autoencoder (VAE) Kingma and Welling (2013); Rezende et al. (2014) is an autoencoder-based generative model that maps data points onto a latent space with a probabilistic encoder and reconstructs objects with a probabilistic decoder . We used a Gaussian encoder with the reparameterization trick:


The most common choice for a prior distribution in the latent space is a standard Gaussian distribution . VAEs are trained by maximizing the lower bound of the log marginal likelihood , also known as the Evidence Lower Bound (ELBO):



is a Kullback-Leibler divergence. We get an unbiased estimate of

by sampling and computing a Monte Carlo estimate


When is a standard Gaussian, the term can be computed analytically, reducing the estimation variance.

For VAEs, flexible priors give tighter evidence lower bound Tomczak and Welling (2018); Chen et al. (2017) and can help with a problem of the decoder ignoring the latent codes Chen et al. (2017); van den Oord et al. (2017). In this work, we parameterize the learnable prior as a Tensor Ring Induced Prior model and train its parameters jointly with encoder and decoder (Figure 2

). We call this model a Variational Autoencoder with Tensor Ring Induced Prior (VAE-TRIP). We initialize the means and the variances by fitting 1D Gaussian Mixture models for each component using samples from the latent codes and initialize cores with a Gaussian noise. We then re-initialize means, variances and cores after the first epoch, and repeat such procedure every 5 epochs.

Figure 2: A Variational Autoencoder with a Tensor Ring Induced Prior (VAE-TRIP).

3.2 Generative Adversarial Networks

Figure 3: A Generative Adversarial Network with a Tensor Ring Induced Prior (GAN-TRIP).

Generative Adversarial Networks (GANs) Goodfellow et al. (2014) consist of two networks: a generator and a discriminator . The discriminator is trying to distinguish real objects from objects produced by a generator. The generator, on the other hand, is trying to produce objects that the discriminator considers real. The optimization setup for all models from the GAN family is a min-max problem. For the standard GAN, the learning procedure alternates between optimizing the generator and the discriminator networks with a gradient descent/ascent:


Similar to VAE, the prior distribution is usually a standard Gaussian, although Gaussian Mixture Models were also previously studied Gurumurthy et al. (2017). In this work, we use a TRIP family of distributions to parameterize a multimodal prior of GANs (Figure 3). We expect that having multiple modes as the prior improves the overall quality of generation and helps to avoid anomalies during sampling, such as partially present eyeglasses.

During training, we sample multiple latent codes from the prior and use REINFORCE Williams (1992) to propagate the gradient through the parameters . We reduce the variance by using average discriminator output as a baseline:


where is the discriminator’s output and are samples from the prior . We call this model a Generative Adversarial Network with Tensor Ring Induced Prior (GAN-TRIP). We initialize means uniformly in a range and standard deviations as .

4 Conditional Generation

In conditional generation problem, data objects (for example, face images) are coupled with properties describing the objects (for example, sex and hair color). The goal of this model is to learn a distribution that produces objects with specified attributes. Some of the attributes for a given may be unknown (), and the model should learn solely from observed attributes (): .

For VAE-TRIP, we train a joint model on all attributes and latent codes parameterized with a Tensor Ring. For discrete conditions, the joint distribution is:


where tensor is represented in a Tensor Ring decomposition. In this work, we focus on discrete attributes, although we can extend the model to continuous attributes with Gaussian Mixture Models as we did for the latent codes.

With the proposed parameterization, we can marginalize out missing attributes and compute conditional probabilities. We can efficiently compute both probabilities similar to Algorithm 1.

For conditional VAE model, the lower bound on is:


We simplify the lower bound by making two restrictions. First, we assume that the conditions are fully defined by the object , which implies . For example, an image with a person wearing a hat defines the presence of a hat. The second restriction is that we can reconstruct an object directly from its latent code: . This restriction also gives:


The resulting Evidence Lower Bound is


In the proposed model, an autoencoder learns to map objects onto a latent manifolds, while TRIP prior finds areas on the manifold corresponding to objects with the specified attributes.

The quality of the model depends on the order of the latent codes and the conditions in , since the Tensor Ring poorly captures dependence between variables that are far apart. In our experiments, we found that randomly permuting latent codes and conditions gives good results.

We can train the proposed model on partially labeled datasets and use it to draw conditional samples with partially specified constraints. For example, we can ask the model to generate images of men in hats, not specifying hair color or the presence of glasses.

5 Related Work


Figure 4: Visualization of the first two dimensions of the learned prior . Left: VAE-TRIP, Right: WGAN-GP-TRIP.

The most common generative models are based on Generative Adversarial Networks Goodfellow et al. (2014) or Variational Autoencoders Kingma and Welling (2013). Both GAN and VAE models usually use continuous unimodal distributions (like a standard Gaussian) as a prior. A space of natural images, however, is multimodal: a person either wears glasses or not—there are no intermediate states. Although generative models are flexible enough to transform unimodal distributions to multimodal, they tend to ignore some modes (mode collapse) or produce images with artifacts (half-present glasses).

A few models with learnable prior distributions were proposed. Tomczak and Welling (2018) used a Gaussian mixture model based on encoder proposals as a prior on the latent space of VAE. Chen et al. (2017) and Rezende and Mohamed (2015) applied normalizing flows Dinh et al. (2015); Kingma et al. (2016); Dinh et al. (2017) to transform a standard normal prior into a more complex latent distribution. Chen et al. (2017); van den Oord et al. (2017) applied auto-regressive models to learn better prior distribution over the latent variables. Bauer and Mnih (2019) proposed to update a prior distribution of a trained VAE to avoid samples that have low marginal posterior, but high prior probability.

Similar to Tensor Ring decomposition, a Tensor-Train decomposition Oseledets (2011)

is used in machine learning and numerical methods to represent tensors with a small number of parameters. Tensor-Train was applied to the compression of fully connected

Novikov et al. (2015), convolutional Garipov et al. (2016) and recurrent Tjandra et al. (2017) layers. In our models, we can use a Tensor-Train decomposition instead of Tensor Ring, but it requires larger cores to achieve comparable results, as first and last dimensions are farther apart.

Most conditional models work with missing values by imputing them with a predictive model or setting them to a special value. With this approach, we cannot sample objects specifying conditions partially. VAE TELBO model

Vedantam et al. (2018) proposes to train a Product of Experts-based model, where the posterior on the latent codes is approximated as , requiring to train a separate posterior model for each condition. JMVAE model Suzuki et al. (2017) contains three encoders that take both image and condition, only a condition, or only an image.

6 Experiments

We conducted experiments on CelebFaces Attributes Dataset (CelebA) Liu et al. (2015) of approximately photos with a random train-test split. For conditional generation, we selected binary image attributes, including sex, hair color, presence mustache, and beard. We compared both GAN and VAE models with and without TRIP. We also compared our best model with known approaches on CIFAR-10 Krizhevsky et al. dataset with a standard split. Model architecture and training details are provided in Supplementary Materials.

6.1 Generating Objects With VAE-TRIP and GAN-TRIP

Metric Model GMM TRIP
FID VAE 86.72 85.64 84.48 85.31 83.54
WGAN 63.46 67.10 61.82 62.48 57.6
WGAN-GP 54.71 57.82 62.10 63.06 52.86
ELBO VAE -194.16 -201.60 -193.88 -202.04 -193.32

-185.09 -191.99 -184.73 -190.09 -184.43
Table 1: FID for GAN and VAE-based architectures trained on CelebA dataset, and ELBO for VAE. F = Fixed, L = Learnable. We also report ELBO for importance-weighted autoencoder with points Burda et al. (2015)

We evaluate GAN-based models with and without Tensor Ring Learnable Prior by measuring a Fréchet Inception Distance (FID). For the baseline models, we used Wasserstein GAN (WGAN) Arjovsky et al. (2017) and Wasserstein GAN with Gradient Penalty (WGAN-GP) Gulrajani et al. (2017) on CelebA dataset. We also compared learnable priors with fixed randomly initialized parameters . The results in Table 1 (CelebA) and Table 2 (CIFAR-10) suggest that with a TRIP prior the quality improves compared to standard models and models with GMM priors. In some experiments, the GMM-based model performed worse than a standard Gaussian, since had to be estimated with Monte-Carlo sampling, resulting in higher gradient variance.

Model FID
SN-GANs Miyato et al. (2018) 21.7
WGAN-GP + Two Time-Scale Heusel et al. (2017) 24.8
WGAN-GP Gulrajani et al. (2017) 29.3
WGAN-GP-TRIP (ours) 16.72
Table 2: FID for CIFAR-10 GAN-based models

6.2 Visualization of TRIP

In Figure 4, we visualize first two dimensions of the learned prior in VAE-TRIP and WGAN-GP-TRIP models. For both models, prior uses most of the components to produce a complex distribution. Also, notice that the components learned different non-uniform weights.

6.3 Generated Images

Here, we visualize the correspondence of modes and generated images by a procedure that we call mode hopping. We start by randomly sampling a latent code and producing the first image. After that, we randomly select five dimensions and sample them conditioned on the remaining dimensions. We repeat this procedure multiple times and obtain a sequence of sampled images shown in Figure 5. With these results, we see that similar images are localized in the learned prior space, and changes in a few dimensions change only a few fine-grained features.

Figure 5: Mode hopping in WGAN-GP-TRIP. We start with a random sample from the prior and conditionally sample five random dimensions on each iteration. Each line shows a single trajectory.

6.4 Generated Conditional Images

In this experiment, we generate images given a subset of attributes to estimate the diversity of generated images. For example, if we specify ‘Young man,’ we would expect different images to have different hair colors, presence and absence of glasses or hat. Generated images shown in Figure 3 indicate that the model learned to produce diverse images with multiple varying attributes.

Young man
Smiling woman in eyeglasses
Smiling woman with a hat
Blond woman with eyeglasses
Table 3: Generated images with VAE-TRIP for different attributes.

7 Discussion

We designed our prior utilizing Tensor Ring decomposition due to its higher representation capacity compared to other decompositions. For example, a Tensor Ring with core size has the same capacity as a Tensor-Train with core size Levine et al. (2018). Although the prior contains an exponential number of modes, in our experiments, its learnable parameters accounted for less than of total weights, which did not cause overfitting. The results can be improved by increasing the core size ; however, the computational complexity has a cubic growth with the core size. We also implemented a conditional GAN but found the REINFORCE-based training of this model very unstable. Further research with variance reduction techniques might improve this approach.

8 Acknowledgements

Image generation for Section 6.3 was supported by the Russian Science Foundation grant no. 17-71-20072.


  • Karras et al. (2018) Tero Karras, Timo Aila, Samuli Laine, and Jaakko Lehtinen. Progressive Growing of GANs for Improved Quality, Stability, and Variation. International Conference on Learning Representations, 2018.
  • Ping et al. (2018) Wei Ping, Kainan Peng, Andrew Gibiansky, Sercan Ömer Arik, Ajay Kannan, Sharan Narang, Jonathan Raiman, and John Miller. Deep Voice 3: Scaling Text-to-Speech with Convolutional Sequence Learning. International Conference on Learning Representations, 2018.
  • van den Oord et al. (2018) Aaron van den Oord, Yazhe Li, Igor Babuschkin, Karen Simonyan, Oriol Vinyals, Koray Kavukcuoglu, George van den Driessche, Edward Lockhart, Luis Cobo, Florian Stimberg, Norman Casagrande, Dominik Grewe, Seb Noury, Sander Dieleman, Erich Elsen, Nal Kalchbrenner, Heiga Zen, Alex Graves, Helen King, Tom Walters, Dan Belov, and Demis Hassabis. Parallel WaveNet: Fast High-Fidelity Speech Synthesis. 2018.
  • Polykovskiy et al. (2018) Daniil Polykovskiy, Alexander Zhebrak, Dmitry Vetrov, Yan Ivanenkov, Vladimir Aladinskiy, Polina Mamoshina, Marine Bozdaganyan, Alexander Aliper, Alex Zhavoronkov, and Artur Kadurin. Entangled conditional adversarial autoencoder for de novo drug discovery. Mol. Pharm., September 2018.
  • Zhavoronkov et al. (2019) Alex Zhavoronkov, Yan A Ivanenkov, Alex Aliper, Mark S Veselov, Vladimir A Aladinskiy, Anastasiya V Aladinskaya, Victor A Terentiev, Daniil A Polykovskiy, Maksim D Kuznetsov, Arip Asadulaev, et al. Deep learning enables rapid identification of potent ddr1 kinase inhibitors. Nature biotechnology, pages 1–4, 2019.
  • Goodfellow et al. (2014) Ian Goodfellow, Jean Pouget-Abadie, Mehdi Mirza, Bing Xu, David Warde-Farley, Sherjil Ozair, Aaron Courville, and Yoshua Bengio. Generative Adversarial Nets. Advances in neural information processing systems, pages 2672–2680, 2014.
  • Kingma and Welling (2013) Diederik P Kingma and Max Welling. Auto-Encoding Variational Bayes. International Conference on Learning Representations, 2013.
  • Ben-Yosef and Weinshall (2018) Matan Ben-Yosef and Daphna Weinshall. Gaussian Mixture Generative Adversarial Networks for Diverse Datasets, and the Unsupervised Clustering of Images. arXiv preprint arXiv:1808.10356, 2018.
  • Pan et al. (2018) Lili Pan, Shen Cheng, Jian Liu, Yazhou Ren, and Zenglin Xu. Latent dirichlet allocation in generative adversarial networks. arXiv preprint arXiv:1812.06571, 2018.
  • Tomczak and Welling (2018) Jakub M Tomczak and Max Welling. VAE with a VampPrior.

    International Conference on Artificial Intelligence and Statistics

    , 2018.
  • Zhao et al. (2016) Qibin Zhao, Guoxu Zhou, Shengli Xie, Liqing Zhang, and Andrzej Cichocki. Tensor Ring Decomposition. arXiv preprint arXiv:1606.05535, 2016.
  • Williams (1992) Ronald J Williams.

    Simple Statistical Gradient-Following Algorithms for Connectionist Reinforcement Learning.

    Machine learning, 8(3-4):229–256, 1992.
  • Rezende et al. (2014) Danilo Jimenez Rezende, Shakir Mohamed, and Daan Wierstra.

    Stochastic Backpropagation and Approximate Inference in Deep Generative Models.

    International Conference on Machine Learning, 2014.
  • Chen et al. (2017) Xi Chen, Diederik P Kingma, Tim Salimans, Yan Duan, Prafulla Dhariwal, John Schulman, Ilya Sutskever, and Pieter Abbeel. Variational Lossy Autoencoder. International Conference on Learning Representations, 2017.
  • van den Oord et al. (2017) Aaron van den Oord, Oriol Vinyals, and koray kavukcuoglu. Neural Discrete Representation Learning. Advances in Neural Information Processing Systems, 2017.
  • Gurumurthy et al. (2017) Swaminathan Gurumurthy, Ravi Kiran Sarvadevabhatla, and R Venkatesh Babu. DeLiGAN: Generative Adversarial Networks for Diverse and Limited Data. In

    Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition

    , pages 166–174, 2017.
  • Rezende and Mohamed (2015) Danilo Jimenez Rezende and Shakir Mohamed. Variational Inference with Normalizing Flows. International Conference on Machine Learning, 2015.
  • Dinh et al. (2015) Laurent Dinh, David Krueger, and Yoshua Bengio. NICE: Non-linear Independent Components Estimation. International Conference on Learning Representations Workshop, 2015.
  • Kingma et al. (2016) Diederik P Kingma, Tim Salimans, Rafal Jozefowicz, Xi Chen, Ilya Sutskever, and Max Welling. Improved Variational Inference with Inverse Autoregressive Flow. In Advances in Neural Information Processing Systems, pages 4743–4751, 2016.
  • Dinh et al. (2017) Laurent Dinh, Jascha Sohl-Dickstein, and Samy Bengio. Density Estimation Using Real NVP. International Conference on Learning Representations, 2017.
  • Bauer and Mnih (2019) Matthias Bauer and Andriy Mnih. Resampled priors for variational autoencoders. 89:66–75, 16–18 Apr 2019. URL
  • Oseledets (2011) Ivan V Oseledets. Tensor-Train Decomposition. SIAM Journal on Scientific Computing, 33(5):2295–2317, 2011.
  • Novikov et al. (2015) Alexander Novikov, Dmitry Podoprikhin, Anton Osokin, and Dmitry Vetrov.

    Tensorizing Neural Networks.

    Advances in Neural Information Processing Systems, 2015.
  • Garipov et al. (2016) Timur Garipov, Dmitry Podoprikhin, Alexander Novikov, and Dmitry Vetrov. Ultimate Tensorization: Compressing Convolutional and FC Layers Alike. Advances in Neural Information Processing Systems, 2016.
  • Tjandra et al. (2017) Andros Tjandra, Sakriani Sakti, and Satoshi Nakamura.

    Compressing Recurrent Neural Network with Tensor Train.

    International Joint Conference on Neural Networks, 2017.
  • Vedantam et al. (2018) Ramakrishna Vedantam, Ian Fischer, Jonathan Huang, and Kevin Murphy. Generative Models of Visually Grounded Imagination. International Conference on Learning Representations, 2018.
  • Suzuki et al. (2017) Masahiro Suzuki, Kotaro Nakayama, and Yutaka Matsuo. Joint Multimodal Learning with Deep Generative Models. International Conference on Learning Representations Workshop, 2017.
  • Liu et al. (2015) Ziwei Liu, Ping Luo, Xiaogang Wang, and Xiaoou Tang. Deep Learning Face Attributes in the Wild. In Proceedings of International Conference on Computer Vision (ICCV), 12 2015.
  • (29) Alex Krizhevsky, Vinod Nair, and Geoffrey Hinton. Cifar-10 (canadian institute for advanced research). URL
  • Burda et al. (2015) Yuri Burda, Roger Grosse, and Ruslan Salakhutdinov. Importance weighted autoencoders. arXiv preprint arXiv:1509.00519, 2015.
  • Arjovsky et al. (2017) Martin Arjovsky, Soumith Chintala, and Léon Bottou. Wasserstein Generative Adversarial Networks. International Conference on Machine Learning, 2017.
  • Gulrajani et al. (2017) Ishaan Gulrajani, Faruk Ahmed, Martin Arjovsky, Vincent Dumoulin, and Aaron C Courville. Improved Training of Wasserstein Gans. In Advances in Neural Information Processing Systems, pages 5767–5777, 2017.
  • Miyato et al. (2018) Takeru Miyato, Toshiki Kataoka, Masanori Koyama, and Yuichi Yoshida. Spectral normalization for generative adversarial networks. International Conference on Learning Representations, 2018.
  • Heusel et al. (2017) Martin Heusel, Hubert Ramsauer, Thomas Unterthiner, Bernhard Nessler, and Sepp Hochreiter. Gans trained by a two time-scale update rule converge to a local nash equilibrium. In Advances in Neural Information Processing Systems, pages 6629–6640, 2017.
  • Levine et al. (2018) Yoav Levine, David Yakira, Nadav Cohen, and Amnon Shashua. Deep learning and quantum entanglement: Fundamental connections with implications to network design. International Conference on Learning Representations, 2018.
  • Kingma and Ba (2015) Diederik P Kingma and Jimmy Ba. Adam: A Method for Stochastic Optimization. International Conference on Learning Representations, 2015.
  • Kingma et al. (2014) Durk P Kingma, Shakir Mohamed, Danilo Jimenez Rezende, and Max Welling. Semi-supervised Learning with Deep Generative Models. In Advances in neural information processing systems, pages 3581–3589, 2014.

Appendix A Derivations for one-dimensional conditional distributions

In the paper, we stated that one-dimensional conditional distributions are Gaussian Mixture Models with the same means and variances as priors, but with different weights . With Tensor Ring decomposition, we can efficiently compute those weights (we denote as ):


Appendix B Calculation of marginal probabilities in Tensor Ring

In Algorithm 2 we show how to compute marginal probabilities for a distribution parameterized in Tensor Ring format. Note that we compute a normalizing constant on-the-fly.

  Input: A set of variable indices, values of these variables for
  Output: Joint probability , where
  Initialize ,
  for  to  do
     if  is marginalized out (then
     end if
  end for
Algorithm 2 Calculation of marginal probabilities in Tensor Ring

Appendix C Model architecture

We manually tuned the hyperparameters: first we selected the best encoder-decoder architecture for a Gaussian prior and then tuned TRIP parameters for a fixed architecture. For models from a GAN family, we used a deconvolutional generator with kernel size

and ReLU activations. The number of channels in layers was

. For the discriminator, we used the symmetric convolutional architecture with a LeakyReLU. We trained a model using Adam Kingma and Ba (2015) optimizer with a learning rate of for iterations with a batch size . We used a schedule of discriminator updates per one generator update. A TRIP prior was -dimensional with Gaussians per dimension and core size (sizes of matrices ). For a baseline Gaussian Mixture Model (GMM) prior we used Gaussians. We conducted all the experiments on Tesla K80.

For VAE models, we used a convolutional encoder and a deconvolutional decoder with a kernel size , and the number of channels for the encoder, and a symmetrical architecture for the decoder. We used LeakyReLU for the encoder and ReLU for the decoder. We trained the model for weight updates with batch size . The latent dimension was for all VAE-based models. For TRIP we used Gaussians per dimension and a Tensor Ring with core size . For a GMM prior we used Gaussians.

For conditional generation with TRIP, the architecture was the same as for unconditional generation. For CVAE we parameterized a posterior model as a fully connected network with layer sizes and LeakyReLU activations. For the VAE TELBO baseline model Vedantam et al. (2018), we used a fully connected network for with layer sizes and LeakyReLU activations.

Appendix D Implementation details

Implementing the TRIP module is straight-forward and requires two functions. The first function that we use during training computes for an arbitrary subset of latent dimensions. The second function is used for sampling, and samples from with a chain rule, for which calculations are described in Eq 15.

During training we enforce values of cores to be non-negative by replacing each element of tensors with their absolute values before computation. To make computations more stable, we divide and by the at each iteration when computing .

CIFAR-10 CelebA
ELBO Reconstruction KL ELBO Reconstruction KL
1 -89.5 60.5 29.0 -243.40 177.63 65.76
5 -89.3 60.2 29.1 -231.57 166.89 64.67
10 -89.3 60.4 28.9 -223.59 156.99 66.60
20 -89.1 60.2 28.9 -215.62 158.95 56.67
Table 4: Impact of core size (CIFAR-10 and CelebA)

Appendix E Impact of core size

In Table 4 we compared the performance of VAE-TRIP model with different core sizes on CIFAR-10 and CelebA datasets. Note that for , TRIP is factorized over dimensions, where each dimension is a 1D Gaussian Mixture Model. Notice that models with higher core sizes perform better as the prior becomes more complex. In Table 5 we show computational complexity and memory usage of TRIP model to illustrate a tradeof between quality and computational complexity of the model.

log-likelihood, ms sampling, ms Memory, MB
1 126 7 201 21 0.023
10 137 4 232 13 0.77
20 193 15 312 18 3.1
50 200 20 360 17 19.5
100 308 12 882 15 78.1
Table 5: Time and memory consumption of operations with prior (per batch). is a core size, latent space dimension , number of Gaussians per dimension , batch size . Other parameters are the same as used in the paper. We performed the experiments on Tesla K80. MS stands for milliseconds, MB stands for megabytes. Results averaged over 10 runs; Reported mean std.
Model % missing
0% 90% 99%
CVAE Kingma et al. (2014) 86.69 85.31 84.61
VAE TELBO Vedantam et al. (2018) 82.80 74.87 73.92
JMVAE Suzuki et al. (2017) 81.87 80.65 73.68
VAE-TRIP (ours) 88.7 87.08 84.89
Table 6: Condition satisfaction (accuracy) for conditional generative models with different rates of missing attributes in the training set.

e.1 Conditional Generation

For the conditional generation, we used images of size . We study the model performance for different rates of missing attributes (, , ). For each model, we generated

images for randomly sampled complete sets of attributes from the test set. We trained a predictive convolutional neural network on a validation set to predict the attributes with

accuracy and predicted the attributes of generated images. We report the condition matching accuracy—when requested attributes matched the actual attributes. We trained all models except for CVAE Kingma et al. (2014) directly on data with missing attributes. For CVAE, we imputed missing values with a predictive model. For the missing rate of , the predictive test accuracy was , and for . In the results shown in Table 6, we see that the VAE-TRIP model outperforms other baselines.

GMM TRIP Combination with Flow
Parameters (model) 11.4M 11.1M 10.7M 11.3M 10.7M 10.4M
Parameters (prior) 0 0.2M 0.6M 0.3M 0.5M 0.7M
Parameters (total) 11.4M 11.3M 11.1M 11.5M 11.2M 11.1M
ELBO -192.6 -190.05 -189.1 -185.3 -186.0 -184.7
Table 7: Preliminary results on combining TRIP and normalizing flows to form a prior; Number of parameters of model components

e.2 Additional experiments for VAE

In Table 7 we compare VAE model with Gaussian, GMM and TRIP priors with a comparable number of parameters. We also provide preliminary results on combining normalizing flows with a TRIP prior.