1 Introduction
Modern generative models are widely applied to the generation of realistic and diverse images, text, and audio files Karras et al. (2018); Ping et al. (2018); van den Oord et al. (2018); Polykovskiy et al. (2018); Zhavoronkov et al. (2019). Generative Adversarial Networks (GAN) Goodfellow et al. (2014), Variational Autoencoders (VAE) Kingma and Welling (2013), and their variations are the most commonly used neural generative models. Both architectures learn a mapping from some prior distribution —usually a standard Gaussian—to the data distribution . Previous works showed that richer prior distributions might improve the generative models—reduce mode collapse for GANs BenYosef and Weinshall (2018); Pan et al. (2018) and obtain a tighter Evidence Lower Bound (ELBO) for VAEs Tomczak and Welling (2018).
The TRIP distribution is a multidimensional Gaussian Mixture Model with an exponentially large number of modes located on the lattice nodes.
(b) To compute the value , one should multiply the highlighted matrices and compute the trace .If the prior lies in a parametric family, we can learn the most suitable distribution for it during training. In this work, we investigate Gaussian Mixture Models as prior distributions with an exponential number of Gaussians in nodes of a multidimensional lattice. In our experiments, we used a prior with more than a googol () Gaussians. To handle such complex distributions, we represent using a Tensor Ring decomposition Zhao et al. (2016)
—a method for approximating highdimensional tensors with a relatively small number of parameters. We call this family of distributions a Tensor Ring Induced Prior (TRIP). For this distribution, we can compute marginal and conditional probabilities and sample from them efficiently.
We also extend TRIP to conditional generation, where a generative model produces new objects with specified attributes . With TRIP, we can produce new objects conditioned only on a subset of attributes, leaving some labels unspecified during both training and inference.
Our main contributions are summarized as follows:

We introduce a family of distributions that we call a Tensor Ring Induced Prior (TRIP) and use it as a prior for generative models—VAE, GAN, and its variations.

We investigate an application of TRIP to conditional generation and show that this prior improves quality on sparsely labeled datasets.

We evaluate TRIP models on the generation of CelebA faces for both conditional and unconditional setups. For GANs, we show improvement in Fréchet Inception Distance (FID) and improved ELBO for VAEs. For the conditional generation, we show lower rates of condition violation compared to standard conditional models.
2 Tensor Ring Induced Prior
In this section, we introduce a Tensor Ringinduced distribution for both discrete and continuous variables. We also define a Tensor Ring Induced Prior (TRIP) family of distributions.
2.1 Tensor Ring decomposition
Tensor Ring decomposition Zhao et al. (2016)
represents large highdimensional tensors (such as discrete distributions) with a relatively small number of parameters. Consider a joint distribution
of discrete random variables taking values from . We write these probabilities as elements of a dimensional tensor . For the brevity of notation, we use for . The number of elements in this tensor grows exponentially with the number of dimensions , and for only binary variables the tensor contains real numbers. Tensor Ring decomposition reduces the number of parameters by approximating tensor with lowrank nonnegative tensors cores , where are core sizes, and :(1) 
To compute , for each random variable , we slice a tensor along the first dimension and obtain a matrix . We multiply these matrices for all random variables and compute the trace of the resulting matrix to get a scalar (see Figure 1(b) for an example). In Tensor Ring decomposition, the number of parameters grows linearly with the number of dimensions. With larger core sizes , Tensor Ring decomposition can approximate more complex distributions. Note that the order of the variables matters: Tensor Ring decomposition better captures dependencies between closer variables than between the distant ones.
With Tensor Ring decomposition, we can compute marginal distributions without computing the whole tensor . To marginalize out the random variable , we replace cores in Eq 1 with matrix :
(2) 
In Supplementary Materials, we show an Algorithm for computing marginal distributions. We can also compute conditionals as a ratio between the joint and marginal probabilities
; we sample from conditional or marginal distributions using the chain rule.
2.2 Continuous Distributions parameterized with Tensor Ring Decomposition
In this section, we apply the Tensor Ring decomposition to continuous distributions over vectors
. In our Learnable Prior model, we assume that each component of is a Gaussian Mixture Model with fully factorized components. The joint distribution is a multidimensional Gaussian Mixture Model with modes placed in the nodes of a multidimensional lattice (Figure 1(a)). The latent discrete variables indicate the index of mixture component for each dimension ( corresponds to the th dimension of the latent code ):(3) 
Here,
is a discrete distribution of prior probabilities of mixture components, which we store as a tensor
in a Tensor Ring decomposition. Note that is not a factorized distribution, and the learnable priormay learn complex weightings of the mixture components. We call the family of distributions parameterized in this form a Tensor Ring Induced Prior (TRIP) and denote its learnable parameters (cores, means, and standard deviations) as
:(4) 
To highlight that the prior distribution is learnable, we further write it as . As we show later, we can optimize directly using gradient descent for VAE models and REINFORCE Williams (1992) for GANs.
An important property of the proposed TRIP family is that we can derive its onedimensional conditional distributions in a closed form. For example, to sample using a chain rule, we need distributions :
(5) 
From Equation 5
we notice that onedimensional conditional distributions are Gaussian Mixture Models with the same means and variances as priors, but with different weights
(see Supplementary Materials).Computations for marginal probabilities in the general case are shown in Algorithm 1; conditional probabilities can be computed as a ratio between the joint and marginal probabilities. Note that we compute a normalizing constant onthefly.
3 Generative Models With Tensor Ring Induced Prior
In this section, we describe how popular generative models—Variational Autoencoders (VAEs) and Generative Adversarial Networks (GANs)—can benefit from using Tensor Ring Induced Prior.
3.1 Variational Autoencoder
Variational Autoencoder (VAE) Kingma and Welling (2013); Rezende et al. (2014) is an autoencoderbased generative model that maps data points onto a latent space with a probabilistic encoder and reconstructs objects with a probabilistic decoder . We used a Gaussian encoder with the reparameterization trick:
(6) 
The most common choice for a prior distribution in the latent space is a standard Gaussian distribution . VAEs are trained by maximizing the lower bound of the log marginal likelihood , also known as the Evidence Lower Bound (ELBO):
(7) 
where
is a KullbackLeibler divergence. We get an unbiased estimate of
by sampling and computing a Monte Carlo estimate(8) 
When is a standard Gaussian, the term can be computed analytically, reducing the estimation variance.
For VAEs, flexible priors give tighter evidence lower bound Tomczak and Welling (2018); Chen et al. (2017) and can help with a problem of the decoder ignoring the latent codes Chen et al. (2017); van den Oord et al. (2017). In this work, we parameterize the learnable prior as a Tensor Ring Induced Prior model and train its parameters jointly with encoder and decoder (Figure 2
). We call this model a Variational Autoencoder with Tensor Ring Induced Prior (VAETRIP). We initialize the means and the variances by fitting 1D Gaussian Mixture models for each component using samples from the latent codes and initialize cores with a Gaussian noise. We then reinitialize means, variances and cores after the first epoch, and repeat such procedure every 5 epochs.
3.2 Generative Adversarial Networks
Generative Adversarial Networks (GANs) Goodfellow et al. (2014) consist of two networks: a generator and a discriminator . The discriminator is trying to distinguish real objects from objects produced by a generator. The generator, on the other hand, is trying to produce objects that the discriminator considers real. The optimization setup for all models from the GAN family is a minmax problem. For the standard GAN, the learning procedure alternates between optimizing the generator and the discriminator networks with a gradient descent/ascent:
(9) 
Similar to VAE, the prior distribution is usually a standard Gaussian, although Gaussian Mixture Models were also previously studied Gurumurthy et al. (2017). In this work, we use a TRIP family of distributions to parameterize a multimodal prior of GANs (Figure 3). We expect that having multiple modes as the prior improves the overall quality of generation and helps to avoid anomalies during sampling, such as partially present eyeglasses.
During training, we sample multiple latent codes from the prior and use REINFORCE Williams (1992) to propagate the gradient through the parameters . We reduce the variance by using average discriminator output as a baseline:
(10) 
where is the discriminator’s output and are samples from the prior . We call this model a Generative Adversarial Network with Tensor Ring Induced Prior (GANTRIP). We initialize means uniformly in a range and standard deviations as .
4 Conditional Generation
In conditional generation problem, data objects (for example, face images) are coupled with properties describing the objects (for example, sex and hair color). The goal of this model is to learn a distribution that produces objects with specified attributes. Some of the attributes for a given may be unknown (), and the model should learn solely from observed attributes (): .
For VAETRIP, we train a joint model on all attributes and latent codes parameterized with a Tensor Ring. For discrete conditions, the joint distribution is:
(11) 
where tensor is represented in a Tensor Ring decomposition. In this work, we focus on discrete attributes, although we can extend the model to continuous attributes with Gaussian Mixture Models as we did for the latent codes.
With the proposed parameterization, we can marginalize out missing attributes and compute conditional probabilities. We can efficiently compute both probabilities similar to Algorithm 1.
For conditional VAE model, the lower bound on is:
(12) 
We simplify the lower bound by making two restrictions. First, we assume that the conditions are fully defined by the object , which implies . For example, an image with a person wearing a hat defines the presence of a hat. The second restriction is that we can reconstruct an object directly from its latent code: . This restriction also gives:
(13) 
The resulting Evidence Lower Bound is
(14) 
In the proposed model, an autoencoder learns to map objects onto a latent manifolds, while TRIP prior finds areas on the manifold corresponding to objects with the specified attributes.
The quality of the model depends on the order of the latent codes and the conditions in , since the Tensor Ring poorly captures dependence between variables that are far apart. In our experiments, we found that randomly permuting latent codes and conditions gives good results.
We can train the proposed model on partially labeled datasets and use it to draw conditional samples with partially specified constraints. For example, we can ask the model to generate images of men in hats, not specifying hair color or the presence of glasses.
5 Related Work
The most common generative models are based on Generative Adversarial Networks Goodfellow et al. (2014) or Variational Autoencoders Kingma and Welling (2013). Both GAN and VAE models usually use continuous unimodal distributions (like a standard Gaussian) as a prior. A space of natural images, however, is multimodal: a person either wears glasses or not—there are no intermediate states. Although generative models are flexible enough to transform unimodal distributions to multimodal, they tend to ignore some modes (mode collapse) or produce images with artifacts (halfpresent glasses).
A few models with learnable prior distributions were proposed. Tomczak and Welling (2018) used a Gaussian mixture model based on encoder proposals as a prior on the latent space of VAE. Chen et al. (2017) and Rezende and Mohamed (2015) applied normalizing flows Dinh et al. (2015); Kingma et al. (2016); Dinh et al. (2017) to transform a standard normal prior into a more complex latent distribution. Chen et al. (2017); van den Oord et al. (2017) applied autoregressive models to learn better prior distribution over the latent variables. Bauer and Mnih (2019) proposed to update a prior distribution of a trained VAE to avoid samples that have low marginal posterior, but high prior probability.
Similar to Tensor Ring decomposition, a TensorTrain decomposition Oseledets (2011)
is used in machine learning and numerical methods to represent tensors with a small number of parameters. TensorTrain was applied to the compression of fully connected
Novikov et al. (2015), convolutional Garipov et al. (2016) and recurrent Tjandra et al. (2017) layers. In our models, we can use a TensorTrain decomposition instead of Tensor Ring, but it requires larger cores to achieve comparable results, as first and last dimensions are farther apart.Most conditional models work with missing values by imputing them with a predictive model or setting them to a special value. With this approach, we cannot sample objects specifying conditions partially. VAE TELBO model
Vedantam et al. (2018) proposes to train a Product of Expertsbased model, where the posterior on the latent codes is approximated as , requiring to train a separate posterior model for each condition. JMVAE model Suzuki et al. (2017) contains three encoders that take both image and condition, only a condition, or only an image.6 Experiments
We conducted experiments on CelebFaces Attributes Dataset (CelebA) Liu et al. (2015) of approximately photos with a random traintest split. For conditional generation, we selected binary image attributes, including sex, hair color, presence mustache, and beard. We compared both GAN and VAE models with and without TRIP. We also compared our best model with known approaches on CIFAR10 Krizhevsky et al. dataset with a standard split. Model architecture and training details are provided in Supplementary Materials.
6.1 Generating Objects With VAETRIP and GANTRIP
Metric  Model  GMM  TRIP  

F  L  F  L  
FID  VAE  86.72  85.64  84.48  85.31  83.54 
WGAN  63.46  67.10  61.82  62.48  57.6  
WGANGP  54.71  57.82  62.10  63.06  52.86  
ELBO  VAE  194.16  201.60  193.88  202.04  193.32 
IWAE ELBO () 
185.09  191.99  184.73  190.09  184.43 
We evaluate GANbased models with and without Tensor Ring Learnable Prior by measuring a Fréchet Inception Distance (FID). For the baseline models, we used Wasserstein GAN (WGAN) Arjovsky et al. (2017) and Wasserstein GAN with Gradient Penalty (WGANGP) Gulrajani et al. (2017) on CelebA dataset. We also compared learnable priors with fixed randomly initialized parameters . The results in Table 1 (CelebA) and Table 2 (CIFAR10) suggest that with a TRIP prior the quality improves compared to standard models and models with GMM priors. In some experiments, the GMMbased model performed worse than a standard Gaussian, since had to be estimated with MonteCarlo sampling, resulting in higher gradient variance.
6.2 Visualization of TRIP
In Figure 4, we visualize first two dimensions of the learned prior in VAETRIP and WGANGPTRIP models. For both models, prior uses most of the components to produce a complex distribution. Also, notice that the components learned different nonuniform weights.
6.3 Generated Images
Here, we visualize the correspondence of modes and generated images by a procedure that we call mode hopping. We start by randomly sampling a latent code and producing the first image. After that, we randomly select five dimensions and sample them conditioned on the remaining dimensions. We repeat this procedure multiple times and obtain a sequence of sampled images shown in Figure 5. With these results, we see that similar images are localized in the learned prior space, and changes in a few dimensions change only a few finegrained features.
6.4 Generated Conditional Images
In this experiment, we generate images given a subset of attributes to estimate the diversity of generated images. For example, if we specify ‘Young man,’ we would expect different images to have different hair colors, presence and absence of glasses or hat. Generated images shown in Figure 3 indicate that the model learned to produce diverse images with multiple varying attributes.
Young man  
Smiling woman in eyeglasses  
Smiling woman with a hat  
Blond woman with eyeglasses 
7 Discussion
We designed our prior utilizing Tensor Ring decomposition due to its higher representation capacity compared to other decompositions. For example, a Tensor Ring with core size has the same capacity as a TensorTrain with core size Levine et al. (2018). Although the prior contains an exponential number of modes, in our experiments, its learnable parameters accounted for less than of total weights, which did not cause overfitting. The results can be improved by increasing the core size ; however, the computational complexity has a cubic growth with the core size. We also implemented a conditional GAN but found the REINFORCEbased training of this model very unstable. Further research with variance reduction techniques might improve this approach.
8 Acknowledgements
Image generation for Section 6.3 was supported by the Russian Science Foundation grant no. 177120072.
References
 Karras et al. (2018) Tero Karras, Timo Aila, Samuli Laine, and Jaakko Lehtinen. Progressive Growing of GANs for Improved Quality, Stability, and Variation. International Conference on Learning Representations, 2018.
 Ping et al. (2018) Wei Ping, Kainan Peng, Andrew Gibiansky, Sercan Ömer Arik, Ajay Kannan, Sharan Narang, Jonathan Raiman, and John Miller. Deep Voice 3: Scaling TexttoSpeech with Convolutional Sequence Learning. International Conference on Learning Representations, 2018.
 van den Oord et al. (2018) Aaron van den Oord, Yazhe Li, Igor Babuschkin, Karen Simonyan, Oriol Vinyals, Koray Kavukcuoglu, George van den Driessche, Edward Lockhart, Luis Cobo, Florian Stimberg, Norman Casagrande, Dominik Grewe, Seb Noury, Sander Dieleman, Erich Elsen, Nal Kalchbrenner, Heiga Zen, Alex Graves, Helen King, Tom Walters, Dan Belov, and Demis Hassabis. Parallel WaveNet: Fast HighFidelity Speech Synthesis. 2018.
 Polykovskiy et al. (2018) Daniil Polykovskiy, Alexander Zhebrak, Dmitry Vetrov, Yan Ivanenkov, Vladimir Aladinskiy, Polina Mamoshina, Marine Bozdaganyan, Alexander Aliper, Alex Zhavoronkov, and Artur Kadurin. Entangled conditional adversarial autoencoder for de novo drug discovery. Mol. Pharm., September 2018.
 Zhavoronkov et al. (2019) Alex Zhavoronkov, Yan A Ivanenkov, Alex Aliper, Mark S Veselov, Vladimir A Aladinskiy, Anastasiya V Aladinskaya, Victor A Terentiev, Daniil A Polykovskiy, Maksim D Kuznetsov, Arip Asadulaev, et al. Deep learning enables rapid identification of potent ddr1 kinase inhibitors. Nature biotechnology, pages 1–4, 2019.
 Goodfellow et al. (2014) Ian Goodfellow, Jean PougetAbadie, Mehdi Mirza, Bing Xu, David WardeFarley, Sherjil Ozair, Aaron Courville, and Yoshua Bengio. Generative Adversarial Nets. Advances in neural information processing systems, pages 2672–2680, 2014.
 Kingma and Welling (2013) Diederik P Kingma and Max Welling. AutoEncoding Variational Bayes. International Conference on Learning Representations, 2013.
 BenYosef and Weinshall (2018) Matan BenYosef and Daphna Weinshall. Gaussian Mixture Generative Adversarial Networks for Diverse Datasets, and the Unsupervised Clustering of Images. arXiv preprint arXiv:1808.10356, 2018.
 Pan et al. (2018) Lili Pan, Shen Cheng, Jian Liu, Yazhou Ren, and Zenglin Xu. Latent dirichlet allocation in generative adversarial networks. arXiv preprint arXiv:1812.06571, 2018.

Tomczak and Welling (2018)
Jakub M Tomczak and Max Welling.
VAE with a VampPrior.
International Conference on Artificial Intelligence and Statistics
, 2018.  Zhao et al. (2016) Qibin Zhao, Guoxu Zhou, Shengli Xie, Liqing Zhang, and Andrzej Cichocki. Tensor Ring Decomposition. arXiv preprint arXiv:1606.05535, 2016.

Williams (1992)
Ronald J Williams.
Simple Statistical GradientFollowing Algorithms for Connectionist Reinforcement Learning.
Machine learning, 8(34):229–256, 1992. 
Rezende et al. (2014)
Danilo Jimenez Rezende, Shakir Mohamed, and Daan Wierstra.
Stochastic Backpropagation and Approximate Inference in Deep Generative Models.
International Conference on Machine Learning, 2014.  Chen et al. (2017) Xi Chen, Diederik P Kingma, Tim Salimans, Yan Duan, Prafulla Dhariwal, John Schulman, Ilya Sutskever, and Pieter Abbeel. Variational Lossy Autoencoder. International Conference on Learning Representations, 2017.
 van den Oord et al. (2017) Aaron van den Oord, Oriol Vinyals, and koray kavukcuoglu. Neural Discrete Representation Learning. Advances in Neural Information Processing Systems, 2017.

Gurumurthy et al. (2017)
Swaminathan Gurumurthy, Ravi Kiran Sarvadevabhatla, and R Venkatesh Babu.
DeLiGAN: Generative Adversarial Networks for Diverse
and Limited Data.
In
Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition
, pages 166–174, 2017.  Rezende and Mohamed (2015) Danilo Jimenez Rezende and Shakir Mohamed. Variational Inference with Normalizing Flows. International Conference on Machine Learning, 2015.
 Dinh et al. (2015) Laurent Dinh, David Krueger, and Yoshua Bengio. NICE: Nonlinear Independent Components Estimation. International Conference on Learning Representations Workshop, 2015.
 Kingma et al. (2016) Diederik P Kingma, Tim Salimans, Rafal Jozefowicz, Xi Chen, Ilya Sutskever, and Max Welling. Improved Variational Inference with Inverse Autoregressive Flow. In Advances in Neural Information Processing Systems, pages 4743–4751, 2016.
 Dinh et al. (2017) Laurent Dinh, Jascha SohlDickstein, and Samy Bengio. Density Estimation Using Real NVP. International Conference on Learning Representations, 2017.
 Bauer and Mnih (2019) Matthias Bauer and Andriy Mnih. Resampled priors for variational autoencoders. 89:66–75, 16–18 Apr 2019. URL http://proceedings.mlr.press/v89/bauer19a.html.
 Oseledets (2011) Ivan V Oseledets. TensorTrain Decomposition. SIAM Journal on Scientific Computing, 33(5):2295–2317, 2011.

Novikov et al. (2015)
Alexander Novikov, Dmitry Podoprikhin, Anton Osokin, and Dmitry Vetrov.
Tensorizing Neural Networks.
Advances in Neural Information Processing Systems, 2015.  Garipov et al. (2016) Timur Garipov, Dmitry Podoprikhin, Alexander Novikov, and Dmitry Vetrov. Ultimate Tensorization: Compressing Convolutional and FC Layers Alike. Advances in Neural Information Processing Systems, 2016.

Tjandra et al. (2017)
Andros Tjandra, Sakriani Sakti, and Satoshi Nakamura.
Compressing Recurrent Neural Network with Tensor Train.
International Joint Conference on Neural Networks, 2017.  Vedantam et al. (2018) Ramakrishna Vedantam, Ian Fischer, Jonathan Huang, and Kevin Murphy. Generative Models of Visually Grounded Imagination. International Conference on Learning Representations, 2018.
 Suzuki et al. (2017) Masahiro Suzuki, Kotaro Nakayama, and Yutaka Matsuo. Joint Multimodal Learning with Deep Generative Models. International Conference on Learning Representations Workshop, 2017.
 Liu et al. (2015) Ziwei Liu, Ping Luo, Xiaogang Wang, and Xiaoou Tang. Deep Learning Face Attributes in the Wild. In Proceedings of International Conference on Computer Vision (ICCV), 12 2015.
 (29) Alex Krizhevsky, Vinod Nair, and Geoffrey Hinton. Cifar10 (canadian institute for advanced research). URL http://www.cs.toronto.edu/~kriz/cifar.html.
 Burda et al. (2015) Yuri Burda, Roger Grosse, and Ruslan Salakhutdinov. Importance weighted autoencoders. arXiv preprint arXiv:1509.00519, 2015.
 Arjovsky et al. (2017) Martin Arjovsky, Soumith Chintala, and Léon Bottou. Wasserstein Generative Adversarial Networks. International Conference on Machine Learning, 2017.
 Gulrajani et al. (2017) Ishaan Gulrajani, Faruk Ahmed, Martin Arjovsky, Vincent Dumoulin, and Aaron C Courville. Improved Training of Wasserstein Gans. In Advances in Neural Information Processing Systems, pages 5767–5777, 2017.
 Miyato et al. (2018) Takeru Miyato, Toshiki Kataoka, Masanori Koyama, and Yuichi Yoshida. Spectral normalization for generative adversarial networks. International Conference on Learning Representations, 2018.
 Heusel et al. (2017) Martin Heusel, Hubert Ramsauer, Thomas Unterthiner, Bernhard Nessler, and Sepp Hochreiter. Gans trained by a two timescale update rule converge to a local nash equilibrium. In Advances in Neural Information Processing Systems, pages 6629–6640, 2017.
 Levine et al. (2018) Yoav Levine, David Yakira, Nadav Cohen, and Amnon Shashua. Deep learning and quantum entanglement: Fundamental connections with implications to network design. International Conference on Learning Representations, 2018.
 Kingma and Ba (2015) Diederik P Kingma and Jimmy Ba. Adam: A Method for Stochastic Optimization. International Conference on Learning Representations, 2015.
 Kingma et al. (2014) Durk P Kingma, Shakir Mohamed, Danilo Jimenez Rezende, and Max Welling. Semisupervised Learning with Deep Generative Models. In Advances in neural information processing systems, pages 3581–3589, 2014.
Appendix A Derivations for onedimensional conditional distributions
In the paper, we stated that onedimensional conditional distributions are Gaussian Mixture Models with the same means and variances as priors, but with different weights . With Tensor Ring decomposition, we can efficiently compute those weights (we denote as ):
(15) 
Appendix B Calculation of marginal probabilities in Tensor Ring
In Algorithm 2 we show how to compute marginal probabilities for a distribution parameterized in Tensor Ring format. Note that we compute a normalizing constant onthefly.
Appendix C Model architecture
We manually tuned the hyperparameters: first we selected the best encoderdecoder architecture for a Gaussian prior and then tuned TRIP parameters for a fixed architecture. For models from a GAN family, we used a deconvolutional generator with kernel size
and ReLU activations. The number of channels in layers was
. For the discriminator, we used the symmetric convolutional architecture with a LeakyReLU. We trained a model using Adam Kingma and Ba (2015) optimizer with a learning rate of for iterations with a batch size . We used a schedule of discriminator updates per one generator update. A TRIP prior was dimensional with Gaussians per dimension and core size (sizes of matrices ). For a baseline Gaussian Mixture Model (GMM) prior we used Gaussians. We conducted all the experiments on Tesla K80.For VAE models, we used a convolutional encoder and a deconvolutional decoder with a kernel size , and the number of channels for the encoder, and a symmetrical architecture for the decoder. We used LeakyReLU for the encoder and ReLU for the decoder. We trained the model for weight updates with batch size . The latent dimension was for all VAEbased models. For TRIP we used Gaussians per dimension and a Tensor Ring with core size . For a GMM prior we used Gaussians.
For conditional generation with TRIP, the architecture was the same as for unconditional generation. For CVAE we parameterized a posterior model as a fully connected network with layer sizes and LeakyReLU activations. For the VAE TELBO baseline model Vedantam et al. (2018), we used a fully connected network for with layer sizes and LeakyReLU activations.
Appendix D Implementation details
Implementing the TRIP module is straightforward and requires two functions. The first function that we use during training computes for an arbitrary subset of latent dimensions. The second function is used for sampling, and samples from with a chain rule, for which calculations are described in Eq 15.
During training we enforce values of cores to be nonnegative by replacing each element of tensors with their absolute values before computation. To make computations more stable, we divide and by the at each iteration when computing .
CIFAR10  CelebA  

ELBO  Reconstruction  KL  ELBO  Reconstruction  KL  
1  89.5  60.5  29.0  243.40  177.63  65.76 
5  89.3  60.2  29.1  231.57  166.89  64.67 
10  89.3  60.4  28.9  223.59  156.99  66.60 
20  89.1  60.2  28.9  215.62  158.95  56.67 
Appendix E Impact of core size
In Table 4 we compared the performance of VAETRIP model with different core sizes on CIFAR10 and CelebA datasets. Note that for , TRIP is factorized over dimensions, where each dimension is a 1D Gaussian Mixture Model. Notice that models with higher core sizes perform better as the prior becomes more complex. In Table 5 we show computational complexity and memory usage of TRIP model to illustrate a tradeof between quality and computational complexity of the model.
loglikelihood, ms  sampling, ms  Memory, MB  

notation  
1  126 7  201 21  0.023 
10  137 4  232 13  0.77 
20  193 15  312 18  3.1 
50  200 20  360 17  19.5 
100  308 12  882 15  78.1 
Model  % missing  

0%  90%  99%  
CVAE Kingma et al. (2014)  86.69  85.31  84.61 
VAE TELBO Vedantam et al. (2018)  82.80  74.87  73.92 
JMVAE Suzuki et al. (2017)  81.87  80.65  73.68 
VAETRIP (ours)  88.7  87.08  84.89 
e.1 Conditional Generation
For the conditional generation, we used images of size . We study the model performance for different rates of missing attributes (, , ). For each model, we generated
images for randomly sampled complete sets of attributes from the test set. We trained a predictive convolutional neural network on a validation set to predict the attributes with
accuracy and predicted the attributes of generated images. We report the condition matching accuracy—when requested attributes matched the actual attributes. We trained all models except for CVAE Kingma et al. (2014) directly on data with missing attributes. For CVAE, we imputed missing values with a predictive model. For the missing rate of , the predictive test accuracy was , and for —. In the results shown in Table 6, we see that the VAETRIP model outperforms other baselines.GMM  TRIP  Combination with Flow  

GMM  TRIP  
Parameters (model)  11.4M  11.1M  10.7M  11.3M  10.7M  10.4M 
Parameters (prior)  0  0.2M  0.6M  0.3M  0.5M  0.7M 
Parameters (total)  11.4M  11.3M  11.1M  11.5M  11.2M  11.1M 
ELBO  192.6  190.05  189.1  185.3  186.0  184.7 
e.2 Additional experiments for VAE
In Table 7 we compare VAE model with Gaussian, GMM and TRIP priors with a comparable number of parameters. We also provide preliminary results on combining normalizing flows with a TRIP prior.