A Factorial Mixture Prior for Compositional Deep Generative Models

12/18/2018 ∙ by Ulrich Paquet, et al. ∙ 26

We assume that a high-dimensional datum, like an image, is a compositional expression of a set of properties, with a complicated non-linear relationship between the datum and its properties. This paper proposes a factorial mixture prior for capturing latent properties, thereby adding structured compositionality to deep generative models. The prior treats a latent vector as belonging to Cartesian product of subspaces, each of which is quantized separately with a Gaussian mixture model. Some mixture components can be set to represent properties as observed random variables whenever labeled properties are present. Through a combination of stochastic variational inference and gradient descent, a method for learning how to infer discrete properties in an unsupervised or semi-supervised way is outlined and empirically evaluated.

READ FULL TEXT VIEW PDF
POST COMMENT

Comments

There are no comments yet.

Authors

page 1

page 8

page 14

page 15

page 16

This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

1 Introduction

An image is often compositional in its properties or attributes. We might generate a random portrait by first picking discrete properties that are simple to describe, like whether the person should be male, whether the person should be smiling, whether the person should wear glasses, and maybe whether the person should have one of a discrete set of hair colors. In the same vein, we might include discrete properties that are harder to describe, but are still present in typical portraits. These are properties that we wish to discover in an unsupervised way from data, and may represent background textures, skin tone, lighting, or any property that we didn’t explicitly enumerate.

This paper proposes a factorial mixture prior for capturing such properties, and presents a method for learning how to infer discrete properties in an unsupervised or semi-supervised way. The outlined framework adds structured compositionality to deep generative models.

given given
and and
Figure 1: Samples from a deep generative model . For the left and right blocks of faces, we sampled and once, and constructed as a Cartesian product . The remaining and vectors were iteratively filled in by sampling and for all settings. With background, skin tone and other properties being kept constant in the left and right blocks, we mark a visible change in gender in the two settings of latent variable . The presence or absence of glasses on a portrait is captured by latent variable . Following the discussion in Section 2.5, with semi-supervision, mixture components are responsible for rendering portraits.

To make the example of learning to render a portrait concrete, let each property , be it one that we can describe or not, take values . Given a sampled latent value , we generate a real-valued vector from a mixture component in a different -component mixture model for each property . The -vectors are concatenated into a latent vector , from which is generated according to any deep generative model that is parameterized by .

The parameters of the aforementioned generative process can be trained wholly unsupervised, and a factorial structure of clustered latent properties will naturally emerge if it exists in the training data. However, unsupervised learning provides no guarantee that some

encodes and models a labelled property of

. Some semi-supervised learning is required to associate a named property with some

. A few training data points might be accompanied with sets of labels, an example set being . We may choose which labels in the set to map to properties , and then when present in a training example, use these labels in a semi-supervised way.

Continuing the illustrative example, let index the gender of portrait and whether glasses appear in or not, to yield two semi-supervised latent clusters with for and . In Figure 1, we let factors be free to capture other combinations of variation in . For a given unsupervised cluster pair of latent properties, Figure 1 illustrates how portraits of males and females with and without glasses are sampled by iterating over and .

Despite the naive simplicity of the generative model, inferring the posterior distribution of and the properties’ latent cluster assignments for a given is far from trivial. Next to inference stands parameter learning. Section 2 is devoted to the model, inference and learning. Section 3 traces the influences on this model, and related work from the body of literature on “untangled”, structured latent representations. Empirical evaluations follow in Section 4.

2 A Structured, Factorial Mixture Model

We observe data , and model each with a likelihood , a deep and flexible function with parameters . A few data points might have labels, and we denote their indices by set and labels by . We will ignore the labels for the time being, and return to them in Section 2.5 when we incorporate them as a semi-supervised signal to aid the prior factors in corresponding to named properties. Additionally, when clear from the context, we will omit superscripts .

2.1 Factorial Mixture Prior on

We present a prior that decomposes in a discrete, structured way. With a subdivision of into sub-vectors , let each factor be generated by a separate Gaussian mixture model with components with non-negative mixing weights that sum to one, . Let factor ’s component have a mean and diagonal precision matrix , and define to denote all prior parameters as random variables, so that . Each is augmented with a cluster assignment vector

, a one-hot encoding of an assignment index at

, producing .

Seen differently, the indices form a discrete product-quantized (PQ) code for

. PQ codes have been used with great success for indexing in computer vision

Jegou et al. (2011). It decomposes the vector space into a Cartesian product of low dimensional subspaces , and quantizes each subspace separately, to represent a vector by its subspace quantization indices.

We let be drawn from a multinomial distribution that is parameterized by , to yield the joint prior distribution

(1)

Our hope is that provides a factorial representation of , with the ability to choose and combine discrete concepts , then generate ’s from those discrete latent variables, and finally to model from the concatenation of the ’s.

2.2 Hierarchical Model

Figure 2: The joint density in (2), shown as a probabilistic graphical model. In the semi-supervised setting, some of the random variables are also observed; see Section 2.5.

In (1), the prior is written as , where are random variables Johnson et al. (2016)

. As random variables, the hyperprior

is judiciously chosen from a conjugate exponential family of distributions, and the rich history of variational Bayes from the 1990’s and 2000’s is drawn on to additionally infer the posterior distribution of Johnson et al. (2016). An alternative, not pursued here, would have been to treat as model parameters, writing the prior as Fraccaro et al. (2017).

With in (1) being random variables (that are integrated out when any marginal distributions are determined), they are a priori modeled with a hyperprior . In particular, each Gaussian mean and precision is independently governed by a conjugate Normal-Gamma hyperprior, , for . Our implementation uses , , and . The choice of a conjugate hyperprior becomes apparent when is inferred from data using stochastic variational inference Hoffman et al. (2013); Paquet & Koenigstein (2013). For a similar reason, each mixing weights vector is drawn from a Dirichlet distribution with pseudo-counts . The joint density over the observed random variables and hidden random variables , and is

(2)

and is illustrated as a probabilistic graphical model in Figure 2. Equation (2) contains local factors and global random variables, and the posterior approximation in Section 2.3 will contain local and global factors.

The model’s prior is related to an independent component analysis (ICA) prior on

, after which a non-linear mixing function is applied to yield . It is also worth noting that yields Johnson et al. (2016)’s Gaussian mixture model (GMM) structured variational auto-encoder (SVAE), and we generalise here from a single GMM to ICA-style component-wise mixtures. When is treated as parameters (and not random variables) in , and set to one, we obtain Jiang et al. (2017)’s variational deep embedding (VaDE) model. Further connections to related work are made in Section 3.

2.3 Posterior Approximation

We approximate the posterior with a fully factorized distribution

(3)

An “encoding network” or “inference network” amortizes inference for , and maps

to a Gaussian distribution on

; both and

are deep neural networks

Kingma & Welling (2014). We approximate and with a “structured VAE” approach Johnson et al. (2016).

There is a local approximation for every data point, yielding a “responsibility” for block through . Furthermore, . Define as parameters of . Maximization for the local factor is semi-amortized; we’ll see in (5) that it can be obtained as an analytic, closed form expression. In (5) it is a softmax over cluster assignments, and could be interpreted as another “encoding network”, this time for the one-hot discrete representations .111 Here, there is room for splitting hairs. We opt for uncluttered notation , following the traditional convention in the variational Bayes (VB) literature. According to (5), the ELBO is locally maximized at an expression for that depends on , and . One might also make the “amortized” nature of these local factors clearer by writing the factors as .

To express , we parameterize each of the factors

as a product of Normal-Gamma distributions, in the same form as the prior. Lastly, let

be a Dirichlet distribution parameterized by its pseudo-counts vector .

We’ve stated the factors in terms of their mean parameters here because we require the mean parameters in (6). However, instead of doing gradient descent in the mean parameterization, we will view through the lens of its natural parameters . This parameterization has pleasing properties, as it allows gradients to be conveniently expressed as natural gradients; a gradient step of length one gives an update to a local minimum (for a batch) Paquet (2015). This is also the framework required for stochastic variational inference (SVI) Hoffman et al. (2013). Here the exponential family representation of is minimal and there exists a one-to-one mapping between the mean parameters and natural parameters Wainwright & Jordan (2008); we state the mapping in Appendix B.

2.4 Inference and Learning

The posterior approximation depends on , , , and . They will be found by maximizing a variational lower bound to , also referred to as the “evidence lower bound” (ELBO):

(4a)
(4b)
(4c)
(4d)
(4e)

The full derivation is provided in Appendix A. The contribution of the hyperpriors is equally divided over all data points through in lines (4d) and (4e), so that the outer sum is over

. This aids stochastic gradient descent, and appropriately weighs

’s gradients as part of mini-batches of . The objective in (4a) to (4e) forms a structured VAE Johnson et al. (2016).

There are four Kullback-Leibler () divergences in . For each data point and factor , (4b) contains the KL divergence between and . This is the familiar local KL divergence between the encoder and the prior. Because is inferred, the KL is averaged over the (approximate) posterior uncertainty . With the allocation of to clusters also being inferred, the smoothed KL is further weighted in a convex combination over . A second local KL cost is incurred in line (4c

), where discrete random variables

encode information about . Finally, lines (4d) and (4e) penalize the global posterior approximation for moving from the prior . The KL divergence between two Normal-Gamma distributions is given in Appendix C.1.

Objective can be optimized using three different techniques, spanning the last three decades of progress in variational inference. This combination has already been employed by a number of authors recently Johnson et al. (2016); Lin et al. (2018). Given a mini-batch of data points at iteration , the following parameter updates are followed:

  1. is locally maximized with respect to

    with an analytic, closed form expression. In an Expectation-Maximization (EM) algorithm, this is the variational Bayes E-step

    Attias (1999); Waterhouse et al. (1996).

  2. Given a batch’s , a stochastic gradient descent step updates and , for example using Kingma & Ba (2015)’s optimizer. When line (4b) is expressed analytically using (6), the reparameterization trick can be readily used Kingma & Welling (2014).

  3. After recomputing at their new local optima using the updated and , a SVI natural gradient step is taken on , using a decreasing Robbins-Monro step size Hoffman et al. (2013). The updates on , and together form a stochastic M-step.

In practice, we initialize and by first running a number of gradient updates, using an ELBO with a prior, possibly with an annealed KL term. That gives an initial encoder . Parameters are then initialized by repeating steps 1 and 3, to seed a basic factorized clustering of the encoded ’s. We consider steps 1–3 separately below:

2.4.1 Locally maximizing over

The local maximum of over is the softmax

(5)

(dropping superscripts ). If the encoding network yields a Gaussian distribution, we can determine the above expectation required for (5) analytically:

(6)

where index is the index into the matching sub-vector in , and indicates the digamma function. Furthermore, . The expression in (6) appears in line (4b), so that the KL divergence is expressed exactly when stochastic gradients over and are computed.

Aside from showing that the maximum of with respect to is analytically tractable, (5) is a softmax classification into expected cluster allocations. Semi-supervised labels could be used to guide —since it appears in analytic form in (6)—towards encodings for which the expected cluster allocations correspond to labelled properties or attributes; we turn to this theme in (8) in Section 2.5.

2.4.2 Stochastic gradients of ,

Lines (4a) and (4b) depend on and , and the “reparameterization trick” can be used with a stochastic gradient optimizer Kingma & Ba (2015); Kingma & Welling (2014).

2.4.3 Conjugate, natural gradients of

We provide only an sketch of natural gradients and their use in SVI here, and provide the derivations and detailed expressions of the gradients in Appendices B and D.

At iteration , for mini-batch of data points, is analytically minimized with respect to to yield . If denotes the natural parameters at iteration , then the natural gradient is Paquet (2015). We update , which is equivalent to updating it to the convex combination between the currently tracked minimizer and the batch’s minimizer ,

(7)

This parameter update is a rewritten from of updating with the natural gradient that is rescaled by Hoffman et al. (2013). The step sizes should satisfy and . In our results was used, with forgetting rate and delay . The updated mean parameters of are obtained from the updated natural parameters .

2.5 Semi-supervised Learning

For some data points , we have access to labels . With appropriate preprocessing, we let correspond to a one-hot encoding for property , and let denote the observed properties for data point . Where present, we treat the labels as observed random variables in the probabilistic graphical model in (2) and Figure 2. The posterior approximation in (3), which is used in (4b) and (4c), hence includes pre-set delta functions . Certain marginals in the posterior approximation are thus clamped.

Merely clamping marginals may not be sufficient for ensuring that encodes, for example, latent male or female properties in Figure 1. The encoder might still distribute these properties through all of , and decoder find parameters that recover from that . We would like the encoder to emit two linearly separable clusters for block for male and female inputs, and define the expected cluster allocation like (5

). To encourage a distributed representation of

male and female primarily in , we incorporate a cross entropy classification loss to ,

(8)

with a tunable scalar knob .

On closer inspection, the “logits” of the

’s are differentiable functions of the encoder, using the analytic expression in (6). For the parameters of the subset of and factors that influence the semi-supervised loss term in (8), the closed-form natural gradient updates of Section 2.4.3 are no longer possible. When , the M-step can employ first-order gradients in the mean parameterization of those factors.

As a final thought, we are free to set once we are satisfied that primarily represents a latent male and female signal in , for instance. In that case we are left with a probabilistic graphical model in which some are observed, hence still encouraging a separation of representation of named properties.

2.6 Sampling

We wish to sample from the marginal distribution of , conditional on all observed data . It is an intractable average over the posterior ,

(9)

which we approximate by substituting with . Call the distribution resulting from this substitution . To sample from , one-hot cluster indices are first sampled with and , a categorical distribution with

as its event probabilities. Then, let

be the non-zero index of , and sample . Each is drawn from a heavy-tailed Student-t distribution with degrees of freedom

(10)

where indexes block ’s entry . Finally, sample .

2.7 Predictive Log Likelihood

The predictive likelihood of a new data point

is estimated with a lower bound to

, which decomposes with two terms containing KL divergences,

(11)

The bound in (11) is first maximized over , using (5), before being evaluated. In our results, we consider the tradeoff between encoding in and in , by evaluating the two terms and .

3 Related Work

The factorial representation in this paper is a collection of “Gaussian mixture model structured VAEs (SVAEs)”, and our optimization scheme in Section 2.4 mirrors that of Algorithm 1 in Johnson et al. (2016). In a larger historical context, the model has roots in variational approximations for mixtures of factor analyzers Ghahramani & Beal (2000). In our work, the prior is a Gaussian on , and so is . As the product of two Gaussians yields an unnormalized Gaussian (with closed form normalizer), the prior structure can further be incorporated in the inference network of a VAE Lin et al. (2018).

When the conjugate hyperprior is ignored and treated as parameters, and a single mixture set at , we recover the model of Jiang et al. (2017). Alternatively, the “stick-breaking VAE” uses a stick-breaking prior on for , and treats all other as parameters Nalisnick & Smyth (2017); Nalisnick et al. (2016). The Gaussian mixture VAE Dilokthanakul et al. (2016)

uses deep networks to transform a sample from a unit-variance isotropic Gaussian into a the mean and variance for each mixture component of a Gaussian mixture model prior on

. Instead of inferring separate means and variances of the mixture components, they are learned non-linear transformations of the same random variable. Another variation on the mixture-theme, the “variational mixture of posteriors” prior writes the mixture prior as a mixture of inference networks, conditioned on learnable pseudo-inputs

Tomczak & Welling (2018). In Graves et al. (2018), the notion of a mixture prior is taken to the extreme, storing the approximate posterior distributions of every element in the training set as mixture components, and using hard -nearest neighbour lookups to determine the cluster assignment for a new .

The vector-quantized VAE (VQ-VAE) model van den Oord et al. (2017) is similar to the work presented here, with these differences: here, there are three types random variables, (discrete) together with and , while VQ-VAE consists only of discrete random variable (a form of ). VQ-VAE introduces a “stop-gradient”, which on the surface resembles an E-step in an EM algorithm (see Section 2.4).

A growing body of literature exists around obtaining “untangled” representations from unsupervised models. These range from bootstrapping partial supervised data Narayanaswamy et al. (2017), changing the relative contribution of the KL term(s) in the ELBO Higgins et al. (2017) to total correlation penalties Chen et al. (2018b); Kim & Mnih (2018) to specialized domain-adverserial training on the latent space Lample et al. (2017). In this work, the only additional penalties that are added to the ELBO appear in the form of the semi-supervised loss term in (8). The subdivision of data into a discrete and somewhat interpretable Cartesian product of clusters is purely the result of a hierarchical mixture model. The model presented in this paper can be viewed as a VAE with latent code vector with a “KD encoding” prior Chen et al. (2018a), learned in a Bayesian probabilistic way.

Instead of letting be parent random variables of , they can be treated as “side information” whose prediction helps recover an interpretable common latent variable . Adel et al. (2018)’s “interpretable lens variable model” constructs an invertible mapping (normalizing flow) between and , and uses it to add an interpretable layer to, for example, pre-trained ’s and ’s. Like this work, might be much smaller than .

4 Experimental results

We expand the example of Figure 1, by illustrate a factorization of CelebA faces into a product of cluster indices , the first three of which are interpretable. Using the binary MNIST and Omniglot data sets, we empirically show the trade-off between costs for representing as discrete and continuous latent variables in the predictive log likelihood (11), as varies.

4.1 Binary MNIST and Omniglot

ccc
ccc

Figure 3: The ELBO of (11), averaged over respective tests sets, as a function of . Generally, the cost of encoding decreases with , at the expense of a higher relative entropy . More mixture components aid the “reconstruction term” , also shown as averaged over the respective test sets.

For both the binary MNIST Salakhutdinov & Murray (2008) and Omniglot Lake et al. (2015) data sets, we use a conventional convolutional architecture for the encoder and decoder, taken from Gulrajani et al. (2017)

. The encoder is a convolutional neural network (CNN) with 8 layers of kernel size 3, alternating strides of 1 and 2, and ReLU activations between the layers. The number of channels is 32 in the first 3 layers, then 64 in the last five layers. Lastly, a linear layer transforms the CNN output to a vector of length

, to produce the means and log variances of the posterior approximation of . The decoder emits a Bernoulli likelihood for every pixel independently. Training details are provided in Appendix E.

To demonstrate the benefits of using a structured prior, we ran experiments on MNIST and Omniglot using , with an increasing number of mixture components . (The data sets did not exhibit enough variability for additional factors not to be pruned away MacKay (2001)). Figure 3 shows the resulting test ELBO, log-likelihood (negative reconstruction error) and KL-divergences from (11).

We decompose the ELBO in (11) into a log likelihood and two expected KL-divergences, and . In Figure 3 we see the ELBO increase, and then seemingly converge, as model complexity increases. Following (11), the figure shows that there is a trade-off in the penalty that the encoder pays, between encoding into , and encoding into . As increases, increases and decreases, pushing information about the encoding “higher up” in the representation. Beyond a certain , proportionally less mixture components are used, reflected in a diminishing . We would expect Figure 3 to be smooth, and its variablility can be ascribed to random seeds, sensible initialization of and local minima in the landscape of .

An ELBO of -85.77 at

on binarized MNIST represents a typical state-of-the-art result (

-87.4) for unconditional generative models using simple convolutional neural networks Gulrajani et al. (2017). On Omniglot, our best test ELBO of -90.89 at is comparable to some of the leading models of the day (-89.8 for the variational lossy auto-encoder in Chen et al. (2017); -95.5 in Rezende et al. (2016); -103.4 in Burda et al. (2016)).

Figure 4: Omniglot cluster means, variances, and samples (for 6 out of clusters).

In Figure 4 we inspect how the latent space clustering of Omniglot is reflected in input space. The figure shows the mean and variance of samples of six mixture components, obtained by repeatedly drawing latent space samples, decoding those, and computing the mean and variance of the resulting images. Additionally, five samples from each cluster are shown.

4.2 CelebA

The images in the CelebA dataset are 64-by-64 pixels with 3 channels (RGB), with each element scaled to . For semi-supervised training, we use the annotations of 40% of images, so that . Hence 40% of data points have labels , for which we use the subset of gender, glasses and smiling labels as a semi-supervised signal. Our primary aim is to illustrate the factorial mixture prior capturing and encouraging compositional latent representations.

0 0

Figure 5:

The mean, standard deviation, and a

sample.
Figure 6: The means of , with sampled from (10) using factorial mixture prior with properties (samples from add position-dependent jitter to the mean; see Figure 5). Semi-supervised labels , , and are incorporated in learning, and the figure shows randomly generated faces using the same set-up as that of Figure 1.

For CelebA, we use a 4-layer convolutional neural network with (128, 256, 512, 1024) channels, kernel size 3, and stride 2. The output of each layer is clipped with rectified linear units (ReLU), before being input to the next. As with MNIST and Omniglot, the final layer output is transformed with a linear layer to a vector of length

, to produce the means and log variances of . The decoder is a deconvolutional neural network whose architecture is the transpose of that of the encoder. The decoder is different for this data set, as it needs to model 3 color channels (RGB). The last layer of the deconvolutional network has 6 channels, which represent the mean and standard deviation of for each color channel for each pixel; see Figure 5. The standard deviations of

are parameterized via a scaled sigmoid function, to be in [0.001, 0.4].

Training started with a prior initialization phase for of iterations, followed by a joint optimization phase of iterations. The Adam optimizer Kingma & Ba (2015) with learning rate was used for and . A forgetting rate and was used for ’s gradient updates. In training, we used in (8), as a typical unscaled cross entropy loss in (8) is a few orders of magnitude smaller than the reconstruction term in (4a), which scales with the number of pixels and color channels. Empirically, we found that initially up-weighting the KL term in (4b) by a factor of 20 encouraged a crisper clustered representation to be learned Higgins et al. (2017).

Figure 6 visually illustrates a model with factors with , of which the first three are semi-supervised with two mixture components each, while the remainder are unsupervised with . For each row, we sample once from and sample once form (10). By iterating over settings and generating random faces from the model, the change in properties is perceptible as the latent code changes. Further results and examples are given in Appendix F.

Latent property Human judgement
male / female 94%
glasses / no glasses 85%
smiling / not smiling 81%
Table 1: Human judgements of properties of generated ’s, sampled according to Section 2.6 but with latent variables to clamped to intended, semi-supervised ground truth properties.

To formally test the interpretability of samples like that in Figure 6, we uniformly sampled from as the model’s “intention” to test all properties equally. It was appended with a sample from , was sampled according to (10) and the mean rendered. For 54 such random images, Table 1 evaluates shows the accuracy of human evaluations compared to their latent -labels. Ten people of varying race and gender evaluated the generated images. The evaluation is subjective—the criteria for smiling need not be consistent with that of the creators of CelebA Liu et al. (2015)—but gives evidence that the ’s capture interpretable attributes.

On the CelebA test set, the estimated cluster assignments were 97% for gender, 99% for the presence or absence of glasses, and 91% for whether the subject is smiling or not. Note that unlike the data evaluated for Table 1

, labeled properties are not uniformly distributed over the test set.

Acknowledgements

We are grateful to three anonymous Neural Information Processing Systems reviewers, who provided invaluable suggestions for improving this paper.

References

  • Adel et al. (2018) Adel, T., Ghahramani, Z., and Weller, A. Discovering interpretable representations for both deep generative and discriminative models. In Proceedings of the 35th International Conference on Machine Learning, pp. 50–59, 2018.
  • Attias (1999) Attias, H. Inferring parameters and structure of latent variable models by variational Bayes. In

    Proceedings of the Fifteenth Conference on Uncertainty in Artificial Intelligence

    , pp. 21–30, 1999.
  • Burda et al. (2016) Burda, Y., Grosse, R., and Salakhutdinov, R.

    Importance weighted autoencoders.

    In International Conference on Learning Representations, 2016.
  • Chen et al. (2018a) Chen, T., Min, M. R., and Sun, Y. Learning k-way d-dimensional discrete codes for compact embedding representations. In Proceedings of the 35th International Conference on Machine Learning, pp. 854–863, 2018a.
  • Chen et al. (2018b) Chen, T. Q., Li, X., Grosse, R., and Duvenaud, D. Isolating sources of disentanglement in variational autoencoders. arXiv:1802.04942, 2018b.
  • Chen et al. (2017) Chen, X., Kingma, D. P., Salimans, T., Duan, Y., Dhariwal, P., Schulman, J., Sutskever, I., and Abbeel, P. Variational lossy autoencoder. In International Conference on Learning Representations, 2017.
  • Dilokthanakul et al. (2016) Dilokthanakul, N., Mediano, P. A. M., Garnelo, M., Lee, M. C. H., Salimbeni, H., Arulkumaran, K., and Shanahan, M. Deep unsupervised clustering with Gaussian mixture variational autoencoders. arXiv:1611.02648, 2016.
  • Fraccaro et al. (2017) Fraccaro, M., Kamronn, S., Paquet, U., and Winther, O. A disentangled recognition and nonlinear dynamics model for unsupervised learning. In Advances in Neural Information Processing Systems 30, pp. 3604–3613, 2017.
  • Ghahramani & Beal (2000) Ghahramani, Z. and Beal, M. J. Variational inference for Bayesian mixtures of factor analysers. In Advances in Neural Information Processing Systems 12, pp. 449–455, 2000.
  • Graves et al. (2018) Graves, A., Menick, J., and van den Oord, A. Associative compression networks. arXiv:1804.02476, 2018.
  • Gulrajani et al. (2017) Gulrajani, I., Kumar, K., Ahmed, F., Taiga, A. A., Visin, F., Vazquez, D., and Courville, A. Pixelvae: A latent variable model for natural images. In International Conference on Learning Representations, 2017.
  • Higgins et al. (2017) Higgins, I., Matthey, L., Pal, A., Burgess, C., Glorot, X., Botvinick, M., Mohamed, S., and Lerchner, A. Beta-VAE: Learning basic visual concepts with a constrained variational framework. In International Conference on Learning Representations, 2017.
  • Hoffman et al. (2013) Hoffman, M. D., Blei, D. M., Wang, C., and Paisley, J. Stochastic variational inference. Journal of Machine Learning Research, 14(1):1303–1347, 2013.
  • Jegou et al. (2011) Jegou, H., Douze, M., and Schmid, C. Product quantization for nearest neighbor search. IEEE Transactions on Pattern Analysis and Machine Intelligence, 33(1):117–128, 2011.
  • Jiang et al. (2017) Jiang, Z., Zheng, Y., Tan, H., Tang, B., and Zhou, H. Variational deep embedding: An unsupervised and generative approach to clustering. In Proceedings of the 26th International Joint Conference on Artificial Intelligence, pp. 1965–1972, 2017.
  • Johnson et al. (2016) Johnson, M., Duvenaud, D. K., Wiltschko, A., Adams, R. P., and Datta, S. R. Composing graphical models with neural networks for structured representations and fast inference. In Advances in Neural Information Processing Systems 29, pp. 2946–2954, 2016.
  • Kim & Mnih (2018) Kim, H. and Mnih, A. Disentangling by factorising. arXiv:1802.05983, 2018.
  • Kingma & Ba (2015) Kingma, D. P. and Ba, J. Adam: A method for stochastic optimization. In International Conference on Learning Representations, 2015.
  • Kingma & Welling (2014) Kingma, D. P. and Welling, M. Auto-encoding variational Bayes. In International Conference on Learning Representations, 2014.
  • Lake et al. (2015) Lake, B. M., Salakhutdinov, R., and Tenenbaum, J. B. Human-level concept learning through probabilistic program induction. Science, 350(6266):1332–1338, 2015.
  • Lample et al. (2017) Lample, G., Zeghidour, N., Usunier, N., Bordes, A., Denoyer, L., and Ranzato, M. Fader networks: Manipulating images by sliding attributes. In Advances in Neural Information Processing Systems 30, pp. 5967–5976. 2017.
  • Lin et al. (2018) Lin, W., Hubacher, N., and Khan, M. E. Variational message passing with structured inference networks. International Conference on Learning Representations, 2018.
  • Liu et al. (2015) Liu, Z., Luo, P., Wang, X., and Tang, X. Deep learning face attributes in the wild. In Proceedings of International Conference on Computer Vision (ICCV), 2015.
  • MacKay (2001) MacKay, D. J. C. Local minima, symmetry-breaking, and model pruning in variational free energy minimization. Technical report, University of Cambridge Cavendish Laboratory, 2001.
  • Nalisnick & Smyth (2017) Nalisnick, E. and Smyth, P. Stick-breaking variational autoencoders. In International Conference on Learning Representations (ICLR), 2017.
  • Nalisnick et al. (2016) Nalisnick, E., Hertel, L., and Smyth, P. Approximate inference for deep latent Gaussian mixtures. In NIPS Workshop on Bayesian Deep Learning, volume 2, 2016.
  • Narayanaswamy et al. (2017) Narayanaswamy, S., Paige, T. B., van de Meent, J.-W., Desmaison, A., Goodman, N., Kohli, P., Wood, F., and Torr, P. Learning disentangled representations with semi-supervised deep generative models. In Advances in Neural Information Processing Systems 30, pp. 5925–5935. 2017.
  • Paquet (2015) Paquet, U. On the convergence of stochastic variational inference in Bayesian networks. arXiv:1507.04505, 2015.
  • Paquet & Koenigstein (2013) Paquet, U. and Koenigstein, N. One-class collaborative filtering with random graphs. In Proceedings of the 22nd International Conference on World Wide Web (WWW), pp. 999–1008. 2013.
  • Rezende et al. (2016) Rezende, D. J., Mohamed, S., Danihelka, I., Gregor, K., and Wierstra, D. One-shot generalization in deep generative models. arXiv:1603.05106, 2016.
  • Salakhutdinov & Murray (2008) Salakhutdinov, R. and Murray, I.

    On the quantitative analysis of deep belief networks.

    In Proceedings of the 25th International Conference on Machine Learning, pp. 872–879, 2008.
  • Tomczak & Welling (2018) Tomczak, J. M. and Welling, M. VAE with a VampPrior. In Proceedings of the 21st International Conference on Artificial Intelligence and Statistics (AISTATS), 2018.
  • van den Oord et al. (2017) van den Oord, A., Vinyals, O., and Kavukcuoglu, K. Neural discrete representation learning. In Advances in Neural Information Processing Systems 30, pp. 6306–6315. 2017.
  • Wainwright & Jordan (2008) Wainwright, M. J. and Jordan, M. I. Graphical models, exponential families, and variational inference. Foundations and Trends in Machine Learning, 1(1–2):1–305, 2008.
  • Waterhouse et al. (1996) Waterhouse, S. R., MacKay, D. J. C., and Robinson, A. J. Bayesian methods for mixtures of experts. In Advances in Neural Information Processing Systems, pp. 351–357, 1996.

Appendix A Evidence Lower Bound

We bound the marginal likelihood of the entire data set—assuming ’s are generated i. i. d.—with Jensen’s inequality,

Using the joint distribution in (

2), and the approximation in (3), we rewrite as