The LORACs prior for VAEs: Letting the Trees Speak for the Data

10/16/2018 ∙ by Sharad Vikram, et al. ∙ 0

In variational autoencoders, the prior on the latent codes z is often treated as an afterthought, but the prior shapes the kind of latent representation that the model learns. If the goal is to learn a representation that is interpretable and useful, then the prior should reflect the ways in which the high-level factors that describe the data vary. The "default" prior is an isotropic normal, but if the natural factors of variation in the dataset exhibit discrete structure or are not independent, then the isotropic-normal prior will actually encourage learning representations that mask this structure. To alleviate this problem, we propose using a flexible Bayesian nonparametric hierarchical clustering prior based on the time-marginalized coalescent (TMC). To scale learning to large datasets, we develop a new inducing-point approximation and inference algorithm. We then apply the method without supervision to several datasets and examine the interpretability and practical performance of the inferred hierarchies and learned latent space.

READ FULL TEXT VIEW PDF
POST COMMENT

Comments

There are no comments yet.

Authors

page 11

page 12

page 15

This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

1 Introduction

Variational autoencoders (VAEs; Kingma and Welling, 2014; Rezende et al., 2014) are a popular class of deep latent-variable models. The VAE assumes that observations

are generated by first sampling a latent vector

from some tractable prior , and then sampling from some tractable distribution . For example,

could be a neural network with weights

and might be a Gaussian with mean .

VAEs, like other unsupervised latent-variable models (e.g.; Tipping and Bishop, 1999; Blei et al., 2003), can uncover latent structure in datasets. In particular, one might hope that high-level characteristics of the data are encoded more directly in the geometry of the latent space than they are in the data space . For example, when modeling faces one might hope that one latent dimension corresponds to pose, another to hair length, another to gender, etc.

What kind of latent structure will the VAE actually discover? Hoffman and Johnson (2016) observe that the ELBO encourages the model to make the statistics of the population of encoded vectors resemble those of the prior, so that . The prior therefore plays an important role in shaping the geometry of the latent space. For example, if we use the “default” prior , then we are asking the model to explain the data in terms of smoothly varying, completely independent factors (Burgess et al., 2018). These constraints may sometimes be reasonable—for example, geometric factors such as pose or lighting angle may be nearly independent and rotationally symmetric. But some natural factors exhibit dependence structure (for example, facial hair length and gender are strongly correlated), and others may have nonsmooth structure (for example, handwritten characters naturally cluster into discrete groups).

In this paper, we propose using a more opinionated prior on the VAE’s latent vectors: the time-marginalized coalescent (TMC; Boyles and Welling, 2012). The TMC is a powerful, interpretable Bayesian nonparametric hierarchical clustering model that can encode rich discrete and continuous structure. Combining the TMC with the VAE combines the strengths of Bayesian nonparametrics (interpretable, discrete structure learning) and deep generative modeling (freedom from restrictive distributional assumptions).

Our contributions are:

  • We propose a deep Bayesian nonparametric model that can discover hierarchical cluster structure in complex, high-dimensional datasets.

  • We develop a minibatch-friendly inference procedure for fitting TMCs based on an inducing-point approximation, which scales to arbitrarily large datasets.

  • We show that our model’s learned latent representations consistently outperform those learned by other variational (and classical) autoencoders when evaluated on downstream classification and retrieval tasks.

2 Background

2.1 Bayesian priors for hierarchical clustering

Figure 1:

Independent samples from a time-marginalized coalescent (TMC) prior and two-dimensional Gaussian random walk likelihood model (10 and 300 leaves respectively). Contours in the plots correspond to posterior predictive density

. As the number of leaves grow, the predictive density grows more complex.

Hierarchical clustering is a flexible tool in exploratory data analysis as trees offer visual, interpretable summaries of data. Typically, algorithms for hierarchical clustering are either agglomerative (where data are recursively, greedily merged to form a tree from the bottom-up) or divisive (where data are recursively partitioned, forming a tree from the top-down). Bayesian nonparametric hierarchical clustering (BNHC) additionally incorporates uncertainty over tree structure by introducing a prior distribution over trees and a likelihood model for data , with the goal of sampling the posterior distribution .111We use

to denote probability distributions relating to the TMC and distinguish from

and distributions used later in the paper.

In this paper, we focus on rooted binary trees with labeled leaves adorned with branch lengths, called phylogenies. Prior distributions over phylogenies often take the form of a stochastic generative process in which a tree is built with random merges, as in the Kingman coalescent (Kingman, 1982), or random splits, as in the Dirichlet diffusion tree (Neal, 2003)

. These nonparametric distributions have helpful properties, such as exchangeability, which enable efficient Bayesian inference. In this paper, we focus on the time-marginalized coalescent

(TMC; Boyles and Welling, 2012), which decouples the distribution over tree structure and branch length, a property that helps simplify inference down the line.

2.1.1 Time-marginalized coalescent (TMC)

The time-marginalized coalescent defines a prior distribution over phylogenies. A phylogeny is a directed rooted full binary tree, with vertex set and edges , together with time labels where we denote . The vertex set is partitioned into leaf vertices and internal vertices , so that , and we take to simplify notation for identifying leaves with data points. The directed edges of the tree are encoded in the edge set , where we denote the root vertex as and for we denote the parent of as where .

The TMC samples a random tree structure by a stochastic process in which the leaves are recursively merged uniformly at random until only one vertex is left. This process yields the probability mass function on valid pairs given by

(1)

where denotes the number of internal vertices in the subtree rooted at . Given the tree structure, time labels are generated via the stick-breaking process

(2)

where for . These time labels encode a branch length for each edge . We denote the overall density on phylogenies with leaves as .

Finally, to connect the TMC prior to data in , we define a likelihood model on data points, with corresponding to the leaf vertex . We use a Gaussian random walk (GRW), where for each vertex a location

is sampled according to a Gaussian distribution centered at its parent’s location with variance equal to the branch length,

and we take . As a result of this choice, we can exploit the Gaussian graphical model structure to efficiently marginalize out the internal locations associated with internal vertices and evaluate the resulting marginal density . For details about this marginalization, please refer to Appendix C. The final overall density is written as

(3)

For further details and derivations related to the TMC, please refer to Boyles and Welling (2012).

2.1.2 TMC posterior predictive density

The TMC with leaves and a GRW likelihood model can be a prior on a set of hierarchically-structured data, i.e. data that correspond to nodes with small tree distance should have similar location values. In addition, it also acts as a density from which we can sample new data. The posterior predictive density is easy to sample thanks to the exchangeability of the TMC.

To sample a new data point , we select a branch (edge) and a time to attach a new leaf node. The probability of selecting branch is proportional to the probability under the TMC prior of the tree with a new leaf attached to branch . The density for a time label is determined by the stick-breaking process (see Appendix C for details). Both of these probabilities are easy to calculate and sample due to the exchangeability of the TMC.

The new location can be sampled from , which is the Gaussian distribution that comes out of the GRW likelihood model. Pictured in Figure 1 are samples from a TMC prior and GRW likelihood, where contours correspond to

. In addition to modeling hierarchical structure, the TMC is a flexible nonparametric density estimator.

2.1.3 TMC inference

The posterior distribution is analytically intractable due to the normalization constant

involving a sum over all tree structures, but it can be approximately sampled via Markov chain Monte-Carlo (MCMC) methods. We utilize the Metropolis-Hastings algorithm with a subtree-prune-and-regraft (SPR) proposal distribution

(Neal, 2003). An SPR proposal picks a subtree uniformly at random from and detaches it. It is then attached back on the tree to a branch and time picked uniformly at random. The Metropolis-Hastings acceptance probability is efficient to compute because the joint density can be evaluated using belief propagation to marginalize the latent values at internal nodes of , and many of the messages can be cached. See Appendix C for details.

2.2 Variational autoencoder

The variational autoencoder (VAE) is a generative model for a dataset wherein latent vectors are sampled from a prior distribution and then individually passed into a neural network observation model with parameters ,

(4)

We are interested in the posterior distribution , which is not analytically tractable but can be approximated with a variational distribution , typically a neural network that outputs parameters of a Gaussian distribution. The weights of the approximate posterior can be learned by optimizing the evidence-lower bound (ELBO),

(5)

The parameters of the model, and , are learned via stochastic gradient ascent on the ELBO, using the reparametrization trick for lower variance gradients (Kingma and Welling, 2014; Rezende et al., 2014).

3 The TMC-VAE

The choice of prior distribution in the VAE significantly affects the autoencoder and resulting latent space. The default standard normal prior, which takes , acts as a regularizer on an otherwise unconstrained autoencoder, but can be restrictive and result in overpruning (Burda et al., 2015). Extremely flexible, learnable distributions like masked autoregressive flow (MAF) priors (Papamakarios et al., 2017) enable very rich latent spaces, but don’t encode any interpretable bias for organizing the latent space (except perhaps smoothness).

In this paper, we explore the TMC prior for the VAE, which could potentially strike a sweet spot between restrictive and flexible priors. We generate the latent values of a VAE according to the TMC prior, then generate observations using a neural network observation model,

(6)
(7)

The TMC-VAE is a coherent generative process that captures discrete, interpretable structure in the latent space. A phylogeny not only has an intuitive inductive bias, but can be useful for exploratory data analysis and introspecting the latent space itself.

Consider doing inference in this model: first assume variational distributions (as in the VAE) and , which results in the ELBO,

(8)

For fixed , we can sample the optimal ,

(9)

Because is jointly Gaussian (factorizing according to tree structure) and is Gaussian, expectations with respect to can move into . This enables sampling the expected joint likelihood using SPR Metropolis-Hastings. However, optimizing this ELBO is problematic. does not factorize independently, so computing unbiased gradient estimates from minibatches is impossible and requires evaluating all the data. Furthermore, the TMC is limiting from a computational perspective. Since a phylogeny has as many leaves as points in the dataset, belief propagation over internal nodes of the tree slows down linearly as the size of the dataset grows. In addition, SPR proposals mix very slowly for large trees. We found these limitations make the model impractical for datasets of more than 1000 examples.

In the next section, we address these computational issues, while retaining the interesting properties of the TMC-VAE.

(a) TMC-VAE graphical model
(b) LORACs-VAE graphical model
(c) LORACs-VAE variational factors
Figure 2: Graphical models and variational approximations for TMC models described in the paper

4 LORACs prior for VAEs

In this section, we introduce a novel approximation to the TMC prior, which preserves many desirable properties like structure and interpretability, while being computationally viable. Our key idea is to use a set of learned inducing points as the leaves of the tree in the latent space, analogous to inducing-input approximations for Gaussian processes (Snelson and Ghahramani, 2006). In this model, latent vectors are not directly hierarchically clustered, but are rather independent samples from the induced posterior predictive density of a TMC. We call this the Latent ORganization of Arboreal Clusters (LORACs, pronounced “lorax”) prior.

To define the LORACs prior , we first define an auxiliary TMC distribution with leaf locations . We treat as a set of learnable free parameters, and define the conditional as the LORACs prior on phylogenies :

(10)

That is, we choose the prior on phylogenies to be the posterior distribution of a TMC with pseudo-observations . Next, we define the LORACs prior on locations as a conditionally independent draw from the predictive distribution , writing the sampled attachment branch and time as and , respectively:

(11)

To complete the model, we use an observation likelihood parameterized by a neural network, writing

(12)

By using the learned inducing points , we avoid the main difficulty of inference in the TMC-VAE of Section 3, namely the need to do inference over all points in the dataset. Instead, dependence between datapoints is mediated by the set of inducing points , which has a size independent of . As a result, with the LORACs prior, minibatch-based learning becomes tractable even for very large datasets. The quality of the approximation to the TMC-VAE can be tuned by adjusting the size of .

However, this technique presents its own inference challenges. Sampling the optimal variational factor is no longer an option as it was in the TMC-VAE:

(13)

This term has a sum over expectations; therefore computing this likelihood for the purpose of MCMC would involve passing the entire dataset through a neural network. Furthermore, the normalizer for this likelihood is intractable, but necessary for computing gradients w.r.t . We therefore avoid using the optimal and set to the prior. This has the additional computational advantage of cancelling out the term in the ELBO, which also has an intractable normalizing constant. If the inducing points are chosen so that they contain most of the information about the hierarchical organization of the dataset, then the approximation will be reasonable.

We also fit the variational factors , , and . The factor for attachment times, , is a recognition network that outputs a posterior over attachment times for a particular branch. Since the and terms cancel out, we obtain the following ELBO (some notation suppressed for simplicity):

(14)

This ELBO can be optimized by first computing

(15)

and computing gradients with respect to , , , and using a Monte-Carlo estimate of the ELBO using samples from , , , and . The factor can be sampled using vanilla SPR Metropolis-Hastings. The detailed inference procedure can be found in Appendix C.

5 Related work

As mentioned above, LORACs connects various ideas in the literature, including Bayesian nonparametrics (Boyles and Welling, 2012), inducing-point approximations (e.g.; Snelson and Ghahramani, 2006; Tomczak and Welling, 2018), and amortized inference (Kingma and Welling, 2014; Rezende et al., 2014).

Also relevant is a recent thread of efforts to endow VAEs with the interpretability of graphical models (e.g.; Johnson et al., 2016; Lin et al., 2018). In this vein, Goyal et al. (2017) propose using a different Bayesian nonparametric tree prior, the nested Chinese restaurant process (CRP) (Griffiths et al., 2004), in a VAE. We chose to base LORACs on the TMC instead, as the posterior predictive distribution of an nCRP is a finite mixture, whereas the TMC’s posterior predictive distribution has more complex continuous structure. Another distinction is that Goyal et al. (2017) only consider learning from pretrained image features, whereas our approach is completely unsupervised.

6 Results

In this section, we analyze properties of the LORACs prior, focusing on qualitative aspects, like exploratory data analysis and interpretability, and quantitative aspects, like few-shot classification and information retrieval.

Experimental setup

We evaluated the LORACs prior on three separate datasets: dynamically binarized MNIST

(LeCun and Cortes, 2010), Omniglot (Lake et al., 2015), and CelebA (Liu et al., 2015). For all three experiments, we utilized convolutional/deconvolutional encoders/decoders and a 40-dimensional latent space (detailed architectures can be found in Appendix D). We used 200, 1000, and 500 inducing points for MNIST, Omniglot, and CelebA respectively with TMC parameters .

was a two-layer 500-wide neural network with ReLU activations that output parameters of a logistic-normal distribution over stick size and all parameters were optimized with Adam

(Kingma and Ba, 2015). Other implementation details can be found in Appendix D.

6.1 Qualitative results

A hierarchical clustering in the latent space offers a unique opportunity for interpretability and exploratory data analysis, especially when the data are images. Here are some methods for users to obtain useful data summaries and explore a dataset.

Visualizing inducing points

We first inspect the learned inducing points by passing them through the decoder. Visualized in Figure 3 are the 200 learned inducing points for MNIST. The inducing points are all unique and are cleaner than pseudo-input reconstructions from VampPrior (shown in Figure A.13). Inducing points can help summarize a dataset, as visualizations of the latent space indicate they spread out and cover the data (see Figure A.11). Inducing points are also visually unique and sensible in Omniglot and CelebA (see Figure A.14 and A.15).

Figure 3: Learned inducing points for a LORACs(200) prior on MNIST.
Hierarchical clustering

We can sample to obtain phylogenies over the inducing points, and can visualize these clusterings using the decoded inducing points; subtrees from a sample in each dataset are visualized in Figure 4. In MNIST, we find large subtrees correspond to the discrete classes in the dataset. In Omniglot, subtrees sometimes correspond to language groups and letter shapes. In CelebA, we find subtrees sometimes correspond to pose or hair color and style.

We can further use the time at each internal node to summarize the data at many levels of granularity. Consider “slicing” the hierarchy at a particular time by taking every branch with and computing the corresponding expected Gaussian random walk value at time . At times closer to zero, we slice fewer branches and are closer to the root of the hierarchy, so the value at the slice looks more like the mean of the data. In Figure 5, we visualize this process over a subset of the inducing points of CelebA. Visualizing the dataset in this way reveals cluster structure at different granularities and offers an evolutionary interpretation to the data, as leaves that coalesce more “recently” are likely to be closer in the latent space.

Although the hierarchical clustering is only over inducing points, we can still visualize where real data belong on the hierarchy by computing and attaching the data to the tree. By doing this for many points of data, and removing the inducing points from the tree, we obtain an induced hierarchical clustering.

(a) MNIST
(b) Omniglot
(c) CelebA
Figure 4: An example learned subtree from a sample of for each dataset. Leaves are visualized by passing inducing points throught the decoder.
Figure 5: The evolution of a CelebA over a subset of inducing points. We create this visualization by taking slices of the tree at particular times and looking at the latent distribution at each of the sliced branches.
Generating samples

Having fit a generative model to our data, we can visualize samples from the model. Although we do not expect the samples to have fidelity and sharpness comparable to those from GANs or state-of-the-art decoding networks (Radford et al., 2015; Salimans et al., 2017), sampling with the LORACs prior can help us understand the latent space. To draw a sample from a TMC’s posterior predictive density, we first sample a branch and time, assigning the sample a place in the tree. This provides each generated sample a context, i.e., the branch and subtree it was generated from. However, learning a LORACs prior allows us to conditionally sample in a novel way. By restricting samples to a subtree, we can generate samples from the support of the posterior predictive density limited to that subtree. This enables conditional sampling at many levels of the hierarchy. We visualize examples of this in Figure 6 and Figure 7.

(a) MNIST
(b) Omniglot
Figure 6: Conditional samples from subtrees.
Figure 7: Samples from subtrees of CelebA.

6.2 Quantitative results

We ran experiments designed to evaluate the usefulness of the LORACs’s learned latent space for downstream tasks. We compare the LORACs prior against a set of baseline priors on three different tasks: few-shot classification, information retrieval, and generative modeling. Our datasets are dynamically binarized MNIST and Omniglot (split by instance) and our baselines are representations learned with the same encoder-decoder architecture and latent dimensionality222Following the defaults in the author’s reference implementation, we evaluated DVAE# on statically binarized MNIST with smaller neural networks, but with a higher-dimensional latent space. but substituting the following prior distributions over :

  • No prior

  • Standard normal prior

  • VampPrior (Tomczak and Welling, 2018) - 500 pseudo-inputs for MNIST, 1000 for Omniglot

  • DVAE (Vahdat et al., 2018) - latent vectors are 400-dimensional, formed from concatenating binary latents, encoder and decoder are two-layer feed-forward networks with ReLU nonlinearities

  • Masked autoregressive flow (MAF; Papamakarios et al., 2017) - two layer, 512 wide MADE

(a) MNIST
(b) Omniglot
Figure 8: Few-shot classification results
Few-shot classification

In this task, we train a classifier with varying numbers of labels and measure test accuracy. We pick equal numbers of labels per class to avoid imbalance and we use a logistic regression classifier trained to convergence to avoid adding unnecessary degrees of freedom to the experiment. We replicated the experiment across 20 randomly chosen label sets for MNIST and 5 for Omniglot. The test accuracy on these datasets is visualized in

Figure 8

. For MNIST, we also manually labeled inducing points and found that training a classifier on 200 and 500 inducing points achieved significantly better test accuracy than randomly chosen labeled points, hinting that the LORACs prior has utility in an active learning setting.

The representations learned with the LORACs  consistently achieve better accuracy, though in MNIST, LORACs prior and MAF reach very similar test accuracy at 100 labels per class. The advantage of the LORACs prior is especially clear in Omniglot (Table B.3 and Table B.4 contain the exact numbers). We believe our advantage in this task comes from ability of the LORACs prior to model discrete structure. TSNE visualizations in Figure 9 and Figure A.10 indicate clusters are more concentrated and separated with the LORACs prior than with other priors, though TSNE visualizations should be taken with a grain of salt.

(a) Normal prior
(b) LORACs(200) prior
Figure 9: TSNE visualizations of the latent space of the MNIST test set with different priors, color-coded according to class. LORACs prior appears to learn a space with more separated, concentrated clusters.
Information retrieval

We evaluated the meaningfulness of Euclidean distances in the learned latent space by measuring precision-recall when querying the test set. We take each element of the test set and sort all other members according to their distance in the latent space. From this ranking, we produce precision-recall curves for each of the query and plot the average precision-recall over the entire test set in Figure B.16. We also report the area-under-the-curve (AUC) measure for each of these curves in Table 1.

AUC numbers for Omniglot are low across the board because of the large number of classes and low number of instances per class. However, in both datasets the LORACs prior consistently achieves the highest AUC, especially with MNIST. The LORACs prior encourages tree-distance to correspond to squared Euclidean distance, as branch lengths in the tree are variances in a Gaussian likelihoods. We thus suspect distances in a LORACs prior latent space to be more informative and better for information retrieval.

Held-out log-likelihood

We estimate held-out log-likelihoods for the four VAEs we trained with comparable architectures and different priors. (We exclude DVAE since its architecture is substantially different, and the classical autoencoder since it lacks generative semantics.) We use 1000 importance-weighted samples (Burda et al., 2015) to estimate held-out log-likelihood, and report the results in Table 2. We find that, although LORACs outperforms the other priors on downstream tasks, it only achieves middling likelihood numbers. This result is consistent with the findings of Chang et al. (2009) that held-out log-likelihood is not necessarily correlated with interpretability or usefulness for downstream tasks.

Prior MNIST Omniglot
No prior 0.429 0.078
Normal 0.317 0.057
VAMP 0.502 0.063
DVAE# 0.490 0.024
MAF 0.398 0.070
LORACs 0.626 0.087
Table 1: Averaged precision-recall AUC on MNIST/Omniglot test datasets
Prior MNIST Omniglot
Normal -83.789 -89.722
MAF -80.121 -86.298
Vamp -83.0135 -87.604
LORACs -83.401 -87.105
Table 2: MNIST/Omniglot test log-likelihoods

7 Discussion

Learning discrete, hierarchical structure in a latent space opens a new opportunity: interactive deep unsupervised learning. User-provided constraints have been used in both flat and hierarchical clustering

(Wagstaff and Cardie, 2000; Awasthi et al., 2014), so an interesting follow up to this work would be incorporating constraints into the LORACs  prior, as in Vikram and Dasgupta (2016), which could potentially enable user-guided representation learning.

References

  • Awasthi et al. (2014) Awasthi, P., Balcan, M., and Voevodski, K. (2014). Local algorithms for interactive clustering. In

    International Conference on Machine Learning

    , pages 550–558.
  • Blei et al. (2003) Blei, D. M., Ng, A. Y., and Jordan, M. I. (2003). Latent Dirichlet allocation. Journal of Machine Learning Research, 3(Jan):993–1022.
  • Boyles and Welling (2012) Boyles, L. and Welling, M. (2012). The time-marginalized coalescent prior for hierarchical clustering. In Advances in Neural Information Processing Systems, pages 2969–2977.
  • Burda et al. (2015) Burda, Y., Grosse, R., and Salakhutdinov, R. (2015). Importance weighted autoencoders. arXiv preprint arXiv:1509.00519.
  • Burgess et al. (2018) Burgess, C. P., Higgins, I., Pal, A., Matthey, L., Watters, N., Desjardins, G., and Lerchner, A. (2018). Understanding disentangling in -vae. arXiv preprint arXiv:1804.03599.
  • Chang et al. (2009) Chang, J., Gerrish, S., Wang, C., Boyd-Graber, J. L., and Blei, D. M. (2009). Reading tea leaves: How humans interpret topic models. In Advances in neural information processing systems, pages 288–296.
  • Goyal et al. (2017) Goyal, P., Hu, Z., Liang, X., Wang, C., Xing, E. P., and Mellon, C. (2017). Nonparametric variational auto-encoders for hierarchical representation learning. In ICCV, pages 5104–5112.
  • Griffiths et al. (2004) Griffiths, T. L., Jordan, M. I., Tenenbaum, J. B., and Blei, D. M. (2004). Hierarchical topic models and the nested chinese restaurant process. In Advances in neural information processing systems, pages 17–24.
  • Hoffman and Johnson (2016) Hoffman, M. D. and Johnson, M. J. (2016). Elbo surgery: yet another way to carve up the variational evidence lower bound. In Workshop in Advances in Approximate Bayesian Inference, NIPS.
  • Johnson et al. (2016) Johnson, M. J., Duvenaud, D., Wiltschko, A. B., Datta, S. R., and Adams, R. P. (2016). Composing graphical models with neural networks for structured representations and fast inference. In Neural Information Processing Systems.
  • Kingma and Ba (2015) Kingma, D. and Ba, J. (2015). Adam: A method for stochastic optimization. In International Conference on Learning Representations.
  • Kingma and Welling (2014) Kingma, D. P. and Welling, M. (2014). Auto-encoding variational Bayes. International Conference on Learning Representations.
  • Kingman (1982) Kingman, J. F. C. (1982). The coalescent. Stochastic processes and their applications, 13(3):235–248.
  • Lake et al. (2015) Lake, B. M., Salakhutdinov, R., and Tenenbaum, J. B. (2015). Human-level concept learning through probabilistic program induction. Science, 350(6266):1332–1338.
  • LeCun and Cortes (2010) LeCun, Y. and Cortes, C. (2010). MNIST handwritten digit database.
  • Lin et al. (2018) Lin, W., Hubacher, N., and Khan, M. E. (2018). Variational message passing with structured inference networks. arXiv preprint arXiv:1803.05589.
  • Liu et al. (2015) Liu, Z., Luo, P., Wang, X., and Tang, X. (2015). Deep learning face attributes in the wild. In

    Proceedings of International Conference on Computer Vision (ICCV)

    .
  • Neal (2003) Neal, R. M. (2003). Density modeling and clustering using dirichlet diffusion trees. Bayesian statistics, 7:619–629.
  • Papamakarios et al. (2017) Papamakarios, G., Murray, I., and Pavlakou, T. (2017). Masked autoregressive flow for density estimation. In Advances in Neural Information Processing Systems, pages 2338–2347.
  • Radford et al. (2015) Radford, A., Metz, L., and Chintala, S. (2015). Unsupervised representation learning with deep convolutional generative adversarial networks. arXiv preprint arXiv:1511.06434.
  • Rezende et al. (2014) Rezende, D. J., Mohamed, S., and Wierstra, D. (2014).

    Stochastic backpropagation and approximate inference in deep generative models.

    In Proceedings of the 31st International Conference on Machine Learning, pages 1278–1286.
  • Salimans et al. (2017) Salimans, T., Karpathy, A., Chen, X., Kingma, D. P., and Bulatov, Y. (2017). Pixelcnn++: Improving the pixelcnn with discretized logistic mixture likelihood and other modifications. In International Conference on Learning Representations (ICLR).
  • Snelson and Ghahramani (2006) Snelson, E. and Ghahramani, Z. (2006). Sparse gaussian processes using pseudo-inputs. In Advances in neural information processing systems, pages 1257–1264.
  • Tipping and Bishop (1999) Tipping, M. E. and Bishop, C. M. (1999).

    Probabilistic principal component analysis.

    Journal of the Royal Statistical Society: Series B (Statistical Methodology), 61(3):611–622.
  • Tomczak and Welling (2018) Tomczak, J. and Welling, M. (2018). Vae with a vampprior. In

    International Conference on Artificial Intelligence and Statistics

    , pages 1214–1223.
  • Vahdat et al. (2018) Vahdat, A., Andriyash, E., and Macready, W. G. (2018). Dvae#: Discrete variational autoencoders with relaxed Boltzmann priors. In Neural Information Processing Systems (NIPS).
  • Vikram and Dasgupta (2016) Vikram, S. and Dasgupta, S. (2016). Interactive bayesian hierarchical clustering. In International Conference on Machine Learning, pages 2081–2090.
  • Wagstaff and Cardie (2000) Wagstaff, K. and Cardie, C. (2000). Clustering with instance-level constraints. AAAI/IAAI, 1097:577–584.

The LORACs prior for VAEs: Letting the Trees Speak for the Data - Supplement

References

  • Awasthi et al. (2014) Awasthi, P., Balcan, M., and Voevodski, K. (2014). Local algorithms for interactive clustering. In

    International Conference on Machine Learning

    , pages 550–558.
  • Blei et al. (2003) Blei, D. M., Ng, A. Y., and Jordan, M. I. (2003). Latent Dirichlet allocation. Journal of Machine Learning Research, 3(Jan):993–1022.
  • Boyles and Welling (2012) Boyles, L. and Welling, M. (2012). The time-marginalized coalescent prior for hierarchical clustering. In Advances in Neural Information Processing Systems, pages 2969–2977.
  • Burda et al. (2015) Burda, Y., Grosse, R., and Salakhutdinov, R. (2015). Importance weighted autoencoders. arXiv preprint arXiv:1509.00519.
  • Burgess et al. (2018) Burgess, C. P., Higgins, I., Pal, A., Matthey, L., Watters, N., Desjardins, G., and Lerchner, A. (2018). Understanding disentangling in -vae. arXiv preprint arXiv:1804.03599.
  • Chang et al. (2009) Chang, J., Gerrish, S., Wang, C., Boyd-Graber, J. L., and Blei, D. M. (2009). Reading tea leaves: How humans interpret topic models. In Advances in neural information processing systems, pages 288–296.
  • Goyal et al. (2017) Goyal, P., Hu, Z., Liang, X., Wang, C., Xing, E. P., and Mellon, C. (2017). Nonparametric variational auto-encoders for hierarchical representation learning. In ICCV, pages 5104–5112.
  • Griffiths et al. (2004) Griffiths, T. L., Jordan, M. I., Tenenbaum, J. B., and Blei, D. M. (2004). Hierarchical topic models and the nested chinese restaurant process. In Advances in neural information processing systems, pages 17–24.
  • Hoffman and Johnson (2016) Hoffman, M. D. and Johnson, M. J. (2016). Elbo surgery: yet another way to carve up the variational evidence lower bound. In Workshop in Advances in Approximate Bayesian Inference, NIPS.
  • Johnson et al. (2016) Johnson, M. J., Duvenaud, D., Wiltschko, A. B., Datta, S. R., and Adams, R. P. (2016). Composing graphical models with neural networks for structured representations and fast inference. In Neural Information Processing Systems.
  • Kingma and Ba (2015) Kingma, D. and Ba, J. (2015). Adam: A method for stochastic optimization. In International Conference on Learning Representations.
  • Kingma and Welling (2014) Kingma, D. P. and Welling, M. (2014). Auto-encoding variational Bayes. International Conference on Learning Representations.
  • Kingman (1982) Kingman, J. F. C. (1982). The coalescent. Stochastic processes and their applications, 13(3):235–248.
  • Lake et al. (2015) Lake, B. M., Salakhutdinov, R., and Tenenbaum, J. B. (2015). Human-level concept learning through probabilistic program induction. Science, 350(6266):1332–1338.
  • LeCun and Cortes (2010) LeCun, Y. and Cortes, C. (2010). MNIST handwritten digit database.
  • Lin et al. (2018) Lin, W., Hubacher, N., and Khan, M. E. (2018). Variational message passing with structured inference networks. arXiv preprint arXiv:1803.05589.
  • Liu et al. (2015) Liu, Z., Luo, P., Wang, X., and Tang, X. (2015). Deep learning face attributes in the wild. In

    Proceedings of International Conference on Computer Vision (ICCV)

    .
  • Neal (2003) Neal, R. M. (2003). Density modeling and clustering using dirichlet diffusion trees. Bayesian statistics, 7:619–629.
  • Papamakarios et al. (2017) Papamakarios, G., Murray, I., and Pavlakou, T. (2017). Masked autoregressive flow for density estimation. In Advances in Neural Information Processing Systems, pages 2338–2347.
  • Radford et al. (2015) Radford, A., Metz, L., and Chintala, S. (2015). Unsupervised representation learning with deep convolutional generative adversarial networks. arXiv preprint arXiv:1511.06434.
  • Rezende et al. (2014) Rezende, D. J., Mohamed, S., and Wierstra, D. (2014).

    Stochastic backpropagation and approximate inference in deep generative models.

    In Proceedings of the 31st International Conference on Machine Learning, pages 1278–1286.
  • Salimans et al. (2017) Salimans, T., Karpathy, A., Chen, X., Kingma, D. P., and Bulatov, Y. (2017). Pixelcnn++: Improving the pixelcnn with discretized logistic mixture likelihood and other modifications. In International Conference on Learning Representations (ICLR).
  • Snelson and Ghahramani (2006) Snelson, E. and Ghahramani, Z. (2006). Sparse gaussian processes using pseudo-inputs. In Advances in neural information processing systems, pages 1257–1264.
  • Tipping and Bishop (1999) Tipping, M. E. and Bishop, C. M. (1999).

    Probabilistic principal component analysis.

    Journal of the Royal Statistical Society: Series B (Statistical Methodology), 61(3):611–622.
  • Tomczak and Welling (2018) Tomczak, J. and Welling, M. (2018). Vae with a vampprior. In

    International Conference on Artificial Intelligence and Statistics

    , pages 1214–1223.
  • Vahdat et al. (2018) Vahdat, A., Andriyash, E., and Macready, W. G. (2018). Dvae#: Discrete variational autoencoders with relaxed Boltzmann priors. In Neural Information Processing Systems (NIPS).
  • Vikram and Dasgupta (2016) Vikram, S. and Dasgupta, S. (2016). Interactive bayesian hierarchical clustering. In International Conference on Machine Learning, pages 2081–2090.
  • Wagstaff and Cardie (2000) Wagstaff, K. and Cardie, C. (2000). Clustering with instance-level constraints. AAAI/IAAI, 1097:577–584.

Appendix A Additional visualizations

(a) Normal prior
(b) No prior
(c) Vamp(500) prior
(d) MAF prior
(e) LORACs(200) prior
Figure A.10: TSNE visualizations of the latent space of the MNIST test set with various prior distributions, color-coded according to class.
Figure A.11: A TSNE visualization of the latent space for the TMC(200) model with inducing points and one sample from plotted. Internal nodes are visualized by computing their expected posterior values, and branches are plotted in 2-d space.
Figure A.12: MNIST VampPrior learned pseudo-inputs.
Figure A.13: MNIST VampPrior reconstructed pseudo-inputs obtained by deterministically encoding and decoding each pseudo-input.
Figure A.14: Omniglot learned inducing points.
Figure A.15: CelebA learned inducing points.

Appendix B Empirical results

Labels per class 1 10 20 30 40 50 60 70 80 90 100
No prior
Normal
Vamp(500)
DVAE#
MAF
LORACs(200)
Table B.3: MNIST few-shot classification results.
Labels per class 1 2 5 10 15
No prior
Normal
Vamp(1000)
DVAE#
MAF
LORACs(1000)
Table B.4: Omniglot few-shot classification results.
# of inducing points 200 500
0.9428 0.9474
Table B.5: MNIST few-shot classification with labeled inducing points.
(a) MNIST
(b) Omniglot
Figure B.16: Averaged precision-recall curves over test datasets.

Appendix C Algorithm details

c.1 Stick breaking process

Consider inserting a node into the tree in between vertices and such that , creating branch . The inserted node has time with probability according to the stick breaking process, i.e.

(C.16)

c.2 Belief propagation in TMCs

The TMC is at the core of the LORACs  prior. Recall that the TMC is a prior over phylogenies , and after attaching a Gaussian random walk (GRW), we obtain a distribution over vectors in , corresponding to the leaves, . However, the GRW samples latent vectors at internal nodes . Rather than explicitly representing these values, in this work we marginalize them out, i.e.

(C.17)

This marginalization process can be done efficiently, because our graphical model is tree-shaped and all nodes have Gaussian likelihoods. Belief propagation is a message-passing framework for marginalization and we utilize message-passing for several TMC inference queries. The main queries we are interested in are:

  1. - for the purposes of MCMC, we are interested in computing the joint likelihood of a set of observed leaf values and a phylogeny.

  2. - this query computes the posterior density over one leaf given all the others; we use this distribution when computing the posterior predictive density of a TMC.

  3. - this query is the gradient of the predictive density of a single leaf with respect to the values at all other leaves. This query is used when computing gradients of the ELBO w.r.t in the LORACs prior.

Message passing

Message passing treats the tree as an undirected graph. We first pick start node and request messages from each of ’s neighbors.

Message passing is thereafter defined recursively. When a node has requested messages from a source node , it thereafter requests messages from all its neighbors but . The base case for this recursion is a leaf node , which returns a message with the following contents:

(C.18)

where bold numbers and denote vectors obtained by repeating a scalar times.

In the recursive case, consider being at a node and receiving a set of messages from its neighbors .

(C.19)

where is the length of the edge between nodes and . These messages are identical to those used in Boyles and Welling (2012).

Additionally, our messages include gradients w.r.t. every leaf node downstream of the message. We update each of these gradients when computing the new message and pass them along to the source node. Gradients with respect to one of these nodes are calculated as

(C.20)

The most complicated message is the message, which depends on the number of incoming messages. gets three incoming messages, all other nodes get only two. Consider two messages from nodes and :

(C.21)

For three messages from nodes , , and :

(C.22)

With these messages, we can answer all the aforementioned inference queries.

  1. We can begin message passing at any internal node and compute:

  2. We start message passing at . is a Gaussian with mean and variance .

  3. is , which in turn utilizes gradients sent via message passing.

Implementation

We chose to implement the TMC and message passing in Cython because we found raw Python to be too slow due to function call and type-checking overhead. Furthermore, we used diagonal rather than scalar variances in the message passing implementation to later support diagonal variances handed from the variational posterior over .

c.3 Variational inference for the LORACs prior

The LORACs  prior involves first sampling a tree from the posterior distribution over TMCs with as leaves. We then sample a branch and time for each data according to the posterior predictive distribution described in subsection 2.1. We then sample a from the distribution induced by the GRW likelihood model. Finally, we pass the sampled through the decoder.

(C.23)

Consider sampling the optimal .

(C.24)

We set . We use additional variational factors , , and . is a recognition network that outputs the attach time for a particular branch. Since the and terms cancel out, we obtain the following ELBO.

(C.25)
Inference procedure

In general, can be sampled using vanilla SPR Metropolis-Hastings, so samples from this distribution are readily available.

For each data in the minibatch , we pass it through the encoder to obtain . We then compute

(C.26)

This quantity is computed by looping over every branch of a sample from , storing incoming messages at each node, passing the and and a sample from into , outputting a logistic-normal distribution over times for that branch. We sample that logistic normal to obtain a time to go with branch . We can then compute the log-likelihood of if it were to attach to and , using TMC inference query #2. This log-likelihood is added to the TMC prior log-probability of the branch being selected to obtain a joint probability over the branch. After doing this for every branch, we normalize the joint likelihoods to obtain the optimal categorical distribution over every branch for , . We then sample this distribution to obtain an attach location and time for each data in the minibatch.

The next stage is to compute gradients w.r.t. to the learnable parameters of the model (, , , and ). In the process of calculating , we have obtained samples from its corresponding and . We plug these into the ELBO and can compute gradients via automatic differentiation w.r.t. , , and . Computing gradients w.r.t. is more tricky. We first examine the ELBO.

(C.27)

Consider the gradient of the ELBO with respect to .