1 Introduction
Variational autoencoders (VAEs; Kingma and Welling, 2014; Rezende et al., 2014) are a popular class of deep latentvariable models. The VAE assumes that observations
are generated by first sampling a latent vector
from some tractable prior , and then sampling from some tractable distribution . For example,could be a neural network with weights
and might be a Gaussian with mean .VAEs, like other unsupervised latentvariable models (e.g.; Tipping and Bishop, 1999; Blei et al., 2003), can uncover latent structure in datasets. In particular, one might hope that highlevel characteristics of the data are encoded more directly in the geometry of the latent space than they are in the data space . For example, when modeling faces one might hope that one latent dimension corresponds to pose, another to hair length, another to gender, etc.
What kind of latent structure will the VAE actually discover? Hoffman and Johnson (2016) observe that the ELBO encourages the model to make the statistics of the population of encoded vectors resemble those of the prior, so that . The prior therefore plays an important role in shaping the geometry of the latent space. For example, if we use the “default” prior , then we are asking the model to explain the data in terms of smoothly varying, completely independent factors (Burgess et al., 2018). These constraints may sometimes be reasonable—for example, geometric factors such as pose or lighting angle may be nearly independent and rotationally symmetric. But some natural factors exhibit dependence structure (for example, facial hair length and gender are strongly correlated), and others may have nonsmooth structure (for example, handwritten characters naturally cluster into discrete groups).
In this paper, we propose using a more opinionated prior on the VAE’s latent vectors: the timemarginalized coalescent (TMC; Boyles and Welling, 2012). The TMC is a powerful, interpretable Bayesian nonparametric hierarchical clustering model that can encode rich discrete and continuous structure. Combining the TMC with the VAE combines the strengths of Bayesian nonparametrics (interpretable, discrete structure learning) and deep generative modeling (freedom from restrictive distributional assumptions).
Our contributions are:

We propose a deep Bayesian nonparametric model that can discover hierarchical cluster structure in complex, highdimensional datasets.

We develop a minibatchfriendly inference procedure for fitting TMCs based on an inducingpoint approximation, which scales to arbitrarily large datasets.

We show that our model’s learned latent representations consistently outperform those learned by other variational (and classical) autoencoders when evaluated on downstream classification and retrieval tasks.
2 Background
2.1 Bayesian priors for hierarchical clustering
Independent samples from a timemarginalized coalescent (TMC) prior and twodimensional Gaussian random walk likelihood model (10 and 300 leaves respectively). Contours in the plots correspond to posterior predictive density
. As the number of leaves grow, the predictive density grows more complex.Hierarchical clustering is a flexible tool in exploratory data analysis as trees offer visual, interpretable summaries of data. Typically, algorithms for hierarchical clustering are either agglomerative (where data are recursively, greedily merged to form a tree from the bottomup) or divisive (where data are recursively partitioned, forming a tree from the topdown). Bayesian nonparametric hierarchical clustering (BNHC) additionally incorporates uncertainty over tree structure by introducing a prior distribution over trees and a likelihood model for data , with the goal of sampling the posterior distribution .^{1}^{1}1We use
to denote probability distributions relating to the TMC and distinguish from
and distributions used later in the paper.In this paper, we focus on rooted binary trees with labeled leaves adorned with branch lengths, called phylogenies. Prior distributions over phylogenies often take the form of a stochastic generative process in which a tree is built with random merges, as in the Kingman coalescent (Kingman, 1982), or random splits, as in the Dirichlet diffusion tree (Neal, 2003)
. These nonparametric distributions have helpful properties, such as exchangeability, which enable efficient Bayesian inference. In this paper, we focus on the timemarginalized coalescent
(TMC; Boyles and Welling, 2012), which decouples the distribution over tree structure and branch length, a property that helps simplify inference down the line.2.1.1 Timemarginalized coalescent (TMC)
The timemarginalized coalescent defines a prior distribution over phylogenies. A phylogeny is a directed rooted full binary tree, with vertex set and edges , together with time labels where we denote . The vertex set is partitioned into leaf vertices and internal vertices , so that , and we take to simplify notation for identifying leaves with data points. The directed edges of the tree are encoded in the edge set , where we denote the root vertex as and for we denote the parent of as where .
The TMC samples a random tree structure by a stochastic process in which the leaves are recursively merged uniformly at random until only one vertex is left. This process yields the probability mass function on valid pairs given by
(1) 
where denotes the number of internal vertices in the subtree rooted at . Given the tree structure, time labels are generated via the stickbreaking process
(2) 
where for . These time labels encode a branch length for each edge . We denote the overall density on phylogenies with leaves as .
Finally, to connect the TMC prior to data in , we define a likelihood model on data points, with corresponding to the leaf vertex . We use a Gaussian random walk (GRW), where for each vertex a location
is sampled according to a Gaussian distribution centered at its parent’s location with variance equal to the branch length,
and we take . As a result of this choice, we can exploit the Gaussian graphical model structure to efficiently marginalize out the internal locations associated with internal vertices and evaluate the resulting marginal density . For details about this marginalization, please refer to Appendix C. The final overall density is written as
(3) 
For further details and derivations related to the TMC, please refer to Boyles and Welling (2012).
2.1.2 TMC posterior predictive density
The TMC with leaves and a GRW likelihood model can be a prior on a set of hierarchicallystructured data, i.e. data that correspond to nodes with small tree distance should have similar location values. In addition, it also acts as a density from which we can sample new data. The posterior predictive density is easy to sample thanks to the exchangeability of the TMC.
To sample a new data point , we select a branch (edge) and a time to attach a new leaf node. The probability of selecting branch is proportional to the probability under the TMC prior of the tree with a new leaf attached to branch . The density for a time label is determined by the stickbreaking process (see Appendix C for details). Both of these probabilities are easy to calculate and sample due to the exchangeability of the TMC.
The new location can be sampled from , which is the Gaussian distribution that comes out of the GRW likelihood model. Pictured in Figure 1 are samples from a TMC prior and GRW likelihood, where contours correspond to
. In addition to modeling hierarchical structure, the TMC is a flexible nonparametric density estimator.
2.1.3 TMC inference
The posterior distribution is analytically intractable due to the normalization constant
involving a sum over all tree structures, but it can be approximately sampled via Markov chain MonteCarlo (MCMC) methods. We utilize the MetropolisHastings algorithm with a subtreepruneandregraft (SPR) proposal distribution
(Neal, 2003). An SPR proposal picks a subtree uniformly at random from and detaches it. It is then attached back on the tree to a branch and time picked uniformly at random. The MetropolisHastings acceptance probability is efficient to compute because the joint density can be evaluated using belief propagation to marginalize the latent values at internal nodes of , and many of the messages can be cached. See Appendix C for details.2.2 Variational autoencoder
The variational autoencoder (VAE) is a generative model for a dataset wherein latent vectors are sampled from a prior distribution and then individually passed into a neural network observation model with parameters ,
(4) 
We are interested in the posterior distribution , which is not analytically tractable but can be approximated with a variational distribution , typically a neural network that outputs parameters of a Gaussian distribution. The weights of the approximate posterior can be learned by optimizing the evidencelower bound (ELBO),
(5) 
The parameters of the model, and , are learned via stochastic gradient ascent on the ELBO, using the reparametrization trick for lower variance gradients (Kingma and Welling, 2014; Rezende et al., 2014).
3 The TMCVAE
The choice of prior distribution in the VAE significantly affects the autoencoder and resulting latent space. The default standard normal prior, which takes , acts as a regularizer on an otherwise unconstrained autoencoder, but can be restrictive and result in overpruning (Burda et al., 2015). Extremely flexible, learnable distributions like masked autoregressive flow (MAF) priors (Papamakarios et al., 2017) enable very rich latent spaces, but don’t encode any interpretable bias for organizing the latent space (except perhaps smoothness).
In this paper, we explore the TMC prior for the VAE, which could potentially strike a sweet spot between restrictive and flexible priors. We generate the latent values of a VAE according to the TMC prior, then generate observations using a neural network observation model,
(6)  
(7) 
The TMCVAE is a coherent generative process that captures discrete, interpretable structure in the latent space. A phylogeny not only has an intuitive inductive bias, but can be useful for exploratory data analysis and introspecting the latent space itself.
Consider doing inference in this model: first assume variational distributions (as in the VAE) and , which results in the ELBO,
(8) 
For fixed , we can sample the optimal ,
(9) 
Because is jointly Gaussian (factorizing according to tree structure) and is Gaussian, expectations with respect to can move into . This enables sampling the expected joint likelihood using SPR MetropolisHastings. However, optimizing this ELBO is problematic. does not factorize independently, so computing unbiased gradient estimates from minibatches is impossible and requires evaluating all the data. Furthermore, the TMC is limiting from a computational perspective. Since a phylogeny has as many leaves as points in the dataset, belief propagation over internal nodes of the tree slows down linearly as the size of the dataset grows. In addition, SPR proposals mix very slowly for large trees. We found these limitations make the model impractical for datasets of more than 1000 examples.
In the next section, we address these computational issues, while retaining the interesting properties of the TMCVAE.
4 LORACs prior for VAEs
In this section, we introduce a novel approximation to the TMC prior, which preserves many desirable properties like structure and interpretability, while being computationally viable. Our key idea is to use a set of learned inducing points as the leaves of the tree in the latent space, analogous to inducinginput approximations for Gaussian processes (Snelson and Ghahramani, 2006). In this model, latent vectors are not directly hierarchically clustered, but are rather independent samples from the induced posterior predictive density of a TMC. We call this the Latent ORganization of Arboreal Clusters (LORACs, pronounced “lorax”) prior.
To define the LORACs prior , we first define an auxiliary TMC distribution with leaf locations . We treat as a set of learnable free parameters, and define the conditional as the LORACs prior on phylogenies :
(10) 
That is, we choose the prior on phylogenies to be the posterior distribution of a TMC with pseudoobservations . Next, we define the LORACs prior on locations as a conditionally independent draw from the predictive distribution , writing the sampled attachment branch and time as and , respectively:
(11) 
To complete the model, we use an observation likelihood parameterized by a neural network, writing
(12) 
By using the learned inducing points , we avoid the main difficulty of inference in the TMCVAE of Section 3, namely the need to do inference over all points in the dataset. Instead, dependence between datapoints is mediated by the set of inducing points , which has a size independent of . As a result, with the LORACs prior, minibatchbased learning becomes tractable even for very large datasets. The quality of the approximation to the TMCVAE can be tuned by adjusting the size of .
However, this technique presents its own inference challenges. Sampling the optimal variational factor is no longer an option as it was in the TMCVAE:
(13) 
This term has a sum over expectations; therefore computing this likelihood for the purpose of MCMC would involve passing the entire dataset through a neural network. Furthermore, the normalizer for this likelihood is intractable, but necessary for computing gradients w.r.t . We therefore avoid using the optimal and set to the prior. This has the additional computational advantage of cancelling out the term in the ELBO, which also has an intractable normalizing constant. If the inducing points are chosen so that they contain most of the information about the hierarchical organization of the dataset, then the approximation will be reasonable.
We also fit the variational factors , , and . The factor for attachment times, , is a recognition network that outputs a posterior over attachment times for a particular branch. Since the and terms cancel out, we obtain the following ELBO (some notation suppressed for simplicity):
(14) 
This ELBO can be optimized by first computing
(15) 
and computing gradients with respect to , , , and using a MonteCarlo estimate of the ELBO using samples from , , , and . The factor can be sampled using vanilla SPR MetropolisHastings. The detailed inference procedure can be found in Appendix C.
5 Related work
As mentioned above, LORACs connects various ideas in the literature, including Bayesian nonparametrics (Boyles and Welling, 2012), inducingpoint approximations (e.g.; Snelson and Ghahramani, 2006; Tomczak and Welling, 2018), and amortized inference (Kingma and Welling, 2014; Rezende et al., 2014).
Also relevant is a recent thread of efforts to endow VAEs with the interpretability of graphical models (e.g.; Johnson et al., 2016; Lin et al., 2018). In this vein, Goyal et al. (2017) propose using a different Bayesian nonparametric tree prior, the nested Chinese restaurant process (CRP) (Griffiths et al., 2004), in a VAE. We chose to base LORACs on the TMC instead, as the posterior predictive distribution of an nCRP is a finite mixture, whereas the TMC’s posterior predictive distribution has more complex continuous structure. Another distinction is that Goyal et al. (2017) only consider learning from pretrained image features, whereas our approach is completely unsupervised.
6 Results
In this section, we analyze properties of the LORACs prior, focusing on qualitative aspects, like exploratory data analysis and interpretability, and quantitative aspects, like fewshot classification and information retrieval.
Experimental setup
We evaluated the LORACs prior on three separate datasets: dynamically binarized MNIST
(LeCun and Cortes, 2010), Omniglot (Lake et al., 2015), and CelebA (Liu et al., 2015). For all three experiments, we utilized convolutional/deconvolutional encoders/decoders and a 40dimensional latent space (detailed architectures can be found in Appendix D). We used 200, 1000, and 500 inducing points for MNIST, Omniglot, and CelebA respectively with TMC parameters .was a twolayer 500wide neural network with ReLU activations that output parameters of a logisticnormal distribution over stick size and all parameters were optimized with Adam
(Kingma and Ba, 2015). Other implementation details can be found in Appendix D.6.1 Qualitative results
A hierarchical clustering in the latent space offers a unique opportunity for interpretability and exploratory data analysis, especially when the data are images. Here are some methods for users to obtain useful data summaries and explore a dataset.
Visualizing inducing points
We first inspect the learned inducing points by passing them through the decoder. Visualized in Figure 3 are the 200 learned inducing points for MNIST. The inducing points are all unique and are cleaner than pseudoinput reconstructions from VampPrior (shown in Figure A.13). Inducing points can help summarize a dataset, as visualizations of the latent space indicate they spread out and cover the data (see Figure A.11). Inducing points are also visually unique and sensible in Omniglot and CelebA (see Figure A.14 and A.15).
Hierarchical clustering
We can sample to obtain phylogenies over the inducing points, and can visualize these clusterings using the decoded inducing points; subtrees from a sample in each dataset are visualized in Figure 4. In MNIST, we find large subtrees correspond to the discrete classes in the dataset. In Omniglot, subtrees sometimes correspond to language groups and letter shapes. In CelebA, we find subtrees sometimes correspond to pose or hair color and style.
We can further use the time at each internal node to summarize the data at many levels of granularity. Consider “slicing” the hierarchy at a particular time by taking every branch with and computing the corresponding expected Gaussian random walk value at time . At times closer to zero, we slice fewer branches and are closer to the root of the hierarchy, so the value at the slice looks more like the mean of the data. In Figure 5, we visualize this process over a subset of the inducing points of CelebA. Visualizing the dataset in this way reveals cluster structure at different granularities and offers an evolutionary interpretation to the data, as leaves that coalesce more “recently” are likely to be closer in the latent space.
Although the hierarchical clustering is only over inducing points, we can still visualize where real data belong on the hierarchy by computing and attaching the data to the tree. By doing this for many points of data, and removing the inducing points from the tree, we obtain an induced hierarchical clustering.
Generating samples
Having fit a generative model to our data, we can visualize samples from the model. Although we do not expect the samples to have fidelity and sharpness comparable to those from GANs or stateoftheart decoding networks (Radford et al., 2015; Salimans et al., 2017), sampling with the LORACs prior can help us understand the latent space. To draw a sample from a TMC’s posterior predictive density, we first sample a branch and time, assigning the sample a place in the tree. This provides each generated sample a context, i.e., the branch and subtree it was generated from. However, learning a LORACs prior allows us to conditionally sample in a novel way. By restricting samples to a subtree, we can generate samples from the support of the posterior predictive density limited to that subtree. This enables conditional sampling at many levels of the hierarchy. We visualize examples of this in Figure 6 and Figure 7.
6.2 Quantitative results
We ran experiments designed to evaluate the usefulness of the LORACs’s learned latent space for downstream tasks. We compare the LORACs prior against a set of baseline priors on three different tasks: fewshot classification, information retrieval, and generative modeling. Our datasets are dynamically binarized MNIST and Omniglot (split by instance) and our baselines are representations learned with the same encoderdecoder architecture and latent dimensionality^{2}^{2}2Following the defaults in the author’s reference implementation, we evaluated DVAE# on statically binarized MNIST with smaller neural networks, but with a higherdimensional latent space. but substituting the following prior distributions over :

No prior

Standard normal prior

VampPrior (Tomczak and Welling, 2018)  500 pseudoinputs for MNIST, 1000 for Omniglot

DVAE (Vahdat et al., 2018)  latent vectors are 400dimensional, formed from concatenating binary latents, encoder and decoder are twolayer feedforward networks with ReLU nonlinearities

Masked autoregressive flow (MAF; Papamakarios et al., 2017)  two layer, 512 wide MADE
Fewshot classification
In this task, we train a classifier with varying numbers of labels and measure test accuracy. We pick equal numbers of labels per class to avoid imbalance and we use a logistic regression classifier trained to convergence to avoid adding unnecessary degrees of freedom to the experiment. We replicated the experiment across 20 randomly chosen label sets for MNIST and 5 for Omniglot. The test accuracy on these datasets is visualized in
Figure 8. For MNIST, we also manually labeled inducing points and found that training a classifier on 200 and 500 inducing points achieved significantly better test accuracy than randomly chosen labeled points, hinting that the LORACs prior has utility in an active learning setting.
The representations learned with the LORACs consistently achieve better accuracy, though in MNIST, LORACs prior and MAF reach very similar test accuracy at 100 labels per class. The advantage of the LORACs prior is especially clear in Omniglot (Table B.3 and Table B.4 contain the exact numbers). We believe our advantage in this task comes from ability of the LORACs prior to model discrete structure. TSNE visualizations in Figure 9 and Figure A.10 indicate clusters are more concentrated and separated with the LORACs prior than with other priors, though TSNE visualizations should be taken with a grain of salt.
Information retrieval
We evaluated the meaningfulness of Euclidean distances in the learned latent space by measuring precisionrecall when querying the test set. We take each element of the test set and sort all other members according to their distance in the latent space. From this ranking, we produce precisionrecall curves for each of the query and plot the average precisionrecall over the entire test set in Figure B.16. We also report the areaunderthecurve (AUC) measure for each of these curves in Table 1.
AUC numbers for Omniglot are low across the board because of the large number of classes and low number of instances per class. However, in both datasets the LORACs prior consistently achieves the highest AUC, especially with MNIST. The LORACs prior encourages treedistance to correspond to squared Euclidean distance, as branch lengths in the tree are variances in a Gaussian likelihoods. We thus suspect distances in a LORACs prior latent space to be more informative and better for information retrieval.
Heldout loglikelihood
We estimate heldout loglikelihoods for the four VAEs we trained with comparable architectures and different priors. (We exclude DVAE since its architecture is substantially different, and the classical autoencoder since it lacks generative semantics.) We use 1000 importanceweighted samples (Burda et al., 2015) to estimate heldout loglikelihood, and report the results in Table 2. We find that, although LORACs outperforms the other priors on downstream tasks, it only achieves middling likelihood numbers. This result is consistent with the findings of Chang et al. (2009) that heldout loglikelihood is not necessarily correlated with interpretability or usefulness for downstream tasks.
Prior  MNIST  Omniglot 

No prior  0.429  0.078 
Normal  0.317  0.057 
VAMP  0.502  0.063 
DVAE#  0.490  0.024 
MAF  0.398  0.070 
LORACs  0.626  0.087 
Prior  MNIST  Omniglot 

Normal  83.789  89.722 
MAF  80.121  86.298 
Vamp  83.0135  87.604 
LORACs  83.401  87.105 
7 Discussion
Learning discrete, hierarchical structure in a latent space opens a new opportunity: interactive deep unsupervised learning. Userprovided constraints have been used in both flat and hierarchical clustering
(Wagstaff and Cardie, 2000; Awasthi et al., 2014), so an interesting follow up to this work would be incorporating constraints into the LORACs prior, as in Vikram and Dasgupta (2016), which could potentially enable userguided representation learning.References

Awasthi et al. (2014)
Awasthi, P., Balcan, M., and Voevodski, K. (2014).
Local algorithms for interactive clustering.
In
International Conference on Machine Learning
, pages 550–558.  Blei et al. (2003) Blei, D. M., Ng, A. Y., and Jordan, M. I. (2003). Latent Dirichlet allocation. Journal of Machine Learning Research, 3(Jan):993–1022.
 Boyles and Welling (2012) Boyles, L. and Welling, M. (2012). The timemarginalized coalescent prior for hierarchical clustering. In Advances in Neural Information Processing Systems, pages 2969–2977.
 Burda et al. (2015) Burda, Y., Grosse, R., and Salakhutdinov, R. (2015). Importance weighted autoencoders. arXiv preprint arXiv:1509.00519.
 Burgess et al. (2018) Burgess, C. P., Higgins, I., Pal, A., Matthey, L., Watters, N., Desjardins, G., and Lerchner, A. (2018). Understanding disentangling in vae. arXiv preprint arXiv:1804.03599.
 Chang et al. (2009) Chang, J., Gerrish, S., Wang, C., BoydGraber, J. L., and Blei, D. M. (2009). Reading tea leaves: How humans interpret topic models. In Advances in neural information processing systems, pages 288–296.
 Goyal et al. (2017) Goyal, P., Hu, Z., Liang, X., Wang, C., Xing, E. P., and Mellon, C. (2017). Nonparametric variational autoencoders for hierarchical representation learning. In ICCV, pages 5104–5112.
 Griffiths et al. (2004) Griffiths, T. L., Jordan, M. I., Tenenbaum, J. B., and Blei, D. M. (2004). Hierarchical topic models and the nested chinese restaurant process. In Advances in neural information processing systems, pages 17–24.
 Hoffman and Johnson (2016) Hoffman, M. D. and Johnson, M. J. (2016). Elbo surgery: yet another way to carve up the variational evidence lower bound. In Workshop in Advances in Approximate Bayesian Inference, NIPS.
 Johnson et al. (2016) Johnson, M. J., Duvenaud, D., Wiltschko, A. B., Datta, S. R., and Adams, R. P. (2016). Composing graphical models with neural networks for structured representations and fast inference. In Neural Information Processing Systems.
 Kingma and Ba (2015) Kingma, D. and Ba, J. (2015). Adam: A method for stochastic optimization. In International Conference on Learning Representations.
 Kingma and Welling (2014) Kingma, D. P. and Welling, M. (2014). Autoencoding variational Bayes. International Conference on Learning Representations.
 Kingman (1982) Kingman, J. F. C. (1982). The coalescent. Stochastic processes and their applications, 13(3):235–248.
 Lake et al. (2015) Lake, B. M., Salakhutdinov, R., and Tenenbaum, J. B. (2015). Humanlevel concept learning through probabilistic program induction. Science, 350(6266):1332–1338.
 LeCun and Cortes (2010) LeCun, Y. and Cortes, C. (2010). MNIST handwritten digit database.
 Lin et al. (2018) Lin, W., Hubacher, N., and Khan, M. E. (2018). Variational message passing with structured inference networks. arXiv preprint arXiv:1803.05589.

Liu et al. (2015)
Liu, Z., Luo, P., Wang, X., and Tang, X. (2015).
Deep learning face attributes in the wild.
In
Proceedings of International Conference on Computer Vision (ICCV)
.  Neal (2003) Neal, R. M. (2003). Density modeling and clustering using dirichlet diffusion trees. Bayesian statistics, 7:619–629.
 Papamakarios et al. (2017) Papamakarios, G., Murray, I., and Pavlakou, T. (2017). Masked autoregressive flow for density estimation. In Advances in Neural Information Processing Systems, pages 2338–2347.
 Radford et al. (2015) Radford, A., Metz, L., and Chintala, S. (2015). Unsupervised representation learning with deep convolutional generative adversarial networks. arXiv preprint arXiv:1511.06434.

Rezende et al. (2014)
Rezende, D. J., Mohamed, S., and Wierstra, D. (2014).
Stochastic backpropagation and approximate inference in deep generative models.
In Proceedings of the 31st International Conference on Machine Learning, pages 1278–1286.  Salimans et al. (2017) Salimans, T., Karpathy, A., Chen, X., Kingma, D. P., and Bulatov, Y. (2017). Pixelcnn++: Improving the pixelcnn with discretized logistic mixture likelihood and other modifications. In International Conference on Learning Representations (ICLR).
 Snelson and Ghahramani (2006) Snelson, E. and Ghahramani, Z. (2006). Sparse gaussian processes using pseudoinputs. In Advances in neural information processing systems, pages 1257–1264.

Tipping and Bishop (1999)
Tipping, M. E. and Bishop, C. M. (1999).
Probabilistic principal component analysis.
Journal of the Royal Statistical Society: Series B (Statistical Methodology), 61(3):611–622. 
Tomczak and Welling (2018)
Tomczak, J. and Welling, M. (2018).
Vae with a vampprior.
In
International Conference on Artificial Intelligence and Statistics
, pages 1214–1223.  Vahdat et al. (2018) Vahdat, A., Andriyash, E., and Macready, W. G. (2018). Dvae#: Discrete variational autoencoders with relaxed Boltzmann priors. In Neural Information Processing Systems (NIPS).
 Vikram and Dasgupta (2016) Vikram, S. and Dasgupta, S. (2016). Interactive bayesian hierarchical clustering. In International Conference on Machine Learning, pages 2081–2090.
 Wagstaff and Cardie (2000) Wagstaff, K. and Cardie, C. (2000). Clustering with instancelevel constraints. AAAI/IAAI, 1097:577–584.
The LORACs prior for VAEs: Letting the Trees Speak for the Data  Supplement
References

Awasthi et al. (2014)
Awasthi, P., Balcan, M., and Voevodski, K. (2014).
Local algorithms for interactive clustering.
In
International Conference on Machine Learning
, pages 550–558.  Blei et al. (2003) Blei, D. M., Ng, A. Y., and Jordan, M. I. (2003). Latent Dirichlet allocation. Journal of Machine Learning Research, 3(Jan):993–1022.
 Boyles and Welling (2012) Boyles, L. and Welling, M. (2012). The timemarginalized coalescent prior for hierarchical clustering. In Advances in Neural Information Processing Systems, pages 2969–2977.
 Burda et al. (2015) Burda, Y., Grosse, R., and Salakhutdinov, R. (2015). Importance weighted autoencoders. arXiv preprint arXiv:1509.00519.
 Burgess et al. (2018) Burgess, C. P., Higgins, I., Pal, A., Matthey, L., Watters, N., Desjardins, G., and Lerchner, A. (2018). Understanding disentangling in vae. arXiv preprint arXiv:1804.03599.
 Chang et al. (2009) Chang, J., Gerrish, S., Wang, C., BoydGraber, J. L., and Blei, D. M. (2009). Reading tea leaves: How humans interpret topic models. In Advances in neural information processing systems, pages 288–296.
 Goyal et al. (2017) Goyal, P., Hu, Z., Liang, X., Wang, C., Xing, E. P., and Mellon, C. (2017). Nonparametric variational autoencoders for hierarchical representation learning. In ICCV, pages 5104–5112.
 Griffiths et al. (2004) Griffiths, T. L., Jordan, M. I., Tenenbaum, J. B., and Blei, D. M. (2004). Hierarchical topic models and the nested chinese restaurant process. In Advances in neural information processing systems, pages 17–24.
 Hoffman and Johnson (2016) Hoffman, M. D. and Johnson, M. J. (2016). Elbo surgery: yet another way to carve up the variational evidence lower bound. In Workshop in Advances in Approximate Bayesian Inference, NIPS.
 Johnson et al. (2016) Johnson, M. J., Duvenaud, D., Wiltschko, A. B., Datta, S. R., and Adams, R. P. (2016). Composing graphical models with neural networks for structured representations and fast inference. In Neural Information Processing Systems.
 Kingma and Ba (2015) Kingma, D. and Ba, J. (2015). Adam: A method for stochastic optimization. In International Conference on Learning Representations.
 Kingma and Welling (2014) Kingma, D. P. and Welling, M. (2014). Autoencoding variational Bayes. International Conference on Learning Representations.
 Kingman (1982) Kingman, J. F. C. (1982). The coalescent. Stochastic processes and their applications, 13(3):235–248.
 Lake et al. (2015) Lake, B. M., Salakhutdinov, R., and Tenenbaum, J. B. (2015). Humanlevel concept learning through probabilistic program induction. Science, 350(6266):1332–1338.
 LeCun and Cortes (2010) LeCun, Y. and Cortes, C. (2010). MNIST handwritten digit database.
 Lin et al. (2018) Lin, W., Hubacher, N., and Khan, M. E. (2018). Variational message passing with structured inference networks. arXiv preprint arXiv:1803.05589.

Liu et al. (2015)
Liu, Z., Luo, P., Wang, X., and Tang, X. (2015).
Deep learning face attributes in the wild.
In
Proceedings of International Conference on Computer Vision (ICCV)
.  Neal (2003) Neal, R. M. (2003). Density modeling and clustering using dirichlet diffusion trees. Bayesian statistics, 7:619–629.
 Papamakarios et al. (2017) Papamakarios, G., Murray, I., and Pavlakou, T. (2017). Masked autoregressive flow for density estimation. In Advances in Neural Information Processing Systems, pages 2338–2347.
 Radford et al. (2015) Radford, A., Metz, L., and Chintala, S. (2015). Unsupervised representation learning with deep convolutional generative adversarial networks. arXiv preprint arXiv:1511.06434.

Rezende et al. (2014)
Rezende, D. J., Mohamed, S., and Wierstra, D. (2014).
Stochastic backpropagation and approximate inference in deep generative models.
In Proceedings of the 31st International Conference on Machine Learning, pages 1278–1286.  Salimans et al. (2017) Salimans, T., Karpathy, A., Chen, X., Kingma, D. P., and Bulatov, Y. (2017). Pixelcnn++: Improving the pixelcnn with discretized logistic mixture likelihood and other modifications. In International Conference on Learning Representations (ICLR).
 Snelson and Ghahramani (2006) Snelson, E. and Ghahramani, Z. (2006). Sparse gaussian processes using pseudoinputs. In Advances in neural information processing systems, pages 1257–1264.

Tipping and Bishop (1999)
Tipping, M. E. and Bishop, C. M. (1999).
Probabilistic principal component analysis.
Journal of the Royal Statistical Society: Series B (Statistical Methodology), 61(3):611–622. 
Tomczak and Welling (2018)
Tomczak, J. and Welling, M. (2018).
Vae with a vampprior.
In
International Conference on Artificial Intelligence and Statistics
, pages 1214–1223.  Vahdat et al. (2018) Vahdat, A., Andriyash, E., and Macready, W. G. (2018). Dvae#: Discrete variational autoencoders with relaxed Boltzmann priors. In Neural Information Processing Systems (NIPS).
 Vikram and Dasgupta (2016) Vikram, S. and Dasgupta, S. (2016). Interactive bayesian hierarchical clustering. In International Conference on Machine Learning, pages 2081–2090.
 Wagstaff and Cardie (2000) Wagstaff, K. and Cardie, C. (2000). Clustering with instancelevel constraints. AAAI/IAAI, 1097:577–584.
Appendix A Additional visualizations
Appendix B Empirical results
Labels per class  1  10  20  30  40  50  60  70  80  90  100 

No prior  
Normal  
Vamp(500)  
DVAE#  
MAF  
LORACs(200) 
Labels per class  1  2  5  10  15 

No prior  
Normal  
Vamp(1000)  
DVAE#  
MAF  
LORACs(1000) 
# of inducing points  200  500 

0.9428  0.9474 
Appendix C Algorithm details
c.1 Stick breaking process
Consider inserting a node into the tree in between vertices and such that , creating branch . The inserted node has time with probability according to the stick breaking process, i.e.
(C.16) 
c.2 Belief propagation in TMCs
The TMC is at the core of the LORACs prior. Recall that the TMC is a prior over phylogenies , and after attaching a Gaussian random walk (GRW), we obtain a distribution over vectors in , corresponding to the leaves, . However, the GRW samples latent vectors at internal nodes . Rather than explicitly representing these values, in this work we marginalize them out, i.e.
(C.17) 
This marginalization process can be done efficiently, because our graphical model is treeshaped and all nodes have Gaussian likelihoods. Belief propagation is a messagepassing framework for marginalization and we utilize messagepassing for several TMC inference queries. The main queries we are interested in are:

 for the purposes of MCMC, we are interested in computing the joint likelihood of a set of observed leaf values and a phylogeny.

 this query computes the posterior density over one leaf given all the others; we use this distribution when computing the posterior predictive density of a TMC.

 this query is the gradient of the predictive density of a single leaf with respect to the values at all other leaves. This query is used when computing gradients of the ELBO w.r.t in the LORACs prior.
Message passing
Message passing treats the tree as an undirected graph. We first pick start node and request messages from each of ’s neighbors.
Message passing is thereafter defined recursively. When a node has requested messages from a source node , it thereafter requests messages from all its neighbors but . The base case for this recursion is a leaf node , which returns a message with the following contents:
(C.18) 
where bold numbers and denote vectors obtained by repeating a scalar times.
In the recursive case, consider being at a node and receiving a set of messages from its neighbors .
(C.19) 
where is the length of the edge between nodes and . These messages are identical to those used in Boyles and Welling (2012).
Additionally, our messages include gradients w.r.t. every leaf node downstream of the message. We update each of these gradients when computing the new message and pass them along to the source node. Gradients with respect to one of these nodes are calculated as
(C.20) 
The most complicated message is the message, which depends on the number of incoming messages. gets three incoming messages, all other nodes get only two. Consider two messages from nodes and :
(C.21) 
For three messages from nodes , , and :
(C.22) 
With these messages, we can answer all the aforementioned inference queries.

We can begin message passing at any internal node and compute:

We start message passing at . is a Gaussian with mean and variance .

is , which in turn utilizes gradients sent via message passing.
Implementation
We chose to implement the TMC and message passing in Cython because we found raw Python to be too slow due to function call and typechecking overhead. Furthermore, we used diagonal rather than scalar variances in the message passing implementation to later support diagonal variances handed from the variational posterior over .
c.3 Variational inference for the LORACs prior
The LORACs prior involves first sampling a tree from the posterior distribution over TMCs with as leaves. We then sample a branch and time for each data according to the posterior predictive distribution described in subsection 2.1. We then sample a from the distribution induced by the GRW likelihood model. Finally, we pass the sampled through the decoder.
(C.23) 
Consider sampling the optimal .
(C.24) 
We set . We use additional variational factors , , and . is a recognition network that outputs the attach time for a particular branch. Since the and terms cancel out, we obtain the following ELBO.
(C.25) 
Inference procedure
In general, can be sampled using vanilla SPR MetropolisHastings, so samples from this distribution are readily available.
For each data in the minibatch , we pass it through the encoder to obtain . We then compute
(C.26) 
This quantity is computed by looping over every branch of a sample from , storing incoming messages at each node, passing the and and a sample from into , outputting a logisticnormal distribution over times for that branch. We sample that logistic normal to obtain a time to go with branch . We can then compute the loglikelihood of if it were to attach to and , using TMC inference query #2. This loglikelihood is added to the TMC prior logprobability of the branch being selected to obtain a joint probability over the branch. After doing this for every branch, we normalize the joint likelihoods to obtain the optimal categorical distribution over every branch for , . We then sample this distribution to obtain an attach location and time for each data in the minibatch.
The next stage is to compute gradients w.r.t. to the learnable parameters of the model (, , , and ). In the process of calculating , we have obtained samples from its corresponding and . We plug these into the ELBO and can compute gradients via automatic differentiation w.r.t. , , and . Computing gradients w.r.t. is more tricky. We first examine the ELBO.
(C.27) 
Consider the gradient of the ELBO with respect to .
Comments
There are no comments yet.