Learning Hierarchical Priors in VAEs

by   Alexej Klushyn, et al.

We propose to learn a hierarchical prior in the context of variational autoencoders. Our aim is to avoid over-regularisation resulting from a simplistic prior like a standard normal distribution. To incentivise an informative latent representation of the data by learning a rich hierarchical prior, we formulate the objective function as the Lagrangian of a constrained-optimisation problem and propose an optimisation algorithm inspired by Taming VAEs. To validate our approach, we train our model on the static and binary MNIST, Fashion-MNIST, OMNIGLOT, CMU Graphics Lab Motion Capture, 3D Faces, and 3D Chairs datasets, obtaining results that are comparable to state-of-the-art. Furthermore, we introduce a graph-based interpolation method to show that the topology of the learned latent representation correspond to the topology of the data manifold.


page 6

page 7

page 8

page 14

page 15


Nonparametric Variational Auto-encoders for Hierarchical Representation Learning

The recently developed variational autoencoders (VAEs) have proved to be...

Increasing the Generalisaton Capacity of Conditional VAEs

We address the problem of one-to-many mappings in supervised learning, w...

The LORACs prior for VAEs: Letting the Trees Speak for the Data

In variational autoencoders, the prior on the latent codes z is often tr...

Encoded Prior Sliced Wasserstein AutoEncoder for learning latent manifold representations

While variational autoencoders have been successful generative models fo...

NCP-VAE: Variational Autoencoders with Noise Contrastive Priors

Variational autoencoders (VAEs) are one of the powerful likelihood-based...

Covariate-informed Representation Learning with Samplewise Optimal Identifiable Variational Autoencoders

Recently proposed identifiable variational autoencoder (iVAE, Khemakhem ...

Associative Compression Networks for Representation Learning

This paper introduces Associative Compression Networks (ACNs), a new fra...

1 Introduction

Variational autoencoders (VAEs) (Kingma & Welling, 2013; Rezende et al., 2014)

are a class of latent variable models for unsupervised learning. The learned generative model and the corresponding (approximate) posterior distribution of the latent variables provide a decoder/encoder pair that might capture semantically meaningful features of the data. In this paper, we address the issue of learning informative latent encodings/representations of the data.

The vanilla VAE uses a standard normal prior distribution for the latent variables. However, it has been shown that this often leads to over-regularising the posterior distribution, resulting in a poor latent representation (Alemi et al., 2018). There are several approaches to alleviate this problem: (i) defining and learning complex prior distributions that can model the encoded data manifold (Chen et al., 2016b; Tomczak & Welling, 2018); (ii) using specialised optimisation algorithms that try to find local minima of the training objective that correspond to informative latent representations (Bowman et al., 2016; Sønderby et al., 2016; Higgins et al., 2017; Rezende & Viola, 2018); and (iii) adding mutual-information based constrains or regularisers to incentivise a good correspondence between the data and the latent variables (Alemi et al., 2018; Zhao et al., 2017; Chen et al., 2016a). In this paper, we address the first two approaches.

As a starting point, we use the approach from (Tomczak & Welling, 2018), where the authors note that the optimal prior (empirical Bayes) is the aggregated posterior—a uniform mixture of approximate posteriors evaluated at the data points. Using this insight, they propose a prior that is a uniform mixture of approximate posterior distributions, evaluated at a few learned pseudo data points. However, a finite mixture does not always provide a good prior (e.g., Sec. 4.2). In this paper, we propose to approximate the aggregated posterior through a continuous mixture/hierarchical distribution. This enables a highly flexible prior, and hence avoids over-regularising the approximate posterior.

In order to learn such hierarchical priors, we extend the optimisation framework introduced in (Rezende & Viola, 2018), where the authors reformulate the VAE objective as the Lagrangian of a constrained-optimisation problem. They impose an inequality constraint on the reconstruction error and choose the KL divergence between the approximate posterior and the prior as the optimisation objective. Instead of a standard normal, we use the hierarchical distribution described above as prior and approximate it by applying an importance-weighted bound (Burda et al., 2015). Concurrently, we introduce the associated optimisation algorithm, inspired by GECO (Rezende & Viola, 2018), the latter does not always lead to good encodings (e.g., Sec. 4.1). Our approach prevents posterior collapse and results in more informative latent representations than previous methods.

We adopt the manifold hypothesis

(Cayton, 2005; Rifai et al., 2011) to validate the quality of a latent representation. We do this by proposing a nearest-neighbour graph-based method for interpolating between different data points along the learned data manifold in the latent space.

2 Methods

2.1 VAEs as a Constrained-Optimisation Problem

VAEs model the distribution of i.i.d. data as the marginal


The model parameters are learned through amortised variational EM, which requires learning an approximate posterior distribution . It is hoped that the learned and result in an informative latent representation of the data. For example, cluster w.r.t. some discrete features or important factors of variation in the data. In Sec. 4.1, we show a toy example, where the model can learn the true underlying factors of variation in .

Amortised variational EM in VAEs maximises the evidence lower bound (ELBO) (Kingma & Welling, 2013; Rezende et al., 2014):


where and

are typically assumed to be diagonal Gaussians with their parameters defined as neural network functions of the conditioning variables.

stands for the empirical distribution of the data . The (EM) optimisation problem (e.g. Neal & Hinton, 1998) is formulated as


The corresponding optimisation algorithm has been often implemented as a double-loop algorithm, however, in the context of VAEs—or neural inference models in general—it is a common practice to optimise jointly.

It has been shown that local minima with high ELBO values do not necessarily result in informative latent representations (Alemi et al., 2018; Higgins et al., 2017). In order to address this problem, several approaches have been developed, which typically result in some weighting schedule for either the negative expected log-likelihood or the KL term of the ELBO (Bowman et al., 2016; Sønderby et al., 2016). This is due to the fact that a different ratio targets different regions in the rate-distortion plane, either favouring better compression or reconstruction (Alemi et al., 2018).

In (Rezende & Viola, 2018), the authors reformulate the VAE objective as the Lagrangian of a constrained-opti- misation problem. They choose the as the optimisation objective and impose the inequality constraint . Typically is defined as the reconstruction-error-related term in . Since is the average reconstruction error, this formulation allows for a better control of the quality of generated data. In the resulting Lagrangian objective


the Lagrange multiplier can be viewed as a weighting term for . This approach leads to a similar optimisation objective as in (Higgins et al., 2017) with . The authors propose a descent-ascent algorithm (GECO) for finding the saddle point of the Lagrangian. The parameters are optimised through gradient descent and is updated as


corresponding to a quasi-gradient ascent due to ; is the update’s learning rate. In the context of stochastic batch gradient training,

is estimated as the running-average

, where is the batch average . To the best of our understanding111The optimisation problem is not explicitly stated in (Rezende & Viola, 2018)., the GECO algorithm solves the optimisation problem


Here, can be viewed to correspond to the E-step of the EM algorithm. However, in general this objective can only be guaranteed to be the ELBO if , or in case of , a scaled lower bound on the ELBO.

2.2 Hierarchical Priors for Learning Informative Latent Representations

In this section, we propose a hierarchical prior for VAEs within the constrained-optimisation setting. Our goal is to incentivise the learning of informative latent representations and to avoid over-regularising the posterior distribution (i) by increasing the complexity of the prior distribution , and (ii) by providing an optimisation method to learn such models.

It has been shown that the optimal empirical Bayes prior is the aggregated posterior distribution


We follow (Tomczak & Welling, 2018) to approximate this distribution in the form of a mixture distribution. However, we opt for a continuous mixture/hierarchical model


with a standard normal . This leads to a hierarchical model with two stochastic layers. As a result, intuitively, our approach inherently favours the learning of continuous latent features. We refer to this model by variational-hierarchical prior (VHP).

In order to learn the parameters in Eq. (8), we propose to adapt the constrained-optimisation problem in Sec. 2.1 to hierarchical models. For this purpose we use an importance-weighted (IW) bound (Burda et al., 2015) to define a sequence of upper bounds (and constrained-optimisation problems). That is, we use


with importance weights, defining an upper bound on Eq. (2.1):


As a result, we arrive to the optimisation problem


which we can optimise by the following double-loop algorithm: (i) in the outer loop we update the bound w.r.t. ; (ii) in the inner loop we solve the optimisation problem by applying an update scheme for and , respectively. In the following, we use the -parameterisation to be in line with (e.g. Higgins et al., 2017; Sønderby et al., 2016).

In the GECO update scheme (Eq. (5)), increases/decreases until . However, provided the constraint is fulfilled, we want to obtain a lower lower bound on the log-likelihood. As discussed in Sec. 2.1, that can be guaranteed when (ELBO). To achieve this, we propose to replace the corresponding -version of Eq. (5) by


where we define


Here, is the Heaviside function and we introduce a slope parameter . This update can be interpreted as follows. If the constraint is violated, i.e. , the update scheme is equal to Eq. (5) due to the Heaviside function. In case the constraint is fulfilled, the term guarantees that we finish at , to obtain/optimise the ELBO at the end of the training. Thus, we impose , which is reasonable since does not violate the constraint. A visualisation of the -update scheme is shown in Fig. 1. Note that there are alternative ways to modify Eq. (5), see App. B.1, however, Eq. (12) led to the best results.

Figure 1: -update scheme: as a function of and for and . A comparison with the GECO update scheme can be found in App. A. (see Sec. 2.2)

However, the double-loop approach in Eq. (11) is often computationally inefficient. Thus, we decided to run the inner loop only until the constraints are satisfied and then updating the bound. That is, we optimise Eq. (11) and skip the outer loop/bound updates when the constraints are not satisfied. It turned out that the bound updates were often skipped in the initial phase, but rarely skipped later on. Hence, the algorithm behaves as layer-wise pretraining (Bengio et al., 2007). For these reasons, we propose Alg. 1 (REWO) that separates training into two phases: an initial phase, where we only optimise the reconstruction error—and a main phase, where all parameters are updated jointly.

  Initialise  InitialPhase = True
  while training do
     Read current data batch
     Sample from variational posterior
     Compute (batch average)
     , ()
     if  then
        InitialPhase = False
     end if
     if InitialPhase then
        Optimise   w.r.t   
        Optimise   w.r.t   
     end if
  end while
Algorithm 1 (REWO) Reconstruction-error-based weighting of the objective function

In the initial phase, we initialise to enforce a reconstruction optimisation. Thus, to train the first stochastic layer for achieving a good encoding of the data through , measured by the reconstruction error. For preventing to become smaller than the initial value during the first iteration steps, we start to update when the condition is fulfilled. A good encoding is required to learn the conditionals and in the second stochastic layer. Otherwise, would be strongly regularised towards , resulting in , from which it typically does not recover (Sønderby et al., 2016). In the main phase, after is fulfilled, we additionally optimise the parameters of the second stochastic layer and start to update . This approach avoids posterior collapse in both stochastic layers (see Sec. 4.1 and App. D.2), and thereby supports the hierarchical prior to learn an informative latent representation for preventing the aforementioned over-regularisation.

The proposed method, which is a combination of an ELBO-like Lagrangian and an IW bound, can be interpreted as follows: the posterior of the first stochastic layer can learn an informative latent representation due to the flexible hierarchical prior. Since a diagonal Gaussian is not flexible enough to capture the (true) posterior of the second stochastic layer, we propose to enhance it by using an importance-weighted bound (Cremer et al., 2017) (alternatively, one could use, for example, normalising flows (Rezende & Mohamed, 2015)). This has the following advantages: (i) it facilitates learning a precise conditional from the standard normal distribution to the aggregated posterior ; (ii) it allows to fully exploit its representational capacity—otherwise, the model could compensate a less expressive by regularising (see App. B.4).

2.3 Graph-Based Interpolations for Verifying Latent Representations

A key reason for introducing hierarchical priors was to facilitate an informative latent representation due to less over-regularisation of the posterior. To verify the quality of the latent representations, we build on the manifold hypothesis, defined in (Cayton, 2005; Rifai et al., 2011). The idea can be summarised by the following assumption: real-world data presented in high-dimensional spaces is likely to concentrate in the vicinity of nonlinear sub-manifolds of much lower dimensionality. Following this hypothesis, the quality of latent representations can be evaluated by interpolating between different data points along the learned data manifold in the latent space—and reconstructing the resulting path to the observable space.

To implement the above idea, we propose a graph-based method (Chen et al., 2018) which summarises the continuous latent space by a graph consisting of a finite number of nodes. The nodes can be obtained by randomly sampling samples from the learned prior (Eq. (8)):


The graph is henceforth constructed by connecting each of them by undirected edges to its k-nearest neighbours. The edge weights are Euclidean distances in the latent space between the related node pairs. Once the graph is built, interpolation between data points and can be done as follows. First, we encode the data points as , where is the mean of . Next, the encoded data is added as new nodes to the graph along with edges to the existing (nearest neighbour) nodes.

To find the shortest path through the graph between nodes and , a classic search algorithm such as can be used. The result is a sequence , where , representing the shortest path in the latent space along the learned latent manifold. Finally, to obtain the interpolation, we reconstruct to the observable space.

3 Related Work

(a) VHP + REWO
(b) VHP + GECO
Figure 2: (left) Latent representation of the pendulum data at different iteration steps when optimising

with REWO and GECO, respectively. The top row shows the approximate posterior; the greyscale encodes the variance of its standard deviation. The bottom row shows the hierarchical prior. (right)

as a function of the iteration steps; red lines mark the iteration steps, where the latent representation is visualised. (see Sec. 4.1)

Several works improve the VAE by learning more complex priors such as the stick-breaking prior (Nalisnick & Smyth, 2017), a nested Chinese Restaurant Process prior (Goyal et al., 2017), Gaussian mixture priors (Dilokthanakul et al., 2016), or autoregressive priors (Chen et al., 2016b). A closely related line of research is based on the insight that the optimal prior is the aggregated posterior. The VampPrior (Tomczak & Welling, 2018) approximates the aggregated posterior by a uniform mixture of approximate posterior distributions, evaluated at a few learned pseudo data points. In our approach, the aggregated posterior is approximated by using an IW bound. Compared to the VampPrior, the VHP can be viewed as a continuous mixture distribution.

In the context of VAEs, hierarchical latent variable models were already introduced earlier (Rezende et al., 2014; Burda et al., 2015; Sønderby et al., 2016). Compared to our approach, these works have in common the structure of the generative model but differ in the factorisation of the inference models and the optimisation procedure. In our proposed method, the VAE objective is reformulated as the Lagrangian of a constrained-optimisation problem. The prior of this ELBO-like Lagrangian is approximated by an IW bound. It is motivated by the fact that a single stochastic layer with a flexible prior can be sufficient for modelling an informative latent representation. The second stochastic layer is required to learn a sufficiently flexible prior.

The common problem of failing to use the full capacity of the model in VAEs (Burda et al., 2015) has been addressed by applying annealing/warm-up (Bowman et al., 2016; Sønderby et al., 2016). Here, the KL divergence between the approximate posterior and the prior is multiplied by a weighting factor, which is linearly increased from 0 to 1 during training. However, such predefined schedules might be suboptimal. Therefore, (Rezende & Viola, 2018) introduce a constrained-optimisation algorithm called GECO. By reformulating the objective as a constrained-optimisation problem, the above weighting term can be represented by a Lagrange multiplier and updated based on the reconstruction error. Our proposed algorithm builds on GECO, providing several modifications discussed in Sec. 2.2.

In (Higgins et al., 2017), the authors propose a constrained-optimisation framework, where the optimisation objective is the expected negative log-likelihood and the constraint is imposed in the KL term—recall that in (Rezende & Viola, 2018) the roles are reversed. Instead of optimising the resulting Lagrangian, the authors choose Lagrange multipliers (

parameter) that maximise a heuristic cost for disentanglement. In contrast to our approach, the goal is not to learn a latent representation that reflects the topology of the data but a disentangled representation, where different dimensions of the latent space correspond to different features of the data.

4 Experiments

To validate our approach, we consider the following experiments. In Sec. 4.1

, we demonstrate that our method learns to represent the degree of freedom in the data of a moving pendulum. In Sec. 

4.2, we generate human movements based on the learned latent representations of real-world data (CMU Graphics Lab Motion Capture Database). In Sec. 4.3, the marginal log-likelihood on standard datasets such as MNIST, Fashion-MNIST, and OMNIGLOT is evaluated. In Sec. 4.4, we compare our method on the high-dimensional image datasets 3D Faces and 3D Chairs. The model architectures used in our experiments can be found in App. F.

4.1 Artificial Pendulum Dataset

We created a dataset of 15,000 images of a moving pendulum (see Fig. 4). Each image has a size of pixels and the joint angles are distributed uniformly in the range .

(a) VHP + REWO
(b) VHP + GECO
(c) IWAE
Figure 3: Graph-based interpolation of the pendulum movement. The graph is based on the prior, shown in Fig. 2 and App. B.5, respectively. The red curves depict the interpolations, the bluescale indicates the edge weight. (see Sec. 4.1)
top: VHP + REWO, middle: VHP + GECO, bottom: IWAE
Figure 4: Pendulum reconstructions of the graph-based interpolation in the latent space, shown in Fig. 3. Discontinuities are marked by blue boxes. (see Sec. 4.1)
Figure 5: Latent representation of human motion data: VHP + REWO (top), VampPrior (middle), and IWAE (bottom); approximate posterior (left) and prior (right). The colour encodes the five human motions. The different sample densities are caused by a different amount of data points for each motion. (see Sec. 4.2)
(a) VHP + REWO
(b) VampPrior
(c) IWAE
Figure 6: Graph-based interpolation of human motions. The graphs are based on the (learned) prior distributions, depicted in Fig. 5. The bluescale indicates the edge weight. The coloured lines represent four interpolated movements, which can be found in Fig. 7 and App. C. (see Sec. 4.2)

top: VHP + REWO, middle: VampPrior, bottom: IWAE

Figure 7: Human-movement reconstructions of the graph-based interpolations in Fig. 6 (red curve). Reconstruction of the remaining interpolations can be found in App. C. Discontinuities are marked by blue boxes. (see Sec. 4.2)
Figure 8: Smoothness measure of the human-movement interpolations. For each joint, the mean and standard deviation of the smoothness factor are displayed. Smaller values correspond to smoother movements. (see Sec. 4.2)

Fig. 2 shows latent representations of the pendulum data learned by the VHP when applying REWO and GECO, respectively; the same is used in both cases. In case of REWO, the approximate posterior (Fig. 2(a), top row) is optimised to reach a low reconstruction error at the beginning of the training due to . The variance of the posterior’s standard deviation, expressed by the greyscale, measures whether the contribution to the ELBO is equally distributed over all data points. Once is fulfilled (Fig. 2(a), iter=350), begins to be updated and the parameters of the second stochastic layer start to be optimised, leading to informative hierarchical prior distributions (Fig. 2(a), bottom row). Between iteration 2000 and 5000, the increase in results in a regularisation of the latent representation, and hence in a higher reconstruction error. At iteration 5000, stops to increase due to (see Eq. (12)). From iteration 5000 to 27,500, is updated driven by an interplay between the reconstruction error and the KL divergence, . After , the regularisation impact of the KL divergence does not increase anymore, leading to an improvement of the latent representation (Fig. 2(a), iter=150,000).

To validate whether the obtained latent representation is informative, we apply a linear regression after transforming the latent space to polar coordinates. The goal is to predict the joint angle of the pendulum. We compare REWO with GECO, and additionally with

warm-up (WU) (Sønderby et al., 2016), a linear annealing schedule of . In the latter, we use a VAE objective—defined as an ELBO/IW bound combination, similar to Eq. (2.2); the related plots are in App. B.2. Tab. 1 shows the absolute errors (OLS regression) for the different optimisation procedures; details on the regression can be found in App. B.3. REWO leads to the most precise prediction of the ground truth.

method absolute error
VHP + REWO 0.054
VHP + GECO 0.53
VHP 0.49

+ WU (20 epochs)

VHP + WU (200 epochs) 0.31
VAE objective
Table 1: OLS regression on the learned latent representations of the pendulum data.

Furthermore, we demonstrate in App. B.4 that a less expressive posterior in the second stochastic layer leads to poor latent representations, since the model compensates it by restricting —as discussed in Sec. 2.2.

Finally, we compare the latent representations, learned by the VHP and the IWAE, using our graph-based interpolation method. The graphs, shown in Fig. 3, are built (see Sec. 2.3) based on 1000 samples from the prior of the respective model. The red curves depict the interpolation along resulting data manifold, between pendulum images with joint angles of 0 and 180 degrees, respectively. The reconstructions of the interpolations are shown in (Fig. 4). The top row (VHP + REWO) shows a smooth change of the joint angles, whereas the middle (VHP + GECO) and bottom row (IWAE) contain discontinuities resulting in an unrealistic interpolation.

4.2 Human Motion Capture Database

This section presents the evaluation on the CMU Graphics Lab Motion Capture Database222http://mocap.cs.cmu.edu/, which consists of several human motion recordings. Our experiments base on data of five different motions. Since different motions have similar body positions in certain frames, the corresponding manifolds are connected, making it a suitable dataset for interpolation experiments. We preprocess the data as in (Chen et al., 2015)

, such that each frame is represented by a 50-dimensional feature vector.

We compare our method with the VampPrior and the IWAE. The prior and aggregated approximate posterior of the three methods is shown in Fig. 5. In case of the VHP and the VampPrior the latent representations of different movements are separated and the learned prior matches the aggregated posterior. By contrast, the IWAE is restricted by the Gaussian prior and cannot represent the different motions separately in the latent space. Next, we generate four interpolations (Fig. 6) using our graph-based approach: between two frames within one motion (black line) and of different motions (orange, red, and maroon); the reconstructions are shown in Fig. 7 and App. C. The VampPrior and the VHP lead to smooth interpolations, whereas the IWAE interpolations show abrupt changes in the movements.

Fig. 8 depicts the movement smoothness factor, which we define as the RMS of the second order finite difference along the interpolated path. Thus, smaller values correspond to smoother movements. For each of the three methods, it is averaged across 10 graphs, each with 100 interpolations. The starting and ending points are randomly selected. As a result, the latent representation learned by the VHP leads to smoother movement interpolations than in case of the VampPrior and IWAE.

4.3 Evaluation on MNIST, Fashion-MNIST, and OMNIGLOT

We compare our method quantitatively to the VampPrior and the IWAE on MNIST (Lecun et al., 1998; Larochelle & Murray, 2011), Fashion-MNIST (Xiao et al., 2017), and OMNIGLOT (Lake et al., 2015). For this purpose, we report the marginal log-likelihood (LL) on the respective test set. Following the test protocol of previous work (Tomczak & Welling, 2018), we evaluate the LL using importance sampling with 5,000 samples (Burda et al., 2015). The results are reported in Tab. 2.

VHP + REWO performs as good or better than state-of-the-art on these datasets. The same was used for training VHP with REWO and GECO. The two stochastic layer hierarchical IWAE does not perform better than VHP + REWO, supporting our claim that a flexible prior in the first stochastic layer and a flexible approximate posterior in the second stochastic layer are sufficient. Additionally, we show that REWO leads to a similar amount of active units as WU (see App. D.2).

dynamic MNIST static MNIST Fashion- MNIST OMNI- GLOT
VHP + REWO 78.88 82.74 225.37 101.78
VHP + GECO 95.01 96.32 234.73 108.97
VampPrior 80.42 84.02 232.78 101.97
IWAE (L=1) 81.36 84.46 226.83 101.57
IWAE (L=2) 80.66 82.83 225.39 101.83
Table 2: Negative test log-likelihood estimated with 5,000 importance samples.

4.4 Qualitative Results: 3D Chairs and 3D Faces

We generated 3D Faces (Paysan et al., 2009) based on images of 2000 faces with 37 views each. 3D Chairs (Aubry et al., 2014) consists of 1393 chair images with 62 views each. The images have a size of pixels.

Here, our approach is compared with the IWAE using a 32-dimensional latent space. The learned encodings are evaluated qualitatively by using the graph-based interpolation method. Fig. 9(a) and 9(c) show interpolations along the latent manifold learned by the VHP + REWO. Compared to the IWAE (Fig. 9(b) and 9(d)), they are less blurry and smoother. Further results can be found in App. E.

(a) VHP + REWO
(b) IWAE
(c) VHP + REWO
(d) IWAE
Figure 9: Faces & Chairs: graph-based interpolations—between data points from the test set—along the learned 32-dimensional latent manifold. The graph is based on prior samples. (see Sec. 4.4)

5 Conclusion

In this paper, we have proposed a hierarchical prior in the context of variational autoencoders and extended the constrained-optimisation framework in Taming VAEs to hierarchical models by using an importance-weighted bound on the marginal of the hierarchical prior. Concurrently, we have introduced the associated optimisation algorithm to facilitate good encodings.

We have shown that the learned hierarchical prior is indeed non-trivial, moreover, it is well-adapted to the latent representation, reflecting the topology of the encoded data manifold. Our method provides informative latent representations and performs particularly well on data where the relevant features change continuously. In case of the pendulum (Sec. 4.1), the prior has learned to represent the degrees of freedom in the data—allowing us to predict the pendulum’s angle by a simple OLS regression. The experiments on the human motion data (Sec. 4.2) and on the high-dimensional Faces and Chairs datasets (Sec. 4.4) have demonstrated that the learned hierarchical prior leads to smoother and more realistic interpolations than a standard normal prior (or the VampPrior). Moreover, we have obtained test log-likelihoods (Sec. 4.3) comparable to state-of-art on standard datasets.


We would like to thank Maximilian Soelch for valuable suggestions and discussions.


  • Alemi et al. (2018) Alemi, A. A., Poole, B., Fischer, I., Dillon, J. V., Saurous, R. A., and Murphy, K. Fixing a broken ELBO. ICML, 2018.
  • Aubry et al. (2014) Aubry, M., Maturana, D., Efros, A. A., Russell, B. C., and Sivic, J. Seeing 3D chairs: exemplar part-based 2D-3D alignment using a large dataset of CAD models. CVPR, 2014.
  • Bengio et al. (2007) Bengio, Y., Lamblin, P., Popovici, D., and Larochelle, H. Greedy layer-wise training of deep networks. NeurIPS, 2007.
  • Bowman et al. (2016) Bowman, S. R., Vilnis, L., Vinyals, O., Dai, A., Jozefowicz, R., and Bengio, S. Generating sentences from a continuous space. CoNLL, 2016.
  • Burda et al. (2015) Burda, Y., Grosse, R. B., and Salakhutdinov, R. Importance weighted autoencoders. CoRR, 2015.
  • Cayton (2005) Cayton, L. Algorithms for manifold learning. Univ. of California at San Diego Tech. Rep, 12, 2005.
  • Chen et al. (2015) Chen, N., Bayer, J., Urban, S., and Van Der Smagt, P. Efficient movement representation by embedding dynamic movement primitives in deep autoencoders. HUMANOIDS, 2015.
  • Chen et al. (2018) Chen, N., Ferroni, F., Klushyn, A., Paraschos, A., Bayer, J., and van der Smagt, P. Fast approximate geodesics for deep generative models. arXiv:1812.08284, 2018.
  • Chen et al. (2016a) Chen, X., Duan, Y., Houthooft, R., Schulman, J., Sutskever, I., and Abbeel, P. InfoGAN: Interpretable representation learning by information maximizing generative adversarial nets. NeurIPS, 2016a.
  • Chen et al. (2016b) Chen, X., Kingma, D. P., Salimans, T., Duan, Y., Dhariwal, P., Schulman, J., Sutskever, I., and Abbeel, P. Variational lossy autoencoder. CoRR, 2016b.
  • Cremer et al. (2017) Cremer, C., Morris, Q., and Duvenaud, D. Reinterpreting importance-weighted autoencoders. arXiv:1704.02916, 2017.
  • Dilokthanakul et al. (2016) Dilokthanakul, N., Mediano, P. A. M., Garnelo, M., Lee, M. C. H., Salimbeni, H., Arulkumaran, K., and Shanahan, M. Deep unsupervised clustering with gaussian mixture variational autoencoders. CoRR, 2016.
  • Goyal et al. (2017) Goyal, P., Hu, Z., Liang, X., Wang, C., and Xing, E. P. Nonparametric variational auto-encoders for hierarchical representation learning. ICCV, 2017.
  • Higgins et al. (2017) Higgins, I., Matthey, L., Pal, A., Burgess, C., Glorot, X., Botvinick, M., Mohamed, S., and Lerchner, A. Beta-VAE: Learning basic visual concepts with a constrained variational framework. ICLR, 2017.
  • Kingma & Welling (2013) Kingma, D. P. and Welling, M. Auto-encoding variational bayes. CoRR, 2013.
  • Lake et al. (2015) Lake, B. M., Salakhutdinov, R., and Tenenbaum, J. B. Human-level concept learning through probabilistic program induction. Science, 350, 2015.
  • Larochelle & Murray (2011) Larochelle, H. and Murray, I. The neural autoregressive distribution estimator. AISTATS, 2011.
  • Lecun et al. (1998) Lecun, Y., Bottou, L., Bengio, Y., and Haffner, P. Gradient-based learning applied to document recognition. Proceedings of the IEEE, 1998.
  • Nalisnick & Smyth (2017) Nalisnick, E. T. and Smyth, P. Stick-breaking variational autoencoders. ICLR, 2017.
  • Neal & Hinton (1998) Neal, R. M. and Hinton, G. E. A view of the em algorithm that justifies incremental, sparse, and other variants. In Learning in graphical models. 1998.
  • Paysan et al. (2009) Paysan, P., Knothe, R., Amberg, B., Romdhani, S., and Vetter, T.

    A 3d face model for pose and illumination invariant face recognition.

    AVSS, 2009.
  • Rezende & Mohamed (2015) Rezende, D. J. and Mohamed, S. Variational inference with normalizing flows. ICML, 2015.
  • Rezende & Viola (2018) Rezende, D. J. and Viola, F. Taming VAEs. arXiv:1810.00597, 2018.
  • Rezende et al. (2014) Rezende, D. J., Mohamed, S., and Wierstra, D.

    Stochastic backpropagation and approximate inference in deep generative models.

    ICML, 2014.
  • Rifai et al. (2011) Rifai, S., Dauphin, Y. N., Vincent, P., Bengio, Y., and Muller, X.

    The manifold tangent classifier.

    NeurIPS, 2011.
  • Sønderby et al. (2016) Sønderby, C. K., Raiko, T., Maaløe, L., Sønderby, S. K., and Winther, O. Ladder variational autoencoders. NeurIPS, 2016.
  • Tomczak & Welling (2018) Tomczak, J. and Welling, M. VAE with a VampPrior. AISTATS, 2018.
  • Xiao et al. (2017) Xiao, H., Rasul, K., and Vollgraf, R.

    Fashion-MNIST: a novel image dataset for benchmarking machine learning algorithms.

    CoRR, 2017.
  • Yeung et al. (2017) Yeung, S., Kannan, A., Dauphin, Y., and Fei-Fei, L. Tackling over-pruning in variational autoencoders. CoRR, 2017.
  • Zhao et al. (2017) Zhao, S., Song, J., and Ermon, S. Infovae: Information maximizing variational autoencoders. CoRR, 2017.


Appendix A Comparison: -Update Scheme in REWO and in GECO

(a) REWO
(b) GECO
Figure 10: -update scheme: as a function of and for and .

Appendix B Pendulum

b.1 Training Process with Alternative -Update Scheme

An alternative way to define the -update scheme such that is to use Eq. (5) with


As in Sec. 2, is defined as the update’s learning rate. This leads to the following -update scheme:


where is a slope parameter. However, the -update defined in Eq. (12) is easier to tune, leading to better results. Furthermore, Eq. (12) allows to choose any as starting value.

Figure 11: VHP + REWO (with different -update scheme): (left) latent representation of the pendulum data at different iteration steps when optimising . The top row shows the approximate posterior, where the colour encodes the rotation angle of the pendulum. The bottom row shows samples from the hierarchical prior. (right) as a function of the iteration steps; the red lines mark the visualised iteration steps.

b.2 Training Process with/without WU

Figure 12: VHP (no REWO/GECO/WU): latent representation of the pendulum data at different iteration steps when optimising . The top row shows the approximate posterior, where the colour encodes the rotation angle of the pendulum. The bottom row shows samples from the hierarchical prior. It took 27,500 iterations until the model learned a representation of the data. However, the latent representation is less informative than in Fig. 2(a).
Figure 13: VHP + WU (20 epochs): latent representation of the pendulum data at different iteration steps when optimising . The top row shows the approximate posterior, where the colour encodes the rotation angle of the pendulum. The bottom row shows samples from the hierarchical prior. The model started to learn a representation (iter=2000) but the fast increase led to an over-regularisation by the KL term, resulting in a less informative representation than in Fig. 2(a).
Figure 14: VHP + WU (200 epochs): latent representation of the pendulum data at different iteration steps when optimising . The top row shows the approximate posterior, where the colour encodes the rotation angle of the pendulum. The bottom row shows samples from the hierarchical prior. The learned latent representation less informative than in Fig. 2(a).

b.3 OLS Regression on Learned Latent Representations

Fig. 15 shows the joint angle versus , where is the second component of the latent space and the radius is estimated from the learned latent representation.

Figure 15: Verifying the learned latent representations of the VHP trained with REWO, GECO, or WU: OLS regressions on encodings of the pendulum data. The absolute errors are shown in Tab. 1.

b.4 VHP with ELBO instead of IW Bound

Figure 16: VHP + REWO (with ELBO instead of IW bound in the second stochastic layer): latent representation of the pendulum data at different iteration steps when optimising . The top row shows the approximate posterior, where the colour encodes the rotation angle of the pendulum. The bottom row shows samples from the hierarchical prior.
Figure 17: REWO: reconstruction-error-dependent increase of . The red lines mark the respective iteration steps shown in Fig. 16. The model compensates the less expressive posterior in the second stochastic layer by restricting , which leads to poor latent representations.

b.5 Latent Representations Learned by VHP and IWAE

Figure 18: Latent respresentation of VHP + REWO (left), VHP + GECO (middle), and IWAE (right): approximate posterior (top) and prior (bottom). The colour encodes the rotation angle of the pendulum.

Appendix C CMU Human Motion

(a) VHP + REWO
(b) VampPrior
(c) IWAE
Figure 19: Movement interpolation. The different colours correspond to Fig. 6. Discontinuities are marked by blue boxes.

Appendix D Quantitative Results

d.1 Training Efficiency

Figure 20: NLL vs rate vs distortion on static MNIST

d.2 Active Units

Furthermore, we evaluate whether REWO prevents over-pruning of the latent variables (Yeung et al., 2017). Following (Sønderby et al., 2016), we evaluate for different optimisation strategies, where . We show the results for the inner latent variable on several datasets in Fig. 21.

(a) static MNIST
(b) dynamic MNIST
(c) Fashion-MNIST
Figure 21: Expected KL divergence between approximate posterior and prior for REWO algorithm (left) and WU (right). The latent dimensions are sorted by the KL divergence and the histograms are shown on a logarithmic scale.

Appendix E Faces and Chairs

(a) VHP + REWO
(b) IWAE
Figure 22: Faces: interpolations along the learned latent manifold with a latent space of 32 dimensions.
(a) VHP + REWO
(b) IWAE
Figure 23: Chairs: interpolations along the learned latent manifold with a latent space of 32 dimensions.

Appendix F Model Architectures

Dataset Optimiser Architecture
Pendulum Adam Input 256(flattened 1616)
1-4 Latents 2

FC 256, 256, 256, 256. ReLU activation.

FC 256, 256, 256, 256. ReLU activation. Gaussian.
FC 256, 256, 256, 256, ReLU activation.
FC 256, 256, 256, 256, ReLU activation.
Others = 0.02, = 5, = 16.
Graph 1,000 nodes, 18 neighbours.
CMU Human Adam Input 50
1-4 Latents 2
FC 256, 256, 256, 256. ReLU activation.
FC 256, 256, 256, 256. ReLU activation. Gaussian.
FC 256, 256, 256, 256, ReLU activation.
FC 256, 256, 256, 256, ReLU activation.
Others = 0.02, = 1, = 32.
Graph 2,530 nodes, 15 neighbours.
Faces, Adam Input 64641
Chairs 5-4 Latents 32
Conv 325

5(stride 2) , 32

33(stride 1), 4855(stride 2).
4833(stride 1), 6455(stride 2), 6433(stride 1).
9655(stride 2), 9633(stride 1), FC 256. ReLU activation
Deconv reverse of encoder. ReLU activation. Bernoulli.
FC 256, 256, ReLU activation.
FC 256, 256, ReLU activation.
others = 0.2, = 1, = 16.
Graph 10,000 nodes (faces), 8,637 nodes (chairs), 18 neighbours.
dynamicMNIST, Adam Input 28281
staticMNIST, 5-4 Latents 32
Fashion-MNIST, GatedConv 3277(stride 1) , 3233(stride 2),
OMNIGLOT 6455(stride 1), 6433(stride 2), 633(stride 1)
GatedFC 784, GatedConv 6433(stride 1),
6433(stride 1), 6433(stride 1), 6433(stride 1).
linear activation. Bernoulli.
GatedFC 256, 256, linear activation.
GatedFC 256, 256, linear activation.
others = 0.18 (dynamicMNIST, staticMNIST, OMNIGLOT),
= 0.31 (Fashion-MNIST),
= 1, = 16.
Table 3: Model architectures. GatedFC/GatedConv denote pairs of fully-connected/convolutional layers multiplied element-wise, where one of the layers (gate) always uses sigmoid activations.