1 Introduction
Variational autoencoders (DBLP:journals/corr/KingmaW13) are a class of unsupervised representation learning models with a principled probabilistic interpretation that extends normal autoencoders first described by hinton2006reducing. VAE is a followup technique that proposes special weighting of the KL divergence term in the VAE loss to obtain disentangled representations. However, unsupervised learning is notoriously brittle even on toy datasets and a meaningful, mathematically precise definition of disentanglement remains difficult to find.
It is thus not obvious to what extent VAEs can robustly obtain disentangled representations in different settings. The main contributions of our reproducibility report can be summarised as follows:

We add to the evidence provided by followup work (pmlrv80kim18b; locatello2020sober) that the almost perfect performance presented by higgins2016beta is very difficult to reproduce.

We demonstrate that does not continue to yield the best quantitative disentanglement results for very complex datasets.

We show that high disentanglement metric scores do not imply a qualitative disentanglement.

We quantitatively assess how lower values give better reconstructions of the original images.
1.1 VAE framework
VAEs are a special class of deep generative models optimised via variational inference, which allows one to approximate intractable distributions in Bayesian inference by solving an optimisation problem. Assume we have a directed latent variable model
(1) 
and we observed a dataset
. Many standard techniques such as ExpectationMaximization do not scale to the largescale deep learning setting because they require computing the
, for which the normalization constant is not available. We can avoid the need for precise normalization constants by using a variational approximation and instead optimising the evidence lower bound (ELBO) as in Equation (2), where is the marginal likelihood or model evidence:ELBO  (2)  
(3)  
(4)  
(5) 
However, obtaining gradients of the ELBO with respect to the variational parameters is difficult, because we cannot safely exchange derivatives and integrals in this case. Instead, VAEs crucially rely on using the reparameterization trick for computing Monte Carlo estimates of the gradient, typically using a single minibatch. The reparameterization trick is crucial, since it produces an estimator with much lower variance than more generalpurpose Monte Carlo estimators, such as the score function estimator. However, reparameterization requires working with continuous distributions and makes VAEs difficult to apply in the discrete setting, albeit not impossible
(DBLP:conf/iclr/MaddisonMT17; NIPS2017_7a98af17). Standard VAEs typically use an isotropic Gaussian prior for the KL divergence, which enables computing the KL divergence analytically.1.2 VAE improvements
higgins2016beta propose to increase the weighting of the KL divergence in Equation (2). This should in turn enforce a greater similarity between the posterior and the prior , which leads to greater disentanglement. bengio2013representation define disentangled representations as the property that latent variables are sensitive to one of the ground truth generative factors, but invariant to others. Because the VAE prior is typically chosen to be a Gaussian with diagonal covariance matrix, its dimensions are independent, which can be seen as disentangled.
While the
weighted KL loss can be seen merely as a heuristic addition to the normal autoencoder, followup work exploiting the informationtheoretic nature of KL divergence has led to further improvements in the algorithm
(NEURIPS2018_1ee3dfcd; DBLP:journals/corr/abs180403599). Regardless, largescale reproduction studies show the importance of random seeds and hyperparameter settings to be at least comparable to model choice
(locatello2020sober).2 Methodology
2.1 Datasets
higgins2016beta use a number of standard image datasets to evaluate both image generation properties of VAEs and specially designed datasets for evaluating disentanglement of ground truth generative factors.
2.1.1 Disentanglement evaluation  2Dshapes, 3Dshapes, MPI3DToy
To assess the disentanglement quantitatively, we use three synthetic datasets that come with ground truth generative factors, with samples of each shown in Figure 1. 2Dshapes was originally created by higgins2016beta, consisting of 737,280 2D shapes that are generated from the Cartesian product of five ground truth independent latent factors.
We hypothesise that the 2Dshapes dataset is too easy to solve, as evidenced by the fairly high scores of PCA and ICA, and further check the robustness of VAEs on harder, more recent datasets; specifically, RGB datasets 3Dshapes (3Dshapes18) and MPI3DToy (gondal2019transfer). 3Dshapes consists of 480,000 images generated from six data generative factors. MPI3DToy is the most complex dataset containing 1,036,800 images from seven data generative factors captured from realworld robotics experiments. Due to computational constraints, we only use the toy version of this dataset which contains renders of the real scenes, rather than the scenes themselves.
2.1.2 Image generation  CelebA, Chairs, CIFAR10, and CIFAR100
CelebA and Chairs are the two datasets utilised by higgins2016beta for qualitative evaluation. Due to hardware constraints, we instead qualitatively inspect models trained on CIFAR10 and CIFAR100, as they are standard benchmark datasets for evaluating generative models (shmelkov2018good). These datasets are nonsynthetic, and samples are shown in Figure 2.
2.2 Metrics
To evaluate the disentanglement of a representation, we adopt the approach presented in higgins2016beta. This can be used to evaluate disentanglement directly without having to rely on qualitative inspection.
higgins2016beta
suggest that there exists a tradeoff between generated image quality and level of disentanglement. This is at odds with the notion that disentangled representations should lead to superior performance in downstream tasks. To quantitatively investigate the extent to which higher
harms generative model quality while enhancing disentanglement, we adopt Fréchet Inception Distance as a stateoftheart metric for evaluating reconstruction quality of generative models (heusel2017gans). higgins2016beta only evaluate image quality qualitatively by inspection.2.2.1 Fréchet Inception Distance
Fréchet Inception Distance (FID) is a metric used to assess the reconstruction quality of images produced by generative models, usually a generative adversarial network (GAN). Rather than comparing generated and real images on a pixelbypixel basis, the FID compares the distribution of the activations of the final layer of a pretrained InceptionV3 model
(szegedy2016rethinking). This layer corresponds to highlevel features of objects (such as airplanes) and thus captures the human notion of similarity in images (heusel2017gans). FID is based on the Wasserstein metric and can be computed as:(6) 
where
and
are Gaussians fit to the 2048dimensional activations of the last InceptionV3 pooling layer for real and generated samples, respectively (FrechetI11). As we used a pretrained model, images need to be scaled to the correct size and have colour channels added if they were black and white.
We use FID to examine the tradeoff between disentanglement and reconstruction quality since the best performing models for image generation (i.e. GANs) use FID for evaluation (dai2019diagnosing; lucic2017gans). A lower FID score corresponds to real and generated samples being more similar. WGANGP (NIPS2017_892c3b1c), a GAN of similar age as the VAE, is noted to achieve an FID of 29.3 on CIFAR10 (heusel2017gans). Early iterations of VAEs are generally understood to have poor image generation capabilities compared to contemporary GANs. (shmelkov2018good; makhzani2015adversarial). However, recent VAEbased models (parmar2020dual) achieve a CIFAR10 FID of 17.9, and are competitive with GANbased approaches.
Limitations when using FID as a metric are indicated by shmelkov2018good. FID cannot separate image quality from image diversity. For example, a poor FID score can be due to the reconstructed images either being unrealistic (low image quality) or too similar to each other (low diversity), with no way to analyze the cause.
2.2.2 Disentanglement Metric
The disentanglement metric refers to the framework proposed by higgins2016beta that is meant to quantify the level of disentanglement in deep generative models by measuring the independence and interpretability of their latent representation. The idea is that for a disentangled representation, images generated from fixing one factor of variation and randomly sampling all others should result in a relatively lower variance in the latents corresponding to
. The lower this variance is, the easier it will be to predict the corresponding data generative factor. Therefore, we can measure the disentanglement by reporting the accuracy of a classifier identifying the corresponding data generative factor given the latent representation.
An assumption on the dataset is that its elements are generated from a true world simulator
(7) 
where with are conditionally independent factors and are conditionally dependent factors. In particular, these ground truth factors need to be known for computing the metric. In all of the datasets that we consider in this report for computing the metric, the data is generated by independent factors alone. The full procedure is outlined in algorithm 1.
Note that rather than using the expectation values of the latent Gaussian for one simulated image, a pair of images is sampled, and the absolute difference of their latent representations is computed to reduce the variance of the classifier’s inputs and to lower the conditional dependence on the input images . The classifier is taken to be linear to ensure the interpretability of the inferred latents and not learn any nonlinear disentanglement itself.
One drawback of this metric is its dependence on hyperparameters, such as the choice of the classifier, the optimiser, and most significantly the sample size . This is already noted by pmlrv80kim18b
. We investigate this further by reporting the accuracy scores of a linear and nonlinear MLP, a logistic regression classifier and a random forest classifier from the
SKlearn library. Furthermore, we compare the scores for different values of the sample size in generating training data for the metric.In addition, pmlrv80kim18b show that there is a mode in which the classifier reports 100% accuracy while only factors are disentangled. Issues like this could account for the discrepancy between quantitative scores and qualitative observations as described in Section 3.1.3.
2.3 Models
higgins2016beta originally use a VAE where both the encoder and decoder are MLPs for the 2Dshapes dataset and a Convolutional VAE for their remaining experiments. Followup work (DBLP:journals/corr/abs180403599; locatello2020sober; pmlrv80kim18b) instead uses a Convolutional VAE across all experiments, including 2Dshapes.
We report results for both the MLP and Convolutional VAEs on 2Dshapes and only use the Convolutional VAE for other experiments as we found it to give better results. As optimisers, we use Adagrad (lr = 1e2) for the MLP and Adam (lr = 5e4) for the Convolutional VAE as done in the original papers. We also use PCA and ICA as nondeep baselines, following higgins2016beta. All models are trained with multiple seeds. Full details on the model, hyperparameters, and training protocols are available in Appendix A.
3 Results
In this section, we present the results using the datasets described in Section 2.1 and the metrics from Section 2.2 to evaluate VAEs in terms of their ability to learn a disentangled representation and to reconstruct images.
3.1 Disentanglement
First, we present the scores of the disentanglement metric on the dataset used by higgins2016beta and compare them to baseline methods PCA and ICA. Next, we investigate the behaviour of VAE on the more complex threedimensional datasets 3Dshapes and MPI3DToy. Finally, we draw a connection between quantitative and qualitative evaluation of disentanglement to see if these two notions coincide.
3.1.1 Disentanglement on 2Dshapes
Table 1 presents the disentanglement scores achieved on the 2Dshapes dataset. It is important to note that among the total five ground truth factors, higgins2016beta disregard ‘shape’ when sampling data generative factors for the disentanglement metric. This is presumably because VAE struggles to learn a disentangled representation for this data generative factor, as evidenced by the traversals in Figure 7 from higgins2016beta (reprinted in Figure 4) all having a very similar shape.
5 factors  4 factors  
Model  Mean  Median  Mean  Median  higgins2016beta 
VAE  65.13 4.98%  64.79%  81.89 2.33%  81.82%  (–) 
VAE  64.31 6.56%  63.79%  82.39 5.79%  81.37%  61.58 0.5% 
VAE  74.53 15.06%  76.62%  89.09 15.82%  95.15%  99.23 0.1% 
VAE  76.77 11.44%  79.67%  86.86 15.5%  88.81%  (–) 
PCA  69.79 4.22%  71.53%  88.95 3.89%  89.91%  84.9 0.4% 
ICA  68.99 1.68%  69.38%  83.94 2.16%  84.12%  42.03 10.6% 
In Table 1 we compare the metric scores using all five data generative factors to the scores obtained by disregarding the ‘shape’ data generative factor. Furthermore, we reprint the results presented in higgins2016beta for comparison. We see that the scores using all five data generative factors are lower across all models confirming the conjecture mentioned above. While our results show the superiority of VAE over regular VAE for , we do not obtain a difference as big as that in higgins2016beta. A potential reason for this may be that higgins2016beta discard the worst performing 50% of their training runs due to training instabilities. We also observe this instability especially for
, as suggested by the high standard deviations. In the rest of this report, we report median by default as a robust measure of performance even in case some of the training runs diverge. Furthermore, we note that ICA has a much stronger performance in our results. We found that finetuning ICA parameters was crucial for improving its scores.
3.1.2 Beyond 2D datasets
Next, we investigate the disentanglement scores on the more complex datasets 3Dshapes and MPI3DToy. Because images in these datasets have three colour channels, a convolutional architecture for the encoder and decoder is more suitable. Therefore, we trained VAE using the architecture proposed in burgess2018understanding, which is also used by followup work on disentangling in VAEs, on all of the three datasets with ground truth variation factors. For 2Dshapes in particular, it leads to significantly improved disentanglement scores, which often reach the almost 100 accuracy reported by higgins2016beta when using only four latents on 2Dshapes and around 90 for all five. These Convolutional VAE experiments are illustrated in Figure 3, including the five latent 2Dshapes.
VAE manages to reach very high scores on 3Dshapes as well. However, PCA and ICA experience a significant drop in the disentanglement metric. This may be due to the fact that we flattened the 3channel images to be in a suitable shape for these methods. On the most complex dataset MPI3DToy we see that the best scores are actually reached by followed by .
Compared to MPI3DToy, the 3Dshapes dataset has much higher contrasts and the shapes are more regular, as illustrated in Figure 1. Furthermore, it is generated from six factors of variation compared to the seven of MPI3DToy. We hypothesise that the reason why the lowest performs the best is that since MPI3DToy is the hardest dataset and it has the highest relative magnitude of reconstruction loss to the KL loss. This implicitly creates a different weighting of the two loss terms, suggesting that needs to be scaled according to the difficulty of the dataset.
3.1.3 Quantitative vs qualitative evaluation of disentanglement
The Higgins disentanglement metric tends to assign very high scores that do not correlate well with human judgement of the level of disentanglement. In Figure 4, we first encode the latents of a true data sample from 2Dshapes and then vary the latent dimensions individually. The true generative factors are position X, position Y, scale, rotation and shape. All models achieve almost perfect disentanglement quantitatively, but upon visual inspection, generally only two out of five latents are learnt well and even those might be entangled.
3.1.4 Latent space visualization
Figure 5 shows posterior latents on the 2Dshapes dataset embedded by the first two components of PCA. In order to generate the data, we first seeded the VAE with a picture that had the median values for each of the five ground truth generative factors. Then we conducted a traversal across each dimension in the ground truth latent space (rather than the model posterior latents as in Figure 4).
When the parameter is set low, the embeddings of the ground truth traversal are sparse and appear to have much more regular structure than for higher
s. This is likely because the KL divergence with a Normal prior penalizes embeddings that are far away from the mean since Normal distributions have thin tails. As a result, the model posterior latents become much more concentrated near the mean and do not extend very far into the latent space.
spinner_towards_2018 explain how low makes the embeddings more similar to those of a standard autoencoder. This suggests that the posterior latent distribution of standard autoencoders (or VAEs with low for that matter) is strongly concentrated on the training samples. Sampling new images is difficult because decoding a randomly sampled latent code in the autoencoder case is unlikely to generate meaningful images because we are almost certainly not going to hit the part of the latent space that the model learnt to decode during training. However, the KL divergence regularization for causes the model to utilise the whole latent space rather than a small part of it. This in turn makes the model work better for generating new data. In Section 4.2, we formalise the intuition of controlling the posterior latent variance when assuming Gaussian likelihood over the decoded pixels.
3.1.5 Sensitivity analysis of the disentanglement metric
While the metric introduced by higgins2016beta has a number of parameters, no guidance is provided on what values to choose. Followup work pmlrv80kim18b notices a particular importance of the parameter which represents a sample size for generating the training samples. In Figure 6, we train the linear classifier for and find that higher values improve the average disentanglement score significantly for PCA and ICA in particular. Moreover, it causes the VAEs to have increasingly similar performance which makes it harder to use the metric for model selection.
As seen in Table 2, we also note that the choice of classifier does not have a significant effect on the disentanglement metric score. This contradicts the assertion of higgins2016beta, who claim that using a nonlinear classifier itself may disentangle the results.
Linear (PyTorch) 
MLP  Random Forest  Linear (SKlearn)  
2Dshapes  77.62 13.04%  78.8 12.73%  80.16 12.29%  80.73 12.03% 
3Dshapes  91.98 6.82%  92.61 6.82%  94.45 6.52%  94.89 5.67% 
MPI3DToy  33.21 14.38%  34.78 15.72%  36.25 16.91%  35.86 16.82% 
3.2 Reconstruction
VAE  VAE  VAE  VAE  WGANGP  
CIFAR10  197.8  171.1  221.1  266.5  29.3 
CIFAR100  132.1  163.5  204.6  252.9  N/A 
Table 3 shows the computed FID values for various values of the parameter and Figures 10, 10 show the corresponding reconstructions. The already very high values of FID achieved by the standard VAE with steeply increase as we make the larger. Remarkably, CIFAR10 achieves the best FID for , but CIFAR100 does so for . rybkin2020simple observe similar behaviour across datasets where small but nonzero tends to be the best, suggesting that cannot be seen as just a regularization hyperparameter given that it can improve the model performance even on training data. Interestingly, CIFAR100 consistently achieves lower FID scores than CIFAR10, even though CIFAR100 is a more complex dataset.
We have already shown in Figure 3 that for MPI3DToy, which is a dataset containing ground truth variation factors with complexity the closest to CIFAR10, the best values of disentanglement were also produced for . We found the reconstructions for MPI3DToy to also be significantly lower quality than for 2Dshapes or 3Dshapes. This suggests that the model first needs to be able to produce good reconstructions, and then it can specialise in producing disentangled representations. There is likely to be a range of values for which both the reconstruction and disentanglement improve simultaneously, and only later there starts to be a tradeoff between those two notions of representation quality. Based on this, we hypothesise that in order to scale disentanglement beyond toy datasets such as 2Dshapes, it is necessary to use models which are simultaneously very strong at image generation.
In Figure 8 and Figure 8, we show FID and reconstruction loss for training on CIFAR100. We found that even though VAEs tend to be evaluated by the negative loglikelihood, there is almost no qualitative difference between the behaviour of FID and NLL in our models (dai2019diagnosing; lucic2017gans)
. The reconstructions for the CIFAR datasets can be seen in Figure
10 and Figure 10. It is clear that lower values of enable reconstructions that have a far higher quality, albeit ones that are still inferior to ones produced by stateoftheart GANs, as measured by FID (heusel2017gans; NIPS2017_892c3b1c). Finally, we note that these qualitative inspections coincide with the quantitative evaluation, unlike the disentanglement metric.4 Discussion
4.1 Broad reproducibility
Neither us nor followup work have been able to reproduce the original results of higgins2016beta. The work by pmlrv80kim18b; locatello2020sober was unable to significantly exceed 80 accuracy on 2Dshapes, regardless of the parameter setting. This contrasts with the 99 accuracy in the original paper. locatello2020sober extensively show that even other metrics proposed in the literature exhibit high variance and inconsistency across datasets for the same model. In our results, it was crucial to both ignore the ‘shape’ generative factor in 2Dshapes and to only consider median performance. Those things were only mentioned in the Appendix of the original paper but are vital for achieving close to the same performance.
4.2 VAE with MSE loss
VAE has a particular interpretation when assuming Gaussian likelihood over the decoded pixels. Using MSE loss is effectively equivalent to the normal VAE formulation but with a calibrated Normal prior. In the Gaussian likelihood case, we normally have and the loglikelihood of the decoder is then
(8) 
where is the prediction of the decoder and D is the dimensionality of the dataset. rybkin2020simple show that if we instead assume , the log likelihood would be
(9) 
and the full VAE objective is
(10) 
This is now very similar to the VAE objective since plays the same weighting role as , and if is assumed to be constant, the term disappears during optimisation. This shows the equivalence of VAE to the standard VAE in this case.
Some of the experiments by higgins2016beta use MSE loss and show results superior to the standard VAE. However, such examples cannot be used to show superiority of VAE as a model class since it becomes equivalent to the standard VAE with a different prior.
4.3 Implementation details
We have initially adopted an online implementation of various VAE models (yanndubs2019), to which we added the disentanglement metric, more datasets, FID score, and more model options. In order to reproduce the results better, we later tried reimplementing the core model and training code while crosschecking it via other sources. The implementation of the disentanglement metric was initially constructed from scratch; we later tried another freely available implementation of the disentanglement metric (noauthor_googleresearchdisentanglement_lib_2021), but achieved the same results. To ensure our implementation of FID was accurate, we based our code off the standard implementation by Seitzer2020FID.
Our final code used to produce the results in this report is mostly custommade and is available at https://github.com/Mandelbrot99/BetaVAE.
5 Conclusion
In this report, we studied the performance of VAE in terms of disentanglement and reconstruction across a variety of datasets. First, we observed that the results originally reported by higgins2016beta are difficult to reproduce. They rely heavily on discarding of the worst 50 performing random seeds as well as simplifying the learning task, despite both of those aspects only being mentioned in the Appendix of the paper. The newly proposed disentanglement metric fails to fully capture humaninterpretable disentanglement, as evidenced by qualitative evaluation. We show that even the hyperparameters of the disentanglement metric itself can be used to artificially boost the scores.
On more complex datasets, we noted that the intuition suggested by higgins2016beta of yielding better disentanglement scores breaks down. Rather, gives the best results on MPI3DToy. We hypothesise that this can be attributed to different relative magnitudes of the reconstruction loss and the KL divergence on more complex datasets. Finally, we confirmed that FID is strongly correlated with the reconstruction loss and that both notions coincide with a visual inspection of the reconstructed images. However, the best FID scores are not necessarily achieved for the lowest value which is somewhat counterintuitive, similar to how the best disentanglement can be achieved for .
Our work could be further improved by running the experiments for many more random seeds than we did, but we were constrained by limited hardware. It is known that the variance of performance across random seeds and hyperparameters is generally even greater than across model choices (locatello2020sober). This weakens the robustness of our claims, given that we were not able to run very largescale experiments. We would also be very interested in seeing more results on the disentanglementreconstruction tradeoff described in Section 3.2, which would again require very extensive experiments with modern VAEbased models. For example, we would like to use the real MPI3D instead of the toy version to see if our results generalise to a truly realworld setting.
References
Appendix
A Model and Hyperparameters Details
A summary of the model architectures we used is shown in Table 4. The MLP model for 2Dshapes follows higgins2016beta but the Convolutional VAE is adopted from burgess2018understanding because other followup work typically uses the same architecture.
The settings of the optimiser likewise generally use the same parameters across all experiments except for 3Dshapes, where we found it necessary to decrease the learning rate to 1e4. The 2Dshapes experiments use 256 batch size while the remaining datasets use 64. We also reduced the learning rate by a factor of 5 for last 25epochs of training on all experiments.
All 2Dshapes experiments were ran using four seeds 123, 427, 235, 921. MPI3DToy and 3Dshapes used 123, 427, 235. CIFAR10 and CIFAR100 experiments used 4723, 1263.
B PCA and ICA implementation
Because all common implementations of ICA and PCA are not able to train from minibatches and instead use expensive linear algebra routines with the full dataset, we had to significantly limit the training data size in order to execute the training. For PCA, we use 25000 samples for BW datasets and 3500 for RGB datasets while for ICA, we use 2500 for BW and 1000 for RGB datasets, respectively. Additionally, RGB images are flattened since standard PCA/ICA implementations are not able to support images with multiple channels.


Dataset  Optimiser  Architecture  


2Dshapes (MLP)  Adagrad
1e2 256 
Input
Encoder Latents Decoder 
4096 (flattened 64x64x1).
FC 1200, 1200. ReLU activation. 10 FC 1200, 1200, 1200, 4096. Tanh activation. Bernoulli. 
2Dshapes (Conv VAE)  Adam
5e4 256 
Input
Encoder Latents Decoder 
4096 (flattened 64x64x1).
Conv 32x4x4 (stride 2) 3x, FC 256 2x. ReLU activation. 10 Deconv reverse of encoder. ReLU activation. Bernoulli. 
3Dshapes (Conv VAE)  Adam
1e4 64 
Input
Encoder Latents Decoder 
12288 (flattened 64x64x3).
Conv 32x4x4 (stride 2) 3x, FC 256 2x. ReLU activation. 10 Deconv reverse of encoder. ReLU activation. Bernoulli. 
MPI3dToy (Conv VAE)  Adam
5e4 64 
Input
Encoder Latents Decoder 
12288 (flattened 64x64x3).
Conv 32x4x4 (stride 2) 3x, FC 256 2x. ReLU activation. 10 Deconv reverse of encoder. ReLU activation. Bernoulli. 
CIFAR10 (Conv VAE)  Adam
5e4 64 
Input
Encoder Latents Decoder 
12288 (flattened 64x64x3).
Conv 32x4x4 (stride 2) 3x, FC 256 2x. ReLU activation. 128 Deconv reverse of encoder. ReLU activation. Bernoulli. 
CIFAR100 (Conv VAE)  Adam
5e4 64 
Input
Encoder Latents Decoder 
12288 (flattened 64x64x3).
Conv 32x4x4 (stride 2) 3x, FC 256 2x. ReLU activation. 128 Deconv reverse of encoder. ReLU activation. Bernoulli. 