Log In Sign Up

Balancing reconstruction error and Kullback-Leibler divergence in Variational Autoencoders

In the loss function of Variational Autoencoders there is a well known tension between two components: the reconstruction loss, improving the quality of the resulting images, and the Kullback-Leibler divergence, acting as a regularizer of the latent space. Correctly balancing these two components is a delicate issue, easily resulting in poor generative behaviours. In a recent work, Dai and Wipf obtained a sensible improvement by allowing the network to learn the balancing factor during training, according to a suitable loss function. In this article, we show that learning can be replaced by a simple deterministic computation, helping to understand the underlying mechanism, and resulting in a faster and more accurate behaviour. On typical datasets such as Cifar and Celeba, our technique sensibly outperforms all previous VAE architectures.


GCVAE: Generalized-Controllable Variational AutoEncoder

Variational autoencoders (VAEs) have recently been used for unsupervised...

Constraining Variational Inference with Geometric Jensen-Shannon Divergence

We examine the problem of controlling divergences for latent space regul...

Variance Loss in Variational Autoencoders

In this article, we highlight what appears to be major issue of Variatio...

From Variational to Deterministic Autoencoders

Variational Autoencoders (VAEs) provide a theoretically-backed framework...

Deterministic Decoding for Discrete Data in Variational Autoencoders

Variational autoencoders are prominent generative models for modeling di...

Variational Autoencoders Without the Variation

Variational autoencdoers (VAE) are a popular approach to generative mode...

Self-Paced Learning: an Implicit Regularization Perspective

Self-paced learning (SPL) mimics the cognitive mechanism of humans and a...

1 Introduction

Generative models address the challenging task of capturing the probabilistic distribution of high-dimensional data, in order to gain insight in their characteristic manifold, and ultimately paving the way to the possibility of synthesizing new data samples.

The main frameworks of generative models that have been investigated so far are Generative Adversarial Networks (GAN) GAN and Variational Autoencoders (VAE) (Kingma13; RezendeMW14), both of which generated an enormous amount of works, addressing variants, theoretical investigations, or practical applications.

The main feature of Variational Autoencoders is that they offer a strongly principled probabilistic approach to generative modeling. The key insight is the idea of addressing the problem of learning representations as a variational inference problem, coupling the generative model for given the latent variable , with an inference model synthesizing the latent representation of the given data.

The loss function of VAEs is composed of two parts: one is just the log-likelihood of the reconstruction, while the second one is a term aimed to enforce a known prior distribution

of the latent space - typically a spherical normal distribution. Technically, this is achieved by minimizing the Kullbach-Leibler distance between

and the prior distribution ; as a side effect, this will also improve the similarity of the aggregate inference distribution with the desired prior, that is our final objective.

Loglikelihood and KL-divergence are typically balanced by a suitable -parameter (called in the terminology of -VAE beta-vae17; understanding-beta-vae18), since they have somewhat contrasting effects: the former will try to improve the quality of the reconstruction, neglecting the shape of the latent space; on the other side, KL-divergence is normalizing and smoothing the latent space, possibly at the cost of some additional “overlapping” between latent variables, eventually resulting in a more noisy encoding brokenELBOW. If not properly tuned, KL-divergence can also easily induce a sub-optimal use of network capacity, where only a limited number of latent variables are exploited for generation: this is the so called overpruning/variable-collapse/sparsity phenomenon BurdaGS15; overpruning17.

Tuning down typically reduces the number of collapsed variables and improves the quality of reconstructed images. However, this may not result in a better quality of generated samples, since we loose control on the shape of the latent space, that becomes harder to be exploited by a random generator.

Several techniques have been considered for the correct calibration of , comprising an annealed optimization schedule Bowman15 or a policy enforcing minimum KL contribution from subsets of latent units autoregressive16. Most of these schemes require hand-tuning and, quoting overpruning17, they easily risk to “take away the principled regularization scheme that is built into VAE.”

An interesting alternative that has been recently introduced in TwoStage consists in learning the correct value for the balancing parameter during training, that also allows its automatic calibration along the training process. The parameter is called , in this context, and it is considered as a normalizing factor for the reconstruction loss.

Measuring the trend of the loss function and of the learned lambda parameter during training, it becomes evident that the parameter is proportional to the reconstruction error, with the result that the relevance of the KL-component inside the whole loss function becomes independent from the current error.

Considering the shape of the loss function, it is easy to give a theoretical justification for this behavior. As a consequence, there is no need for learning, that can be replaced by a simple deterministic computation, eventually resulting in a faster and more accurate behaviour.

The structure of the article is the following. In Section 2, we give a quick introduction to Variational Autoencoders, with particular emphasis on generative issues (Section 2.1). In Section 3, we discuss our approach to the problem of balancing reconstruction error and Kullback-Leibler divergence in the VAE loss function; this is obtained from a simple theoretical investigation of the loss function in TwoStage, and essentially amounts to keeping a constant balance between the two components along training. Experimental results are provided in Section 4, relative to standard datasets such as CIFAR-10 (Section 4.1) and CelebA (Section 4.2): up to our knowledge, we get the best generative scores in terms of Frechet Inception Distance ever obtained by means of Variational Autoencoders. In Section 5, we try to investigate the reasons why our technique seems to be more effective than previous approaches, by considering the evolution of latent variables along training. Concluding remarks and ideas for future investigations are offered in Section 5.1.

2 Variational Autoencoders

In a generative setting, we are interested to express the probability of a data point

through marginalization over a vector of latent variables:


For most values of ,

is likely to be close to zero, contributing in a negligible way in the estimation of

, and hence making this kind of sampling in the latent space practically unfeasible. The variational approach exploits sampling from an auxiliary “inference” distribution , hopefully producing values for more likely to effectively contribute to the (re)generation of . The relation between and is given by the following equation, where KL denotes the Kulback-Leibler divergence:


KL-divergence is always positive, so the term on the right provides a lower bound to the loglikelihood , known as Evidence Lower Bound (ELBO).

If is a reasonable approximation of , the quantity is small; in this case the loglikelihood is close to the Evidence Lower Bound: the learning objective of VAEs is the maximization of the ELBO.

In traditional implementations, we additionally assume that is normally distributed around an encoding function

, with variance

; similarly is normally distributed around a decoder function . The functions , and

are approximated by deep neural networks. Knowing the variance of latent variables allows sampling during training.

Provided the model for the decoder function is sufficiently expressive, the shape of the prior distribution for latent variables can be arbitrary, and for simplicity we may assumed it is a normal distribution . The term

is hence the KL-divergence between two Gaussian distributions

and which can be computed in closed form:


As for the term , under the Gaussian assumption the logarithm of is just the quadratic distance between and its reconstruction ; the parameter balancing reconstruction error and KL-divergence can understood in terms of the variance of this Gaussian (tutorial-VAE).

The problem of integrating sampling with backpropagation, is solved by the well known reparametrization trick (

VAE13; RezendeMW14).

2.1 Generation of new samples

The whole point of VAEs is to force the generator to produce a marginal distribution111called by some authors aggregate posterior distribution AAE. close to the prior . If we average the Kullback-Leibler regularizer on all input data, and expand KL-divergence in terms of entropy, we get:


The cross-entropy between two distributions is minimal when they coincide, so we are pushing towards . At the same time, we try to augment the entropy of each ; under the assumption that is Gaussian, this amounts to enlarge the variance, further improving the coverage of the latent space, essential for generative sampling (at the cost of more overlapping, and hence more confusion between the encoding of different datapoints).

Since our prior distribution is a Gaussian, we expect to be normally distributed too, so the mean should be 0 and the variance should be 1. If , we may look at

as a Gaussian Mixture Model (GMM). Then, we expect


and especially, assuming the previous equation (see aboutVAE for details),


This rule, that we call variance law, provides a simple sanity check to test if the regularization effect of the KL-divergence is properly working.

The fact that the two first moments of the marginal inference distribution are 0 and 1, does not imply that it should look like a Normal. The possible mismatching between

and the expected prior is indeed a problematic aspect of VAEs that, as observed in several works ELBOsurgery; rosca2018distribution; aboutVAE could compromise the whole generative framework. To fix this, some works extend the VAE objective by encouraging the aggregated posterior to match WAE or by exploiting more complex priors autoregressive; Vamp; resampledPriors.

In TwoStage (that is the current state of the art), a second VAE is trained to learn an accurate approximation of ; samples from a Normal distribution are first used to generate samples of , that are then fed to the actual generator of data points. Similarly, in deterministic, the authors try to give an ex-post estimation of , e.g. imposing a distribution with a sufficient complexity (they consider a combination of 10 Gaussians, reflecting the ten categories of MNIST and Cifar10).

3 The balancing problem

As we already observed, the problem of correctly balancing reconstruction error and KL-divergence in the loss function has been the object of several investigations. Most of the approaches were based on empirical evaluation, and often required manual hand-tuning of the relevant parameters. A more theoretical approach has been recently pursued in TwoStage

The generative loss (GL), to be summed with the KL-divergence, is defined by the following expression (directly borrowed from the public code222


where mse is the mean square error on the minibatch under consideration and is a parameter of the model, learned during training. The previous loss is derived in TwoStage by a complex analysis of the VAE objective function behavior, assuming the decoder has a gaussian error with variance , and investigating the case of arbitrarily small but explicitly nonzero values of .

Since has no additional constraints, we can explicitly minimize it in equation 7. The derivative of is


having a zero for , corresponding to a minimum for equation 7.

This suggests a very simple deterministic policy for computing instead of learning it: just use the current estimation of the mean square error. This can be easily computed as a discounted combination of the mse relative to the current minibatch with the previous approximation: in our implementation, we just take the minimum between these two values, in order to have a monotically decreasing value for (we work with minibatches of size 100, that is sufficiently large to provide a reasonable approximation of the real mse). Updating is done at every minibatch of samples.

Compared with the original approach in TwoStage, the resulting technique is both faster and more accurate.

An additional contribution of our approach is to bring some light on the effect of the balancing technique in TwoStage. Neglecting constant addends, that have no role in the loss function, the total loss function for the VAE is simply:


So, computing gamma according to the previous estimation of mse has essentially the effect of keeping a constant balance between reconstruction error and KL-divergence during the whole training: as mse is decreasing, we normalize it in order to prevent a prevalence of the KL-component, that would forbid further improvements of the quality of reconstructions.

4 Empirical evaluation

We compared our proposed Two Stage VAE with computed against the original model with learned using the same network architectures. In particular, we worked with many different variants of the so called ResNet version, schematically described in Figure 1 (pictures are borrowed from TwoStage).

(A) Scale block (B) Encoder   (C) decoder
Figure 1:

”Resnet” architecture. (A) Scale block: a sequence of residual blocks. We mostly worked with a single residual block; two or more blocks makes the architecture sensibly heavier and slower to train, with no remarkable improvement (B) Encoder: the input is first transformed by a convolutional layer into and then passed to a chain of Scale blocks; after each Scale block, input is downsampled with a a convolutional layer with stride

channels are doubled. After Scale blocks, the feature map is flattened to a vector. and then fed to another Scale Block composed by fully connected layers of dimension 512. The output of this Scale Block is used to produce mean and variances of the latent variables. Following TwoStage, and for CIFAR-10. For CelebA, we tested many different configurations. (C) Decoder: the latent representation is first passed through a fully connected layer, reshaped to 2D, and then passed through a sequence of deconvolutions halving the number of channels at the same.

In all our experiments, we used a batch size of 100, and adopted Adam with default TensorFlow’s hyperparameters as optimizer. Other hyperparameters, as well as additional architectural details will be described below, where we discuss the cases of Cifar and CelebA separately.

In general, in all our experiments, we observed a high sensibility of Fid scores to the learning rate, and to the deployment of auxiliary regularization techniques. As we shall discuss in Section 5, modifying these training configurations may easily result in a different number of inactive333for the purposes of this work, we consider a variable inactive when latent variables at the end of training. Having both too few or too many active variables may eventually compromise generative sampling, for opposite reasons: few active variables usually compromise reconstruction quality, but an excessive number of active variables makes controlling the shape of the latent space sensbibly harder.

The code is available on GitHub444 Checkpoints for Cifar10 and CelebA are available at the project’s page555

4.1 Cifar10

For Cifar10, we got relatively good results with the basic ResNet architecture with 3 Scale Blocks, a single Resblock for every Scaleblock, and 64 latent variables. We trained our model for 700 epochs on the first VAE and 1400 epochs on the second VAE; the initial learning rate was

, halving it every epochs on the first VAE and every epochs on the second VAE. Details about the evolution of reconstruction and generative error during training are provided in Figure 2 and Table 1. The data refer to ten different but “uniform” trainings ending with the same number of active latent variables, (17 in this case). Few pathological trainings resulting in less or higher sparsity (and worse FID scores) have been removed from the statistic.

Figure 2:

Evolution during 700 epochs of training on the CIFAR-10 dataset of the FID scores for reconstructed images (blue), first-stage generated images (orange), and second-stage generated images. The number of epochs refer to the first VAE, and it is doubled for the second VAE. The filled region around the line corresponds to the standard deviation from the expected value. Mean and variances have been estimated over 10 different trainings.

Table 1: Evolution during training on the CIFAR-10 dataset of several different metrics
model epochs REC GEN-1 GEN-2
RAE-l2 deterministic (128 vars) 100
2S-VAE, learned TwoStage 1000
2S-VAE, learned , replicated 1000
2S-VAE, computed 700

Table 2: CIFAR-10: summary of results

In Table 2), we compare our approach with the original version with learned TwoStage. Since some people had problems in replicating the results in TwoStage (see the discussion on OpenReview666, we repeated the experiment (also in order to compute the reconstruction FID). Using the learning configuration suggested by the authors, namely 1000 epochs for the first VAE, 2000 epochs for the second one, initial learning rate equal to 0.0001, halved every 300 and 600 epochs for the two stages, respectively, we obtained results essentially in line with those declared in TwoStage.

learned computed
Model epochs REC GEN-1 GEN-2 mse REC GEN-1 GEN-2 mse
1 40/120 53.8 66.0 59.3 .0059 1.024 45.8 56.9 57.1 .0056 0.805
1 80/210 54.3 65.9 59.8 .0049 0.803 46.0 61.1 58.1 .0047 0.688
1 120/300 54.9 66.5 60.4 .0044 0.775 48.1 63.8 59.9 .0043 0.687
2 40/120 48.5 58.2 54.5 .0059 0.985 41.5 58.7 53.7 .0058 1.024
2 80/210 48.8 60.7 55.5 .0048 0.889 42.0 58.8 55.1 .0048 0.877
2 120/300 49.1 62.8 56.9 .0043 0.880 43.0 60.2 56.2 .0043 0.863
3 40/120 59.4 74.3 63.4 .0050 0.893 56.3 74.0 63.9 .0049 0.637
3 80/210 55.6 72.6 62.2 .0039 0.840 54.4 72.3 61.8 .0038 0.621
3 120/300 55.2 72.0 62.1 .0037 0.785 54.4 71.3 62.0 .0036 0.744
4 40/120 52.8 68.2 60.4 .0072 0.789 48.0 65.0 57.7 .0072 0.742
4 80/210 49.4 67.9 58.5 .0060 0.822 45.3 65.0 53.4 .0059 0.785
4 120/300 49.1 68.0 58.4 .0053 0.844 44.4 65.4 54.0 .0053 0.804
Table 3: CelebA: metrics for different models

For the sake of completeness, we also compare with the FID scores for the recent RAE-l2 model deterministic (variance was not provided by authors). In this case, the comparison is purely indicative, since in deterministic they work, in the CIFAR-10 case, with a latent space of dimension 128. This also explains their particularly good reconstruction error, and the few training epochs.

4.2 CelebA

In the case of CelebA, we had more trouble in replicating the results of TwoStage, although we were working with their own code. As we shall see, this was partly due to a mistake on our side, that pushed us to an extensive investigation of different architectures.

In Table 3 we summarize some of the results we obtained, over a large variety of different network configurations. The metrics given in the table refer to the following models:

  • Model 1: This is our base model, with 4 scale blocks in the first stage, 64 latent variables, and dense layers with inner dimension 4096 in the second stage.

  • Model 2: As Model 1 with l2 regularizer added in upsampling and scale layers in the decoder.

  • Model 3: Two resblocks for every scale block, l2 regularizer added in downsampling layers in the encoder.

  • Model 4: As Model 1 with 128 latent variables, and 3 scale blocks.

All models have been trained with Adam, with an initial learning rate of 0.0001, halved every 48 epochs in the first stage and every 120 epochs in the second stage.

According to the results in Table 3, we can do a few noteworthy observations:

  1. for a given model, the technique computing systematically outperforms the version learning it, both in reconstruction and generation on both stages;

  2. after the first 40 epochs, FID scores (comprising reconstruction FID) do not seem to improve any further, and can even get worse, in spite of the fact that the mean square error keep decreasing; this is in contrast with the intuitive idea that FID REC score should be proportional to mse;

  3. the variance law is far from one, that seems to suggest Kl is too weak, in this case; this justifies the mediocre generative scores of the first stage, and the sensible improvement obtained with the second stage;

  4. l2-regularization, as advocated in deterministic, seems indeed to have some beneficial effect.

We spent quite a lot of time trying to figure out the reasons of the discrepancy between our observations, and the results claimed in TwoStage. Inspecting the elements of the dataset with worse reconstruction errors, we remarked a particularly bad quality of some of the images, resulting from the resizing of the face crop of dimension 128x128 to the canonical dimension 64x64 expected from the neural network. The resizing function used in the source code of TwoStage available at was the deprecated imresize function of the scipy library777scipy imresize: Following the suggestion in the documentation, we replaced the call to imresize with a call to PILLOW:
Unfortunately, and surprisingly, the default resizing mode of PILLOW is Nearest Neighbours that, as described in Figure 3, introduces annoying jaggies that sensibly deteriorate the quality of images.

Figure 3: Effect of resizing mode on a few CelebA samples. Nearest Neighbours produces bad staircase effects; bilinear, that is the common choice, is particularly smooth, suiting well to VAEs; bicubic is sligtly sharper.

This probably also explains the anomalous behaviour of FID REC with respect to mean squared error. The Variational Autoencoder fails to reconstruct images with high frequency jaggies, while keep improving on smoother images. This can be experimentally confirmed by the fact that while the minimum mse keeps decreasing during training, the maximum, after a while, stabilizes. So, in spite of the fact that the average mse decreases, the overall distribution of reconstructed images may remain far from the distribution of real images, and possibly get even more more distant.

model epochs REC GEN-1 GEN-2
RAE-SN deterministic 70
2S-VAE, learned TwoStage 120
2S-VAE, computed 70
with latent space norm.

Table 4: CelebA: summary of results

Resizing images with the traditional bilinear interpolation produces a substantial improvement, but not sufficient to obtain the expected generative scores.

Another essential component is again the balance between reconstruction error and KL-divergence. As observed above, in the case of CelebA the KL-divergence seems too weak, as clearly testified by the moments of latent variables expressed by the variance law. As a matter of fact, in the loss function of TwoStage, both mse and KL-divergence are computed as reduced sums, respectively over pixels and latent variables. Now, passing from CIFAR-10 to Celeba, we multiplied the number of pixels by four, passing from 32x32 to 64x64, but kept a constant number of latent variables. So, in order to keep the same balance we used for CIFAR-10, we should multiply the KL-divergence by a factor 4.

Finally, learning seems to proceed quite fast in the case of CelebA, that suggests to work with a lower initial learning rate: 0.00005. We also kept l2 regularization on downsampling and upsampling layers.

With these simple expedients, we were already able to improve on generative scores in TwoStage, (see Table 4), but not with respect to deterministic.

Figure 4: Examples of generated faces. The resulting images do not show the blurred appearance so typical of variational approaches, sensibly improving their perceptive quality.

Analyzing the moments of the distribution of latent variables generated during the second stage, we observed that the actual variance was sensibly below the expected unitary variance (around .85). The simplest solution consists in normalizing the generated latent variables, to meet the expected variance (this point is a bit outside the scope of this contribution, and will be better investigated in a forthcoming article).

This final precaution caused a sudden burst in the FID score for generated images, permitting to obtain, to the best of our knowledge, the best generative scores ever produced for CelebA with a variational approach.

In Figure 4 we provide examples of randomly generated faces. Note the particularly sharp quality of the images, so unusual for variational approaches.

5 Discussion

The reason why the balancing policy between reconstruction error and KL-regularization addressed in TwoStage and revisited in this article is so effective seems to rely on its laziness in the choice of the latent representation.

A Variational Autoencoder computes, for each latent variable and each sample , an expected value and a variance around it. During training, the variance usually drops very fast to values close to , reflecting the fact that the network is highly confident in its choice of . The KL-component in the loss function can be understood as a mechanism aimed to reduce this confidence, by forcing a not negligible variance. By effect of the KL-regularization, some latent variables may be even neglected by the VAE, inducing sparsity in the resulting encoding sparsity. The “collapsed” variables have, for any , a value of close to and a mean variance close . So, typically, at a relatively early stage of training, the mean variance of each latent variable gets either close to , if the variable is exploited, of close to if the variable is neglected (see Figure 5).

Figure 5: Typical evolution of the mean variance of latent variables during training in a Variational Autoencoder. Relevant variables have a variance close to , while inactive variables have a variance going to . The picture was borrowed from sparsity and is relative to the first epoch of training for a dense VAE over the MNIST data set.

Traditional balancing policies addressed in the literature start with a low value for the KL-regularization, increasing it during training. The general idea is to start privileging the quality of reconstruction, and then try to induce a better coverage of the latent space. Unfortunately, this reshaping ex post of the latent space looks hard to achieve, in practice.

The balancing property discussed in this article does the opposite: it starts attributing a relatively high importance to KL-divergence, to balance the high initial reconstruction error, progressively reducing its relevance in a way proportional to the improvement of the reconstruction. In this way, the relative importance between the two components of the loss function remains constant during training.

The practical effect is that latent variables are kept for a long time in a sort of limbo from which, one at a time, they are retrieved and put to work by the autoencoder, as soon as it realizes how they can contribute to the reconstruction.

The previous behaviour is evident by looking at the evolution of the mean variance of latent variables during training (not to be confused with the variance of the mean values , that according to the variance law should approximately be the complement to of the former).

In Figure 6

Figure 6: Evolution of the mean variance of the 64 latent variables during the first epoch of training on Cifar10. Due to the ”lazy” balancing technique, even after a full epoch, the destiny of most latent variables is still uncertain: they could collapse or be exploited for reconstruction.

we see the evolution of the variance of the 64 latent variables during the first epoch of training on the Cifar10 data set: even after a full epoch, the “status” of most latent variables is still uncertain.

During the next 50 epochs, in a very slow process, some of the “dormient” latent variables are woken up by the autoencoder, causing their mean variance to move towards 0: see Figure 7.

Figure 7: Evolution of the mean variance of the 64 latent variables First 50 epochs of training on Cifar10. One by one, latent variables are retrieved from the limbo (variance around 0.8) , and put to work by the autoencoder.

With the progress of training, less and less variables change their status, until the process finally stabilizes.

It would be nice to think, as hinted to in TwoStage, that the number of active latent variables at the end of training corresponds to the actual dimensionality of the data manifold. Unfortunately, this number still depends on too many external factors to justify such a claim. For instance, a mere modification of the learning rate is sensibly affecting the sparsity of the resulting latent space, as shown in Table 5 where we compare, for different initial learning rates (l.r.), the final number of inactive variables, FID scores, and mean square error.

l.r. inact. REC GEN-1 GEN-2 mse
.00020 13 53.0 80.6 74.5 .0039
.00015 15 53.3 79.9 71.8 .0040
.00010 17 53.8 80.2 68.8 .0041
.00005 19 58.2 83.2 75.8 .0047
Table 5: Effect of the learning rate on sparsity and different metrics. A high learning rate reduces sparsity and improves on reconstruction. However, this does not result in a better generative score. With a low rate, too many variables remains inactive.

Specifically, a high learning rate appears to be in conflict with the lazy way we would like latent variables to be chosen for activation; this typically results in less sparsity, that is not always beneficial for generative purposes. The annoying point is that with respect to the dimensionality of the latent space with the best generative FID, activating more variables can result in a lower reconstruction error, that should not be the case if we correctly identified the datafold dimensionality.

So, while the balancing strategy discussed in this article (similarly to the one in TwoStage) is eventually beneficial, still could take advantage of some tuning.

5.1 Conclusions

In this article, we stressed the importance of keeping a constant balance between reconstruction error and Kullback-Leibler divergence during training of Variational Autoencoders. We did so by normalizing the reconstruction error by an estimation of its current value, derived from minibatches. We developed the technique by an investigation of the loss function used in TwoStage, where the balancing parameter was instead learned during training. Our technique seems to outperform all previous Variational Approaches, permitting us to obtain unprecedented FID scores for traditional datasets such as CIFAR-10 and CelebA.

In spite of its relevance, the politics of keeping a constant balance does not seem to entirely solve the balancing issue, that still seems to depend from many additional factors, such as the network architecture, the complexity and resolution of the dataset, or from training parameters, such as the learning rate.

Also, the regularization effect of the KL-component must be better understood, since it frequently fails to induce the expected distribution of latent variables, possibly requiring and justifying ex-post adjustments.

Credits: All innovative ideas and results contained in this article are to be credited to the first author. The second author mostly contributed on the experimental side.

Conflict of Interest: The authors declare that they have no conflict of interest.