1 Introduction
A common generative model is the variational autoencoder (VAE) (Kingma and Welling, 2014; Rezende et al., 2014). In recent papers, VAE was shown not to learn meaningful latent representations, i.e., the latent representation becomes statistically independent of the data , if a sufficiently powerful probabilistic decoder is used (Bowman et al., 2016; Chen et al., 2017; Higgins et al., 2017a; Alemi et al., 2018). A focus of many recent works has been to develop generative models based on VAE that learn meaningful latent representations (Tolstikhin et al., 2018; Braithwaite and Kleijn, 2018; Zhao et al., 2019; Alemi et al., 2018; Razavi et al., 2019). Of these systems, the stateoftheart Wasserstein Autoencoder (WAE) (Tolstikhin et al., 2018) and Bounded Information Rate Variational Autoencoder (BIRVAE) (Braithwaite and Kleijn, 2018), are conceptually relatively straightforward as they do not involve the explicit optimisation of informationtheoretical measures. Consider an encoder and a decoder , with parameters and respectively, and let be the data distribution. WAE and BIRVAE then minimise a mean output error and, additionally, attempt to drive the aggregate posterior distribution to a predefined prior , typically an isotropic Gaussian.
WAE authors advocated the use of deterministic encoders, where the variance of is zero. However, (Braithwaite and Kleijn, 2018) show that using a fixed amount of encoder stochasticity (e.g., additive white gaussian noise) in the latent layer during training can be used to prevent overfitting in situations where limited data is available. Additionally, in section 3, we discuss how using stochastic encoders during training results in the datadomain reconstruction cost encouraging neighbourhoods in the data domain to remain connected in the latent space. In contrast, the natural continuity of the encoder network is the only reason for such preservation of neighbourhood connectivity when the encoder is deterministic. This argument suggests that latent space noise is beneficial for generative modelling tasks, allowing for better sampling. Stochastic encoders, however, lead to some interesting challenges.
Consider an extension of WAE that uses stochastic encoders implemented by fixed variance additive noise in the latent layer. We view the stochastic WAE as transmitting latent codes, given by the function , through a noisy communication channel. The output of the channel, has the form , where is a userdefined distribution. The learned latent representation is jointly optimised to encode the most important information in the data (source coding (Cover and Thomas, 2012)) and be robust to errors introduced by the noisy communication channel (channel coding (Cover and Thomas, 2012)), so as to minimise the distortion at the output.
WAE attempts to enforce a prespecified shape on the aggregate posterior, , by minimising the divergence (Tolstikhin et al., 2018). In the case of stochastic encoders, the learned aggregate posterior is a compromise between and the distribution corresponding to the optimal code that minimises the expected distortion. This attempt to enforce the desired prior distribution, , prevents the optimal joint sourcechannel code from being learned, negatively effecting reconstruction performance. This behaviour also affects generative performance because , and, therefore, an incorrect distribution is assumed when sampling from the generative model. In addition, the regularisation of the aggregate posterior to match is contrary to the objectives of disentanglement. In general, the true generative latent features for a dataset are not necessarily Gaussian. In section 6, we observe that when disentangled features are learnt on 3DShapes (Burgess and Kim, 2018)
, they do not have a Gaussian distribution. Despite this, many stateoftheart disentanglement methods
(Higgins et al., 2017b; Chen et al., 2018; Kim and Mnih, 2018) regularise the aggregate posterior to have a Gaussian distribution.Our Contributions:

In section 3, we demonstrate theoretically that in the case of the stochastic WAE, the compromise between the userdefined distribution and the optimal code that minimises the expected distortion causes suboptimal reconstruction and generative performance.

In section 4, we propose the Variance Constrained Autoencoder (VCAE) which applies only a variance constraint to the aggregate posterior rather than additionally constraining the shape (like WAE). Then, in section 6, we demonstrate that VCAE outperforms the stochastic WAE (a stateoftheart method) and VAE in terms of reconstruction quality and generative modelling on MNIST and CelebA.

In section 6, we show that VCAE equipped with a total correlation penalty (TCVCAE) has equivalent performance to FactorVAE for the task of disentanglement on 3DShapes, while also a more principled disentanglement approach than FactorVAE.
2 Constrained Wasserstein Autoencoder
In this section, we introduce two stateoftheart latent variable generative models, the Wasserstein Autoencoder (WAE) (Tolstikhin et al., 2018) and Bounded Information Rate Variational Autoencoder (BIRVAE) (Braithwaite and Kleijn, 2018). Despite WAE and BIRVAE being derived from different perspectives, in practice, they are equivalent. This relationship leads us to the constrained Wasserstein Autoencoder (cWAE), an informationrate limited WAE.
Throughout the remainder of this paper, we denote random variables by capital letters, e.g.,
, and their realisations as lowercase letters, e.g.,. Probability density functions
are abbreviated to, and probability distributions are denoted as
. We will primarily deal with data and latent variables and , respectively, where generally .Both WAE and BIRVAE have the same setup, with two stochastic mappings, and
implemented by neural networks with parameters
and respectively. and are referred to as the encoder and decoder, respectively. The aggregated posterior distribution is , where is the distribution of defined by the data. The aggregated generative distribution is . Lastly, let be a userselected distribution.WAE optimises this aforementioned model by minimising the Wasserstein distance between and . In general, the Wasserstein distance is not easily computable. However, when the decoder is implemented by a deterministic function, denoted , the Wasserstein distance can be written as (Bousquet et al., 2017):
(1) 
where can be any distance metric, and is constrained to match the userdefined distribution (e.g., ). WAE takes advantage of this convenient form of the Wasserstein distance.
The distribution constraint in (1), that , cannot be enforced directly, and must be relaxed using a penalty function. Hence, (1) is written as the following unconstrained optimisation problem for WAE (Tolstikhin et al., 2018) objective:
(2)  
where is a divergence and (a hyperparameter) controls the tradeoff between minimising the expected distortion and attempting to enforce the constraint.
We now discuss two motivations for stochastic encoders. Firstly, it was shown that fixedvariance stochastic encoders could be used to prevent overfitting when limited data is available (Braithwaite and Kleijn, 2018). Secondly, stochastic encoders explicitly prioritise the local structure in the data domain to be expressed in the latent structure, improving generalisation. We expand on this in section 3.
One approach for implementing the stochastic encoders is using fixed additive noise (Braithwaite and Kleijn, 2018). In this case, for a given , has the form where and is a userdefined distribution. WAEs with stochastic encoders can also be implemented (Rubenstein et al., 2018a) using the reparameterisation trick (Kingma and Welling, 2014; Rezende et al., 2014). In this case, is a diagonal Gaussian and both the mean and variance of are a function of . A disadvantage with the latter approach is that in practice, the variance of can decay to 0 (Rubenstein et al., 2018a), removing any noise in the latent layer. We chose to implement the stochastic encoders using the method of (Braithwaite and Kleijn, 2018), primarily because fixing the variance of the noise means it cannot decay to 0. Additionally, the method of (Braithwaite and Kleijn, 2018) is simpler than the of approach (Rubenstein et al., 2018b) and, as discussed below, allows the mutual information between and to be explicitly controlled.
A constrained optimisation problem for WAE with Gaussian stochastic encoders can now be formulated. Let , where is a hyperparameter. Additionally, let , as is commonly done. Then, an extended version of WAEs objective (2) that includes stochastic encoders can be written as:
(3)  
subject to 
We now introduce concepts relating to source and channel coding. Let be a set of datapoints, then the function is a code, where is an alphabet of codewords. For , is the codeword associated with . A communication channel is defined as , where is the output of the channel. The objective of channel coding is to minimise the overall distortion (e.g., L1 or L2 error) between the input and reconstructed data points after the channel. The channel capacity, M, is the theoretical maximum amount of information that can be transmitted through the channel. Only in the case of infinitedelay does the optimal code achieve the maximum channel capacity. In the cases of finitedelay, the optimal code will not necessarily reach this bound. To compute
, first we construct the joint distribution
, then is given by:(4) 
under a constraint on the power of transmission, i.e., the summed variance of each dimension of is constrained to be , a hyperparameter. Since , for a fixed channel , corresponds to choosing with maximum entropy. Therefore, for infinitedelay, the optimal distribution of codewords is a symmetric Gaussian.
A stochastic WAE can be interpreted as transmitting a code, given by , through a noisy AWGN channel. The channel is defined by , where and is a hyperparameter. Hence, for a fixed , increasing decreases the theoretical channel capacity, limiting the number of bits that can be transmitted by the latent layer. We denote this model the constrained Wasserstein Autoencoder (cWAE), as it is information rate limited. cWAE is equivalent to the Bounded Information Rate Variational Autoencoder.
cWAE attempts to enforce a Gaussian distribution on the output of the encoder, which in the infinitedelay case would correspond to the optimal specification of the latent code. However, this situation has finitedelay, and as a consequence, this attempt to specify the shape of the aggregate posterior causes a reduction in reconstruction and generative performance. We discuss the disadvantages of cWAE in the following section.
3 Drawbacks of cWAE
In this section, we introduce and discuss the drawbacks of cWAE. Primarily, this section looks at how the specification of a desired latent distribution causes a higher expected distortion at the output.
We first build on concepts introduced in the previous section with the following definitions. A source coder compresses the input dataset, removing redundancy to express data points in as few bits as possible. On the other hand, a channel coder introduces redundancies to make the codewords robust to transmission over a noisy communication channel. Lastly, a joint sourcechannel coder performs both source and channel coding simultaneously. The separation theorem (Shannon, 1948) proves that source and channel codes can be optimised independently. However, this theorem relies on infinitely long codes, something that does not hold in practice. Consequently, using a joint sourcechannel coder can lead to lower expected distortion compared with performing source and channel coding individually.
We now return to the interpretation of cWAE as transmitting latent codes across a communication channel and minimising the expected distortion. In this interpretation, the encoder function represents the code and represents its noisy transmission through the channel, where is a zeromean distribution defined by the user. Hence, cWAE is a joint sourcechannel coder. Learning binary joint sourcechannel codes has previously been explored (Choi et al., 2018).
Typically, cWAE attempts to enforce a Gaussian distribution on its aggregate posterior, maximising the amount of information being transmitted across the channel for a given restriction on the transmission power. In the case of infinite delay, this would correspond to the optimal distribution of codewords. However, the situation we are considering is not infinitedelay, but rather finitedelay. For transition over an AWGN communication channel with finitedelay, a Gaussian distribution at the output of the encoder is optimal (results in minimum mean square error) only if the input data are Gaussian. In general, the optimal distribution of codewords is dependent on the data distribution (Akyol et al., 2010). In the following arguments, we first decompose the mean reconstruction cost used as the objective function for cWAE, revealing why stochastic encoders cause preservation of data domain connectivity in the learned representation. These observations then lead to the result that the latent representation depends on the data distribution. Hence, the conclusion that attempting to enforce a distribution on the aggregate posterior negatively affects reconstruction and generative performance.
Proposition 1.
Let and be differentiable functions, and let be a zeromean distribution with variance . The encoding operation is defined as , and the decoding operation is . Then, for sufficiently small :
(5) 
where is the Jacobian of the decoder function .
Proposition 1 (proof in section A.1 of the appendix) shows that minimising the mean square reconstruction cost under the imposed conditions results in an objective regularised by the Jacobian of the decoder. A similar idea has been previously explored twice. The first case is the denoising autoencoder (Vincent et al., 2008), where adding noise in the input domain results in an objective similar to (5), but minimises , instead of (Bishop, 1995)
. The second case is the Contractive Autoencoder (CAE)
(Rifai et al., ), which explicitly adds the Frobenius norm of the decoder’s Jacobian to the standard autoencoder objective. The authors of CAE find this regularisation results in a learned representation that better represents the data and is robust to perturbations in the input domain.The squared Frobenius norm of a matrix is equal to the sum of the squared singular values. Additionally, the Lipschitz norm of a matrix is given by its largest singular value. Therefore, for a matrix A, we have the following:
where is the th singular value. Consequently, is an upper bound on the Lipschitz value of the local transform defined by the Jacobian of . This analysis affords the interpretation of the objective as regularising the local Lipschitz value of for the neighbourhood of each . The expressive power of is restricted by this regularisation. This argument also demonstrates why VCAE preserves datadomain connectivity in the latent representation: optimising the mean squared error resulting from a small noise addition in Z means that the system attempts to make a neighbourhood of any realisation of correspond to a neighbourhood of . The final result is that nearby points in Z are nearby points in X and that the mapping is, therefore, smooth. If the latent distribution is left entirely unconstrained, then its variance will grow without bound. Hence, at minimum, a constraint on the variance is required. If both and the variance of are specified, then an optimisation problem for is obtained.
Preservation of datadomain connectivity in the latent representation necessarily means that the structure of the aggregate posterior depends on the data distribution. Therefore, we see that when enforcing a prespecified distribution, , on the aggregate posterior, , the resultant learned shape will be a compromise between and what is optimal for minimising the mean reconstruction cost. Hence, causing degradation in both reconstructive and generative performance. Reconstruction performance is affected because the mean reconstruction cost is not minimised. On the other hand, generative performance is affected because the latent distribution is not equal to what was enforced. Thus, when sampling from the model, an incorrect latent distribution is assumed.
4 Variance Constrained Autoencoder
In this section, we introduce the Variance Constrained Autoencoder (VCAE), a generative model with the same structure as cWAE, but which applies a variance constraint to the aggregate posterior , rather than constraining its shape. The motivation for this change in constraints is the argument presented in section 3, where we discuss issues with enforcing a shape on the latent distribution when using stochastic encoders.
The Variance Constrained Autoencoder (VCAE) is made up of two probabilistic mappings, and , the encoder and decoder, respectively. and are implemented by neural networks with parameters and , respectively. The probabilistic encoder is implemented by adding noise to a deterministic mapping: has the form , where is a userdefined distribution. It is common to use the mean of the decoder as output (Braithwaite and Kleijn, 2018; Tolstikhin et al., 2018) denoted , and we do so here as well. The aggregate posterior and generative distributions are defined as before.
The principle of VCAE is to maximise the likelihood of the data while constraining the variance of the aggregate posterior. This is in contrast to WAE (and BIRVAE), where is regularised to be a prespecified distribution. We write VCAE’s objective as:
(6)  
subject to 
is the data distribution and a hyperparameter specifying the desired total variance. We relax the constraint in (6) using a penalty function, giving:
(7) 
an unconstrained optimisation problem where is a hyperparameter controlling the tradeoff between maximising the likelihood and approximating the variance constraint. The variance penalty is computed per batch.
We can similarly view VCAE as transmitting datapoint encodings over a noisy communication channel. The code is given by and the channel is defined by the choice of the distribution , where the output from the channel is . Therefore, like cWAE, this affords the interpretation of VCAE as a joint sourcechannel coder. However, in this case, only the variance of the aggregate posterior is constrained. This is in contrast to the cWAE which restricts the shape to be that of a predefined distribution . This change in constraints means that the learned latent distribution is no longer a compromise between the desired prior, , and the optimal distribution of the joint sourcechannel code that minimises the expected distortion. Consequently, for VCAE, a lower distortion can be achieved.
For WAE, the aggregate posterior is regularised to be the userdefined distribution , which theoretically allows for easy sampling from the generative model. In the case of VCAE, the aggregate posterior is not known and therefore sampling from the trained generative model is not directly possible. To facilitate sampling, a chain of normalising flows transforms into a known distribution (selected by the user before training). Since normalising flows are invertible, this transform can be undone before the decoder. Consequently, this operation does not affect the training of the encoder or decoder and thus can be trained as a subsequent step. Moreover, if sampling is not required, then these normalising flows need not be trained.
Let be a userdefined distribution (e.g. unit Gaussian), which will be transformed into our aggregate posterior using a chain of normalising flows. Let be the set of invertible and continuous functions. Denote , where , this induces the p.d.f (where represents the parameters of the ’s), which we can write as:
(8) 
We wish to optimise the functions so that , something that can be achieved by maximising the loglikelihood of samples mapped inversely though the normalising flows, given by:
(9) 
We choose normalising flows to implement this transformation because they are well defined and provide a convenient method for constructing an invertible transform from a known p.d.f to the unknown p.d.f that describes our learned latent representation . However, it is important to note that the VCAE system is not restricted to normalising flows, and any method of approximating the aggregate posterior can be used. Figure 1 is a diagram of the VCAE architecture and Algorithm 1 (given in section B of the supplementary material) describes an implementation of the VCAE, where we assume is a symmetric Gaussian permitting the use of the mean squared error (MSE) at the decoder output.
While not immediately knowing is a limitation of VCAE, this issue is often also present for cWAE. In the case of cWAE, the aggregated posterior is regularised to be . However, after training, is a compromise between
and the distribution of latent vectors that is optimal with respect to minimising the expected reconstruction error. Therefore, the aggregate posterior is also not known in the case of cWAE. This result is demonstrated experimentally in section
C.4 of the supplementary material, and by the discrepancy between the FID scores for WAE with the assumed prior and approximated aggregate posterior (shown in table 2).Similarly to cWAE, (4) shows that there exists an upper bound on the mutual information between and for VCAE. In the experimental section of this paper, we choose and , where is userdefined. In this case, the upper bound on the information rate can be computed as: . For a fixed , increasing decreases the maximum information rate. On the other hand, decreasing allows higher information throughput. However, setting too low can allow overfitting (Braithwaite and Kleijn, 2018).
As previously mentioned, VCAE is a natural model for learning disentangled representations. This is because: 1) VCAE allows a flexible latent distribution; 2) Using stochastic encoders enforces a smooth latent representation, in which local neighbourhoods of points in the data domain are maintained in the latent space. To facilitate disentanglement, a penalty term can be added to VCAE’s objective which enforces independence between , the different latent features. The penalty term used will be a total correlation (TC) penalty term: , following (Kim and Mnih, 2018; Chen et al., 2018). To implement the TC penalty, we use the method of (Kim and Mnih, 2018). We denote VCAE equipped with the TC penalty as Total Correlation VCAE (TCVCAE). The impossibility result of (Locatello et al., 2019) states that disentanglement is, in general, impossible for factorised priors and that the reason many disentanglement methods work in practice is because of their implicit assumptions of the model. VCAE also enforces a factorised prior, but optimises for the conserving of datadomain connectivity in the latent space.
5 Related Work
VAE + NF (Rezende and Mohamed, 2015) and VAE + IAF (Kingma et al., 2016) are two extensions on VAE which apply normalising flows to the distribution during training, to allow a more flexible latent distribution. VCAE relies on normalising flows to facilitate sampling from the trained generative model. However, these flows can be trained as a secondary process and are only necessary when sampling is required; this is in contrast to VAE + NF/IAF, where the flows are always needed and must be trained along with the rest of the system.
Two recent disentanglement models, FactorVAE (Kim and Mnih, 2018) and TCVAE (Chen et al., 2018) extend VAE with a total correlation penalty term. These aforementioned methods have been shown to perform well on disentanglement tasks. However, both methods maximise the ELBO, which can be seen as actively working against the task of disentanglement because the KLdivergence term is at a minimum when and are independent. WAE has been used for disentanglement by enforcing a factorised prior (Rubenstein et al., 2018a). In addition, these aforementioned methods all attempt to enforce a userdefined shape on the aggregate posterior. In section 6.4, we show that this constraint does not align with the objectives of disentanglement. TCVCAE is, therefore, a more principled disentanglement approach as it allows the shape of the aggregate posterior to vary.
(Ghosh et al., 2020) implement VAE’s encoder deterministically instead applying L2 regularisation, spectral normalisation (Miyato et al., 2018), or gradient penalty (Gulrajani et al., 2017) to the decoder function. The proposed model was shown to outperform the VAE and WAE in terms of reconstruction and generative performance. However, unlike VCAE, the method proposed in (Ghosh et al., 2020) still enforces a prior distribution, and does not consider the associated informationtheoretic disadvantages.
Generative Latent Flow (GLF) (Xiao et al., 2019) was also developed concurrently to VCAE. In GLF, a deterministic autoencoder is trained and regularised by a normalising flow mapping from the AE latent space to a unit Gaussian. While GLF is similar in structure to VCAE, (Xiao et al., 2019) do not address the importance of latent noise, the relationships to cWAE and information theory nor is a connection to disentanglement made.
6 Experiments
In this section, we compare VCAE against VAE, VAE + IAF and WAE. A complete description of the experimental setup, including network architectures and hyperparameters, is given in Appendix B. A description of how the models are compared fairly is also given in the appendix. To summarise, we select the VCAE variance constraint such that it has the same maximum encoding channel capacity as cWAE. Additionally, we select hyperparameter settings for all models, which result in the respective constraints being sufficiently enforced. Sample implementations for VCAE are available.^{1}^{1}1See supplementary material for code.
We first give two toy examples in section 6.1, demonstrating the efficacy of VCAE over cWAE. Then, in section 6.2, we experiment with preventing overfitting using stochastic encoders. In section 6.3, we evaluate VCAE’s generative modelling performance. Next, in section 6.4, we equip VCAE with a total correlation penalty term (denoted TCVCAE) and evaluate its disentanglement performance. An auxiliary experiment in section C.4 of the supplementary material investigates the structure of the latent space for VCAE and cWAE on MNIST, before and after applying the normalising flows.
6.1 Toy Experiments
In this section, we present two toy experiments that show specific situations where VCAE outperforms cWAE. First, we consider a dataset generated from a mixture of four twodimensional Gaussians (MoG)(shown in figure 1(a)) and projected into a 100dimensional space. Secondly, we look at models trained on MNIST with a twodimensional latent space. These toy examples demonstrate the need for and utility of VCAE. In both cases, attempting to enforce a Gaussian distribution on causes a degradation in performance.
Figures 1(b) & 1(c) show the representations learned by VCAE and cWAE respectively, when trained on this MoG dataset. cWAE enforces a Gaussian distribution, but this comes at the cost of performance, cWAE and VCAE achieve 0.13 and 0.067 MSE, respectively.
For the second experiment, we look at how these methods perform on the MNIST dataset, with two latent features to facilitate analysis. We find that a large regularisation parameter is required to ensure the desired distribution is enforced for cWAE. Moreover, this prior regularisation negatively affects performance. VCAE achieves training/testing reconstruction errors of 25.52/27.67, and cWAE achieves 30.09/31.50. Figures 1(d) & 1(e) show the latent representations learned by VCAE and cWAE, respectively. Histograms of the features learned by VCAE and cWAE are provided in section C.1 of the appendix. These histograms confirm that cWAE’s distribution is close to a unit Gaussian, while VCAE’s is not.
6.2 Overfitting
In this section, we trained VCAE, (c)WAE and VAE on the ReducedMNIST problem, a 600 element subset of the MNIST training data, to demonstrate that the use of stochastic encoders can prevent overfitting. For these experiments let dVCAE refer to VCAE with a deterministic encoder. The experimental setup for the following experiments is in section B.1 of the supplementary material.
Table 1 show the results from these experiments, demonstrating that using stochastic encoders does indeed reduce the degree of overfitting as both VCAE and cWAE improve on WAE and dVCAE. Additionally, VCAE further reduces the amount of overfitting when compared to an equivalent (c)WAE. Lastly, we see the benefit of fixed rather than variable latent noise by observing that VCAE and cWAE improve upon VAE. In this case, VAE had a larger degree of overfitting because the variance in the latent dimension can be driven to zero, maximising performance on the training set but reducing generalisation.
Model  Train Err  Test Err 

VCAE ()  22.16  41.96 
cWAE ()  18.06  44.56 
VCAE ()  17.02  47.39 
cWAE ()  15.19  47.39 
dVCAE ()  16.56  53.16 
WAE ()  14.79  58.50 
VAE  10.45  53.32 
6.3 Generative Quality: MNIST & CelebA
MNIST  CelebA  

Reconstruction (L2)  Samples (FID)  Reconstruction (L2)  Samples (FID)  
Train Err  Test Err  Train Err  Test Err  
VAE  6.16  8.72  23.20  23.03  85.45  100.16  58.84  54.19 
VAE + IAF  6.65  7.91  29.30  96.18  120.82  50.73  
cWAE  1.36  5.36  30.08  8.22  63.22  96.04  61.29  47.09 
cVCAE  1.36  5.15  7.68  62.76  94.40  43.14 
In this section, we compare the generative and reconstructive performance of VCAE, cWAE, VAE, and VAE + IAF on the two datasets MNIST and CelebA. These models are compared using training set errors, testing set errors, and Fréchet Inception Distance (FID) scores (Heusel et al., 2017). When reporting FID scores for WAE and VAE, we investigate two situations, one where the assumed prior is used (), and the other where the true latent distribution is learned using normalising flows. The latter allows for a fair comparison with VCAE. Sections B.3 & B.4 of the appendix give the experimental setup for the MNIST and CelebA experiments, respectively. These sections contain results for different settings of for both cWAE and VCAE.
Table 2 shows a quantitative comparison between VCAE, cWAE, VAE and VAE + IAF. The results show that both in the case of MNIST and CelebA, VCAE achieves the lowest testing set (mean square) error and the lowest FID score (lower is better) out of all other models. Figure 3 displays samples from each of the generative models trained on CelebA. A qualitative comparison of figure 3 shows that VCAE and WAE consistently produces higher quality samples than both VAE and VAE + IAF. In section C.3, nearest neighbours (in the training set) of generations for these models trained on CelebA are displayed, showing that overfitting has not occurred. Figure 7, which is found in section C.2 of the supplementary material, gives a qualitative comparison of the models trained on MNIST. Analysis of figure 7 yields the same results as were found for CelebA.
6.4 Disentanglement on Shapes3D
In this section, we evaluate VCAE’s ability to learn disentangled representations when equipped with a total correlation penalty term. This extended model is called TCVCAE. As a reference system, we used cWAE equipped with the same total correlation penalty term, denoted TCcWAE. Additionally, we compare against the stateoftheart FactorVAE (Kim and Mnih, 2018). A complete description of the setup for these experiments is given in section B.5 of the supplementary material. We evaluate these models on the 3DShapes (Burgess and Kim, 2018) dataset. The disentanglement metric used to give the following results is from (Kim and Mnih, 2018).
TCVCAE  TCcWAE  FactorVAE  

Error  3538  3531  3522 
Score  0.93  0.58  0.93 
Table 3 shows the final errors and disentanglement scores for the best performing models. TCVCAE and FactorVAE had very similar results, both outperforming TCcWAE. Careful examination of the latent traversals in figure 4 shows that both TCVCAE and FactorVAE have captured room orientation, wall hue and floor hue. TCcWAE fails to capture any. It is important to note that FactorVAE (Kim and Mnih, 2018) achieves a higher disentanglement score, of 1.0 in (Kim and Mnih, 2018) than what was obtained here. In (Locatello et al., 2019), the authors discuss that the initialisation is more important for successfully performing disentanglement than the hyperparameter configuration. In section C.5 of the appendix, we display results from five runs of these models, demonstrating the variation in performance.
In section C.6 of the supplementary material, further analysis is given, in the form of histograms of the learned representations as well as a larger traversal. Examining the feature histograms shown in the supplementary material demonstrate that the features which are closer to Gaussian, do not correspond to a disentangled feature. In general, the underlying factors of generation for a dataset will not be Gaussian. Consequently, attempting to enforce a Gaussian, or any other shape on the aggregate posterior does not align with the task of disentanglement.
The objectives of both TCcWAE and FactorVAE are enforcing Gaussianity on the aggregated posterior, but are relying on Gaussianity not being enforced to perform disentanglement. On the other hand, VCAE does not enforce any shape on . Consequently, where the objectives of TCcWAE and FactorVAE are contradictory for disentanglement, VCAE’s objectives are not.
7 Conclusion
Enforcing a desired prior distribution on the aggregate posterior of a generative model such as the Wasserstein Autoencoder (WAE) facilitates sampling. However, when stochastic encoders are used, this latent distribution constraint negatively affects the model’s reconstruction and generative quality. This issue arises because achieving the minimum expected reconstruction error corresponds to a particular specification of the aggregate posterior . By attempting to enforce , the optimisation process must find a compromise between and the optimal specification.
This paper proposed the Variance Constrained Autoencoder (VCAE), which only constrains the variance of the aggregate posterior rather than constraining its shape. This change in constraints means that the shape of the latent distribution is no longer regularised to conflict with the expected distortion. After training, the distribution of the aggregate posterior is not known. Therefore, to facilitate sampling from the trained VCAE, a chain of normalising flows can be optimised as a secondary stage, learning an invertible transform from a userdefined distribution to the aggregate posterior .
Our experimental results showed that VCAE outperforms VAE, VAE + IAF and cWAE in terms of reconstruction and generative performance on MNIST and CelebA. Moreover, VCAE is a more principled approach for learning disentangled representations as it does not assume a prior. Observing histograms of learned features from FactorVAE and TCWAE demonstrated that a constraint on the latent distribution shape was counterproductive for disentanglement. Hence, providing evidence that the objectives of TCcWAE and FactorVAE are contrary to disentanglement, whereas VCAE objective facilitates the task. When VCAE is equipped with a total correlation penalty term, it performs as well as FactorVAE for the task of disentanglement on 3D Shapes.
Acknowledgments
This research was funded by GN.
References
 Optimal mappings for joint source channel coding. In IEEE Workshop on Information Theory, Cited by: §3.
 Fixing a Broken ELBO. In ICML, Cited by: §1.
 Training with noise is equivalent to tikhonov regularization. Neural computation 7 (1), pp. 108–116. Cited by: §3.
 From optimal transport to generative modeling: the VEGANcookbook. arXiv preprint arXiv:1705.07642. Cited by: §2.
 Generating sentences from a continuous space. In CoNLL, Cited by: §1.
 Bounded Information Rate Variational Autoencoders. arXiv preprint arXiv:1807.07306. Cited by: §1, §1, §2, §2, §2, §4, §4.
 3D shapes dataset. Note: https://github.com/deepmind/3dshapesdataset/ Cited by: §1, §6.4.
 Isolating sources of disentanglement in vaes. In NeurIPS, Cited by: §1, §4, §5.
 Variational Lossy Autoencoder. In ICLR, Cited by: §1.
 NECST: neural joint sourcechannel coding. arXiv preprint arXiv:1811.07557. Cited by: §3.
 Elements of information theory. John Wiley & Sons. Cited by: §1.
 From variational to deterministic autoencoders. In ICLR, Cited by: §5.

A kernel twosample test.
Journal of Machine Learning Research
13 (Mar), pp. 723–773. Cited by: Appendix B.  Improved Training of Wasserstein GANs. In NeurIPS, Cited by: §5.
 GANs trained by a two timescale update rule converge to a local nash equilibrium. In NeurIPS, Cited by: §6.3.
 BetaVAE: Learning basic visual concepts with a constrained variational framework. In ICLR, Cited by: §1.
 Betavae: learning basic visual concepts with a constrained variational framework. In ICLR, Cited by: §1.
 Disentangling by factorising. In ICML, Cited by: §B.5, §B.5, Appendix B, §1, §4, §5, §6.4, §6.4.
 Improved variational inference with inverse autoregressive flow. In NeurIPS, Cited by: Appendix B, §5.
 AutoEncoding Variational Bayes. In International Conference on Machine Learning, Cited by: §1, §2.

Challenging common assumptions in the unsupervised learning of disentangled representations
. In ICML, Cited by: §4, §6.4.  Spectral normalization for generative adversarial networks. In ICLR, Cited by: §5.

Masked autoregressive flow for density estimation
. In NeurIPS, Cited by: Table 5.  Preventing posterior collapse with deltavaes. In ICLR, Cited by: §1.

Stochastic backpropagation and approximate inference in deep generative models
. In ICML, Cited by: §1, §2.  Variational inference with normalizing flows. In ICML, Cited by: §5.

[27]
Contractive autoencoders: Explicit invariance during feature extraction
. In ICML, Cited by: §3.  Learning disentangled representations with wasserstein autoencoders. Cited by: §2, §5.
 Wasserstein autoencoders: latent dimensionality and random encoders. Cited by: §2.
 A mathematical theory of communication. Bell system technical journal 27 (3). Cited by: §3.
 Wasserstein AutoEncoders. In ICLR, Cited by: §B.4, Appendix B, Appendix B, §1, §1, §2, §2, §4.

Extracting and composing robust features with denoising autoencoders
. In ICML, Cited by: §3.  Generative Latent Flow: a framework for nonadversarial image generation. arXiv preprint arXiv:1905.10485. Cited by: §5.
 InfoVAE: Balancing Learning and Inference in Variational Autoencoders. In AAAI, Cited by: §1.
Appendix A Proofs
a.1 Proof of Proposition 1
Proof.
Let and , the encoding and decoding operations respectively, be differentiable functions. Let be a zeromean distribution with variance . For sufficiently small , we can approximate the decoding operation as:
where is the Jacobian of the decoder at point . The reconstruction cost can therefore be written as:
To facilitate analysis of this objective, let and . Then:
is simply the square error between and . Now consider the second term, is the dot product between the th row of (denote as for simplicity) and :
The final term is zero because is a zeromean distribution. Therefore, as required, we have:
∎
Appendix B Experimental Setup
MNIST  
Encoder  Decoder  



CelebA  
Encoder  Decoder  



3D Shapes  
Encoder  Decoder  


Description of neural network architectures used by the experiments in this paper. Conv refers to a 2D convolutional layer with parameters nf (number of filters), ks (kernel size) and s (stride). Similarly TConv refers to a 2D transpose convolutional layer with the same parameters. BN refers to a batch normalisation layer and ReLU refers to the nonlinear Rectified Linear Unit activation function.
MNIST  CelebA  



In this section, we give a complete description of the setups for all experiments that were run in section 6. The models evaluated are VAE, VAE + IAF, WAE and VCAE. We chose VAE + IAF over VAE + NF because it uses a more powerful class of functions which perform better in higher dimensions.
The implementation of VCAE follows Algorithm 1, and the implementation of WAE, VAE and VAE + IAF follow from their respective papers. In the case of WAE, we only consider the MMDWAE, where the Maximum Mean Discrepancy (MMD) Gretton et al. (2012) is implemented using the inverse multiquadratic (IMQ) kernel, , as used in Tolstikhin et al. (2018), where kernel parameter is given by . For each given experiment, all models use the same neural network structure, the only exception being in the case of VAE and VAE + IAF, where the encoder outputs additional information. In the case of VAE, a final linear layer of the encoder outputs two vectors of length , and
, representing the mean and standard deviation of
. Additionally, in the case of VAE + IAF, the encoder outputs another vector of length (for a total of three) which is provided as an additional input to the normalising flows, as per the standard implementation Kingma et al. (2016).We note that table 4 describes the neural network architectures used in each experiment. The encoder/decoder structure for MNIST and CelebA follows that of Tolstikhin et al. (2018), and the structure for 3D shapes follows that of Kim and Mnih (2018).
b.1 Overfitting
For these experiments, we used the encoder/decoder structure outlined in table 4 under the MNIST heading. To outline the effect of overfitting we train on a reduced version of MNIST dataset (we denote this dataset ReducedMNIST), which consists of a 600 sample subset of the training data. For VCAE we select , and . In the case of WAE, we select , and . A number of experiments were run with different choices for (for VCAE and WAE), these are , and (no noise).
During the training of these models, the Adam optimiser was used with an initial learning rate of and no learning rate schedule. A batch size of 200 was used in all cases.
b.2 Latent Space Analysis
For these experiments, we used the encoder/decoder structure outlined in table 4 under the MNIST heading. For VCAE we select , and . In the case of WAE, we select , and . We selected .
During the training of these models, the Adam optimiser was used with an initial learning rate of and no learning rate schedule. A batch size of 200 was used in all cases.
b.3 Mnist
In this section, we describe the setup for the experiments comparing the generative and reconstructive quality of VCAE, cWAE, VAE and VAE+IAF on MNIST.
b.3.1 Model Setup
For this experiment we use the encoder/decoder setup described under the MNSIT heading in table 4. For all experiments we chose . In the case of VCAE and cWAE, chose . The normalising flow architecture used for these experiments is given in table 5 under the MNIST heading.
During training, we used a batch size of 100 in all cases. For VCAE, WAE and VAE we used an Adam optimiser with an initial learning rate of , for VAE + IAF we used an Adam optimiser with the initial learning rate
. For all experiments, the learning rate schedule was the same: after 30 epochs, cut the learning rate in half; after 50 epochs, reduce the learning rate by a factor of five. For VAE, WAE and VCAE, the encoder and decoder are trained for 100 epochs. VAE + IAF is trained for 200 epochs as it has a lower learning rate (because it does not converge for a larger learning rate) and extra parameters (the normalising flows).
When training the normalising flows as a secondary stage, we train for 100 epochs of the dataset using the Adam optimiser with an initial learning rate of and no learning rate schedule. The batch size used was 100.
b.3.2 Hyperparamater Selection
For VCAE and WAE the hyperparameter must be chosen, this parameter controls the tradeoff between minimising output distortion and enforcing the variance or distribution constraint. To ensure a fair comparison, we would like to find a parameter setting that sufficiently enforces the desired constraint while performing as optimally as possible. In tables 6 & 7 we explore various settings of for VCAE and WAE respectively. Under both constraints, we should find that the sum of variances in the latent dimension should equal , hence, we choose the setting of for which the summed latent variance is approximately . The hyperparameters for VCAE are: 1) ; 2) . The hyperparameters for WAE are: 1) ; 2) .
Train  Test  

0.05  1.35  5.19  17.29 
0.1  1.36  5.15  16.05 
0.5  1.47  5.01  16.40 
1.0  1.52  5.60  16.86 
1.5  1.56  5.76  16.70 
Error  FID  

Train  Test  
5  1.24  5.17  27.27  23.45 
10  1.27  5.15  26.26  19.85 
25  1.28  5.33  27.38  17.55 
50  1.36  5.56  30.08  16.51 
75  1.43  6.023  33.46  15.78 
100  1.48  6.26  33.57  15.67 
150  1.53  6.67  34.27  15.55 
b.4 CelebA
In this section, we describe the experimental setup for experiments performed on CelebA. We used a preprocesssed version of the CelebA dataset, obtained via the following steps:

Take a pixel centre crop of each image.

Down scale each cropped image to pixels.
The CelebA data set is preprocessed in the same way for the experiments conducted in Tolstikhin et al. (2018).
b.4.1 Model Setup
For these experiment we used the encoder/decoder setup described in table 4 under CelebA. For all experiments we chose . In the case of VCAE and cWAE, chose . The normalising flow architecture used for these experiments is given in table 5 under the CelebA heading.
During training, we used a batch size of 100 in all cases. For VCAE, WAE and VAE we used an Adam optimiser with an initial learning rate of , for VAE + IAF we used an Adam optimiser with the initial learning rate . For all experiments, the learning rate schedule was the same: after 30 epochs, cut the learning rate in half; after 50 epochs, reduce the learning rate by a factor of five. For VAE, WAE and VCAE, the encoder and decoder are trained for 70 epochs. VAE + IAF is trained for 140 epochs as it has a lower learning rate (because it does not converge for a larger learning rate) and extra parameters (the normalising flows).
When training the normalising flows as a secondary stage, we train for 100 epochs of the dataset using the Adam optimiser with an initial learning rate of and no learning rate schedule. The batch size used was 100.
b.4.2 Hyperparamater Selection
For these experiments we must also select and which control the tradeoff between minimising the reconstruction cost and enforcing the constraint on the aggregate posterior . To ensure a fair comparison, we would like to find a parameter setting that sufficiently enforces the desired constraint while performing as optimally as possible. In tables 8 & 9 we explore various settings of for VCAE and WAE respectively. Under both constraints, we should find that the sum of variances in the latent dimension should equal , hence, we choose the setting of for which the summed latent variance is approximately . The hyperparameters for VCAE are: 1) ; 2) . The hyperparameters for WAE are: 1) ; 2) .
Train  Test  

0.05  59.98  97.82  65.24 
0.1  60.26  96.63  64.64 
0.5  62.77  94.40  63.47 
Error  FID  

Train  Test  
100  60.63  97.78  61.99  73.52 
250  61.59  99.25  61.70  68.32 
500  61.24  98.81  59.05  66.93 
750  63.22  96.04  61.29  65.70 
1000  65.94  98.22  64.63  65.72 
b.5 Disentanglement on 3D Shapes
For these experiments, we follow the setup described in Kim and Mnih (2018). However, to make this paper selfcontained, we will reiterate the setup here. The encoder/decoder setup is described in table 4 under the 3D shapes heading. The same structure is used for all experiments except for the implementation differences outlined at the start of this section.
The implementation of the total correlation (TC) penalty term requires an additional discriminator network which consists of six fullyconnected layers, each with 1000 hidden units and used the LeakyReLU (
) activation. The discriminator network outputs two logits.
Each model was trained for a total of batches, with a batch size of 64. Six latent dimensions were used. Adding the total correlation penalty introduces another hyperparameter , which controls how strongly the constraint is enforced. For all experiments, we choose , as this was reported to be the optimal setting for FactorVAE Kim and Mnih (2018). For VCAE we choose and for cWAE we choose . For both VCAE and cWAE we choose .
Appendix C Auxiliary Experiments
c.1 Toy MNIST Example
In this section we give further figures to complement the analysis given in section 6.1. Figure 7 gives histograms of the latent features learned by cWAE and VCAE for the MNIST toy experiment. In this experiment, we chose . We progressively increased the regularisation parameter for cWAE, resulting in . Increasing resulted in a model that could not train.
c.2 Generative Modeling on MNIST
c.3 Nearest Neighbour Analysis on CelebA
c.4 Latent Space Analysis
In this section, we develop a better understanding of the learned latent distributions and the effects that applying a normalising flow transform has on them. We trained VCAE and WAE on MNIST with a twodimensional latent space, as this facilitates visualisation of . Additionally, we train a chain of normalising flows to transform (selected to be a unit Gaussian) into for both models, we can then display the distribution inversely mapped through the normalising flow denoted .
Figure 8 shows four latent embedding plots, showing both and for VCAE and WAE. We observed that for both VCAE and cWAE contains gaps between the different classes, in these regions, the decoder behaviour is undefined. The Gaussian representation for both models does not include these gaps, meaning we avoid sampling from these undefined regions.
c.5 Variability Analysis
Table 10 shows the error and scores from five runs of FactorVAE, TCVCAE and TCWAE. The results demonstrate how varied performance was for different random initialisations.
Run 1  Run 2  Run 3  Run 4  Run 5  

FactorVAE  
Error  3518.21  3517.20  3507.24  3521.96  3508.93 
DScore  0.78  0.90  0.60  0.93  0.67 
TCWAE  
Error  3518.62  3516.12  3517.69  3513.09  3522.99 
DScore  0.53  0.55  0.56  0.58  0.49 
TCVCAE  
Error  3538.53  3578.88  3549.36  3529.74  3538.55 
DScore  0.93  0.69  0.64  0.91  0.87 
c.6 Traversal and Latent Space Analysis
In this section, we display further results from the disentanglement experiments presented in section 6.4. Figure 10 shows a histogram of each latent feature learned by TCVCAE, of which we first focus on Features 1, 2 and 5 which correspond to the room orientation, wall hue and floor hue respectively. This correspondence can be seen by inspecting figure 8(a). The histograms show that features 1,2 and 5 have distributions with 15, 10 and 10 modes respectively. The number of modes for each distribution corresponds to the number of settings of that parameter when generating the data set. For example, there are 15 different settings of the room orientation in the dataset, which corresponds to the 15 modes in feature 1. Similarly, for wall and floor hue, there are ten possible settings in the dataset, again represented by ten modes in features 2 and 5 respectively. Careful inspection of the histograms and traversal in figures 12 & 8(c) respectively, reveals that the same holds true for FactorVAE. However, the same does not hold true for TCWAE, as seen by the histograms and traversal in figures 11 & 8(b)
Figure 9 shows three enlarged traversals, one for the best performing TCVCAE, TCWAE and FactorVAE. These results demonstrate that both TCVCAE and FactorVAE can generalise to settings of orientation, wall hue and floor hue that were not present in the dataset as in this case, we have 30 settings.