1 Introduction
Due to the rapidly evolving computational technologies, large amounts of unlabeled data are continuously generated. Considerable time, labor, and resources are dedicated to labeling, preprocessing, and transforming the unlabelled data for realworld supervised machine learning applications. As an alternative, unsupervised representation learning algorithms extract meaningful and discriminative representations that are amenable to the downstream task in the absence of labels. Unsupervised representation learning have found applications in several domains, such as computer vision
Kim et al. (2019b); Wang et al. (2020); Lin et al. (2017); Jahanian et al. (2021); Kim et al. (2021), medical image analysis Tang et al. (2017); Yadav et al. (2021); Kolyvakis et al. (2018), and natural language processing
Han et al. (2021); Radford et al. (2018)Deep generative models are widely used in unsupervised representation learning to learn informative representations of data and parameterize the underlying data manifold, enabling them to generate new samples from the distribution. Widely used methods include density estimation using flowbased models
Dinh et al. (2016); Zang and Wang (2020); Stypułkowski et al. (2019), generative adversarial networks (GANs)
Tanaka and Aranha (2019); Goodfellow et al. (2014), and variational autoencoders Kingma and Welling (2013); Rezende et al. (2014). This paper focuses on autoencodertype architectures like VAEs that provide representation and sample generation capabilities.Algorithms and models for unsupervised representation are based on the manifold assumption, where the data is assumed to be generated by varying a set of parameters. Usually, the number of parameters is much smaller than the dimensionality of the data. For example, different facial images can be generated using a finite set of parameters such as lighting, skin color, expressions, facial features, hair, etc. According to the manifold assumption: a set of samples in a dimensional data space, where , is said to lie on or near a lowdimensional manifold with intrinsic dimensionality that refers to the minimum number of parameters necessary to capture the entire information content present in the representation Gong et al. (2019)
Principal component analysis (PCA) was one of the earliest methods that established a relation between the data and the lowdimensional latent space. PCA provides a closedform solution for determining the optimum latent space dimensionality. However, PCA is limited by linearity and imposes oversimplifying assumptions of Gaussianity on the data distribution. Moreover, traditional approaches such as PCA are not scalable for large datasets and require algorithmic treatments (e.g., online PCA Cardot and Degras (2018); Boutsidis et al. (2014)) for big data. Deep generative models overcome the limitations of PCA providing nonlinearity and scalability using deep networks to parameterize the mapping to the latent space. In the case of deep generative models, the dimensionality of the latent space is typically defined upfront for each dataset at the design time. The design process may be under or overprovision the number of dimensions for the application at hand. If the dimensionality is not predefined, this parameter is usually determined using time and resourceconsuming crossvalidation. A mismatch between latent dimensionality and intrinsic dimensionality affects the quality of data representation and sample generation Rubenstein et al. (2018a); Mondal et al. (2021); Rubenstein et al. (2018b). Studying and understanding the effects of dimensionality mismatch in the latent space becomes imperative to use deep generative models effectively.
Multiple studies have targeted improving reconstruction quality, but very few have tried to analyze the impact of latent dimensionality mismatch. Wasserstein autoencoders, highlights that a mismatch between the latent dimensionality and the true intrinsic dimensionality leads to an infeasible optimization object Mondal et al. (2021, 2019). For deterministic encoders and decoders, a high capacity bottleneck can cause curling of the manifold Tolstikhin et al. (2017). Whereas using a smaller bottleneck can cause lossy compression of data and deteriorate representation quality De Boom et al. (2020). For deterministic encoders, studies have concluded that larger bottleneck dimensions are not always better Mondal et al. (2019); Tolstikhin et al. (2017). For VAE, it was theoretically proved that increasing the bottleneck capacity beyond the intrinsic dimensionality does not improve the reconstruction quality Dai and Wipf (2019).
We propose a principled framework grounded in probabilistic modeling to identify the optimal dataspecific latent dimensionality without adding new hyperparameters. To empirically motivate the proposed model, we performed experiments with  vanilla VAE Kingma and Welling (2019) and dpVAE Bhalodia et al. (2020) to analyze the impact of latent dimensionality on the representation learning and sample generation tasks. See Figure 1. The two models were trained on the MNIST dataset with varying latent dimensions, while other parameters were kept the same. The Frechet inception distance (FID) score was used as the generation metric, and mean squared error (MSE) as the representation metric. The FID score and MSE curve indicate inferior performance when the bottleneck size is underprovisioned. We see the MSE performance improve with the bottleneck size, but FID scores suffers at higher dimensions such as 64 and 128. The lower MSE and increase in FID score at a large bottleneck size indicates overfitting and loss of generalization. We can conclude that larger dimensions are not always guaranteed to produce the bestperforming models. Hence, we need automated ways of identifying the dimensionality mismatch and informing the model about the intrinsic dimensionality of a given dataset. Our contributions are as follows:

Introduce relevance encoding networks (RENs): a framework that facilities the training of VAEs using a unified formulation to parameterize the data distribution and detect latentintrinsic dimensionalities mismatch. The formulation is general and can be adapted to stateofart VAEbased methods such as regularized VAEs. This framework also provides a PCAlike ordering for the latent dimensions that conveys the variance of each latent dimension supported by the data manifold.

Derive the evidence lower bound (ELBO) for RENs in case of vanilla VAEs and decoupled prior VAEs Bhalodia et al. (2020) that leverages invertible bottleneck to improve the matching of the aggregate posterior with the latent prior.

Use VAE Rybkin et al. (2021) to calibrate the RENs decoder and reduce the requirement of tuning the weight on the likelihood term of the VAE ELBO.

Demonstrate the ability of relevance encoding networks in detecting the relevant bottleneck dimensionality for three public image datasets without compromising the representation and generation quality and with no additional hyperparameter tuning.
2 Related Work
VAE Kingma and Welling (2019) is a latent variable model specified by an encoder, decoder, and prior distribution on the latent space. The encoder maps the input to the latent space (inference), while the decoder reconstructs the original input from the latent space (representation). The prior enables sample generation from a tractable probabilistic distribution Doersch (2016). Several studies have suggested that using learnable prior can improve the performance of VAEs and reduce the impact of dimensionality mismatch Xu et al. (2020, 2019); Tomczak and Welling (2018); Bhalodia et al. (2020); Xu et al. (2019). Dai and Wipf rigorously analyzed the VAE objective under various scenarios of dimensionality mismatch Dai and Wipf (2019). A critical conclusion (see Theorem 5 in Dai and Wipf (2019)) states that optimal reconstruction can be achieved when the latent bottleneck dimensionality matches the intrinsic dimensionality, and increasing the bottleneck capacity may negatively impact the generation process.
Very few methods have demonstrated ways of identifying and handling latentintrinsic dimensionality mismatch. De Boom et al. illustrated the use of Generalized ELBO with constrained optimization (GECO) and the L0augmentREINFORCEmerge (L0ARM) gradient estimator Li and Ji (2019); Yin and Zhou (2018) to shrink the latent bottleneck dimensionality of VAEs automatically De Boom et al. (2020)
. The L0 norm was applied to a global binary gating vector that controlled the latent dimensionality. GECO was used to define a constraint on the reconstruction error to give it more weight during optimization until the desired level of accuracy is reached. Once the threshold is reached, the narrowing of the bottleneck is given priority. Kim et al. proposed relevance factor VAE
Kim et al. (2019a) that infers relevance and disentanglement using total correlation. Although the models proposed by De Boom et al. and Kim et al. identify the inactive latent dimensions and eliminate them in the variational posterior distribution, the prior is still an isotropic standard normal. This can result in poor representation and generation quality.Heim N et.al. proposed the relevance determination in differential equations model (Rodent) model Heim et al. (2019) that showed the use of automatic relevance determination (ARD) Tipping (1999); Bishop and Tipping (2013)
priors to minimize the state size of the ordinary differential equation (ODE) and the number of nonparameters required to solve the problem using partial observations. They used a VAElike architecture, where the encoder was a neural network, and the decoder was an ODE solver. Isotropic Gaussian was used with an ARD prior in the latent space, and a point estimate was used for the variance of the prior distribution. The Rodent model formulation is not fully probabilistic and only focused on solving the ODEs.
For autoencoders, several studies (e.g., Rubenstein et al. (2018a, b)) have analyzed how deterministic and random encoderdecoder pairs perform in the presence of latentintrinsic dimensionality mismatch. Studies by Rubenstein et al. revealed that the deterministic encoders start curling the manifold in the latent space when the latent dimensionality is higher than the intrinsic dimensionality. At the same time, random encoders fill the irrelevant dimension with noise while encoding useful information in the latent space. Random encoders start behaving like deterministic encoders if the dimensionality is increased further. Deterministic and random exhibit poor sample generation performance with an increase in the volume of the holes in the latent space. Mondal et al. studied the effect of dimensionality mismatch in the case of deterministic autoencoders Mondal et al. (2019, 2021). Mathematically and empirically, Mondal et al. shows that having a fixed prior distribution, oblivious to the dimensionality of the true latent space, leads to the optimization infeasibility, and proposes masked autoencoders (MAAE) Mondal et al. (2019) as a potential solution. MAAE Mondal et al. (2019) introduced modifications to the autoencoder architecture to infer a mask at the end of the encoder to suppress noisy latent dimensions.
Existing approaches that identifies the relevant dimensions, introduces more hyperparameters that have to be tuned to identify the bottleneck size. Hence, the complexity of finding the optimum latent dimension remains the same. Moreover, these methods treat relevance determination as a separate task agnostic to the probabilistic formulation of deep generative models, making the solution less interpretable.The RENs framework facilities training of VAEs using a unified probabilistic formulation to parameterize the data distribution and detect latentintrinsic dimensionality mismatch without adding new hyperparameters.
3 Background
Notation: We denote a set of observations in a dimensional data space as and their corresponding latent representations . A representation learning model maps an observation to an unobserved latent representation in an dimensional latent space, where . Hereafter, we use boldface lowercase letters to denote vectors, boldface uppercase letters to denote matrices, and nonbold lowercase to denote scalars.
Variational Autoencoders (VAEs): VAEs are latent variable models that learn data representations in an unsupervised way by matching the learned model distribution to the true data distribution . The generative (i.e., decoder) and inference (i.e., encoder) models in VAEs are jointly trained to maximize a tractable lower bound on the marginal loglikelihood of the training data. The structure of the learned latent representation is controlled via imposing a prior distribution on the latent space such as .
(1) 
where denotes the generative model parameters, denotes the inference model parameters, and is the variational posterior distribution that approximates the true posterior , with , and
dpVAE: Decoupled Prior for VAE: Maximizing the ELBO by optimizing the marginal loglikelihood does not guarantee good representation. With expressive generative models , VAE can ignore the latent representation and not encode any information about the data in them, but still maximize the ELBO Hoffman and Johnson (2016); Alemi et al. (2018); Chen et al. (2016), this is the information preference phenomena of VAEs. It has been shown that datadriven (i.e., learned during training) priors help mitigate the information preference of VAEs Hoffman and Johnson (2016); Rosca et al. (2018); Xu et al. (2019). Specifically, dpVAE Bhalodia et al. (2020) decouples the latent space that performs the representation and the space that drives sample generation using a functional bijective mapping parameterized by the network parameters , where . Affine coupling layers Dinh et al. (2016) are used to build a flexible bijection function by stacking a sequence of simple bijection blocks.
(2) 
where , is the scaling network of the th block, is the th element in vector; the binary mask used for partition the th block of the scaling and translation function of the th block and is the th element of vectors
Sigma Variational Autoencoders: It is common practice to consider the decoding distribution as a Gaussian with constant variance representing the data noise (tuned as a hyperparameter). When using a fixed variance, a model with high variance will not retain enough information in the latent space to faithfully reconstruct samples and a model with low variance will generate poor samples as the divergence term becomes weaker Alemi et al. (2018); Lucas et al. (2019). The VAE Rybkin et al. (2021) model is a simple, yet effective solution for calibrating the decode variance by using a single learnable parameter in The variance of the decoder is trainable and is learned along with the rest of the model parameters Rybkin et al. (2021). The formulation reduces the time and resources required to tune the variance for each model.
Considering the performance merits of dpVAE and VAE, all our experiments will implement the VAE model with the decoupling architecture of dpVAE and VAE formulation. This dovetails with RENs objective to reduce VAE hyperparamters that require extensive tuning for each datasets while improving sample generation and representation.
4 Relevance Encoding Networks
Larger latent bottleneck sizes does not guarantee a better VAE performance (Figure 1). Therefore, to inform the model about intrinsic dimensionality, we introduce an automatic relevance determination (ARD) Bishop and Tipping (2013); Tipping (1999)hyperprior over the latent space prior .
4.1 RENs Formulation
The ARD hyperprior regularizes the latent space to discover relevant latent dimensions that are supported by the data, hence reducing the contribution of redundant dimensions. The ARD hyperprior provides relevance of each dimension in the latent representation and this relevance is defined via precision (i.e. inverse of variance). The ARD hyperprior pushes the precision of the spurious dimensions to infinity; thus, the variance of these dimensions is pushed to zero in the latent space. The latent prior is given by:
(3) 
with ARD hyperprior given as:
(4) 
Here, is the relevance of the latent dimensions, and is an dimensional vector of ones. The concentration parameter and the rate parameter
of the Gamma distribution are shared across all latent dimensions. The VAE prior now becomes:
(5) 
We introduce a relevance encoder to the VAE architecture that learns the variational posterior which approximates the true posterior . Here,
denotes the parameters of the relevance encoder network. The relevance of a latent dimension is a statistical property of the underlying latent distribution that is induced by the data distribution. Consequently, relevance cannot be estimated from a single sample, but instead requires access to a finite set of representative samples for the data and latent distributions. Hence, we formulate a setinput problem, where a set of instances is given as an input and the relevance encoder parameterizes the relevance for the entire set with permutation invariance. Taking the complete dataset into consideration, the joint probability of the training data can be expressed as,
and the probability of the training data is . See Figure 2 and Figure 3 for the plate notation of the graphical model and the block diagrams of their architectures, respectively.Samples of the Gamma distribution are reparameterized to enable gradient flow and network training in the presence of probabilistic layers. The derivatives are computed using the implicit reparameterization approach Figurnov et al. (2018)
. This reparameterization is implemented in the TensorFlow Probability
^{1}^{1}1https://www.tensorflow.org/probability/api_docs/python/tfp/distributions/Gamma.Derivations for the ELBO can be found in Appendix A.1. The ELBO for VAE is:
(6) 
The ELBO for dpVAE is:
(7) 
4.2 Network and Training Strategies
The ideal scenario would be feeding the relevance encoders all the training samples at once to generate a robust estimate of the relevance. However, the use of the entire dataset to estimate such relevance negatively impact the scalability provided by stochastic gradient descent training for large datasets. We train RENs via stochastic gradient descent and alternating optimization with two batch sizes. Each batch in the training dataset is broken down into smaller batches of size . The alternating optimization is given by following steps: (i) The relevance encoder is kept fixed (i.e., nottrainable) and the smaller batches are used to update the VAE encoder and decoder times by keeping the obtained from the previous iteration fixed. (ii) The original large batch is used to update the entire network endtoend (i.e., encoder , decoder , and relevance encoder ), and the value is updated. As mentioned in section 4.1, we use set formulation for the relevance encoder , wherein the relevance encoder is fed in a set of data samples, and their latent representations and a single is inferred. The response of the relevance encoder should be invariant to the ordering of the samples in the given batch. Therefore, we use DeepSets Zaheer et al. (2017) to make the relevance encoder permutation invariant for a given batch. (see Figure 3(c)). Using this alternating optimization and DeepSets aggregator, REN is encouraged to learn the global statistics of latent representations induced by the data distribution in the data space. See Algorithm.1 for more details.
5 Experiments
5.1 Toy Datasets
We use the circle and onemoon datasets for proofofconcept experiments, both of which exhibit an intrinsic dimensionality of one, parameterized by the radius. We generated three different datasets for circle and onemoon distributions by varying the standard deviation of the zeromean additive Gaussian noise to mimic data with different noise levels. For all our experiments, we implemented the relevance encoder with dpVAE
Bhalodia et al. (2020) and VAE framework Dai and Wipf (2019), hereafter referenced as RdpVAE. We tested the ability of the models to identify the intrinsic dimensionality along with their sample reconstruction capacity and realistic sample generation capability. The RdpVAE model was compared with relevance factor VAE (RFVAE) Kim et al. (2019a) and masked adversarial autoencoders (MAAE) Mondal et al. (2019).Figure 4 shows the results for onemoon and circle datasets with additive Gaussian noise of zeromean and 10% of the radius of the data manifold as standard deviation. Compared to RFVAE and MAAE, RdpVAE can regularize the latent space to discover latent dimensions relevance supported by the data while achieving the lowest mean square errors on the testing samples and suppressing the spurious dimension (Figure 4.2 and 4.6). For RdpVAE, the variance in the latent space is indicative of the relevance shown in plots (Figure 4.1c and 4.5c). For onemoon, relevance estimated by RdpVAE is for zdim1 and zdim2, which correspond to the x and yaxis in the latent space. Therefore, the xaxis with low relevance has a larger variance, and the yaxis with high relevance has a low variance, correctly capturing a lowdimensional manifold where latent dimensionality equals the intrinsic dimensionality.
Although, the performance of RFVAE (Figure 4.3 and 4
.7) comes close to RdpVAE with the estimated relevance, it fails to generate good quality samples as the relevance is not factored in the aggregate posterior, and the latent prior is still a standard normal distribution. MAAE (Figure
4.4 and 4.8) identifies the latent manifold but has a higher reconstruction error and generates bad quality samples due to the weak nature of regularization in the latent space; holes can be seen in the latent space (Figure 4.4b and 4.8b). Across all experiments, RdpVAE models provide a tighter distribution for the reconstruction error than other methods. RdpVAE is the best performing model consistently, even in the presence of higher noise levels that make learning the underlying onedimensional manifold more challenging. Results with different noise levels can be found in Appendix A.3.5.2 Image Dataset
We experimented with three image datasets: MNIST, Fashion MNIST, and dSprites. Similar to the toy experiments, we compared the performance of RdpVAE with relevance factor VAE (RFVAE) Kim et al. (2019a), masked adversarial autoencoders (MAAE) Mondal et al. (2019), with the addition of VAE regularized with L0 ARM and GECO (henceforth referenced as GECO) De Boom et al. (2020). To set baselines for comparisons, we implemented vanilla VAE with VAE framework and dpVAE with VAE framework sans the relevance. The supplementary material includes details on all the models’ implementation, architectures, and hyperparameters. We consider the following quantitative metrics to evaluate and compare the models: (1) Frechet Inception Distance (FID): This metric calculates the distance between feature vectors of the real and generated images Heusel et al. (2017). Lower FID scores are better. (2) Mean Squared Error (MSE): The MSE values reported for reconstructed images are averaged over dimensions and sample size. (3) Latent Dimensionality (): The estimated latent bottleneck size and , where is the provisioned bottleneck size of the model.
Although the ground truth intrinsic dimensionality of image datasets is not known, we use previous studies Kim et al. (2019a); Mondal et al. (2019, 2021); De Boom et al. (2020)
as points of reference. The design choices of latent dimensions in the experiments are also motivated by the reported latent dimensions in the relevant literature. For MNIST, studies reported the latent dimensionality between 7 and 10. Hence, we choose 16 as the base dimensionality to provide the models with enough degreesoffreedom to discover the relevant latent dimensions and 32 as the overprovisioned model to assess the impact of a significant mismatch with the intrinsic dimensionality. Similarly, for Fashion MNIST, the choices were 32 and 64 for base and overprovisioned dimensionalities, and for dSprites (known to have 6 factors of variation
Matthey et al. (2017)), the choices were 10 and 15. The MAAE and GECO models provide the number of active dimensions, whereas the RFVAE model provides a relevance vector with values from 0 to 1. For RFVAE, we estimate the dimensionality by calculating the number of dimensions with relevance values higher than the average of the vector. In the case of RdpVAE, the relevance estimated by the relevance encoding network is the inverse variance as per Eq. 3. We compute the explained variance as the ratio of variance in a single dimension to the sum of all the variances. The number of dimensions required to explain 95% of the variability is considered the detected latent bottleneck dimensionality.Table 1 summarizes the FID scores of randomly generated images and the estimated latent dimensionality by each model for all three datasets. The proposed model (RdpVAE) achieves lower FID scores on all three datasets compared to the other models while estimating bottleneck dimensionality in the same range as the other methods. RdpVAE also achieves better FID scores than the baseline models that do not perform relevance determination. This performance boost via relevance encoding bolsters our argument for the necessity of identifying and fixing the latent and intrinsic dimensionality mismatch. Similar to the findings from the toy experiments, the models that do not modify the posterior based on the learned relevance of the latent space exhibit inferior sample generation. This is reflected in the higher FID scores of RFVAE, GECO, and MAAE. The MAAE and GECO models show consistency in determining the effective bottleneck size in the case of MNIST and dSprites, irrespective of the provisioned bottleneck size of the model. Although the RdpVAE model estimates the bottleneck size in the same range as MAAE and GECO and shows better generation capabilities, RdpVAE does not estimate the same size across for the same dataset. This behavior could be attributed to the use of random samples in the minibatch used for updating the relevance encoder() and may not capture the entire data variation. Using a stratified sampling approach to generate minibatch could be a potential solution.
Table 2 summarizes the MSE of the testing (i.e., heldout) images for each model for all three datasets. While RdpVAE does not achieve low MSE as other models, the overall performance in conjunction with the FID score indicates that RdpVAE models generalized well with the use of calibrated decoders and relevance encoders. The use of fixed weight on the reconstruction term in models such as RFVAE, MAAE, and GECO is hypothesized to cause lower MSE and higher FID scores. Figure 5 shows the reconstructed and generated images for the best performing models cross different values for each method.
MNIST  Fashion MNIST  dSprites  

Model  FID  Model  FID  Model  FID  
RdpVAE  6.57  6  RdpVAE  48.61  10  RdpVAE  67.93  7 
RdpVAE  4.43  11  RdpVAE  52.68  26  RdpVAE  52.15  9 
RFVAE  184.38  9  RFVAE  258.37  27  RFVAE  144.89  8 
RFVAE  208.27  21  RFVAE  272.87  52  RFVAE  167.24  12 
GECO  29.21  11  GECO  69.67  8  GECO  97.86  5 
GECO  30.29  9  GECO  67.21  10  GECO  94.61  5 
MAAE  13.81  11  MAAE  104.15  6  MAAE  105.89  7 
MAAE  13.04  11  MAAE  79.18  5  MAAE  90.85  11 
dpVAE  4.62  16  dpVAE  52.08  32  dpVAE  61.99  10 
dpVAE  5.53  32  dpVAE  52.38  64  dpVAE  48.68  15 
VAE  8.83  16  VAE  74.40  32  VAE  82.667  10 
VAE  9.65  32  VAE  70.18  64  VAE  81.86  15 
MNIST  Fashion MNIST  dSprites  

Model  MSE()  Model  MSE()  Model  MSE() 
RdpVAE  3.28  RdpVAE  1.21  RdpVAE 10  1.79 
RdpVAE  3.55  RdpVAE  1.17  RdpVAE 15  1.35 
RFVAE  1.82  RFVAE  1.04  RFVAE  3.66 
RFVAE  1.83  RFVAE  1.00  RFVAE  3.65 
GECO  1.22  GECO  1.06  GECO  2.45 
GECO  1.24  GECO  1.17  GECO  3.83 
MAAE  1.84  MAAE  2.10  MAAE  1.13 
MAAE  1.80  MAAE  2.56  MAAE  1.28 
dpVAE  3.15  dpVAE  1.24  dpVAE  1.83 
dpVAE  3.22  dpVAE  1.19  dpVAE  9.67 
VAE  4.59  VAE  1.36  VAE  2.95 
VAE  2.63  VAE  1.35  VAE  1.14 
6 Conclusion
Latent dimensionality mismatch can have a detrimental effect on the performance of deep generative models such as VAEs. We have introduced relevance encoding networks (RENs) to identify this mismatch and inform the model about the relevant bottleneck size. The RENs framework facilities training of VAEs using a unified probabilistic formulation to parameterize the data distribution and detect latentintrinsic dimensionality mismatch. A key feature of the RENs framework is that it does not require any extra hyperparameter tuning for relevance determination and it provides a PCAlike ranking of the latent dimensions based on the learned, dataspecific relevance. The proposed model is general and flexible to be incorporated in any stateoftheart VAEbased models, including regularized variants of VAEs. Future directions include extending the formulation of RENs toward explainable VAEs.
References
 Fixing a broken elbo. In International Conference on Machine Learning, pp. 159–168. Cited by: §3, §3.
 DpVAEs: fixing sample generation for regularized vaes. In Proceedings of the Asian Conference on Computer Vision, Cited by: §A.1, Figure 1, item 2, §1, §2, §3, §5.1.
 Variational relevance vector machines. arXiv preprint arXiv:1301.3838. Cited by: §2, §4.
 Online principal components analysis. In Proceedings of the twentysixth annual ACMSIAM symposium on Discrete algorithms, pp. 887–901. Cited by: §1.
 Online principal component analysis in high dimension: which algorithm to choose?. International Statistical Review 86 (1), pp. 29–50. Cited by: §1.
 Variational lossy autoencoder. arXiv preprint arXiv:1611.02731. Cited by: §3.
 Diagnosing and enhancing vae models. arXiv preprint arXiv:1903.05789. Cited by: §1, §2, §5.1.
 Dynamic narrowing of vae bottlenecks using geco and l0 regularization. arXiv preprint arXiv:2003.10901. Cited by: §1, §2, §5.2, §5.2.
 Density estimation using real nvp. arXiv preprint arXiv:1605.08803. Cited by: §1, §3.
 Tutorial on variational autoencoders. arXiv preprint arXiv:1606.05908. Cited by: §2.
 Implicit reparameterization gradients. Advances in Neural Information Processing Systems 31. Cited by: §4.1.

On the intrinsic dimensionality of image representations.
In
Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition
, pp. 3987–3996. Cited by: §1.  Generative adversarial nets. Advances in neural information processing systems 27. Cited by: §1.

Unsupervised neural machine translation with generative language models only
. arXiv preprint arXiv:2110.05448. Cited by: §1.  Rodent: relevance determination in differential equations. arXiv preprint arXiv:1912.00656. Cited by: §2.
 Gans trained by a two timescale update rule converge to a local nash equilibrium. Advances in neural information processing systems 30. Cited by: §5.2.

Elbo surgery: yet another way to carve up the variational evidence lower bound.
In
Workshop in Advances in Approximate Bayesian Inference, NIPS
, Cited by: §3.  Generative models as a data source for multiview representation learning. arXiv preprint arXiv:2106.05258. Cited by: §1.
 Relevance factor vae: learning and identifying disentangled factors. arXiv preprint arXiv:1902.01568. Cited by: §2, §5.1, §5.2, §5.2.
 Hybrid generativecontrastive representation learning. arXiv preprint arXiv:2106.06162. Cited by: §1.
 Diversify and match: a domain adaptive representation learning paradigm for object detection. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 12456–12465. Cited by: §1.
 Autoencoding variational bayes. arXiv preprint arXiv:1312.6114. Cited by: §1.
 An introduction to variational autoencoders. arXiv preprint arXiv:1906.02691. Cited by: §1, §2.
 Biomedical ontology alignment: an approach based on representation learning. Journal of biomedical semantics 9 (1), pp. 1–20. Cited by: §1.
 L0 arm: network sparsification via stochastic binary optimization. arXiv preprint arXiv:1904.04432. Cited by: §2.
 MARTA gans: unsupervised representation learning for remote sensing image classification. IEEE Geoscience and Remote Sensing Letters 14 (11), pp. 2092–2096. Cited by: §1.
 Don’t blame the elbo! a linear vae perspective on posterior collapse. Advances in Neural Information Processing Systems 32. Cited by: §3.
 DSprites: disentanglement testing sprites dataset. Note: https://github.com/deepmind/dspritesdataset/ Cited by: §5.2.

FlexAE: flexibly learning latent priors for wasserstein autoencoders.
In
Uncertainty in Artificial Intelligence
, pp. 525–535. Cited by: §1, §1, §2, §5.2.  MaskAAE: latent space optimization for adversarial autoencoders. arXiv preprint arXiv:1912.04564. Cited by: §1, §2, §5.1, §5.2, §5.2.
 Improving language understanding by generative pretraining. Cited by: §1.

Stochastic backpropagation and approximate inference in deep generative models
. In International conference on machine learning, pp. 1278–1286. Cited by: §1.  Distribution matching in variational inference. arXiv preprint arXiv:1802.06847. Cited by: §3.
 On the latent space of wasserstein autoencoders. arXiv preprint arXiv:1802.03761. Cited by: §1, §2.
 Wasserstein autoencoders: latent dimensionality and random encoders. Cited by: §1, §2.
 Simple and effective vae training with calibrated decoders. In International Conference on Machine Learning, pp. 9179–9189. Cited by: item 3, §3.
 Conditional invertible flow for point cloud generation. arXiv preprint arXiv:1910.07344. Cited by: §1.
 Data augmentation using gans. arXiv preprint arXiv:1904.09135. Cited by: §1.
 Medical image classification via multiscale representation learning. Artificial Intelligence in Medicine 79, pp. 71–78. Cited by: §1.
 The relevance vector machine. Advances in neural information processing systems 12. Cited by: §2, §4.
 Wasserstein autoencoders. arXiv preprint arXiv:1711.01558. Cited by: §1.
 VAE with a vampprior. In International Conference on Artificial Intelligence and Statistics, pp. 1214–1223. Cited by: §2.
 Deep highresolution representation learning for visual recognition. IEEE transactions on pattern analysis and machine intelligence 43 (10), pp. 3349–3364. Cited by: §1.
 On the necessity and effectiveness of learning the prior of variational autoencoder. arXiv preprint arXiv:1905.13452. Cited by: §2, §3.
 Shallow vaes with realnvp prior can perform as well as deep hierarchical vaes. In International Conference on Neural Information Processing, pp. 650–659. Cited by: §2.
 Lunggans: unsupervised representation learning for lung disease classification using chest ct and xray images. IEEE Transactions on Engineering Management. Cited by: §1.
 ARM: augmentreinforcemerge gradient for stochastic binary networks. arXiv preprint arXiv:1807.11143. Cited by: §2.
 Deep sets. Advances in neural information processing systems 30. Cited by: §4.2.
 MoFlow: an invertible flow model for generating molecular graphs. In Proceedings of the 26th ACM SIGKDD International Conference on Knowledge Discovery & Data Mining, pp. 617–626. Cited by: §1.
Appendix A Appendix
a.1 Evidence Lower Bound for RENs
From equation.3 and equation.4,
(8) 
(9) 
Considering the graphical model in Figure.3(d), the variational posterior is specified by
We begin with the defining the marginal likelihood of the training data,
ELBO maximizes the marginal log likelihood of the training data:
Applying Jensens inequality
The first term is the reconstruction loss:
The second term can be further simplified as follows,
(10) 
The ELBO for REN with VAEs:
(11) 
For dpVAE [2], please refer to the paper for the basic formulation. For dpVAE with REN the joint likelihood changes to,
The second term for dpVAE with REN formulation becomes:
(12) 
The ELBO for REN with dpVAE:
(13) 